Decode is sequential, but verification does not have to be. Speculative decoding asks a cheaper model to draft several tokens, then lets the target model verify them in parallel while preserving the target distribution.
Day 15 split inference into prefill and decode. Day 20 showed why decode becomes memory-bound and user-visible. Speculative decoding attacks the one-token-at-a-time loop without changing the target model. If the draft is right often enough, one target forward can advance multiple tokens.
K is the number of proposed draft tokens per verification step.gamma is the probability that one proposed token is accepted.pi_d is the draft model distribution.pi_t is the target model distribution.u is a random number from 0 to 1 used for the accept/reject test.Standard decode spends one target forward per output token. If a target forward takes 50 ms, then five new tokens take about 250 ms before sampling overhead. Speculative decoding changes the schedule: a small draft model proposes K tokens cheaply, then the target model verifies those positions in one parallel forward.
Concrete latency: draft cost is 5 ms per token, target verify is 50 ms, and K = 5. Drafting costs 25 ms. Verifying costs 50 ms. If the target accepts about three tokens, the effective cost is 75 / 3 = 25 ms/token, a 2x speedup.
At each proposed token, compare the target probability with the draft probability.
accept if u <= min(1, pi_t(token) / pi_d(token))
If the draft underestimates a token that the target likes, the ratio is above 1 and the token is always accepted. If the draft overestimates the token, the ratio is low and rejection is likely. On rejection, sample from the residual distribution so the final stream still follows the target model.
If each position is accepted with probability gamma, the expected tokens advanced by one verification step are:
E[tokens] = (1 - gamma^(K + 1)) / (1 - gamma)
For gamma = 0.8 and K = 5, E = (1 - 0.8^6) / 0.2 = 3.69 tokens. Real systems then subtract draft overhead, verification overhead, cache management, and batching effects.
| Method | Draft mechanism | Strength | Cost |
|---|---|---|---|
| Separate draft model | Small autoregressive model | Easy when tokenizer matches | Extra model and cache |
| Medusa | Multiple prediction heads on target hidden state | No separate full model | Needs head fine-tuning |
| EAGLE | Feature-level draft model | High acceptance | Needs trained drafter |
| DFlash | Block-diffusion drafter | Parallel block proposal | Needs DFlash checkpoint |
All methods exploit target verification. They differ in how candidates are proposed and how much integration they require.
Speculative decoding loves low-batch interactive serving, where the target model is underutilized and every token matters to the user. It is less impressive at high batch, where continuous batching already fills matmul shapes and the draft model becomes extra work. It also fails when the draft is weak: low gamma means many rejections and little progress per verify.
{0.6, 0.7, 0.8, 0.9} and K in {3, 5, 7}."Speculative decoding is latency arbitrage: spend cheap guesses to buy fewer expensive target steps."
Primary references and the companion notebook for today's exercise.