LLM Inference Engineer · Day 23
Day 23 · Week 4 · Optimization & Capstone

Speculative Decoding: Draft, Verify, Accept

Decode is sequential, but verification does not have to be. Speculative decoding asks a cheaper model to draft several tokens, then lets the target model verify them in parallel while preserving the target distribution.

Time~190 min
DifficultyHard
PrerequisiteDays 15, 20
Notebookday-23-spec-decoding-simulator
Why This Lesson

Why this optimization matters.

Day 15 split inference into prefill and decode. Day 20 showed why decode becomes memory-bound and user-visible. Speculative decoding attacks the one-token-at-a-time loop without changing the target model. If the draft is right often enough, one target forward can advance multiple tokens.

Learning Objectives

What you should be able to do today.

  1. Walk through draft-verify speculative decoding with concrete probabilities.
  2. Compute expected accepted tokens from acceptance probability gamma and draft length K.
  3. Explain why rejection sampling preserves the target model distribution.
  4. Compare separate draft models, Medusa, EAGLE, and DFlash.
  5. Predict when speculative decoding helps and when batching already fills the GPU.
Notation Cheatsheet

Decode the symbols before using them.

  • K is the number of proposed draft tokens per verification step.
  • gamma is the probability that one proposed token is accepted.
  • pi_d is the draft model distribution.
  • pi_t is the target model distribution.
  • u is a random number from 0 to 1 used for the accept/reject test.
The Bottleneck

Use a cheap guess to reduce expensive target steps.

Standard decode spends one target forward per output token. If a target forward takes 50 ms, then five new tokens take about 250 ms before sampling overhead. Speculative decoding changes the schedule: a small draft model proposes K tokens cheaply, then the target model verifies those positions in one parallel forward.

Concrete latency: draft cost is 5 ms per token, target verify is 50 ms, and K = 5. Drafting costs 25 ms. Verifying costs 50 ms. If the target accepts about three tokens, the effective cost is 75 / 3 = 25 ms/token, a 2x speedup.

Sequential Decode vs Speculative Decode sequential t1 t2 t3 t4 t5 speculative draft burst verify K accepted ~3 Speculation replaces several target steps with one target verification plus cheaper draft work.
Speculation replaces several target steps with one target verification plus cheaper draft work.
Accept or Reject

The ratio test keeps the target distribution.

At each proposed token, compare the target probability with the draft probability.

accept if u <= min(1, pi_t(token) / pi_d(token))

If the draft underestimates a token that the target likes, the ratio is above 1 and the token is always accepted. If the draft overestimates the token, the ratio is low and rejection is likely. On rejection, sample from the residual distribution so the final stream still follows the target model.

One Candidate Token Decision draft token id 42 pi_d = 0.50 target verify pi_t = 0.05 ratio = 0.10 reject 90% Overconfident draft tokens are rejected most often.
Overconfident draft tokens are rejected most often.
Expected Tokens Per Step

Acceptance rate sets the speedup ceiling.

If each position is accepted with probability gamma, the expected tokens advanced by one verification step are:

E[tokens] = (1 - gamma^(K + 1)) / (1 - gamma)

For gamma = 0.8 and K = 5, E = (1 - 0.8^6) / 0.2 = 3.69 tokens. Real systems then subtract draft overhead, verification overhead, cache management, and batching effects.

Expected Tokens per Verify, K = 5 gamma 0.60 2.31 gamma 0.70 2.77 gamma 0.80 3.69 gamma 0.90 4.69 Acceptance rate dominates the speedup ceiling.
Acceptance rate dominates the speedup ceiling.
Families of Speculation

Drafting families differ in integration cost.

MethodDraft mechanismStrengthCost
Separate draft modelSmall autoregressive modelEasy when tokenizer matchesExtra model and cache
MedusaMultiple prediction heads on target hidden stateNo separate full modelNeeds head fine-tuning
EAGLEFeature-level draft modelHigh acceptanceNeeds trained drafter
DFlashBlock-diffusion drafterParallel block proposalNeeds DFlash checkpoint

All methods exploit target verification. They differ in how candidates are proposed and how much integration they require.

Speculation Methods Draft model small AR model Medusa parallel heads EAGLE feature-level drafter DFlash block diffusion The verifier is still the target model; the candidate generator changes.
The verifier is still the target model; the candidate generator changes.
When It Fails

Speculation is not free.

Speculative decoding loves low-batch interactive serving, where the target model is underutilized and every token matters to the user. It is less impressive at high batch, where continuous batching already fills matmul shapes and the draft model becomes extra work. It also fails when the draft is weak: low gamma means many rejections and little progress per verify.

Cache Implication target cache prefix draft cache candidate path verify target K tokens crop on reject continue Draft and target caches must stay consistent after acceptance or rejection.
Draft and target caches must stay consistent after acceptance or rejection.
Did You Know?

A detail worth remembering.

Speculative decoding was proposed by multiple groups around the same time because the bottleneck is so visible: decode exposes expensive sequential dependence even when GPUs prefer wide parallel work.
Exercise

Build the habit with code.

  1. Implement the accept/reject test on toy distributions.
  2. Simulate expected tokens per step for gamma in {0.6, 0.7, 0.8, 0.9} and K in {3, 5, 7}.
  3. Measure when draft overhead erases speedup by changing draft latency in the notebook.
  4. Optional: try HuggingFace assisted generation with a small target/draft pair and record tokens/sec.
Self-Check

Answer these from memory.

  1. Why does speculative decoding preserve target quality? Because the target distribution verifies candidates and rejection samples from the residual distribution.
  2. What controls the optimal K? Acceptance rate and draft cost.
  3. Why does high batch reduce benefit? The target model is already efficient, so draft overhead has less idle time to exploit.
  4. How is Medusa different from a separate draft model? Medusa adds prediction heads to the target hidden state instead of running another autoregressive model.
  5. What must happen to KV cache after a rejection? The target cache is cropped to the accepted prefix.

"Speculative decoding is latency arbitrage: spend cheap guesses to buy fewer expensive target steps."

Day 23 · Week 4
Further Reading

Go deeper.

Primary references and the companion notebook for today's exercise.

Paper

Speculative Decoding

Draft and verify with rejection sampling.

Open
Paper

Medusa

Multiple decoding heads for LLM inference.

Open
Paper

EAGLE

Feature-level speculative drafting.

Open
Blog

HuggingFace assisted generation

Practical API framing.

Open
Notebook

Day 23 notebook

Runnable companion notebook for the lesson.

Open notebook