Day 23 · Week 4 · Optimization & Capstone

∴

Speculative Decoding: Draft, Verify, Accept

Decode is sequential, but verification does not have to be. Speculative decoding asks a cheaper model to draft several tokens, then lets the target model verify them in parallel while preserving the target distribution.

Time~190 min

DifficultyHard

PrerequisiteDays 15, 20

Notebookday-23-spec-decoding-simulator

Why This Lesson

Why this optimization matters.

Day 15 split inference into prefill and decode. Day 20 showed why decode becomes memory-bound and user-visible. Speculative decoding attacks the one-token-at-a-time loop without changing the target model. If the draft is right often enough, one target forward can advance multiple tokens.

Learning Objectives

What you should be able to do today.

Walk through draft-verify speculative decoding with concrete probabilities.
Compute expected accepted tokens from acceptance probability gamma and draft length K.
Explain why rejection sampling preserves the target model distribution.
Compare separate draft models, Medusa, EAGLE, and DFlash.
Predict when speculative decoding helps and when batching already fills the GPU.

Notation Cheatsheet

Decode the symbols before using them.

K is the number of proposed draft tokens per verification step.
gamma is the probability that one proposed token is accepted.
pi_d is the draft model distribution.
pi_t is the target model distribution.
u is a random number from 0 to 1 used for the accept/reject test.

The Bottleneck

Use a cheap guess to reduce expensive target steps.

Standard decode spends one target forward per output token. If a target forward takes 50 ms, then five new tokens take about 250 ms before sampling overhead. Speculative decoding changes the schedule: a small draft model proposes K tokens cheaply, then the target model verifies those positions in one parallel forward.

Concrete latency: draft cost is 5 ms per token, target verify is 50 ms, and K = 5. Drafting costs 25 ms. Verifying costs 50 ms. If the target accepts about three tokens, the effective cost is 75 / 3 = 25 ms/token, a 2x speedup.

Speculation replaces several target steps with one target verification plus cheaper draft work.

Accept or Reject

The ratio test keeps the target distribution.

At each proposed token, compare the target probability with the draft probability.

accept if u <= min(1, pi_t(token) / pi_d(token))

If the draft underestimates a token that the target likes, the ratio is above 1 and the token is always accepted. If the draft overestimates the token, the ratio is low and rejection is likely. On rejection, sample from the residual distribution so the final stream still follows the target model.

Overconfident draft tokens are rejected most often.

Expected Tokens Per Step

Acceptance rate sets the speedup ceiling.

If each position is accepted with probability gamma, the expected tokens advanced by one verification step are:

E[tokens] = (1 - gamma^(K + 1)) / (1 - gamma)

For gamma = 0.8 and K = 5, E = (1 - 0.8^6) / 0.2 = 3.69 tokens. Real systems then subtract draft overhead, verification overhead, cache management, and batching effects.

Acceptance rate dominates the speedup ceiling.

Families of Speculation

Drafting families differ in integration cost.

Method	Draft mechanism	Strength	Cost
Separate draft model	Small autoregressive model	Easy when tokenizer matches	Extra model and cache
Medusa	Multiple prediction heads on target hidden state	No separate full model	Needs head fine-tuning
EAGLE	Feature-level draft model	High acceptance	Needs trained drafter
DFlash	Block-diffusion drafter	Parallel block proposal	Needs DFlash checkpoint

All methods exploit target verification. They differ in how candidates are proposed and how much integration they require.

The verifier is still the target model; the candidate generator changes.

When It Fails

Speculation is not free.

Speculative decoding loves low-batch interactive serving, where the target model is underutilized and every token matters to the user. It is less impressive at high batch, where continuous batching already fills matmul shapes and the draft model becomes extra work. It also fails when the draft is weak: low gamma means many rejections and little progress per verify.

Draft and target caches must stay consistent after acceptance or rejection.

Exercise

Build the habit with code.

Implement the accept/reject test on toy distributions.
Simulate expected tokens per step for gamma in {0.6, 0.7, 0.8, 0.9} and K in {3, 5, 7}.
Measure when draft overhead erases speedup by changing draft latency in the notebook.
Optional: try HuggingFace assisted generation with a small target/draft pair and record tokens/sec.

Self-Check

Answer these from memory.

Why does speculative decoding preserve target quality? Because the target distribution verifies candidates and rejection samples from the residual distribution.
What controls the optimal K? Acceptance rate and draft cost.
Why does high batch reduce benefit? The target model is already efficient, so draft overhead has less idle time to exploit.
How is Medusa different from a separate draft model? Medusa adds prediction heads to the target hidden state instead of running another autoregressive model.
What must happen to KV cache after a rejection? The target cache is cropped to the accepted prefix.

Go deeper.

Primary references and the companion notebook for today's exercise.

Paper

Speculative Decoding

Draft and verify with rejection sampling.

Open

Paper

Medusa

Multiple decoding heads for LLM inference.

Open

Paper

EAGLE

Feature-level speculative drafting.

Open

Blog

HuggingFace assisted generation

Practical API framing.

Open

Notebook

Day 23 notebook

Runnable companion notebook for the lesson.

Open notebook

Speculative Decoding: Draft, Verify, Accept

Why this optimization matters.

What you should be able to do today.

Decode the symbols before using them.

Use a cheap guess to reduce expensive target steps.

The ratio test keeps the target distribution.

Acceptance rate sets the speedup ceiling.

Drafting families differ in integration cost.

Speculation is not free.

A detail worth remembering.

Build the habit with code.

Answer these from memory.

Go deeper.

Speculative Decoding

Medusa

EAGLE

HuggingFace assisted generation

Day 23 notebook