DFlash replaces autoregressive drafting with a lightweight block-diffusion drafter. The target model still verifies candidates, but the draft phase can propose a whole block in one parallel pass.
The original Week 4 plan treated DFlash as a repo to investigate. That investigation matters: as of June 11, 2026, z-lab/dflash is an active implementation of block-diffusion speculative decoding with Transformers, vLLM, SGLang, and MLX paths. Today is a source-reading lesson: connect the Day 23 speculative decoding math to a current research implementation.
block_size is the number of candidate tokens DFlash tries to draft together.target_hidden is hidden-state context extracted from selected target layers.mask_token_id is the placeholder token embedded for draft positions before denoising.acceptance_length is how many proposed tokens survive target verification.crop(start) removes unaccepted cache entries after rejection.DFlash is not a PagedAttention replacement. It is a speculative decoding method. The bottleneck it attacks is the sequential draft phase inside speculative decoding. Standard speculative decoding may still draft token 1, then token 2, then token 3 autoregressively. DFlash instead uses a lightweight block diffusion model to fill a block of masked positions in parallel.
The public repo is intentionally small. The core files are dflash/model.py, dflash/model_mlx.py, and dflash/benchmark.py. At repository HEAD 94e4abc5e0c31b67bc1a9d30f1cc34ece28a8756, the PyTorch path defines dflash_generate, DFlashDraftModel, Qwen3-specific draft layers, and sampling helpers.
Autoregressive drafting predicts position t+1, feeds that token back in, predicts t+2, and repeats. Block diffusion creates a block of unknown future positions, conditions on context features, and denoises the block together. For inference engineering, the shape change is the main idea: draft cost moves from roughly O(K) sequential steps toward one parallel pass.
| Backend | Current lesson takeaway |
|---|---|
| Transformers | Useful for reading and small supported-model experiments. |
| vLLM | Serving path uses --speculative-config with method dflash. |
| SGLang | Launch path uses speculative algorithm DFLASH. |
| MLX | Apple Silicon path exists in model_mlx.py; good for this course audience. |
Treat exact install flags as moving parts and verify before using them in production.
DFlash is most attractive when block_size can be large, acceptance length is high, and the serving regime is low-batch or latency-sensitive. It is less attractive when no matching DFlash checkpoint exists, when acceptance is low for your prompt distribution, or when high-concurrency batching already saturates the target model.
In the capstone, DFlash is an optional accelerator layered on top of the Day 28 generation loop and Day 29 scheduler. It needs a draft module, a target verifier, cache cropping, and scheduler accounting for accepted token bursts.
dflash/model.py and identify the target prefill, draft block, target verification, acceptance length, and cache crop.draft_block -> verify_block accelerator after single-sequence cache correctness."DFlash is speculative decoding with the draft bottleneck moved from token-by-token to block-at-once."
Primary references and the companion notebook for today's exercise.