Production Inference Engines Compared

Before building your own engine, know the field. vLLM, TGI, llama.cpp, MLX-LM, TensorRT-LLM, and SGLang optimize for different hardware, request patterns, and operational constraints.

Time~170 min

DifficultyMedium

PrerequisiteDays 20-25

Notebookday-26-engine-benchmark

Why This Lesson

Why this optimization matters.

A production engine is a bundle of choices: scheduler, KV cache layout, kernels, quantization format, API surface, deployment model, and hardware assumptions. The right choice for a MacBook demo is not the right choice for an H100 fleet.

Learning Objectives

What you should be able to do today.

Map each major engine to its design center of gravity.
Choose an engine for NVIDIA throughput, Apple Silicon, CPU/local, HF ecosystem, and structured generation.
Read feature matrices without confusing marketing claims with measured performance.
Set up a benchmark plan that records TTFT, tokens/sec, peak memory, and output quality notes.
Identify which Days 20-25 optimizations each engine implements.

Notation Cheatsheet

Decode the symbols before using them.

TTFT is time to first token.
TPOT is time per output token.
tok/s is output tokens per second.
concurrency is simultaneous in-flight requests.
TBD in benchmark tables means not measured on your hardware yet.

Decision Table

Engines optimize for different constraints.

Engine	Center of gravity	Best for	Main caution
vLLM	Continuous batching, PagedAttention, OpenAI-compatible serving	High-concurrency NVIDIA serving	Python/runtime overhead at tiny batch
TGI	HuggingFace ecosystem and deployment ergonomics	HF-centric teams	Peak throughput can lag specialized stacks
llama.cpp	GGUF, CPU/Metal/CUDA local inference	Local, edge, Apple, oversized models	Not the high-concurrency GPU throughput king
MLX-LM	Apple Silicon native arrays and unified memory	Mac-native experimentation	Apple-only
TensorRT-LLM	NVIDIA kernel fusion and compiled engines	Maximum NVIDIA throughput	Complex build and shape planning
SGLang	RadixAttention and structured generation runtime	Agentic/structured serving	Smaller operational footprint

Start from constraints, then pick the engine family.

Do Not Invent Benchmarks

A professional page starts benchmark cells as TBD.

Benchmark numbers are hardware-, model-, quantization-, prompt-, and concurrency-dependent. A professional course page should not pretend one table applies everywhere. The notebook therefore creates a measurement template with TBD cells. Fill it on your machine.

Different engines emphasize different parts of the optimization stack.

Feature Matrix

Feature checklists are maps, not proof.

Feature	vLLM	TGI	llama.cpp	MLX-LM	TensorRT-LLM	SGLang
Continuous batching	Yes	Yes	Limited/dynamic	Local loop	In-flight	Yes
Paged/prefix KV	Yes	Evolving	Direct/local	Direct/local	Yes	Radix/prefix
Quantization	GPTQ/AWQ/FP8 paths	HF quant paths	GGUF K-quants	MLX quant	FP8/INT4/AWQ	Engine-dependent
Best hardware	NVIDIA	NVIDIA	CPU/Metal/CUDA	Apple Silicon	NVIDIA	NVIDIA

Professional benchmarking records measurements from your hardware, not copied guesses.

Benchmark Harness

Record the workload, not just the result.

Each row should include engine, version, model, quantization, hardware, concurrency, prompt_tokens, output_tokens, ttft_ms, tok_per_s, peak_mem_gb, notes. Run batch size or concurrency [1, 4, 8, 16] where the engine supports it.

Memory strategy changes how much useful work fits into the same window.

How to Read Results

The winner depends on the shape of traffic.

At batch 1, launch overhead, sampling, and memory bandwidth matter. At high concurrency, scheduler quality and KV memory management dominate. The winner is the engine that satisfies your deployment constraint with measured headroom.

A repeatable harness matters more than a single impressive run.

Did You Know?

A detail worth remembering.

TensorRT-LLM often wins peak NVIDIA throughput because it turns model execution into compiled engine plans, but that same compilation step is why it feels heavier than Python-first engines.

Exercise

Build the habit with code.

Pick two engines that run on your hardware.
Use the notebook schema to record version, model, quantization, prompt length, output length, TTFT, throughput, and memory.
Run at least three concurrency or batch settings.
Write one page explaining which engine you would deploy and what constraint drove the choice.

Self-Check

Answer these from memory.

Why should benchmark cells start as TBD? Because performance must be measured on the actual hardware/model/request shape.
When is TensorRT-LLM worth its complexity? When NVIDIA throughput at scale matters enough to justify compile and deployment friction.
Why is llama.cpp strong locally? GGUF and mature CPU/Metal paths make it practical on consumer hardware.
What does SGLang add conceptually? Runtime support for structured generation and prefix sharing.
What is a fair engine comparison? Same model, quantization, prompts, output lengths, hardware, warmup, and metrics.