Before building your own engine, know the field. vLLM, TGI, llama.cpp, MLX-LM, TensorRT-LLM, and SGLang optimize for different hardware, request patterns, and operational constraints.
A production engine is a bundle of choices: scheduler, KV cache layout, kernels, quantization format, API surface, deployment model, and hardware assumptions. The right choice for a MacBook demo is not the right choice for an H100 fleet.
TTFT is time to first token.TPOT is time per output token.tok/s is output tokens per second.concurrency is simultaneous in-flight requests.TBD in benchmark tables means not measured on your hardware yet.| Engine | Center of gravity | Best for | Main caution |
|---|---|---|---|
| vLLM | Continuous batching, PagedAttention, OpenAI-compatible serving | High-concurrency NVIDIA serving | Python/runtime overhead at tiny batch |
| TGI | HuggingFace ecosystem and deployment ergonomics | HF-centric teams | Peak throughput can lag specialized stacks |
| llama.cpp | GGUF, CPU/Metal/CUDA local inference | Local, edge, Apple, oversized models | Not the high-concurrency GPU throughput king |
| MLX-LM | Apple Silicon native arrays and unified memory | Mac-native experimentation | Apple-only |
| TensorRT-LLM | NVIDIA kernel fusion and compiled engines | Maximum NVIDIA throughput | Complex build and shape planning |
| SGLang | RadixAttention and structured generation runtime | Agentic/structured serving | Smaller operational footprint |
Benchmark numbers are hardware-, model-, quantization-, prompt-, and concurrency-dependent. A professional course page should not pretend one table applies everywhere. The notebook therefore creates a measurement template with TBD cells. Fill it on your machine.
| Feature | vLLM | TGI | llama.cpp | MLX-LM | TensorRT-LLM | SGLang |
|---|---|---|---|---|---|---|
| Continuous batching | Yes | Yes | Limited/dynamic | Local loop | In-flight | Yes |
| Paged/prefix KV | Yes | Evolving | Direct/local | Direct/local | Yes | Radix/prefix |
| Quantization | GPTQ/AWQ/FP8 paths | HF quant paths | GGUF K-quants | MLX quant | FP8/INT4/AWQ | Engine-dependent |
| Best hardware | NVIDIA | NVIDIA | CPU/Metal/CUDA | Apple Silicon | NVIDIA | NVIDIA |
Each row should include engine, version, model, quantization, hardware, concurrency, prompt_tokens, output_tokens, ttft_ms, tok_per_s, peak_mem_gb, notes. Run batch size or concurrency [1, 4, 8, 16] where the engine supports it.
At batch 1, launch overhead, sampling, and memory bandwidth matter. At high concurrency, scheduler quality and KV memory management dominate. The winner is the engine that satisfies your deployment constraint with measured headroom.
"An inference engine is an opinionated answer to one workload, one hardware target, and one operations model."
Primary references and the companion notebook for today's exercise.