LLM Inference Engineer · Day 26
Day 26 · Week 4 · Optimization & Capstone

Production Inference Engines Compared

Before building your own engine, know the field. vLLM, TGI, llama.cpp, MLX-LM, TensorRT-LLM, and SGLang optimize for different hardware, request patterns, and operational constraints.

Time~170 min
DifficultyMedium
PrerequisiteDays 20-25
Notebookday-26-engine-benchmark
Why This Lesson

Why this optimization matters.

A production engine is a bundle of choices: scheduler, KV cache layout, kernels, quantization format, API surface, deployment model, and hardware assumptions. The right choice for a MacBook demo is not the right choice for an H100 fleet.

Learning Objectives

What you should be able to do today.

  1. Map each major engine to its design center of gravity.
  2. Choose an engine for NVIDIA throughput, Apple Silicon, CPU/local, HF ecosystem, and structured generation.
  3. Read feature matrices without confusing marketing claims with measured performance.
  4. Set up a benchmark plan that records TTFT, tokens/sec, peak memory, and output quality notes.
  5. Identify which Days 20-25 optimizations each engine implements.
Notation Cheatsheet

Decode the symbols before using them.

  • TTFT is time to first token.
  • TPOT is time per output token.
  • tok/s is output tokens per second.
  • concurrency is simultaneous in-flight requests.
  • TBD in benchmark tables means not measured on your hardware yet.
Decision Table

Engines optimize for different constraints.

EngineCenter of gravityBest forMain caution
vLLMContinuous batching, PagedAttention, OpenAI-compatible servingHigh-concurrency NVIDIA servingPython/runtime overhead at tiny batch
TGIHuggingFace ecosystem and deployment ergonomicsHF-centric teamsPeak throughput can lag specialized stacks
llama.cppGGUF, CPU/Metal/CUDA local inferenceLocal, edge, Apple, oversized modelsNot the high-concurrency GPU throughput king
MLX-LMApple Silicon native arrays and unified memoryMac-native experimentationApple-only
TensorRT-LLMNVIDIA kernel fusion and compiled enginesMaximum NVIDIA throughputComplex build and shape planning
SGLangRadixAttention and structured generation runtimeAgentic/structured servingSmaller operational footprint
Engine to Deployment Fit vLLM concurrency TGI HF deploy llama.cpp local/edge MLX-LM Apple Silicon TensorRT-LLM max NVIDIA SGLang structured output Start from constraints, then pick the engine family.
Start from constraints, then pick the engine family.
Do Not Invent Benchmarks

A professional page starts benchmark cells as TBD.

Benchmark numbers are hardware-, model-, quantization-, prompt-, and concurrency-dependent. A professional course page should not pretend one table applies everywhere. The notebook therefore creates a measurement template with TBD cells. Fill it on your machine.

Optimization Coverage Snapshot batching 12/16 active KV cache 10/16 active quant 11/16 active Different engines emphasize different parts of the optimization stack.
Different engines emphasize different parts of the optimization stack.
Feature Matrix

Feature checklists are maps, not proof.

FeaturevLLMTGIllama.cppMLX-LMTensorRT-LLMSGLang
Continuous batchingYesYesLimited/dynamicLocal loopIn-flightYes
Paged/prefix KVYesEvolvingDirect/localDirect/localYesRadix/prefix
QuantizationGPTQ/AWQ/FP8 pathsHF quant pathsGGUF K-quantsMLX quantFP8/INT4/AWQEngine-dependent
Best hardwareNVIDIANVIDIACPU/Metal/CUDAApple SiliconNVIDIANVIDIA
Benchmark Table Starts Empty TTFT TBD tok/s TBD peak memory TBD quality notes TBD Professional benchmarking records measurements from your hardware, not copied guesses.
Professional benchmarking records measurements from your hardware, not copied guesses.
Benchmark Harness

Record the workload, not just the result.

Each row should include engine, version, model, quantization, hardware, concurrency, prompt_tokens, output_tokens, ttft_ms, tok_per_s, peak_mem_gb, notes. Run batch size or concurrency [1, 4, 8, 16] where the engine supports it.

60 Second Serving Window preallocated reserved idle active fragmented paged/prefix active reuse active Memory strategy changes how much useful work fits into the same window.
Memory strategy changes how much useful work fits into the same window.
How to Read Results

The winner depends on the shape of traffic.

At batch 1, launch overhead, sampling, and memory bandwidth matter. At high concurrency, scheduler quality and KV memory management dominate. The winner is the engine that satisfies your deployment constraint with measured headroom.

Benchmark Workflow choose model pin versions warm up run matrix write analysis A repeatable harness matters more than a single impressive run.
A repeatable harness matters more than a single impressive run.
Did You Know?

A detail worth remembering.

TensorRT-LLM often wins peak NVIDIA throughput because it turns model execution into compiled engine plans, but that same compilation step is why it feels heavier than Python-first engines.
Exercise

Build the habit with code.

  1. Pick two engines that run on your hardware.
  2. Use the notebook schema to record version, model, quantization, prompt length, output length, TTFT, throughput, and memory.
  3. Run at least three concurrency or batch settings.
  4. Write one page explaining which engine you would deploy and what constraint drove the choice.
Self-Check

Answer these from memory.

  1. Why should benchmark cells start as TBD? Because performance must be measured on the actual hardware/model/request shape.
  2. When is TensorRT-LLM worth its complexity? When NVIDIA throughput at scale matters enough to justify compile and deployment friction.
  3. Why is llama.cpp strong locally? GGUF and mature CPU/Metal paths make it practical on consumer hardware.
  4. What does SGLang add conceptually? Runtime support for structured generation and prefix sharing.
  5. What is a fair engine comparison? Same model, quantization, prompts, output lengths, hardware, warmup, and metrics.

"An inference engine is an opinionated answer to one workload, one hardware target, and one operations model."

Day 26 · Week 4
Further Reading

Go deeper.

Primary references and the companion notebook for today's exercise.

Repo

vLLM

PagedAttention and high-concurrency serving.

Open
Repo

Text Generation Inference

HuggingFace production server.

Open
Repo

llama.cpp

GGUF local inference stack.

Open
Docs

TensorRT-LLM

NVIDIA compiled inference runtime.

Open
Notebook

Day 26 notebook

Runnable companion notebook for the lesson.

Open notebook