Pre-training is expensive and rare. Fine-tuning is cheap and constant — it's how a base model becomes a chat assistant, a coding model, or a domain specialist. Today you learn supervised fine-tuning and the parameter-efficient methods (LoRA, QLoRA) that let you adapt a 7B model on a single consumer GPU, then serve the result with zero overhead or with live adapter swapping.
Pre-training a frontier model costs millions of dollars and is done by a handful of labs. Fine-tuning takes that base model — which already knows language — and specializes it: turning a raw next-token predictor into an instruction-following assistant, a SQL generator, a medical-notes summarizer. This is the part of the training stack that most engineers actually touch, and LoRA and QLoRA have made an enormous practical difference: techniques that let you adapt a large model with a tiny fraction of the memory and produce adapters measured in megabytes, not gigabytes.
For an inference engineer this matters twice. First, the artifact you serve is usually a fine-tuned model, and LoRA adapters can be merged into the base weights or served dynamically per request — both are inference-system design decisions with direct throughput and cost implications. Second, the quantization ideas behind QLoRA (4-bit NF4) are a direct preview of Week 4's quantization lessons. Fine-tuning is where training and inference start to blur.
W + (α/r)·BA, explain why it works, count trainable params, and choose rank and target modules.The term "fine-tuning" is used loosely to mean several different things. Let's pin them down, because each has different data requirements, cost, and inference implications.
Take a base model and keep running the standard next-token prediction loss on new raw text — medical papers, legal documents, a new language, code in a niche framework. No chat template, no instruction pairs. This teaches the model knowledge and vocabulary in a new domain without changing its behavior format. The cost profile is similar to pre-training: all weights update, all the Adam states are alive. Used when you need deep domain adaptation before task-specific tuning.
Train on (prompt, response) pairs with a chat template. The model learns to follow instructions in the expected format. This is how GPT-3 became InstructGPT, how Llama-2-base became Llama-2-chat. Full SFT updates all weights; PEFT-SFT (LoRA/QLoRA) updates only a small adapter. Both use the same data; the distinction is which parameters move.
Freeze the pre-trained weights and train only a small number of new parameters. The dominant method is LoRA. Others include prefix tuning (prepend learned soft tokens to the input), prompt tuning (same but only at the embedding layer), and adapter layers (tiny bottleneck MLPs inserted between transformer sub-layers). LoRA wins in practice: it adds no inference latency when merged, and the quality gap with full SFT is small for most tasks.
For SFT and PEFT, each training example is a formatted conversation. The exact format depends on the model family's chat template — a strict convention of special tokens marking system/user/assistant roles. Here is a ChatML-style example (used by Mistral, Qwen, and others):
<|im_start|>system
You are a helpful coding assistant.<|im_end|>
<|im_start|>user
Write a Python function that reverses a linked list.<|im_end|>
<|im_start|>assistant
def reverse_linked_list(head):
prev = None
current = head
while current:
nxt = current.next
current.next = prev
prev = current
current = nxt
return prev<|im_end|>
The entire formatted string is tokenized and fed as a single sequence. The crucial detail is loss masking: you compute the cross-entropy loss only on the assistant turn tokens. The system and user tokens are masked to -100 so PyTorch's cross-entropy ignores them. You want the model to learn to produce good responses, not to predict the user's questions — which are given to it at inference time anyway.
-100; PyTorch's cross_entropy(ignore_index=-100) skips them. Only the assistant-turn tokens accumulate gradients. This ensures the model learns to generate responses, not memorize prompts.# Loss masking in code — the essential pattern.
labels = input_ids.clone()
labels[:, :prompt_len] = -100 # mask system + user tokens
loss = F.cross_entropy(
logits.view(-1, vocab_size),
labels.view(-1),
ignore_index=-100 # skipped in the sum
)
One detail engineers trip over: every model family has its own chat template. ChatML, LLaMA-2's [INST]/[/INST], Alpaca's ### Instruction:/### Response:. At inference time, you must apply exactly the same template that was used in training. Using the wrong template silently degrades quality — sometimes catastrophically. HuggingFace Tokenizer now ships apply_chat_template() to reduce this class of bug.
You saw on Day 11 that training a model is far more memory-hungry than running inference. The gap is stark and worth understanding quantitatively before LoRA makes sense.
In mixed-precision training (the default since Day 10), each parameter lives in multiple copies simultaneously:
That totals 16 bytes per parameter. For a 7B model:
This arithmetic explains the entire point of PEFT. If the base weights are frozen, you only need Adam states for the tiny adapter — a 7B model's LoRA adapter at rank 16 on attention projections has roughly 40 M parameters (0.6%), requiring just ~0.6 GB of Adam state. The frozen base sits in fp16 at 14 GB. Total: under 16 GB — fits on a single consumer GPU. QLoRA then compresses the frozen base to 4-bit (≈3.5 GB), pushing the total under 6 GB and enabling fine-tuning on an RTX 3090.
LoRA (Low-Rank Adaptation) rests on an empirical observation from the paper: the weight update a model needs during fine-tuning has low intrinsic rank. Even if W is a 4096×4096 matrix, the useful fine-tuning update ΔW can be well-approximated by the product of two skinny matrices. So instead of learning a full-rank ΔW (same size as W, as many parameters), LoRA learns ΔW = BA, where A and B are low-rank.
Read the formula as: the frozen weight W acts as before, and a small side branch learns a correction ΔW = (α/r)·BA. At the start of training, B=0, so the output is exactly what the original model would give — you start precisely at the pretrained optimum and walk away from it only as training proceeds. This is far more stable than, say, initializing ΔW randomly and hoping to cancel out.
A → B alongside the frozen weight W. Only A and B have gradients. Because B is initialized to zero, ΔW = 0 initially — the model starts exactly at the pretrained optimum. The scaling factor α/r lets you control update magnitude independently of rank.You can LoRA-wrap any linear layer. In practice, the attention projections (Q, K, V, and the output projection O) give the most benefit per parameter. Some practitioners also wrap the MLP's up-projection and gate. A typical config targets q_proj and v_proj (the LoRA paper's default) with rank 8–16 for light adaptation, or all four attention projections plus MLP at rank 32–64 for deeper task adaptation. Targets that are not commonly wrapped: layer norms (tiny; already cheap to full-train), embeddings (very high-dimensional; marginal benefit from LoRA).
Rank r controls capacity. For a simple style/format change (instruction tuning on a well-formatted dataset), r=8 is often enough. For complex domain adaptation or when the gap between base and target is large, try r=32 or r=64. A common heuristic: set alpha = r (so α/r = 1) or alpha = 2r (so α/r = 2). A higher alpha makes the LoRA update more aggressive — handy when you want faster adaptation, risky when you want to preserve base behavior.
For completeness, the alternatives you'll see referenced:
LoRA already removes most of the optimizer-state cost, but the frozen base weights still sit in memory at 16 bits — that's 14 GB for a 7B model. QLoRA (Quantized LoRA) attacks exactly that: it loads the frozen base in 4-bit precision and trains LoRA adapters on top. Because the base is frozen and is only used in the forward/backward pass (dequantized on the fly for each matrix multiply, then discarded), 4-bit precision is acceptable — the trainable adapters stay in bf16 and carry all the learning. The result: fine-tuning a 7B model in under 6 GB of GPU memory, and a 65B model on a single 48 GB GPU — previously impossible.
QLoRA has three carefully engineered ingredients:
Standard int4 (or fp4) places its 16 representable values at evenly-spaced or power-of-two intervals. But neural network weights follow a roughly normal distribution, heavily concentrated near zero. Evenly-spaced quantization wastes most of its precision on the tails where few weights live.
NF4 is a 4-bit data type whose 16 quantization levels are chosen to be information-theoretically optimal for a normal distribution: they partition the standard normal distribution into 16 equal-probability buckets, so each representable value covers the same probability mass. This means a typical weight is quantized with less error than with int4. Each block of 64 weights gets its own scale factor (block-wise quantization) computed from the max absolute value, so global scale-factor errors don't compound.
Block-wise quantization produces one fp32 scale factor per 64 weights — that's an extra 4 bytes per 64 params, or an extra 0.0625 bytes (0.5 bits) per param. Double quantization then quantizes those scale factors too, representing them in 8-bit with blocks of 256. The net saving is roughly 0.37 bits per parameter — small but essentially free. Together with NF4, the total cost of the frozen base is ≈ 4.5 bits per parameter instead of 16.
Even with a tiny adapter, processing long sequences causes memory spikes that can kill the GPU process. QLoRA uses NVIDIA's unified memory (UM) to page optimizer states from GPU DRAM to CPU RAM when the GPU is close to capacity, and page them back before the optimizer step. This is the safety valve that makes the tight memory budget survivable on consumer hardware. Expect some CPU↔GPU transfer overhead — paging is slower than on-device — but it means the process doesn't crash.
The original QLoRA paper fine-tuned a 65B-parameter model on a single 48 GB GPU in about 24 hours and produced Guanaco, which matched 99% of ChatGPT's performance on a benchmark of the time. Hardware you could rent for a few dollars an hour. It is one of the clearest examples in ML of a systems trick democratizing a capability that was previously behind a data-center wall.
You will see NF4 and block-wise quantization again on Day 22, where we cover quantization for inference (GPTQ, AWQ, FP8). QLoRA is your first encounter with the idea that 4 bits is often enough — a theme that runs through the rest of the course.
This is the section that most training-focused tutorials skip, and the one an inference engineer most needs. Once you have a LoRA adapter, you face a binary choice at serve time:
Fold the adapter into the base weight once, offline, before deployment:
The merged weight is a standard nn.Linear. At inference, there is no extra computation — the model runs at full speed as if no adapter was ever there. You can quantize W_merged to fp8 or int4 like any other model. The trade-off: you lose the adapter's identity. You cannot hot-swap tasks without loading a different checkpoint. If you have one task, merge. If you have many, read on.
Keep the base model on the GPU and store many small adapter sets on disk or in CPU memory. For each incoming request, identify which adapter to use (e.g., customer A uses the legal adapter, customer B uses the coding adapter) and apply it during the forward pass. This is the multi-adapter serving pattern.
The naive implementation — load one adapter, run forward, swap another — is slow and serializes requests. S-LoRA (Sheng et al., 2023) showed that you can batch requests with different adapters together efficiently: the base-model computation is shared across the batch, and the per-adapter ΔW·x terms are computed with a unified CUDA kernel that handles the heterogeneity. vLLM and other serving frameworks now support this natively. The win is enormous for multi-tenant scenarios: one base model on one set of GPUs serves thousands of customers' custom models.
If you merge the adapter before quantizing, you get the full benefit of inference quantization (e.g., GPTQ or AWQ to int4) with no interaction effects — the merged weight is just a weight. If you keep adapters separate and serve them dynamically, you typically keep the adapters in fp16 or bf16 even if the base is quantized, since the adapter is tiny and the precision matters for quality. Either way, the adapter does not block your path to a quantized deployment.
Use this table to make the right choice for your situation. Quality differences are typically small between LoRA and full SFT for most adaptation tasks; they widen if the target domain is very different from pretraining.
| Method | Trainable params | Training memory (7B) | Artifact size (7B) | Inference overhead | Quality vs full SFT | Best when |
|---|---|---|---|---|---|---|
| Full SFT | 100% | ~112 GB | ~14 GB checkpoint | None | Ceiling | Max quality, one task, large cluster |
| LoRA | ~0.1–1% | ~16 GB | ~10–200 MB adapter | Zero (if merged) | Within ~1–2% on most tasks | Most adaptation tasks; many tasks from one base |
| QLoRA | ~0.1–1% | ~6 GB | ~10–200 MB adapter | Zero (if merged then decompressed) | Slightly below LoRA on hard tasks | Big model, single consumer GPU |
| Prefix tuning | <0.1% | ~14–16 GB | <1 MB | Extra KV-cache entries | Below LoRA for complex tasks | Very few params needed; frozen base mandatory |
| Prompt tuning | <0.01% | ~14 GB | <0.1 MB | Extra prompt tokens | Competitive only at large scale (10B+) | Massive models, minimal storage |
| Adapter layers | ~1–3% | ~16 GB | ~100–500 MB | Sequential bottleneck per layer | Similar to LoRA | Legacy; LoRA preferred today |
A persistent risk: fine-tuning narrowly on new data can erode the broad capabilities the model gained in pre-training — it gets better at your task but worse at everything else. This is catastrophic forgetting. The intuition: the weights that represent general knowledge are being overwritten by gradient updates specialized for your narrow dataset.
Always eval on two distributions: the target task and a general-domain benchmark (MMLU, HellaSwag, or a held-out sample of the pretraining data). A fine-tune that looks great on task accuracy but causes large regressions on general benchmarks is a regression in disguise. Log both numbers before shipping.
Companion notebook: day-13-fine-tuning.ipynb.
nn.Linear: freeze W, add trainable A (random) and B (zero), forward Wx + (α/r)·BAx. Confirm output equals the base at init (B=0 guarantee). Count trainable vs frozen params.r ∈ {4, 8, 16, 64}. Build a memory calculator that estimates full FT vs LoRA vs QLoRA training memory for arbitrary model sizes.-100, and confirm the loss only sees the response portion. Show the loss numerically on masked vs unmasked.A and B have gradients; plot training loss to confirm it drops.BA into W and confirm the merged model produces numerically identical outputs to the adapter model. Benchmark inference speed (merged should have zero overhead).Close the page and answer from memory. If you can't, re-read the relevant section.
A and B? State the trainable-parameter count in terms of r, d, k.B initialized to zero, and what would go wrong if it were randomly initialized?"Pre-training builds the engine. Fine-tuning is steering. LoRA is steering with a 200-megabyte steering wheel — and S-LoRA lets you hand that wheel to a thousand different drivers without buying a thousand cars."
The fine-tuning canon.
Low-Rank Adaptation of Large Language Models. The decomposition, the intrinsic-dimensionality argument, and empirical comparisons.
Open paperNF4, double quantization, paged optimizers. 65B on one GPU. Read §3 carefully — it is packed with systems insights.
Open paperThe SFT-then-RLHF recipe; instruction-tuning blueprint. The original paper behind ChatGPT's training pipeline.
Open paperServing thousands of LoRA adapters from one base model with a unified batched CUDA kernel. The inference side of the fine-tuning story.
Open paperProduction LoRA/QLoRA/adapters/IA3/prefix-tuning in one library. Read it after writing your own from scratch.
View repoHands-on guidance on rank, target modules, and hyperparameters. Sebastian Raschka's empirically grounded advice.
Read postThe original prefix/prompt tuning paper. Useful contrast to LoRA — same era, different trade-offs, still used in some production systems.
Open paperGradient Low-Rank Projection — applies the low-rank idea to gradients rather than weights, enabling full-parameter training at lower memory. A natural extension to know about.
Open paper