Day 13 — Fine-tuning: SFT, LoRA, QLoRA · LLM Inference Engineer Curriculum

Why This Lesson

Almost nobody pre-trains. Almost everybody fine-tunes.

Pre-training a frontier model costs millions of dollars and is done by a handful of labs. Fine-tuning takes that base model — which already knows language — and specializes it: turning a raw next-token predictor into an instruction-following assistant, a SQL generator, a medical-notes summarizer. This is the part of the training stack that most engineers actually touch, and LoRA and QLoRA have made an enormous practical difference: techniques that let you adapt a large model with a tiny fraction of the memory and produce adapters measured in megabytes, not gigabytes.

For an inference engineer this matters twice. First, the artifact you serve is usually a fine-tuned model, and LoRA adapters can be merged into the base weights or served dynamically per request — both are inference-system design decisions with direct throughput and cost implications. Second, the quantization ideas behind QLoRA (4-bit NF4) are a direct preview of Week 4's quantization lessons. Fine-tuning is where training and inference start to blur.

Learning objectives

Map the fine-tuning landscape: continued pretraining vs. SFT vs. PEFT — when each applies and what the data looks like.
Explain instruction tuning and chat templates; implement loss masking on the prompt.
Understand the memory cost of full fine-tuning (weights + gradients + Adam states ≈ 16 bytes/param) and why a 7B full FT needs 100 GB+.
Derive the LoRA decomposition W + (α/r)·BA, explain why it works, count trainable params, and choose rank and target modules.
Explain QLoRA's three ingredients — 4-bit NF4, double quantization, paged optimizers — and the resulting memory win.
Distinguish adapter merging vs. dynamic adapter serving (S-LoRA) and their inference trade-offs.
Recognize catastrophic forgetting and apply mitigations.
Compare full SFT, LoRA, QLoRA, and other PEFT methods across trainable params, memory, quality, and serving cost.

The Fine-tuning Landscape

Three distinct modes: continued pretraining, SFT, and PEFT.

The term "fine-tuning" is used loosely to mean several different things. Let's pin them down, because each has different data requirements, cost, and inference implications.

Continued pretraining

Take a base model and keep running the standard next-token prediction loss on new raw text — medical papers, legal documents, a new language, code in a niche framework. No chat template, no instruction pairs. This teaches the model knowledge and vocabulary in a new domain without changing its behavior format. The cost profile is similar to pre-training: all weights update, all the Adam states are alive. Used when you need deep domain adaptation before task-specific tuning.

Supervised fine-tuning (SFT) / instruction tuning

Train on (prompt, response) pairs with a chat template. The model learns to follow instructions in the expected format. This is how GPT-3 became InstructGPT, how Llama-2-base became Llama-2-chat. Full SFT updates all weights; PEFT-SFT (LoRA/QLoRA) updates only a small adapter. Both use the same data; the distinction is which parameters move.

Parameter-efficient fine-tuning (PEFT)

Freeze the pre-trained weights and train only a small number of new parameters. The dominant method is LoRA. Others include prefix tuning (prepend learned soft tokens to the input), prompt tuning (same but only at the embedding layer), and adapter layers (tiny bottleneck MLPs inserted between transformer sub-layers). LoRA wins in practice: it adds no inference latency when merged, and the quality gap with full SFT is small for most tasks.

The three fine-tuning modes differ in data format, which parameters move, and what artifact you deploy. PEFT (LoRA/QLoRA) is the default choice for most adaptation tasks because the adapter is tiny and can be served separately from the shared base.

What the data looks like

For SFT and PEFT, each training example is a formatted conversation. The exact format depends on the model family's chat template — a strict convention of special tokens marking system/user/assistant roles. Here is a ChatML-style example (used by Mistral, Qwen, and others):

<|im_start|>system
You are a helpful coding assistant.<|im_end|>
<|im_start|>user
Write a Python function that reverses a linked list.<|im_end|>
<|im_start|>assistant
def reverse_linked_list(head):
    prev = None
    current = head
    while current:
        nxt = current.next
        current.next = prev
        prev = current
        current = nxt
    return prev<|im_end|>

The entire formatted string is tokenized and fed as a single sequence. The crucial detail is loss masking: you compute the cross-entropy loss only on the assistant turn tokens. The system and user tokens are masked to -100 so PyTorch's cross-entropy ignores them. You want the model to learn to produce good responses, not to predict the user's questions — which are given to it at inference time anyway.

Loss masking in instruction tuning. System and user tokens are assigned label -100; PyTorch's cross_entropy(ignore_index=-100) skips them. Only the assistant-turn tokens accumulate gradients. This ensures the model learns to generate responses, not memorize prompts.

# Loss masking in code — the essential pattern.
labels = input_ids.clone()
labels[:, :prompt_len] = -100      # mask system + user tokens
loss = F.cross_entropy(
    logits.view(-1, vocab_size),
    labels.view(-1),
    ignore_index=-100              # skipped in the sum
)

One detail engineers trip over: every model family has its own chat template. ChatML, LLaMA-2's [INST]/[/INST], Alpaca's ### Instruction:/### Response:. At inference time, you must apply exactly the same template that was used in training. Using the wrong template silently degrades quality — sometimes catastrophically. HuggingFace Tokenizer now ships apply_chat_template() to reduce this class of bug.

Full Fine-tuning Memory Cost

Weights + gradients + Adam states ≈ 16 bytes per parameter.

You saw on Day 11 that training a model is far more memory-hungry than running inference. The gap is stark and worth understanding quantitatively before LoRA makes sense.

In mixed-precision training (the default since Day 10), each parameter lives in multiple copies simultaneously:

fp16 forward weights — 2 bytes/param. These are what the forward pass sees.
fp32 master weights — 4 bytes/param. Kept to accumulate small gradient updates accurately; fp16 precision is insufficient for this.
fp16 gradients — 2 bytes/param. One gradient per weight from backprop.
fp32 Adam first moment (m) — 4 bytes/param. The exponential moving average of gradients.
fp32 Adam second moment (v) — 4 bytes/param. The EMA of squared gradients.

That totals 16 bytes per parameter. For a 7B model:

7,000,000,000 params × 16 bytes = 112,000,000,000 bytes ≈ 104 GB Breakdown: fp16 weights: 7B × 2 = 14 GB fp32 master: 7B × 4 = 28 GB fp16 grads: 7B × 2 = 14 GB Adam m: 7B × 4 = 28 GB Adam v: 7B × 4 = 28 GB ───────────────────────────────── Total: 112 GB Plus activations (batch-size × sequence-length dependent). A single H100 (80 GB) cannot hold this alone.

Training memory breakdown for a 7B model. Full fine-tuning needs ~112 GB — impossible on a single 80 GB GPU. LoRA freezes the base (no master weights or Adam states for it), slashing cost to ~16 GB. QLoRA compresses the frozen base to 4-bit NF4, reaching ~6 GB and fitting comfortably on a 24 GB consumer GPU.

This arithmetic explains the entire point of PEFT. If the base weights are frozen, you only need Adam states for the tiny adapter — a 7B model's LoRA adapter at rank 16 on attention projections has roughly 40 M parameters (0.6%), requiring just ~0.6 GB of Adam state. The frozen base sits in fp16 at 14 GB. Total: under 16 GB — fits on a single consumer GPU. QLoRA then compresses the frozen base to 4-bit (≈3.5 GB), pushing the total under 6 GB and enabling fine-tuning on an RTX 3090.

LoRA

Freeze W. Learn a low-rank update BA. Train 1% of the parameters.

LoRA (Low-Rank Adaptation) rests on an empirical observation from the paper: the weight update a model needs during fine-tuning has low intrinsic rank. Even if W is a 4096×4096 matrix, the useful fine-tuning update ΔW can be well-approximated by the product of two skinny matrices. So instead of learning a full-rank ΔW (same size as W, as many parameters), LoRA learns ΔW = BA, where A and B are low-rank.

The math

Original layer: y = W · x W ∈ ℝ^{d×k} (frozen, no gradient) LoRA layer: y = W · x + (α/r) · B · A · x A ∈ ℝ^{r×k}, B ∈ ℝ^{d×r}, r ≪ min(d, k) r = rank (e.g. 8, 16, 64) α = scaling constant (often set = r, so α/r = 1) A ~ N(0, σ²) at init (random) B = 0 at init → ΔW = BA = 0 at start (stable no-op) Trainable parameters: r·k (for A) + d·r (for B) = r·(d + k) Example — a 4096×4096 projection with r = 16: Full FT: 4096 × 4096 = 16,777,216 params LoRA: 16 × (4096 + 4096) = 131,072 params Reduction: 131k / 16.7M ≈ 0.78%

Read the formula as: the frozen weight W acts as before, and a small side branch learns a correction ΔW = (α/r)·BA. At the start of training, B=0, so the output is exactly what the original model would give — you start precisely at the pretrained optimum and walk away from it only as training proceeds. This is far more stable than, say, initializing ΔW randomly and hoping to cancel out.

LoRA adds a parallel branch A → B alongside the frozen weight W. Only A and B have gradients. Because B is initialized to zero, ΔW = 0 initially — the model starts exactly at the pretrained optimum. The scaling factor α/r lets you control update magnitude independently of rank.

Where to apply LoRA

You can LoRA-wrap any linear layer. In practice, the attention projections (Q, K, V, and the output projection O) give the most benefit per parameter. Some practitioners also wrap the MLP's up-projection and gate. A typical config targets q_proj and v_proj (the LoRA paper's default) with rank 8–16 for light adaptation, or all four attention projections plus MLP at rank 32–64 for deeper task adaptation. Targets that are not commonly wrapped: layer norms (tiny; already cheap to full-train), embeddings (very high-dimensional; marginal benefit from LoRA).

Choosing rank and alpha

Rank r controls capacity. For a simple style/format change (instruction tuning on a well-formatted dataset), r=8 is often enough. For complex domain adaptation or when the gap between base and target is large, try r=32 or r=64. A common heuristic: set alpha = r (so α/r = 1) or alpha = 2r (so α/r = 2). A higher alpha makes the LoRA update more aggressive — handy when you want faster adaptation, risky when you want to preserve base behavior.

Other PEFT methods

For completeness, the alternatives you'll see referenced:

Prefix tuning / prompt tuning: Prepend learned soft-token vectors to the input (or to all layers' keys/values). No architectural change to W. Fewer parameters than LoRA, but less expressive; often underperforms LoRA on hard tasks.
Adapter layers: Insert tiny bottleneck MLP modules (down-project → nonlinearity → up-project) inside each transformer block. Adds some inference latency even after training. LoRA avoids this by merging.
IA³: Learn per-dimension scale vectors for keys, values, and FFN activations. Very few parameters; good for few-shot tasks; less flexible than LoRA for full SFT.

QLoRA

4-bit frozen base + LoRA adapters in bf16. A 7B on a 24 GB GPU.

LoRA already removes most of the optimizer-state cost, but the frozen base weights still sit in memory at 16 bits — that's 14 GB for a 7B model. QLoRA (Quantized LoRA) attacks exactly that: it loads the frozen base in 4-bit precision and trains LoRA adapters on top. Because the base is frozen and is only used in the forward/backward pass (dequantized on the fly for each matrix multiply, then discarded), 4-bit precision is acceptable — the trainable adapters stay in bf16 and carry all the learning. The result: fine-tuning a 7B model in under 6 GB of GPU memory, and a 65B model on a single 48 GB GPU — previously impossible.

QLoRA has three carefully engineered ingredients:

1 · 4-bit NormalFloat (NF4)

Standard int4 (or fp4) places its 16 representable values at evenly-spaced or power-of-two intervals. But neural network weights follow a roughly normal distribution, heavily concentrated near zero. Evenly-spaced quantization wastes most of its precision on the tails where few weights live.

NF4 is a 4-bit data type whose 16 quantization levels are chosen to be information-theoretically optimal for a normal distribution: they partition the standard normal distribution into 16 equal-probability buckets, so each representable value covers the same probability mass. This means a typical weight is quantized with less error than with int4. Each block of 64 weights gets its own scale factor (block-wise quantization) computed from the max absolute value, so global scale-factor errors don't compound.

2 · Double quantization

Block-wise quantization produces one fp32 scale factor per 64 weights — that's an extra 4 bytes per 64 params, or an extra 0.0625 bytes (0.5 bits) per param. Double quantization then quantizes those scale factors too, representing them in 8-bit with blocks of 256. The net saving is roughly 0.37 bits per parameter — small but essentially free. Together with NF4, the total cost of the frozen base is ≈ 4.5 bits per parameter instead of 16.

3 · Paged optimizers

Even with a tiny adapter, processing long sequences causes memory spikes that can kill the GPU process. QLoRA uses NVIDIA's unified memory (UM) to page optimizer states from GPU DRAM to CPU RAM when the GPU is close to capacity, and page them back before the optimizer step. This is the safety valve that makes the tight memory budget survivable on consumer hardware. Expect some CPU↔GPU transfer overhead — paging is slower than on-device — but it means the process doesn't crash.

QLoRA's GPU layout: the frozen base is stored in 4-bit NF4 with double-quantized scale factors; LoRA adapters and their Adam states stay in bf16; when the GPU is close to capacity, optimizer states page out to CPU RAM and back. The total for a 7B model is around 6 GB.

The original QLoRA paper fine-tuned a 65B-parameter model on a single 48 GB GPU in about 24 hours and produced Guanaco, which matched 99% of ChatGPT's performance on a benchmark of the time. Hardware you could rent for a few dollars an hour. It is one of the clearest examples in ML of a systems trick democratizing a capability that was previously behind a data-center wall.

You will see NF4 and block-wise quantization again on Day 22, where we cover quantization for inference (GPTQ, AWQ, FP8). QLoRA is your first encounter with the idea that 4 bits is often enough — a theme that runs through the rest of the course.

Inference Implications

Merge for zero overhead — or serve live adapters per request.

This is the section that most training-focused tutorials skip, and the one an inference engineer most needs. Once you have a LoRA adapter, you face a binary choice at serve time:

Option 1 — Merge the adapter (zero inference overhead)

Fold the adapter into the base weight once, offline, before deployment:

W_merged = W + (α/r) · B · A

The merged weight is a standard nn.Linear. At inference, there is no extra computation — the model runs at full speed as if no adapter was ever there. You can quantize W_merged to fp8 or int4 like any other model. The trade-off: you lose the adapter's identity. You cannot hot-swap tasks without loading a different checkpoint. If you have one task, merge. If you have many, read on.

Option 2 — Keep adapters separate and serve dynamically

Keep the base model on the GPU and store many small adapter sets on disk or in CPU memory. For each incoming request, identify which adapter to use (e.g., customer A uses the legal adapter, customer B uses the coding adapter) and apply it during the forward pass. This is the multi-adapter serving pattern.

The naive implementation — load one adapter, run forward, swap another — is slow and serializes requests. S-LoRA (Sheng et al., 2023) showed that you can batch requests with different adapters together efficiently: the base-model computation is shared across the batch, and the per-adapter ΔW·x terms are computed with a unified CUDA kernel that handles the heterogeneity. vLLM and other serving frameworks now support this natively. The win is enormous for multi-tenant scenarios: one base model on one set of GPUs serves thousands of customers' custom models.

Two LoRA serving modes. Merging is simplest and fastest for single-task deployments. Multi-adapter serving (S-LoRA/punica) keeps a single base on GPU and batches requests across heterogeneous adapters with a unified CUDA kernel — one server handles thousands of customers, each with their own fine-tuned behavior.

Quantization at inference

If you merge the adapter before quantizing, you get the full benefit of inference quantization (e.g., GPTQ or AWQ to int4) with no interaction effects — the merged weight is just a weight. If you keep adapters separate and serve them dynamically, you typically keep the adapters in fp16 or bf16 even if the base is quantized, since the adapter is tiny and the precision matters for quality. Either way, the adapter does not block your path to a quantized deployment.

Method	Trainable params	Training memory (7B)	Artifact size (7B)	Inference overhead	Quality vs full SFT	Best when
Full SFT	100%	~112 GB	~14 GB checkpoint	None	Ceiling	Max quality, one task, large cluster
LoRA	~0.1–1%	~16 GB	~10–200 MB adapter	Zero (if merged)	Within ~1–2% on most tasks	Most adaptation tasks; many tasks from one base
QLoRA	~0.1–1%	~6 GB	~10–200 MB adapter	Zero (if merged then decompressed)	Slightly below LoRA on hard tasks	Big model, single consumer GPU
Prefix tuning	<0.1%	~14–16 GB	<1 MB	Extra KV-cache entries	Below LoRA for complex tasks	Very few params needed; frozen base mandatory
Prompt tuning	<0.01%	~14 GB	<0.1 MB	Extra prompt tokens	Competitive only at large scale (10B+)	Massive models, minimal storage
Adapter layers	~1–3%	~16 GB	~100–500 MB	Sequential bottleneck per layer	Similar to LoRA	Legacy; LoRA preferred today

Catastrophic Forgetting & Overfitting

Fine-tune too hard and the model forgets what it knew.

A persistent risk: fine-tuning narrowly on new data can erode the broad capabilities the model gained in pre-training — it gets better at your task but worse at everything else. This is catastrophic forgetting. The intuition: the weights that represent general knowledge are being overwritten by gradient updates specialized for your narrow dataset.

Symptoms to watch for

The fine-tuned model loses coherence on general prompts ("who is Albert Einstein?") even if it wasn't asked about them in training.
Perplexity on a general-domain eval set (e.g. WikiText) rises after fine-tuning.
The model starts repeating patterns from the training set (overfitting on small data).
Safety guardrails from RLHF erode — the model becomes more willing to produce harmful outputs after SFT on uncurated data.

Mitigations

Low learning rate — fine-tune with LR 1–10× lower than the original pretraining LR. Smaller steps mean smaller drift.
Few epochs — 1–3 epochs on the fine-tuning data is usually enough; more risks memorization.
Data mixing — include a fraction (~10%) of general-domain data in the fine-tuning mix.
PEFT (LoRA/QLoRA) — the single most effective mitigation. Because the base weights are frozen and the low-rank update is low-rank, the model can only diverge as far as the low-rank manifold allows. Full forgetting is structurally impossible when the base is frozen.
Regularization — L2 toward the original weights, or EWC (Elastic Weight Consolidation) — penalize moving weights that were important for the original data distribution.

Evaluating fine-tuned models

Always eval on two distributions: the target task and a general-domain benchmark (MMLU, HellaSwag, or a held-out sample of the pretraining data). A fine-tune that looks great on task accuracy but causes large regressions on general benchmarks is a regression in disguise. Log both numbers before shipping.

Exercise

Eight exercises, all in the notebook.

Companion notebook: day-13-fine-tuning.ipynb.

Implement LoRALinear from scratch. Wrap an nn.Linear: freeze W, add trainable A (random) and B (zero), forward Wx + (α/r)·BAx. Confirm output equals the base at init (B=0 guarantee). Count trainable vs frozen params.
Sweep rank and count params. Apply LoRA to a small model's attention projections and report the trainable fraction for r ∈ {4, 8, 16, 64}. Build a memory calculator that estimates full FT vs LoRA vs QLoRA training memory for arbitrary model sizes.
Loss masking for instruction data. Build a ChatML-formatted example, tokenize it, mask prompt tokens to -100, and confirm the loss only sees the response portion. Show the loss numerically on masked vs unmasked.
Fine-tune the Day 9 GPT with LoRA. Freeze the pretrained tiny GPT, attach LoRA to its attention, and fine-tune on a small new corpus. Assert only A and B have gradients; plot training loss to confirm it drops.
Merge the adapter and verify. Fold BA into W and confirm the merged model produces numerically identical outputs to the adapter model. Benchmark inference speed (merged should have zero overhead).
Memory calculator. Implement the 16-bytes/param formula for full FT and the LoRA/QLoRA reductions. Print a table for model sizes 1B, 7B, 13B, 70B.
Catastrophic forgetting demo. Full-fine-tune the tiny GPT hard on a new corpus and show its loss on the original data rises; repeat with LoRA and show it rises less. Sweep LoRA rank.
(Optional) Real QLoRA with HF peft + bitsandbytes. If GPU + packages available, load a small HF model in 4-bit NF4 and attach a LoRA adapter; report the GPU memory footprint and compare to fp16 baseline.

Self-Check

Ten questions before moving on.

Close the page and answer from memory. If you can't, re-read the relevant section.

What is the difference between continued pretraining and supervised fine-tuning (SFT)? When would you do one vs. the other?
Why is the SFT loss masked on the prompt? What sentinel value is conventional, and where in the code does it take effect?
Write the LoRA forward pass formula. What are the shapes of A and B? State the trainable-parameter count in terms of r, d, k.
Why is B initialized to zero, and what would go wrong if it were randomly initialized?
Compute the trainable fraction for LoRA on a 7B model's attention projections with r=16 (assume 32 layers, d=4096, Q/K/V/O).
Name QLoRA's three ingredients. What specific problem does each one solve?
What is NF4 and why does it give lower quantization error for neural-network weights than plain int4?
Contrast merging a LoRA adapter with serving it dynamically via S-LoRA. What does each approach optimize for?
What is catastrophic forgetting, and why does LoRA reduce it compared to full fine-tuning?
Name three PEFT methods other than LoRA and QLoRA. For each, state one advantage and one disadvantage versus LoRA.

Go deeper.

The fine-tuning canon.

Paper · 2021

Hu et al. — LoRA

Low-Rank Adaptation of Large Language Models. The decomposition, the intrinsic-dimensionality argument, and empirical comparisons.

Open paper

Paper · 2023

Dettmers et al. — QLoRA

NF4, double quantization, paged optimizers. 65B on one GPU. Read §3 carefully — it is packed with systems insights.

Open paper

Paper · 2022

Ouyang et al. — InstructGPT

The SFT-then-RLHF recipe; instruction-tuning blueprint. The original paper behind ChatGPT's training pipeline.

Open paper

Paper · 2023

Sheng et al. — S-LoRA

Serving thousands of LoRA adapters from one base model with a unified batched CUDA kernel. The inference side of the fine-tuning story.

Open paper

Library · HF

huggingface/peft

Production LoRA/QLoRA/adapters/IA3/prefix-tuning in one library. Read it after writing your own from scratch.

View repo

Blog · Raschka

Practical Tips for Fine-tuning with LoRA

Hands-on guidance on rank, target modules, and hyperparameters. Sebastian Raschka's empirically grounded advice.

Read post

Paper · 2021

Li & Liang — Prefix Tuning

The original prefix/prompt tuning paper. Useful contrast to LoRA — same era, different trade-offs, still used in some production systems.

Open paper

Paper · 2024

Zhao et al. — GaLore

Gradient Low-Rank Projection — applies the low-rank idea to gradients rather than weights, enabling full-parameter training at lower memory. A natural extension to know about.

Open paper

Fine-tuning: SFT, LoRA, QLoRA

Almost nobody pre-trains. Almost everybody fine-tunes.

Learning objectives

Three distinct modes: continued pretraining, SFT, and PEFT.

Continued pretraining

Supervised fine-tuning (SFT) / instruction tuning

Parameter-efficient fine-tuning (PEFT)

What the data looks like

Weights + gradients + Adam states ≈ 16 bytes per parameter.

Freeze W. Learn a low-rank update BA. Train 1% of the parameters.

The math

Where to apply LoRA

Choosing rank and alpha

Other PEFT methods

4-bit frozen base + LoRA adapters in bf16. A 7B on a 24 GB GPU.

1 · 4-bit NormalFloat (NF4)

2 · Double quantization

3 · Paged optimizers

Merge for zero overhead — or serve live adapters per request.

Option 1 — Merge the adapter (zero inference overhead)

Option 2 — Keep adapters separate and serve dynamically

Quantization at inference

Full FT vs LoRA vs QLoRA vs prefix/prompt tuning at a glance.

Fine-tune too hard and the model forgets what it knew.

Symptoms to watch for

Mitigations

Evaluating fine-tuned models

Eight exercises, all in the notebook.

Ten questions before moving on.

Go deeper.