Day 05 — Tokenization & Embeddings · LLM Inference Engineer Curriculum

Why This Lesson

Every LLM call begins with a tokenizer and ends with a detokenizer.

Language models cannot read text directly. Before any neural network can do anything with a sentence, that sentence has to be turned into numbers. The piece of software that does this conversion is called a tokenizer. It chops the input string into smaller pieces and assigns each piece a unique integer ID. The model takes those integers as input and produces integers as output, and at the end of the pipeline a matching detokenizer turns those output integers back into readable text. Every single LLM call in the world goes through this round trip.

The choice of tokenizer matters more than you might expect. It affects how long your prompts are (longer prompts cost more compute), how well the model handles other languages (a tokenizer trained mostly on English is wildly inefficient on Chinese), and even model behavior on rare tokens that the model barely saw during training.

Once we have integer IDs, the next step is to turn each integer into a dense vector that the model can actually do math on. This is what the embedding layer does. We will build it up from a simple lookup table, see why it has to be learned rather than fixed, and understand the geometric mental model that all of modern NLP relies on.

We also need to give the model some sense of token order. Self-attention by itself does not know whether a token came before or after another. The standard trick is to add a positional encoding to each embedding. We will look at the original sinusoidal encoding from the 2017 Transformer paper, the simpler learned variant used by GPT-2 and BERT, and a quick preview of RoPE — the rotary scheme used by every modern open LLM. RoPE itself is a Day 12 topic.

By the end of today you will have implemented Byte-Pair Encoding from scratch and understood every byte of input that flows into a Transformer.

Learning objectives

Explain why both character-level and word-level tokenization fail in practice, and how subword tokenization (BPE) avoids both pitfalls.
Implement Byte-Pair Encoding training and encoding in pure Python.
Use tiktoken, Hugging Face tokenizers, and sentencepiece interchangeably.
Describe an embedding matrix in detail — its shape, parameter count, and role.
Compare sinusoidal and learned positional embeddings, and preview RoPE.
Explain weight tying between the input embedding and the language-model head, and why it works.

The Three Regimes

Characters, words, or subwords? Subwords are the right answer, and here's why.

When you're designing a tokenizer, you have three obvious choices: split the text into individual characters, split it into whole words, or split it into something in between. The third option turns out to be much better than the first two, but understanding why requires looking at what each option costs you.

Strategy	Vocab size	Tokens for "the cat sat on the mat"	Handles new words?
Character	~100 (printable ASCII)	22	Yes
Word	50,000–500,000	6	No (UNK for unseen)
Subword (BPE)	32,000–128,000	6–8	Yes — composes from sub-pieces

Why character-level fails. The vocabulary is small and tidy — every character gets its own ID. But the sequences become extremely long. The phrase "the cat sat on the mat" is six words, but it's twenty-two characters. So a character-level model needs to process roughly four times as many tokens as a subword model for the same input. This is a big deal because attention scales as O(T²) in sequence length: making T four times bigger makes attention sixteen times more expensive. There's also a subtler problem. The model has to spend its parameters learning that the characters t-h-e form a unit, every time it sees the word "the." That representation effort is wasted work.

Why word-level fails. Sequences are nice and short, one token per word. But the vocabulary explodes. English has hundreds of thousands of words, and any meaningful corpus contains typos, names, neologisms, and other strings that the vocabulary doesn't cover. These get mapped to a special "unknown" token, which throws away information the model could have used. Worse, word-level tokenization can't share information between morphological variants. The words running, runs, and ran would be three completely separate entries in the vocabulary, with three independent embeddings, and the model would have to learn from scratch that they are related.

The same phrase under three schemes. Character-level keeps the vocabulary tiny but makes the sequence long; word-level keeps the sequence short but explodes the vocabulary and chokes on unseen words; subword BPE keeps common words whole and splits rare ones into pieces — short sequences, a bounded vocabulary, and no out-of-vocabulary tokens.

Why subwords win. A subword tokenizer learns a vocabulary of common chunks of text. The chunks can be whole words for frequent words, or fragments for rarer ones. The word "running" might be tokenized as ["run", "ning"]; "transformer" might be ["trans", "former"] or a single piece, depending on how it appeared in the training corpus. Common words stay whole; rare or novel words decompose into sub-pieces, so there are no out-of-vocabulary tokens. The vocabulary stays small enough to train, and the sequences stay short enough to process. It is the best of both worlds, and it has been the standard since GPT-2.

The standard subword schemes

BPE (Byte-Pair Encoding). The bottom-up approach: start with bytes and repeatedly merge the most common pair. Used by GPT-2, GPT-3, GPT-4, LLaMA, and most modern models.
WordPiece. Used by BERT. Similar in spirit to BPE, but picks merges based on which one most increases the likelihood of the training corpus, rather than which pair is most frequent.
SentencePiece (unigram). A top-down algorithm: start with a large candidate vocabulary and iteratively prune the pieces that contribute least. Used by T5 and many multilingual models.

The three algorithms produce broadly similar vocabularies in practice. We will implement BPE, which is the one you are most likely to encounter.

BPE From Scratch

Greedy bottom-up merging. The whole algorithm fits in a few lines.

Byte-Pair Encoding has exactly one core idea, and once you see it, the algorithm is obvious. Repeatedly find the most frequent pair of adjacent tokens, and merge that pair into a new token. Stop when you've reached your target vocabulary size.

The training algorithm

Start with a vocabulary that contains all individual bytes (or characters). On a UTF-8 corpus that's 256 starting tokens.
Look at the entire training corpus and count every adjacent pair of tokens.
Find the most frequent pair. Merge it into a new token, give it the next available ID, and write the merge down in a list (the order matters for encoding later).
Apply the merge throughout the corpus, replacing every instance of the pair with the new token.
Repeat until you've added enough new tokens to reach your target vocabulary size.

BPE walks the corpus by frequency. The most-frequent adjacent pair becomes a new token; the merge is recorded with a priority order (the order it was learned in). To encode new text later, you replay the merges in that same order.

The minimum implementation

Three small functions cover the algorithm. The first counts adjacent pairs in a sequence; the second applies one merge; the third runs the training loop.

from collections import Counter

def get_pairs(seq):
    return Counter(zip(seq, seq[1:]))

def replace_pair(seq, pair, new_id):
    out, i = [], 0
    while i < len(seq):
        if i < len(seq) - 1 and seq[i] == pair[0] and seq[i+1] == pair[1]:
            out.append(new_id); i += 2
        else:
            out.append(seq[i]); i += 1
    return out

def train_bpe(text, vocab_size):
    seq = list(text.encode('utf-8'))      # start with raw bytes (256 IDs)
    merges, next_id = {}, 256
    while next_id < vocab_size:
        pair, freq = get_pairs(seq).most_common(1)[0]
        if freq < 2: break
        merges[pair] = next_id
        seq = replace_pair(seq, pair, next_id)
        next_id += 1
    return merges

Encoding new text is the dual of training. Each saved merge gets applied in the order it was learned, until none of them fire any more. Because Python dictionaries preserve insertion order, we can simply iterate over the merges dict.

def encode(text, merges):
    seq = list(text.encode('utf-8'))
    for pair, new_id in merges.items():
        seq = replace_pair(seq, pair, new_id)
    return seq

What you have just seen is the concept. Production tokenizers add several refinements that the textbook version skips. They typically pre-tokenize the text first, splitting on whitespace and punctuation so that merges cannot accidentally span word boundaries. They use regex patterns to control which kinds of merges are allowed. They define special tokens like end-of-text. They have byte-level fallback for arbitrary input. And they are written in fast Rust or C++ rather than Python. If you want to read a clean reference implementation, karpathy/minbpe is the place to go.

Production Tokenizers

Two libraries cover almost everything you'll meet in the wild.

You essentially never roll your own tokenizer in production. There are two libraries you should know how to use.

tiktoken (OpenAI)

OpenAI's tokenizer library, used for GPT-2, GPT-3, and GPT-4. Written in Rust, very fast, exposed to Python with a clean API.

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")     # GPT-3.5/4 tokenizer
ids = enc.encode("Hello, world!")
print(ids)                                      # [9906, 11, 1917, 0]
print(enc.decode(ids))                          # 'Hello, world!'
print(enc.n_vocab)                              # 100277

Three encodings are worth knowing by name. gpt2 has 50,257 tokens and is what GPT-2 used. cl100k_base has 100,277 tokens and powers GPT-3.5 and GPT-4. o200k_base has 200,019 tokens and is the larger vocabulary used by GPT-4o.

Hugging Face — `transformers.AutoTokenizer`

The Hugging Face approach is to load whatever tokenizer was published alongside a particular model.

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
ids = tok.encode("Hello, world!")
print(ids)                                      # [1, 15043, 29892, 3186, 29991]
print(tok.decode(ids))

The 1 at the very start is the BOS (beginning-of-sequence) token. LLaMA always prepends one to every input. This is a model-specific convention; not every tokenizer does it.

SentencePiece

SentencePiece is the third major player. T5, LLaMA-1, LLaMA-2, and many multilingual models all use it. Internally it treats the input as a stream of bytes (including whitespace) and uses a special character ▁ (Unicode U+2581) to mark word boundaries.

Vocab sizes by model

Model	Tokenizer	Vocab
GPT-2	BPE	50,257
GPT-3.5 / GPT-4	tiktoken cl100k_base	100,277
GPT-4o	tiktoken o200k_base	200,019
LLaMA-2	SentencePiece BPE	32,000
LLaMA-3	tiktoken-like BPE	128,256

A larger vocabulary means each input takes fewer tokens to represent (so attention is cheaper) but also means the model carries more parameters in its embedding layer and language-model head. The sweet spot for modern LLMs sits between 32,000 and 128,000 tokens.

Tokenizer Pitfalls

Six gotchas that bite real systems.

Tokenizers are the single most common source of subtle bugs in LLM applications. Here are six failure modes that come up over and over.

Leading whitespace matters.
The strings "hello" and " hello" tokenize differently in most BPE schemes, because the leading space is part of the token. So when you concatenate already-tokenized sub-sequences without paying attention to spaces, you can produce inputs that the model has never seen during training, which often produces strange outputs.
Trailing whitespace plus completion.
If you end a prompt with a trailing space, you have effectively "used up" the leading-space slot of the next token. The model is then forced to predict a continuation that starts without a space — and most natural continuations don't. The result is often garbled output. Best practice: never end a prompt with a trailing space.
Numbers.
GPT-2 tokenized numbers inconsistently — the digit string 1234 could end up as one token or several depending on context. This makes arithmetic harder for the model. LLaMA-3 splits each digit into its own token; GPT-4 also handles numbers more cleanly than its predecessor.
Byte-level UTF-8.
Multi-byte characters such as emoji or non-Latin scripts decompose into their UTF-8 bytes when there is no specialized vocabulary entry. A single emoji can become four tokens. This is harmless for correctness but inflates token counts on multilingual or emoji-heavy inputs.
Glitch tokens.
Some tokens appear in the vocabulary but barely appear in the model's training data. (Tokenizer training and model training use different corpora at different times.) The classic example is " SolidGoldMagikarp" in GPT-2. The embedding for such a token stays roughly random because gradients almost never flow through it during training, so the model behaves erratically when it encounters one.
Token-level metrics, not character-level.
When a model spec says "context length 4096", that number is in tokens, not characters. As a rough guide, 4096 tokens corresponds to about 3000 English words.

Token efficiency varies wildly across languages

For English, most BPE tokenizers compress to around four characters per token. Code is similar. But Chinese and Japanese, in older tokenizers, can tokenize to roughly one or two characters per token — meaning the same content costs five to ten times more tokens than its English equivalent. That translates directly to five to ten times the API cost, latency, and attention compute. LLaMA-3 and GPT-4o specifically widened their vocabularies to fix this disparity.

Embeddings

A learned dictionary that maps token IDs to D-dimensional geometry.

Once we have integer IDs, we still need to give the model something it can do math on. Integers themselves carry no useful structure: token 5379 isn't "close" to token 5380 in any meaningful way. We need to convert each integer ID into a dense vector of floating-point numbers, and we need that vector to encode something about the meaning of the token.

The solution is breathtakingly simple. We create a learnable matrix E with shape (V, D), where V is the vocabulary size and D is the model's hidden dimension. Every row of E is the embedding for one token. To embed a token with ID i, we just look up row i. The whole operation is integer indexing into a tall matrix.

import torch
import torch.nn as nn

emb = nn.Embedding(num_embeddings=50257, embedding_dim=768)   # GPT-2 small
ids = torch.tensor([[15496, 11, 995]])                        # shape (1, 3)
x = emb(ids)                                                  # shape (1, 3, 768)

The shape transformation is worth tracing carefully. The input ids has shape (B, T): a batch of B sequences, each with T token positions, holding integer IDs. The output x has shape (B, T, D): the same batch and same sequence length, but now each token has been replaced by a D-dimensional vector.

nn.Embedding is a tall (V, D) matrix. The forward operation is integer indexing — pull row i for token id i. During the backward pass, the gradient updates only the rows that were actually used.

Parameter count

The embedding table has V × D parameters. To get a feel for what this means in practice:

GPT-2 small. 50,257 × 768 ≈ 39M parameters in the embedding alone — about 30% of the model's 124M total.
LLaMA-2-7B. 32,000 × 4,096 ≈ 131M parameters — about 1.9% of the 7B total.
LLaMA-3 with 128k vocab. 128,256 × 4,096 ≈ 525M — over 7% of an 8B-parameter model.

The fraction matters. In small models, the embedding is a meaningful chunk of the parameter budget and of the optimizer's gradient memory. In large models, it shrinks to a rounding error.

The language-model head and weight tying

The output side of the model is the inverse operation. Given a final hidden state h ∈ ℝ^D, we need to produce a probability distribution over the entire vocabulary. We do that with a linear layer that maps D-dimensional hidden states to V-dimensional logits: logits = h @ W_lm.T, where W_lm ∈ ℝ^{V × D} has the same shape as the input embedding.

Notice that the input embedding and the output projection are essentially mirror operations: one maps from vocabulary to embedding space, the other maps back. Many models exploit this symmetry by sharing the same parameters for both. This trick is called weight tying:

self.lm_head.weight = self.tok_emb.weight

The benefit is parameter savings — instead of two V × D matrices, you have one. For small models where the embedding is a large fraction of total parameters, this is significant. GPT-2 ties; LLaMA-2 does not (it uses a separate lm_head.weight). The choice is mostly a matter of empirical preference.

The geometric mental model

It helps to think of the embedding table as a learned dictionary that places each token at a specific location in D-dimensional "meaning space." During training, semantically similar tokens drift toward similar locations, so that the model can use spatial proximity as a signal. This is the same intuition behind the famous Word2Vec demonstration that king − man + woman ≈ queen. LLM embeddings work the same way; they are just learned jointly with the rest of the model rather than as a separate first step.

Positional Encoding

Self-attention is permutation-invariant. Position has to come from somewhere.

Self-attention has a strange property: by itself, it does not care about the order of tokens. If you shuffle the tokens in your input and re-run attention, you get the same outputs in the corresponding shuffled positions. The model can't tell "the cat sat" from "sat the cat". For language modeling this is obviously a disaster, so we have to inject position information ourselves.

Sinusoidal positional encoding (original Transformer)

The 2017 Transformer paper added a fixed (non-learned) function of position to the token embedding. For position pos and dimension index i:

PE(pos, 2i) = sin(pos / 10000^(2i/D)) PE(pos, 2i+1) = cos(pos / 10000^(2i/D))

Why use sinusoids of all things? Two reasons.

The first reason is that different dimensions of the encoding oscillate at different frequencies. The first few dimensions cycle very quickly with position (high frequency), so they encode fine-grained positional information. The last few dimensions cycle very slowly (low frequency), so they encode coarse positional information. Together, the model can read off both relative and absolute position.

The second reason is more subtle. For any fixed offset k, the encoding at position pos + k is a linear function of the encoding at position pos. (This falls out of the sum-of-angles formulas for sine and cosine.) So attention can encode the rule "look k steps back" as a fixed linear transformation, which is much easier to learn than encoding it as a complicated function of absolute position.

Each embedding dimension is a sine (or cosine) wave of a different wavelength. Low dimensions cycle quickly with position; high dimensions cycle slowly. Read a vertical slice at any position and you get a unique pattern of values — the model's coordinate for "where am I in the sequence."

Learned positional embeddings (GPT-2, BERT)

GPT-2 and BERT take a simpler approach. They allocate another embedding table P of shape (T_max, D) — one row per position — and learn it as part of the model. The full input is then the sum of token embedding and positional embedding: x = E[ids] + P[positions]. This is conceptually the simplest approach: no math required, just another lookup table.

The downside is that the maximum sequence length is fixed at training time. The model can't directly generate or process inputs longer than T_max, because there are no learned embeddings for positions it hasn't seen.

Modern (RoPE, ALiBi, NoPE) — preview

Day 12 will cover these in depth. As a preview:

RoPE (Rotary Position Embedding). Rotate the Q and K vectors by an angle proportional to position. Used by LLaMA, Mistral, Qwen, and GPT-NeoX. Naturally encodes relative position, and extends well beyond the training-time maximum length using simple scaling tricks.
ALiBi. Add a linear bias to attention scores based on the distance between two positions. Used by some BLOOM-derived models. Has good extrapolation properties.
NoPE. Literally no positional encoding at all. Some recent work shows decoder-only models can implicitly learn position from the causal mask. Not yet mainstream.

Exercise

Six exercises, all in the notebook.

Companion notebook: day-5-tokenization-embeddings.ipynb.

Implement BPE training. Pick a corpus of your choice — a chapter of a public-domain book is fine — and train your BPE tokenizer to a vocabulary size of 1000. Print the top 50 most common merges. You should see common English subwords emerging naturally.
Implement BPE encode and decode. Verify that decode(encode(text)) == text for several sentences, including ones containing emoji and accented characters.
Tokenizer comparison. Take a single paragraph and tokenize it three ways: with tiktoken cl100k_base, with your own trained BPE, and with the gpt2 encoding. Compare the token counts.
Embed and visualize. Initialize a fresh nn.Embedding(1000, 32). Pick five token IDs and compute pairwise cosine similarities — pre-training, these should be roughly random. Then train the embedding on a tiny next-token-prediction task and recheck the similarities. Structure should emerge for tokens that often co-occur.
Sinusoidal PE heatmap. Implement the sinusoidal formula. Plot the resulting (T, D) matrix as a heatmap. You should see characteristic stripes that rotate at different frequencies along the embedding axis.
Test glitch tokens. Tokenize " SolidGoldMagikarp" with tiktoken.get_encoding("gpt2"). Note that it tokenizes to a single token. (This particular token caused famously bizarre behavior in GPT-3.)

Go deeper.

Hand-picked references for this lesson.

YouTube · 2.5 hr

Karpathy — Let's build the GPT Tokenizer

The single best tokenizer tutorial. Builds a tiktoken-equivalent from scratch.

Watch on YouTube

Repo · Karpathy

karpathy/minbpe

The code from the video. Cleanest readable BPE reference.

View repo

Paper · 2015

Sennrich et al. — Subword Units

The original "BPE for translation" paper. Adapted Gage's 1994 compression idea for NLP.

Open paper

Paper · 2018

Kudo & Richardson — SentencePiece

Language-independent subword tokenizer. Default for T5, LLaMA-1/2, multilingual models.

Open paper

Repo · OpenAI

openai/tiktoken

Reference Rust implementation. Very fast.

View repo

Tool

OpenAI Tokenizer Playground

Paste any text, see tokenization for cl100k_base / o200k_base interactively.

Open tool

Paper · 2017

Vaswani et al. — Attention Is All You Need

Section 3.5 introduces the sinusoidal positional encoding.

Open paper

Paper · 2021

Su et al. — RoFormer (RoPE)

Rotary Position Embedding. The Day 12 deep-dive lives here.

Open paper

Blog · Visual

Alammar — Illustrated Word2Vec

Best visual intuition for what an embedding actually represents.

Read post

Post · LessWrong

Rumbelow & Watkins — SolidGoldMagikarp

The original glitch-token write-up. Why some tokens make GPT-3 produce gibberish.

Read post

Tokenization & Embeddings

Every LLM call begins with a tokenizer and ends with a detokenizer.

Learning objectives

Characters, words, or subwords? Subwords are the right answer, and here's why.

The standard subword schemes

Greedy bottom-up merging. The whole algorithm fits in a few lines.

The training algorithm

The minimum implementation

Two libraries cover almost everything you'll meet in the wild.

tiktoken (OpenAI)

Hugging Face — `transformers.AutoTokenizer`

SentencePiece

Vocab sizes by model

Six gotchas that bite real systems.

Token efficiency varies wildly across languages

A learned dictionary that maps token IDs to D-dimensional geometry.

Parameter count

The language-model head and weight tying

The geometric mental model

Self-attention is permutation-invariant. Position has to come from somewhere.

Sinusoidal positional encoding (original Transformer)

Learned positional embeddings (GPT-2, BERT)

Modern (RoPE, ALiBi, NoPE) — preview

From string to first hidden state in five lines.

Six exercises, all in the notebook.

Seven questions before moving on.

Go deeper.

Karpathy — Let's build the GPT Tokenizer

karpathy/minbpe

Sennrich et al. — Subword Units

Kudo & Richardson — SentencePiece

openai/tiktoken

OpenAI Tokenizer Playground

Vaswani et al. — Attention Is All You Need

Su et al. — RoFormer (RoPE)

Alammar — Illustrated Word2Vec

Rumbelow & Watkins — SolidGoldMagikarp

Tokenization & Embeddings

Every LLM call begins with a tokenizer and ends with a detokenizer.

Learning objectives

Characters, words, or subwords? Subwords are the right answer, and here's why.

The standard subword schemes

Greedy bottom-up merging. The whole algorithm fits in a few lines.

The training algorithm

The minimum implementation

Two libraries cover almost everything you'll meet in the wild.

tiktoken (OpenAI)

Hugging Face — transformers.AutoTokenizer

SentencePiece

Vocab sizes by model

Six gotchas that bite real systems.

Token efficiency varies wildly across languages

A learned dictionary that maps token IDs to D-dimensional geometry.

Parameter count

The language-model head and weight tying

The geometric mental model

Self-attention is permutation-invariant. Position has to come from somewhere.

Sinusoidal positional encoding (original Transformer)

Learned positional embeddings (GPT-2, BERT)

Modern (RoPE, ALiBi, NoPE) — preview

From string to first hidden state in five lines.

Six exercises, all in the notebook.

Seven questions before moving on.

Go deeper.

Hugging Face — `transformers.AutoTokenizer`