A neural network can't read text. Before any language model can do anything, the text has to be converted to numbers — first to integer token IDs, then to dense vectors. Today we build that pipeline from scratch and unpack every piece of it.
Language models cannot read text directly. Before any neural network can do anything with a sentence, that sentence has to be turned into numbers. The piece of software that does this conversion is called a tokenizer. It chops the input string into smaller pieces and assigns each piece a unique integer ID. The model takes those integers as input and produces integers as output, and at the end of the pipeline a matching detokenizer turns those output integers back into readable text. Every single LLM call in the world goes through this round trip.
The choice of tokenizer matters more than you might expect. It affects how long your prompts are (longer prompts cost more compute), how well the model handles other languages (a tokenizer trained mostly on English is wildly inefficient on Chinese), and even model behavior on rare tokens that the model barely saw during training.
Once we have integer IDs, the next step is to turn each integer into a dense vector that the model can actually do math on. This is what the embedding layer does. We will build it up from a simple lookup table, see why it has to be learned rather than fixed, and understand the geometric mental model that all of modern NLP relies on.
We also need to give the model some sense of token order. Self-attention by itself does not know whether a token came before or after another. The standard trick is to add a positional encoding to each embedding. We will look at the original sinusoidal encoding from the 2017 Transformer paper, the simpler learned variant used by GPT-2 and BERT, and a quick preview of RoPE — the rotary scheme used by every modern open LLM. RoPE itself is a Day 12 topic.
By the end of today you will have implemented Byte-Pair Encoding from scratch and understood every byte of input that flows into a Transformer.
tiktoken, Hugging Face tokenizers, and sentencepiece interchangeably.When you're designing a tokenizer, you have three obvious choices: split the text into individual characters, split it into whole words, or split it into something in between. The third option turns out to be much better than the first two, but understanding why requires looking at what each option costs you.
| Strategy | Vocab size | Tokens for "the cat sat on the mat" | Handles new words? |
|---|---|---|---|
| Character | ~100 (printable ASCII) | 22 | Yes |
| Word | 50,000–500,000 | 6 | No (UNK for unseen) |
| Subword (BPE) | 32,000–128,000 | 6–8 | Yes — composes from sub-pieces |
Why character-level fails. The vocabulary is small and tidy — every character gets its own ID. But the sequences become extremely long. The phrase "the cat sat on the mat" is six words, but it's twenty-two characters. So a character-level model needs to process roughly four times as many tokens as a subword model for the same input. This is a big deal because attention scales as O(T²) in sequence length: making T four times bigger makes attention sixteen times more expensive. There's also a subtler problem. The model has to spend its parameters learning that the characters t-h-e form a unit, every time it sees the word "the." That representation effort is wasted work.
Why word-level fails. Sequences are nice and short, one token per word. But the vocabulary explodes. English has hundreds of thousands of words, and any meaningful corpus contains typos, names, neologisms, and other strings that the vocabulary doesn't cover. These get mapped to a special "unknown" token, which throws away information the model could have used. Worse, word-level tokenization can't share information between morphological variants. The words running, runs, and ran would be three completely separate entries in the vocabulary, with three independent embeddings, and the model would have to learn from scratch that they are related.
Why subwords win. A subword tokenizer learns a vocabulary of common chunks of text. The chunks can be whole words for frequent words, or fragments for rarer ones. The word "running" might be tokenized as ["run", "ning"]; "transformer" might be ["trans", "former"] or a single piece, depending on how it appeared in the training corpus. Common words stay whole; rare or novel words decompose into sub-pieces, so there are no out-of-vocabulary tokens. The vocabulary stays small enough to train, and the sequences stay short enough to process. It is the best of both worlds, and it has been the standard since GPT-2.
The three algorithms produce broadly similar vocabularies in practice. We will implement BPE, which is the one you are most likely to encounter.
Byte-Pair Encoding has exactly one core idea, and once you see it, the algorithm is obvious. Repeatedly find the most frequent pair of adjacent tokens, and merge that pair into a new token. Stop when you've reached your target vocabulary size.
Three small functions cover the algorithm. The first counts adjacent pairs in a sequence; the second applies one merge; the third runs the training loop.
from collections import Counter
def get_pairs(seq):
return Counter(zip(seq, seq[1:]))
def replace_pair(seq, pair, new_id):
out, i = [], 0
while i < len(seq):
if i < len(seq) - 1 and seq[i] == pair[0] and seq[i+1] == pair[1]:
out.append(new_id); i += 2
else:
out.append(seq[i]); i += 1
return out
def train_bpe(text, vocab_size):
seq = list(text.encode('utf-8')) # start with raw bytes (256 IDs)
merges, next_id = {}, 256
while next_id < vocab_size:
pair, freq = get_pairs(seq).most_common(1)[0]
if freq < 2: break
merges[pair] = next_id
seq = replace_pair(seq, pair, next_id)
next_id += 1
return merges
Encoding new text is the dual of training. Each saved merge gets applied in the order it was learned, until none of them fire any more. Because Python dictionaries preserve insertion order, we can simply iterate over the merges dict.
def encode(text, merges):
seq = list(text.encode('utf-8'))
for pair, new_id in merges.items():
seq = replace_pair(seq, pair, new_id)
return seq
What you have just seen is the concept. Production tokenizers add several refinements that the textbook version skips. They typically pre-tokenize the text first, splitting on whitespace and punctuation so that merges cannot accidentally span word boundaries. They use regex patterns to control which kinds of merges are allowed. They define special tokens like end-of-text. They have byte-level fallback for arbitrary input. And they are written in fast Rust or C++ rather than Python. If you want to read a clean reference implementation, karpathy/minbpe is the place to go.
BPE was invented in 1994 by Philip Gage for data compression, and published in C/C++ Users Journal under the title "A New Algorithm for Data Compression." It then sat in obscurity for two decades. In 2015, Sennrich and colleagues realized the same algorithm could be repurposed for tokenizing text in machine translation models. Today it powers GPT-4, LLaMA, and Mistral. The lesson is that today's "obvious" choices in machine learning often hide a thirty-year history nobody remembers.
You essentially never roll your own tokenizer in production. There are two libraries you should know how to use.
OpenAI's tokenizer library, used for GPT-2, GPT-3, and GPT-4. Written in Rust, very fast, exposed to Python with a clean API.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-3.5/4 tokenizer
ids = enc.encode("Hello, world!")
print(ids) # [9906, 11, 1917, 0]
print(enc.decode(ids)) # 'Hello, world!'
print(enc.n_vocab) # 100277
Three encodings are worth knowing by name. gpt2 has 50,257 tokens and is what GPT-2 used. cl100k_base has 100,277 tokens and powers GPT-3.5 and GPT-4. o200k_base has 200,019 tokens and is the larger vocabulary used by GPT-4o.
transformers.AutoTokenizerThe Hugging Face approach is to load whatever tokenizer was published alongside a particular model.
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
ids = tok.encode("Hello, world!")
print(ids) # [1, 15043, 29892, 3186, 29991]
print(tok.decode(ids))
The 1 at the very start is the BOS (beginning-of-sequence) token. LLaMA always prepends one to every input. This is a model-specific convention; not every tokenizer does it.
SentencePiece is the third major player. T5, LLaMA-1, LLaMA-2, and many multilingual models all use it. Internally it treats the input as a stream of bytes (including whitespace) and uses a special character ▁ (Unicode U+2581) to mark word boundaries.
| Model | Tokenizer | Vocab |
|---|---|---|
| GPT-2 | BPE | 50,257 |
| GPT-3.5 / GPT-4 | tiktoken cl100k_base | 100,277 |
| GPT-4o | tiktoken o200k_base | 200,019 |
| LLaMA-2 | SentencePiece BPE | 32,000 |
| LLaMA-3 | tiktoken-like BPE | 128,256 |
A larger vocabulary means each input takes fewer tokens to represent (so attention is cheaper) but also means the model carries more parameters in its embedding layer and language-model head. The sweet spot for modern LLMs sits between 32,000 and 128,000 tokens.
Tokenizers are the single most common source of subtle bugs in LLM applications. Here are six failure modes that come up over and over.
The strings "hello" and " hello" tokenize differently in most BPE schemes, because the leading space is part of the token. So when you concatenate already-tokenized sub-sequences without paying attention to spaces, you can produce inputs that the model has never seen during training, which often produces strange outputs.
If you end a prompt with a trailing space, you have effectively "used up" the leading-space slot of the next token. The model is then forced to predict a continuation that starts without a space — and most natural continuations don't. The result is often garbled output. Best practice: never end a prompt with a trailing space.
GPT-2 tokenized numbers inconsistently — the digit string 1234 could end up as one token or several depending on context. This makes arithmetic harder for the model. LLaMA-3 splits each digit into its own token; GPT-4 also handles numbers more cleanly than its predecessor.
Multi-byte characters such as emoji or non-Latin scripts decompose into their UTF-8 bytes when there is no specialized vocabulary entry. A single emoji can become four tokens. This is harmless for correctness but inflates token counts on multilingual or emoji-heavy inputs.
Some tokens appear in the vocabulary but barely appear in the model's training data. (Tokenizer training and model training use different corpora at different times.) The classic example is " SolidGoldMagikarp" in GPT-2. The embedding for such a token stays roughly random because gradients almost never flow through it during training, so the model behaves erratically when it encounters one.
When a model spec says "context length 4096", that number is in tokens, not characters. As a rough guide, 4096 tokens corresponds to about 3000 English words.
For English, most BPE tokenizers compress to around four characters per token. Code is similar. But Chinese and Japanese, in older tokenizers, can tokenize to roughly one or two characters per token — meaning the same content costs five to ten times more tokens than its English equivalent. That translates directly to five to ten times the API cost, latency, and attention compute. LLaMA-3 and GPT-4o specifically widened their vocabularies to fix this disparity.
Once we have integer IDs, we still need to give the model something it can do math on. Integers themselves carry no useful structure: token 5379 isn't "close" to token 5380 in any meaningful way. We need to convert each integer ID into a dense vector of floating-point numbers, and we need that vector to encode something about the meaning of the token.
The solution is breathtakingly simple. We create a learnable matrix E with shape (V, D), where V is the vocabulary size and D is the model's hidden dimension. Every row of E is the embedding for one token. To embed a token with ID i, we just look up row i. The whole operation is integer indexing into a tall matrix.
import torch
import torch.nn as nn
emb = nn.Embedding(num_embeddings=50257, embedding_dim=768) # GPT-2 small
ids = torch.tensor([[15496, 11, 995]]) # shape (1, 3)
x = emb(ids) # shape (1, 3, 768)
The shape transformation is worth tracing carefully. The input ids has shape (B, T): a batch of B sequences, each with T token positions, holding integer IDs. The output x has shape (B, T, D): the same batch and same sequence length, but now each token has been replaced by a D-dimensional vector.
nn.Embedding is a tall (V, D) matrix. The forward operation is integer indexing — pull row i for token id i. During the backward pass, the gradient updates only the rows that were actually used.The embedding table has V × D parameters. To get a feel for what this means in practice:
50,257 × 768 ≈ 39M parameters in the embedding alone — about 30% of the model's 124M total.32,000 × 4,096 ≈ 131M parameters — about 1.9% of the 7B total.128,256 × 4,096 ≈ 525M — over 7% of an 8B-parameter model.The fraction matters. In small models, the embedding is a meaningful chunk of the parameter budget and of the optimizer's gradient memory. In large models, it shrinks to a rounding error.
The output side of the model is the inverse operation. Given a final hidden state h ∈ ℝ^D, we need to produce a probability distribution over the entire vocabulary. We do that with a linear layer that maps D-dimensional hidden states to V-dimensional logits: logits = h @ W_lm.T, where W_lm ∈ ℝ^{V × D} has the same shape as the input embedding.
Notice that the input embedding and the output projection are essentially mirror operations: one maps from vocabulary to embedding space, the other maps back. Many models exploit this symmetry by sharing the same parameters for both. This trick is called weight tying:
self.lm_head.weight = self.tok_emb.weight
The benefit is parameter savings — instead of two V × D matrices, you have one. For small models where the embedding is a large fraction of total parameters, this is significant. GPT-2 ties; LLaMA-2 does not (it uses a separate lm_head.weight). The choice is mostly a matter of empirical preference.
It helps to think of the embedding table as a learned dictionary that places each token at a specific location in D-dimensional "meaning space." During training, semantically similar tokens drift toward similar locations, so that the model can use spatial proximity as a signal. This is the same intuition behind the famous Word2Vec demonstration that king − man + woman ≈ queen. LLM embeddings work the same way; they are just learned jointly with the rest of the model rather than as a separate first step.
Self-attention has a strange property: by itself, it does not care about the order of tokens. If you shuffle the tokens in your input and re-run attention, you get the same outputs in the corresponding shuffled positions. The model can't tell "the cat sat" from "sat the cat". For language modeling this is obviously a disaster, so we have to inject position information ourselves.
The 2017 Transformer paper added a fixed (non-learned) function of position to the token embedding. For position pos and dimension index i:
Why use sinusoids of all things? Two reasons.
The first reason is that different dimensions of the encoding oscillate at different frequencies. The first few dimensions cycle very quickly with position (high frequency), so they encode fine-grained positional information. The last few dimensions cycle very slowly (low frequency), so they encode coarse positional information. Together, the model can read off both relative and absolute position.
The second reason is more subtle. For any fixed offset k, the encoding at position pos + k is a linear function of the encoding at position pos. (This falls out of the sum-of-angles formulas for sine and cosine.) So attention can encode the rule "look k steps back" as a fixed linear transformation, which is much easier to learn than encoding it as a complicated function of absolute position.
GPT-2 and BERT take a simpler approach. They allocate another embedding table P of shape (T_max, D) — one row per position — and learn it as part of the model. The full input is then the sum of token embedding and positional embedding: x = E[ids] + P[positions]. This is conceptually the simplest approach: no math required, just another lookup table.
The downside is that the maximum sequence length is fixed at training time. The model can't directly generate or process inputs longer than T_max, because there are no learned embeddings for positions it hasn't seen.
Day 12 will cover these in depth. As a preview:
Let's put all the pieces together. Here is the entire path from a raw string to the first input that the Transformer sees.
text = "Hello, world!"
# 1. Tokenize: string → list of integer IDs
ids = tokenizer.encode(text) # [15496, 11, 995]
ids = torch.tensor([ids]) # (1, 3) — add a batch dimension
# 2. Token embeddings: integer IDs → dense vectors via lookup
tok_emb = embedding_table[ids] # (1, 3, D)
# 3. Positional embeddings (learned variant)
pos_ids = torch.arange(ids.size(1)) # [0, 1, 2]
pos_emb = position_table[pos_ids] # (3, D), broadcasts over batch
# 4. Sum the two
x = tok_emb + pos_emb # (1, 3, D) — this is the residual stream input
# 5. Optionally apply dropout
x = dropout(x)
That final x tensor is what flows into Transformer block 1. By the end of Day 7, you will have built every layer that processes it.
Companion notebook: day-5-tokenization-embeddings.ipynb.
decode(encode(text)) == text for several sentences, including ones containing emoji and accented characters.tiktoken cl100k_base, with your own trained BPE, and with the gpt2 encoding. Compare the token counts.nn.Embedding(1000, 32). Pick five token IDs and compute pairwise cosine similarities — pre-training, these should be roughly random. Then train the embedding on a tiny next-token-prediction task and recheck the similarities. Structure should emerge for tokens that often co-occur.(T, D) matrix as a heatmap. You should see characteristic stripes that rotate at different frequencies along the embedding axis." SolidGoldMagikarp" with tiktoken.get_encoding("gpt2"). Note that it tokenizes to a single token. (This particular token caused famously bizarre behavior in GPT-3.)Close the page and answer from memory. If you can't, re-read the relevant section.
"hello" and " hello". Why might these tokenize differently in GPT-2?(32000, 4096). How many parameters is that? What fraction of the model's 7B?pos/T?"Models don't read text. They read integers, and we choose the integers."
Hand-picked references for this lesson.
The single best tokenizer tutorial. Builds a tiktoken-equivalent from scratch.
Watch on YouTubeThe original "BPE for translation" paper. Adapted Gage's 1994 compression idea for NLP.
Open paperLanguage-independent subword tokenizer. Default for T5, LLaMA-1/2, multilingual models.
Open paperPaste any text, see tokenization for cl100k_base / o200k_base interactively.
Open toolSection 3.5 introduces the sinusoidal positional encoding.
Open paperBest visual intuition for what an embedding actually represents.
Read postThe original glitch-token write-up. Why some tokens make GPT-3 produce gibberish.
Read post