01

Tokenizer

The model cannot read letters — it only understands numbers. The tokenizer converts each character to a numeric ID, and adds a special BOS (Begin/End of Sequence) token to mark the boundaries of each name.

Character → ID Mapping

Every unique character in the dataset gets a numeric ID. There are 26 characters (a–z) plus one special BOS token (id=26). Total vocabulary size: 27.

a0
b1
c2
d3
e4
f5
g6
h7
i8
j9
k10
l11
m12
n13
o14
p15
q16
r17
s18
t19
u20
v21
w22
x23
y24
z25
BOS26

Characters highlighted in green are used in your input "emma"

Try It: Type a Name

Type any name below. Watch how it becomes a sequence of token IDs. The model will learn from sequences like this — predicting each next token from the previous ones.

Only letters a-z are used. Maximum 16 characters.

Token Sequence

What just happened

Your input "emma" was converted to 0 tokens: a BOS token at the start, 4 character tokens (e4, m12, m12, a0), and a BOS token at the end. The model sees only the numbers: [].

How tokenization works in real GPTs

This micro GPT uses character-level tokenization — each letter is one token.

Real GPTs (like GPT-4) use subword tokenization (BPE) where common word parts like "ing", "tion", "the" become single tokens. This makes the vocabulary much larger (~50,000+ tokens) but allows the model to process text more efficiently.