Tokenizer

The model cannot read letters — it only understands numbers. The tokenizer converts each character to a numeric ID, and adds a special BOS (Begin/End of Sequence) token to mark the boundaries of each name.

Character → ID Mapping

Every unique character in the dataset gets a numeric ID. There are 26 characters (a–z) plus one special BOS token (id=26). Total vocabulary size: 27.

k10

l11

m12

n13

o14

p15

q16

r17

s18

t19

u20

v21

w22

x23

y24

z25

BOS26

Characters highlighted in green are used in your input "emma"

Try It: Type a Name

Type any name below. Watch how it becomes a sequence of token IDs. The model will learn from sequences like this — predicting each next token from the previous ones.

Type a name

Only letters a-z are used. Maximum 16 characters.

Token Sequence

What just happened

Your input "emma" was converted to 0 tokens: a BOS token at the start, 4 character tokens (e→4, m→12, m→12, a→0), and a BOS token at the end. The model sees only the numbers: [].

How tokenization works in real GPTs

This micro GPT uses character-level tokenization — each letter is one token.

Real GPTs (like GPT-4) use subword tokenization (BPE) where common word parts like "ing", "tion", "the" become single tokens. This makes the vocabulary much larger (~50,000+ tokens) but allows the model to process text more efficiently.