Tokenizer
The model cannot read letters — it only understands numbers. The tokenizer converts each character to a numeric ID, and adds a special BOS (Begin/End of Sequence) token to mark the boundaries of each name.
Character → ID Mapping
Every unique character in the dataset gets a numeric ID. There are 26 characters (a–z) plus one special BOS token (id=26). Total vocabulary size: 27.
Characters highlighted in green are used in your input "emma"
Try It: Type a Name
Type any name below. Watch how it becomes a sequence of token IDs. The model will learn from sequences like this — predicting each next token from the previous ones.
Only letters a-z are used. Maximum 16 characters.
Token Sequence
What just happened
Your input "emma" was converted to 0 tokens: a BOS token at the start, 4 character tokens (e→4, m→12, m→12, a→0), and a BOS token at the end. The model sees only the numbers: [].
How tokenization works in real GPTs
This micro GPT uses character-level tokenization — each letter is one token.
Real GPTs (like GPT-4) use subword tokenization (BPE) where common word parts like "ing", "tion", "the" become single tokens. This makes the vocabulary much larger (~50,000+ tokens) but allows the model to process text more efficiently.