Embeddings (wte/wpe)
After turning letters into IDs, the model converts each ID into a list of 16 numbers called an embedding. It combines token embeddings (what) with position embeddings (where).
Select a Token
Position in the sequence (0-15)
'a' at position 0:
The model looks up two things: wte['a'] (the token embedding for 'a', id=0) and wpe[0] (the position embedding for position 0). These are both learned 16-dimensional vectors — the model discovers what values work best during training.
Example calculation
Dim 0: + = — each of the 16 dimensions is added independently.
How wte + wpe combine
When the model processes a token, it looks up wte[token_id] and wpe[position], then adds them together element-by-element. The result is a single vector that encodes both which character and where it is.
Try changing the token and position above — notice how the combined embedding changes. The same letter at different positions produces different combined vectors, so the model knows that 'a' at position 0 is different from 'a' at position 5.
Why embeddings?
- • A bare number like "4" doesn't tell the model much. An embedding is a rich representation — 16 numbers that encode meaningful properties.
- • Similar letters might get similar embeddings, helping the model generalize.
- • Position embeddings let the model understand order — without them, "ab" and "ba" would look the same.
- • These embeddings are learned — they start random and improve during training.