Embeddings (wte/wpe)

After turning letters into IDs, the model converts each ID into a list of 16 numbers called an embedding. It combines token embeddings (what) with position embeddings (where).

Select a Token

Position: 0

Position in the sequence (0-15)

'a' at position 0:

The model looks up two things: wte['a'] (the token embedding for 'a', id=0) and wpe[0] (the position embedding for position 0). These are both learned 16-dimensional vectors — the model discovers what values work best during training.

wte['a'] (token embedding)

wpe[0] (position embedding)

combined (input to the model)

positive negativeHover over cells to see exact values

Example calculation

Dim 0: + = — each of the 16 dimensions is added independently.

How wte + wpe combine

When the model processes a token, it looks up wte[token_id] and wpe[position], then adds them together element-by-element. The result is a single vector that encodes both which character and where it is.

Try changing the token and position above — notice how the combined embedding changes. The same letter at different positions produces different combined vectors, so the model knows that 'a' at position 0 is different from 'a' at position 5.

Why embeddings?

• A bare number like "4" doesn't tell the model much. An embedding is a rich representation — 16 numbers that encode meaningful properties.
• Similar letters might get similar embeddings, helping the model generalize.
• Position embeddings let the model understand order — without them, "ab" and "ba" would look the same.
• These embeddings are learned — they start random and improve during training.