Training
During training, the model processes many names and learns to predict the next letter. Loss measures how wrong the predictions are — lower is better. Press play to watch the model learn.
Training Controls
Set how many steps to train (1-1000), then click Play to watch the model learn. Use the speed slider to control how fast the animation runs (1-100x speed).
Step 0: Early Training
The model is making nearly random predictions. Loss is high because it hasn't learned any patterns yet. The weights are still close to their random initial values.
Currently processing "emma" — phase: Forward. Loss has improved 0.0% from the initial value.
→What's Happening Right Now: Forward Phase
Forward Pass: The model reads the name "emma" one character at a time.
For each position, it tries to predict what comes next. For example:
- • Input: [start] → Predict: likely "e"
- • Input: "e" → Predict: likely "m"
- • Input: "em" → Predict: likely "m"
- • ...and so on for all 4 characters
The model processes each position through {embeddings → hidden layers → output logits} to get probability scores for all 27 possible next characters.
Loss Over Time
The dashed line at ~3.3 represents random guessing (ln(27) ≈ 3.3). As the model learns, loss drops below this baseline. A loss of 2.2 means the model is significantly better than random — it has learned real patterns in names.
Training Cycle
→ Forward
1/4Pass "emma" through the model to get predictions for each next character
The model processes each character position independently, predicting what comes next.
📉 Loss
2/4Calculate how wrong the predictions are (current: 3.4500)
← Backward
3/4Compute gradients — how much each parameter contributed to the error
⟳ Update
4/4Adjust all parameters using Adam optimizer (lr: 0.010000)
What the model learns
- • You can train from 10 to 1000 steps — more steps = better learning (but diminishing returns)
- • Each step uses one name from the dataset to update the model
- • Loss starts at ~3.3 (random guessing among 27 tokens: a-z + special token)
- • As training progresses, loss decreases — the model gets better at predicting next characters
- • Learning rate decays linearly from 0.01 to 0 — big updates early, small refinements late
- • Adam optimizer uses momentum and adaptive rates — it "remembers" previous gradients for smarter updates
- • The model learns patterns like:
→ Names often start with certain letters (A, J, M are common)
→ Letter combinations like "qu", "th", "er" appear frequently
→ Vowels and consonants alternate in realistic patterns
→ Names tend to end with vowels or certain consonants