Training

During training, the model processes many names and learns to predict the next letter. Loss measures how wrong the predictions are — lower is better. Press play to watch the model learn.

Training Controls

Set how many steps to train (1-1000), then click Play to watch the model learn. Use the speed slider to control how fast the animation runs (1-100x speed).

Training Steps:

Speed: 10x

Step: 0 / 199

Step 0: Early Training

The model is making nearly random predictions. Loss is high because it hasn't learned any patterns yet. The weights are still close to their random initial values.

Currently processing "emma" — phase: Forward. Loss has improved 0.0% from the initial value.

→What's Happening Right Now: Forward Phase

Forward Pass: The model reads the name "emma" one character at a time.

For each position, it tries to predict what comes next. For example:

• Input: [start] → Predict: likely "e"
• Input: "e" → Predict: likely "m"
• Input: "em" → Predict: likely "m"
• ...and so on for all 4 characters

The model processes each position through {embeddings → hidden layers → output logits} to get probability scores for all 27 possible next characters.

Current Loss

3.4500

Nearly random

Learning Rate

0.010000

0% decayed

Training Name

emma

5 prediction targets

Loss Over Time

The dashed line at ~3.3 represents random guessing (ln(27) ≈ 3.3). As the model learns, loss drops below this baseline. A loss of 2.2 means the model is significantly better than random — it has learned real patterns in names.

Training Cycle

Each training step repeats four phases. The currently active phase is highlighted. Click play above to watch them cycle through.

→ Forward

1/4

Pass "emma" through the model to get predictions for each next character

The model processes each character position independently, predicting what comes next.

📉 Loss

2/4

Calculate how wrong the predictions are (current: 3.4500)

← Backward

3/4

Compute gradients — how much each parameter contributed to the error

⟳ Update

4/4

Adjust all parameters using Adam optimizer (lr: 0.010000)

What the model learns

• You can train from 10 to 1000 steps — more steps = better learning (but diminishing returns)
• Each step uses one name from the dataset to update the model
• Loss starts at ~3.3 (random guessing among 27 tokens: a-z + special token)
• As training progresses, loss decreases — the model gets better at predicting next characters
• Learning rate decays linearly from 0.01 to 0 — big updates early, small refinements late
• Adam optimizer uses momentum and adaptive rates — it "remembers" previous gradients for smarter updates
• The model learns patterns like:
  → Names often start with certain letters (A, J, M are common)
  → Letter combinations like "qu", "th", "er" appear frequently
  → Vowels and consonants alternate in realistic patterns
  → Names tend to end with vowels or certain consonants