← Back to archive

Checkpoint Trajectory

99 checkpoints (10M–990M bytes): the Boolean automaton strengthens, the factor map collapses

Three Phases of Training

Phase I: Learning
10–110M
R² 0.83–0.86
Phase II: Stable
110–400M
R² 0.80–0.84
Phase III: Collapse
450–990M
R² 0.40–0.76
2.78→61.3
Mean margin
(monotonic growth)
450M
R² cliff
(coincides with forgetting)
640M
by all-negative
(phase transition)
0.031
Wh std growth
per 10M (linear)

Margin Trajectory: Always Boolean

The mean margin grows monotonically from 2.78 (10M) to 61.3 (990M). The model is Boolean from the very first checkpoint.

Mean Margin vs Training Progress

The anomalous 100M point (margin 59.7) is from Run 1 (SGD), not Run 2 (Adam). Excluding it, margin growth is smooth and monotonic. The grey band marks Phase II (stable). Data: 99 checkpoints from trajectory.csv.
There is no analog regime at any point during training. The tanh RNN is Boolean from the very first checkpoint (75.7% of margins > 1 at 10M). Adam with gradient clipping drives margins higher: weights grow unbounded, producing larger pre-activations.

R² Trajectory: The 450M Cliff

The 2-offset conjunction R² is stable at 0.80–0.86 for 400M bytes, then collapses sharply.

Factor Map R² vs Training Progress

The R² cliff at 450–460M coincides exactly with the onset of catastrophic forgetting. During Phase I–II (10–400M), R² oscillates in [0.80, 0.86]. After 450M, it drops to 0.40–0.76. Data: trajectory.csv.

Neurons with R² ≥ 0.80

Before 450M: 73–126 neurons above the threshold. After 460M: drops to 0–8 (with transient spikes). The conjunction structure fragments during catastrophic forgetting.
Catastrophic forgetting destroys the readout, not the dynamics. Margins grow through the collapse; the Boolean automaton strengthens. What fails is the Wy/by readout: the model can no longer decode its own internal representation.

Wh Drift: Perfectly Linear

Wh Standard Deviation vs Training Progress

std(Wh) = 0.22 + 0.031 × (checkpoint − 10M)/10M. R² > 0.99 (excluding 100M anomaly). The growth is perfectly linear and never stabilizes. This drives margin growth and eventual readout failure.

Output Bias Collapse: The 640M Phase Transition

by Range (min, max) vs Training Progress

bymax crosses zero between 630M (+0.66) and 640M (−0.37). After 640M, all 256 output biases are negative. At 990M: [−46.3, −29.9]. The reversal begins at 450M—coinciding with the R² cliff.
bymax reverses at exactly 450M. From 10–450M, bymax grows (0.79→8.18). From 450–640M, it collapses (8.18→−0.37). Three transitions align: R² cliff, by reversal, and training bpc cliff. All at 450M.

BPC and Hebbian Correlation

BPC on Eval Window

BPC on first 1024 bytes. Stays 5.3–7.5 throughout (model has moved on from this data).

Hebbian r(cov, Wh)

Hebbian correlation drifts down from ~0.42 to ~0.30, then rebounds slightly in Phase III.

The Anomalous 100M Checkpoint

epoch1_100M.bin is from Run 1 (SGD), not Run 2 (Adam).

The 100M checkpoint has margin 59.7, Wh std 3.30, R² = 0.49, by range [−47.4, −30.9]. These values are 7× higher than surrounding checkpoints (90M: margin 7.4, std 0.50; 110M: margin 8.2, std 0.55). The Run 1 model at 100M had already undergone the same weight explosion that Run 2 doesn't reach until 990M.

Combined Dashboard: All Metrics

Margin, R², and Wh std (normalized)

All three metrics normalized to [0,1] for comparison. Margin and Wh std grow together (positive feedback). R² holds steady then collapses at 450M. The 100M anomaly is excluded for clarity.

Related

Source: trajectory.tex · Data: trajectory.csv · Program: trajectory.c