← Back to Hutter
Archive 2026-02-11_2
Scaling the Total Interpretation to Full enwik9
Papers
- experimental-design.pdf — Experimental Design: Scaling the Total Interpretation to Full enwik9. Seven predictions (P1–P7), four-phase experimental program. (6 pages)
(source)
- scaling-results.pdf — Scaling Results: The enwik9 Model Is More Boolean, More Shallow, and More Distributed. Full Q1–Q7 results on 110M checkpoint. P1 WRONG: margins increase 6.5× (8.56 vs 1.31). P2 confirmed: top-1 neuron 10.4% (not 99.7%). P3 confirmed: 78% at d=1 (not d=15–25). Two kinds of Boolean automaton: deep/concentrated (1024B) vs shallow/distributed (enwik9). (6 pages)
(source)
- factor-weight-scaling.pdf — The Factor Map and Weight Construction Scale: R²=0.83 Is an Architectural Invariant. P4 WRONG: R²=0.830 (not 0.4–0.6). P5 WRONG: analytic 4.21 bpc (not >5). P6 MIXED: overall r drops (0.38 vs 0.56), large-weight r increases (0.77 vs 0.74). Word-length subtraction 13× less catastrophic. (4 pages)
(source)
- trajectory.pdf — Checkpoint Trajectory: The Boolean Automaton Strengthens, the Factor Map Collapses. 99-checkpoint sweep (10M–990M) + 10 Xavier/AdamW checkpoints. Three phases: learning, stable (R²=0.80–0.86), collapse (450M+). Xavier R²=0.87–0.89 (higher than Adam). W_h std perfectly linear. b_y all-negative at 640M. (5 pages)
(source, data)
Key finding: The Boolean automaton regime is a structural property of the 128-hidden tanh RNN, not of training. Both the 1024-byte model (0.079 bpc, memorization) and the enwik9 model (2.81 bpc, generalization) are Boolean, but of opposite character: the 1024B model is deep/sparse/concentrated, the enwik9 model is shallow/dense/distributed.
Results vs Predictions
P1 WRONG: Margins INCREASE 6.5×
Mean margin 8.56 (enwik9) vs 1.31 (1024B). Only 6.4% small margins (vs 52.4%). Adam + gradient clipping drives deeper saturation.
P2 CONFIRMED: Distributed neuron roles
Top-1 neuron h82 captures 10.4% of gap. 60 neurons needed for 97%. Top-70 beat full model (104.9%).
P3 CONFIRMED: Shallow offsets (even shallower than predicted)
78.1% of neurons at d=1 (predicted d=5–15, actual d=1–3). 85.6% of attribution mass at d≤4.
New: Two kinds of Boolean automaton
1024B: deep (d=15–25), sparse (fan-in 3.5), concentrated (h28=99.7%), slow (dwell 3.3).
enwik9: shallow (d=1–3), dense (fan-in 105), distributed (h82=10.4%), fast (dwell 1.6).
New: Routing backbone via h75
h75 self-loop (−3.2) dominates justifications, analogous to h54 in the 1024B model.
P4 WRONG: R² is an architectural invariant
2-offset conjunction R²=0.830 (enwik9) vs 0.837 (1024B). Predicted 0.4–0.6. 128/128 neurons above 0.70. The factor map is a property of the architecture, not the data.
P5 WRONG: Analytic construction closes the loop
Fully analytic W_y: 4.21 bpc (predicted >5). Beats trained model (6.42 bpc) due to catastrophic forgetting. Hebbian + optimized W_y: 1.90 bpc.
P6 MIXED: Hebbian correlation splits
Overall r(cov, W_h) drops 0.56→0.38. But large-weight r increases 0.74→0.77. Sign accuracy 74.4%. Word-length subtraction 13× less catastrophic (+0.54 vs +7.3 bpc).
Phase 4: Checkpoint Trajectory (99 + 10 checkpoints)
Three training phases: I (learning, 10–110M), II (stable, R²=0.80–0.86), III (collapse at 450M+, R² drops to 0.4–0.7).
Margins grow monotonically 2.8→61.3. W_h std perfectly linear (+0.031/checkpoint). b_y crosses all-negative at 640M.
Forgetting destroys readout (W_y/b_y), not dynamics (W_h/margins).
Xavier/AdamW: R² even higher (0.87–0.89)
Xavier init produces R²=0.87–0.89 (vs 0.83–0.86 for random init). Margins grow 8× slower.
All 128 neurons above 0.80 throughout training. Confirms R²≈0.83 is an architectural floor, not a ceiling.
Interactive Experiment Viewers
- Scaling the Total Interpretation — Predictions scorecard, two kinds of Boolean automaton, margin comparison, offset/neuron distribution, R² invariant, weight construction scaling, word-length robustness. Data from all three results papers.
- Checkpoint Trajectory — 99-checkpoint sweep with interactive charts: margin growth (2.8→61.3), R² cliff at 450M, Wh std linear drift, by phase transition at 640M, combined dashboard. Full CSV data from trajectory.csv.
Navigation
Next: 20260212 →
Carrier signal problem. Orthogonal offsets. Byte KN = 2.29 bpc at 10M.
← Previous: 20260211
Total interpretation, weight construction, E onto N, quotient chain.