← Back to Hutter

Archive 2026-02-11_2

Scaling the Total Interpretation to Full enwik9

Papers

Key finding: The Boolean automaton regime is a structural property of the 128-hidden tanh RNN, not of training. Both the 1024-byte model (0.079 bpc, memorization) and the enwik9 model (2.81 bpc, generalization) are Boolean, but of opposite character: the 1024B model is deep/sparse/concentrated, the enwik9 model is shallow/dense/distributed.

Results vs Predictions

P1 WRONG: Margins INCREASE 6.5×
Mean margin 8.56 (enwik9) vs 1.31 (1024B). Only 6.4% small margins (vs 52.4%). Adam + gradient clipping drives deeper saturation.
P2 CONFIRMED: Distributed neuron roles
Top-1 neuron h82 captures 10.4% of gap. 60 neurons needed for 97%. Top-70 beat full model (104.9%).
P3 CONFIRMED: Shallow offsets (even shallower than predicted)
78.1% of neurons at d=1 (predicted d=5–15, actual d=1–3). 85.6% of attribution mass at d≤4.
New: Two kinds of Boolean automaton
1024B: deep (d=15–25), sparse (fan-in 3.5), concentrated (h28=99.7%), slow (dwell 3.3).
enwik9: shallow (d=1–3), dense (fan-in 105), distributed (h82=10.4%), fast (dwell 1.6).
New: Routing backbone via h75
h75 self-loop (−3.2) dominates justifications, analogous to h54 in the 1024B model.
P4 WRONG: R² is an architectural invariant
2-offset conjunction R²=0.830 (enwik9) vs 0.837 (1024B). Predicted 0.4–0.6. 128/128 neurons above 0.70. The factor map is a property of the architecture, not the data.
P5 WRONG: Analytic construction closes the loop
Fully analytic W_y: 4.21 bpc (predicted >5). Beats trained model (6.42 bpc) due to catastrophic forgetting. Hebbian + optimized W_y: 1.90 bpc.
P6 MIXED: Hebbian correlation splits
Overall r(cov, W_h) drops 0.56→0.38. But large-weight r increases 0.74→0.77. Sign accuracy 74.4%. Word-length subtraction 13× less catastrophic (+0.54 vs +7.3 bpc).
Phase 4: Checkpoint Trajectory (99 + 10 checkpoints)
Three training phases: I (learning, 10–110M), II (stable, R²=0.80–0.86), III (collapse at 450M+, R² drops to 0.4–0.7). Margins grow monotonically 2.8→61.3. W_h std perfectly linear (+0.031/checkpoint). b_y crosses all-negative at 640M. Forgetting destroys readout (W_y/b_y), not dynamics (W_h/margins).
Xavier/AdamW: R² even higher (0.87–0.89)
Xavier init produces R²=0.87–0.89 (vs 0.83–0.86 for random init). Margins grow 8× slower. All 128 neurons above 0.80 throughout training. Confirms R²≈0.83 is an architectural floor, not a ceiling.

Interactive Experiment Viewers

Navigation

Next: 20260212 →
Carrier signal problem. Orthogonal offsets. Byte KN = 2.29 bpc at 10M.
← Previous: 20260211
Total interpretation, weight construction, E onto N, quotient chain.