← Back to archive

Scaling the Total Interpretation

1024 bytes → 109 bytes: seven predictions tested, two kinds of Boolean automaton discovered

Predictions Scorecard

Seven predictions were made before running the experiments. The results:

WRONG
P1: Margins decrease
RIGHT
P2: Distributed neurons
RIGHT
P3: Shallow offsets
WRONG
P4: R² drops
WRONG
P5: Analytic >5 bpc
MIXED
P6: Hebbian improves
RIGHT
P7: E→N→Q holds
The wrong predictions are wrong in the interesting direction: the architecture preserves more structure than expected. R² stays at 0.83, margins increase, analytic construction reaches 4.21 bpc.

Predictions vs Reality

Each prediction (blue) vs actual result (green). P1 (margins) and P4 (R²) are the most spectacularly wrong. Data from scaling-results.tex and factor-weight-scaling.tex.

Q1: The Boolean Automaton Is Structural

Prediction P1 expected margins to decrease from 60.5 to 5–15. They increased to 8.56 (on eval data with only 1.31 mean for the 1024B model on the same metric).

8.56
enwik9 mean margin
(6.5× higher)
1.31
1024B mean margin
(on same eval)
6.4%
enwik9 small margins
(vs 52.4%)
105
enwik9 fan-in
(vs 3.5)

Margin Distribution: 1024B vs enwik9

The 1024B model clusters near the boundary (100% in [0,5]). The enwik9 model spreads deep into saturation (tail to [45,50]). Data from scaling-results.tex.
The Boolean regime is not an artifact of overtraining. Both models—trained in opposite regimes (memorization vs generalization)—converge to Boolean dynamics. The enwik9 model is more strongly Boolean. This is a structural property of the 128-hidden tanh RNN.

Q2–Q3: Shallow and Distributed

Dominant Offset Distribution

78.1% of neurons at d=1 (enwik9) vs deep d=15–25 (1024B).

Neuron Contribution Distribution

Top-1: 10.4% (enwik9) vs 99.7% (1024B). 60 neurons for 97%.
Propertysat-rnn-1024sat-rnn-enwik9
Training regimeMemorization (1024 epochs)Generalization (1 epoch)
BPC0.0792.81 (eval: 3.88)
Boolean regimeYes (margin 1.31)Yes (margin 8.56)
Offset depthDeep (d=15–25)Shallow (d=1–3)
Top-1 neuronh28 = 99.7%h82 = 10.4%
Neurons for 97%160
Mean dwell3.3 steps1.6 steps
Mean flips/step31.649.6
Fan-in3.5105.0
Signal locationd=11–20d=1–4
Two kinds of Boolean automaton. The 1024B model is deep, sparse, concentrated, slow. The enwik9 model is shallow, dense, distributed, fast. Both are Boolean. The architecture determines the structure; the training regime determines the allocation.

What the Neurons Encode

Top 10 Neurons: Knockout Impact (enwik9)

No single neuron dominates. h82 (+0.237), h30 (+0.207), h75 (+0.141)... The model distributes computation across ~60 neurons. Data from scaling-results.tex.
Top-70 beat full model (104.9%). Removing the bottom 58 neurons improves performance: 3.68 bpc vs baseline 3.88. The effective enwik9 model is a 70-neuron Boolean automaton.

P4: R² = 0.83 Is an Architectural Invariant

The prediction that 2-offset conjunction R² would drop to 0.4–0.6 was spectacularly wrong.

0.830
enwik9 R²
(predicted 0.4–0.6)
0.837
1024B R²
(within 1%)
128/128
neurons ≥ 0.70
(all)
100.2%
BPC gain captured
by cond. means

Offset Pair Distribution

The dominant pair shifts from (1,7) to (1,8), but offset 1 is universal: every neuron uses it. Data from factor-weight-scaling.tex.
"The 2-offset conjunction R² of ~0.83 reflects the fraction of variance these conjunctions capture; the remaining ~17% is higher-order interaction that the architecture cannot avoid."
— factor-weight-scaling.tex

P5–P6: Weight Construction Scales

Construction Results: 1024B vs enwik9

The fully analytic construction achieves 4.21 bpc on enwik9 (predicted >5). Hebbian + optimized Wy reaches 1.90 bpc—far below the trained model's 6.42 on this data. Data from factor-weight-scaling.tex.
Metric1024Benwik9
r(cov, Wh) all0.560.38
r(cov, Wh) |w|≥30.740.77
Sign accuracy |w|≥393.3%
Hebbian all7.44 bpc7.44 bpc
Hebbian + opt Wy1.15 bpc1.90 bpc
Analytic Wy (zero opt)1.89 bpc4.21 bpc
Trained model0.079 bpc6.42 bpc
The analytic construction beats the trained model on early data (4.21 vs 6.42 bpc). The trained model has catastrophic forgetting; the analytic construction captures local statistics and cannot forget. The loop is closed for a second model.

Word-Length Encoding: 13× Less Catastrophic

Direction Subtraction Cost

Subtracting the word-length direction costs only +0.54 bpc (enwik9) vs +7.3 (1024B). The enwik9 model distributes information; no single direction matters. Data from factor-weight-scaling.tex.

Synthesis

"The central lesson: the 128-hidden tanh RNN imposes a specific computational structure regardless of training regime. The 2-offset conjunction factor map (R² ≈ 0.83), the Boolean dynamics, and the Hebbian alignment of large weights are all architectural invariants."
— factor-weight-scaling.tex, Discussion
The architecture determines the structure; the training regime determines the allocation. What changes between models: deep/concentrated (1024B) vs shallow/distributed (enwik9). What stays the same: Boolean dynamics, R² ≈ 0.83, weight constructibility.

Related

Source: experimental-design.tex · scaling-results.tex · factor-weight-scaling.tex