← Back to archive
Scaling the Total Interpretation
1024 bytes → 109 bytes: seven predictions tested, two kinds of Boolean automaton discovered
Predictions Scorecard
Seven predictions were made before running the experiments. The results:
WRONG
P1: Margins decrease
RIGHT
P2: Distributed neurons
MIXED
P6: Hebbian improves
The wrong predictions are wrong in the interesting direction:
the architecture preserves more structure than expected. R² stays at 0.83, margins increase, analytic construction reaches 4.21 bpc.
Q1: The Boolean Automaton Is Structural
Prediction P1 expected margins to decrease from 60.5 to 5–15. They increased to 8.56 (on eval data with only 1.31 mean for the 1024B model on the same metric).
8.56
enwik9 mean margin
(6.5× higher)
1.31
1024B mean margin
(on same eval)
6.4%
enwik9 small margins
(vs 52.4%)
Margin Distribution: 1024B vs enwik9
The 1024B model clusters near the boundary (100% in [0,5]).
The enwik9 model spreads deep into saturation (tail to [45,50]).
Data from
scaling-results.tex.
The Boolean regime is not an artifact of overtraining.
Both models—trained in opposite regimes (memorization vs generalization)—converge to Boolean dynamics.
The enwik9 model is more strongly Boolean. This is a structural property of the 128-hidden tanh RNN.
Q2–Q3: Shallow and Distributed
Dominant Offset Distribution
78.1% of neurons at d=1 (enwik9) vs deep d=15–25 (1024B).
Neuron Contribution Distribution
Top-1: 10.4% (enwik9) vs 99.7% (1024B). 60 neurons for 97%.
| Property | sat-rnn-1024 | sat-rnn-enwik9 |
| Training regime | Memorization (1024 epochs) | Generalization (1 epoch) |
| BPC | 0.079 | 2.81 (eval: 3.88) |
| Boolean regime | Yes (margin 1.31) | Yes (margin 8.56) |
| Offset depth | Deep (d=15–25) | Shallow (d=1–3) |
| Top-1 neuron | h28 = 99.7% | h82 = 10.4% |
| Neurons for 97% | 1 | 60 |
| Mean dwell | 3.3 steps | 1.6 steps |
| Mean flips/step | 31.6 | 49.6 |
| Fan-in | 3.5 | 105.0 |
| Signal location | d=11–20 | d=1–4 |
Two kinds of Boolean automaton.
The 1024B model is deep, sparse, concentrated, slow. The enwik9 model is shallow, dense, distributed, fast.
Both are Boolean. The architecture determines the structure; the training regime determines the allocation.
What the Neurons Encode
Top 10 Neurons: Knockout Impact (enwik9)
No single neuron dominates. h82 (+0.237), h30 (+0.207), h75 (+0.141)...
The model distributes computation across ~60 neurons.
Data from
scaling-results.tex.
Top-70 beat full model (104.9%).
Removing the bottom 58 neurons improves performance: 3.68 bpc vs baseline 3.88.
The effective enwik9 model is a 70-neuron Boolean automaton.
P4: R² = 0.83 Is an Architectural Invariant
The prediction that 2-offset conjunction R² would drop to 0.4–0.6 was spectacularly wrong.
0.830
enwik9 R²
(predicted 0.4–0.6)
128/128
neurons ≥ 0.70
(all)
100.2%
BPC gain captured
by cond. means
Offset Pair Distribution
The dominant pair shifts from (1,7) to (1,8), but offset 1 is universal: every neuron uses it.
Data from
factor-weight-scaling.tex.
"The 2-offset conjunction R² of ~0.83 reflects the fraction of variance these conjunctions capture;
the remaining ~17% is higher-order interaction that the architecture cannot avoid."
— factor-weight-scaling.tex
P5–P6: Weight Construction Scales
Construction Results: 1024B vs enwik9
The fully analytic construction achieves 4.21 bpc on enwik9 (predicted >5).
Hebbian + optimized W
y reaches 1.90 bpc—far below the trained model's 6.42 on this data.
Data from
factor-weight-scaling.tex.
| Metric | 1024B | enwik9 |
| r(cov, Wh) all | 0.56 | 0.38 |
| r(cov, Wh) |w|≥3 | 0.74 | 0.77 |
| Sign accuracy |w|≥3 | — | 93.3% |
| Hebbian all | 7.44 bpc | 7.44 bpc |
| Hebbian + opt Wy | 1.15 bpc | 1.90 bpc |
| Analytic Wy (zero opt) | 1.89 bpc | 4.21 bpc |
| Trained model | 0.079 bpc | 6.42 bpc |
The analytic construction beats the trained model on early data (4.21 vs 6.42 bpc).
The trained model has catastrophic forgetting; the analytic construction captures local statistics
and cannot forget. The loop is closed for a second model.
Word-Length Encoding: 13× Less Catastrophic
Direction Subtraction Cost
Subtracting the word-length direction costs only +0.54 bpc (enwik9) vs +7.3 (1024B).
The enwik9 model distributes information; no single direction matters.
Data from
factor-weight-scaling.tex.
Synthesis
"The central lesson: the 128-hidden tanh RNN imposes a specific computational structure
regardless of training regime. The 2-offset conjunction factor map (R² ≈ 0.83),
the Boolean dynamics, and the Hebbian alignment of large weights are all architectural invariants."
— factor-weight-scaling.tex, Discussion
The architecture determines the structure; the training regime determines the allocation.
What changes between models: deep/concentrated (1024B) vs shallow/distributed (enwik9).
What stays the same: Boolean dynamics, R² ≈ 0.83, weight constructibility.
Related