← Back to archive

Scaling the Total Interpretation

1024 bytes → 10⁹ bytes: seven predictions tested, two kinds of Boolean automaton discovered

Predictions Scorecard

Seven predictions were made before running the experiments. The results:

WRONG

P1: Margins decrease

RIGHT

P2: Distributed neurons

RIGHT

P3: Shallow offsets

WRONG

P4: R² drops

WRONG

P5: Analytic >5 bpc

MIXED

P6: Hebbian improves

RIGHT

P7: E→N→Q holds

The wrong predictions are wrong in the interesting direction: the architecture preserves more structure than expected. R² stays at 0.83, margins increase, analytic construction reaches 4.21 bpc.

Predictions vs Reality

Each prediction (blue) vs actual result (green). P1 (margins) and P4 (R²) are the most spectacularly wrong. Data from scaling-results.tex and factor-weight-scaling.tex.

Q1: The Boolean Automaton Is Structural

Prediction P1 expected margins to decrease from 60.5 to 5–15. They increased to 8.56 (on eval data with only 1.31 mean for the 1024B model on the same metric).

8.56

enwik9 mean margin
(6.5× higher)

1.31

1024B mean margin
(on same eval)

6.4%

enwik9 small margins
(vs 52.4%)

105

enwik9 fan-in
(vs 3.5)

Margin Distribution: 1024B vs enwik9

The 1024B model clusters near the boundary (100% in [0,5]). The enwik9 model spreads deep into saturation (tail to [45,50]). Data from scaling-results.tex.

The Boolean regime is not an artifact of overtraining. Both models—trained in opposite regimes (memorization vs generalization)—converge to Boolean dynamics. The enwik9 model is more strongly Boolean. This is a structural property of the 128-hidden tanh RNN.

Q2–Q3: Shallow and Distributed

Dominant Offset Distribution

78.1% of neurons at d=1 (enwik9) vs deep d=15–25 (1024B).

Neuron Contribution Distribution

Top-1: 10.4% (enwik9) vs 99.7% (1024B). 60 neurons for 97%.

Property	sat-rnn-1024	sat-rnn-enwik9
Training regime	Memorization (1024 epochs)	Generalization (1 epoch)
BPC	0.079	2.81 (eval: 3.88)
Boolean regime	Yes (margin 1.31)	Yes (margin 8.56)
Offset depth	Deep (d=15–25)	Shallow (d=1–3)
Top-1 neuron	h28 = 99.7%	h82 = 10.4%
Neurons for 97%	1	60
Mean dwell	3.3 steps	1.6 steps
Mean flips/step	31.6	49.6
Fan-in	3.5	105.0
Signal location	d=11–20	d=1–4

Two kinds of Boolean automaton. The 1024B model is deep, sparse, concentrated, slow. The enwik9 model is shallow, dense, distributed, fast. Both are Boolean. The architecture determines the structure; the training regime determines the allocation.

What the Neurons Encode

Top 10 Neurons: Knockout Impact (enwik9)

No single neuron dominates. h82 (+0.237), h30 (+0.207), h75 (+0.141)... The model distributes computation across ~60 neurons. Data from scaling-results.tex.

Top-70 beat full model (104.9%). Removing the bottom 58 neurons improves performance: 3.68 bpc vs baseline 3.88. The effective enwik9 model is a 70-neuron Boolean automaton.

P4: R² = 0.83 Is an Architectural Invariant

The prediction that 2-offset conjunction R² would drop to 0.4–0.6 was spectacularly wrong.

0.830

enwik9 R²
(predicted 0.4–0.6)

0.837

1024B R²
(within 1%)

128/128

neurons ≥ 0.70
(all)

100.2%

BPC gain captured
by cond. means

Offset Pair Distribution

The dominant pair shifts from (1,7) to (1,8), but offset 1 is universal: every neuron uses it. Data from factor-weight-scaling.tex.

"The 2-offset conjunction R² of ~0.83 reflects the fraction of variance these conjunctions capture; the remaining ~17% is higher-order interaction that the architecture cannot avoid."

— factor-weight-scaling.tex

P5–P6: Weight Construction Scales

Construction Results: 1024B vs enwik9

The fully analytic construction achieves 4.21 bpc on enwik9 (predicted >5). Hebbian + optimized W_y reaches 1.90 bpc—far below the trained model's 6.42 on this data. Data from factor-weight-scaling.tex.

Metric	1024B	enwik9
r(cov, W_h) all	0.56	0.38
r(cov, W_h) \|w\|≥3	0.74	0.77
Sign accuracy \|w\|≥3	—	93.3%
Hebbian all	7.44 bpc	7.44 bpc
Hebbian + opt W_y	1.15 bpc	1.90 bpc
Analytic W_y (zero opt)	1.89 bpc	4.21 bpc
Trained model	0.079 bpc	6.42 bpc

The analytic construction beats the trained model on early data (4.21 vs 6.42 bpc). The trained model has catastrophic forgetting; the analytic construction captures local statistics and cannot forget. The loop is closed for a second model.

Word-Length Encoding: 13× Less Catastrophic

Direction Subtraction Cost

Subtracting the word-length direction costs only +0.54 bpc (enwik9) vs +7.3 (1024B). The enwik9 model distributes information; no single direction matters. Data from factor-weight-scaling.tex.

Synthesis

"The central lesson: the 128-hidden tanh RNN imposes a specific computational structure regardless of training regime. The 2-offset conjunction factor map (R² ≈ 0.83), the Boolean dynamics, and the Hebbian alignment of large weights are all architectural invariants."

— factor-weight-scaling.tex, Discussion

The architecture determines the structure; the training regime determines the allocation. What changes between models: deep/concentrated (1024B) vs shallow/distributed (enwik9). What stays the same: Boolean dynamics, R² ≈ 0.83, weight constructibility.

Checkpoint Trajectory

99 checkpoints, three phases, R² cliff at 450M.

The Evidence (1024B)

Complete evidence chain for the 1024-byte model.

Weight Construction (1024B)

12 experiments: counting to construction.

Seven Questions (1024B)

Q1–Q7 synthesis for the 1024-byte model.

Source: experimental-design.tex · scaling-results.tex · factor-weight-scaling.tex