1. The Model
A vanilla RNN: 128 hidden units, tanh activation, trained on 1024 bytes of enwik9 (Wikipedia XML) by BPTT-50 + Adam for 4000 epochs. 82,304 trainable parameters. Achieves 0.079 bits per character.
The starting point: 82,304 opaque floating-point numbers found by stochastic optimization. What do they encode? Can they be explained? Can they be reconstructed?
2. The RNN Is a Boolean Automaton
The first discovery: the tanh activations are so deeply saturated that the sign bit carries virtually all the computation. The mantissa is not memory — it is noise.
60.5
Mean pre-activation
margin
98.9%
Neuron-steps with
margin > 1.0
4.7e-5
Max mantissa
perturbation
106×
Safety factor
(margin / perturbation)
Sign-Only vs Full f32: The Mantissa Is Noise
Sign-only dynamics (replacing tanh with sgn)
outperforms full f32 by 0.031 bpc.
The mantissa actively degrades prediction by 0.095 bpc through fragile cascade transitions.
Data from
q1_boolean.c run on the sat-rnn.
Bit Leverage Hierarchy: 300:52:1
Per-bit KL divergence: sign bit = 0.046, each exponent bit = 0.008, each mantissa bit = 0.00015. Pattern ranking ρ = 1.000 at BPTT depth ≥ 11.
Boolean Dynamics: Every Neuron Is Volatile
All 128 neurons classified as "volatile" (> 50 flips over 1023 positions). Top: h20 (667 flips), h88 (653), h36 (633). Zero frozen neurons. Mean dwell: 1.9 steps. Data from
q4_saturation.c.
The computation is Boolean. Mean margin 60.5 vs max perturbation 4.7×10-5.
The tanh function is equivalent to sgn for 98.9% of neuron-steps. Sign-only dynamics is better than f32.
The mantissa was the ladder that gradient descent climbed; the result is a 128-bit Boolean automaton.
"The mantissa is not memory — it is noise injected at the 0.1% of neuron-steps where the margin is small enough for mantissa perturbation to flip a sign bit, cascading through ~4.6 downstream neurons."
— boolean-automaton.tex
3. Most of the Model Is Noise
The knockout experiments reveal extreme concentration: a handful of neurons carry nearly all the signal.
h8
King neuron
+0.035 bpc when removed
20
Neurons needed
(out of 128)
0.15
bpc BETTER
than full model
Neuron Knockout: Individual Impact (Top 30)
Each bar shows the bpc increase when that neuron is removed. h8 alone accounts for 0.035 bpc.
The top 10 neurons (h8, h56, h68, h99, h15, h52, h76, h90, h73, h50) account for the majority of compression.
Data from
q3_neurons.c.
Cumulative: Keep Only the Top-k Neurons
Keeping only the top 20 neurons captures 83.2% of the compression gap from uniform (6.66 bpc) to trained (0.079 bpc).
Top 30 = 92.1%. Top 50 = 97.4%. The remaining 78 neurons contribute 2.6%.
Red dashed line: full 128-neuron model (0.079 bpc). Blue dashed line: uniform baseline (6.66 bpc).
What the Top Neurons Predict
h8 (King, +0.035 bpc)
Promotes: '/'(1.3) 'i'(1.3) ' '(0.8) 'c'(0.8) 'd'(0.8)
Demotes: 'a'(-4.1) 'e'(-2.5) 'o'(-1.1)
|Wy| = 6.10, mean|h| = 0.72
h68 (#3, +0.025 bpc)
Promotes: 'm'(1.7) 'i'(1.7) 'e'(1.6) 'n'(1.2)
Demotes: ' '(-2.5) 'k'(-1.8) '>'(-1.6)
|Wy| = 5.83, mean|h| = 0.73
| Configuration | bpc | Δ from full | Parameters |
| Full f32 (baseline) | 4.965 | — | 82,304 |
| Top 20 neurons + Wh prune (>3.0) | 4.811 | −0.154 | 25,857 |
| Top 20 neurons alone | 4.882 | −0.083 | 37,689 |
| Wh prune (>3.0) alone | 4.903 | −0.063 | 49,753 |
Every pruned variant outperforms the full model.
The best redux (20 neurons, 36% of Wh) achieves 0.15 bpc better with 31% of the parameters.
Training needed 82k parameters for gradient flow; inference needs 26k.
4. The RNN Uses Deep Offsets
The skip-k-gram analysis showed the data has structure at offsets 1, 8, 20. But the RNN looks deeper — many neurons are dominated by offsets d=14–28.
Dominant Offset Per Neuron (128 neurons)
Histogram of each neuron's dominant offset (the depth at which flipping input causes the most sign changes).
Peak at d=14 (19 neurons, 14.8%). Strong representation at d=17 (12), d=21 (9), d=28 (8).
The MI-greedy offsets [1,3,8,20] capture only 10.3% of the total sign-change signal.
Data from
q2_offsets.c.
Output KL Divergence by Depth
How much the output distribution changes when input at depth d is perturbed.
Deep offsets (d=21: KL=4.63, d=28: KL=3.54) have more output impact than shallow ones.
The RNN routes information through long chains in Wh.
The RNN uses depths 14–28 most heavily. This is deeper than the MI-greedy offsets [1,3,8,20] that capture 0.069 bpc. The routing backbone (h54 ← h121 ← h78) carries information from these deep offsets through Wh propagation.
5. Every Prediction Has a Justification
Each prediction can be traced backward through ~5 neurons × ~3 Wh hops = ~15 weight entries (0.1% of Wh). Here are real worked examples from q6_justify.c.
Position t=10: Predicting after "<mediawiki " → ?
<mediawiki x
P('x') = 0.996, bpc = 0.005
h52 (sign=−1, Δbpc=+0.069): z=−1.2, driven by input ' ' via Wx=−1.7
→ Top Wh source: h8 (+1.0) at t−1, driven by input 'i'
h90 (sign=+1, Δbpc=+0.059): z=4.1, top Wh source: h8 (+0.8)
h99 (sign=+1, Δbpc=+0.051): top Wh source: h68 (−0.9)
Position t=50: Predicting after "...wiki.org/" → ?
...mediawiki.org/r (expecting "xml/export")
P('r') = 0.980, bpc = 0.030
h99 (sign=+1, Δbpc=+2.027): z=1.9, CRITICAL — flipping would add 2 bpc
→ Top Wh: h68 (+1.0) at t−1, driven by input 'p' via h26(+0.7)
h76 (sign=−1, Δbpc=+0.533): top source: h50 (−1.3)
h26 (sign=+1, Δbpc=+0.166): top source: h61 (+1.2)
Position t=100: Predicting after "...XMLSche" → ?
...XMLSchem
P('m') = 0.998, bpc = 0.004
h68 (sign=+1, Δbpc=+0.202): z=1.7, top source: h26 (−0.7) from h61 (+1.0) via input 'h'
h52 (sign=−1, Δbpc=+0.132): top source: h8 (−1.2), self-loop h8←h8 (−0.9)
h16 (sign=+1, Δbpc=+0.110): top source: h76 (+0.6)
Each prediction traces to ~15 weights. The routing backbone
(h8, h68, h99, h52, h76, h26, h61, h50) appears across all examples. h8 is the hub —
it appears as a Wh source in virtually every justification chain.
6. The Weights Are a Function of Data Statistics
If the RNN is really counting events, its weights should be predictable from event co-occurrence counts. They are.
Weight Prediction: Pearson Correlation with Trained Values
Hebbian covariance (cov(h
j(t), h
i(t+1))) predicts W
h at r = 0.56 for dynamically important entries.
Sign accuracy: 72.7%. b
h from sign log-odds: r = 0.58.
W
y from sign-split log-probabilities: r = 0.54.
Data from
write_weights.c.
Substitution: Replacing Trained Weights
Effect of Data-Derived Weight Substitution
Three substitutions improve the trained model: 50% Hebbian Wy blend (−0.658),
80/20 Hebbian Wh (−0.046), Hebbian bh (−0.011). The trained readout is
worse than the data-derived correction.
"Mixing 50% Hebbian W
y with 50% trained W
y yields 4.31 bpc, 0.66 better than the trained model. The trained W
y is over-optimized for the training dynamics and slightly miscalibrated for the Boolean dynamics that actually matter."
— narrative.tex
7. The Loop Closes: All 82k Parameters from Data
The strongest test: build a working model from data statistics alone, with zero gradient descent.
Wx: Deterministic Hash
128 neurons ÷ 16 groups of 8. Group 0 encodes current byte via hash. Deterministic.
↓
Wh: Shift Register
Group g carries hash of g steps ago. Diagonal copy with weight 5.0. 100% encoding accuracy.
↓
bh: Sign Log-Odds
Bias = log(fraction positive) − log(fraction negative). From data counts.
↓
Wy: Skip-Bigram Log-Ratios
Wy[o][j] = s · (log P(o | hj > 0) − log P(o | hj < 0)). All from counting.
↓
by: Byte Marginals
by[o] = log c(o) − const. The count of each output byte.
1.89
Analytic construction
ZERO optimization
0.59
Optimized Wy
1000 epochs SGD
4.97
Trained model
~2M gradient steps
82,304
All parameters
from data counts
The Optimization Continuum: From Counting to Construction
Every point uses shift-register dynamics (W
x, W
h, b
h from data). Only W
y varies.
The trained model (red dashed) uses ~2M gradient steps on all 82k parameters yet is beaten by
the closed-form solution with zero optimization. Data from
write_weights12.c.
Generalization: Analytic vs Trained on Unseen Data
On the test half (unseen during Wy construction), the analytic model achieves 4.88 bpc
vs the trained model's 5.08 — within 0.2 bpc. The analytic model captures the
same statistical structure that BPTT discovers.
The fully analytic construction beats the trained model by 3.08 bpc
on the full data with zero optimization. All 82,304 parameters are determined by data statistics:
Wx and Wh by construction (hash + shift register), Wy by skip-bigram
log-ratios, by by byte marginals. The loop is closed.
The sparse 26k-parameter redux cannot train from scratch.
Random initialization → 7.74 bpc after 50 epochs (barely below uniform 8.0). The full 82k architecture
reaches 5.16 bpc. Gradient flow requires the dense Wh as scaffolding. Training needs 82k parameters;
inference needs 26k; construction needs 0.
Full-Bandwidth Carrier Signal: Architecture Scaling
Doubling hidden size (256 = 16×16) gives 253/256 distinct patterns and drops the closed-form PI
from 1.41 to 0.44 bpc. The hybrid (semantic + hash) achieves 0.40 bpc — approaching SGD
territory with zero optimization. Data from
write_weights14.c.
8. The Bi-Embedding: Events ↔ Numbers
The mathematical structure that makes the loop close: every weight is simultaneously a number (from counting events) and an event description (encoding a relationship).
E → N (Counting)
Each event gets a count in the dataset. The count is a natural number. The log-count is the SN strength. Shannon entropy = average log-luck.
In the RNN: Wh[k,j] ≈ scale · cov(hj(t), hk(t+1)). The weight IS the co-occurrence count.
N → E (Construction)
Each weight value encodes an event relationship. Given the numbers, reconstruct the events they describe. The factor map does exactly this.
In the RNN: R² = 0.837 (120/128 neurons). Each neuron is a 2-offset conjunction detector.
| Direction | Method | Quality Measure | Result |
| E → N (counting → weights) | Hebbian covariance | Correlation with Wh | r = 0.56 |
| E → N (counting → model) | Analytic construction | BPC (zero optimization) | 1.89 bpc |
| N → E (weights → events) | Factor map | R² per neuron | 0.837 mean |
| N → E (weights → events) | Attribution chains | PMI alignment | 74% overall |
| φ ∘ ψ (round trip) | Count → build → run | Test bpc gap | 0.2 bpc |
"The bi-embedding is the content of the twelve-day arc. Neither direction is exact. The gap is the higher-order structure: patterns involving 3+ events, cross-offset synergies (> 1.0 bits between offset pairs), and the non-linear interactions that BPTT captures but first-order counting misses. But the bi-embedding is approximately invertible, which is why the RNN can be cracked open."
— e-onto-n.tex
9. The Entropy Identity
This is not an analogy. The UM's forward pass IS a thermodynamic partition function. The SN strength IS the Boltzmann entropy.
H = log2 N − ⟨SB⟩
Shannon entropy = log of total positions − expected Boltzmann entropy of macrostates
The Factoring Hierarchy: From No Description to Full Specification
Each level represents a different factoring of the event space — a choice of which event spaces to observe.
More event spaces = more macrostates = less residual entropy = better compression.
The UM's learning problem: find the factoring that minimizes bpc with the fewest macrostate variables.
| Factoring Level | Macrostates | Avg log2 W | Residual (bpc) |
| Uniform (no factoring) | 1 | 10.0 | 8.00 |
| Unigram (input char) | 52 | 5.26 | 4.74 |
| Bigram (input + prev) | 231 | 7.22 | ~2.8 |
| sat-rnn (128 neurons) | ~1000 | 9.92 | 0.079 |
| Skip-8 UM (834 patterns) | ~834 | 9.96 | 0.043 |
| Full (all ESes) | 1024 | 10.0 | 0.000 |
The skip-8 UM beats the sat-rnn with fewer macrostates (834 vs ~1000) because its macrostates are more informative. Compression is factoring: finding the description that absorbs the most of log2 N into the Boltzmann term.
10. The Full Evidence Chain
Day 1 (Jan 31): Training
82,304 opaque parameters → 0.079 bpc
↓
Days 2–4 (Feb 1–4): UM Isomorphism
Every RNN prediction has a UM pattern witness. 10−6 bpc error.
↓
Days 7–8 (Feb 7–8): Pattern Discovery
6,180 patterns at order 12. Skip-k-grams. 0.043 bpc at skip-8.
↓
Days 9–11 (Feb 9–11): Boolean Automaton
98.9% Boolean. 20 neurons suffice. Mantissa is noise.
↓
Day 11: Attribution Chains
~15 weights per prediction. 74% PMI alignment. Routing backbone h8 ← h68 ← h99.
↓
Day 11: Weight Construction
ALL 82k parameters from data counts. 1.89 bpc, ZERO optimization. Beats trained by 3.08 bpc.
The loop closes.
"The weights are not arbitrary parameters found by stochastic optimization. They are a noisy encoding of the data's covariance structure, filtered through the f32 quotient and the BPTT optimization landscape. The mantissa was the ladder. The result is counting."
— narrative.tex, final passage
Detailed Experiment Pages