The Evidence: Twelve Days of RNN Interpretation

1. The Model

A vanilla RNN: 128 hidden units, tanh activation, trained on 1024 bytes of enwik9 (Wikipedia XML) by BPTT-50 + Adam for 4000 epochs. 82,304 trainable parameters. Achieves 0.079 bits per character.

128

Hidden units

0.079

bpc achieved

82,304

Parameters

1,024

Bytes of data

The starting point: 82,304 opaque floating-point numbers found by stochastic optimization. What do they encode? Can they be explained? Can they be reconstructed?

2. The RNN Is a Boolean Automaton

The first discovery: the tanh activations are so deeply saturated that the sign bit carries virtually all the computation. The mantissa is not memory — it is noise.

60.5

Mean pre-activation
margin

98.9%

Neuron-steps with
margin > 1.0

4.7e-5

Max mantissa
perturbation

10⁶×

Safety factor
(margin / perturbation)

Sign-Only vs Full f32: The Mantissa Is Noise

Sign-only dynamics (replacing tanh with sgn) outperforms full f32 by 0.031 bpc. The mantissa actively degrades prediction by 0.095 bpc through fragile cascade transitions. Data from q1_boolean.c run on the sat-rnn.

Bit Leverage Hierarchy: 300:52:1

Per-bit KL divergence: sign bit = 0.046, each exponent bit = 0.008, each mantissa bit = 0.00015. Pattern ranking ρ = 1.000 at BPTT depth ≥ 11.

Boolean Dynamics: Every Neuron Is Volatile

All 128 neurons classified as "volatile" (> 50 flips over 1023 positions). Top: h20 (667 flips), h88 (653), h36 (633). Zero frozen neurons. Mean dwell: 1.9 steps. Data from q4_saturation.c.

The computation is Boolean. Mean margin 60.5 vs max perturbation 4.7×10^-5. The tanh function is equivalent to sgn for 98.9% of neuron-steps. Sign-only dynamics is better than f32. The mantissa was the ladder that gradient descent climbed; the result is a 128-bit Boolean automaton.

"The mantissa is not memory — it is noise injected at the 0.1% of neuron-steps where the margin is small enough for mantissa perturbation to flip a sign bit, cascading through ~4.6 downstream neurons."

— boolean-automaton.tex

3. Most of the Model Is Noise

The knockout experiments reveal extreme concentration: a handful of neurons carry nearly all the signal.

King neuron
+0.035 bpc when removed

Neurons needed
(out of 128)

36%

W_h entries
needed

0.15

bpc BETTER
than full model

Neuron Knockout: Individual Impact (Top 30)

Each bar shows the bpc increase when that neuron is removed. h8 alone accounts for 0.035 bpc. The top 10 neurons (h8, h56, h68, h99, h15, h52, h76, h90, h73, h50) account for the majority of compression. Data from q3_neurons.c.

Cumulative: Keep Only the Top-k Neurons

Keeping only the top 20 neurons captures 83.2% of the compression gap from uniform (6.66 bpc) to trained (0.079 bpc). Top 30 = 92.1%. Top 50 = 97.4%. The remaining 78 neurons contribute 2.6%. Red dashed line: full 128-neuron model (0.079 bpc). Blue dashed line: uniform baseline (6.66 bpc).

What the Top Neurons Predict

h8 (King, +0.035 bpc)

Promotes: '/'(1.3) 'i'(1.3) ' '(0.8) 'c'(0.8) 'd'(0.8)
Demotes: 'a'(-4.1) 'e'(-2.5) 'o'(-1.1)
|W_y| = 6.10, mean|h| = 0.72

h68 (#3, +0.025 bpc)

Promotes: 'm'(1.7) 'i'(1.7) 'e'(1.6) 'n'(1.2)
Demotes: ' '(-2.5) 'k'(-1.8) '>'(-1.6)
|W_y| = 5.83, mean|h| = 0.73

Configuration	bpc	Δ from full	Parameters
Full f32 (baseline)	4.965	—	82,304
Top 20 neurons + W_h prune (>3.0)	4.811	−0.154	25,857
Top 20 neurons alone	4.882	−0.083	37,689
W_h prune (>3.0) alone	4.903	−0.063	49,753

Every pruned variant outperforms the full model. The best redux (20 neurons, 36% of W_h) achieves 0.15 bpc better with 31% of the parameters. Training needed 82k parameters for gradient flow; inference needs 26k.

4. The RNN Uses Deep Offsets

The skip-k-gram analysis showed the data has structure at offsets 1, 8, 20. But the RNN looks deeper — many neurons are dominated by offsets d=14–28.

Dominant Offset Per Neuron (128 neurons)

Histogram of each neuron's dominant offset (the depth at which flipping input causes the most sign changes). Peak at d=14 (19 neurons, 14.8%). Strong representation at d=17 (12), d=21 (9), d=28 (8). The MI-greedy offsets [1,3,8,20] capture only 10.3% of the total sign-change signal. Data from q2_offsets.c.

Output KL Divergence by Depth

How much the output distribution changes when input at depth d is perturbed. Deep offsets (d=21: KL=4.63, d=28: KL=3.54) have more output impact than shallow ones. The RNN routes information through long chains in W_h.

The RNN uses depths 14–28 most heavily. This is deeper than the MI-greedy offsets [1,3,8,20] that capture 0.069 bpc. The routing backbone (h54 ← h121 ← h78) carries information from these deep offsets through W_h propagation.

5. Every Prediction Has a Justification

Each prediction can be traced backward through ~5 neurons × ~3 W_h hops = ~15 weight entries (0.1% of W_h). Here are real worked examples from q6_justify.c.

Position t=10: Predicting after "<mediawiki " → ?

<mediawiki x

P('x') = 0.996, bpc = 0.005

h52 (sign=−1, Δbpc=+0.069): z=−1.2, driven by input ' ' via W_x=−1.7
→ Top W_h source: h8 (+1.0) at t−1, driven by input 'i'
h90 (sign=+1, Δbpc=+0.059): z=4.1, top W_h source: h8 (+0.8)
h99 (sign=+1, Δbpc=+0.051): top W_h source: h68 (−0.9)

Position t=50: Predicting after "...wiki.org/" → ?

...mediawiki.org/r (expecting "xml/export")

P('r') = 0.980, bpc = 0.030

h99 (sign=+1, Δbpc=+2.027): z=1.9, CRITICAL — flipping would add 2 bpc
→ Top W_h: h68 (+1.0) at t−1, driven by input 'p' via h26(+0.7)
h76 (sign=−1, Δbpc=+0.533): top source: h50 (−1.3)
h26 (sign=+1, Δbpc=+0.166): top source: h61 (+1.2)

Position t=100: Predicting after "...XMLSche" → ?

...XMLSchem

P('m') = 0.998, bpc = 0.004

h68 (sign=+1, Δbpc=+0.202): z=1.7, top source: h26 (−0.7) from h61 (+1.0) via input 'h'
h52 (sign=−1, Δbpc=+0.132): top source: h8 (−1.2), self-loop h8←h8 (−0.9)
h16 (sign=+1, Δbpc=+0.110): top source: h76 (+0.6)

Each prediction traces to ~15 weights. The routing backbone (h8, h68, h99, h52, h76, h26, h61, h50) appears across all examples. h8 is the hub — it appears as a W_h source in virtually every justification chain.

6. The Weights Are a Function of Data Statistics

If the RNN is really counting events, its weights should be predictable from event co-occurrence counts. They are.

Weight Prediction: Pearson Correlation with Trained Values

Hebbian covariance (cov(h_j(t), h_i(t+1))) predicts W_h at r = 0.56 for dynamically important entries. Sign accuracy: 72.7%. b_h from sign log-odds: r = 0.58. W_y from sign-split log-probabilities: r = 0.54. Data from write_weights.c.

Substitution: Replacing Trained Weights

Effect of Data-Derived Weight Substitution

Three substitutions improve the trained model: 50% Hebbian W_y blend (−0.658), 80/20 Hebbian W_h (−0.046), Hebbian b_h (−0.011). The trained readout is worse than the data-derived correction.

"Mixing 50% Hebbian W_y with 50% trained W_y yields 4.31 bpc, 0.66 better than the trained model. The trained W_y is over-optimized for the training dynamics and slightly miscalibrated for the Boolean dynamics that actually matter."

— narrative.tex

7. The Loop Closes: All 82k Parameters from Data

The strongest test: build a working model from data statistics alone, with zero gradient descent.

W_x: Deterministic Hash

128 neurons ÷ 16 groups of 8. Group 0 encodes current byte via hash. Deterministic.

↓

W_h: Shift Register

Group g carries hash of g steps ago. Diagonal copy with weight 5.0. 100% encoding accuracy.

↓

b_h: Sign Log-Odds

Bias = log(fraction positive) − log(fraction negative). From data counts.

↓

W_y: Skip-Bigram Log-Ratios

W_y[o][j] = s · (log P(o | h_j > 0) − log P(o | h_j < 0)). All from counting.

↓

b_y: Byte Marginals

b_y[o] = log c(o) − const. The count of each output byte.

1.89

Analytic construction
ZERO optimization

0.59

Optimized W_y
1000 epochs SGD

4.97

Trained model
~2M gradient steps

82,304

All parameters
from data counts

The Optimization Continuum: From Counting to Construction

Every point uses shift-register dynamics (W_x, W_h, b_h from data). Only W_y varies. The trained model (red dashed) uses ~2M gradient steps on all 82k parameters yet is beaten by the closed-form solution with zero optimization. Data from write_weights12.c.

Generalization: Analytic vs Trained on Unseen Data

On the test half (unseen during W_y construction), the analytic model achieves 4.88 bpc vs the trained model's 5.08 — within 0.2 bpc. The analytic model captures the same statistical structure that BPTT discovers.

The fully analytic construction beats the trained model by 3.08 bpc on the full data with zero optimization. All 82,304 parameters are determined by data statistics: W_x and W_h by construction (hash + shift register), W_y by skip-bigram log-ratios, b_y by byte marginals. The loop is closed.

The sparse 26k-parameter redux cannot train from scratch. Random initialization → 7.74 bpc after 50 epochs (barely below uniform 8.0). The full 82k architecture reaches 5.16 bpc. Gradient flow requires the dense W_h as scaffolding. Training needs 82k parameters; inference needs 26k; construction needs 0.

Full-Bandwidth Carrier Signal: Architecture Scaling

Doubling hidden size (256 = 16×16) gives 253/256 distinct patterns and drops the closed-form PI from 1.41 to 0.44 bpc. The hybrid (semantic + hash) achieves 0.40 bpc — approaching SGD territory with zero optimization. Data from write_weights14.c.

8. The Bi-Embedding: Events ↔ Numbers

The mathematical structure that makes the loop close: every weight is simultaneously a number (from counting events) and an event description (encoding a relationship).

E → N (Counting)
Each event gets a count in the dataset. The count is a natural number. The log-count is the SN strength. Shannon entropy = average log-luck.

In the RNN: W_h[k,j] ≈ scale · cov(h_j(t), h_k(t+1)). The weight IS the co-occurrence count.

N → E (Construction)
Each weight value encodes an event relationship. Given the numbers, reconstruct the events they describe. The factor map does exactly this.

In the RNN: R² = 0.837 (120/128 neurons). Each neuron is a 2-offset conjunction detector.

Direction	Method	Quality Measure	Result
E → N (counting → weights)	Hebbian covariance	Correlation with W_h	r = 0.56
E → N (counting → model)	Analytic construction	BPC (zero optimization)	1.89 bpc
N → E (weights → events)	Factor map	R² per neuron	0.837 mean
N → E (weights → events)	Attribution chains	PMI alignment	74% overall
φ ∘ ψ (round trip)	Count → build → run	Test bpc gap	0.2 bpc

"The bi-embedding is the content of the twelve-day arc. Neither direction is exact. The gap is the higher-order structure: patterns involving 3+ events, cross-offset synergies (> 1.0 bits between offset pairs), and the non-linear interactions that BPTT captures but first-order counting misses. But the bi-embedding is approximately invertible, which is why the RNN can be cracked open."

— e-onto-n.tex

9. The Entropy Identity

This is not an analogy. The UM's forward pass IS a thermodynamic partition function. The SN strength IS the Boltzmann entropy.

H = log₂ N − ⟨S_B⟩

Shannon entropy = log of total positions − expected Boltzmann entropy of macrostates

The Factoring Hierarchy: From No Description to Full Specification

Each level represents a different factoring of the event space — a choice of which event spaces to observe. More event spaces = more macrostates = less residual entropy = better compression. The UM's learning problem: find the factoring that minimizes bpc with the fewest macrostate variables.

Factoring Level	Macrostates	Avg log₂ W	Residual (bpc)
Uniform (no factoring)	1	10.0	8.00
Unigram (input char)	52	5.26	4.74
Bigram (input + prev)	231	7.22	~2.8
sat-rnn (128 neurons)	~1000	9.92	0.079
Skip-8 UM (834 patterns)	~834	9.96	0.043
Full (all ESes)	1024	10.0	0.000

The skip-8 UM beats the sat-rnn with fewer macrostates (834 vs ~1000) because its macrostates are more informative. Compression is factoring: finding the description that absorbs the most of log₂ N into the Boltzmann term.

10. The Full Evidence Chain

Day 1 (Jan 31): Training

82,304 opaque parameters → 0.079 bpc

↓

Days 2–4 (Feb 1–4): UM Isomorphism

Every RNN prediction has a UM pattern witness. 10⁻⁶ bpc error.

↓

Days 7–8 (Feb 7–8): Pattern Discovery

6,180 patterns at order 12. Skip-k-grams. 0.043 bpc at skip-8.

↓

Days 9–11 (Feb 9–11): Boolean Automaton

98.9% Boolean. 20 neurons suffice. Mantissa is noise.

↓

Day 11: Attribution Chains

~15 weights per prediction. 74% PMI alignment. Routing backbone h8 ← h68 ← h99.

↓

Day 11: Weight Construction

ALL 82k parameters from data counts. 1.89 bpc, ZERO optimization. Beats trained by 3.08 bpc.

The loop closes.

"The weights are not arbitrary parameters found by stochastic optimization. They are a noisy encoding of the data's covariance structure, filtered through the f32 quotient and the BPTT optimization landscape. The mantissa was the ladder. The result is counting."

— narrative.tex, final passage

Detailed Experiment Pages

The Boolean Automaton

Margins, mantissa ablation, bit leverage 300:52:1, influence graph.

Neuron Knockout

h8 is king. Top 15 beat full 128. 113 neurons are noise.

Saturation Dynamics

All 128 volatile. Mean dwell 1.9 steps. Co-flip clusters.

Offset Analysis

Dominant offset d=14. Deep offsets d=17-28. MI-greedy = 10%.

Per-Prediction Justifications

Worked examples. ~15 weights per prediction. h8 routing backbone.

Weight Construction

The optimization continuum. 12 experiments. 1.89 bpc analytic.

The Bi-Embedding (E onto N)

Events ↔ Numbers. The thermodynamic partition function.

The Entropy Bridge

Shannon = Boltzmann. Microstates, macrostates, factoring hierarchy.

The Quotient Chain

E → N → Q at every step. The forward pass traced.

Papers: e-onto-n.pdf · entropy-bridge.pdf · narrative.pdf · synthesis.pdf · boolean-automaton.pdf
Programs: q1_boolean.c, q2_offsets.c, q3_neurons.c, q4_saturation.c, q6_justify.c, write_weights.c–write_weights14.c