Seven Questions

Total interpretation of the sat-rnn through seven experimental questions (Q1–Q7)

Overview

Seven questions, each answered by running C programs against the trained model (128 hidden, 0.079 bpc on 1024 bytes of enwik9). All data below is from real experimental runs.

Q1: Sparsity
300:52:1
Bit leverage hierarchy. Sign:exponent:mantissa. Pattern ranking ρ = 1.000 at depth ≥ 11.
Q2: Offsets
d=14 peak
19 neurons dominated by d=14. MI-greedy [1,3,8,20] captures only 10.3% of signal.
Q3: Neurons
h8 = King
h8 alone: +0.035 bpc impact. Top 20 neurons = 83% of compression.
Q4: Saturation
128/128
All neurons volatile (>50 flips). Mean dwell 1.9 steps. Zero frozen.
Q5: Redux
20 + 36%
20 neurons + 36% of Wh = 4.81 bpc (0.15 BETTER than full 82k model).
Q6: Justification
~15 weights
Each prediction: ~5 neurons × ~3 backward steps. h8 in every chain.
Q7: Algebra
74% aligned
RNN attribution matches data PMI at shallow offsets (88%), diverges at depth (24-37%).
Q1
How Sparse Is the Explanation?
Answer: 300:52:1 bit leverage. The sign bit IS the computation.

We compare f32 and exact (MPFR-256) arithmetic to measure how much each bit of the floating-point representation contributes to the model's predictions.

0.046
Sign bit KL
(1 bit)
0.008
Per exponent bit KL
(8 bits)
0.00015
Per mantissa bit KL
(23 bits)
3.44
Lyapunov exponent
(chaos at depth)

Per-Bit KL Divergence

The sign bit carries 300× the information of a mantissa bit. Total ratio: sign(0.046) : exponent(8×0.008=0.064) : mantissa(23×0.00015=0.0035) = 300:52:1. Source: q1_bit_sample.c

Sign-Only vs Full f32

Sign-only dynamics OUTPERFORMS full f32 by 0.031 bpc. The mantissa degrades prediction. Source: q1_boolean.c
Key finding: The gradient decorrelates at BPTT depth 1 between f32 and exact arithmetic, but pattern ranking ρ = 1.000 at depth ≥ 11. The f32 quotient costs 0.071 bpc. The mantissa matters for optimization but is noise for inference.
Q2
Which Offsets Does the RNN Use?
Answer: Deep offsets d=14–28. Not the MI-greedy [1,3,8,20].

Neurons per Dominant Offset

The peak is at d=14 (19 neurons). Strong at d=17 (12), d=21 (9), d=28 (8). The MI-greedy offsets [1,3,8,20] that the pattern-chain analysis found optimal capture only 10.3% of the RNN's actual sign-change signal. The RNN routes information through deeper Wh chains. Source: q2_offsets.c

Output KL by Depth: How Much Does Each Offset Matter?

Depth d=21 stands out: KL = 4.63 bits. Deep offsets d=28 (3.54), d=27 (3.10) also dominate. The RNN has learned to propagate information through long Wh chains, accessing structure at depths 15–28 that skip-k-gram models can only reach by including more offsets.
The RNN is deeper than it looks. With 128 neurons and dense Wh, information from offset d=28 reaches the output through chains like h8 ← h8 (−1.1) ← h15 (+0.7). The factor-map pair (1,7) captures only 4.8% of the signal.
Q3
Which Neurons Carry the Signal?
Answer: h8 is king (+0.035 bpc). Top 20 = 83%. 108 neurons are noise.

Individual Neuron Impact (Top 30 by Knockout Δbpc)

Each bar: the increase in bpc when that neuron's Wy column is zeroed. h8 (+0.035) > h56 (+0.025) > h68 (+0.025) > h99 (+0.020) > h15 (+0.020). The top 10 account for 0.218 bpc; the bottom 98 account for 0.066. Source: q3_neurons.c

Cumulative Compression: Keep Only Top-k

Keeping only the top-k neurons (zeroing the rest) and measuring bpc. 20 neurons: 83.2% of the compression gap. 30: 92.1%. 50: 97.4%. 120 neurons: 100.0% (actually 0.1% better than 128, because the last 8 add noise).

What the Top 10 Neurons Predict (Wy Profile)

NeuronΔbpc|Wy|PromotesDemotes
h8+0.0356.10'/' 'i' ' ' 'c' 'd''a' 'e' 'o' 'p' ':'
h56+0.0254.43' ' 'y' 's' 'd' 'l''a' 'c' 'n' 'p' 'e'
h68+0.0255.83'm' 'i' 'e' 'n' 'p'' ' 'k' '>' 'h' '.'
h99+0.0205.02'>' 'a' 'r' 'd' 'U''/' 'i' 'e' 't' ' '
h15+0.0205.19's' '/' 'g' 'l' '"''k' ' ' 'h' 't' '.'
h52+0.0185.83'e' 'i' 'a' '.' 'p''k' '<' '/' 'n' '"'
h76+0.0174.56's' 'l' 'm' 'd' 'c'' ' 'e' 'n' '<' '/'
h90+0.0154.69'a' 'k' '"' 'h' '=''M' 'm' '.' 'p' 's'
h73+0.0154.32'a' '"' ' ' 'n' '>''r' 't' 'k' 'M' 'l'
h50+0.0133.29'e' '"' 'k' '<' '0''a' 's' ' ' 'c' '>'
Q4
What Is the Saturation Structure?
Answer: All 128 neurons are volatile. Zero frozen. Mean dwell 1.9 steps.
128/128
Neurons classified
as "volatile"
0
Frozen neurons
(zero flips)
1.9
Mode dwell time
(steps between flips)
0.650
Top co-flip Jaccard
(h3, h36)

Dwell Time Distribution

How long neurons stay in one sign before flipping. Mode = 1 step (25,672 occurrences). Exponential decay. Most neurons flip every 1–3 steps. Source: q4_saturation.c

Top Co-Flip Pairs (Jaccard > 0.45)

Neurons that flip at the same position share features. h3&h36: Jaccard 0.650 (494 co-flips). h97&h117: 0.598. h44&h69: 0.592. These are functional clusters encoding correlated input features.
The snapshot is not the dynamics. All 128 neurons are volatile — none stay fixed. The hidden state is fully mixed at each step: 51.2 sign changes per step on average. Co-flip groups reveal functional clusters.
Q5
How Small Can the Model Be?
Answer: 20 neurons + 36% of Wh = 0.15 bpc BETTER than full.

Model Size vs Performance

Every pruned configuration outperforms the full model. The sweet spot: 20 neurons + Wh entries > 3.0 = 25,857 parameters at 4.81 bpc. The full 82k model: 4.97 bpc. Removing noise improves prediction.
ConfigurationbpcΔParameters
Full f324.96582,304
Top 20 + Wh > 3.04.811−0.15425,857
Top 20 neurons4.882−0.08337,689
Top 15 + Wh > 3.04.879−0.08622,017
Wh prune (>3.0)4.903−0.06349,753
Training needed 82k; inference needs 26k; construction needs 0. The dense Wh is scaffolding for gradient flow. Training the sparse 26k redux from scratch fails: 7.74 bpc after 50 epochs.
Q6
Can Each Prediction Be Justified?
Answer: Yes. ~5 neurons × ~3 backward hops = ~15 weight entries per prediction.

The Routing Backbone

NeuronRoleEvidence
h8Hub — appears in every justification chainSelf-loop h8←h8 (−1.1). Top Wh source for h52, h90, h56.
h68Secondary hubSource for h99 (+1.0), h52 (−1.2). Driven by h26, h90.
h99Critical predictorFlipping costs +2.027 bpc at t=50. Source: h68 (+1.0), h17 (+0.8).
h52Context encoderDriven by h8 (+1.3), h56 (+0.8). Encodes vowel/consonant structure.
h76RelaySource: h50 (−1.3). Main signal path for 's','l','m','d','c'.
"Each prediction traces to ~5 neurons × ~3 backward steps = 15 weights. The routing backbone h8 ← h68 ← h99 carries the plurality of prediction-relevant information."
— synthesis.tex
Q7
Do RNN Attributions Match Data Statistics?
Answer: 74% overall. 88% at shallow offsets, 24–37% at depth.

RNN Attribution vs Data PMI Alignment by Offset

At shallow offsets (d=1–4), the RNN's per-neuron attributions match the data's pairwise mutual information at 88%. At d=5–8: 87–96%. Beyond d=15, alignment drops to 24–37% as the RNN develops higher-order patterns that pairwise PMI cannot capture. Source: q7_algebraic.c
The bi-embedding is tighter at short range and loosens at long range. The 26% gap at depth is the higher-order structure: 3+ event patterns and cross-offset synergies (> 1.0 bits between offset pairs) that pairwise statistics miss.

The Complete Picture

"The weights are not arbitrary parameters found by stochastic optimization. They are a noisy encoding of the data's covariance structure, filtered through the f32 quotient and the BPTT optimization landscape."
— narrative.tex
QuestionKey NumberWhat It Proves
Q1: Sparsity300:52:1The computation is Boolean (sign bits only)
Q2: Offsetsd=14 peakInformation routes through deep Wh chains
Q3: Neuronsh8 = kingExtreme concentration; most neurons are noise
Q4: Saturation128/128 volatileAll neurons flip; the dynamics is fully mixed
Q5: Redux20 + 36%82k params for training, 26k for inference
Q6: Justification~15 weightsEvery prediction is human-readable
Q7: Algebra74% alignedWeights are a function of data PMI