Seven Questions: Total Interpretation of a 128-Hidden RNN

Overview

Seven questions, each answered by running C programs against the trained model (128 hidden, 0.079 bpc on 1024 bytes of enwik9). All data below is from real experimental runs.

Q1: Sparsity

300:52:1

Bit leverage hierarchy. Sign:exponent:mantissa. Pattern ranking ρ = 1.000 at depth ≥ 11.

Q2: Offsets

d=14 peak

19 neurons dominated by d=14. MI-greedy [1,3,8,20] captures only 10.3% of signal.

Q3: Neurons

h8 = King

h8 alone: +0.035 bpc impact. Top 20 neurons = 83% of compression.

Q4: Saturation

128/128

All neurons volatile (>50 flips). Mean dwell 1.9 steps. Zero frozen.

Q5: Redux

20 + 36%

20 neurons + 36% of W_h = 4.81 bpc (0.15 BETTER than full 82k model).

Q6: Justification

~15 weights

Each prediction: ~5 neurons × ~3 backward steps. h8 in every chain.

Q7: Algebra

74% aligned

RNN attribution matches data PMI at shallow offsets (88%), diverges at depth (24-37%).

How Sparse Is the Explanation?

Answer: 300:52:1 bit leverage. The sign bit IS the computation.

We compare f32 and exact (MPFR-256) arithmetic to measure how much each bit of the floating-point representation contributes to the model's predictions.

0.046

Sign bit KL
(1 bit)

0.008

Per exponent bit KL
(8 bits)

0.00015

Per mantissa bit KL
(23 bits)

3.44

Lyapunov exponent
(chaos at depth)

Per-Bit KL Divergence

The sign bit carries 300× the information of a mantissa bit. Total ratio: sign(0.046) : exponent(8×0.008=0.064) : mantissa(23×0.00015=0.0035) = 300:52:1. Source: q1_bit_sample.c

Sign-Only vs Full f32

Sign-only dynamics OUTPERFORMS full f32 by 0.031 bpc. The mantissa degrades prediction. Source: q1_boolean.c

Key finding: The gradient decorrelates at BPTT depth 1 between f32 and exact arithmetic, but pattern ranking ρ = 1.000 at depth ≥ 11. The f32 quotient costs 0.071 bpc. The mantissa matters for optimization but is noise for inference.

Which Offsets Does the RNN Use?

Answer: Deep offsets d=14–28. Not the MI-greedy [1,3,8,20].

Neurons per Dominant Offset

The peak is at d=14 (19 neurons). Strong at d=17 (12), d=21 (9), d=28 (8). The MI-greedy offsets [1,3,8,20] that the pattern-chain analysis found optimal capture only 10.3% of the RNN's actual sign-change signal. The RNN routes information through deeper W_h chains. Source: q2_offsets.c

Output KL by Depth: How Much Does Each Offset Matter?

Depth d=21 stands out: KL = 4.63 bits. Deep offsets d=28 (3.54), d=27 (3.10) also dominate. The RNN has learned to propagate information through long W_h chains, accessing structure at depths 15–28 that skip-k-gram models can only reach by including more offsets.

The RNN is deeper than it looks. With 128 neurons and dense W_h, information from offset d=28 reaches the output through chains like h8 ← h8 (−1.1) ← h15 (+0.7). The factor-map pair (1,7) captures only 4.8% of the signal.

Which Neurons Carry the Signal?

Answer: h8 is king (+0.035 bpc). Top 20 = 83%. 108 neurons are noise.

Individual Neuron Impact (Top 30 by Knockout Δbpc)

Each bar: the increase in bpc when that neuron's W_y column is zeroed. h8 (+0.035) > h56 (+0.025) > h68 (+0.025) > h99 (+0.020) > h15 (+0.020). The top 10 account for 0.218 bpc; the bottom 98 account for 0.066. Source: q3_neurons.c

Cumulative Compression: Keep Only Top-k

Keeping only the top-k neurons (zeroing the rest) and measuring bpc. 20 neurons: 83.2% of the compression gap. 30: 92.1%. 50: 97.4%. 120 neurons: 100.0% (actually 0.1% better than 128, because the last 8 add noise).

What the Top 10 Neurons Predict (W_y Profile)

Neuron	Δbpc	\|W_y\|	Promotes	Demotes
h8	+0.035	6.10	'/' 'i' ' ' 'c' 'd'	'a' 'e' 'o' 'p' ':'
h56	+0.025	4.43	' ' 'y' 's' 'd' 'l'	'a' 'c' 'n' 'p' 'e'
h68	+0.025	5.83	'm' 'i' 'e' 'n' 'p'	' ' 'k' '>' 'h' '.'
h99	+0.020	5.02	'>' 'a' 'r' 'd' 'U'	'/' 'i' 'e' 't' ' '
h15	+0.020	5.19	's' '/' 'g' 'l' '"'	'k' ' ' 'h' 't' '.'
h52	+0.018	5.83	'e' 'i' 'a' '.' 'p'	'k' '<' '/' 'n' '"'
h76	+0.017	4.56	's' 'l' 'm' 'd' 'c'	' ' 'e' 'n' '<' '/'
h90	+0.015	4.69	'a' 'k' '"' 'h' '='	'M' 'm' '.' 'p' 's'
h73	+0.015	4.32	'a' '"' ' ' 'n' '>'	'r' 't' 'k' 'M' 'l'
h50	+0.013	3.29	'e' '"' 'k' '<' '0'	'a' 's' ' ' 'c' '>'

What Is the Saturation Structure?

Answer: All 128 neurons are volatile. Zero frozen. Mean dwell 1.9 steps.

128/128

Neurons classified
as "volatile"

Frozen neurons
(zero flips)

1.9

Mode dwell time
(steps between flips)

0.650

Top co-flip Jaccard
(h3, h36)

Dwell Time Distribution

How long neurons stay in one sign before flipping. Mode = 1 step (25,672 occurrences). Exponential decay. Most neurons flip every 1–3 steps. Source: q4_saturation.c

Top Co-Flip Pairs (Jaccard > 0.45)

Neurons that flip at the same position share features. h3&h36: Jaccard 0.650 (494 co-flips). h97&h117: 0.598. h44&h69: 0.592. These are functional clusters encoding correlated input features.

The snapshot is not the dynamics. All 128 neurons are volatile — none stay fixed. The hidden state is fully mixed at each step: 51.2 sign changes per step on average. Co-flip groups reveal functional clusters.

How Small Can the Model Be?

Answer: 20 neurons + 36% of W_h = 0.15 bpc BETTER than full.

Model Size vs Performance

Every pruned configuration outperforms the full model. The sweet spot: 20 neurons + W_h entries > 3.0 = 25,857 parameters at 4.81 bpc. The full 82k model: 4.97 bpc. Removing noise improves prediction.

Configuration	bpc	Δ	Parameters
Full f32	4.965	—	82,304
Top 20 + W_h > 3.0	4.811	−0.154	25,857
Top 20 neurons	4.882	−0.083	37,689
Top 15 + W_h > 3.0	4.879	−0.086	22,017
W_h prune (>3.0)	4.903	−0.063	49,753

Training needed 82k; inference needs 26k; construction needs 0. The dense W_h is scaffolding for gradient flow. Training the sparse 26k redux from scratch fails: 7.74 bpc after 50 epochs.

Can Each Prediction Be Justified?

Answer: Yes. ~5 neurons × ~3 backward hops = ~15 weight entries per prediction.

The Routing Backbone

Neuron	Role	Evidence
h8	Hub — appears in every justification chain	Self-loop h8←h8 (−1.1). Top W_h source for h52, h90, h56.
h68	Secondary hub	Source for h99 (+1.0), h52 (−1.2). Driven by h26, h90.
h99	Critical predictor	Flipping costs +2.027 bpc at t=50. Source: h68 (+1.0), h17 (+0.8).
h52	Context encoder	Driven by h8 (+1.3), h56 (+0.8). Encodes vowel/consonant structure.
h76	Relay	Source: h50 (−1.3). Main signal path for 's','l','m','d','c'.

"Each prediction traces to ~5 neurons × ~3 backward steps = 15 weights. The routing backbone h8 ← h68 ← h99 carries the plurality of prediction-relevant information."

— synthesis.tex

Do RNN Attributions Match Data Statistics?

Answer: 74% overall. 88% at shallow offsets, 24–37% at depth.

RNN Attribution vs Data PMI Alignment by Offset

At shallow offsets (d=1–4), the RNN's per-neuron attributions match the data's pairwise mutual information at 88%. At d=5–8: 87–96%. Beyond d=15, alignment drops to 24–37% as the RNN develops higher-order patterns that pairwise PMI cannot capture. Source: q7_algebraic.c

The bi-embedding is tighter at short range and loosens at long range. The 26% gap at depth is the higher-order structure: 3+ event patterns and cross-offset synergies (> 1.0 bits between offset pairs) that pairwise statistics miss.

The Complete Picture

"The weights are not arbitrary parameters found by stochastic optimization. They are a noisy encoding of the data's covariance structure, filtered through the f32 quotient and the BPTT optimization landscape."

— narrative.tex

Question	Key Number	What It Proves
Q1: Sparsity	300:52:1	The computation is Boolean (sign bits only)
Q2: Offsets	d=14 peak	Information routes through deep W_h chains
Q3: Neurons	h8 = king	Extreme concentration; most neurons are noise
Q4: Saturation	128/128 volatile	All neurons flip; the dynamics is fully mixed
Q5: Redux	20 + 36%	82k params for training, 26k for inference
Q6: Justification	~15 weights	Every prediction is human-readable
Q7: Algebra	74% aligned	Weights are a function of data PMI

The Evidence (Master)

All evidence in one page with weight construction.

Weight Construction

The loop closes: ALL 82k params from data.

Papers: synthesis.pdf · boolean-automaton.pdf · q234-results.pdf · q6-justifications.pdf
Programs: q1_boolean.c, q1_bit_sample.c, q2_offsets.c, q3_neurons.c, q4_saturation.c, q5_redux.c, q6_justify.c, q7_algebraic.c

Seven Questions