Overview
Seven questions, each answered by running C programs against the trained model (128 hidden, 0.079 bpc on 1024 bytes of enwik9). All data below is from real experimental runs.
Q1: Sparsity
300:52:1
Bit leverage hierarchy. Sign:exponent:mantissa. Pattern ranking ρ = 1.000 at depth ≥ 11.
Q2: Offsets
d=14 peak
19 neurons dominated by d=14. MI-greedy [1,3,8,20] captures only 10.3% of signal.
Q3: Neurons
h8 = King
h8 alone: +0.035 bpc impact. Top 20 neurons = 83% of compression.
Q4: Saturation
128/128
All neurons volatile (>50 flips). Mean dwell 1.9 steps. Zero frozen.
Q5: Redux
20 + 36%
20 neurons + 36% of Wh = 4.81 bpc (0.15 BETTER than full 82k model).
Q6: Justification
~15 weights
Each prediction: ~5 neurons × ~3 backward steps. h8 in every chain.
Q7: Algebra
74% aligned
RNN attribution matches data PMI at shallow offsets (88%), diverges at depth (24-37%).
We compare f32 and exact (MPFR-256) arithmetic to measure how much each bit of the floating-point representation contributes to the model's predictions.
0.008
Per exponent bit KL
(8 bits)
0.00015
Per mantissa bit KL
(23 bits)
3.44
Lyapunov exponent
(chaos at depth)
Per-Bit KL Divergence
The sign bit carries 300× the information of a mantissa bit. Total ratio: sign(0.046) : exponent(8×0.008=0.064) : mantissa(23×0.00015=0.0035) = 300:52:1. Source:
q1_bit_sample.c
Sign-Only vs Full f32
Sign-only dynamics OUTPERFORMS full f32 by 0.031 bpc. The mantissa degrades prediction. Source:
q1_boolean.c
Key finding: The gradient decorrelates at BPTT depth 1 between f32 and exact arithmetic, but pattern ranking ρ = 1.000 at depth ≥ 11. The f32 quotient costs 0.071 bpc. The mantissa matters for optimization but is noise for inference.
Neurons per Dominant Offset
The peak is at d=14 (19 neurons). Strong at d=17 (12), d=21 (9), d=28 (8). The MI-greedy offsets [1,3,8,20] that the pattern-chain analysis found optimal capture only 10.3% of the RNN's actual sign-change signal. The RNN routes information through deeper W
h chains. Source:
q2_offsets.c
Output KL by Depth: How Much Does Each Offset Matter?
Depth d=21 stands out: KL = 4.63 bits. Deep offsets d=28 (3.54), d=27 (3.10) also dominate. The RNN has learned to propagate information through long Wh chains, accessing structure at depths 15–28 that skip-k-gram models can only reach by including more offsets.
The RNN is deeper than it looks. With 128 neurons and dense Wh, information from offset d=28 reaches the output through chains like h8 ← h8 (−1.1) ← h15 (+0.7). The factor-map pair (1,7) captures only 4.8% of the signal.
Individual Neuron Impact (Top 30 by Knockout Δbpc)
Each bar: the increase in bpc when that neuron's W
y column is zeroed. h8 (+0.035) > h56 (+0.025) > h68 (+0.025) > h99 (+0.020) > h15 (+0.020). The top 10 account for 0.218 bpc; the bottom 98 account for 0.066. Source:
q3_neurons.c
Cumulative Compression: Keep Only Top-k
Keeping only the top-k neurons (zeroing the rest) and measuring bpc. 20 neurons: 83.2% of the compression gap. 30: 92.1%. 50: 97.4%. 120 neurons: 100.0% (actually 0.1% better than 128, because the last 8 add noise).
What the Top 10 Neurons Predict (Wy Profile)
| Neuron | Δbpc | |Wy| | Promotes | Demotes |
| h8 | +0.035 | 6.10 | '/' 'i' ' ' 'c' 'd' | 'a' 'e' 'o' 'p' ':' |
| h56 | +0.025 | 4.43 | ' ' 'y' 's' 'd' 'l' | 'a' 'c' 'n' 'p' 'e' |
| h68 | +0.025 | 5.83 | 'm' 'i' 'e' 'n' 'p' | ' ' 'k' '>' 'h' '.' |
| h99 | +0.020 | 5.02 | '>' 'a' 'r' 'd' 'U' | '/' 'i' 'e' 't' ' ' |
| h15 | +0.020 | 5.19 | 's' '/' 'g' 'l' '"' | 'k' ' ' 'h' 't' '.' |
| h52 | +0.018 | 5.83 | 'e' 'i' 'a' '.' 'p' | 'k' '<' '/' 'n' '"' |
| h76 | +0.017 | 4.56 | 's' 'l' 'm' 'd' 'c' | ' ' 'e' 'n' '<' '/' |
| h90 | +0.015 | 4.69 | 'a' 'k' '"' 'h' '=' | 'M' 'm' '.' 'p' 's' |
| h73 | +0.015 | 4.32 | 'a' '"' ' ' 'n' '>' | 'r' 't' 'k' 'M' 'l' |
| h50 | +0.013 | 3.29 | 'e' '"' 'k' '<' '0' | 'a' 's' ' ' 'c' '>' |
128/128
Neurons classified
as "volatile"
0
Frozen neurons
(zero flips)
1.9
Mode dwell time
(steps between flips)
0.650
Top co-flip Jaccard
(h3, h36)
Dwell Time Distribution
How long neurons stay in one sign before flipping. Mode = 1 step (25,672 occurrences). Exponential decay. Most neurons flip every 1–3 steps. Source:
q4_saturation.c
Top Co-Flip Pairs (Jaccard > 0.45)
Neurons that flip at the same position share features. h3&h36: Jaccard 0.650 (494 co-flips). h97&h117: 0.598. h44&h69: 0.592. These are functional clusters encoding correlated input features.
The snapshot is not the dynamics. All 128 neurons are volatile — none stay fixed. The hidden state is fully mixed at each step: 51.2 sign changes per step on average. Co-flip groups reveal functional clusters.
Model Size vs Performance
Every pruned configuration outperforms the full model. The sweet spot: 20 neurons + Wh entries > 3.0 = 25,857 parameters at 4.81 bpc. The full 82k model: 4.97 bpc. Removing noise improves prediction.
| Configuration | bpc | Δ | Parameters |
| Full f32 | 4.965 | — | 82,304 |
| Top 20 + Wh > 3.0 | 4.811 | −0.154 | 25,857 |
| Top 20 neurons | 4.882 | −0.083 | 37,689 |
| Top 15 + Wh > 3.0 | 4.879 | −0.086 | 22,017 |
| Wh prune (>3.0) | 4.903 | −0.063 | 49,753 |
Training needed 82k; inference needs 26k; construction needs 0. The dense Wh is scaffolding for gradient flow. Training the sparse 26k redux from scratch fails: 7.74 bpc after 50 epochs.
The Routing Backbone
| Neuron | Role | Evidence |
| h8 | Hub — appears in every justification chain | Self-loop h8←h8 (−1.1). Top Wh source for h52, h90, h56. |
| h68 | Secondary hub | Source for h99 (+1.0), h52 (−1.2). Driven by h26, h90. |
| h99 | Critical predictor | Flipping costs +2.027 bpc at t=50. Source: h68 (+1.0), h17 (+0.8). |
| h52 | Context encoder | Driven by h8 (+1.3), h56 (+0.8). Encodes vowel/consonant structure. |
| h76 | Relay | Source: h50 (−1.3). Main signal path for 's','l','m','d','c'. |
"Each prediction traces to ~5 neurons × ~3 backward steps = 15 weights. The routing backbone h8 ← h68 ← h99 carries the plurality of prediction-relevant information."
— synthesis.tex
RNN Attribution vs Data PMI Alignment by Offset
At shallow offsets (d=1–4), the RNN's per-neuron attributions match the data's pairwise mutual information at 88%. At d=5–8: 87–96%. Beyond d=15, alignment drops to 24–37% as the RNN develops higher-order patterns that pairwise PMI cannot capture. Source:
q7_algebraic.c
The bi-embedding is tighter at short range and loosens at long range. The 26% gap at depth is the higher-order structure: 3+ event patterns and cross-offset synergies (> 1.0 bits between offset pairs) that pairwise statistics miss.
The Complete Picture
"The weights are not arbitrary parameters found by stochastic optimization. They are a noisy encoding of the data's covariance structure, filtered through the f32 quotient and the BPTT optimization landscape."
— narrative.tex
| Question | Key Number | What It Proves |
| Q1: Sparsity | 300:52:1 | The computation is Boolean (sign bits only) |
| Q2: Offsets | d=14 peak | Information routes through deep Wh chains |
| Q3: Neurons | h8 = king | Extreme concentration; most neurons are noise |
| Q4: Saturation | 128/128 volatile | All neurons flip; the dynamics is fully mixed |
| Q5: Redux | 20 + 36% | 82k params for training, 26k for inference |
| Q6: Justification | ~15 weights | Every prediction is human-readable |
| Q7: Algebra | 74% aligned | Weights are a function of data PMI |