What This Experiment Shows
For each of the 1024 predictions the sat-rnn makes, how many of its ~44,794 patterns actually participate in the backward attribution chain? This experiment computes the full backward trace at every position, sweeping the attribution threshold τ across five orders of magnitude.
The answer: typical predictions are extremely sparse, but the distribution has a heavy tail. Most positions need a handful of patterns; a few surprising positions activate thousands.
"The explanation is very sparse for typical predictions but heavy-tailed. The median position at τ = 0.01 uses only 15 patterns out of 44,794. But the mean is 1,166, pulled up by a minority of positions with large attribution counts."
— q1-sparsity.tex
The Numbers at a Glance
44,794
patterns with |w| > 0.01
15
median active patterns at τ = 0.01
87%
of active patterns are Wh (recurrent)
95.5%
of Wh patterns active somewhere
Sparsity Distribution
As the threshold τ increases, fewer patterns survive. The distribution is highly skewed: mean ≫ median at every threshold, revealing a heavy-tailed distribution where most predictions are cheap but some are expensive.
| Threshold τ | Mean n | Median n | Min | Max | n/44,794 |
| 10-4 | 9,807 | 10,283 | 0 | 19,850 | 0.219 |
| 10-3 | 4,357 | 1,664 | 0 | 19,352 | 0.097 |
| 10-2 | 1,166 | 15 | 0 | 17,710 | 0.026 |
| 10-1 | 157 | 0 | 0 | 11,012 | 0.004 |
| 1.0 | 8 | 0 | 0 | 2,127 | 0.000 |
Mean vs Median Active Patterns
The gap between mean and median shows the heavy tail. At τ = 0.01, the mean is 78× the median.
Key finding: At τ = 0.01, the median prediction uses just 15 patterns (0.03% of the total), but the mean is 1,166 — a 78× gap driven by a minority of hard-to-predict positions.
Breakdown by Pattern Class
The model has three classes of patterns from its three weight matrices. Wh (recurrent) patterns dominate at every threshold — the backward attribution chain flows primarily through recurrent connections.
| Threshold τ | Mean nx (input) | Mean nh (recurrent) | Mean ny (output) | Wh share |
| 10-3 | 481 | 3,834 | 42 |
88% |
| 10-2 | 136 | 1,018 | 12 |
87% |
| 10-1 | 22 | 134 | 2 |
85% |
Pattern Class Composition at τ = 0.01
Wh dominates: Recurrent patterns account for ~87% of all active patterns at every threshold. The backward attribution chain is primarily a story of neuron-to-neuron connections. Output patterns (Wy) are the sparsest — only 12 on average at τ = 0.01, meaning just a handful of neurons contribute meaningfully to each prediction.
Never-Active Patterns
At τ = 0.01, how many of the model's patterns are never active at any of the 1024 positions?
87%
Wx never active (28,635 / 32,768)
4.5%
Wh never active (735 / 16,384)
85%
Wy never active (27,965 / 32,768)
Pattern Utilization by Class
Most Wx patterns are irrelevant because most input byte values never occur in the 1024-byte dataset (only 52 of 256 byte values appear). Similarly, most Wy patterns are irrelevant because most output byte values are never the prediction target.
Wh is fully utilized: Nearly all (95.5%) recurrent connections matter at some position. The 128-neuron recurrent core is not over-provisioned — every connection plays a role.
Depth Profile: The Gradient Does Not Vanish
A common belief about RNNs is that gradients vanish with depth, making long-range dependencies unlearnable. This experiment measures the actual attribution mass at each backward offset d.
| Offset d | Mean mass | Fraction of d=0 | Visual |
Attribution Mass vs Backward Offset
Mass does not decay monotonically. It grows from d=1 to a peak at d≈21, exceeding the d=0 value, then oscillates around 0.5–0.7× d=0 out to d=50.
Peak at d ≈ 20–21: Attribution mass at d=21 (0.827) exceeds d=0 (0.757). The RNN mixes information into a carrier arising from its recurrent dynamics. This is consistent with the greedy MI ordering [1, 8, 20, ...] where offset 20 was selected third.
"The gradient does not vanish. Mass grows from d=1 to a peak at d ≈ 20–21 (exceeding d=0), then oscillates around 0.5–0.7× the d=0 value out to d=50. The RNN mixes information into a carrier arising from its recurrent dynamics."
— q1-sparsity.tex
How It Works
For each position t = 0, ..., 1023, predicting y = xt+1:
- Compute the output gradient gt ∈ R128
- For each offset d = 1, ..., Dmax, compute the backward gradient gt,d via the Jacobian chain
- For each Wy pattern (j, y): attribution = |[gt]j · 1[hj has correct sign]|
- For each Wx pattern (xt-d, j) at offset d: attribution = |Wx[j, xt-d] · [gt,d]j|
- For each Wh pattern (j, k): attribution at offset d = |(1 - hj(t-d)²) · Wh[k,j] · [gt,d]j|; take max over offsets
A pattern is active for position t if its attribution exceeds threshold τ.
Tool: q1_sparsity.c · Model: sat_model.bin from archive/20260209 · Data: first 1024 bytes of enwik9