Q2: Which Offsets Does the RNN Use?

Experiment: q2_offsets — 2026-02-11
"The RNN compensates for PMI decay by mixing information through recurrent dynamics."

What This Experiment Shows

The RNN predicts the next byte based on its hidden state, which accumulates information from all previous inputs. But which past inputs matter most? The "offset" d measures how far back in time: d=1 is the immediately previous byte, d=7 is 7 bytes ago, etc.

We measure each neuron's sensitivity to inputs at different offsets by perturbing the input at position t-d and observing how much the neuron's sign changes at position t. This gives us the RNN's "depth profile" — its memory reach.

The earlier factor map analysis found that neurons act as 2-offset conjunction detectors, with the dominant pair being (1,7). But that used a different method (conditional statistics). Here we use direct perturbation to measure the actual information flow.

"MI-greedy [1,3,8,20] captures only 9.4% of sign-change signal. The RNN's information flow is distributed broadly across offsets, not concentrated at the MI-optimal positions."
— q234-results.tex

Key Numbers

d=14
most common dominant offset (19 neurons, 14.8%)
10.3%
MI-greedy [1,3,8,20] signal capture
4.8%
factor-map pair (1,7) signal capture
1-30
range of dominant offsets across neurons

Depth Profile: Sign Changes and Output KL

For each depth d from 1 to 30, we measure (1) how many neuron signs change when the input at t-d is perturbed, and (2) how much the output distribution changes (KL divergence).

Mean Sign Changes and Output KL by Depth

Both sign changes and output KL increase with depth, peaking around d=17-25. This is surprising — the input 20+ steps ago affects the output more than the input 1 step ago. The RNN amplifies long-range signals through its recurrent dynamics.

Depth dMean Sign ChangesMean Output KL
16.690.119
411.000.886
714.231.236
816.081.372
1421.771.783
1723.382.043
2121.464.631
2515.381.542
2819.003.542
3014.692.766
The gradient does NOT vanish. Despite the chaotic dynamics (spectral radius > 1), the RNN maintains sensitivity to inputs 20-30 steps in the past. The recurrent dynamics amplify some signals while damping others, creating a selective long-range memory.

Per-Neuron Dominant Offsets

Each neuron has a "dominant offset" — the depth at which perturbing the input causes the most sign changes in that neuron. The distribution of dominant offsets reveals the model's memory structure.

Histogram: Dominant Offset per Neuron

The distribution is broad, spanning d=1 to d=30, with a peak at d=14 (19 neurons). There is no single "memory depth" — different neurons specialize in different time scales. This is how the RNN encodes context at multiple resolutions simultaneously.

Short-range specialists (d ≤ 8)

OffsetNeurons%
d=132.3%
d=475.5%
d=775.5%
d=864.7%

Long-range specialists (d ≥ 14)

OffsetNeurons%
d=141914.8%
d=17129.4%
d=2197.0%
d=2886.2%

MI-Greedy vs RNN Offsets

The MI-greedy offsets [1,8,20,3] were found in the pattern-chain analysis by selecting offsets that maximize mutual information with the output. How well do they match what the RNN actually uses?

Signal Captured by Different Offset Sets

MI-greedy offsets capture only 10.3% of the sign-change signal. The RNN does not concentrate its computation at the MI-optimal offsets. Instead, it distributes information processing broadly across d=1 to d=30, with most activity at d=14-28. The factor-map pair (1,7) captures even less: 4.8%.

This discrepancy makes sense: the MI-greedy selection maximizes direct mutual information between a past byte and the output, ignoring the indirect chains the RNN uses. The RNN amplifies weak but complementary signals from many offsets, achieving better coverage than concentrating on the strongest individual offsets.

Why This Matters

The RNN has deep memory, not shallow. Dominant offsets at d=14-28 mean the model relies heavily on context 14-28 characters back. For XML/Wikipedia data, this corresponds to tag names, attribute values, and word-level patterns — not just character bigrams.
Attention-like behavior without attention. Different neurons "attend" to different offsets, creating a distributed multi-scale representation. This is functionally similar to multi-head attention in transformers, achieved through recurrent dynamics rather than explicit attention mechanisms.
Information is amplified, not attenuated. The growing sign changes with depth show that the RNN amplifies signals from far back, consistent with the spectral radius > 1 and the Lyapunov exponent of 3.44 found in the f32 experiments.

Source & Related

Papers: q234-results.pdfq1-exact-results.pdf

Programs: q2_offsets.c

Related experiments: Boolean AutomatonNeuron KnockoutSaturation DynamicsPer-Prediction Justifications