The RNN predicts the next byte based on its hidden state, which accumulates information from all previous inputs. But which past inputs matter most? The "offset" d measures how far back in time: d=1 is the immediately previous byte, d=7 is 7 bytes ago, etc.
We measure each neuron's sensitivity to inputs at different offsets by perturbing the input at position t-d and observing how much the neuron's sign changes at position t. This gives us the RNN's "depth profile" — its memory reach.
The earlier factor map analysis found that neurons act as 2-offset conjunction detectors, with the dominant pair being (1,7). But that used a different method (conditional statistics). Here we use direct perturbation to measure the actual information flow.
For each depth d from 1 to 30, we measure (1) how many neuron signs change when the input at t-d is perturbed, and (2) how much the output distribution changes (KL divergence).
Both sign changes and output KL increase with depth, peaking around d=17-25. This is surprising — the input 20+ steps ago affects the output more than the input 1 step ago. The RNN amplifies long-range signals through its recurrent dynamics.
| Depth d | Mean Sign Changes | Mean Output KL |
|---|---|---|
| 1 | 6.69 | 0.119 |
| 4 | 11.00 | 0.886 |
| 7 | 14.23 | 1.236 |
| 8 | 16.08 | 1.372 |
| 14 | 21.77 | 1.783 |
| 17 | 23.38 | 2.043 |
| 21 | 21.46 | 4.631 |
| 25 | 15.38 | 1.542 |
| 28 | 19.00 | 3.542 |
| 30 | 14.69 | 2.766 |
Each neuron has a "dominant offset" — the depth at which perturbing the input causes the most sign changes in that neuron. The distribution of dominant offsets reveals the model's memory structure.
The distribution is broad, spanning d=1 to d=30, with a peak at d=14 (19 neurons). There is no single "memory depth" — different neurons specialize in different time scales. This is how the RNN encodes context at multiple resolutions simultaneously.
| Offset | Neurons | % |
|---|---|---|
| d=1 | 3 | 2.3% |
| d=4 | 7 | 5.5% |
| d=7 | 7 | 5.5% |
| d=8 | 6 | 4.7% |
| Offset | Neurons | % |
|---|---|---|
| d=14 | 19 | 14.8% |
| d=17 | 12 | 9.4% |
| d=21 | 9 | 7.0% |
| d=28 | 8 | 6.2% |
The MI-greedy offsets [1,8,20,3] were found in the pattern-chain analysis by selecting offsets that maximize mutual information with the output. How well do they match what the RNN actually uses?
This discrepancy makes sense: the MI-greedy selection maximizes direct mutual information between a past byte and the output, ignoring the indirect chains the RNN uses. The RNN amplifies weak but complementary signals from many offsets, achieving better coverage than concentrating on the strongest individual offsets.
Papers: q234-results.pdf • q1-exact-results.pdf
Programs: q2_offsets.c
Related experiments: Boolean Automaton • Neuron Knockout • Saturation Dynamics • Per-Prediction Justifications