For each prediction the model makes, we can produce a human-readable justification: which neurons drove the prediction, what inputs activated them, and how the signal flowed through the network. Each justification uses only ~15 weight entries out of 16,384 — 0.1% of the weight matrix.
The method: for each neuron, flip its sign bit and measure how much the BPC changes. The neurons with the largest effect are the "explanation" for that prediction. Then trace backward: what inputs and recurrent connections caused those neurons to be in their current state?
<mediawiki "| Neuron | Δbpc | Sign | W_y·h | Input | Top W_h Source |
|---|---|---|---|---|---|
| h52 | +0.069 | -1 | +0.482 | ' ' (space) | h8 (+1.0) |
| h90 | +0.059 | +1 | +0.458 | ' ' (space) | h8 (+0.8) |
| h99 | +0.051 | +1 | +0.603 | ' ' (space) | h68 (-0.9) |
| h17 | +0.033 | -1 | +0.497 | ' ' (space) | h76 (-0.5) |
| h70 | +0.012 | +1 | +0.508 | ' ' (space) | h17 (+0.5) |
The model has just seen the space after "mediawiki". It knows with 99.6% confidence that 'x' comes next (the start of "xmlns"). The space character is the key input: it activates h52 (via W_x=-1.7) and h90 (W_x=+0.9). Both route through h8, the most important neuron.
...mediawiki.org/"| Neuron | Δbpc | Sign | Input | Top W_h Source | Chain |
|---|---|---|---|---|---|
| h61 | +0.033 | +1 | '/' | h53 (-0.6) | h61 ← h53 ← h26 |
| h56 | +0.016 | -1 | '/' | h56 (-0.7) | h56 ← h56 ← h50 (self-loop) |
| h53 | +0.015 | +1 | '/' | h8 (-0.9) | h53 ← h8 ← h8 |
| h124 | +0.009 | +1 | '/' | h117 (+0.5) | h124 ← h117 ← h20 |
| h52 | +0.007 | -1 | '/' | h8 (+1.3) | h52 ← h8 ← h8 |
After the '/' in "mediawiki.org/", the model predicts 'x' (start of "xml/export"). The '/' character is the trigger. h61 and h53 route through h8 (the hub neuron with its self-loop). This prediction involves a different set of neurons than t=10, but h8 appears in the backward chains again.
...xml/expo"| Neuron | Δbpc | Sign | Input | Top W_h Source |
|---|---|---|---|---|
| h99 | +2.027 | +1 | 'o' | h68 (+1.0) |
| h76 | +0.533 | -1 | 'o' | h50 (-1.3) |
| h26 | +0.166 | +1 | 'o' | h61 (+1.2) |
| h73 | +0.161 | -1 | 'o' | h112 (-1.1) |
| h113 | +0.096 | -1 | 'o' | h106 (-0.5) |
After "expo", the model needs to predict 'r' (for "export"). h99 is the key: it promotes 'r' and 'a' (see neuron knockout data) and is receiving strong input from h68. The second-strongest competitor is 'd' (for "expod"?), showing the model is less certain here than at positions with URL-like patterns.
Across all predictions, certain neurons appear repeatedly in the backward attribution chains. This small set of neurons forms the "routing backbone" through which most information flows.
Papers: q6-justifications.pdf • total-interp.pdf • q1-sparsity.pdf
Programs: q6_justify.c
Related experiments: Boolean Automaton • Neuron Knockout • Saturation Dynamics • Offset Analysis