Per-Prediction Justifications - Hutter RNN Experiment

What This Experiment Shows

For each prediction the model makes, we can produce a human-readable justification: which neurons drove the prediction, what inputs activated them, and how the signal flowed through the network. Each justification uses only ~15 weight entries out of 16,384 — 0.1% of the weight matrix.

The method: for each neuron, flip its sign bit and measure how much the BPC changes. The neurons with the largest effect are the "explanation" for that prediction. Then trace backward: what inputs and recurrent connections caused those neurons to be in their current state?

"Each prediction: ~15 neurons, ~15 weight entries (0.1% of W_h). The explanation is very sparse for typical predictions but heavy-tailed."

— q6-justifications.tex, q1-sparsity.tex

Key Numbers

~15

weight entries per justification

0.1%

of W_h used per prediction

top neurons explain most predictions

2-3

depth of typical backward chain

Worked Example: Position t=10

Predicting after "`<mediawiki` "

<mediawiki x

CORRECT: P('x') = 0.996 BPC = 0.005 Top: 'x'(0.996) '.'(0.002) 'k'(0.001)

Top Contributing Neurons

Neuron	Δbpc	Sign	W_y·h	Input	Top W_h Source
h52	+0.069	-1	+0.482	' ' (space)	h8 (+1.0)
h90	+0.059	+1	+0.458	' ' (space)	h8 (+0.8)
h99	+0.051	+1	+0.603	' ' (space)	h68 (-0.9)
h17	+0.033	-1	+0.497	' ' (space)	h76 (-0.5)
h70	+0.012	+1	+0.508	' ' (space)	h17 (+0.5)

h52 ← input ' ' (W_x=-1.7) + h8 (+1.0) ← at t-1: input 'i', source h70 (+0.6)

The model has just seen the space after "mediawiki". It knows with 99.6% confidence that 'x' comes next (the start of "xmlns"). The space character is the key input: it activates h52 (via W_x=-1.7) and h90 (W_x=+0.9). Both route through h8, the most important neuron.

Worked Example: Position t=42

Predicting after "`...mediawiki.org/`"

mlns="http://www.mediawiki.org/x

CORRECT: P('x') = 0.998 BPC = 0.003 Top: 'x'(0.998) 'a'(0.001) 'w'(0.000)

Top Contributing Neurons

Neuron	Δbpc	Sign	Input	Top W_h Source	Chain
h61	+0.033	+1	'/'	h53 (-0.6)	h61 ← h53 ← h26
h56	+0.016	-1	'/'	h56 (-0.7)	h56 ← h56 ← h50 (self-loop)
h53	+0.015	+1	'/'	h8 (-0.9)	h53 ← h8 ← h8
h124	+0.009	+1	'/'	h117 (+0.5)	h124 ← h117 ← h20
h52	+0.007	-1	'/'	h8 (+1.3)	h52 ← h8 ← h8

After the '/' in "mediawiki.org/", the model predicts 'x' (start of "xml/export"). The '/' character is the trigger. h61 and h53 route through h8 (the hub neuron with its self-loop). This prediction involves a different set of neurons than t=10, but h8 appears in the backward chains again.

Worked Example: Position t=50 (Harder Prediction)

Predicting after "`...xml/expo`"

tp://www.mediawiki.org/xml/expor

CORRECT: P('r') = 0.980 BPC = 0.030 Top: 'r'(0.980) 'd'(0.007) 'c'(0.006) '/'(0.005)

Top Contributing Neurons

Neuron	Δbpc	Sign	Input	Top W_h Source
h99	+2.027	+1	'o'	h68 (+1.0)
h76	+0.533	-1	'o'	h50 (-1.3)
h26	+0.166	+1	'o'	h61 (+1.2)
h73	+0.161	-1	'o'	h112 (-1.1)
h113	+0.096	-1	'o'	h106 (-0.5)

Note the scale difference. Flipping h99 costs +2.027 bpc — far more than any neuron at t=10 or t=42. This prediction is more "fragile": fewer neurons carry the signal, and h99 alone is critical. If h99 were wrong, the model would predict the wrong character with near certainty.

After "expo", the model needs to predict 'r' (for "export"). h99 is the key: it promotes 'r' and 'a' (see neuron knockout data) and is receiving strong input from h68. The second-strongest competitor is 'd' (for "expod"?), showing the model is less certain here than at positions with URL-like patterns.

The Routing Backbone

Across all predictions, certain neurons appear repeatedly in the backward attribution chains. This small set of neurons forms the "routing backbone" through which most information flows.

Neurons Appearing in Multiple Justifications

"The RNN has a small 'routing backbone' through which most information flows. h54 is the bottleneck — smallest margin, most volatile, most important."

— q6-justifications.tex

h8 is the infrastructure; h52, h90, h99 are the signal carriers. h8 appears as a source in chains across all three worked examples. Its self-connection (W_h[8,8] = -1.25) makes it an oscillator that sustains information across time steps. The other top neurons (h52, h90, h99) carry the prediction-specific signal, routed through h8's persistent dynamics.

Why This Matters

Total interpretability is achievable. For every prediction the model makes, we can produce a human-readable explanation citing specific neurons, weight entries, and input characters. The explanation uses only 0.1% of the weight matrix — the model's behavior is extremely sparse.

The explanations are verifiable. Each justification is a chain of concrete operations (neuron activations, weight multiplications, sign flips) that can be checked arithmetically. This is not post-hoc rationalization — it is the actual computation.

Sparsity enables trust. A model that uses 15 out of 16,384 weight entries per prediction is one where we can actually verify the reasoning. This is a concrete path toward trustworthy AI: make the computation sparse enough that humans can follow it.

Source & Related

Papers: q6-justifications.pdf • total-interp.pdf • q1-sparsity.pdf

Programs: q6_justify.c

Related experiments: Boolean Automaton • Neuron Knockout • Saturation Dynamics • Offset Analysis

Q6: Human-Readable Justifications for Every Prediction

What This Experiment Shows

Key Numbers

Worked Example: Position t=10

Predicting after "<mediawiki "

Top Contributing Neurons

Worked Example: Position t=42

Predicting after "...mediawiki.org/"

Top Contributing Neurons

Worked Example: Position t=50 (Harder Prediction)

Predicting after "...xml/expo"

Top Contributing Neurons

The Routing Backbone

Neurons Appearing in Multiple Justifications

Why This Matters

Source & Related

Predicting after "`<mediawiki` "

Predicting after "`...mediawiki.org/`"

Predicting after "`...xml/expo`"