1. What This Shows
The forward pass of the RNN is a chain of quotient computations. At each step, an event (E) is counted (N),
and the ratio of total to count gives the quotient (Q). The log-quotient is the surprisal.
The BPC is the average log-quotient.
This page traces the complete E → N → Q chain through all four stages of the prediction pipeline:
input encoding, hidden-layer partition functions, recurrent propagation, and output readout. Each stage
computes a quotient, and the product of all quotients is the model's prediction.
"Each term is an E → N → Q step: 1. Identify the event (E). 2. Count it or accumulate support (N).
3. Compute the quotient (Q = N/count)."
— e-onto-n.tex
2. The Input Quotient
For the 1024-byte dataset, each input character has a frequency count. The quotient Q = 1024/c(x) measures
how surprising that character is: rare characters have high quotients, common characters have low quotients.
The log-quotient log2(Q) is the information content in bits.
Character Frequency and Quotient (Top 20 by Count)
Left axis (bars): raw count c(x) in the 1024-byte dataset.
Right axis (line): quotient Q = 1024/c(x). Higher Q = rarer character = more information per occurrence.
Space is the most common event. With 127 occurrences, space carries only 3.01 bits per occurrence.
Characters like 'w' (23 occurrences) carry 5.48 bits — nearly twice as much information. The input quotient
is the first link in the chain: it determines how much surprise each input byte contributes.
3. The Hidden-Layer Quotient
Each of the 128 neurons computes a partition function. The Boltzmann probability of neuron j being positive is:
P(hj+) = 1 / (1 + 2−Dj)
where Dj is the accumulator difference (the pre-activation). When |Dj| is large, the neuron
is saturated: Q ≈ 1.0 (no surprise). When Dj is near zero, the neuron is at threshold:
Q can be large (high information).
Neuron Dashboard at Position t=42
Context: "mediawiki.org/", input = '/'
h56
z49.2
bias−3.9
Wx['/']−1.7
P(h+)≈1.0
Q≈1.0
h90
z126.0
bias−4.8
Wx['/']−2.4
P(h+)≈1.0
Q≈1.0
h20
z−71.5
bias−5.4
Wx['/']+3.0
P(h+)≈0.0
Q≈∞
h43
z77.4
bias−0.7
Wx['/']−4.1
P(h+)≈1.0
Q≈1.0
h60
z46.8
bias−0.7
Wx['/']+8.7
P(h+)≈1.0
Q≈1.0
P(hj+) vs Pre-activation z — Neurons at t=42 with Sigmoid Overlay
The sigmoid curve shows P(hj+) = 1/(1 + 2−z). All five neurons are deeply saturated (|z| >> 0),
so their quotients are all near 1.0. The information lives in the rare near-threshold neurons.
Cumulative Hidden Quotient
984
distinct binary states
1.04
cumulative hidden quotient
QH = 1024/984
10.0
state entropy (bits)
log2(1023)
Almost all neurons are saturated. At t=42, 123 of 128 neurons have |z| > 10, giving Q ≈ 1.0.
Only 5 neurons are near threshold. The hidden-layer quotient is nearly 1 — meaning the binary state
is almost fully determined by the input and recurrent context. The information is concentrated in those
few near-threshold neurons.
4. The Recurrent Quotient (Wh Propagation)
Information propagates through Wh: each entry Wh[i][j] contributes a quotient factor
to neuron i from the previous state of neuron j. The key insight is that each weight acts as a "force"
pushing the neuron toward + or −, and the accumulated sum determines the pre-activation z.
Force Diagram for h56 at t=42
Top Wh contributions summing to z=49.2:
Recurrent Chain: Information Flow Through Time
How Input at t−d Reaches h56 at t=42 Through d Steps of Wh
Each bar shows the effective influence strength from input at time t−d on neuron h56 at time t=42.
The influence decays with distance but does not vanish — the saturated RNN carries information
through sign-preserving chains in Wh.
Wh is a sign-preserving channel. The forces from h7 (−11.0) and h42 (−8.7)
push h56 toward negative, but the net sum of all 128 sources plus bias plus Wx is +49.2 —
overwhelmingly positive. This is the recurrent quotient in action: each Wh entry contributes a factor,
and the product determines the neuron's state.
5. The Output Quotient
The final step: Wy maps the 128-dimensional hidden state to 256 output logits. The softmax
normalizes these into a probability distribution. The partition function Z = ∑ 2A(o) for all
output bytes o, and the quotient for the true character is Q = Z / 2A(otrue) = 1/P(otrue).
Position t=50: Predicting After "expo"
True next character: 'r' (for "export")
0.195
P('r') — true character
Output Distribution at t=50 (Top 20 Characters)
The true character 'r' (green) is the second-highest prediction. The model assigns P('r')=0.195, giving
a quotient of 5.13 and a per-position BPC of 2.36 bits. The leading prediction is space (0.215).
6. Worked Example — Complete Chain at t=42
Here is the complete E → N → Q chain for a single prediction. Position t=42: context
"mediawiki.org/", predicting the next byte.
The input character '/' is observed. How surprising is it?
Quotient (Q)
1024/39 = 26.26
log2(Q) = 4.71 bits of input information
↓
128 neurons compute partition functions. How many distinct states exist?
Event (E)
128 neuron signs
Count (N)
123 saturated, 5 near-threshold
984 distinct binary states in 1024 positions → cumulative Q ≈ 1.04
↓
Information from 30+ past positions propagates through Wh.
Event (E)
past context pattern
Each offset contributes quotient factors through sign-preserving Wh chains
↓
Wy maps hidden state to output probabilities.
Event (E)
true: 'x' (P=0.0001)
Predictions
't':0.165 '[':0.086
Quotient (Q)
1/0.0001 = 10000
log2(Q) = 13.32 bits — BPC at this position: 13.32
This is a high-surprise position. The model assigns only P=0.0001 to the true character 'x',
giving a quotient of 10,000 and 13.32 bits of surprise. This is because after "mediawiki.org/", the model
expects 't' (perhaps for "talk") or '[' — not the 'x' that begins "xml/export". The quotient chain
reveals exactly where the surprise comes from: the output layer's distribution is too diffuse to capture
this specific continuation.
7. The Quotient Decomposition of BPC
The total BPC is the average log-quotient across all positions. But individual positions vary enormously:
some are near-certain (low Q, low bits), others are surprises (high Q, many bits). This decomposition
shows where the model's uncertainty lives.
Per-Position BPC (First 100 Positions)
Each bar is the log-quotient (bits) at one position. The first position (t=0) is maximally surprised
at 24.41 bits (no context). Most positions are 2–6 bits. Spikes correspond to unexpected characters.
The red dashed line shows the overall average: 5.72 bpc.
5.72
average BPC
(overall)
24.41
max BPC
(t=0, no context)
1.13
min BPC (shown)
(t=1)
13.88
BPC at t=10
(surprise spike)
"The gap between bpc and zero is the residual luck: positions where the model's macrostate still leaves
multiple output events equally consistent. Better patterns → sharper quotients → less residual luck → lower bpc."
— e-onto-n.tex