The Quotient Chain: E → N → Q at Every Step

Tracing the forward pass as a chain of partition functions — 2026-02-11
"Every operation in the UM forward pass is a quotient: a ratio of counts."

1. What This Shows

The forward pass of the RNN is a chain of quotient computations. At each step, an event (E) is counted (N), and the ratio of total to count gives the quotient (Q). The log-quotient is the surprisal. The BPC is the average log-quotient.

This page traces the complete E → N → Q chain through all four stages of the prediction pipeline: input encoding, hidden-layer partition functions, recurrent propagation, and output readout. Each stage computes a quotient, and the product of all quotients is the model's prediction.

"Each term is an E → N → Q step: 1. Identify the event (E). 2. Count it or accumulate support (N). 3. Compute the quotient (Q = N/count)."
— e-onto-n.tex

2. The Input Quotient

For the 1024-byte dataset, each input character has a frequency count. The quotient Q = 1024/c(x) measures how surprising that character is: rare characters have high quotients, common characters have low quotients. The log-quotient log2(Q) is the information content in bits.

Character Frequency and Quotient (Top 20 by Count)
Left axis (bars): raw count c(x) in the 1024-byte dataset. Right axis (line): quotient Q = 1024/c(x). Higher Q = rarer character = more information per occurrence.
Space is the most common event. With 127 occurrences, space carries only 3.01 bits per occurrence. Characters like 'w' (23 occurrences) carry 5.48 bits — nearly twice as much information. The input quotient is the first link in the chain: it determines how much surprise each input byte contributes.

3. The Hidden-Layer Quotient

Each of the 128 neurons computes a partition function. The Boltzmann probability of neuron j being positive is:

P(hj+) = 1 / (1 + 2−Dj)

where Dj is the accumulator difference (the pre-activation). When |Dj| is large, the neuron is saturated: Q ≈ 1.0 (no surprise). When Dj is near zero, the neuron is at threshold: Q can be large (high information).

Neuron Dashboard at Position t=42

Context: "mediawiki.org/", input = '/'

h56
z49.2
bias−3.9
Wx['/']−1.7
P(h+)≈1.0
Q≈1.0
h90
z126.0
bias−4.8
Wx['/']−2.4
P(h+)≈1.0
Q≈1.0
h20
z−71.5
bias−5.4
Wx['/']+3.0
P(h+)≈0.0
Q≈∞
h43
z77.4
bias−0.7
Wx['/']−4.1
P(h+)≈1.0
Q≈1.0
h60
z46.8
bias−0.7
Wx['/']+8.7
P(h+)≈1.0
Q≈1.0
P(hj+) vs Pre-activation z — Neurons at t=42 with Sigmoid Overlay
The sigmoid curve shows P(hj+) = 1/(1 + 2−z). All five neurons are deeply saturated (|z| >> 0), so their quotients are all near 1.0. The information lives in the rare near-threshold neurons.

Cumulative Hidden Quotient

984
distinct binary states
1024
total positions
1.04
cumulative hidden quotient
QH = 1024/984
10.0
state entropy (bits)
log2(1023)
Almost all neurons are saturated. At t=42, 123 of 128 neurons have |z| > 10, giving Q ≈ 1.0. Only 5 neurons are near threshold. The hidden-layer quotient is nearly 1 — meaning the binary state is almost fully determined by the input and recurrent context. The information is concentrated in those few near-threshold neurons.

4. The Recurrent Quotient (Wh Propagation)

Information propagates through Wh: each entry Wh[i][j] contributes a quotient factor to neuron i from the previous state of neuron j. The key insight is that each weight acts as a "force" pushing the neuron toward + or −, and the accumulated sum determines the pre-activation z.

Force Diagram for h56 at t=42

Top Wh contributions summing to z=49.2:

h7
−11.0
h42
−8.7
h18
+8.3
Wx
−1.7
bias
−3.9
other
+66.2
Σ = 49.2 → P(h56+) ≈ 1.0

Recurrent Chain: Information Flow Through Time

How Input at t−d Reaches h56 at t=42 Through d Steps of Wh
Each bar shows the effective influence strength from input at time t−d on neuron h56 at time t=42. The influence decays with distance but does not vanish — the saturated RNN carries information through sign-preserving chains in Wh.
Wh is a sign-preserving channel. The forces from h7 (−11.0) and h42 (−8.7) push h56 toward negative, but the net sum of all 128 sources plus bias plus Wx is +49.2 — overwhelmingly positive. This is the recurrent quotient in action: each Wh entry contributes a factor, and the product determines the neuron's state.

5. The Output Quotient

The final step: Wy maps the 128-dimensional hidden state to 256 output logits. The softmax normalizes these into a probability distribution. The partition function Z = ∑ 2A(o) for all output bytes o, and the quotient for the true character is Q = Z / 2A(otrue) = 1/P(otrue).

Position t=50: Predicting After "expo"

True next character: 'r' (for "export")

0.195
P('r') — true character
5.13
Q = 1/P('r')
2.36
log2(Q) bits
Output Distribution at t=50 (Top 20 Characters)
The true character 'r' (green) is the second-highest prediction. The model assigns P('r')=0.195, giving a quotient of 5.13 and a per-position BPC of 2.36 bits. The leading prediction is space (0.215).

6. Worked Example — Complete Chain at t=42

Here is the complete E → N → Q chain for a single prediction. Position t=42: context "mediawiki.org/", predicting the next byte.

Step 1: Input Quotient Qin = N / c(x)

The input character '/' is observed. How surprising is it?

Event (E)
'/' (0x2f)
Count (N)
c('/') = 39
Quotient (Q)
1024/39 = 26.26

log2(Q) = 4.71 bits of input information

Step 2: Hidden Quotient QH = ∏ Qj

128 neurons compute partition functions. How many distinct states exist?

Event (E)
128 neuron signs
Count (N)
123 saturated, 5 near-threshold
Quotient (Q)
QH ≈ 1.04

984 distinct binary states in 1024 positions → cumulative Q ≈ 1.04

Step 3: Recurrent Quotient QR via Wh chains

Information from 30+ past positions propagates through Wh.

Event (E)
past context pattern
Key Offsets
d=1,7,12,17
Sources
'/' 'k' 'd' 'w'

Each offset contributes quotient factors through sign-preserving Wh chains

Step 4: Output Quotient Qout = 1/P(o)

Wy maps hidden state to output probabilities.

Event (E)
true: 'x' (P=0.0001)
Predictions
't':0.165 '[':0.086
Quotient (Q)
1/0.0001 = 10000

log2(Q) = 13.32 bits — BPC at this position: 13.32

This is a high-surprise position. The model assigns only P=0.0001 to the true character 'x', giving a quotient of 10,000 and 13.32 bits of surprise. This is because after "mediawiki.org/", the model expects 't' (perhaps for "talk") or '[' — not the 'x' that begins "xml/export". The quotient chain reveals exactly where the surprise comes from: the output layer's distribution is too diffuse to capture this specific continuation.

7. The Quotient Decomposition of BPC

The total BPC is the average log-quotient across all positions. But individual positions vary enormously: some are near-certain (low Q, low bits), others are surprises (high Q, many bits). This decomposition shows where the model's uncertainty lives.

Per-Position BPC (First 100 Positions)
Each bar is the log-quotient (bits) at one position. The first position (t=0) is maximally surprised at 24.41 bits (no context). Most positions are 2–6 bits. Spikes correspond to unexpected characters. The red dashed line shows the overall average: 5.72 bpc.
5.72
average BPC
(overall)
24.41
max BPC
(t=0, no context)
1.13
min BPC (shown)
(t=1)
13.88
BPC at t=10
(surprise spike)
"The gap between bpc and zero is the residual luck: positions where the model's macrostate still leaves multiple output events equally consistent. Better patterns → sharper quotients → less residual luck → lower bpc."
— e-onto-n.tex

8. Sources & Related

Papers: e-onto-n.pdfentropy-bridge.pdf

Experiment pages: