The Quotient Chain: E → N → Q at Every Step

1. What This Shows

The forward pass of the RNN is a chain of quotient computations. At each step, an event (E) is counted (N), and the ratio of total to count gives the quotient (Q). The log-quotient is the surprisal. The BPC is the average log-quotient.

This page traces the complete E → N → Q chain through all four stages of the prediction pipeline: input encoding, hidden-layer partition functions, recurrent propagation, and output readout. Each stage computes a quotient, and the product of all quotients is the model's prediction.

"Each term is an E → N → Q step: 1. Identify the event (E). 2. Count it or accumulate support (N). 3. Compute the quotient (Q = N/count)."

— e-onto-n.tex

2. The Input Quotient

For the 1024-byte dataset, each input character has a frequency count. The quotient Q = 1024/c(x) measures how surprising that character is: rare characters have high quotients, common characters have low quotients. The log-quotient log₂(Q) is the information content in bits.

Character Frequency and Quotient (Top 20 by Count)

Left axis (bars): raw count c(x) in the 1024-byte dataset. Right axis (line): quotient Q = 1024/c(x). Higher Q = rarer character = more information per occurrence.

Space is the most common event. With 127 occurrences, space carries only 3.01 bits per occurrence. Characters like 'w' (23 occurrences) carry 5.48 bits — nearly twice as much information. The input quotient is the first link in the chain: it determines how much surprise each input byte contributes.

3. The Hidden-Layer Quotient

Each of the 128 neurons computes a partition function. The Boltzmann probability of neuron j being positive is:

P(h_j+) = 1 / (1 + 2^−D_j)

where D_j is the accumulator difference (the pre-activation). When |D_j| is large, the neuron is saturated: Q ≈ 1.0 (no surprise). When D_j is near zero, the neuron is at threshold: Q can be large (high information).

Neuron Dashboard at Position t=42

Context: "mediawiki.org/", input = '/'

h56

z49.2

bias−3.9

W_x['/']−1.7

P(h+)≈1.0

Q≈1.0

h90

z126.0

bias−4.8

W_x['/']−2.4

P(h+)≈1.0

Q≈1.0

h20

z−71.5

bias−5.4

W_x['/']+3.0

P(h+)≈0.0

Q≈∞

h43

z77.4

bias−0.7

W_x['/']−4.1

P(h+)≈1.0

Q≈1.0

h60

z46.8

bias−0.7

W_x['/']+8.7

P(h+)≈1.0

Q≈1.0

P(h_j+) vs Pre-activation z — Neurons at t=42 with Sigmoid Overlay

The sigmoid curve shows P(h_j+) = 1/(1 + 2^−z). All five neurons are deeply saturated (|z| >> 0), so their quotients are all near 1.0. The information lives in the rare near-threshold neurons.

Cumulative Hidden Quotient

984

distinct binary states

1024

total positions

1.04

cumulative hidden quotient
Q_H = 1024/984

10.0

state entropy (bits)
log₂(1023)

Almost all neurons are saturated. At t=42, 123 of 128 neurons have |z| > 10, giving Q ≈ 1.0. Only 5 neurons are near threshold. The hidden-layer quotient is nearly 1 — meaning the binary state is almost fully determined by the input and recurrent context. The information is concentrated in those few near-threshold neurons.

4. The Recurrent Quotient (W_h Propagation)

Information propagates through W_h: each entry W_h[i][j] contributes a quotient factor to neuron i from the previous state of neuron j. The key insight is that each weight acts as a "force" pushing the neuron toward + or −, and the accumulated sum determines the pre-activation z.

Force Diagram for h56 at t=42

Top W_h contributions summing to z=49.2:

−11.0

h42

−8.7

h18

+8.3

W_x

−1.7

bias

−3.9

other

+66.2

            Σ = 49.2 → P(h56+) ≈ 1.0
        

Recurrent Chain: Information Flow Through Time

How Input at t−d Reaches h56 at t=42 Through d Steps of W_h

Each bar shows the effective influence strength from input at time t−d on neuron h56 at time t=42. The influence decays with distance but does not vanish — the saturated RNN carries information through sign-preserving chains in W_h.

W_h is a sign-preserving channel. The forces from h7 (−11.0) and h42 (−8.7) push h56 toward negative, but the net sum of all 128 sources plus bias plus W_x is +49.2 — overwhelmingly positive. This is the recurrent quotient in action: each W_h entry contributes a factor, and the product determines the neuron's state.

5. The Output Quotient

The final step: W_y maps the 128-dimensional hidden state to 256 output logits. The softmax normalizes these into a probability distribution. The partition function Z = ∑ 2^A(o) for all output bytes o, and the quotient for the true character is Q = Z / 2^A(o_true) = 1/P(o_true).

Position t=50: Predicting After "expo"

True next character: 'r' (for "export")

0.195

P('r') — true character

5.13

Q = 1/P('r')

2.36

log₂(Q) bits

Output Distribution at t=50 (Top 20 Characters)

The true character 'r' (green) is the second-highest prediction. The model assigns P('r')=0.195, giving a quotient of 5.13 and a per-position BPC of 2.36 bits. The leading prediction is space (0.215).

6. Worked Example — Complete Chain at t=42

Here is the complete E → N → Q chain for a single prediction. Position t=42: context "mediawiki.org/", predicting the next byte.

Step 1: Input Quotient Q_in = N / c(x)

The input character '/' is observed. How surprising is it?

Event (E)

'/' (0x2f)

Count (N)

c('/') = 39

Quotient (Q)

1024/39 = 26.26

log₂(Q) = 4.71 bits of input information

↓

Step 2: Hidden Quotient Q_H = ∏ Q_j

128 neurons compute partition functions. How many distinct states exist?

Event (E)

128 neuron signs

Count (N)

123 saturated, 5 near-threshold

Quotient (Q)

Q_H ≈ 1.04

984 distinct binary states in 1024 positions → cumulative Q ≈ 1.04

↓

Step 3: Recurrent Quotient Q_R via W_h chains

Information from 30+ past positions propagates through W_h.

Event (E)

past context pattern

Key Offsets

d=1,7,12,17

Sources

'/' 'k' 'd' 'w'

Each offset contributes quotient factors through sign-preserving W_h chains

↓

Step 4: Output Quotient Q_out = 1/P(o)

W_y maps hidden state to output probabilities.

Event (E)

true: 'x' (P=0.0001)

Predictions

't':0.165 '[':0.086

Quotient (Q)

1/0.0001 = 10000

log₂(Q) = 13.32 bits — BPC at this position: 13.32

This is a high-surprise position. The model assigns only P=0.0001 to the true character 'x', giving a quotient of 10,000 and 13.32 bits of surprise. This is because after "mediawiki.org/", the model expects 't' (perhaps for "talk") or '[' — not the 'x' that begins "xml/export". The quotient chain reveals exactly where the surprise comes from: the output layer's distribution is too diffuse to capture this specific continuation.

7. The Quotient Decomposition of BPC

The total BPC is the average log-quotient across all positions. But individual positions vary enormously: some are near-certain (low Q, low bits), others are surprises (high Q, many bits). This decomposition shows where the model's uncertainty lives.

Per-Position BPC (First 100 Positions)

Each bar is the log-quotient (bits) at one position. The first position (t=0) is maximally surprised at 24.41 bits (no context). Most positions are 2–6 bits. Spikes correspond to unexpected characters. The red dashed line shows the overall average: 5.72 bpc.

5.72

average BPC
(overall)

24.41

max BPC
(t=0, no context)

1.13

min BPC (shown)
(t=1)

13.88

BPC at t=10
(surprise spike)

"The gap between bpc and zero is the residual luck: positions where the model's macrostate still leaves multiple output events equally consistent. Better patterns → sharper quotients → less residual luck → lower bpc."

— e-onto-n.tex