E onto N: The Bi-Embedding - Hutter 2026-02-11

1. The Bi-Embedding

An RNN maps events to numbers (training) and numbers back to events (inference). Understanding the model means exhibiting both maps explicitly. This is the bi-embedding: a pair of maps between the space of events E and the space of numbers N, each computable from data without training.

Events (E)

Numbers (N)

φ: E → N — Counting: events → numbers

ψ: N → E — Construction: numbers → events

↺

82,304

Parameters from data statistics

r=0.56

W_h Hebbian correlation

1.89

bpc — Analytic, zero optimization

R²=0.837

Factor map reads neurons

"Cracking the RNN means exhibiting both directions explicitly: show which events each weight encodes (read the bi-embedding), and show which weights each event predicts (write the bi-embedding)."

— e-onto-n.tex

2. φ Direction: Events → Weights

The φ direction constructs every weight from data statistics. Count events, compute log-ratios, derive matrices. No gradient descent. Every E→N construction beats the trained model.

Weight Construction Results

Green/blue: constructions from data statistics (E → N). Red: standard BPTT-50 training (~2M gradient steps). Every construction beats training.

Key insight: EVERY E→N construction beats the trained model. The trained model is stuck in a local optimum of N-space that is worse than the direct E→N construction.

Weight Matrix Breakdown

Each component of the 82,304 parameters is derivable from data counts:

W_x

Hash, deterministic

128×256 input encoding. Random hash maps bytes to hidden neurons. No training needed.

W_h

Shift-register carry, deterministic

128×128 recurrence. 16 groups of 8 neurons, diagonal carry structure.

b_h

Sign log-odds, from counts

128 biases. log P(h_j>0) − log P(h_j<0) over data positions.

W_y

Log P ratios, from counts

256×128 output readout. W_y[o][j] = log P(o|h_j>0) − log P(o|h_j<0).

b_y

Log c(o), from counts

256 output biases. Log marginal byte frequencies from the data.

3. ψ Direction: Weights → Events

The ψ direction reads the trained weight matrices and recovers the events they encode. Each matrix tells a different story about the data's statistical structure.

W_x: Input Encoding — Top 5 Bytes by Column Norm

3.75

␣

3.74

3.49

2.81

2.65

Tag delimiters and whitespace dominate — these bytes carry the most information about HTML/XML structure.

W_h: Temporal Propagation

Hebbian covariance predicts W_h: For dynamically important entries (|w| ≥ 3.0), the Hebbian estimate achieves r = 0.56. Sign accuracy reaches 72.7%. The weights the model relies on most are the ones best predicted by simple co-activation statistics.

W_y: Output Prediction

Each column of W_y maps a hidden neuron's sign to an output distribution: W_y[o][j] = log P(o | h_j > 0) - log P(o | h_j < 0). Computable from data counts alone — no optimization required.

Per-Neuron Factor Map R²

The factor map reads events back from each neuron. 120 of 128 neurons achieve R² ≥ 0.80. Mean R² = 0.837. Every neuron is a 2-offset conjunction detector.

Factor Map R² by Neuron (128 neurons)

Green: R² ≥ 0.80 (120 neurons). Yellow: 0.60–0.80 (6 neurons). Red: < 0.60 (2 neurons). Mean R² = 0.837.

Dominant Offset Pair (1,7)

52 of 128 neurons are dominated by the offset pair (1,7). The remaining 76 neurons use other pairs.

Dominant Pair Distribution

52/128 neurons: Offset pair (1,7) — the byte just seen and the byte 7 positions ago. This matches the typical HTML tag-name length.

76/128 neurons: Other offset pairs, covering deeper structural correlations in the data.

4. The Loop Closes

The bi-embedding emerged through twelve days of experiment. Click a milestone to see what was discovered that day.

Days 1-4

Days 7-8

Day 9

Days 9-11

Day 11

Days 1-4 (Jan 31 – Feb 4): Training & UM Isomorphism

The sat-rnn is trained: 128 hidden, BPTT-50, Adam optimizer, 82,304 opaque parameters, achieving 0.079 bpc. The Universal Machine isomorphism establishes that ψ exists — every RNN prediction has a pattern witness. But the explicit map is not yet exhibited.

Days 7-8 (Feb 7-8): Pattern Discovery (ψ exhibited concretely — skip-k-grams)

Skip-k-grams discovered: the RNN uses non-contiguous patterns. 6,180 data-term patterns. Skip-4 [1,8,20,3] = 0.069 bpc with only 712 patterns. The backward trie discovers skip-patterns by MI per offset from output. The ψ direction is now concrete: you can read which events each weight encodes.

Day 9 (Feb 9): Factor Map (ψ read per-neuron)

Every neuron is a 2-offset conjunction detector. Mean R² = 0.837 (120/128 ≥ 0.80). Dominant pair (1,7): 52/128 neurons. State features: word_len best for ALL 128 neurons. +word_len = 0.50 bpc (91%), +in_tag = 0.43 bpc (92.5% of gain, matches UM floor).

Days 9-11 (Feb 9-11): Total Interpretation (Boolean structure)

Sign bits carry 99.7% of compression. Mean margin = 60.5. The dense 128×128 weight matrix produces a sparse Boolean transition function. 1023 unique sign vectors observed. Every prediction traced through Boolean decisions. The mantissa is noise; sign is signal.

Day 11 (Feb 11): Weight Construction (φ exhibited — loop closes)

All 82,304 parameters derived from data statistics. Analytic: 1.89 bpc with zero optimization. Optimized: 0.59 bpc. Both beat trained (4.97 bpc). Test: analytic 4.88 vs trained 5.08 — construction generalizes better. The φ direction is exhibited. The loop closes.

The twelve-day arc was the construction of this bi-embedding, one direction at a time. First ψ (reading what the weights encode), then φ (writing weights from what the data contains).

5. Forward and Backward: The Temporal Bi-Embedding

The bi-embedding has a temporal dimension. The forward pass alternates between event-space (E) and number-space (N). The backward attribution reverses this flow.

Forward Pass: Alternating E and N

x_tinput (E)

→

W_xencode

→

z_tpre-act (N)

→

tanhsquash

→

h_thidden (E)

→

W_yreadout

→

logits(N)

→

softmaxnormalize

→

P(y)prob (N)

Backward Attribution: Reversing the Flow

P(y)prob (N)

→

gradientsignal

→

g_tgrad (N)

→

Jacobianchain

→

α(t,d)N→E

PMI Alignment: Shallow vs Deep

88%

Shallow offsets — PMI alignment

24-37%

Deep offsets — PMI alignment

The bi-embedding is tighter at short range and loosens at long range. At shallow offsets (1-3 steps back), the RNN's attribution aligns with data PMI 88% of the time. At deep offsets (15+ steps), alignment drops to 24-37%. The temporal bi-embedding has a natural scale.

6. Three Sources of Error

The bi-embedding loop is approximate, not exact. Three sources explain the gap between φ∘ψ and the identity:

First-order only

Hebbian covariance is the first-order Taylor expansion of the true weight function. It captures pairwise co-activation but misses higher-order interactions between neurons.

Gap: r = 0.56 → r = 1.0
The 44% unexplained variance is higher-order structure.

Pairwise only

Per-offset log-ratios treat each time offset independently. Cross-offset synergy — where two offsets together predict more than either alone — is missed.

Synergy: >1.0 bits per offset pair
This explains the gap between analytic (1.89 bpc) and optimized (0.59 bpc).

Boolean vs Continuous

Sign bits carry 99.7% of compression. The mantissa (continuous part) is training noise. But the E→N construction produces real-valued weights — the mantissa is the ladder the model climbed, not the information it learned.

Sign = 99.7% of signal, mantissa = noise

"The gap between reading and writing is the training noise — the ladder the mantissa climbed."

— e-onto-n.tex

Weight Construction Viewer

Interactive dashboard for all 12 weight construction experiments

Boolean Automaton Viewer

The sat-rnn as a 128-bit Boolean automaton

Entropy Bridge Viewer

Microstates, macrostates, and factoring

Neuron Knockout Viewer

Per-neuron contributions and knockout analysis

Offset Analysis Viewer

Skip-k-gram offset selection and PMI alignment