The RNN learned a bi-embedding: counting events → weights, reading weights → events
1. The Bi-Embedding
An RNN maps events to numbers (training) and numbers back to events (inference).
Understanding the model means exhibiting both maps explicitly. This is the bi-embedding:
a pair of maps between the space of events E and
the space of numbers N, each computable from data without training.
Events (E)
Numbers (N)
φ: E → N — Counting: events → numbers
ψ: N → E — Construction: numbers → events
↺
82,304
Parameters from data statistics
r=0.56
Wh Hebbian correlation
1.89
bpc — Analytic, zero optimization
R²=0.837
Factor map reads neurons
"Cracking the RNN means exhibiting both directions explicitly: show which events each weight
encodes (read the bi-embedding), and show which weights each event predicts (write the bi-embedding)."
— e-onto-n.tex
2. φ Direction: Events → Weights
The φ direction constructs every weight from data statistics.
Count events, compute log-ratios, derive matrices. No gradient descent.
Every E→N construction beats the trained model.
Weight Construction Results
Green/blue: constructions from data statistics (E → N).
Red: standard BPTT-50 training (~2M gradient steps).
Every construction beats training.
Key insight: EVERY E→N construction beats the trained model.
The trained model is stuck in a local optimum of N-space that is worse than the direct E→N construction.
Weight Matrix Breakdown
Each component of the 82,304 parameters is derivable from data counts:
Wx
Hash, deterministic
128×256 input encoding. Random hash maps bytes to hidden neurons. No training needed.
Wh
Shift-register carry, deterministic
128×128 recurrence. 16 groups of 8 neurons, diagonal carry structure.
bh
Sign log-odds, from counts
128 biases. log P(hj>0) − log P(hj<0) over data positions.
256 output biases. Log marginal byte frequencies from the data.
3. ψ Direction: Weights → Events
The ψ direction reads the trained weight matrices and recovers
the events they encode. Each matrix tells a different story about the data's statistical structure.
Wx: Input Encoding — Top 5 Bytes by Column Norm
<
3.75
␣
3.74
>
3.49
a
2.81
e
2.65
Tag delimiters and whitespace dominate — these bytes carry the most
information about HTML/XML structure.
Wh: Temporal Propagation
Hebbian covariance predicts Wh:
For dynamically important entries (|w| ≥ 3.0), the Hebbian estimate achieves
r = 0.56. Sign accuracy reaches 72.7%.
The weights the model relies on most are the ones best predicted by simple co-activation statistics.
Wy: Output Prediction
Each column of Wy maps a hidden neuron's sign to an output distribution:
W_y[o][j] = log P(o | h_j > 0) - log P(o | h_j < 0).
Computable from data counts alone — no optimization required.
Per-Neuron Factor Map R²
The factor map reads events back from each neuron. 120 of 128 neurons achieve R² ≥ 0.80.
Mean R² = 0.837. Every neuron is a 2-offset conjunction detector.
52 of 128 neurons are dominated by the offset pair (1,7). The remaining 76 neurons use other pairs.
Dominant Pair Distribution
52/128 neurons: Offset pair (1,7) — the byte just seen
and the byte 7 positions ago. This matches the typical HTML tag-name length.
76/128 neurons: Other offset pairs, covering deeper
structural correlations in the data.
4. The Loop Closes
The bi-embedding emerged through twelve days of experiment.
Click a milestone to see what was discovered that day.
Days 1-4
Days 7-8
Day 9
Days 9-11
Day 11
Days 1-4 (Jan 31 – Feb 4): Training & UM Isomorphism
The sat-rnn is trained: 128 hidden, BPTT-50, Adam optimizer,
82,304 opaque parameters, achieving 0.079 bpc.
The Universal Machine isomorphism establishes that ψ exists —
every RNN prediction has a pattern witness. But the explicit map is not yet exhibited.
Skip-k-grams discovered: the RNN uses non-contiguous patterns.
6,180 data-term patterns. Skip-4 [1,8,20,3] = 0.069 bpc
with only 712 patterns. The backward trie discovers skip-patterns by MI per offset from output.
The ψ direction is now concrete: you can read which events each weight encodes.
Day 9 (Feb 9): Factor Map (ψ read per-neuron)
Every neuron is a 2-offset conjunction detector. Mean R² = 0.837
(120/128 ≥ 0.80). Dominant pair (1,7): 52/128 neurons.
State features: word_len best for ALL 128 neurons.
+word_len = 0.50 bpc (91%), +in_tag = 0.43 bpc (92.5% of gain, matches UM floor).
Days 9-11 (Feb 9-11): Total Interpretation (Boolean structure)
Sign bits carry 99.7% of compression. Mean margin = 60.5.
The dense 128×128 weight matrix produces a sparse Boolean transition function.
1023 unique sign vectors observed. Every prediction traced through Boolean decisions.
The mantissa is noise; sign is signal.
Day 11 (Feb 11): Weight Construction (φ exhibited — loop closes)
All 82,304 parameters derived from data statistics.
Analytic: 1.89 bpc with zero optimization. Optimized: 0.59 bpc.
Both beat trained (4.97 bpc). Test: analytic 4.88 vs trained 5.08 —
construction generalizes better. The φ direction is exhibited. The loop closes.
The twelve-day arc was the construction of this bi-embedding, one direction at a time.
First ψ (reading what the weights encode), then φ (writing weights from what the data contains).
5. Forward and Backward: The Temporal Bi-Embedding
The bi-embedding has a temporal dimension. The forward pass alternates between event-space (E) and
number-space (N). The backward attribution reverses this flow.
Forward Pass: Alternating E and N
xtinput (E)
→
Wxencode
→
ztpre-act (N)
→
tanhsquash
→
hthidden (E)
→
Wyreadout
→
logits(N)
→
softmaxnormalize
→
P(y)prob (N)
Backward Attribution: Reversing the Flow
P(y)prob (N)
→
gradientsignal
→
gtgrad (N)
→
Jacobianchain
→
α(t,d)N→E
PMI Alignment: Shallow vs Deep
88%
Shallow offsets — PMI alignment
24-37%
Deep offsets — PMI alignment
The bi-embedding is tighter at short range and loosens at long range.
At shallow offsets (1-3 steps back), the RNN's attribution aligns with data PMI 88% of the time.
At deep offsets (15+ steps), alignment drops to 24-37%. The temporal bi-embedding has a natural scale.
6. Three Sources of Error
The bi-embedding loop is approximate, not exact.
Three sources explain the gap between φ∘ψ and the identity:
1
First-order only
Hebbian covariance is the first-order Taylor expansion of the true weight function.
It captures pairwise co-activation but misses higher-order interactions between neurons.
Gap: r = 0.56 → r = 1.0
The 44% unexplained variance is higher-order structure.
2
Pairwise only
Per-offset log-ratios treat each time offset independently.
Cross-offset synergy — where two offsets together predict more than either alone —
is missed.
Synergy: >1.0 bits per offset pair
This explains the gap between analytic (1.89 bpc) and optimized (0.59 bpc).
3
Boolean vs Continuous
Sign bits carry 99.7% of compression. The mantissa (continuous part)
is training noise. But the E→N construction produces real-valued weights —
the mantissa is the ladder the model climbed, not the information it learned.
Sign = 99.7% of signal, mantissa = noise
"The gap between reading and writing is the training noise —
the ladder the mantissa climbed."