January 31, 2026 — Session 4
We demonstrate empirically that patterns extracted from a Universal Model (bigram statistics) can be injected into RNN weights via SVD factorization, providing a ~1 bit/char head start over random initialization.
The Universal Model (UM) captures byte-level patterns as a matrix:
P[a,b] = log P(b | a) // Bigram log-probabilities
For an RNN, the forward pass is:
h = tanh(W_ih @ one_hot(input) + W_hh @ h_prev) logits = W_ho @ h + bias
The key insight: if we ignore recurrence (W_hh = 0) and tanh ≈ identity for small values:
logits ≈ W_ho @ W_ih[:, input]
We want this to equal P[input, :], so:
(W_ho @ W_ih)^T ≈ P ⟹ W_ho @ W_ih ≈ P^T
Using SVD: P^T = U·S·V^T, we set:
W_ho = U[:, :H] · √S[:H] // 256 × H W_ih = √S[:H] · V^T[:H, :] // H × 256
| Model | bits/char |
|---|---|
| Raw bigram (direct lookup) | 3.84 |
| W = P^T (injected) | 3.84 |
Perfect match — the injection exactly recovers bigram predictions.
| Model | bits/char |
|---|---|
| Raw bigram | 3.84 |
| SVD rank-64 injection | 3.92 |
| Gap | 0.08 |
Only 0.08 bits/char lost from compression to 64 dimensions.
| Initialization | Initial | After 10 epochs |
|---|---|---|
| Random | 5.46 | 4.81 |
| Pattern Injection | 4.47 | 4.53 |
| Gap | 0.99 | 0.28 |
Pattern extraction and SVD factorization in Python:
# Compute bigram log-probs P = log((counts + 0.5) / (row_sums + 0.5 * 256)) # SVD of transpose U, S, Vt = svd(P.T) # Factor into RNN weights H = 64 W_ho = U[:, :H] @ diag(sqrt(S[:H])) # 256 × H W_ih = diag(sqrt(S[:H])) @ Vt[:H, :] # H × 256
After 't': Pred 'h'=26.5%, Truth 'h'=22.6% After 'h': Pred 'e'=41.7%, Truth 'e'=43.4% After 'e': Pred '_'=22.5%, Truth '_'=24.5% After '_': Pred 't'=12.3%, Truth 't'=13.1%
We successfully inject W_ih and W_ho (the "local" weights), but what about W_hh (the recurrent weights)? Initial experiments with trigram-derived W_hh injection failed. This section explains why.
In arithmetic coding, the coder maintains a quotient Q = N/D that tracks position within the probability interval. Ideally, Q carries unbounded information—each new symbol narrows the interval, accumulating context from arbitrarily far back.
But in practice, Q is stored in float32 with only 24 mantissa bits. This creates a depth limit:
d_max ≈ 24 / (-log₂ p_avg)
After d_max symbols, precision loss forces renormalization. Older context is forgotten.
The RNN hidden state h is exactly analogous to the AC interval Q:
| Arithmetic Coding | RNN |
|---|---|
| Q = N/D (interval position) | h (hidden vector) |
| Symbol narrows interval | Input updates hidden state |
| Q carries past context | h carries temporal memory |
| float32 limits depth | float32 limits temporal reach |
Key insight: Carrying entropy through layers = carrying entropy through time.
W_ih and W_ho encode local patterns (bigrams): given this input, predict that output. These don't require carrying state—they're pure matrix factorization.
W_hh encodes temporal patterns: how to carry information from step t to step t+1. This requires patterns that span multiple timesteps (trigrams, 4-grams, ...). But:
The fundamental limit: Q carries infinite information (unbounded), f32 doesn't.
This explains several phenomena:
inject2.c — Direct 1-Markov injection (C)inject_svd2.py — SVD factorization (Python)um_to_rnn.py — Complete training pipelineinjected_weights.h — Exported C header