Pattern Injection: UM → RNN
The Universal Model (UM) captures byte-level patterns as a matrix:
P[a,b] = log P(b | a) // Bigram log-probabilities
For an RNN with no recurrence, the forward pass is:
logits ≈ W_ho @ W_ih[:, input]
We want this to equal P[input, :], so:
W_ho @ W_ih ≈ P^T
Using SVD: P^T = U·S·V^T, we set:
W_ho = U[:, :H] · √S[:H] // 256 × H W_ih = √S[:H] · V^T[:H, :] // H × 256
Direct Injection (no hidden layer):
| Model | bits/char |
|---|---|
| Raw bigram (direct lookup) | 3.84 |
| W = P^T (injected) | 3.84 |
Perfect match — injection exactly recovers bigram predictions.
Factored Injection (H=64):
| Model | bits/char |
|---|---|
| Raw bigram | 3.84 |
| SVD rank-64 | 3.92 |
| Gap | 0.08 |
Only 0.08 bits/char lost from compression to 64 dimensions.
RNN Training Comparison:
| Initialization | Initial | After 10 epochs |
|---|---|---|
| Random | 5.46 | 4.81 |
| Pattern Injection | 4.47 | 4.53 |
| Gap | 0.99 | 0.28 |
After 't': Pred 'h'=26.5%, Truth 'h'=22.6% After 'h': Pred 'e'=41.7%, Truth 'e'=43.4% After 'e': Pred '_'=22.5%, Truth '_'=24.5% After '_': Pred 't'=12.3%, Truth 't'=13.1%
W_ih and W_ho encode local patterns (bigrams). W_hh encodes temporal patterns—how to carry information through time. This hits a fundamental limit.
The Arithmetic Coding Analogy:
| Arithmetic Coding | RNN |
|---|---|
| Q = N/D (interval position) | h (hidden vector) |
| Symbol narrows interval | Input updates hidden state |
| Q carries past context | h carries temporal memory |
| float32 limits depth | float32 limits temporal reach |
Key insight: Carrying entropy through layers = carrying entropy through time.
In ideal arithmetic coding, Q carries unbounded information. But float32 has only 24 mantissa bits, creating a depth limit:
d_max ≈ 24 / (-log₂ p_avg)
Each h' = W_hh @ h multiply introduces precision loss. Trigram patterns (2 steps back) may already exceed stable precision. This is why W_hh injection failed while W_ih/W_ho injection succeeded.