← Back to Archive

Archive: 2026-01-31_4

Pattern Injection: UM → RNN

Focus
Injecting patterns from Universal Model into RNN weights via SVD factorization.
Key Result
~1 bit/char head start over random initialization (5.46 → 4.47 bpc)
Builds On
2026-01-31_3 (memory traces, factor maps)
Key Result: Pattern injection reduces initial loss from 5.46 to 4.47 bits/char (~18% improvement) without any training.

Theory

The Universal Model (UM) captures byte-level patterns as a matrix:

P[a,b] = log P(b | a)    // Bigram log-probabilities

For an RNN with no recurrence, the forward pass is:

logits ≈ W_ho @ W_ih[:, input]

We want this to equal P[input, :], so:

W_ho @ W_ih ≈ P^T

Using SVD: P^T = U·S·V^T, we set:

W_ho = U[:, :H] · √S[:H]     // 256 × H
W_ih = √S[:H] · V^T[:H, :]   // H × 256

Experiments

Direct Injection (no hidden layer):

Modelbits/char
Raw bigram (direct lookup)3.84
W = P^T (injected)3.84

Perfect match — injection exactly recovers bigram predictions.

Factored Injection (H=64):

Modelbits/char
Raw bigram3.84
SVD rank-643.92
Gap0.08

Only 0.08 bits/char lost from compression to 64 dimensions.

RNN Training Comparison:

InitializationInitialAfter 10 epochs
Random5.464.81
Pattern Injection4.474.53
Gap0.990.28

Sample Predictions

After 't': Pred 'h'=26.5%, Truth 'h'=22.6%
After 'h': Pred 'e'=41.7%, Truth 'e'=43.4%
After 'e': Pred '_'=22.5%, Truth '_'=24.5%
After '_': Pred 't'=12.3%, Truth 't'=13.1%

Files

inject2.c
Direct 1-Markov injection (C). W = P^T.
inject_svd2.py
SVD factorization (Python). P^T = U·S·V^T → W_ho, W_ih.
um_to_rnn.py
Complete training pipeline. Compares random vs injected init.

Why W_hh Injection is Hard

W_ih and W_ho encode local patterns (bigrams). W_hh encodes temporal patterns—how to carry information through time. This hits a fundamental limit.

The Arithmetic Coding Analogy:

Arithmetic CodingRNN
Q = N/D (interval position)h (hidden vector)
Symbol narrows intervalInput updates hidden state
Q carries past contexth carries temporal memory
float32 limits depthfloat32 limits temporal reach

Key insight: Carrying entropy through layers = carrying entropy through time.

In ideal arithmetic coding, Q carries unbounded information. But float32 has only 24 mantissa bits, creating a depth limit:

d_max ≈ 24 / (-log₂ p_avg)

Each h' = W_hh @ h multiply introduces precision loss. Trigram patterns (2 steps back) may already exceed stable precision. This is why W_hh injection failed while W_ih/W_ho injection succeeded.

Q carries infinite information (unbounded). f32 doesn't.

Conclusions

  1. Patterns are transferable: UM statistics can initialize RNN weights
  2. SVD provides compression: 256×256 → 256×64 + 64×256 with minimal loss
  3. Head start is significant: ~1 bit/char better than random
  4. Training refines the prior: Gradient descent handles longer context
  5. Temporal patterns hit precision limits: W_hh injection fails because carrying entropy through time requires more precision than f32 provides
← Back to Archive