Archive: 2026-01-31_4

Pattern Injection: UM → RNN

Focus: Injecting patterns from Universal Model into RNN weights via SVD factorization.
Key Result: ~1 bit/char head start over random initialization (5.46 → 4.47 bpc)
Builds On: 2026-01-31_3 (memory traces, factor maps)

Key Result: Pattern injection reduces initial loss from 5.46 to 4.47 bits/char (~18% improvement) without any training.

Theory

The Universal Model (UM) captures byte-level patterns as a matrix:

P[a,b] = log P(b | a)    // Bigram log-probabilities

For an RNN with no recurrence, the forward pass is:

logits ≈ W_ho @ W_ih[:, input]

We want this to equal P[input, :], so:

W_ho @ W_ih ≈ P^T

Using SVD: P^T = U·S·V^T, we set:

W_ho = U[:, :H] · √S[:H]     // 256 × H
W_ih = √S[:H] · V^T[:H, :]   // H × 256

Experiments

Direct Injection (no hidden layer):

Model	bits/char
Raw bigram (direct lookup)	3.84
W = P^T (injected)	3.84

Perfect match — injection exactly recovers bigram predictions.

Factored Injection (H=64):

Model	bits/char
Raw bigram	3.84
SVD rank-64	3.92
Gap	0.08

Only 0.08 bits/char lost from compression to 64 dimensions.

RNN Training Comparison:

Initialization	Initial	After 10 epochs
Random	5.46	4.81
Pattern Injection	4.47	4.53
Gap	0.99	0.28

Sample Predictions

After 't': Pred 'h'=26.5%, Truth 'h'=22.6%
After 'h': Pred 'e'=41.7%, Truth 'e'=43.4%
After 'e': Pred '_'=22.5%, Truth '_'=24.5%
After '_': Pred 't'=12.3%, Truth 't'=13.1%

Files

inject2.c

Direct 1-Markov injection (C). W = P^T.

inject_svd2.py

SVD factorization (Python). P^T = U·S·V^T → W_ho, W_ih.

um_to_rnn.py

Complete training pipeline. Compares random vs injected init.

Why W_hh Injection is Hard

W_ih and W_ho encode local patterns (bigrams). W_hh encodes temporal patterns—how to carry information through time. This hits a fundamental limit.

The Arithmetic Coding Analogy:

Arithmetic Coding	RNN
Q = N/D (interval position)	h (hidden vector)
Symbol narrows interval	Input updates hidden state
Q carries past context	h carries temporal memory
float32 limits depth	float32 limits temporal reach

Key insight: Carrying entropy through layers = carrying entropy through time.

In ideal arithmetic coding, Q carries unbounded information. But float32 has only 24 mantissa bits, creating a depth limit:

d_max ≈ 24 / (-log₂ p_avg)

Each h' = W_hh @ h multiply introduces precision loss. Trigram patterns (2 steps back) may already exceed stable precision. This is why W_hh injection failed while W_ih/W_ho injection succeeded.

Q carries infinite information (unbounded). f32 doesn't.

Conclusions

Patterns are transferable: UM statistics can initialize RNN weights
SVD provides compression: 256×256 → 256×64 + 64×256 with minimal loss
Head start is significant: ~1 bit/char better than random
Training refines the prior: Gradient descent handles longer context
Temporal patterns hit precision limits: W_hh injection fails because carrying entropy through time requires more precision than f32 provides

← Back to Archive