Pattern Injection: UM → RNN

January 31, 2026 — Session 4

Abstract

We demonstrate empirically that patterns extracted from a Universal Model (bigram statistics) can be injected into RNN weights via SVD factorization, providing a ~1 bit/char head start over random initialization.

Key Result: Pattern injection reduces initial loss from 5.46 to 4.47 bits/char (~18% improvement) without any training.

Theory

The Universal Model (UM) captures byte-level patterns as a matrix:

P[a,b] = log P(b | a)    // Bigram log-probabilities

For an RNN, the forward pass is:

h = tanh(W_ih @ one_hot(input) + W_hh @ h_prev)
logits = W_ho @ h + bias

The key insight: if we ignore recurrence (W_hh = 0) and tanh ≈ identity for small values:

logits ≈ W_ho @ W_ih[:, input]

We want this to equal P[input, :], so:

(W_ho @ W_ih)^T ≈ P
⟹ W_ho @ W_ih ≈ P^T

Using SVD: P^T = U·S·V^T, we set:

W_ho = U[:, :H] · √S[:H]     // 256 × H
W_ih = √S[:H] · V^T[:H, :]   // H × 256

Experiments

Direct Injection (no hidden layer)

Model	bits/char
Raw bigram (direct lookup)	3.84
W = P^T (injected)	3.84

Perfect match — the injection exactly recovers bigram predictions.

Factored Injection (H=64)

Model	bits/char
Raw bigram	3.84
SVD rank-64 injection	3.92
Gap	0.08

Only 0.08 bits/char lost from compression to 64 dimensions.

RNN Training Comparison

Initialization	Initial	After 10 epochs
Random	5.46	4.81
Pattern Injection	4.47	4.53
Gap	0.99	0.28

Implementation

Pattern extraction and SVD factorization in Python:

# Compute bigram log-probs
P = log((counts + 0.5) / (row_sums + 0.5 * 256))

# SVD of transpose
U, S, Vt = svd(P.T)

# Factor into RNN weights
H = 64
W_ho = U[:, :H] @ diag(sqrt(S[:H]))  # 256 × H
W_ih = diag(sqrt(S[:H])) @ Vt[:H, :] # H × 256

Sample Predictions

After 't': Pred 'h'=26.5%, Truth 'h'=22.6%
After 'h': Pred 'e'=41.7%, Truth 'e'=43.4%
After 'e': Pred '_'=22.5%, Truth '_'=24.5%
After '_': Pred 't'=12.3%, Truth 't'=13.1%

Why W_hh Injection is Hard

We successfully inject W_ih and W_ho (the "local" weights), but what about W_hh (the recurrent weights)? Initial experiments with trigram-derived W_hh injection failed. This section explains why.

The Arithmetic Coding Analogy

In arithmetic coding, the coder maintains a quotient Q = N/D that tracks position within the probability interval. Ideally, Q carries unbounded information—each new symbol narrows the interval, accumulating context from arbitrarily far back.

But in practice, Q is stored in float32 with only 24 mantissa bits. This creates a depth limit:

d_max ≈ 24 / (-log₂ p_avg)

After d_max symbols, precision loss forces renormalization. Older context is forgotten.

Hidden State as Interval State

The RNN hidden state h is exactly analogous to the AC interval Q:

Arithmetic Coding	RNN
Q = N/D (interval position)	h (hidden vector)
Symbol narrows interval	Input updates hidden state
Q carries past context	h carries temporal memory
float32 limits depth	float32 limits temporal reach

Key insight: Carrying entropy through layers = carrying entropy through time.

Why W_hh Can't Be Injected

W_ih and W_ho encode local patterns (bigrams): given this input, predict that output. These don't require carrying state—they're pure matrix factorization.

W_hh encodes temporal patterns: how to carry information from step t to step t+1. This requires patterns that span multiple timesteps (trigrams, 4-grams, ...). But:

Each matrix multiply h' = W_hh @ h introduces precision loss
After ~24 / (bits per step) iterations, early information is washed out
Trigram patterns (2 steps back) may already exceed stable precision for some contexts

The fundamental limit: Q carries infinite information (unbounded), f32 doesn't.

Implications

This explains several phenomena:

Vanishing gradients aren't just a training problem—they reflect a fundamental precision limit on how far back information can flow
LSTMs/GRUs work by being more selective about what entropy to carry, not by increasing precision
Pattern injection can only work for patterns within the precision budget

Conclusions

Patterns are transferable: UM statistics can initialize RNN weights
SVD provides compression: 256×256 → 256×64 + 64×256 with minimal loss
Head start is significant: ~1 bit/char better than random
Training refines the prior: Gradient descent handles longer context
Temporal patterns hit precision limits: W_hh injection fails because carrying entropy through time requires more precision than f32 provides

Files

inject2.c — Direct 1-Markov injection (C)
inject_svd2.py — SVD factorization (Python)
um_to_rnn.py — Complete training pipeline
injected_weights.h — Exported C header

← Back to Archive