← Back to Archive

Pattern Injection: UM → RNN

January 31, 2026 — Session 4

Abstract

We demonstrate empirically that patterns extracted from a Universal Model (bigram statistics) can be injected into RNN weights via SVD factorization, providing a ~1 bit/char head start over random initialization.

Key Result: Pattern injection reduces initial loss from 5.46 to 4.47 bits/char (~18% improvement) without any training.

Theory

The Universal Model (UM) captures byte-level patterns as a matrix:

P[a,b] = log P(b | a)    // Bigram log-probabilities

For an RNN, the forward pass is:

h = tanh(W_ih @ one_hot(input) + W_hh @ h_prev)
logits = W_ho @ h + bias

The key insight: if we ignore recurrence (W_hh = 0) and tanh ≈ identity for small values:

logits ≈ W_ho @ W_ih[:, input]

We want this to equal P[input, :], so:

(W_ho @ W_ih)^T ≈ P
⟹ W_ho @ W_ih ≈ P^T

Using SVD: P^T = U·S·V^T, we set:

W_ho = U[:, :H] · √S[:H]     // 256 × H
W_ih = √S[:H] · V^T[:H, :]   // H × 256

Experiments

Direct Injection (no hidden layer)

Modelbits/char
Raw bigram (direct lookup)3.84
W = P^T (injected)3.84

Perfect match — the injection exactly recovers bigram predictions.

Factored Injection (H=64)

Modelbits/char
Raw bigram3.84
SVD rank-64 injection3.92
Gap0.08

Only 0.08 bits/char lost from compression to 64 dimensions.

RNN Training Comparison

InitializationInitialAfter 10 epochs
Random5.464.81
Pattern Injection4.474.53
Gap0.990.28

Implementation

Pattern extraction and SVD factorization in Python:

# Compute bigram log-probs
P = log((counts + 0.5) / (row_sums + 0.5 * 256))

# SVD of transpose
U, S, Vt = svd(P.T)

# Factor into RNN weights
H = 64
W_ho = U[:, :H] @ diag(sqrt(S[:H]))  # 256 × H
W_ih = diag(sqrt(S[:H])) @ Vt[:H, :] # H × 256

Sample Predictions

After 't': Pred 'h'=26.5%, Truth 'h'=22.6%
After 'h': Pred 'e'=41.7%, Truth 'e'=43.4%
After 'e': Pred '_'=22.5%, Truth '_'=24.5%
After '_': Pred 't'=12.3%, Truth 't'=13.1%

Why W_hh Injection is Hard

We successfully inject W_ih and W_ho (the "local" weights), but what about W_hh (the recurrent weights)? Initial experiments with trigram-derived W_hh injection failed. This section explains why.

The Arithmetic Coding Analogy

In arithmetic coding, the coder maintains a quotient Q = N/D that tracks position within the probability interval. Ideally, Q carries unbounded information—each new symbol narrows the interval, accumulating context from arbitrarily far back.

But in practice, Q is stored in float32 with only 24 mantissa bits. This creates a depth limit:

d_max ≈ 24 / (-log₂ p_avg)

After d_max symbols, precision loss forces renormalization. Older context is forgotten.

Hidden State as Interval State

The RNN hidden state h is exactly analogous to the AC interval Q:

Arithmetic CodingRNN
Q = N/D (interval position)h (hidden vector)
Symbol narrows intervalInput updates hidden state
Q carries past contexth carries temporal memory
float32 limits depthfloat32 limits temporal reach

Key insight: Carrying entropy through layers = carrying entropy through time.

Why W_hh Can't Be Injected

W_ih and W_ho encode local patterns (bigrams): given this input, predict that output. These don't require carrying state—they're pure matrix factorization.

W_hh encodes temporal patterns: how to carry information from step t to step t+1. This requires patterns that span multiple timesteps (trigrams, 4-grams, ...). But:

  1. Each matrix multiply h' = W_hh @ h introduces precision loss
  2. After ~24 / (bits per step) iterations, early information is washed out
  3. Trigram patterns (2 steps back) may already exceed stable precision for some contexts

The fundamental limit: Q carries infinite information (unbounded), f32 doesn't.

Implications

This explains several phenomena:

Conclusions

  1. Patterns are transferable: UM statistics can initialize RNN weights
  2. SVD provides compression: 256×256 → 256×64 + 64×256 with minimal loss
  3. Head start is significant: ~1 bit/char better than random
  4. Training refines the prior: Gradient descent handles longer context
  5. Temporal patterns hit precision limits: W_hh injection fails because carrying entropy through time requires more precision than f32 provides

Files


← Back to Archive