← Back to Archive

The Bayesian Pattern Story

From joint events to conditional probabilities, and how ES normalization redistributes entropy.

1. From Log Support to Conditional Probability

We have a single atomic pattern: 'e' → 'vowel'

With log support T(e, vowel) = log count(e, vowel)

This is a sufficient statistic - nothing more can be said about this joint event.

P(vowel | e) = P(e, vowel) / P(e) In log-support: T(vowel | e) = T(e, vowel) - T(e) where: T(e) = log Σ_y exp(T(e, y)) ← "rest of event space" on output side
Key: To go from joint to conditional, we need the marginal. The marginal = logsumexp over all other events in the space.

2. Pattern Depth Without Growing the NN

Standard RNN
input: xt
(256 dims)
Augmented RNN
input: (xt, ES(xt))
(260 dims)

Key insight: ES(x) is DETERMINISTIC given x.

So the pattern 'vowel' → 'consonant' can be captured in ONE tick, even though it spans TWO levels in tick-tock.

Effective pattern depth increases: Before: W_hh encodes x_{t-1} → x_t After: W_hh + W_ES encodes (x_{t-1}, ES(x_{t-1})) → x_t This is FREE - no learning needed for ES membership.

3. Deterministic ES Extraction

ES membership is a lookup table:

FunctionDefinition
vowel(x)1 if x ∈ {a,e,i,o,u,...}
digit(x)1 if x ∈ {0,1,2,...,9}
punct(x)1 if x ∈ {.,!,?,...}
whitespace(x)1 if x ∈ {space, \n, \t}

No learning required - compile to lookup table, inject into RNN input.

4. Correlation: 'e' Contains 'vowel'

'e' and 'vowel' are perfectly correlated: If x = 'e', then vowel(x) = 1 (deterministic) If vowel(x) = 1, then x ∈ {a,e,i,o,u} (5 possibilities) Mutual information: I(x; vowel) = H(vowel) - H(vowel | x) = H(vowel) - 0 = H(vowel) ≈ 0.96 bits
The Bayesian story:
P(next | e) contains all information.
P(next | vowel) is coarser.
P(next | e, vowel) = P(next | e) because vowel ⊂ e.

But: The RNN with just 'e' must LEARN vowel-ness.
The RNN with (e, vowel) gets it FREE.

5. ES Normalization → Uniform Entropy Redistribution

Entropy decomposition: H(X) = H(ES(X)) + H(X | ES(X)) = "coarse" + "fine" Example: H(byte) ≈ H(which ES?) + H(which byte within ES?) 8 bits ≈ 2 bits + 6 bits

When we normalize BY the ES:

P(x | ES) = P(x) / P(ES) for x ∈ ES
VowelP(vowel)P(vowel | vowel ES)
'a'0.0820.215
'e'0.1270.332
'i'0.0700.183
'o'0.0750.196
'u'0.0280.073

Entropy within vowel ES:

Embedding interpretation:

Standard: ex represents P(next | x)

ES-normalized: ex represents P(next | x, ES(x))
= excess predictive power beyond ES membership

The ES-normalized embedding keeps only the RESIDUAL information that 'e' provides beyond just knowing "it's a vowel".

Summary: Five Key Points

1
Joint → Conditional: T(vowel|e) = T(e,vowel) - T(e). Need marginals = "rest of event space".
2
Pattern Depth: Feed (x, ES(x)) → doubles effective depth for free.
3
Deterministic: ES membership is a lookup table, no learning needed.
4
Correlation: 'e' determines 'vowel', but RNN must learn this. Augmentation gives it free.
5
Normalization: H(X) = H(ES) + H(X|ES). Normalizing redistributes entropy uniformly within ES.
← Back to Archive