The Bayesian Pattern Story

From joint events to conditional probabilities, and how ES normalization redistributes entropy.

1. From Log Support to Conditional Probability

We have a single atomic pattern: 'e' → 'vowel'

With log support T(e, vowel) = log count(e, vowel)

This is a sufficient statistic - nothing more can be said about this joint event.

P(vowel | e) = P(e, vowel) / P(e) In log-support: T(vowel | e) = T(e, vowel) - T(e) where: T(e) = log Σ_y exp(T(e, y)) ← "rest of event space" on output side

Key: To go from joint to conditional, we need the marginal. The marginal = logsumexp over all other events in the space.

2. Pattern Depth Without Growing the NN

Standard RNN
input: x_t
(256 dims)

→

Augmented RNN
input: (x_t, ES(x_t))
(260 dims)

Key insight: ES(x) is DETERMINISTIC given x.

is_vowel('e') = 1
is_vowel('x') = 0

So the pattern 'vowel' → 'consonant' can be captured in ONE tick, even though it spans TWO levels in tick-tock.

Effective pattern depth increases: Before: W_hh encodes x_{t-1} → x_t After: W_hh + W_ES encodes (x_{t-1}, ES(x_{t-1})) → x_t This is FREE - no learning needed for ES membership.

3. Deterministic ES Extraction

ES membership is a lookup table:

Function	Definition
`vowel(x)`	1 if x ∈ {a,e,i,o,u,...}
`digit(x)`	1 if x ∈ {0,1,2,...,9}
`punct(x)`	1 if x ∈ {.,!,?,...}
`whitespace(x)`	1 if x ∈ {space, \n, \t}

No learning required - compile to lookup table, inject into RNN input.

4. Correlation: 'e' Contains 'vowel'

'e' and 'vowel' are perfectly correlated: If x = 'e', then vowel(x) = 1 (deterministic) If vowel(x) = 1, then x ∈ {a,e,i,o,u} (5 possibilities) Mutual information: I(x; vowel) = H(vowel) - H(vowel | x) = H(vowel) - 0 = H(vowel) ≈ 0.96 bits

The Bayesian story:
P(next | e) contains all information.
P(next | vowel) is coarser.
P(next | e, vowel) = P(next | e) because vowel ⊂ e.

But: The RNN with just 'e' must LEARN vowel-ness.
The RNN with (e, vowel) gets it FREE.

5. ES Normalization → Uniform Entropy Redistribution

Entropy decomposition: H(X) = H(ES(X)) + H(X | ES(X)) = "coarse" + "fine" Example: H(byte) ≈ H(which ES?) + H(which byte within ES?) 8 bits ≈ 2 bits + 6 bits

When we normalize BY the ES:

P(x | ES) = P(x) / P(ES) for x ∈ ES

Vowel	P(vowel)	P(vowel \| vowel ES)
'a'	0.082	0.215
'e'	0.127	0.332
'i'	0.070	0.183
'o'	0.075	0.196
'u'	0.028	0.073

Entropy within vowel ES:

Before normalization: 1.37 bits
After normalization: 2.19 bits
Uniform would be: 2.32 bits

Embedding interpretation:

Standard: e_x represents P(next | x)

ES-normalized: e_x represents P(next | x, ES(x))
= excess predictive power beyond ES membership

The ES-normalized embedding keeps only the RESIDUAL information that 'e' provides beyond just knowing "it's a vowel".

Summary: Five Key Points

Joint → Conditional: T(vowel|e) = T(e,vowel) - T(e). Need marginals = "rest of event space".

Pattern Depth: Feed (x, ES(x)) → doubles effective depth for free.

Deterministic: ES membership is a lookup table, no learning needed.

Correlation: 'e' determines 'vowel', but RNN must learn this. Augmentation gives it free.

Normalization: H(X) = H(ES) + H(X|ES). Normalizing redistributes entropy uniformly within ES.

← Back to Archive