← Back to Archive
The Bayesian Pattern Story
From joint events to conditional probabilities, and how ES normalization redistributes entropy.
1. From Log Support to Conditional Probability
We have a single atomic pattern: 'e' → 'vowel'
With log support T(e, vowel) = log count(e, vowel)
This is a sufficient statistic - nothing more can be said about this joint event.
P(vowel | e) = P(e, vowel) / P(e)
In log-support:
T(vowel | e) = T(e, vowel) - T(e)
where:
T(e) = log Σ_y exp(T(e, y)) ← "rest of event space" on output side
Key: To go from joint to conditional, we need the marginal.
The marginal = logsumexp over all other events in the space.
2. Pattern Depth Without Growing the NN
Standard RNN
input: xt
(256 dims)
→
Augmented RNN
input: (xt, ES(xt))
(260 dims)
Key insight: ES(x) is DETERMINISTIC given x.
- is_vowel('e') = 1
- is_vowel('x') = 0
So the pattern 'vowel' → 'consonant' can be captured in ONE tick,
even though it spans TWO levels in tick-tock.
Effective pattern depth increases:
Before: W_hh encodes x_{t-1} → x_t
After: W_hh + W_ES encodes (x_{t-1}, ES(x_{t-1})) → x_t
This is FREE - no learning needed for ES membership.
3. Deterministic ES Extraction
ES membership is a lookup table:
| Function | Definition |
vowel(x) | 1 if x ∈ {a,e,i,o,u,...} |
digit(x) | 1 if x ∈ {0,1,2,...,9} |
punct(x) | 1 if x ∈ {.,!,?,...} |
whitespace(x) | 1 if x ∈ {space, \n, \t} |
No learning required - compile to lookup table, inject into RNN input.
4. Correlation: 'e' Contains 'vowel'
'e' and 'vowel' are perfectly correlated:
If x = 'e', then vowel(x) = 1 (deterministic)
If vowel(x) = 1, then x ∈ {a,e,i,o,u} (5 possibilities)
Mutual information:
I(x; vowel) = H(vowel) - H(vowel | x)
= H(vowel) - 0
= H(vowel) ≈ 0.96 bits
The Bayesian story:
P(next | e) contains all information.
P(next | vowel) is coarser.
P(next | e, vowel) = P(next | e) because vowel ⊂ e.
But: The RNN with just 'e' must LEARN vowel-ness.
The RNN with (e, vowel) gets it FREE.
5. ES Normalization → Uniform Entropy Redistribution
Entropy decomposition:
H(X) = H(ES(X)) + H(X | ES(X))
= "coarse" + "fine"
Example:
H(byte) ≈ H(which ES?) + H(which byte within ES?)
8 bits ≈ 2 bits + 6 bits
When we normalize BY the ES:
P(x | ES) = P(x) / P(ES) for x ∈ ES
| Vowel | P(vowel) | P(vowel | vowel ES) |
| 'a' | 0.082 | 0.215 |
| 'e' | 0.127 | 0.332 |
| 'i' | 0.070 | 0.183 |
| 'o' | 0.075 | 0.196 |
| 'u' | 0.028 | 0.073 |
Entropy within vowel ES:
- Before normalization: 1.37 bits
- After normalization: 2.19 bits
- Uniform would be: 2.32 bits
Embedding interpretation:
Standard: ex represents P(next | x)
ES-normalized: ex represents P(next | x, ES(x))
= excess predictive power beyond ES membership
The ES-normalized embedding keeps only the RESIDUAL information
that 'e' provides beyond just knowing "it's a vowel".
Summary: Five Key Points
1
Joint → Conditional: T(vowel|e) = T(e,vowel) - T(e). Need marginals = "rest of event space".
2
Pattern Depth: Feed (x, ES(x)) → doubles effective depth for free.
3
Deterministic: ES membership is a lookup table, no learning needed.
4
Correlation: 'e' determines 'vowel', but RNN must learn this. Augmentation gives it free.
5
Normalization: H(X) = H(ES) + H(X|ES). Normalizing redistributes entropy uniformly within ES.
← Back to Archive