← Back to Archive
Deriving Bayes from a Joint Table
Given ONLY T(x,y), derive all marginals and conditionals.
Key Point: You need BOTH event spaces.
Input ES: The alphabet (256 bytes)
Output ES: Character classification we've factored out
Without P(x) from the input ES, we can't do Bayes.
Step 0: The Joint Count Table
Raw bigram counts: C(prev_char, next_class)
| prev | Vowel | Cons | Digit | Punct | Space | Other |
| 'a' | 800 | 4200 | 10 | 100 | 900 | 50 |
| 'e' | 1200 | 3800 | 15 | 120 | 1100 | 60 |
| 'i' | 600 | 3500 | 8 | 80 | 800 | 40 |
| ' ' | 500 | 8000 | 200 | 50 | 100 | 300 |
| '.' | 100 | 200 | 1500 | 30 | 2000 | 100 |
| '0' | 50 | 100 | 3000 | 80 | 200 | 150 |
Step 1: Convert to Log Support
T(x, y) = log C(x, y)
This is the SUFFICIENT STATISTIC - nothing more can be said about (x,y).
Step 2: Input Marginal (rest of INPUT event space)
T(x) = logsumexp_y T(x, y) ← sum over all OUTPUT classes
For 'e':
T('e') = log[ exp(T('e',Vow)) + exp(T('e',Con)) + ... ]
= log[ 1200 + 3800 + 15 + 120 + 1100 + 60 ]
= log(6295)
= 8.75
| char | T(x) | P(x) |
| 'a' | 8.71 | 9.3% |
| 'e' | 8.75 | 9.6% |
| 'i' | 8.52 | 7.7% |
| ' ' | 9.12 | 14.0% |
| '.' | 8.28 | 6.0% |
| '0' | 8.18 | 5.5% |
Step 3: Output Marginal (rest of OUTPUT event space)
T(y) = logsumexp_x T(x, y) ← sum over all INPUT chars
For 'Vowel':
T(Vowel) = log[ exp(T('a',Vow)) + exp(T('e',Vow)) + exp(T('i',Vow)) + ... ]
= log[ 800 + 1200 + 600 + ... ] (over all chars in table)
= 9.48
| class | T(y) | P(y) |
| Vowel | 9.48 | 20.0% |
| Cons | 10.42 | 51.5% |
| Digit | 8.90 | 11.2% |
| Punct | 6.91 | 1.5% |
| Space | 9.14 | 14.3% |
| Other | 6.95 | 1.6% |
Step 4: Forward Conditional T(y|x)
T(y|x) = T(x,y) - T(x)
For P(Vowel | 'e'):
T(Vowel|'e') = T('e',Vowel) - T('e')
= 7.09 - 8.75
= -1.66
P(Vowel|'e') = exp(-1.66) = 0.19 (19%)
This is what the RNN learns to predict!
| prev | P(Vow|x) | P(Con|x) | P(Dig|x) | P(Spc|x) |
| 'a' | 0.13 | 0.69 | 0.00 | 0.15 |
| 'e' | 0.19 | 0.60 | 0.00 | 0.18 |
| ' ' | 0.05 | 0.87 | 0.02 | 0.01 |
| '.' | 0.03 | 0.05 | 0.38 | 0.51 |
Step 5: Inverse Conditional T(x|y)
T(x|y) = T(x,y) - T(y)
For P('e' | Vowel):
T('e'|Vowel) = T('e',Vowel) - T(Vowel)
= 7.09 - 9.48
= -2.39
P('e'|Vowel) = exp(-2.39) = 0.09 (9%)
"Given the next char is a vowel, what's the probability prev was 'e'?"
Step 6: Verify Bayes Theorem
Bayes: P(y|x) = P(x|y) × P(y) / P(x)
In log form:
T(y|x) = T(x|y) + T(y) - T(x)
For P(Vowel | 'e'):
Via direct: T(Vowel|'e') = T('e',Vowel) - T('e')
= 7.09 - 8.75 = -1.66
Via Bayes: T(Vowel|'e') = T('e'|Vowel) + T(Vowel) - T('e')
= -2.39 + 9.48 - 8.75 = -1.66 ✓
Both methods give identical results!
Bayes theorem is just rearranging the same joint table T(x,y).
The Complete Picture
┌─────────────────────────────────────────────────────────────────┐
│ JOINT TABLE T(x,y) = log count(x,y) │
│ The ONLY data we need - sufficient statistic │
└─────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ T(x) = Σ_y T(x,y) │ │ T(y) = Σ_x T(x,y) │
│ INPUT marginal │ │ OUTPUT marginal │
│ (alphabet freqs) │ │ (class freqs) │
└─────────────────────┘ └─────────────────────┘
│ │
└───────────────┬───────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ │ ▼
┌───────────────┐ │ ┌───────────────┐
│ T(y|x) │ │ │ T(x|y) │
│ = T(x,y)-T(x) │ │ │ = T(x,y)-T(y) │
│ FORWARD │ │ │ INVERSE │
└───────────────┘ │ └───────────────┘
│ │ │
└───────────────────┼───────────────────┘
│
▼
┌───────────────────────────┐
│ BAYES THEOREM │
│ T(y|x) = T(x|y)+T(y)-T(x)│
│ │
│ All from T(x,y) alone! │
└───────────────────────────┘
Summary
To derive P(y|x) from joint table T(x,y):
- Compute T(x) = logsumexp over OUTPUT (rest of output ES)
- Compute T(y) = logsumexp over INPUT (rest of input ES)
- T(y|x) = T(x,y) - T(x)
Both marginals are needed:
- T(x) for normalization (what we divide by)
- T(y) for Bayes inversion (if we need P(x|y))
Event spaces:
- INPUT ES = alphabet (256 bytes)
- OUTPUT ES = classification we factored out
← Back to Archive