← Back to Archive

Deriving Bayes from a Joint Table

Given ONLY T(x,y), derive all marginals and conditionals.

Key Point: You need BOTH event spaces.
Input ES: The alphabet (256 bytes)
Output ES: Character classification we've factored out

Without P(x) from the input ES, we can't do Bayes.

Step 0: The Joint Count Table

Raw bigram counts: C(prev_char, next_class)

prevVowelConsDigitPunctSpaceOther
'a'80042001010090050
'e'1200380015120110060
'i'600350088080040
' '500800020050100300
'.'1002001500302000100
'0'50100300080200150

Step 1: Convert to Log Support

T(x, y) = log C(x, y) This is the SUFFICIENT STATISTIC - nothing more can be said about (x,y).

Step 2: Input Marginal (rest of INPUT event space)

T(x) = logsumexp_y T(x, y) ← sum over all OUTPUT classes For 'e': T('e') = log[ exp(T('e',Vow)) + exp(T('e',Con)) + ... ] = log[ 1200 + 3800 + 15 + 120 + 1100 + 60 ] = log(6295) = 8.75
charT(x)P(x)
'a'8.719.3%
'e'8.759.6%
'i'8.527.7%
' '9.1214.0%
'.'8.286.0%
'0'8.185.5%

Step 3: Output Marginal (rest of OUTPUT event space)

T(y) = logsumexp_x T(x, y) ← sum over all INPUT chars For 'Vowel': T(Vowel) = log[ exp(T('a',Vow)) + exp(T('e',Vow)) + exp(T('i',Vow)) + ... ] = log[ 800 + 1200 + 600 + ... ] (over all chars in table) = 9.48
classT(y)P(y)
Vowel9.4820.0%
Cons10.4251.5%
Digit8.9011.2%
Punct6.911.5%
Space9.1414.3%
Other6.951.6%

Step 4: Forward Conditional T(y|x)

T(y|x) = T(x,y) - T(x) For P(Vowel | 'e'): T(Vowel|'e') = T('e',Vowel) - T('e') = 7.09 - 8.75 = -1.66 P(Vowel|'e') = exp(-1.66) = 0.19 (19%)

This is what the RNN learns to predict!

prevP(Vow|x)P(Con|x)P(Dig|x)P(Spc|x)
'a'0.130.690.000.15
'e'0.190.600.000.18
' '0.050.870.020.01
'.'0.030.050.380.51

Step 5: Inverse Conditional T(x|y)

T(x|y) = T(x,y) - T(y) For P('e' | Vowel): T('e'|Vowel) = T('e',Vowel) - T(Vowel) = 7.09 - 9.48 = -2.39 P('e'|Vowel) = exp(-2.39) = 0.09 (9%)

"Given the next char is a vowel, what's the probability prev was 'e'?"

Step 6: Verify Bayes Theorem

Bayes: P(y|x) = P(x|y) × P(y) / P(x) In log form: T(y|x) = T(x|y) + T(y) - T(x) For P(Vowel | 'e'): Via direct: T(Vowel|'e') = T('e',Vowel) - T('e') = 7.09 - 8.75 = -1.66 Via Bayes: T(Vowel|'e') = T('e'|Vowel) + T(Vowel) - T('e') = -2.39 + 9.48 - 8.75 = -1.66 ✓
Both methods give identical results!
Bayes theorem is just rearranging the same joint table T(x,y).

The Complete Picture

┌─────────────────────────────────────────────────────────────────┐ │ JOINT TABLE T(x,y) = log count(x,y) │ │ The ONLY data we need - sufficient statistic │ └─────────────────────────────────────────────────────────────────┘ │ │ ▼ ▼ ┌─────────────────────┐ ┌─────────────────────┐ │ T(x) = Σ_y T(x,y) │ │ T(y) = Σ_x T(x,y) │ │ INPUT marginal │ │ OUTPUT marginal │ │ (alphabet freqs) │ │ (class freqs) │ └─────────────────────┘ └─────────────────────┘ │ │ └───────────────┬───────────────┘ │ ┌───────────────────┼───────────────────┐ │ │ │ ▼ │ ▼ ┌───────────────┐ │ ┌───────────────┐ │ T(y|x) │ │ │ T(x|y) │ │ = T(x,y)-T(x) │ │ │ = T(x,y)-T(y) │ │ FORWARD │ │ │ INVERSE │ └───────────────┘ │ └───────────────┘ │ │ │ └───────────────────┼───────────────────┘ │ ▼ ┌───────────────────────────┐ │ BAYES THEOREM │ │ T(y|x) = T(x|y)+T(y)-T(x)│ │ │ │ All from T(x,y) alone! │ └───────────────────────────┘

Summary

To derive P(y|x) from joint table T(x,y):

  1. Compute T(x) = logsumexp over OUTPUT (rest of output ES)
  2. Compute T(y) = logsumexp over INPUT (rest of input ES)
  3. T(y|x) = T(x,y) - T(x)

Both marginals are needed:

Event spaces:

← Back to Archive