Deriving Bayes from a Joint Table

Given ONLY T(x,y), derive all marginals and conditionals.

Key Point: You need BOTH event spaces.
Input ES: The alphabet (256 bytes)
Output ES: Character classification we've factored out

Without P(x) from the input ES, we can't do Bayes.

Step 0: The Joint Count Table

Raw bigram counts: C(prev_char, next_class)

prev	Vowel	Cons	Digit	Punct	Space	Other
'a'	800	4200	10	100	900	50
'e'	1200	3800	15	120	1100	60
'i'	600	3500	8	80	800	40
' '	500	8000	200	50	100	300
'.'	100	200	1500	30	2000	100
'0'	50	100	3000	80	200	150

Step 1: Convert to Log Support

T(x, y) = log C(x, y) This is the SUFFICIENT STATISTIC - nothing more can be said about (x,y).

Step 2: Input Marginal (rest of INPUT event space)

T(x) = logsumexp_y T(x, y) ← sum over all OUTPUT classes For 'e': T('e') = log[ exp(T('e',Vow)) + exp(T('e',Con)) + ... ] = log[ 1200 + 3800 + 15 + 120 + 1100 + 60 ] = log(6295) = 8.75

char	T(x)	P(x)
'a'	8.71	9.3%
'e'	8.75	9.6%
'i'	8.52	7.7%
' '	9.12	14.0%
'.'	8.28	6.0%
'0'	8.18	5.5%

Step 3: Output Marginal (rest of OUTPUT event space)

T(y) = logsumexp_x T(x, y) ← sum over all INPUT chars For 'Vowel': T(Vowel) = log[ exp(T('a',Vow)) + exp(T('e',Vow)) + exp(T('i',Vow)) + ... ] = log[ 800 + 1200 + 600 + ... ] (over all chars in table) = 9.48

class	T(y)	P(y)
Vowel	9.48	20.0%
Cons	10.42	51.5%
Digit	8.90	11.2%
Punct	6.91	1.5%
Space	9.14	14.3%
Other	6.95	1.6%

Step 4: Forward Conditional T(y|x)

T(y|x) = T(x,y) - T(x) For P(Vowel | 'e'): T(Vowel|'e') = T('e',Vowel) - T('e') = 7.09 - 8.75 = -1.66 P(Vowel|'e') = exp(-1.66) = 0.19 (19%)

This is what the RNN learns to predict!

prev	P(Vow\|x)	P(Con\|x)	P(Dig\|x)	P(Spc\|x)
'a'	0.13	0.69	0.00	0.15
'e'	0.19	0.60	0.00	0.18
' '	0.05	0.87	0.02	0.01
'.'	0.03	0.05	0.38	0.51

Step 5: Inverse Conditional T(x|y)

T(x|y) = T(x,y) - T(y) For P('e' | Vowel): T('e'|Vowel) = T('e',Vowel) - T(Vowel) = 7.09 - 9.48 = -2.39 P('e'|Vowel) = exp(-2.39) = 0.09 (9%)

"Given the next char is a vowel, what's the probability prev was 'e'?"

Step 6: Verify Bayes Theorem

Both methods give identical results!
Bayes theorem is just rearranging the same joint table T(x,y).

The Complete Picture

┌─────────────────────────────────────────────────────────────────┐ │ JOINT TABLE T(x,y) = log count(x,y) │ │ The ONLY data we need - sufficient statistic │ └─────────────────────────────────────────────────────────────────┘ │ │ ▼ ▼ ┌─────────────────────┐ ┌─────────────────────┐ │ T(x) = Σ_y T(x,y) │ │ T(y) = Σ_x T(x,y) │ │ INPUT marginal │ │ OUTPUT marginal │ │ (alphabet freqs) │ │ (class freqs) │ └─────────────────────┘ └─────────────────────┘ │ │ └───────────────┬───────────────┘ │ ┌───────────────────┼───────────────────┐ │ │ │ ▼ │ ▼ ┌───────────────┐ │ ┌───────────────┐ │ T(y|x) │ │ │ T(x|y) │ │ = T(x,y)-T(x) │ │ │ = T(x,y)-T(y) │ │ FORWARD │ │ │ INVERSE │ └───────────────┘ │ └───────────────┘ │ │ │ └───────────────────┼───────────────────┘ │ ▼ ┌───────────────────────────┐ │ BAYES THEOREM │ │ T(y|x) = T(x|y)+T(y)-T(x)│ │ │ │ All from T(x,y) alone! │ └───────────────────────────┘

Summary

To derive P(y|x) from joint table T(x,y):

Compute T(x) = logsumexp over OUTPUT (rest of output ES)
Compute T(y) = logsumexp over INPUT (rest of input ES)
T(y|x) = T(x,y) - T(x)

Both marginals are needed:

T(x) for normalization (what we divide by)
T(y) for Bayes inversion (if we need P(x|y))

Event spaces:

INPUT ES = alphabet (256 bytes)
OUTPUT ES = classification we factored out

← Back to Archive