fig-20260131_2-01 | The 256-byte space factors into ES membership × within-ES identity
Flat Space vs Factored Space
The 256-byte alphabet viewed as flat space (top-left) vs factored into ES × within-ES (bottom-left, right)
Key Insight: The RNN learns to predict at two levels:
Which ES? After "th", predict Vowels ES (not Digits, not Punct)
Which member? Within Vowels, predict 'e' over 'a', 'i', 'o', 'u'
This factorization explains 59% of the model's compression. The ES-level prediction accounts for 1.37 bits/char of the 2.31 bits/char total.
Input × Output: The Full Joint Space
The I×O joint space (65K events) reduces to 25 ES-pairs for coarse prediction
The standard learning function ω₀ from the CMP paper records joint events (i, o) where i is the input byte and o is the output byte. This gives a 256×256 contingency table.
By factoring through ESs, we get a hierarchical joint space:
Coarse level: ES_prev × ES_next (25 joint events)
Fine level: within_prev × within_next (varies by ES pair)
Compression benefit: Instead of learning 65,536 probabilities, we learn: