Joint Event Space: Byte Alphabet Factorization

fig-20260131_2-01 | The 256-byte space factors into ES membership × within-ES identity

Flat Space vs Factored Space

Flat: |E| = 256 10 5 4 32 205 ← 256 bytes, treated as atomic factor Factored: E = ES × Within ES (5 events) × within-ES (1-205 events) Joint Space as Grid Digits Vowels Whitespace Punct Other 0 1 2 3 4 5 6 7 8 9 a e i o u ␣ ↵ ⇥ ... . , ! ? : ; " ' ( ) [ ] { } ... b c d f g h j k l m n p q r s t v w x y z B C D ... (205 total) Each row is one ES, columns are members within ES Total: 10 + 5 + 4 + 32 + 205 = 256

The 256-byte alphabet viewed as flat space (top-left) vs factored into ES × within-ES (bottom-left, right)

Key Insight: The RNN learns to predict at two levels:
  1. Which ES? After "th", predict Vowels ES (not Digits, not Punct)
  2. Which member? Within Vowels, predict 'e' over 'a', 'i', 'o', 'u'

This factorization explains 59% of the model's compression. The ES-level prediction accounts for 1.37 bits/char of the 2.31 bits/char total.

Input × Output: The Full Joint Space

Input byte (prev) Output byte (next) Vowel → Other (5 × 205 = 1025 joint events) 256 0 0 256 |I × O| = 256 × 256 = 65,536 joint events ES-level view D V W P O D V W P O 5 × 5 = 25 ES-pairs (coarse prediction)

The I×O joint space (65K events) reduces to 25 ES-pairs for coarse prediction

The standard learning function ω₀ from the CMP paper records joint events (i, o) where i is the input byte and o is the output byte. This gives a 256×256 contingency table.

By factoring through ESs, we get a hierarchical joint space:

Compression benefit: Instead of learning 65,536 probabilities, we learn:
  • 25 ES-transition probabilities (coarse structure)
  • ~50-100 significant within-ES patterns (fine detail)

Reproduction

Model: Elman RNN 256-128-256, trained on enwik9

ES definitions: ./hutter es (Digits, Vowels, Whitespace, Punct, Other)

Result: 5 ESs explain 59.3% of compression (1.37 of 2.31 bits/char)

Source: #es_eval, #method blocks in hutter.c