| Hypothesis | Characters | Within-class Similarity | Cross-class Similarity | Result |
|---|---|---|---|---|
| H1: Vowels | a, e, i, o, u | 0.988 | 0.986 | ✓ SUPPORTED |
| H2: Consonants | t, n, s, r, h | 0.984 | 0.986 | ✗ NOT SUPPORTED |
| H4: Punctuation | . , ! ? | 0.999 | 0.986 | ✓ SUPPORTED |
| H5: Digits | 0, 1, 2, 5, 9 | 0.9996 | 0.987 | ✓ SUPPORTED |
Even an undertrained model (6.02 bpc) has learned basic character categories discoverable from bigram statistics. This confirms the hypothesis that short Markov chains → early event spaces.
The failure of H2 (consonants) suggests consonants don't form a single ES — they may split into subgroups based on phonetic properties or positional statistics.
Repository: https://github.com/TBD/hutter (not yet public)
Model:
Data:
Training:
To reproduce:
wget http://mattmahoney.net/dc/enwik9.zip && unzip enwik9.zip
make
./hutter predict "a" model.bin # Get hidden state for each char
python3 test_hypothesis.py # Run hypothesis tests