Hypothesis Test Results - fig-20260131-04

Testing character class hypotheses against trained RNN (6.02 bpc)

Hypothesis	Characters	Within-class Similarity	Cross-class Similarity	Result
H1: Vowels	a, e, i, o, u	0.988	0.986	✓ SUPPORTED
H2: Consonants	t, n, s, r, h	0.984	0.986	✗ NOT SUPPORTED
H4: Punctuation	. , ! ?	0.999	0.986	✓ SUPPORTED
H5: Digits	0, 1, 2, 5, 9	0.9996	0.987	✓ SUPPORTED

Hypothesis

Characters

Within-class Similarity

Cross-class Similarity

Result

H1: Vowels

a, e, i, o, u

0.988

0.986

✓ SUPPORTED

H2: Consonants

t, n, s, r, h

0.984

0.986

✗ NOT SUPPORTED

H4: Punctuation

. , ! ?

0.999

0.986

✓ SUPPORTED

H5: Digits

0, 1, 2, 5, 9

0.9996

0.987

✓ SUPPORTED

Key Insights

1. Digits form the tightest cluster (0.9996 similarity) — the RNN treats all digits nearly identically

2. Punctuation is highly uniform (0.999) — learned as a single event space

3. Vowels cluster together (0.988 vs 0.986 cross-class) — weak but present

4. Consonants are diverse — no single ES; may need subgroups (stops, fricatives, etc.)

Interpretation

Even an undertrained model (6.02 bpc) has learned basic character categories discoverable from bigram statistics. This confirms the hypothesis that short Markov chains → early event spaces.

The failure of H2 (consonants) suggests consonants don't form a single ES — they may split into subgroups based on phonetic properties or positional statistics.

Reproducibility

Repository: https://github.com/TBD/hutter (not yet public)

Model:

Architecture: Elman RNN, 256→128→256, tanh activation
Checkpoint: model.bin
Performance: 6.02 bpc (undertrained)

Data:

Dataset: enwik9 (mattmahoney.net/dc/enwik9.zip)
Subset: first 10KB for evaluation

Training:

Hyperparameters: lr=0.01, BPTT=50, grad_clip=5.0
Status: ~1 partial epoch, training diverged

To reproduce:

wget http://mattmahoney.net/dc/enwik9.zip && unzip enwik9.zip
make
./hutter predict "a" model.bin   # Get hidden state for each char
python3 test_hypothesis.py       # Run hypothesis tests

fig-20260131-04: Hypothesis Test Results

Testing character class hypotheses against trained RNN (6.02 bpc)

Key Insights

Interpretation

Reproducibility