Training Loss Curves: Baseline vs Augmented

fig-20260131_2-04 | ES features dramatically accelerate learning

Training bpc over 1 epoch on 1M characters. Augmented model learns 2× faster.

Key finding: Adding 5 ES features (Digits, Vowels, Whitespace, Punct, Other) as one-hot input alongside the byte dramatically accelerates learning.

Metric	Baseline	Augmented	Improvement
Training bpc (end of epoch 1)	5.51	3.42	-2.09 bpc (38%)
Eval bpc	7.31	3.44	-3.87 bpc (53%)
Input dimensions	256	261	+5 (2%)

Why the large gap?

The ES features provide explicit structure that the baseline must discover implicitly:

Direct class information: Model knows 'a' and 'e' share a class without learning it
Efficient patterns: Learn ES→byte with 5 weights instead of 256
Freed capacity: Hidden units focus on longer-range patterns

The 5 extra input dimensions (2% increase) encode 2.32 bits of information that leverage into 3.87 bpc improvement.

Reproduction

Data: enwik9, first 1M bytes

Commands:

head -c 1000000 enwik9 > /tmp/test1m.txt
./hutter train /tmp/test1m.txt models/base_1m.bin 1
./hutter train-aug /tmp/test1m.txt _ 1
./hutter eval /tmp/test1m.txt models/base_1m.bin
./hutter eval-aug /tmp/test1m.txt models/aug_epoch1.bin

Timestamp: 2026-01-31T11:35Z

Generated: 2026-01-31 | Archive: cmpr.ai/hutter/archive/20260131_2/