Training Loss Curves: Baseline vs Augmented

fig-20260131_2-04 | ES features dramatically accelerate learning

8.0 7.0 6.0 5.0 4.0 3.0 Bits per Character (bpc) 0% 25% 50% 75% 100% Training Progress (1M chars, 1 epoch) random Baseline (256 input) Augmented (261 input) 5.51 bpc (train) 3.42 bpc (train) 2.09 bpc

Training bpc over 1 epoch on 1M characters. Augmented model learns 2× faster.

Key finding: Adding 5 ES features (Digits, Vowels, Whitespace, Punct, Other) as one-hot input alongside the byte dramatically accelerates learning.
Metric Baseline Augmented Improvement
Training bpc (end of epoch 1) 5.51 3.42 -2.09 bpc (38%)
Eval bpc 7.31 3.44 -3.87 bpc (53%)
Input dimensions 256 261 +5 (2%)

Why the large gap?

The ES features provide explicit structure that the baseline must discover implicitly:

The 5 extra input dimensions (2% increase) encode 2.32 bits of information that leverage into 3.87 bpc improvement.

Reproduction

Data: enwik9, first 1M bytes

Commands:

head -c 1000000 enwik9 > /tmp/test1m.txt
./hutter train /tmp/test1m.txt models/base_1m.bin 1
./hutter train-aug /tmp/test1m.txt _ 1
./hutter eval /tmp/test1m.txt models/base_1m.bin
./hutter eval-aug /tmp/test1m.txt models/aug_epoch1.bin
            

Timestamp: 2026-01-31T11:35Z

Generated: 2026-01-31 | Archive: cmpr.ai/hutter/archive/20260131_2/