fig-20260131_2-06 | Trained on 1M chars, revealing ES-specific representations
The ES weight columns (positions 256-260 in Wx) learn values 10-100x larger than the mean of byte weights within each class. This is not marginalization - it's amplification.
This suggests the model uses ES features as strong class-level signals that complement rather than replace byte-level patterns.
| ES | Mean | Std | Min | Max |
|---|---|---|---|---|
| Digit | -0.020 | 0.439 | -1.315 | +1.121 |
| Punct | +0.008 | 0.318 | -0.995 | +1.322 |
| Vowel | +0.054 | 0.518 | -1.625 | +1.662 |
| Whitespace | +0.032 | 0.607 | -1.513 | +1.353 |
| Other | -0.060 | 0.472 | -1.628 | +1.272 |
Negative correlations show the model is learning to discriminate between classes.
Strongest negative: Vowel-Punct (-0.45). Green = positive, Red = negative.
| ES | Top 5 Units (weight) | ||||
|---|---|---|---|---|---|
| Digit | h39 (-1.32) | h91 (-1.16) | h35 (+1.12) | h94 (-1.08) | h3 (-1.00) |
| Punct | h95 (+1.32) | h102 (-0.99) | h87 (-0.95) | h120 (+0.83) | h121 (-0.78) |
| Vowel | h43 (+1.66) | h79 (+1.56) | h68 (+1.54) | h102 (+1.50) | h120 (-1.62) |
| Whitespace | h124 (-1.51) | h102 (-1.48) | h79 (-1.46) | h1 (+1.35) | h19 (+1.33) |
| Other | h43 (-1.63) | h29 (+1.27) | h51 (-1.25) | h88 (+1.11) | h100 (-1.10) |
h43 is the Vowel/Other discriminator: +1.66 for Vowel, -1.63 for Other.
h102 appears in Punct, Vowel, and Whitespace - an ES-sensitive unit.
| Unit | Digit | Punct | Vowel | Whitespace | Other |
|---|---|---|---|---|---|
| h0 | 0.57 | 0.35 | 0.62 | 0.37 | -0.02 |
| h1 | 0.81 | 0.93 | 0.68 | 0.98 | 0.49 |
| h2 | 0.93 | 0.63 | -0.81 | 0.90 | 0.11 |
| h3 | -0.85 | 0.16 | 0.71 | 0.94 | 0.26 |
h3 shows dramatic separation: Digit=-0.85, Whitespace=+0.94 (1.79 spread).
| ES | Best Unit | Mean Wy |
|---|---|---|
| Digit | h34 | +0.212 |
| Punct | h46 | +0.254 |
| Vowel | h87 | +0.218 |
| Whitespace | h95 | +0.307 |
| Other | h33 | +0.018 |
Other has weakest predictor (+0.018 vs +0.2-0.3) - it's the residual class.
Command: ./hutter probe-aug enwik9 models/aug_epoch1.bin
Model: Augmented RNN, 128 hidden units, trained 1 epoch on 1M chars
Input size: 261 (256 bytes + 5 ES one-hot features)