SN Model Series

From byte frequencies to the Hutter Prize: a hierarchy of compressors in the SN format, and where we stand.

1 Model Hierarchy (M0 – M5)

0 M0 Unigram 5.069 M1 Bigram 3.889 M2 KN-3 2.551 M3 KN-6 2.043 M4 KN-6+words 1.999 M5 +word bigrams 1.999 bits per character (lower is better)
M0 — Unigram 5.069 bpc
Byte frequency distribution. 256 events, no patterns. Every byte gets a strength proportional to log₂(count).
M1 — Bigram 3.889 bpc
Byte pairs. 512 events, 17,781 patterns. Pattern: “The input is X.” → “The output is Y.” strength.
M2 — KN-3 2.551 bpc
KN-smoothed trigram. Interpolates orders 1–3 with discount D=0.8.
M3 — KN-6 2.043 bpc
KN-6. Same structure, deeper context (order 6). 33M hash table entries.
M4 — KN-6 + words 1.999 bpc
KN-6 + word absorption. 2000 word events. At word-final positions, replaces KN prediction with word-conditioned prediction when more confident.
M5 — KN-6 + word bigrams 1.999 bpc
KN-6 + word bigrams. Adds 88,447 word-bigram patterns. Same bpc as M4—word bigrams don’t help at byte level.

2 Hutter Prize Dashboard

The Hutter Prize measures total = compressor + compressed archive on enwik9 (109 bytes). Prize: 500,000€ × (1 − S/L). Constraints: 1 CPU core, <10 GB RAM, ≤50 hours.

Our KN-6 + match (online, enwik9)
1.654 bpc  =  197.2 MB total
fx2-cmix (record, Sep 2024)
0.886 bpc  =  110.8 MB total
110.8 MB fx2-cmix 197.2 MB KN-6+match bzip2 (254) gzip (323) 1.87× gap 0 100 200 MB 300 400

3 The Gap Decomposition

The 0.80 bpc gap between our KN-6 (1.683 bpc on enwik9) and fx2-cmix (0.886 bpc) breaks into three components:

0.08 0.10 ~0.62 0.80 bpc total gap
HT saturation + higher order — ~0.08 bpc. Our 128M-entry HT saturates at 100%. Larger HT (prize allows 10 GB RAM) + deeper orders.
Word-level patterns — ~0.10 bpc. Word absorption, word bigrams. Requires tokenization bridge to extract at byte level.
Context mixing + neural — ~0.62 bpc. What cmix does: adaptive mixing of PPM, LSTM, match model, indirect models with neural mix weights.

4 SN Format

An SN model is a list of events and patterns, each with an integer strength. The format is human-readable and fully explicit.

Events (unigram strengths):
"The output is ' '." 22.   /* space is the most common byte */
"The output is 'e'." 19.   /* 'e' is second */
"The output is 'a'." 19.
Patterns (conditional strengths):
"The input is ' '." "The output is 't'." 17.
"The input is 't'." "The output is 'h'." 16.
"The input is 'h'." "The output is 'e'." 16.
Strength = ⌊log₂(count)⌋. A strength of 17 means the pair appeared 217 – 218 times (~131K–262K).
Each pattern costs about 80 bytes of description. This is the fundamental trade-off: more patterns improve prediction but cost description length.

5 The Description Length Trade-off

Adding patterns improves prediction (lowers bpc) but each pattern costs ~80 bytes. The optimal model minimizes total description = compressed data + model description.

0 33M
Prediction bpc 5.069
Description cost 0 B
Total (data + desc) 634 MB
Megabytes Number of patterns (log scale) 0 200 400 600 800 optimal
Compressed data
Model description
Total (data + description)