SN Model Series

From byte frequencies to the Hutter Prize: a hierarchy of compressors in the SN format, and where we stand.

1 Model Hierarchy (M0 – M5)

M0 — Unigram 5.069 bpc

Byte frequency distribution. 256 events, no patterns. Every byte gets a strength proportional to log₂(count).

M1 — Bigram 3.889 bpc

Byte pairs. 512 events, 17,781 patterns. Pattern: “The input is X.” → “The output is Y.” strength.

M2 — KN-3 2.551 bpc

KN-smoothed trigram. Interpolates orders 1–3 with discount D=0.8.

M3 — KN-6 2.043 bpc

KN-6. Same structure, deeper context (order 6). 33M hash table entries.

M4 — KN-6 + words 1.999 bpc

KN-6 + word absorption. 2000 word events. At word-final positions, replaces KN prediction with word-conditioned prediction when more confident.

M5 — KN-6 + word bigrams 1.999 bpc

KN-6 + word bigrams. Adds 88,447 word-bigram patterns. Same bpc as M4—word bigrams don’t help at byte level.

2 Hutter Prize Dashboard

The Hutter Prize measures total = compressor + compressed archive on enwik9 (10⁹ bytes). Prize: 500,000€ × (1 − S/L). Constraints: 1 CPU core, <10 GB RAM, ≤50 hours.

Our KN-6 + match (online, enwik9)
1.654 bpc = 197.2 MB total

fx2-cmix (record, Sep 2024)
0.886 bpc = 110.8 MB total

3 The Gap Decomposition

The 0.80 bpc gap between our KN-6 (1.683 bpc on enwik9) and fx2-cmix (0.886 bpc) breaks into three components:

HT saturation + higher order — ~0.08 bpc. Our 128M-entry HT saturates at 100%. Larger HT (prize allows 10 GB RAM) + deeper orders.

Word-level patterns — ~0.10 bpc. Word absorption, word bigrams. Requires tokenization bridge to extract at byte level.

Context mixing + neural — ~0.62 bpc. What cmix does: adaptive mixing of PPM, LSTM, match model, indirect models with neural mix weights.

4 SN Format

An SN model is a list of events and patterns, each with an integer strength. The format is human-readable and fully explicit.

Events (unigram strengths):

"The output is ' '." 22. /* space is the most common byte */
"The output is 'e'." 19. /* 'e' is second */
"The output is 'a'." 19.

Patterns (conditional strengths):

"The input is ' '." "The output is 't'." 17.
"The input is 't'." "The output is 'h'." 16.
"The input is 'h'." "The output is 'e'." 16.

Strength = ⌊log₂(count)⌋. A strength of 17 means the pair appeared 2¹⁷ – 2¹⁸ times (~131K–262K).
Each pattern costs about 80 bytes of description. This is the fundamental trade-off: more patterns improve prediction but cost description length.

5 The Description Length Trade-off

Adding patterns improves prediction (lowers bpc) but each pattern costs ~80 bytes. The optimal model minimizes total description = compressed data + model description.

Number of patterns: 0

0 33M

Prediction bpc 5.069

Description cost 0 B

Total (data + desc) 634 MB

Compressed data

Model description

Total (data + description)