Match Model Combination Problem

Why combining a match model with KN smoothing for enwik9 compression is harder than it looks, and what strategies actually work.

1. The Combination Problem

We have a KN-6 model (1.682 bpc on full enwik9) and a match model that copies from prior occurrences of the current context. How should we combine their predictions? Nine strategies, wildly different results.

Combination Strategy Performance (bpc, lower is better)
The combination gap: Oracle absorption achieves 1.625 bpc by using the actual next byte to decide when to trust the match model. The best valid strategy (extended match + logistic mixer) achieves 1.634 bpc on full enwik9. The 0.28 bpc gap (at 100M) represents how much we leave on the table by not knowing in advance whether the match will be correct.

2. Why Geometric Mixing Fails

Geometric mixing (used successfully in PAQ/cmix) multiplies probability distributions: P(x) = P1(x)α × P2(x)1-α. This is ideal when both models are well-calibrated. But when the match model is confidently wrong, it becomes catastrophic.

Scenario: Match predicts wrong byte. The actual next byte is 'e'. KN assigns it 30%. The match model, copying from a different context, predicts 'a' at 80% and gives 'e' only 0.2%.

Geometric Mean (w=0.3 for match)

Formula
P('e') = 0.30^0.7 x 0.002^0.3
P(correct byte)
0.054
Bits (information cost)
4.21 bits

Linear Mix (w=0.3 for match)

Formula
P('e') = 0.7 x 0.30 + 0.3 x 0.002
P(correct byte)
0.211
Bits (information cost)
2.24 bits
Impact on P(correct byte) across mixing methods
The catastrophe: Geometric mixing multiplies in the confident wrong model's near-zero probability, dragging the correct byte's probability from 0.30 down to 0.054 -- a 5.5x reduction. Linear mixing preserves a floor: even with 30% weight on the wrong model, P('e') only drops from 0.30 to 0.211.

3. Match Model Accuracy

The match model copies the byte that follows the longest matching context. Accuracy improves dramatically with context length: from 73% at length 4 to 97.6% at length 64. Extended match contexts (up to 64 bytes) are the key to higher accuracy.

Match Prediction Accuracy by Context Length
Key insight: Extending match contexts to 64 bytes yields 97.6% accuracy — nearly perfect. At length 64, matches cover 89M positions (8.9%) of enwik9, contributing most of the improvement. Context length matters more than mixer sophistication: the jump from 16 bytes (81.8%) to 64 bytes (97.6%) adds +0.013 bpc over the logistic mixer with 4–16 byte matches alone.

4. Adaptive Weight Dynamics

The adaptive linear strategy adjusts the match model weight w based on recent match performance. In template-heavy regions (XML tags, repeated markup), matches are reliable and w rises. In novel content (article text, numbers), matches fail and w drops rapidly.

Adaptive Match Weight (w) Over Time

The weight oscillates between roughly 0.05 and 0.35, spending most time in the 0.10 -- 0.20 range. The adaptation rate is tuned so w can rise quickly when entering a template region (several consecutive correct matches) but falls faster on a single miss, reflecting the asymmetric cost structure: one catastrophic miss costs more than several small gains.

Result: Adding sparse contexts (non-contiguous byte patterns) in a separate HT to the extended match + KN-6 system achieves 1.588 bpc on full enwik9 — 0.094 bpc better than KN-6 alone (1.682), saving 11.2 MB (189.3 MB compressed). Sparse contexts alone add +0.089 bpc; combined with match (+0.048) the total is +0.094. Different model types capture complementary patterns.

5. Hutter Prize Scoreboard

Where our combination approach sits relative to the state of the art for enwik9 compression (1 GB of English Wikipedia).

Entry Compressed Size bpc Notes
fx2-cmix (record) 110.8 MB 0.886 Hundreds of models, context mixing
Our KN-6 baseline 210.4 MB 1.683 Single KN-smoothed 6-gram model
Our KN-6 + adaptive match 197.2 MB 1.654 Adaptive linear mix
Our KN-6 + logistic mixer (4–16) 196.3 MB 1.647 Per-bucket logistic weights + EMA
Our KN-6 + extended match (4–64) 194.7 MB 1.634 Chain-8, 9 length buckets, logistic mixer
Our KN-6 + sparse + ext match 189.3 MB 1.588 Separate 16M sparse HT, 4 skip patterns
Gap remaining ~79 MB 0.702 Requires many more model types
Visual Scale: Compressed Size (MB)
The lesson: cmix achieves its record by combining hundreds of specialized models. Our three-component system (KN-6 + sparse + match) demonstrates the key principle: different model types capture complementary patterns. Sparse contexts (+0.089 bpc) capture more than the match model (+0.048 bpc) despite being individually much weaker (2.6 vs 1.6 bpc). The 79 MB gap to fx2-cmix requires many more such complementary components. Next targets: word models, XML structure, indirect contexts.