Hutter Prize Compressor

Compressing enwik9 via the Universal Model. Based on the CMP paper (Clement, 2026).

Resources

Active Papers

Start here. The current research front, plus the foundational paper. See #paper_index for the full catalog (~150 papers).

CMP — The Constructive Model of Prediction (Clement, 2026)
Foundation of the research program. The Universal Model, event spaces, patterns, forward pass, learning. Everything below is built on this.
The Lexicon Embedding — From bytes to words: synthesis, experimental plan, predictions (Latest)
Complete picture: agent model, orthographic bijection, two-level trace (word events + spelling residual), MDL lexicon, factor tower, quotient ring algebra, P-programming, formal tokenization contrast. 100-word results (-0.552 bpc). Experimental plan for full lexicon. 14pp.
The Path to the Lexicon — Bag-of-letters embeddings, first change of base
What “the” is, looking down toward orthography. Word events as support on letter events. Reset signal at space. The first Tock of the Tick-Tock process.
The Memory Trace — Definition, replay, LATD, and the path forward (current)
Joint events as digits in a positional number system. Boolean algebra of the forward pass. Clearing triggers write events. Tick-Tock: freeze, extend, replay, rewrite. LATD: Look At The Data.
The Unigram Memory Trace — Product encoding, E→N→Q, LSI gap (background)
The simplest memory trace: 256 bytes. Tenv=30. Product encoding vs arithmetic coding. The LSI gap (0.522 bpc) is underfitting, not overfitting — correct for a model that transfers.
Working Memory — Agent model, shift chain, ES clearing (background)
O is both prediction and input. No separate ESI. Retroactive pass for late-born neurons. ES clearing = biological default. Conjunction = product events from LPPs.
Timing Resolution — Solves marginal dominance via pattern chain length Settling — Sharpness preference and ES epistemics. 4.9→2.0 bpc

Archives

2026-03-17 (Latest)
P-Programming Namespace: first formalization pass for the #p_programming_* hierarchy. Places P-programming above specific realizations such as #umr_spec and cmpr. Introduces the core state u = (E,T,P), first namespace-level operators, and a web + PDF overview.
2026-03-14
CSLA 1994 Reconstruction: OCR-based typeset reconstruction of On Structuring Probabilistic Dependences in Stochastic Language Modelling, preserving the source OCR artifacts for later manual cleanup while making the paper available in the archive stream.
2026-03-13
Scaling Corrections: Scoring protocol correction (retracted “UM beats KN-6”), coverage bottleneck analysis (23% at order 6), threshold scaling degradation (τ=1 degrades 300K+), multi-retro diminishing returns. UM KN chain is structurally limited; path forward = external KN for compression + UM for interpretability.
2026-03-12
MCP Explainers, Combination Problem, Ablation, and Conviction Analysis: 42 papers, 6 interactive explainers, and versioned LATD snapshots. Key new results: count-augmented LPPs (2.766 bpc at 100K, 3.049 at 1M), oracle-correlation conviction-vs-accuracy tradeoff, conviction-depth fewer-wins-bigger-wins asymmetry, and H3 normalized conviction as leading higher-order hybrid. LATD three-regime decomposition with versioned snapshots and prior-art lineage back to Feb trace viewers. Plus quotient/fiber, combination problem, blend analysis, and response-cluster papers.
2026-03-06
Trigram Embedding & Consolidation: P-program for the orthographic bijection (7pp). Consolidation v2: GCD decomposition (Bayes from Counting), ring tower (KN-quotient v2), experimental findings (English context results). OBSERVE LPP RNG coupling, log-stochastic quantization, L∞ vs L1 gap, two-level trace, CRT word extension (11pp). Suffix collision analysis: depth 1 = 12.6%, depth 5 needed for >90%.
2026-03-04
Exp L1: Word Discovery: UM discovers vocabulary from data via threshold creation. 73K unique words at 10M, 8K reified, 88% token coverage. No oracle. 1 paper (4pp), 4 SN models (10K–10M).
2026-03-03
The Lexicon Embedding: synthesis paper. Agent model, working memory, orthographic bijection, two-level trace, MDL lexicon, factor tower, quotient ring algebra, P-programming, formal tokenization contrast. 100-word results (−0.552 bpc). Experimental plan for full lexicon. 1 paper (14pp).
2026-02-24
The Path to the Lexicon + English Context Neuron: bag-of-letters embeddings, first change of base. Eight experiments (Exp 2, 3/6, 4, 5.2). Context neurons help when sharper than baseline (−0.12 bpc at unigram level). LSA primitives (⊕, ⊖) implemented; surgery fails at integer precision. 4 papers.
2026-02-22
Working Memory & Memory Traces: agent model (O=pred+input), shift chain, ES clearing, DYNAMICS. The Memory Trace paper: LATD (Look At The Data), boolean sentences, clearing as write trigger, replay/Tick-Tock, embeddings as change of base. Unigram memory trace (256 bytes, Tenv=30, LSI gap = generalization cost). 8 trace viewers, learning rate viewer. 3 papers, 8 viewers.
2026-02-20
UMR Mathematical Specification (7pp): maps every CMP formula to code (f0, softmax, ω0, LPPs as shorthand for P, threshold creation, support gap, envelope, SN format). Sharpest-LPP Scoring: support gap (s1−s2) selects most confident LPP. Threshold-4 creation (part of ω) filters sparse noise. Frozen rerun order 4 = 4.407 bpc at 64K (−0.41 vs order 2). Fixed softmax bug (s=0 → 20=1, not ε). Three scoring modes, genesis viewer, SN export. 2 papers, 1 viewer, 5 SN models.
2026-02-19
Timing Resolution: solves marginal dominance via pattern chain length. Settling: sharpest-LPP drops from 4.9 to 2.0 bpc (externally instrumented, trained and tested online on first 4K of enwik9). The 2.8 bpc gap between max-min and external interpolation IS the settling problem. Research Summary: 20 days reviewed. 3 papers + 1 interactive timeline.
2026-02-18
Context events & surprise: generalized oversupport, ring pattern (91% from-the-left), three-frequency model (+0.184 bpc). Marginal Dominance Theorem: pure UM bigram at ~5.3 bpc (lower-order always wins under max-min). Generic UM runner (#umr_core). UM Viewer (SN-loading, 3D rings, JS forward pass). UM Connectome (pure UM 3D viewer). 8 papers, 3 viewers, 9 experiments.
2026-02-16
Sparse contexts + match + KN-6: 1.588 bpc = 189.3 MB (1.79× fx2-cmix). 19 scoring experiments, 8 negative results. KN-quotient: ring structure, discount = subtraction vs GCD (g=1 in 98.3%). UM Runner (umr): full v16 pipeline as P-programs, all scores exact match. 13 papers, 36 experiments.
2026-02-15
The Extended Event Space: injecting lexical structure. Nested model (Hext = I′ × Hinner × O′). Tokenization as information loss (≥0.05 bpc, strawberry impossibility). P-Programs: position, accumulator, bag-of-letters, graded word support. 6 papers.
2026-02-14
Lexemes as Binary Event Spaces. Neutral injection EXACT on 10M. Causal+onset model beats oracle: 10M/500w = 4.57 bpc (−30.4%) vs oracle 4.62. Word-onset distribution is the key causal ingredient. 2 papers, 13 tools.
2026-02-13
Tock Protocol: designing lexicon injection into the isomorphic UM.
2026-02-12
The Carrier Signal Problem + 20 math papers. Byte KN-5 = 2.29 bpc. Logic from counting (forward pass = existential prob. syllogism). The Tock Step (factorization tower; strawberry theorem). Mathematical foundations: counting monad, renormalization, expressiveness, information geometry. 21 papers, 8 viewers.
2026-02-11_2
Scaling to full enwik9. P1–P7 predictions (3 wrong, 2 right, 1 mixed). R²=0.83 architectural invariant. 99+10 checkpoint trajectory (Adam + Xavier): three training phases, R² cliff at 450M, b_y collapse at 640M. Xavier R²=0.87–0.89. 4 papers.
2026-02-11
Total interpretation & weight construction. E onto N bi-embedding, quotient chain, temporal bi-embedding, microstates/macrostates. 20+ papers, 15 viewers.
2026-02-10
Narrative paper (31 Jan–10 Feb). Event arithmetic (E onto N, prime encoding). Sparse Differentiation viewer v8 (oscilloscope, attribution arcs, pattern overlays).
2026-02-09
The real factor map: interpretable patterns onto H dynamics. Write-in, subtract-out, UM around the RNN.
2026-02-08
Factor map: 128/128 neurons explained as 2-offset conjunctions (R²=0.84, 87% bpc gain). Reverse isomorphism (0.107 bpc). Pattern viewer.
2026-02-07
SN visibility, export gap, pattern chains, skip-patterns, greedy offsets, weight construction (0.137 bpc via MLP readout).
2026-02-06
Synthesis (8 days of findings), SIMD optimization (13.3x speedup), saturation experiment (RNN vs UM on 1024 bytes).
2026-02-04
UM Interpretation (Tock 1): 302 patterns, 18 natural ESs, word boundary & syllable momentum. <5% bpc explained.
2026-02-01
SN Visibility: Interactive viewer for doubled-E UM. 302 significant patterns extracted.
2026-01-31_6
Memory Traces: AC ↔ RNN dual visualization.
2026-01-31_5
Retrospective and predictions. Testable claims for future work.
2026-01-31_4
Pattern injection, unification (Q = λ), time-energy-bits.
2026-01-31_3
Memory traces, time, integration, explanatory sufficiency.
2026-01-31_2
Tock methodology. ES captures 15.9% of Markov MI (not 59%). 9 figures, 7 papers.
2026-01-31
Initial ES experiments. Activation probing, RNN-UM mapping, model.sn visualization.