Tracking progress: model + trace = exact input (lossless invariant).
Goal: compress enwik9. Record: fx2-cmix 110,793,128 bytes (0.886 bpc). Prize threshold: <109,685,197 (1% improvement).
| Model | Trace (bytes) | Model (bytes) | Total | bpc | vs Record |
|---|---|---|---|---|---|
| Unigram (envelope→output) | 681,894,692 | 256 | 681,894,956 | 5.455 | 6.15× |
Model: Unigram. One LPP: envelope → byte_output. 256 log-support values learned via ω₀ (log-stochastic counting). Two-pass: learn counts, then encode frozen.
Range coder: Subbotin-style byte-aligned, carry propagation via output buffer. RC_FREQ_TOTAL = 2¹⁴ = 16384. Cumulative frequency table from 2^s softmax with minimum 1 per symbol.
| Scale | Input | Bitstream | Total | bpc | Status |
|---|---|---|---|---|---|
| 1K | 1,024 | 681 | 945 | 7.383 | PASS |
| 10K | 10,000 | 6,771 | 7,035 | 5.628 | PASS |
| 100K | 100,000 | 64,266 | 64,530 | 5.162 | PASS |
| 1M | 1,000,000 | 663,616 | 663,880 | 5.311 | PASS |
| 10M | 10,000,000 | 6,836,203 | 6,836,467 | 5.469 | PASS |
| 1B | 1,000,000,000 | 681,894,692 | 681,894,956 | 5.455 | PASS |
Notes: Shannon H₀ = 4.926 bpc. LSI gap (online vs ideal) = 0.52 bpc. Model size = 256 bytes (negligible). This is the baseline — every future model must beat this AND maintain the lossless invariant.
Blocks: #umr_range_coder (Subbotin range coder), #umr_trace_ac (write/read/verify), #umr_main (trace-write, trace-read, trace-verify commands).
Five range coder implementations before the working one:
Tick-Tock cycle (see #root, lexicon paper §3):
The model is always a UM. Event spaces, LPPs, forward pass (max-min on log-support), softmax output distribution. The trace encodes against the UM's output distribution via arithmetic coding. The frozen model is stored in the trace file (SN or equivalent). NOT a flat lookup table.
Lossless invariant: For any frozen model M and trace T, decode(M, T) = exact original input. Must hold at every step.
Spec: lexicon paper §6 ("What We Need to Build").
| # | Step | Status |
|---|---|---|
| 1 | AC encoder/decoder | DONE |
| 2 | Extended shift chain (depth 3+, SN model, word-boundary clearing) | NEXT |
| 3 | Trace recording under UM output distribution | DONE (unigram baseline) |
| 4 | Threshold creation at reset (word event discovery) | planned |
| 5 | Word-level LPPs (spelling + word sequences) | planned |
| 6 | Replay: decode old trace, re-encode under extended model | planned |
Current baseline: unigram, 5.455 bpc, 681.9 MB. Target: <109,685,197 bytes (0.877 bpc).