Trace Experiments Log

Tracking progress: model + trace = exact input (lossless invariant).
Goal: compress enwik9. Record: fx2-cmix 110,793,128 bytes (0.886 bpc). Prize threshold: <109,685,197 (1% improvement).

Current Best

ModelTrace (bytes)Model (bytes)Totalbpcvs Record
Unigram (envelope→output) 681,894,692 256 681,894,956 5.455 6.15×

Experiments

E1: Unigram AC trace — full enwik9

2026-02-24 · revstore 20260224-052210 · git 4dd3b8c (uncommitted)

Model: Unigram. One LPP: envelope → byte_output. 256 log-support values learned via ω₀ (log-stochastic counting). Two-pass: learn counts, then encode frozen.

Range coder: Subbotin-style byte-aligned, carry propagation via output buffer. RC_FREQ_TOTAL = 2¹⁴ = 16384. Cumulative frequency table from 2^s softmax with minimum 1 per symbol.

ScaleInputBitstreamTotalbpcStatus
1K 1,024 681 945 7.383 PASS
10K 10,000 6,771 7,035 5.628 PASS
100K 100,000 64,266 64,530 5.162 PASS
1M 1,000,000 663,616 663,880 5.311 PASS
10M 10,000,000 6,836,203 6,836,467 5.469 PASS
1B 1,000,000,000681,894,692681,894,9565.455 PASS

Notes: Shannon H₀ = 4.926 bpc. LSI gap (online vs ideal) = 0.52 bpc. Model size = 256 bytes (negligible). This is the baseline — every future model must beat this AND maintain the lossless invariant.

Blocks: #umr_range_coder (Subbotin range coder), #umr_trace_ac (write/read/verify), #umr_main (trace-write, trace-read, trace-verify commands).

Range coder attempts that failed

2026-02-24 · revstore 20260224-051440 through 20260224-052113

Five range coder implementations before the working one:

Architecture

Tick-Tock cycle (see #root, lexicon paper §3):

  1. Tick 0: Train model on enwik9. Write AC trace. Freeze model.
  2. Query: LATD — find where the model fails and why.
  3. Tock 1: Extend model architecture (+E, +P) to address gaps.
  4. Tick 1: Decode old trace with frozen model → bytes → encode under extended model → shorter trace.
  5. If not shorter, stop and reconsider.

The model is always a UM. Event spaces, LPPs, forward pass (max-min on log-support), softmax output distribution. The trace encodes against the UM's output distribution via arithmetic coding. The frozen model is stored in the trace file (SN or equivalent). NOT a flat lookup table.

Lossless invariant: For any frozen model M and trace T, decode(M, T) = exact original input. Must hold at every step.

Roadmap

Spec: lexicon paper §6 ("What We Need to Build").

#StepStatus
1AC encoder/decoderDONE
2Extended shift chain (depth 3+, SN model, word-boundary clearing)NEXT
3Trace recording under UM output distributionDONE (unigram baseline)
4Threshold creation at reset (word event discovery)planned
5Word-level LPPs (spelling + word sequences)planned
6Replay: decode old trace, re-encode under extended modelplanned

Current baseline: unigram, 5.455 bpc, 681.9 MB. Target: <109,685,197 bytes (0.877 bpc).

Last updated: 2026-02-24