Trace Experiments Log

Tracking progress: model + trace = exact input (lossless invariant).
Goal: compress enwik9. Record: fx2-cmix 110,793,128 bytes (0.886 bpc). Prize threshold: <109,685,197 (1% improvement).

Current Best

Experiments

Model	Trace (bytes)	Model (bytes)	Total	bpc	vs Record
Unigram (envelope→output)	681,894,692	256	681,894,956	5.455	6.15×

E1: Unigram AC trace — full enwik9

2026-02-24 · revstore 20260224-052210 · git 4dd3b8c (uncommitted)

Model: Unigram. One LPP: envelope → byte_output. 256 log-support values learned via ω₀ (log-stochastic counting). Two-pass: learn counts, then encode frozen.

Range coder: Subbotin-style byte-aligned, carry propagation via output buffer. RC_FREQ_TOTAL = 2¹⁴ = 16384. Cumulative frequency table from 2^s softmax with minimum 1 per symbol.

Scale	Input	Bitstream	Total	bpc	Status
1K	1,024	681	945	7.383	PASS
10K	10,000	6,771	7,035	5.628	PASS
100K	100,000	64,266	64,530	5.162	PASS
1M	1,000,000	663,616	663,880	5.311	PASS
10M	10,000,000	6,836,203	6,836,467	5.469	PASS
1B	1,000,000,000	681,894,692	681,894,956	5.455	PASS

Notes: Shannon H₀ = 4.926 bpc. LSI gap (online vs ideal) = 0.52 bpc. Model size = 256 bytes (negligible). This is the baseline — every future model must beat this AND maintain the lossless invariant.

Blocks: #umr_range_coder (Subbotin range coder), #umr_trace_ac (write/read/verify), #umr_main (trace-write, trace-read, trace-verify commands).

Range coder attempts that failed

2026-02-24 · revstore 20260224-051440 through 20260224-052113

Five range coder implementations before the working one:

v1 (simple top-byte renorm): Passed at 100K, failed at 1M (position 199740). Missing underflow/carry handling.
v2 (Subbotin carry with cache byte): Failed at 1K (position 0). Encoder/decoder alignment mismatch — first output byte was the cache, decoder expected data.
v3 (v2 + started flag): Failed at 1K. Same fundamental issue with cache/carry logic.
v4 (force-flush on small range): Passed at 1K, failed at 100K (position 5032). Force-flush is lossy — changes range non-reversibly.
v5 (Witten-Neal-Cleary bit-level): Failed at 1K (position 9). Underflow bit-flip in decoder was wrong.
v6 (Subbotin with buffer-based carry): PASS at all scales including full enwik9. This is the keeper.

Architecture

The model is always a UM. Event spaces, LPPs, forward pass (max-min on log-support), softmax output distribution. The trace encodes against the UM's output distribution via arithmetic coding. The frozen model is stored in the trace file (SN or equivalent). NOT a flat lookup table.

Lossless invariant: For any frozen model M and trace T, decode(M, T) = exact original input. Must hold at every step.

Roadmap

Current baseline: unigram, 5.455 bpc, 681.9 MB. Target: <109,685,197 bytes (0.877 bpc).

#	Step	Status
1	AC encoder/decoder	DONE
2	Extended shift chain (depth 3+, SN model, word-boundary clearing)	NEXT
3	Trace recording under UM output distribution	DONE (unigram baseline)
4	Threshold creation at reset (word event discovery)	planned
5	Word-level LPPs (spelling + word sequences)	planned
6	Replay: decode old trace, re-encode under extended model	planned