Claude and MJC, February 2026
We inject word-level conditional distributions into a byte-level KN model by absorbing the word evidence: after seeing a known word boundary, we replace the byte-level prediction with the word-conditional next-byte distribution for one byte, then resume KN.
| Word | H(next|w) | KN bpc | Saved (bits) |
|---|
We build a word bigram matrix from the first 100M bytes, apply SVD, then k-means cluster the left singular vectors. Syntactic categories emerge purely from co-occurrence counting — no labels, no parsing, no neural network.
P-programs are small deterministic state machines that maintain an accumulator updated at each byte. They define features that the byte-level model conditions on. We evaluate each P-program's contribution to compression.
| Position | Observed States | Theoretical (26^n) | Ratio |
|---|