← Back to Hutter

Archive 2026-02-14

Lexemes as binary event spaces. The neutral introduction of “the” into the isomorphic UM.

Key insight: No new FSAs needed. The model's existing atomic byte patterns (skip-bigrams) already carry word-level information as a "bag of letters." Each lexeme is a binary ES (word occurred / negation), with the negative event distributing over the prefix class. The char→lexeme map is itself a complete UM, composable with the byte-level model via UM algebra.
Neutrality theorem: Introducing “the” as a binary ES and marginalizing out recovers exactly the original predictions (0.079 bpc). The operation is a pure rearrangement of count tables via E→N→Q. Value arrives only when intra-lexical patterns are added. The negative event distribution models how negative weights distribute generally.

Papers

Experimental results: Neutrality verified EXACTLY on 10M bytes (4×10-14 difference). RNN gives P(space|the+) = 12% uniformly. Top-50 words: 1.02 bpc oracle value (16% of loss). Top-500: 2.32 bpc (36%). Internal bytes dominate 5:1 over boundary.
Causal+onset model beats oracle: Prefix trie with log-linear mix + onset distribution. At 10M/500 words: RNN 6.56 → Causal+onset 4.57 bpc (−30.4%), vs Oracle 4.62 bpc. 102.5% of oracle gain recovered.
Split-alpha optimization: Onset positions want α=1.0 (pure lexical—RNN is useless at word onset, 7.8 bpc). Mid-word positions want α=0.9 (trie dominates but RNN helps). 5K words at 10M: 3.56 bpc whole-sample (45.7% reduction), 3.72 bpc split-sample (43.3%). Gap analysis: onset = 80% of remaining causal-oracle gap. 2-byte conditional onset adds 0.026 bpc over 1-byte.
Byte KN dominates RNN: KN-6 at 1M = 1.24 bpc whole, 2.93 bpc split (vs RNN 6.43). KN captures everything the RNN knows plus much more. KN+RNN ensemble at α=0.2: 2.75 bpc split (RNN adds 0.18 bpc of independent information). KN+RNN+trie: 2.63 bpc split—beats RNN+lex (3.70) by 1.07 bpc. The lexical model was compensating for the RNN’s weakness, not adding genuine structure.

Tools

Navigation

Next: 20260215 →
The extended event space: injecting lexical structure into H.
← Previous: 20260213
Tock protocol: systematic lexicon injection into the isomorphic UM.