← Back to Hutter
Archive 2026-02-14
Lexemes as binary event spaces. The neutral introduction of “the” into the isomorphic UM.
Key insight: No new FSAs needed. The model's existing atomic byte patterns (skip-bigrams) already carry word-level information as a "bag of letters." Each lexeme is a binary ES (word occurred / negation), with the negative event distributing over the prefix class. The char→lexeme map is itself a complete UM, composable with the byte-level model via UM algebra.
Neutrality theorem: Introducing “the” as a binary ES and marginalizing out recovers exactly the original predictions (0.079 bpc). The operation is a pure rearrangement of count tables via E→N→Q. Value arrives only when intra-lexical patterns are added. The negative event distribution models how negative weights distribute generally.
Papers
- lexeme-es.pdf — Lexemes as Binary Event Spaces: From Atomic Patterns to Bag-of-Letters Prediction. Words emerge from conjunction of letter evidence at offsets—no word detector needed. Matches psycholinguistic findings (transposed-letter effects). Binary ES with negation distributing over prefix class ("the" vs "there/them/then/..."). Lexeme support is graded at every position (positive even on just "t") and is just another atomic input to the rest of the model. The char→lexeme map is a complete UM; UM join composes it with the byte model. (9 pages)
(source)
- the-experiment.pdf — Lexeme Injection Experiments: From Neutral Factorization to KN Dominance. Compiles results from 31 iterations (the_inject.c–the_inject30.c). Neutrality exact on 10M (4×10−14). Causal+onset beats oracle (4.57 vs 4.62 bpc). Split-alpha 5K/10M: 3.56 whole, 3.72 split. KN-6 dominates RNN 5×. KN+RNN+trie: 2.63 split. RNN contributes 2.4% at 10M. Negative results: centroids, entropy-adaptive alpha, word bigrams. (7 pages)
(source)
- the-neutral.pdf — Introducing “the”: A Neutral Factorization of the Byte-Level Model. Adds Ethe = {the+, the−} to the sat-rnn’s isomorphic UM (DSS=1024). Proves via E→N→Q Bayesian marginalization that the introduction is neutral (Δbpc = 0). Factors skip-bigram count tables through the new dimension with exact integer arithmetic. Characterizes the negative event distribution over the prefix class. Establishes bookkeeping for future lexeme-level patterns. (8 pages)
(source)
Experimental results: Neutrality verified EXACTLY on 10M bytes (4×10-14 difference). RNN gives P(space|the+) = 12% uniformly. Top-50 words: 1.02 bpc oracle value (16% of loss). Top-500: 2.32 bpc (36%). Internal bytes dominate 5:1 over boundary.
Causal+onset model beats oracle: Prefix trie with log-linear mix + onset distribution. At 10M/500 words: RNN 6.56 → Causal+onset 4.57 bpc (−30.4%), vs Oracle 4.62 bpc. 102.5% of oracle gain recovered.
Split-alpha optimization: Onset positions want α=1.0 (pure lexical—RNN is useless at word onset, 7.8 bpc). Mid-word positions want α=0.9 (trie dominates but RNN helps). 5K words at 10M: 3.56 bpc whole-sample (45.7% reduction), 3.72 bpc split-sample (43.3%). Gap analysis: onset = 80% of remaining causal-oracle gap. 2-byte conditional onset adds 0.026 bpc over 1-byte.
Byte KN dominates RNN: KN-6 at 1M = 1.24 bpc whole, 2.93 bpc split (vs RNN 6.43). KN captures everything the RNN knows plus much more. KN+RNN ensemble at α=0.2: 2.75 bpc split (RNN adds 0.18 bpc of independent information). KN+RNN+trie: 2.63 bpc split—beats RNN+lex (3.70) by 1.07 bpc. The lexical model was compensating for the RNN’s weakness, not adding genuine structure.
Tools
- the_inject.c — Neutral skip-bigram factorization through E_the. Verifies exact neutrality (0 errors / 786K cells), reports asymmetry, conditional luck, negative event distribution.
- the_inject2.c — RNN-level neutrality verification. Loads sat_1024 model, partitions bpc by the+/the−. Result: EXACT neutrality on 10M bytes (4×10−14).
- the_inject3.c — Hidden state separation analysis. cos(hpos, hneg) = 1.0000 (no separation). MI(the; next_byte) = 10−5 bits. Two-channel model comparison.
- the_inject4.c — Lexeme value measurement. Bayesian and log-linear mixing at the+ positions. Log-linear β=1 optimal. −0.020 bpc overall from just “the” boundary.
- the_inject5.c — Intra-lexical analysis. Value at each byte within “the”: offset 1 (h after t) = 8.14 bpc gain, offset 0 (t) = 5.90, offset 2 (e) = 3.02, boundary = 3.11. Total: 0.130 bpc.
- the_inject6.c — Top-N word value. 50 words = 1.02 bpc (16%), 200 = 1.74 bpc (27%), 500 = 2.32 bpc (36%). Internal/boundary ratio 5:1.
- the_inject7.c — Actual model improvement. Causal prefix trie with log-linear mixing. 500 words: 6.43→4.63 bpc (−28%). Oracle: 4.19 bpc (−35%).
- the_inject8.c — Position effects and alpha tuning. Bug: log-linear with unknown continuations. Diagnostic.
- the_inject9.c — Linear mix with oracle positions. Coverage 38.3% for 100 words. Alpha=1.0 is best (=oracle).
- the_inject10.c — Causal trie with linear mix + alpha sweep. Best: α=0.9, 80% oracle recovery.
- the_inject11.c — Linear vs log-linear comparison. Log-linear slightly better at low α, similar at optimal α.
- the_inject12.c — Per-offset gap analysis. Key finding: offset 0 (word onset) = entire causal-oracle gap. Ambiguity breakdown.
- the_inject13.c — Causal+onset model. Adds onset distribution at word boundaries. Beats oracle: 10M/500w = 4.57 bpc (102.5% recovery).
- the_inject14.c — Fair oracle+onset vs causal+onset + split-sample validation. 500w split-sample: causal+onset 4.54 vs oracle+onset 4.58 (causal wins even with train/test split).
- the_inject15.c — Vocabulary scaling: 100–10K words. 10K = 92% coverage, 52% bpc reduction. Causal-oracle gap widens at large vocab.
- the_inject16.c — Conditional onset P(first_byte|prev_byte). Modest improvement: 3.246 vs 3.299 bpc at 5K words.
- the_inject17.c — Per-offset gap analysis with onset. Onset = 80.4% of remaining gap. Offset 1 = 41.8%. Non-word positions: causal beats oracle.
- the_inject18.c — Word bigram onset: P(first_byte|prev_word). No improvement over prev_byte conditioning. Sparse counts dominate.
- the_inject19.c — Split-alpha optimization. Sweep αonset × αmid independently. Best: onset=1.0, mid=0.9 → 3.16 bpc (vs uniform 0.7 → 3.25). RNN is useless at onset.
- the_inject20.c — Optimized combined model. Split-alpha + conditional onset + split-sample validation. 10M: 3.56 bpc whole (45.7%), 3.72 bpc split-sample (43.3%). 1M: 3.16 bpc whole (50.8%).
- the_inject21.c — Higher-order onset: P(first|prev_2) adds 0.026 bpc over P(first|prev_1). 3-byte overfits at 1M.
- the_inject22.c — RNN hidden-state centroid onset prediction. Negative result: +0.085 bpc worse than conditional onset. cos(h) ≈ 1.0 at boundaries—centroids don’t discriminate words.
- the_inject23.c — Entropy-adaptive alpha (per-bin sweep). Negative result: ZERO improvement. 99.96% of onset positions in single entropy bin [4,5). RNN entropy is constant at word boundaries.
- the_inject24.c — KN-6 byte n-gram integration. Per-category comparison: KN onset 3.04 vs cond_onset 4.77 (−1.73!). KN midword 0.91 vs RNN 5.84. KN everywhere 1.24 bpc (whole). Trie HURTS midword (1.42 vs 0.91).
- the_inject25.c — KN+RNN alpha sweep. RNN adds ZERO value to KN at whole-sample (pure KN=1.24 is optimal). All alpha>0 makes it worse.
- the_inject26.c — Split-sample KN comparison. KN-6 split: 2.93 bpc (vs RNN 6.51, RNN+lex 3.70). KN overfits 1.24→2.93 but still dominates.
- the_inject27.c — Comprehensive split-sample ensemble. KN+RNN α=0.2: 2.75. KN+RNN+trie: 2.63 bpc. Beats RNN+lex (3.70) by 1.07 bpc.
- the_inject28.c — KN order sweep. 1M split: order 4 optimal (2.81 bpc). 10M split: order 5 optimal (2.39 bpc). HT saturates at order 7+ (16M entries). RNN: 6.58 bpc—KN-5 is 2.76× better.
- the_inject29.c — KN+RNN at 10M split + D sweep. D=0.9 best (2.383 vs 2.390 at D=0.75). KN+RNN α=0.2: 2.332 bpc. RNN adds 0.057 bpc (2.4%)—shrinking contribution as KN gets more data.
- the_inject30.c — Word bigram KN at onset. Negative result: word bigram 5.63 vs byte KN 4.73 at 10M split. Word-pair sparsity dominates. Byte context is richer than word context at feasible data sizes.
Navigation
Next: 20260215 →
The extended event space: injecting lexical structure into H.
← Previous: 20260213
Tock protocol: systematic lexicon injection into the isomorphic UM.