2026-03-04: Exp L1 — Word Discovery
First lexicon experiment. UM discovers vocabulary from data via threshold creation on word-boundary conjunctions. No oracle, no external word list. Shift chain depth 8, threshold 16 observations (tau=4).
Results
| Scale | Unique | Reified | Token Coverage | Bigram bpc |
| 10K | 464 | 11 | 23.8% | 5.333 |
| 100K | 3,213 | 126 | 50.6% | 4.443 |
| 1M | 15,806 | 1,179 | 72.3% | 4.172 |
| 10M | 73,053 | 8,048 | 87.8% | 4.140 |
73K unique words at 10M matches prediction. 8K reified words cover 88% of word tokens. Length distribution peaks at 5–6 chars. Top words: the, of, and, in, a, to, quot, is, s, as.
Paper
word-discovery.pdf
Experiment L1: Word Discovery via Threshold Creation. P-program design, results at 4 scales, SN format conventions, P-programming gap analysis.
Models (.sn)
word-discover-10K.sn
11 word events. 464 unique words at 10K.
word-discover-100K.sn
126 word events. 3,213 unique words at 100K.
word-discover-1M.sn
1,179 word events. 15,806 unique words at 1M.
word-discover-10M.sn
8,048 word events. 73,053 unique words at 10M.
Navigation
← Previous: 20260303
The Lexicon Embedding: synthesis paper (14pp).
Next: 20260306 →
Exp L2 Design: Trigram Embedding. Suffix collision analysis.