2026-03-04: Exp L1 — Word Discovery

First lexicon experiment. UM discovers vocabulary from data via threshold creation on word-boundary conjunctions. No oracle, no external word list. Shift chain depth 8, threshold 16 observations (tau=4).

Results

ScaleUniqueReifiedToken CoverageBigram bpc
10K4641123.8%5.333
100K3,21312650.6%4.443
1M15,8061,17972.3%4.172
10M73,0538,04887.8%4.140

73K unique words at 10M matches prediction. 8K reified words cover 88% of word tokens. Length distribution peaks at 5–6 chars. Top words: the, of, and, in, a, to, quot, is, s, as.

Paper

word-discovery.pdf
Experiment L1: Word Discovery via Threshold Creation. P-program design, results at 4 scales, SN format conventions, P-programming gap analysis.

Models (.sn)

word-discover-10K.sn
11 word events. 464 unique words at 10K.
word-discover-100K.sn
126 word events. 3,213 unique words at 100K.
word-discover-1M.sn
1,179 word events. 15,806 unique words at 1M.
word-discover-10M.sn
8,048 word events. 73,053 unique words at 10M.

Navigation

← Previous: 20260303
The Lexicon Embedding: synthesis paper (14pp).
Next: 20260306 →
Exp L2 Design: Trigram Embedding. Suffix collision analysis.