2026-03-04: Exp L1 — Word Discovery

First lexicon experiment. UM discovers vocabulary from data via threshold creation on word-boundary conjunctions. No oracle, no external word list. Shift chain depth 8, threshold 16 observations (tau=4).

Results

Scale	Unique	Reified	Token Coverage	Bigram bpc
10K	464	11	23.8%	5.333
100K	3,213	126	50.6%	4.443
1M	15,806	1,179	72.3%	4.172
10M	73,053	8,048	87.8%	4.140

73K unique words at 10M matches prediction. 8K reified words cover 88% of word tokens. Length distribution peaks at 5–6 chars. Top words: the, of, and, in, a, to, quot, is, s, as.

Paper

word-discovery.pdf

Experiment L1: Word Discovery via Threshold Creation. P-program design, results at 4 scales, SN format conventions, P-programming gap analysis.

Models (.sn)

word-discover-10K.sn

11 word events. 464 unique words at 10K.

word-discover-100K.sn

126 word events. 3,213 unique words at 100K.

word-discover-1M.sn

1,179 word events. 15,806 unique words at 1M.

word-discover-10M.sn

8,048 word events. 73,053 unique words at 10M.

Navigation

← Previous: 20260303

The Lexicon Embedding: synthesis paper (14pp).

Next: 20260306 →

Exp L2 Design: Trigram Embedding. Suffix collision analysis.