← Archive 2026-02-16
Kneser–Ney as UM Quotient
Concrete examples from the first 1024 bytes of enwik9. See kn-quotient.pdf for the formal treatment.
1. The Bigram Table (LPP)
This is the raw count table c(ea, eb) for byte bigrams. Each row is a context byte (Ea), each column is an output byte (Eb). The entry is the number of times that bigram occurs.
Click any row header to see the detailed decomposition below. Summary columns show row total, type count (distinct continuations), and GCD.
2. Row Decomposition: Discount as Quotient
Select a context byte to see how KN smoothing decomposes its row.
After KN discount (D=0.75)
Discount as GCD approximation
In the UM, the GCD g(c) of a row removes the common evidence shared by all continuations. KN’s flat discount D is a rough approximation:
When the row GCD = 1 (as in our 1024-byte sample), the common factor is trivial. With more data, rows develop non-trivial GCDs. The discount D=0.75 removes a constant fraction regardless — this is the “discount–GCD gap” identified in the paper.
3. Column Types: Continuation Count
The continuation count N1+(·, b) asks: for output byte b, how many distinct context bytes lead to it? This is the “column type count” in UM terms — measuring how general each output byte is across contexts.
UM interpretation: Column types measure the generality of an output event. Output ‘e’ has 15 distinct contexts (very general). Output ‘0’ has 3 (specialized). KN uses this as the backoff distribution: general bytes get more probability mass when context is novel.
4. The Interpolation Tower
KN interpolation cascades through orders. Each level is a quotient (projection) dropping one context position.
Order 3: P(b | a1, a2)
Ea1 × Ea2 × Eb — full trigram context
↓ drop a1, redistribute D · ntypes
Order 2: P(b | a2)
Ea2 × Eb — bigram (our table above)
↓ drop a2, redistribute D · ntypes
Order 1: P(b)
Eb — unigram (column marginals)
↓ uniform floor
Order 0: P(b) = 1/256
uniform — no context
The UM view: Each arrow is a quotient map π: En → En-1 that projects out one context dimension. The redistributed mass (D · ntypes) flows to the lower level, weighted by continuation counts. This is exactly the hierarchical combination rule from the pattern-space paper applied to n-gram event spaces.
5. The Sliding Window & Context Events
KN’s fixed-width window drops information as it slides forward. In the first 1024 bytes:
The context events to the left of the window carry information that the n-gram model discards. The kn-quotient paper argues this “excess entropy” is precisely what higher-order structures (word embeddings, phrase structure) should capture. As the higher tower improves, losing the left context matters less.
Window: ... [an-5 an-4 an-3 an-2 an-1] → b
↓ slide ↓
Window: ... an-5 [an-4 an-3 an-2 an-1 b] → ?
Dropped: an-5 — but its contribution persists in the quotient space
Data source: first 1024 bytes of enwik9 (<mediawiki xmlns="http://...">).
Companion paper: kn-quotient.pdf.
Visualization: LPP 3D viewer.