← Archive 2026-02-16

Kneser–Ney as UM Quotient

Concrete examples from the first 1024 bytes of enwik9. See kn-quotient.pdf for the formal treatment.

1. The Bigram Table (LPP)

This is the raw count table c(e_a, e_b) for byte bigrams. Each row is a context byte (E_a), each column is an output byte (E_b). The entry is the number of times that bigram occurs.

Click any row header to see the detailed decomposition below. Summary columns show row total, type count (distinct continuations), and GCD.

2. Row Decomposition: Discount as Quotient

Select a context byte to see how KN smoothing decomposes its row.

3. Column Types: Continuation Count

The continuation count N₁₊(·, b) asks: for output byte b, how many distinct context bytes lead to it? This is the “column type count” in UM terms — measuring how general each output byte is across contexts.

UM interpretation: Column types measure the generality of an output event. Output ‘e’ has 15 distinct contexts (very general). Output ‘0’ has 3 (specialized). KN uses this as the backoff distribution: general bytes get more probability mass when context is novel.

4. The Interpolation Tower

KN interpolation cascades through orders. Each level is a quotient (projection) dropping one context position.

Order 3: P(b | a₁, a₂)
E_a1 × E_a2 × E_b — full trigram context

↓ drop a₁, redistribute D · n_types

Order 2: P(b | a₂)
E_a2 × E_b — bigram (our table above)

↓ drop a₂, redistribute D · n_types

Order 1: P(b)
E_b — unigram (column marginals)

↓ uniform floor

Order 0: P(b) = 1/256
uniform — no context

The UM view: Each arrow is a quotient map π: E_n → E_n-1 that projects out one context dimension. The redistributed mass (D · n_types) flows to the lower level, weighted by continuation counts. This is exactly the hierarchical combination rule from the pattern-space paper applied to n-gram event spaces.

5. The Sliding Window & Context Events

KN’s fixed-width window drops information as it slides forward. In the first 1024 bytes:

The context events to the left of the window carry information that the n-gram model discards. The kn-quotient paper argues this “excess entropy” is precisely what higher-order structures (word embeddings, phrase structure) should capture. As the higher tower improves, losing the left context matters less.

Window: ... [a_n-5 a_n-4 a_n-3 a_n-2 a_n-1] → b
↓ slide ↓
Window: ... a_n-5 [a_n-4 a_n-3 a_n-2 a_n-1 b] → ?

Dropped: a_n-5 — but its contribution persists in the quotient space

Data source: first 1024 bytes of enwik9 (<mediawiki xmlns="http://...">).
Companion paper: kn-quotient.pdf. Visualization: LPP 3D viewer.