SVD Component Interpretation

What do the singular values of the bigram matrix mean?

Singular Value Spectrum

Component	S	Variance
0	2892.8	97.5%
1	306.5	1.1%
2	151.9	0.3%
3	127.8	0.2%
4	108.0	0.1%
5	91.8	0.1%

Top 10 capture 99.4% of variance. Top 64 capture 99.9%.

Component Interpretations

Component 0: Frequency Baseline (97.5%)

PREV: i o t e a ] vs x01 xFE xFD

NEXT: SP ] a e i r vs xFD xFE xFF

Common bytes (letters, space) vs rare bytes (control chars). This is just marginal frequency.

Component 1: ASCII vs UTF-8 (1.1%)

PREV: SP e n r s a vs xC3 xD0 x83 xB8

NEXT: e a i o t s vs xD0 xE0 xE3 xD1

After ASCII text → expect ASCII. After UTF-8 lead bytes → expect continuation bytes.

Component 2: Letters vs Digits/XML (0.3%)

PREV: u i o r a l vs > = : 9 NL 1

NEXT: o l i n u y vs 6 1 T 5 8 9

After vowels → expect letters. After XML/digits → expect digits.

Component 3: Bracket Structure (0.2%)

PREV: 1 5 9 4 8 6 vs [ { ( SP |

NEXT: ] , ; SP < ) vs I S L M R P

After digits → expect closing brackets. After opening → expect uppercase. Wikipedia citations: [1], [2], etc.

Component 4: UTF-8 Internal (0.1%)

PREV: xC3 xD0 xCE xD7 vs SP : ( x84

NEXT: e . u o : y vs xD0 xE0 xE3 xCE

UTF-8 lead bytes predict ASCII punctuation. Space/punctuation predict UTF-8.

Component 5: Word Structure (0.1%)

PREV: [ a i o / e vs L C W V H N

NEXT: p c m g f d vs o e a ] i .

After vowels → consonants. After uppercase consonants → vowels. Phonotactic patterns.

Reconstruction Quality

Rank  1: MSE = 3.31
Rank  2: MSE = 1.88
Rank  4: MSE = 1.28
Rank  8: MSE = 0.83
Rank 16: MSE = 0.60
Rank 32: MSE = 0.38
Rank 64: MSE = 0.16

Key Insight

Component 0 (frequency baseline) captures 97.5% of variance. The interpretable structure — encoding type, text vs markup, bracket matching, phonotactics — lives in the remaining 2.5%.

When we inject rank-64, we keep all of this structure and lose only noise.

← Back to Archive