← Back to archive

The Carrier Signal Problem

Why product patterns need orthogonal offsets. PI-SGD gap analysis, SVD events, byte KN baselines.

27.2

bpc with 8 Bayesian pairs
(worse than uniform 8.0)

2.29

bpc byte KN-5 at 10M
(beats sat-rnn 2.81)

1.64

bpc SVD-16 events
at 1024 bytes

90%

neuron MI captured
by K=4 events

1. The Shared-Offset Problem

Every neuron in the 128-hidden tanh RNN is a 2-offset conjunction detector: h_j ≈ E[h_j | data[t-d₁], data[t-d₂]]. But all top-8 pairs share offset d=1, making naïve Bayesian combination catastrophic.

Top Offset Pairs by Mutual Information (1024 bytes)

All 8 pairs share d=1. This means the conditionals P(o|x₁,x_d) are NOT independent—they all contain P(o|x₁).

Bayesian Combination: Catastrophic Failure

With shared offsets, adding more pairs makes prediction WORSE. 8 pairs = 27.2 bpc (3.4× worse than uniform 8.0 bpc). The shared d=1 factor is raised to the K-th power.

Why it fails: All top-8 pairs share offset d=1. The Bayesian combination formula P(o|all) ∝ P(o)^1-K ∏ P(o|x_d1,x_d2) assumes independence. With shared d=1, the factor P(o|x₁) appears K times instead of once, distorting probabilities exponentially.

2. Orthogonal Offset Selection

Shared (MI-greedy)

Disjoint (greedy)

Fixed pairs

Shared Offset Pairs (standard MI ranking)

Pair	MI (bits)	Shared offset
(1,7)	4.483	d=1
(1,8)	4.465	d=1
(1,4)	4.463	d=1
(1,10)	4.456	d=1
(1,3)	4.452	d=1
(1,6)	4.452	d=1
(1,5)	4.445	d=1
(1,9)	4.409	d=1

Bayes-2: 8.41 | Bayes-4: 14.84 | Bayes-8: 27.35 bpc

Disjoint Offset Pairs (greedy selection)

Pair	MI (bits)	Overlap
(1,7)	4.483	none
(2,11)	4.403	none
(3,12)	4.320	none
(4,10)	4.232	none
(5,9)	4.090	none
(6,8)	3.807	none

Bayes-2: 8.69 | Bayes-4: 16.04 bpc — still catastrophic because the Bayesian assumption itself is wrong for product patterns

Fixed Consecutive Pairs

Pair	MI (bits)	Overlap
(1,2)	4.399	none
(3,4)	3.980	none
(5,6)	3.739	none
(7,8)	3.560	none

Bayes-2: 6.90 | Bayes-4: 13.34 bpc — better than shared but Bayesian combination fundamentally fails

Key insight: The fix isn't just disjoint offsets. Naïve Bayes is fundamentally wrong for combining product patterns. The correct approach is KN n-gram counting over events, which handles context dependencies properly.

3. KN N-gram: The Right Approach

Byte KN vs Data Size

Byte KN-5 reaches 2.29 bpc at 10M, already beating the sat-rnn (2.81 bpc at 110M).

SVD Events vs Byte KN (N=1024)

SVD-16 events reach 1.64 bpc. Raw bytes are better (1.40) at this scale because 1024 bytes has only 22 unique byte values.

Method	Events/offset	Order	Test bpc	Notes
SVD-4 KN	4	12	2.41	Too few events
SVD-8 KN	8	12	1.98	Below 2 bpc
SVD-16 KN	16	12	1.64	Best event-based
Byte KN-7	256	7	1.40	Best overall at 1024
Skip-KN-6 (12 offs)	256	6	1.40	Matches byte KN
sat-rnn	—	—	8.22	Memorized, doesn't generalize

4. Event Space Discovery in the Trained RNN

Analyzing the 128 neurons of the sat-rnn on 1024 bytes reveals that K=2 (sign, the doubled-E) wastes half the information. K=4 is the sweet spot.

Top Neurons by Mutual Information

h56

0.774 bits

h112

0.757 bits

h61

0.719 bits

h76

0.684 bits

h106

0.677 bits

h68

0.646 bits

h15

0.608 bits

h52

0.468 bits

h26

0.459 bits

h17

0.431 bits

h101

0.393 bits

h80

0.348 bits

Top 12 neurons account for most MI. Mean over all 128 neurons: 0.072 bits.

MI Retention by Event Count K

K=2 (sign/doubled-E) captures only 41–52% of information for top neurons. K=4 captures 90%+. K=8 captures 100%.

Event Space Contents (K=4, top neurons)

h56 (MI = 0.774 bits)

Separates letters from delimiters

E0: pmcasn/i

E1: :lgWMo-w

E2: t23."0=>

E3: rdh\n<SP

h112 (MI = 0.757 bits)

Opposite polarity: delimiters first

E0: SP<\nxk1yd

E1: h>t.30r2

E2: Mg-"w=oe

E3: Wl/i:sanmcp

h68 (MI = 0.646 bits)

Responds to 'm' (positive) vs space (negative)

Best: m (0.340) Worst: SP (-0.255)

Range: 0.595

h106 (MI = 0.677 bits)

Separates content chars from whitespace

Best: SP (0.277) Worst: m (-0.359)

Range: 0.637

The 4-event partition is linguistic: delimiters (space, newline, <), common content letters (p,m,c,a,s,n), mixed alphanumeric (t,2,3,.,"), and rare/structural characters. This is the natural event space for byte-level text compression.

5. Scale-Up Path to <2 bpc

Compression Methods Compared

Current best: byte KN-5 at 10M = 2.29 bpc. Target: <2 bpc via orthogonal offsets + event factoring.

Step	Method	Target bpc	Status
1	Byte KN-5 at 10M	2.29	Done — beats sat-rnn (2.81)
2	Orthogonal pairs + valid combination	2.0	Requires disjoint-offset counting
3	Skip-KN over event context	1.8	Larger context window
4	Full UM with learned ESs	<1.5	Complete interpretable model

Why the RNN loses: (1) Only 128 hidden units (limited capacity). (2) BPTT-50 limits context. (3) Catastrophic forgetting after 110M bytes. (4) Gradient noise vs exact counts. The n-gram sees exact statistics; the RNN's advantage is compact generalization. The UM combines both.

6. Weight Construction Methods (W_y)

Method	Train bpc	Test bpc	Notes
Per-neuron log-ratio	2.78	3.09	scale = 1/16
Per-byte PI	4.91	4.73	pseudo-inverse
Byte lookup (nonlinear)	14.75	14.93	166/256 decode (hash collisions)
PI + SGD-500	0.065	3.39	Overfits (matches sat-rnn 0.079)

The PI-SGD gap (1.89 → 0.59 bpc): Two sources. (1) Hash collisions: 90/256 byte values lost (35% information). (2) Shared-offset non-independence: Bayesian combination amplifies the d=1 factor exponentially. SGD learns to compensate; analytics cannot.