← Back to archive

The Carrier Signal Problem

Why product patterns need orthogonal offsets. PI-SGD gap analysis, SVD events, byte KN baselines.

27.2
bpc with 8 Bayesian pairs
(worse than uniform 8.0)
2.29
bpc byte KN-5 at 10M
(beats sat-rnn 2.81)
1.64
bpc SVD-16 events
at 1024 bytes
90%
neuron MI captured
by K=4 events

1. The Shared-Offset Problem

Every neuron in the 128-hidden tanh RNN is a 2-offset conjunction detector: hj ≈ E[hj | data[t-d1], data[t-d2]]. But all top-8 pairs share offset d=1, making naïve Bayesian combination catastrophic.

Top Offset Pairs by Mutual Information (1024 bytes)

All 8 pairs share d=1. This means the conditionals P(o|x1,xd) are NOT independent—they all contain P(o|x1).

Bayesian Combination: Catastrophic Failure

With shared offsets, adding more pairs makes prediction WORSE. 8 pairs = 27.2 bpc (3.4× worse than uniform 8.0 bpc). The shared d=1 factor is raised to the K-th power.
Why it fails: All top-8 pairs share offset d=1. The Bayesian combination formula P(o|all) ∝ P(o)1-K ∏ P(o|xd1,xd2) assumes independence. With shared d=1, the factor P(o|x1) appears K times instead of once, distorting probabilities exponentially.

2. Orthogonal Offset Selection

Shared (MI-greedy)
Disjoint (greedy)
Fixed pairs

Shared Offset Pairs (standard MI ranking)

PairMI (bits)Shared offset
(1,7)4.483d=1
(1,8)4.465d=1
(1,4)4.463d=1
(1,10)4.456d=1
(1,3)4.452d=1
(1,6)4.452d=1
(1,5)4.445d=1
(1,9)4.409d=1

Bayes-2: 8.41 | Bayes-4: 14.84 | Bayes-8: 27.35 bpc

Disjoint Offset Pairs (greedy selection)

PairMI (bits)Overlap
(1,7)4.483none
(2,11)4.403none
(3,12)4.320none
(4,10)4.232none
(5,9)4.090none
(6,8)3.807none

Bayes-2: 8.69 | Bayes-4: 16.04 bpc — still catastrophic because the Bayesian assumption itself is wrong for product patterns

Fixed Consecutive Pairs

PairMI (bits)Overlap
(1,2)4.399none
(3,4)3.980none
(5,6)3.739none
(7,8)3.560none

Bayes-2: 6.90 | Bayes-4: 13.34 bpc — better than shared but Bayesian combination fundamentally fails

Key insight: The fix isn't just disjoint offsets. Naïve Bayes is fundamentally wrong for combining product patterns. The correct approach is KN n-gram counting over events, which handles context dependencies properly.

3. KN N-gram: The Right Approach

Byte KN vs Data Size

Byte KN-5 reaches 2.29 bpc at 10M, already beating the sat-rnn (2.81 bpc at 110M).

SVD Events vs Byte KN (N=1024)

SVD-16 events reach 1.64 bpc. Raw bytes are better (1.40) at this scale because 1024 bytes has only 22 unique byte values.
MethodEvents/offsetOrderTest bpcNotes
SVD-4 KN4122.41Too few events
SVD-8 KN8121.98Below 2 bpc
SVD-16 KN16121.64Best event-based
Byte KN-725671.40Best overall at 1024
Skip-KN-6 (12 offs)25661.40Matches byte KN
sat-rnn8.22Memorized, doesn't generalize

4. Event Space Discovery in the Trained RNN

Analyzing the 128 neurons of the sat-rnn on 1024 bytes reveals that K=2 (sign, the doubled-E) wastes half the information. K=4 is the sweet spot.

Top Neurons by Mutual Information

h56
0.774 bits
h112
0.757 bits
h61
0.719 bits
h76
0.684 bits
h106
0.677 bits
h68
0.646 bits
h15
0.608 bits
h52
0.468 bits
h26
0.459 bits
h17
0.431 bits
h101
0.393 bits
h80
0.348 bits
Top 12 neurons account for most MI. Mean over all 128 neurons: 0.072 bits.

MI Retention by Event Count K

K=2 (sign/doubled-E) captures only 41–52% of information for top neurons. K=4 captures 90%+. K=8 captures 100%.

Event Space Contents (K=4, top neurons)

h56 (MI = 0.774 bits)

Separates letters from delimiters

E0: pmcasn/i

E1: :lgWMo-w

E2: t23."0=>

E3: rdh\n<SP

h112 (MI = 0.757 bits)

Opposite polarity: delimiters first

E0: SP<\nxk1yd

E1: h>t.30r2

E2: Mg-"w=oe

E3: Wl/i:sanmcp

h68 (MI = 0.646 bits)

Responds to 'm' (positive) vs space (negative)

Best: m (0.340)   Worst: SP (-0.255)

Range: 0.595

h106 (MI = 0.677 bits)

Separates content chars from whitespace

Best: SP (0.277)   Worst: m (-0.359)

Range: 0.637

The 4-event partition is linguistic: delimiters (space, newline, <), common content letters (p,m,c,a,s,n), mixed alphanumeric (t,2,3,.,"), and rare/structural characters. This is the natural event space for byte-level text compression.

5. Scale-Up Path to <2 bpc

Compression Methods Compared

Current best: byte KN-5 at 10M = 2.29 bpc. Target: <2 bpc via orthogonal offsets + event factoring.
StepMethodTarget bpcStatus
1Byte KN-5 at 10M2.29Done — beats sat-rnn (2.81)
2Orthogonal pairs + valid combination2.0Requires disjoint-offset counting
3Skip-KN over event context1.8Larger context window
4Full UM with learned ESs<1.5Complete interpretable model
Why the RNN loses: (1) Only 128 hidden units (limited capacity). (2) BPTT-50 limits context. (3) Catastrophic forgetting after 110M bytes. (4) Gradient noise vs exact counts. The n-gram sees exact statistics; the RNN's advantage is compact generalization. The UM combines both.

6. Weight Construction Methods (Wy)

MethodTrain bpcTest bpcNotes
Per-neuron log-ratio2.783.09scale = 1/16
Per-byte PI4.914.73pseudo-inverse
Byte lookup (nonlinear)14.7514.93166/256 decode (hash collisions)
PI + SGD-5000.0653.39Overfits (matches sat-rnn 0.079)
The PI-SGD gap (1.89 → 0.59 bpc): Two sources. (1) Hash collisions: 90/256 byte values lost (35% information). (2) Shared-offset non-independence: Bayesian combination amplifies the d=1 factor exponentially. SGD learns to compensate; analytics cannot.

Navigation