← Back to archive
The Carrier Signal Problem
Why product patterns need orthogonal offsets. PI-SGD gap analysis, SVD events, byte KN baselines.
27.2
bpc with 8 Bayesian pairs
(worse than uniform 8.0)
2.29
bpc byte KN-5 at 10M
(beats sat-rnn 2.81)
1.64
bpc SVD-16 events
at 1024 bytes
90%
neuron MI captured
by K=4 events
1. The Shared-Offset Problem
Every neuron in the 128-hidden tanh RNN is a 2-offset conjunction detector: hj ≈ E[hj | data[t-d1], data[t-d2]]. But all top-8 pairs share offset d=1, making naïve Bayesian combination catastrophic.
Top Offset Pairs by Mutual Information (1024 bytes)
All 8 pairs share d=1. This means the conditionals P(o|x1,xd) are NOT independent—they all contain P(o|x1).
Bayesian Combination: Catastrophic Failure
With shared offsets, adding more pairs makes prediction WORSE. 8 pairs = 27.2 bpc (3.4× worse than uniform 8.0 bpc). The shared d=1 factor is raised to the K-th power.
Why it fails: All top-8 pairs share offset d=1. The Bayesian combination formula P(o|all) ∝ P(o)1-K ∏ P(o|xd1,xd2) assumes independence. With shared d=1, the factor P(o|x1) appears K times instead of once, distorting probabilities exponentially.
2. Orthogonal Offset Selection
Shared (MI-greedy)
Disjoint (greedy)
Fixed pairs
Shared Offset Pairs (standard MI ranking)
| Pair | MI (bits) | Shared offset |
| (1,7) | 4.483 | d=1 |
| (1,8) | 4.465 | d=1 |
| (1,4) | 4.463 | d=1 |
| (1,10) | 4.456 | d=1 |
| (1,3) | 4.452 | d=1 |
| (1,6) | 4.452 | d=1 |
| (1,5) | 4.445 | d=1 |
| (1,9) | 4.409 | d=1 |
Bayes-2: 8.41 | Bayes-4: 14.84 | Bayes-8: 27.35 bpc
Disjoint Offset Pairs (greedy selection)
| Pair | MI (bits) | Overlap |
| (1,7) | 4.483 | none |
| (2,11) | 4.403 | none |
| (3,12) | 4.320 | none |
| (4,10) | 4.232 | none |
| (5,9) | 4.090 | none |
| (6,8) | 3.807 | none |
Bayes-2: 8.69 | Bayes-4: 16.04 bpc — still catastrophic because the Bayesian assumption itself is wrong for product patterns
Fixed Consecutive Pairs
| Pair | MI (bits) | Overlap |
| (1,2) | 4.399 | none |
| (3,4) | 3.980 | none |
| (5,6) | 3.739 | none |
| (7,8) | 3.560 | none |
Bayes-2: 6.90 | Bayes-4: 13.34 bpc — better than shared but Bayesian combination fundamentally fails
Key insight: The fix isn't just disjoint offsets. Naïve Bayes is fundamentally wrong for combining product patterns. The correct approach is KN n-gram counting over events, which handles context dependencies properly.
3. KN N-gram: The Right Approach
Byte KN vs Data Size
Byte KN-5 reaches 2.29 bpc at 10M, already beating the sat-rnn (2.81 bpc at 110M).
SVD Events vs Byte KN (N=1024)
SVD-16 events reach 1.64 bpc. Raw bytes are better (1.40) at this scale because 1024 bytes has only 22 unique byte values.
| Method | Events/offset | Order | Test bpc | Notes |
| SVD-4 KN | 4 | 12 | 2.41 | Too few events |
| SVD-8 KN | 8 | 12 | 1.98 | Below 2 bpc |
| SVD-16 KN | 16 | 12 | 1.64 | Best event-based |
| Byte KN-7 | 256 | 7 | 1.40 | Best overall at 1024 |
| Skip-KN-6 (12 offs) | 256 | 6 | 1.40 | Matches byte KN |
| sat-rnn | — | — | 8.22 | Memorized, doesn't generalize |
4. Event Space Discovery in the Trained RNN
Analyzing the 128 neurons of the sat-rnn on 1024 bytes reveals that K=2 (sign, the doubled-E) wastes half the information. K=4 is the sweet spot.
Top Neurons by Mutual Information
Top 12 neurons account for most MI. Mean over all 128 neurons: 0.072 bits.
MI Retention by Event Count K
K=2 (sign/doubled-E) captures only 41–52% of information for top neurons. K=4 captures 90%+. K=8 captures 100%.
Event Space Contents (K=4, top neurons)
h56 (MI = 0.774 bits)
Separates letters from delimiters
E0: pmcasn/i
E1: :lgWMo-w
E2: t23."0=>
E3: rdh\n<SP
h112 (MI = 0.757 bits)
Opposite polarity: delimiters first
E0: SP<\nxk1yd
E1: h>t.30r2
E2: Mg-"w=oe
E3: Wl/i:sanmcp
h68 (MI = 0.646 bits)
Responds to 'm' (positive) vs space (negative)
Best: m (0.340) Worst: SP (-0.255)
Range: 0.595
h106 (MI = 0.677 bits)
Separates content chars from whitespace
Best: SP (0.277) Worst: m (-0.359)
Range: 0.637
The 4-event partition is linguistic: delimiters (space, newline, <), common content letters (p,m,c,a,s,n), mixed alphanumeric (t,2,3,.,"), and rare/structural characters. This is the natural event space for byte-level text compression.
5. Scale-Up Path to <2 bpc
Compression Methods Compared
Current best: byte KN-5 at 10M = 2.29 bpc. Target: <2 bpc via orthogonal offsets + event factoring.
| Step | Method | Target bpc | Status |
| 1 | Byte KN-5 at 10M | 2.29 | Done — beats sat-rnn (2.81) |
| 2 | Orthogonal pairs + valid combination | 2.0 | Requires disjoint-offset counting |
| 3 | Skip-KN over event context | 1.8 | Larger context window |
| 4 | Full UM with learned ESs | <1.5 | Complete interpretable model |
Why the RNN loses: (1) Only 128 hidden units (limited capacity). (2) BPTT-50 limits context. (3) Catastrophic forgetting after 110M bytes. (4) Gradient noise vs exact counts. The n-gram sees exact statistics; the RNN's advantage is compact generalization. The UM combines both.
6. Weight Construction Methods (Wy)
| Method | Train bpc | Test bpc | Notes |
| Per-neuron log-ratio | 2.78 | 3.09 | scale = 1/16 |
| Per-byte PI | 4.91 | 4.73 | pseudo-inverse |
| Byte lookup (nonlinear) | 14.75 | 14.93 | 166/256 decode (hash collisions) |
| PI + SGD-500 | 0.065 | 3.39 | Overfits (matches sat-rnn 0.079) |
The PI-SGD gap (1.89 → 0.59 bpc): Two sources. (1) Hash collisions: 90/256 byte values lost (35% information). (2) Shared-offset non-independence: Bayesian combination amplifies the d=1 factor exponentially. SGD learns to compensate; analytics cannot.
Navigation