← Back to archive

A Mathematical Review of CMP

Formalizing u = (e, t, p, f, ω), assessing the framework against twenty papers of empirical results.

5
Components in the Universal Model
4/4
Core claims confirmed empirically
3
Extensions beyond original CMP
5
Open questions identified
20
Papers in Feb 7-11 empirical program

1. The Five-Tuple: u = (e, t, p, f, ω)

CMP defines the Universal Model as a five-tuple drawn from a product of finite sets U = E × T × P × F × Ω. Click each component to explore.

e Event
,
t Thought
,
p Pattern
,
f Update
,
ω Learning

e ∈ E — Event (current state of the world)

E = ∏i=1k Ei     I(E) = ∑ log|Ei|

An event space Ei is a finite set of mutually exclusive events, exactly one of which is true at any given time. The total event space is a product — choosing a factorization is choosing a coordinate system for the state space.

Empirical confirmation: The sat-rnn's 128 hidden neurons map to 128 binary event spaces (|Ei| = 2). The doubled-E construction matches within 0.00% bpc difference.

Key insight: "Epistemic precommitments to divide reality into distinguishable parts." Philosophically loaded but mathematically precise — the factorization is not unique, and the choice of factorization is the central analytical decision.

t ∈ T — Thought (model's current belief)

T ≅ E   —   t ∈ {0,1}|E| (discrete)  or  t ∈ (0,1)|E| (continuous)

A total thought assigns a belief value to each event. In the concrete SN representation, strengths are in [0, 255] where 0 = "certainly false" and 255 = "certainly true".

SN strength = log2 Ω where Ω = number of microstates (dataset positions) supporting the event. This is the Shannon-Boltzmann identity pinned down by the Feb 11 archive.

The isomorphism T ≅ E is key: a thought is an assignment of belief to each possible event.

Discrete is primary: Feb 11 results show sign bits carry 99.7% of compression. The mantissa is noise — the continuous form is an artifact of float representation (for tanh-saturated RNNs).

p ∈ P — Pattern (knowledge structure)

p: T → T   —   Atomic pattern: (ea, eb) with strength

A total pattern maps thoughts to thoughts. An atomic pattern (ea, eb) with strength s means "when ea is observed, eb becomes more likely."

Standard update (tropical): (fp(t))j = maxi min(ti, pij) — conjunction as min, disjunction as max. A (max, min) tropical semiring computation.

The neural network update h' = tanh(Wh + b) is a different (non-tropical) realization. For the sat-rnn, tanh saturation makes it approximately Boolean: tanh(z) is within 10-6 of ±1 for 98.9% of neuron-steps (mean margin = 60.5).

f ∈ F — Update Function (P × T → T)

f: P × T → T   —   Apply patterns to current thought

The update function takes a pattern and a thought and produces a new thought. CMP defines F as a union of binary strings up to fixed length bounds, with decoding maps into realizable functions.

Open question: The relationship between the tropical update (max-min) and the neural update (matrix multiply + nonlinearity) is established empirically for sat-rnn but not proved in general.

Finiteness convention: By encoding F as bounded binary strings, everything stays finite and computable. |U| < ∞ and all information quantities are well-defined.

ω ∈ Ω — Learning Function (T × E → P)

ω: T × E → P   —   Update patterns from observations

The standard learning function ω0 maintains a log contingency table via log-stochastic counting: with probability 2-s, set s → s+1. This keeps E[2s] = count.

Key property: The log contingency table is a sufficient statistic for the function I → O given the data. Everything statistically learnable from observations is captured by this matrix.
Symmetry: The same matrix serves I → O (forward) and O → I (backward) by transposition. This is the bidirectionality confirmed by the temporal bi-embedding paper.

2. Information Decomposition

Since U is a product, total information is additive:

I(U) = log|E| + log|T| + log|P| + log|F| + log|Ω|

Information Content by Component (sat-rnn, 128 hidden)

E
128 bits
128 binary ESes
T
128 bits
T ≅ E (isomorphic)
P
49,408 params
Wh (128×128) + Wx (256×128)
F
 
tanh (fixed — ~0 free bits)
ω
33,024 params
Wy (128×256) + bias + SGD
Approximate information content of each UM component for the sat-rnn. Total: ~82k parameters (656k bits at f32).

3. Event Spaces and Encodings

CMP defines event spaces as finite sets with product structure. Two concrete encodings map E into the natural numbers.

Coprime (CRT)
Prime-Power
Product Structure

Coprime Encoding (Chinese Remainder Theorem)

n(e1, ..., ek) = ∑i eij<i |Ej|

When cardinalities |Ei| are pairwise coprime, the CRT guarantees unique recovery: (e1, ..., ek) = (n mod |E1|, ...).

Example: E1 = {0,1,2} (3 values), E2 = {0,1,2,3,4} (5 values), E3 = {0,1,2,3,4,5,6} (7 values). Total |E| = 105. State (2, 3, 4) encodes to n = 2 + 3×3 + 4×15 = 71. Recover: 71 mod 3 = 2, 71 mod 5 = 1... (mixed-radix, not direct CRT mod).

Prime-Power Encoding (Feb 10 Extension)

N(σ) = ∏i=1k piei(σ)

For equal-dimension factors (e.g., all binary), assign one prime per event space. The fundamental theorem of arithmetic guarantees unique factorization.

Binary ESes: N(σ) = ∏i∈S pi where S = set of "on" bits. Square-free products. Every composite N has non-trivial internal structure — factorization IS interpretability.

Example: 4 Binary Event Spaces

N = 1 (all off)

Click the event spaces to toggle them on/off

Product Structure of Event Spaces

E = E1 × E2 × ... × Ek   ⇒   I(E) = I(E1) + I(E2) + ... + I(Ek)

Information decomposes additively because E is a product. The quotient E/E1 identifies events differing only in E1, giving ∏i≠1 Ei.

Factored vs. unfactored information: k binary event spaces give 2k total states but only k bits of information content.

4. The E → N → Q Chain

The quotient chain traces computation through the universal model, from raw events to the luck of predictions.

E Events
N Numbers
Q Quotient

E → N: Encoding

Events embed into natural numbers via the prime-power encoding. Each factor Ei corresponds to a prime pi, and the macrostate integer N = ∏ piei is uniquely recoverable.

128 binary ESes → products of up to 128 distinct primes. Hamming weight w gives w prime factors = w independent pieces of interpretive content.

N → Q: Quotient

The quotient Q = λ traces the luck of events. Q = 1/p = inverse probability of the observed sequence under the model. The quotient presupposes the factored perspective.

Q = λ: quotient over dataset positions = luck of events. The quotient chain E → N → Q traces computation at every layer.

5. Multiplication: The Universal Combining Operation

CMP identifies multiplication as universal across all five components.

Factor Multiplication Additive Form Example
E Cartesian product E1 × E2 I(E) = I(E1) + I(E2) coin × weather
T Independent beliefs t = (t1, t2) log p = log p1 + log p2 joint probability
P Layer composition P1 · P2 matrix multiplication multi-layer net
F Function composition f1 ∘ f2 chained updates forward pass
ω Learning composition chained updates multi-epoch training
The quotient operation is multiplication's inverse: E/E1 = ∏i≠1 Ei.

6. Three Extensions Beyond CMP

The February 11 archive extends the theory in three directions not present in the original paper.

Factor Permutation
Thermodynamic ID
Bidirectional Construction

Extension 1: Equal-Dimension Factor Permutation

Z ≅ Z0k  ⇒  Sk acts by permuting coordinates  ⇒  φ ∈ Sk × Z0k → T

When all event spaces have the same dimension (as in the 128 binary ESes of the sat-rnn), the symmetric group Sk acts by permuting coordinates. CMP does not discuss this symmetry.

Consequence: The alignment problem between architecture-natural and domain-natural factorizations becomes a permutation search. The low-dimensional inner product structure (d ≈ 20 functional features + 108 gauge dimensions) makes this search tractable.

This gives CMP's abstract φ: Harch → Hdomain a precise algebraic form.

Extension 2: Thermodynamic Identification (Shannon = Boltzmann)

The Feb 11 microstate-macrostate paper identifies the binary-ES softmax with the Boltzmann distribution at β = ln 2. CMP does not make this connection.

CMPThermodynamics
Event space EiDegree of freedom
SN strength slog2 Ω (log microstates)
Pattern strengthFree energy difference
Update fPartition function evaluation
Quotient QLuck = 1/p = inverse probability
Shannon-Boltzmann identity: H = log2 N - ⟨SB⟩ bridges information theory and statistical mechanics through the event formalism. Arguably the most important theoretical extension.

Extension 3: Bidirectional Construction (Data ↔ Weights)

CMP defines ω0 (log-stochastic counting) for the unfactored case. The archive applies it to a factored system and shows the result is constructive.

Forward (data → weights): Hebbian covariance, skip-bigram log-ratios, and shift-register design give all 82k parameters from data statistics. Achieves 1.89 bpc with zero gradient descent.
Backward (weights → interpretation): Factor map, backward attribution chains, and Boolean automaton analysis recover the data's statistical structure from trained weights.
Analytic (data-derived) construction vs. SGD training. The analytic approach is 39,800x cheaper and achieves better generalization.

7. The Equivalence Thesis

Central Claim: Interpretability and efficiency in learning systems are identical problems, both resolved by recovering the correct factorization of the event space.

Efficiency ⇒ Interpretability

An efficient model (sparse P, small |E|) must factor E into small event spaces with sparse patterns. If the factorization matches domain structure, it is automatically interpretable: event spaces have domain-natural names, patterns express domain-natural relationships.

Interpretability ⇒ Efficiency

An interpretable model has named event spaces and explicit patterns. This naming IS a factorization, and an interpreted factorization is always at least as efficient as an unfactored one: I(E) = ∑ I(Ei), and patterns between small Ei are sparser.

The failure mode: Architecture-natural ≠ domain-natural. Deep learning provides a factorization (layers, neurons, attention heads) efficient for gradient-based learning but opaque because it doesn't match the domain's natural decomposition. Interpretability seeks the refactoring map φ: Harch → Hdomain.
The equivalence in practice: analytic construction is both 39,800x cheaper AND fully interpretable (every parameter has a data-statistical meaning).

8. CMP Claims vs. Empirical Evidence

Scorecard assessing CMP's claims against the February 7-11 empirical program.

What CMP Gets Right

The five-tuple is exhaustive
Every component of the RNN maps cleanly to one of (e, t, p, f, ω). No sixth component is needed.
CONFIRMED
Factorization is the right abstraction
The entire interpretability program reduces to finding the correct factorization of E (128 binary ESes) and P (sparse patterns between them).
CONFIRMED
The standard learning function works
Log-stochastic counting is exactly what Hebbian construction computes. 50% blend improves trained model by 0.66 bpc.
CONFIRMED
The equivalence thesis holds
Analytic construction: both more interpretable (every param has meaning) and more efficient (39,800x cheaper, 1.89 vs 4.97 bpc trained).
CONFIRMED

What CMP Leaves Open

?
How to find the factorization
CMP proves the correct factorization exists but gives no algorithm. Feb 11 provides one empirical recipe (Boolean analysis → factor map → backward attribution) but generalization unclear.
OPEN
?
Update function beyond standard case
Tropical (max-min) vs. neural (matrix multiply + nonlinearity) equivalence is empirical for sat-rnn (98.9% Boolean) but not proved in general.
OPEN
?
Continuous vs. discrete
Sign bits carry 99.7% for tanh RNNs, but transformers with softmax attention may genuinely use continuous dynamics.
OPEN
?
The role of depth
Dominant patterns involve deep temporal offsets (d=18-25). CMP accommodates depth via composition (P = PL...P1) but doesn't single it out as primary.
OPEN
?
Scaling beyond H=128
All results concern 128-hidden single-layer RNN on 262k bytes. CMP's claims are universal. Cost analysis (gap widens with H) is encouraging but untested.
OPEN

Extensions from Feb 11 Archive

+
Equal-dimension factor permutation
Sk symmetry on Z0k gives φ a precise algebraic form. Low-dimensional (d≈20) inner product structure makes permutation search tractable.
EXTENDED
+
Thermodynamic identification
Shannon = Boltzmann at β = ln 2. Every CMP quantity gets a physical meaning. H = log2 N - ⟨SB⟩.
EXTENDED
+
Bidirectional construction
Forward: all 82k params from data stats (1.89 bpc, zero SGD). Backward: factor map recovers data structure from weights. ω0 applied to factored systems is constructive.
EXTENDED

9. Summary Assessment

Distribution of CMP claims across categories: confirmed by empirical evidence, extended by Feb 11 archive, or remaining open questions.
Conclusion: CMP's framework is mathematically precise where it needs to be (event spaces, factorization, the standard learning function) and deliberately open where it should be (the update function, the learning function beyond the standard case). The central claim — that interpretability and efficiency are the same problem, both resolved by correct factorization — is confirmed by the empirical results.
Most significant contribution: Identifying factorization as the primary analytical tool, unifying interpretability, efficiency, information theory, and statistical mechanics under a single algebraic structure. The concrete representation (E → N via prime powers) makes this unification computational, and the Feb 11 archive makes it empirical.

Navigation

Claude and MJC — February 12, 2026 — A Mathematical Review of CMP