← Back to 2026-02-11 Archive

The Entropy Bridge

H = log₂N − ⟨S_B⟩ — Shannon meets Boltzmann on the same objects

1. The Entropy Bridge

Shannon entropy and Boltzmann entropy are the same mathematics applied to the same objects. Dataset positions are microstates. Event classes are macrostates. Counts are multiplicities. And entropy is log-counts.

H = log₂ N − ⟨S_B⟩

Residual Shannon entropy = Total information − Expected Boltzmann entropy

1024

N positions
(microstates)

10.0

log₂ N bits
(total information)

Distinct chars
(unigram macrostates)

0.079

BPC (sat-rnn)
(residual H)

"The Shannon-Boltzmann bridge is not an analogy. It is the same mathematics applied to the same objects: microstates (positions), macrostates (event classes), multiplicities (counts), and entropy (log-counts)."

— entropy-bridge.tex

2. The Factoring Hierarchy

Every model defines a factoring of the dataset into macrostates. Each step adds event spaces to the macrostate description, absorbing more of log₂ N into the Boltzmann term and leaving less residual Shannon entropy (bpc). The stacked chart below shows this tradeoff.

The Factoring Staircase: Absorbed vs Residual Entropy

For each factoring level, the purple bar shows ⟨S_B⟩ (Boltzmann entropy absorbed by the macrostates) and the colored bar shows H (residual Shannon entropy / bpc). Together they always sum to approximately 10 bits (= log₂ 1024). Click a bar to see details.

Level Details

Macrostates

⟨S_B⟩ (absorbed)

H (residual bpc)

% absorbed

Skip-8 UM achieves lower bpc than sat-rnn despite having fewer macrostates (834 vs ~1000) because its macrostates are more informative — each skip-pattern captures more of the data's structure per macrostate.

3. Microstates and Macrostates

At the most basic level, a microstate is a single position t in the dataset, carrying its full event description. A macrostate is an equivalence class formed by ignoring some event spaces — grouping positions that share a common property.

MICRO

Microstate

1024

Single position t with full event description. Each is unique.

MACRO

Macrostate

Equivalence class by ignoring some event spaces. W positions share the class.

EXAMPLE

"input = space"

W = 127

S_B = log₂(127) = 6.99 bits. Absorbs 70% of log₂ N.

QUOTIENT

Q = N / W

8.1

For "space": Q = 1024/127 = 8.1. How surprising this macrostate is.

Top 10 Input Characters: Multiplicity W and log₂ Q

Blue bars show multiplicity W (number of positions with that input character). Gold line shows log₂ Q = log₂(N/W) — the bits of surprise. Common characters have large W and low surprise; rare characters have small W and high surprise.

127

W for "space"
(largest macrostate)

8.1

Q for "space"
(N/W = 1024/127)

6.99

S_B for "space"
log₂(127) bits

3.01

log₂ Q for "space"
bits of surprise

4. The Partition Function

The UM softmax IS the Boltzmann distribution with β = ln 2. Each neuron's activation probability follows the same formula that statistical mechanics uses for thermal equilibrium. The forward pass is a chain of partition function calculations.

P(h_j⁺) = σ(D_j ln 2) = 1/(1 + 2^−D_j)

Boltzmann probability of neuron j being in the "on" state

β = ln 2

Inverse temperature
≈ 0.693

128

Binary event spaces
(neurons)

984

Distinct states
(out of 1024 positions)

Q_H ≈ 1.04

State quotient
(near-unique states)

Boltzmann Probability vs Pre-Activation D_j

The sigmoid curve P(h⁺) = 1/(1 + 2^−D) maps pre-activation to Boltzmann probability. Most neurons are deeply saturated (near 0 or 1), acting as deterministic Boolean gates. The few near the threshold carry the marginal predictive signal.

The forward pass is a chain of partition function calculations — the same computation that statistical mechanics uses to determine the equilibrium distribution of particles across energy levels.

5. Q = λ: The Quotient is the Luck

For any macrostate m, the quotient Q(m) = N/W(m) counts how many macrostates of equal size would tile the dataset. But Q(m) = N/W(m) = 1/p(m) = λ(m) — the luck. How surprised the model is when event m occurs.

Q(m) = N/W(m) = 1/p(m) = λ(m)

Quotient = inverse probability = luck

The bpc is the expected log-luck: bpc = E[log₂ Q]. Compression improves bpc by increasing W(m) for the correct macrostates — by finding factorings that group more microstates into fewer, larger macrostates.

Quotient Q Across Factoring Levels

Each factoring level has a different average Q. Better factorings produce larger macrostates (higher W), which means lower Q, lower surprise, and lower bpc. The horizontal dashed line at Q = 1 represents the theoretical floor (zero surprise).

Compression improves bpc by increasing W(m) for the correct macrostates — by finding factorings that group more microstates into fewer, larger macrostates. Each step down the factoring hierarchy is a step toward lower average luck.

6. Compression is Factoring

The unfactored event space E with |E| = 2¹²⁸ is a lookup table: it can represent any function from histories to outputs, achieving zero entropy. But it requires 2¹²⁸ parameters — one per possible state.

Factoring E into 128 binary event spaces gives the same expressiveness (2¹²⁸ possible states) but in only 128 bits. This makes the structure learnable: instead of memorizing 2¹²⁸ entries, the model learns 128 binary decision boundaries.

2¹²⁸

Unfactored params
(lookup table)

128

Factored dimensions
(binary ESes)

2¹²⁸

Expressiveness
(same either way)

0.079

Achieved bpc
(sat-rnn)

The Factoring Tradeoff: Parameters vs Learnability

The unfactored lookup table (left) has zero residual entropy but 2¹²⁸ parameters. The 128-neuron factoring (right) achieves 0.079 bpc with 82k learnable parameters. The factoring sacrifices a tiny amount of expressiveness for massive gains in learnability.

"The UM's learning problem is: find the factoring that minimizes conditional entropy (bpc) subject to the constraint that the factoring must be expressible as patterns over the event spaces."

— entropy-bridge.tex

Compression is factoring. The hierarchy from unfactored (8.0 bpc, 1 macrostate) through unigram, bigram, RNN, to full specification (0.0 bpc, 1024 macrostates) is the compression curve itself. Each level finds a better factoring of the same underlying event space.

Related Experiment Pages

Boolean Automaton

98.9% of neuron-steps are Boolean exact. The mantissa is noise.

Neuron Knockout

15 neurons beat the full 128. The rest add noise.

Weight Construction

All 82k parameters derived from data statistics alone.

Saturation Dynamics

How neurons commit to Boolean states during training.

Offset Analysis

Skip-pattern structure and mutual information per offset.

Per-Prediction Justifications

Why the model predicts each byte, traced through the Boolean function.

Source: entropy-bridge.pdf · e-onto-n.pdf