← Back to 2026-02-11 Archive

The Entropy Bridge

H = log2N − ⟨SB⟩ — Shannon meets Boltzmann on the same objects

1. The Entropy Bridge

Shannon entropy and Boltzmann entropy are the same mathematics applied to the same objects. Dataset positions are microstates. Event classes are macrostates. Counts are multiplicities. And entropy is log-counts.

H = log2 N⟨SB
Residual Shannon entropy = Total information − Expected Boltzmann entropy
1024
N positions
(microstates)
10.0
log2 N bits
(total information)
52
Distinct chars
(unigram macrostates)
0.079
BPC (sat-rnn)
(residual H)
"The Shannon-Boltzmann bridge is not an analogy. It is the same mathematics applied to the same objects: microstates (positions), macrostates (event classes), multiplicities (counts), and entropy (log-counts)."
— entropy-bridge.tex

2. The Factoring Hierarchy

Every model defines a factoring of the dataset into macrostates. Each step adds event spaces to the macrostate description, absorbing more of log2 N into the Boltzmann term and leaving less residual Shannon entropy (bpc). The stacked chart below shows this tradeoff.

The Factoring Staircase: Absorbed vs Residual Entropy
For each factoring level, the purple bar shows ⟨SB⟩ (Boltzmann entropy absorbed by the macrostates) and the colored bar shows H (residual Shannon entropy / bpc). Together they always sum to approximately 10 bits (= log2 1024). Click a bar to see details.

Level Details

--
Macrostates
--
⟨SB⟩ (absorbed)
--
H (residual bpc)
--
% absorbed
Skip-8 UM achieves lower bpc than sat-rnn despite having fewer macrostates (834 vs ~1000) because its macrostates are more informative — each skip-pattern captures more of the data's structure per macrostate.

3. Microstates and Macrostates

At the most basic level, a microstate is a single position t in the dataset, carrying its full event description. A macrostate is an equivalence class formed by ignoring some event spaces — grouping positions that share a common property.

MICRO

Microstate

1024

Single position t with full event description. Each is unique.

MACRO

Macrostate

W

Equivalence class by ignoring some event spaces. W positions share the class.

EXAMPLE

"input = space"

W = 127

SB = log2(127) = 6.99 bits. Absorbs 70% of log2 N.

QUOTIENT

Q = N / W

8.1

For "space": Q = 1024/127 = 8.1. How surprising this macrostate is.

Top 10 Input Characters: Multiplicity W and log2 Q
Blue bars show multiplicity W (number of positions with that input character). Gold line shows log2 Q = log2(N/W) — the bits of surprise. Common characters have large W and low surprise; rare characters have small W and high surprise.
127
W for "space"
(largest macrostate)
8.1
Q for "space"
(N/W = 1024/127)
6.99
SB for "space"
log2(127) bits
3.01
log2 Q for "space"
bits of surprise

4. The Partition Function

The UM softmax IS the Boltzmann distribution with β = ln 2. Each neuron's activation probability follows the same formula that statistical mechanics uses for thermal equilibrium. The forward pass is a chain of partition function calculations.

P(hj+) = σ(Dj ln 2) = 1/(1 + 2−Dj)
Boltzmann probability of neuron j being in the "on" state
β = ln 2
Inverse temperature
≈ 0.693
128
Binary event spaces
(neurons)
984
Distinct states
(out of 1024 positions)
QH ≈ 1.04
State quotient
(near-unique states)
Boltzmann Probability vs Pre-Activation Dj
The sigmoid curve P(h+) = 1/(1 + 2−D) maps pre-activation to Boltzmann probability. Most neurons are deeply saturated (near 0 or 1), acting as deterministic Boolean gates. The few near the threshold carry the marginal predictive signal.
The forward pass is a chain of partition function calculations — the same computation that statistical mechanics uses to determine the equilibrium distribution of particles across energy levels.

5. Q = λ: The Quotient is the Luck

For any macrostate m, the quotient Q(m) = N/W(m) counts how many macrostates of equal size would tile the dataset. But Q(m) = N/W(m) = 1/p(m) = λ(m) — the luck. How surprised the model is when event m occurs.

Q(m) = N/W(m) = 1/p(m) = λ(m)
Quotient = inverse probability = luck

The bpc is the expected log-luck: bpc = E[log2 Q]. Compression improves bpc by increasing W(m) for the correct macrostates — by finding factorings that group more microstates into fewer, larger macrostates.

Quotient Q Across Factoring Levels
Each factoring level has a different average Q. Better factorings produce larger macrostates (higher W), which means lower Q, lower surprise, and lower bpc. The horizontal dashed line at Q = 1 represents the theoretical floor (zero surprise).
Compression improves bpc by increasing W(m) for the correct macrostates — by finding factorings that group more microstates into fewer, larger macrostates. Each step down the factoring hierarchy is a step toward lower average luck.

6. Compression is Factoring

The unfactored event space E with |E| = 2128 is a lookup table: it can represent any function from histories to outputs, achieving zero entropy. But it requires 2128 parameters — one per possible state.

Factoring E into 128 binary event spaces gives the same expressiveness (2128 possible states) but in only 128 bits. This makes the structure learnable: instead of memorizing 2128 entries, the model learns 128 binary decision boundaries.

2128
Unfactored params
(lookup table)
128
Factored dimensions
(binary ESes)
2128
Expressiveness
(same either way)
0.079
Achieved bpc
(sat-rnn)
The Factoring Tradeoff: Parameters vs Learnability
The unfactored lookup table (left) has zero residual entropy but 2128 parameters. The 128-neuron factoring (right) achieves 0.079 bpc with 82k learnable parameters. The factoring sacrifices a tiny amount of expressiveness for massive gains in learnability.
"The UM's learning problem is: find the factoring that minimizes conditional entropy (bpc) subject to the constraint that the factoring must be expressible as patterns over the event spaces."
— entropy-bridge.tex
Compression is factoring. The hierarchy from unfactored (8.0 bpc, 1 macrostate) through unigram, bigram, RNN, to full specification (0.0 bpc, 1024 macrostates) is the compression curve itself. Each level finds a better factoring of the same underlying event space.

Related Experiment Pages