← Back to 2026-02-11 Archive
The Entropy Bridge
H = log2N − ⟨SB⟩ — Shannon meets Boltzmann on the same objects
1. The Entropy Bridge
Shannon entropy and Boltzmann entropy are the same mathematics applied to
the same objects. Dataset positions are microstates. Event classes are
macrostates. Counts are multiplicities. And entropy is log-counts.
H = log2 N − ⟨SB⟩
Residual Shannon entropy = Total information − Expected Boltzmann entropy
1024
N positions
(microstates)
10.0
log2 N bits
(total information)
52
Distinct chars
(unigram macrostates)
0.079
BPC (sat-rnn)
(residual H)
"The Shannon-Boltzmann bridge is not an analogy. It is the same mathematics
applied to the same objects: microstates (positions), macrostates (event classes),
multiplicities (counts), and entropy (log-counts)."
— entropy-bridge.tex
2. The Factoring Hierarchy
Every model defines a factoring of the dataset into macrostates. Each step
adds event spaces to the macrostate description, absorbing more of
log2 N into the Boltzmann term and leaving less residual
Shannon entropy (bpc). The stacked chart below shows this tradeoff.
The Factoring Staircase: Absorbed vs Residual Entropy
For each factoring level, the purple bar shows ⟨SB⟩
(Boltzmann entropy absorbed by the macrostates) and the colored bar shows H
(residual Shannon entropy / bpc). Together they always sum to approximately 10 bits
(= log2 1024). Click a bar to see details.
Level Details
--
⟨SB⟩ (absorbed)
Skip-8 UM achieves lower bpc than sat-rnn despite having fewer
macrostates (834 vs ~1000) because its macrostates are more informative
— each skip-pattern captures more of the data's structure per macrostate.
3. Microstates and Macrostates
At the most basic level, a microstate is a single position
t in the dataset, carrying its full event description. A macrostate
is an equivalence class formed by ignoring some event spaces — grouping
positions that share a common property.
MICRO
Microstate
1024
Single position t with full event description. Each is unique.
MACRO
Macrostate
W
Equivalence class by ignoring some event spaces. W positions share the class.
EXAMPLE
"input = space"
W = 127
SB = log2(127) = 6.99 bits. Absorbs 70% of log2 N.
QUOTIENT
Q = N / W
8.1
For "space": Q = 1024/127 = 8.1. How surprising this macrostate is.
Top 10 Input Characters: Multiplicity W and log2 Q
Blue bars show multiplicity W (number of positions with that input character).
Gold line shows log2 Q = log2(N/W) — the bits of surprise.
Common characters have large W and low surprise; rare characters have small W and high surprise.
127
W for "space"
(largest macrostate)
8.1
Q for "space"
(N/W = 1024/127)
6.99
SB for "space"
log2(127) bits
3.01
log2 Q for "space"
bits of surprise
4. The Partition Function
The UM softmax IS the Boltzmann distribution with β = ln 2.
Each neuron's activation probability follows the same formula that
statistical mechanics uses for thermal equilibrium. The forward pass
is a chain of partition function calculations.
P(hj+) = σ(Dj ln 2) = 1/(1 + 2−Dj)
Boltzmann probability of neuron j being in the "on" state
β = ln 2
Inverse temperature
≈ 0.693
128
Binary event spaces
(neurons)
984
Distinct states
(out of 1024 positions)
QH ≈ 1.04
State quotient
(near-unique states)
Boltzmann Probability vs Pre-Activation Dj
The sigmoid curve P(h+) = 1/(1 + 2−D) maps pre-activation
to Boltzmann probability. Most neurons are deeply saturated (near 0 or 1), acting as
deterministic Boolean gates. The few near the threshold carry the marginal predictive signal.
The forward pass is a chain of partition function calculations
— the same computation that statistical mechanics uses to determine
the equilibrium distribution of particles across energy levels.
5. Q = λ: The Quotient is the Luck
For any macrostate m, the quotient Q(m) = N/W(m) counts how many macrostates
of equal size would tile the dataset. But Q(m) = N/W(m) = 1/p(m) = λ(m) —
the luck. How surprised the model is when event m occurs.
Q(m) = N/W(m) = 1/p(m) = λ(m)
Quotient = inverse probability = luck
The bpc is the expected log-luck: bpc = E[log2 Q].
Compression improves bpc by increasing W(m) for the correct macrostates —
by finding factorings that group more microstates into fewer, larger macrostates.
Quotient Q Across Factoring Levels
Each factoring level has a different average Q. Better factorings produce larger
macrostates (higher W), which means lower Q, lower surprise, and lower bpc.
The horizontal dashed line at Q = 1 represents the theoretical floor (zero surprise).
Compression improves bpc by increasing W(m) for the correct macrostates
— by finding factorings that group more microstates into fewer, larger
macrostates. Each step down the factoring hierarchy is a step toward lower average luck.
6. Compression is Factoring
The unfactored event space E with |E| = 2128 is a lookup table:
it can represent any function from histories to outputs, achieving zero entropy.
But it requires 2128 parameters — one per possible state.
Factoring E into 128 binary event spaces gives the same expressiveness
(2128 possible states) but in only 128 bits. This makes the
structure learnable: instead of memorizing 2128 entries,
the model learns 128 binary decision boundaries.
2128
Unfactored params
(lookup table)
128
Factored dimensions
(binary ESes)
2128
Expressiveness
(same either way)
0.079
Achieved bpc
(sat-rnn)
The Factoring Tradeoff: Parameters vs Learnability
The unfactored lookup table (left) has zero residual entropy but 2128
parameters. The 128-neuron factoring (right) achieves 0.079 bpc with 82k learnable
parameters. The factoring sacrifices a tiny amount of expressiveness for massive
gains in learnability.
"The UM's learning problem is: find the factoring that minimizes conditional
entropy (bpc) subject to the constraint that the factoring must be expressible
as patterns over the event spaces."
— entropy-bridge.tex
Compression is factoring. The hierarchy from unfactored (8.0 bpc,
1 macrostate) through unigram, bigram, RNN, to full specification (0.0 bpc,
1024 macrostates) is the compression curve itself. Each level finds a better
factoring of the same underlying event space.
Related Experiment Pages