H = 2^32: The f32 State Space - Hutter RNN Experiment

What This Experiment Shows

Each of the 128 neurons holds a 32-bit floating-point number, giving a theoretical state space of 2⁴⁰⁹⁶. But how many of those bits actually matter? This experiment sets H = 2³² and treats the IEEE 754 decomposition (sign/exponent/mantissa) as just one of many possible factorizations of the 32-bit space.

The answer: the sign bit is the computation, the exponent is routing, and the mantissa is noise. The effective state is 128 bits — one sign per neuron. Removing the mantissa makes the model better.

"The tanh activation and f32 mantissa exist for training (gradient flow through the saturation gate); at inference, the Boolean function encoded in the weight signs and magnitudes is the entire computation."

— h32.tex

The Numbers at a Glance

496

effective state bits at t=42 (not 4096)

300:52:1

per-bit leverage (sign:exp:mant)

31.6

sign changes per step (25% of neurons)

−0.139

bpc improvement from zeroing mantissa

Two Factorizations of 32 Bits

Every f32 value has 32 individually addressable bits. There are two ways to interpret them:

Hardware Factorization (IEEE 754)

Dynamical Factorization (Measured)

Flip each bit and measure KL divergence on the output:

Bit range	Channel	Mean KL (bits)	Dynamical role
0–4	mantissa (low)	< 10^-6	dead
5–14	mantissa (mid)	< 10^-4	dormant
15–22	mantissa (high)	0.00044/bit	active memory
23–29	exponent	0.0080/bit	importance
31	sign	0.046	topology

Per-Bit KL Leverage (log scale)

300:52:1 hierarchy: Sign is 300× more important per bit than mantissa, exponent is 52×. The hardware factorization aligns with the dynamical factorization at single-step leverage: sign is topology, exponent is importance, mantissa is noise.

State Snapshot: t = 42

112

saturated (E=127, |h|=1.0 exactly)

unsaturated (E=126, |h| ∈ [0.5,1))

unique f32 bit patterns

68/60

positive / negative signs

Effective State Decomposition

Saturated neurons carry exactly 1 bit each (the sign). Unsaturated neurons carry ~24 bits each (1 sign + 23 mantissa, exponent fixed at 126). Total: 112 × 1 + 16 × 24 = 496 bits — not 4,096 and not 128.

But for prediction, it's just 128 bits. The sign-only model achieves 99.7% of the compression gap, meaning the analog bits contribute only 0.007 bpc × 1023 = 7 bits total across all positions. The effective state for prediction is one sign bit per neuron.

Mantissa Ablation: Dynamics vs Readout

Five ways to run the RNN, varying where the mantissa is used:

Mode	bpc	Δ from full	Description
full f32	5.721	0	baseline
sign readout only	5.728	+0.007	full dynamics, snap h → ±1 for W_y
sign dynamics	5.690	−0.031	snap h → ±1 after each step
zero-mant readout	5.637	−0.084	full dynamics, zero mantissa for W_y
zero-mant dynamics	5.582	−0.139	zero mantissa after each step

Mantissa Ablation: Lower is Better

The mantissa is noise: Every variant that removes the mantissa from dynamics outperforms full f32. Zero-mantissa dynamics is 0.139 bpc better. Sign-only dynamics (a pure 128-bit Boolean automaton) is 0.031 bpc better. The mantissa actively degrades both prediction and dynamics.

A different trajectory, a better result: Sign-only dynamics has 52.2% sign agreement with full f32 after 100 steps — barely above chance. It's on a completely different trajectory through state space. Yet it compresses better. The weights encode a good Boolean function that is obscured by mantissa noise.

Mantissa Bit Sweep

Quantize the mantissa to 0–23 bits during dynamics:

bpc vs Mantissa Precision

Every level except 8 bits improves on full f32 (dashed line). Fewer bits → better bpc. The mantissa is not a graded resource — it is interference.

Mantissa bits	bpc	Δ from full
0 (zero-mant)	5.582	−0.139
1	5.592	−0.129
2	5.696	−0.025
4	5.636	−0.085
8	5.740	+0.019
12	5.621	−0.100
16	5.682	−0.039
20	5.614	−0.107
23 (full)	5.721	0

Bit Propagation Through Time

Flip one bit of unsaturated neuron h8 (|h| = 0.855) at t = 42 and measure downstream KL:

KL Divergence After Bit Flip

Sign bit amplifies, mantissa decays: Flipping the sign bit causes 0.125 bits KL at t=43, growing to 0.729 at t=46 — a 6× amplification. It's a global perturbation: changing the sign of W_h[j, ·] contributions to all 128 neurons. MSB mantissa: 0.0004 KL at t=43, then decay. LSB mantissa: zero at all future times.

State Statistics Across All Positions

Metric	Value
Mean saturated neurons (\|h\| ≥ 0.999)	123.0 / 128
Mean unsaturated	5.0 / 128
Mean sign changes per step	31.6
Mean bpc (full f32)	5.721
Mean bpc (sign-only readout)	5.728
Mean bpc (zero-mantissa readout)	5.637

"The sign bits carry the long-range memory (which patterns have been observed). The mantissa bits carry the short-range precision (how recently, how strongly). The sign bits change rarely. The mantissa bits change continuously."

— h32.tex

The Revised Picture

The sat-rnn is a 128-bit Boolean automaton with a 2-value exponent channel:

128 signs

the computation

Long-range memory. Which patterns have been seen. 99.7% of compression.

~5 exponents

the routing

Which neurons are unsaturated ("open gates"). Changes every step.

23×128 mantissa

noise (needed for training)

Enables gradient flow during BPTT. Harmful at inference. Removing it improves bpc by 0.14.

"The weight matrices encode the Boolean transition function σ_t+1 = f(σ_t, x_t) where σ is the sign vector and x is the input byte. The mantissa is the price paid for differentiable training of a Boolean function."

— h32.tex

Tool: experiments in h32.tex · Model: sat_model.bin from archive/20260209 · Data: first 1024 bytes of enwik9

H = 232: The f32 State Space

What This Experiment Shows

The Numbers at a Glance

Two Factorizations of 32 Bits

Hardware Factorization (IEEE 754)

Dynamical Factorization (Measured)

Per-Bit KL Leverage (log scale)

State Snapshot: t = 42

Effective State Decomposition

Mantissa Ablation: Dynamics vs Readout

Mantissa Ablation: Lower is Better

Mantissa Bit Sweep

bpc vs Mantissa Precision

Bit Propagation Through Time

KL Divergence After Bit Flip

State Statistics Across All Positions

The Revised Picture

H = 2³²: The f32 State Space