What This Experiment Shows
Each of the 128 neurons holds a 32-bit floating-point number, giving a theoretical state space of 24096. But how many of those bits actually matter? This experiment sets H = 232 and treats the IEEE 754 decomposition (sign/exponent/mantissa) as just one of many possible factorizations of the 32-bit space.
The answer: the sign bit is the computation, the exponent is routing, and the mantissa is noise. The effective state is 128 bits — one sign per neuron. Removing the mantissa makes the model better.
"The tanh activation and f32 mantissa exist for training (gradient flow through the saturation gate); at inference, the Boolean function encoded in the weight signs and magnitudes is the entire computation."
— h32.tex
The Numbers at a Glance
496
effective state bits at t=42 (not 4096)
300:52:1
per-bit leverage (sign:exp:mant)
31.6
sign changes per step (25% of neurons)
−0.139
bpc improvement from zeroing mantissa
Two Factorizations of 32 Bits
Every f32 value has 32 individually addressable bits. There are two ways to interpret them:
Hardware Factorization (IEEE 754)
Dynamical Factorization (Measured)
Flip each bit and measure KL divergence on the output:
| Bit range | Channel | Mean KL (bits) | Dynamical role | Leverage |
| 0–4 | mantissa (low) | < 10-6 | dead | |
| 5–14 | mantissa (mid) | < 10-4 | dormant | |
| 15–22 | mantissa (high) | 0.00044/bit | active memory | |
| 23–29 | exponent | 0.0080/bit | importance | |
| 31 | sign | 0.046 | topology | |
Per-Bit KL Leverage (log scale)
300:52:1 hierarchy: Sign is 300× more important per bit than mantissa, exponent is 52×. The hardware factorization aligns with the dynamical factorization at single-step leverage: sign is topology, exponent is importance, mantissa is noise.
State Snapshot: t = 42
112
saturated (E=127, |h|=1.0 exactly)
16
unsaturated (E=126, |h| ∈ [0.5,1))
17
unique f32 bit patterns
68/60
positive / negative signs
Effective State Decomposition
Saturated neurons carry exactly 1 bit each (the sign). Unsaturated neurons carry ~24 bits each (1 sign + 23 mantissa, exponent fixed at 126). Total: 112 × 1 + 16 × 24 = 496 bits — not 4,096 and not 128.
But for prediction, it's just 128 bits. The sign-only model achieves 99.7% of the compression gap, meaning the analog bits contribute only 0.007 bpc × 1023 = 7 bits total across all positions. The effective state for prediction is one sign bit per neuron.
Mantissa Ablation: Dynamics vs Readout
Five ways to run the RNN, varying where the mantissa is used:
| Mode | bpc | Δ from full | Description | |
| full f32 | 5.721 | 0 | baseline |
|
| sign readout only | 5.728 | +0.007 | full dynamics, snap h → ±1 for Wy |
|
| sign dynamics | 5.690 | −0.031 | snap h → ±1 after each step |
|
| zero-mant readout | 5.637 | −0.084 | full dynamics, zero mantissa for Wy |
|
| zero-mant dynamics | 5.582 | −0.139 | zero mantissa after each step |
|
Mantissa Ablation: Lower is Better
The mantissa is noise: Every variant that removes the mantissa from dynamics outperforms full f32. Zero-mantissa dynamics is 0.139 bpc better. Sign-only dynamics (a pure 128-bit Boolean automaton) is 0.031 bpc better. The mantissa actively degrades both prediction and dynamics.
A different trajectory, a better result: Sign-only dynamics has 52.2% sign agreement with full f32 after 100 steps — barely above chance. It's on a completely different trajectory through state space. Yet it compresses better. The weights encode a good Boolean function that is obscured by mantissa noise.
Mantissa Bit Sweep
Quantize the mantissa to 0–23 bits during dynamics:
bpc vs Mantissa Precision
Every level except 8 bits improves on full f32 (dashed line). Fewer bits → better bpc. The mantissa is not a graded resource — it is interference.
| Mantissa bits | bpc | Δ from full | |
| 0 (zero-mant) | 5.582 | −0.139 | |
| 1 | 5.592 | −0.129 | |
| 2 | 5.696 | −0.025 | |
| 4 | 5.636 | −0.085 | |
| 8 | 5.740 | +0.019 | |
| 12 | 5.621 | −0.100 | |
| 16 | 5.682 | −0.039 | |
| 20 | 5.614 | −0.107 | |
| 23 (full) | 5.721 | 0 | |
Bit Propagation Through Time
Flip one bit of unsaturated neuron h8 (|h| = 0.855) at t = 42 and measure downstream KL:
KL Divergence After Bit Flip
Sign bit amplifies, mantissa decays: Flipping the sign bit causes 0.125 bits KL at t=43, growing to 0.729 at t=46 — a 6× amplification. It's a global perturbation: changing the sign of Wh[j, ·] contributions to all 128 neurons. MSB mantissa: 0.0004 KL at t=43, then decay. LSB mantissa: zero at all future times.
State Statistics Across All Positions
| Metric | Value |
| Mean saturated neurons (|h| ≥ 0.999) | 123.0 / 128 |
| Mean unsaturated | 5.0 / 128 |
| Mean sign changes per step | 31.6 |
| Mean bpc (full f32) | 5.721 |
| Mean bpc (sign-only readout) | 5.728 |
| Mean bpc (zero-mantissa readout) | 5.637 |
"The sign bits carry the long-range memory (which patterns have been observed). The mantissa bits carry the short-range precision (how recently, how strongly). The sign bits change rarely. The mantissa bits change continuously."
— h32.tex
The Revised Picture
The sat-rnn is a 128-bit Boolean automaton with a 2-value exponent channel:
128 signs
the computation
Long-range memory. Which patterns have been seen. 99.7% of compression.
~5 exponents
the routing
Which neurons are unsaturated ("open gates"). Changes every step.
23×128 mantissa
noise (needed for training)
Enables gradient flow during BPTT. Harmful at inference. Removing it improves bpc by 0.14.
"The weight matrices encode the Boolean transition function σ
t+1 = f(σ
t, x
t) where σ is the sign vector and x is the input byte. The mantissa is the price paid for differentiable training of a Boolean function."
— h32.tex
Tool: experiments in h32.tex · Model: sat_model.bin from archive/20260209 · Data: first 1024 bytes of enwik9