Q1 Exact: f32 vs MPFR-256

Programs 1–6 from q1_exact_p1–p6, q1_lyapunov, q1_bit_sample — 2026-02-11
"The f32 error vector aligns with the dominant backward direction. It is wrong about 'how much' but right about 'what matters.'"

The Headline Numbers

We ran the sat-rnn (128 hidden, 0.079 bpc) forward and backward in both f32 and MPFR-256 exact arithmetic. The forward pass is well-behaved. The backward pass decorrelates in a single step. But pattern rankings are perfect.

128/128
Forward sign agreement at t=42
21.5
Mantissa bits (of 23) forward
4.8
Mantissa bits in gradient
0.7
Mantissa bits at BPTT d=1
3.44
Lyapunov error ratio (transient)
1.000
Pattern rank ρ at d≥11
"At depth, f32 is wrong about 'how much' but right about 'what matters.'"
— q1-exact-results.tex, Observation 8

Program 1: Forward Pass (t = 42)

Hidden state h42 after 42 exact timesteps vs f32. The forward pass is remarkably stable.

MetricValue
Sign agreement128 / 128
Exponent agreement124 / 128
Mean mantissa bits21.5 / 23
Max absolute error4.74 × 10−1
Mean absolute error8.30 × 10−3
Near-zero neurons0

Saturation Distribution

125
|h| ≥ 0.95 (saturated)
1
|h| ∈ [0.5, 0.75)
1
|h| ∈ [0.25, 0.5)
112
Exponent = 127 (|h| ≈ 1.0)
Forward pass preserves structure. After 42 steps, sign and exponent are essentially preserved (128/128 sign, 124/128 exponent). The 4 exponent disagreements come from neurons near the |h| = 1.0 boundary where f32(tanh) = 1.0 but exact(tanh) = 0.999...

Program 2: Output Gradient

The gradient g42 = ∂ log P(y) / ∂h is where f32 error first becomes significant.

MetricValue
Sign agreement128 / 128
Exponent agreement111 / 128
Mean mantissa bits4.8 / 23
||gf32||1.878
||gexact||1.886
Relative error3.7%
Softmax destroys mantissa. Hidden state has 21.5 mantissa bits of agreement; gradient has 4.8 bits. The softmax (256-term sum, subtraction for the gradient) destroys ~17 mantissa bits in a single operation. Sign is still perfect (128/128), overall relative error only 3.7%, but per-neuron mantissa is devastated.

Programs 3+4: BPTT Backward Trace

The backward Jacobian trace from d = 0 to d = 42. Watch the mantissa die at d=1, sign die by d=10, and the error ratio lock onto 3.44.

dSignExpMant bits ||gf32||||gex|| ||Δ||/||gex||Phase
01281114.81.881.890.037Agreement
190230.74.737.020.80Transition
2105301.18.8215.50.58Transition
312810.839.090.10.57Transition
559300.651.637.11.66Transition
719210.211654.13.07Decorrelation
10130.05962483.41Decorrelation
15000.08.7×1053.6×1053.43Decorrelation
20000.03.6×1061.5×1063.44Decorrelation
30000.01.9×1037.9×1023.44Decorrelation
42000.02.0×1058.5×1043.41Decorrelation
Error ratio converges to a dynamical constant. The relative error converges to ≈3.44 by d≈7 and stays there through d=42. This is not a random walk — it is a fixed-point property of the Jacobian dynamics. The f32 error vector has aligned with the dominant backward direction.
Mantissa is zero from d=1. Mean mantissa bits of agreement: 0.7 at d=1, 0.0 from d=8 onward. The mantissa decorrelates in a single BPTT step. The 128-term dot product in the Jacobian immediately mixes mantissa noise into all channels.

Bit-Level Sampling

Flipping each bit in each neuron at t=42 and measuring the KL divergence of the output distribution. The hierarchy is stark: sign : exponent : mantissa = 300 : 52 : 1 per bit.

Per-Bit KL Divergence (log scale)

BitsChannelMean KL (bits)Per bitFlips > 0.1 bpcLeverage
0–4mantissa (low) < 10−6< 10−70
5–14mantissa (mid) < 10−4< 10−50
15–22mantissa (high) 0.00350.0004461
23–30exponent 0.0640.0080~90
31sign 0.0460.046110
300×
Sign vs mantissa per bit
52×
Exponent vs mantissa per bit
Each mantissa bit vs the one below
Per-bit leverage: 300 : 52 : 1 hierarchy. Each exponent bit carries 52× the leverage of each mantissa bit. The sign bit carries 5.7× the leverage of each exponent bit. The mantissa has perfect geometric scaling: each bit is exactly 2× the one below.

Lyapunov Structure Across Positions

The 3.44 ratio at t=42 is a transient. Running the same analysis across all positions reveals three regimes.

th mantg0 mantratio d=0ratio d=10cos(gf32, gex)||gf32||/||gex||Phase
1022.922.50.000.01+1.0000.99Agreement
2023.022.20.000.00+1.0001.00Agreement
3522.010.20.000.04+1.0000.96Agreement
4521.75.70.021.08−0.6940.11Transition
6018.01.90.321.00−0.0820.03Decorrelation
1006.11.60.421.00+0.0500.00Decorrelation
20010.41.10.561.00−0.2290.00Decorrelation
50014.02.10.291.00−0.1310.00Decorrelation
100014.52.20.210.99+0.1970.04Decorrelation
Three regimes. (1) Agreement (t < 40): cos = +1.0, f32 and exact track perfectly. (2) Transition (t ≈ 40–60): cosine drops from +1 to ∼0; the 3.44 ratio is a transient here. (3) Decorrelation (t > 60): cos ≈ 0, ratio ≈ 1.0. f32 and exact gradients at d=10 are uncorrelated.
BPTT gradients are f32 noise after t ≈ 60. At depth d=10, the f32 and exact gradients share zero mutual information beyond t≈60. Yet the model learns (0.079 bpc). Either the output gradient at d=0 contains enough signal, or Adam extracts signal from the f32 noise's statistical structure across positions.

Program 5: Pattern Attribution Rankings

The punchline: even though gradient magnitudes diverge by 2.44×, the ranking of which patterns matter is perfectly preserved.

Wy (output layer)

ρ = 0.997
Spearman rank correlation
125
Agree at τ=0.01
0
f32-only phantoms

Wx (input layer, by BPTT depth)

dρAgreef32-onlyexact-onlyMean ratio
1+0.52126020.69
2+0.74128000.61
5+0.50128001.30
8+0.91128002.53
10+0.99128002.40
11+1.00128002.44
15+1.00128002.43
20+1.00128002.44
Pattern ranking is perfect at depth. At d≥11, Spearman correlation between f32 and exact attributions is exactly 1.000. Zero phantom patterns at any depth. The f32 quotient preserves the topology of the pattern space while distorting magnitudes by ~2.44×.

Program 6: The Cost of f32

Full forward pass in f32 and MPFR-256 over all 1023 positions. f32 costs exactly 0.071 bpc.

Phasebpc (f32)bpc (exact)Diffn
t < 506.6126.642−0.03150
50–2006.8056.652+0.153150
200–5005.4695.351+0.118300
500+5.4695.439+0.031523
Overall5.7215.650+0.0711023
f32 is BETTER early. For t < 50, f32 is 0.031 bpc better than exact (6.612 vs 6.642). The model was trained in f32, so it optimized for f32 dynamics. The exact computation is a different dynamical system the model was not trained on.
Total cost: 0.071 bpc. Hidden states diverge completely after t≈45 (full sign flips), yet bpc cost is modest because the sign vector is the dominant information channel and many sign bits happen to agree.

Predictions vs Results

How did the pre-experiment predictions hold up?

PredictionExpectedActual
P1: Sign agreement128/128128/128 ✓
P1: Exponent agreement128/128124/128 (close)
P1: Mantissa bits~18/2321.5/23 (better)
P2: Gradient sign128/128128/128 ✓
P2: Gradient mantissa15–18 bits4.8 bits (much worse)
P3: Gates killed60–80114 (more)
P4: Sign dies at d≈10–157 (faster)
P4: Mantissa decaylogarithmicinstant (0.7 at d=1)
P4: Error ratiogrowing3.44 (transient), 1.0 (steady)
"The predictions underestimated how fast the backward pass decorrelates. The key surprise is not a constant ratio but a phase transition: before t≈40, f32 BPTT is exact; after t≈60, it is pure noise. The transition is sharp (20 positions)."
— q1-exact-results.tex

Source & Reproducibility

Paper: q1-exact-results.pdf (source)
Programs: q1-exact.pdf (protocol), p1, p2, p3, p5, p6, lyapunov, lyapunov2, bit_sample
Related: Protocol B, Protocol C, Q1 Sparsity, Implementation Notes
Model: sat_model.bin (128 hidden, 0.079 bpc). Data: first 1024 bytes of enwik9.