Q1 Exact: f32 vs MPFR-256

The Headline Numbers

We ran the sat-rnn (128 hidden, 0.079 bpc) forward and backward in both f32 and MPFR-256 exact arithmetic. The forward pass is well-behaved. The backward pass decorrelates in a single step. But pattern rankings are perfect.

128/128

Forward sign agreement at t=42

21.5

Mantissa bits (of 23) forward

4.8

Mantissa bits in gradient

0.7

Mantissa bits at BPTT d=1

3.44

Lyapunov error ratio (transient)

1.000

Pattern rank ρ at d≥11

"At depth, f32 is wrong about 'how much' but right about 'what matters.'"

— q1-exact-results.tex, Observation 8

Program 1: Forward Pass (t = 42)

Hidden state h₄₂ after 42 exact timesteps vs f32. The forward pass is remarkably stable.

Metric	Value
Sign agreement	128 / 128
Exponent agreement	124 / 128
Mean mantissa bits	21.5 / 23
Max absolute error	4.74 × 10⁻¹
Mean absolute error	8.30 × 10⁻³
Near-zero neurons	0

Saturation Distribution

125

|h| ≥ 0.95 (saturated)

|h| ∈ [0.5, 0.75)

|h| ∈ [0.25, 0.5)

112

Exponent = 127 (|h| ≈ 1.0)

Forward pass preserves structure. After 42 steps, sign and exponent are essentially preserved (128/128 sign, 124/128 exponent). The 4 exponent disagreements come from neurons near the |h| = 1.0 boundary where f32(tanh) = 1.0 but exact(tanh) = 0.999...

Program 2: Output Gradient

The gradient g₄₂ = ∂ log P(y) / ∂h is where f32 error first becomes significant.

Metric	Value
Sign agreement	128 / 128
Exponent agreement	111 / 128
Mean mantissa bits	4.8 / 23
\|\|g_f32\|\|	1.878
\|\|g_exact\|\|	1.886
Relative error	3.7%

Softmax destroys mantissa. Hidden state has 21.5 mantissa bits of agreement; gradient has 4.8 bits. The softmax (256-term sum, subtraction for the gradient) destroys ~17 mantissa bits in a single operation. Sign is still perfect (128/128), overall relative error only 3.7%, but per-neuron mantissa is devastated.

Programs 3+4: BPTT Backward Trace

The backward Jacobian trace from d = 0 to d = 42. Watch the mantissa die at d=1, sign die by d=10, and the error ratio lock onto 3.44.

d	Sign	Exp	Mant bits	\|\|g^f32\|\|	\|\|g^ex\|\|	\|\|Δ\|\|/\|\|g^ex\|\|	Phase
0	128	111	4.8	1.88	1.89	0.037	Agreement
1	90	23	0.7	4.73	7.02	0.80	Transition
2	105	30	1.1	8.82	15.5	0.58	Transition
3	128	1	0.8	39.0	90.1	0.57	Transition
5	59	30	0.6	51.6	37.1	1.66	Transition
7	19	21	0.2	116	54.1	3.07	Decorrelation
10	1	3	0.0	596	248	3.41	Decorrelation
15	0	0	0.0	8.7×10⁵	3.6×10⁵	3.43	Decorrelation
20	0	0	0.0	3.6×10⁶	1.5×10⁶	3.44	Decorrelation
30	0	0	0.0	1.9×10³	7.9×10²	3.44	Decorrelation
42	0	0	0.0	2.0×10⁵	8.5×10⁴	3.41	Decorrelation

Error ratio converges to a dynamical constant. The relative error converges to ≈3.44 by d≈7 and stays there through d=42. This is not a random walk — it is a fixed-point property of the Jacobian dynamics. The f32 error vector has aligned with the dominant backward direction.

Mantissa is zero from d=1. Mean mantissa bits of agreement: 0.7 at d=1, 0.0 from d=8 onward. The mantissa decorrelates in a single BPTT step. The 128-term dot product in the Jacobian immediately mixes mantissa noise into all channels.

Bit-Level Sampling

Flipping each bit in each neuron at t=42 and measuring the KL divergence of the output distribution. The hierarchy is stark: sign : exponent : mantissa = 300 : 52 : 1 per bit.

Per-Bit KL Divergence (log scale)

Bits	Channel	Mean KL (bits)	Per bit	Flips > 0.1 bpc
0–4	mantissa (low)	< 10⁻⁶	< 10⁻⁷	0
5–14	mantissa (mid)	< 10⁻⁴	< 10⁻⁵	0
15–22	mantissa (high)	0.0035	0.00044	61
23–30	exponent	0.064	0.0080	~90
31	sign	0.046	0.046	110

300×

Sign vs mantissa per bit

52×

Exponent vs mantissa per bit

2×

Each mantissa bit vs the one below

Per-bit leverage: 300 : 52 : 1 hierarchy. Each exponent bit carries 52× the leverage of each mantissa bit. The sign bit carries 5.7× the leverage of each exponent bit. The mantissa has perfect geometric scaling: each bit is exactly 2× the one below.

Lyapunov Structure Across Positions

The 3.44 ratio at t=42 is a transient. Running the same analysis across all positions reveals three regimes.

t	h mant	g₀ mant	ratio d=0	ratio d=10	cos(g^f32, g^ex)	\|\|g^f32\|\|/\|\|g^ex\|\|	Phase
10	22.9	22.5	0.00	0.01	+1.000	0.99	Agreement
20	23.0	22.2	0.00	0.00	+1.000	1.00	Agreement
35	22.0	10.2	0.00	0.04	+1.000	0.96	Agreement
45	21.7	5.7	0.02	1.08	−0.694	0.11	Transition
60	18.0	1.9	0.32	1.00	−0.082	0.03	Decorrelation
100	6.1	1.6	0.42	1.00	+0.050	0.00	Decorrelation
200	10.4	1.1	0.56	1.00	−0.229	0.00	Decorrelation
500	14.0	2.1	0.29	1.00	−0.131	0.00	Decorrelation
1000	14.5	2.2	0.21	0.99	+0.197	0.04	Decorrelation

Three regimes. (1) Agreement (t < 40): cos = +1.0, f32 and exact track perfectly. (2) Transition (t ≈ 40–60): cosine drops from +1 to ∼0; the 3.44 ratio is a transient here. (3) Decorrelation (t > 60): cos ≈ 0, ratio ≈ 1.0. f32 and exact gradients at d=10 are uncorrelated.

BPTT gradients are f32 noise after t ≈ 60. At depth d=10, the f32 and exact gradients share zero mutual information beyond t≈60. Yet the model learns (0.079 bpc). Either the output gradient at d=0 contains enough signal, or Adam extracts signal from the f32 noise's statistical structure across positions.

Program 5: Pattern Attribution Rankings

The punchline: even though gradient magnitudes diverge by 2.44×, the ranking of which patterns matter is perfectly preserved.

W_y (output layer)

ρ = 0.997

Spearman rank correlation

125

Agree at τ=0.01

f32-only phantoms

W_x (input layer, by BPTT depth)

d	ρ	Agree	exact-only	Mean ratio
1	+0.52	126	2	0.69
2	+0.74	128	0	0.61
5	+0.50	128	0	1.30
8	+0.91	128	0	2.53
10	+0.99	128	0	2.40
11	+1.00	128	0	2.44
15	+1.00	128	0	2.43
20	+1.00	128	0	2.44

Pattern ranking is perfect at depth. At d≥11, Spearman correlation between f32 and exact attributions is exactly 1.000. Zero phantom patterns at any depth. The f32 quotient preserves the topology of the pattern space while distorting magnitudes by ~2.44×.

Program 6: The Cost of f32

Full forward pass in f32 and MPFR-256 over all 1023 positions. f32 costs exactly 0.071 bpc.

Phase	bpc (f32)	bpc (exact)	Diff	n
t < 50	6.612	6.642	−0.031	50
50–200	6.805	6.652	+0.153	150
200–500	5.469	5.351	+0.118	300
500+	5.469	5.439	+0.031	523
Overall	5.721	5.650	+0.071	1023

f32 is BETTER early. For t < 50, f32 is 0.031 bpc better than exact (6.612 vs 6.642). The model was trained in f32, so it optimized for f32 dynamics. The exact computation is a different dynamical system the model was not trained on.

Total cost: 0.071 bpc. Hidden states diverge completely after t≈45 (full sign flips), yet bpc cost is modest because the sign vector is the dominant information channel and many sign bits happen to agree.

Predictions vs Results

How did the pre-experiment predictions hold up?

Prediction	Expected	Actual
P1: Sign agreement	128/128	128/128 ✓
P1: Exponent agreement	128/128	124/128 (close)
P1: Mantissa bits	~18/23	21.5/23 (better)
P2: Gradient sign	128/128	128/128 ✓
P2: Gradient mantissa	15–18 bits	4.8 bits (much worse)
P3: Gates killed	60–80	114 (more)
P4: Sign dies at d≈	10–15	7 (faster)
P4: Mantissa decay	logarithmic	instant (0.7 at d=1)
P4: Error ratio	growing	3.44 (transient), 1.0 (steady)

"The predictions underestimated how fast the backward pass decorrelates. The key surprise is not a constant ratio but a phase transition: before t≈40, f32 BPTT is exact; after t≈60, it is pure noise. The transition is sharp (20 positions)."

— q1-exact-results.tex

Source & Reproducibility

Paper: q1-exact-results.pdf (source)
Programs: q1-exact.pdf (protocol), p1, p2, p3, p5, p6, lyapunov, lyapunov2, bit_sample
Related: Protocol B, Protocol C, Q1 Sparsity, Implementation Notes
Model: sat_model.bin (128 hidden, 0.079 bpc). Data: first 1024 bytes of enwik9.

The Headline Numbers

Program 1: Forward Pass (t = 42)

Saturation Distribution

Program 2: Output Gradient

Programs 3+4: BPTT Backward Trace

Sign Agreement vs BPTT Depth

Mantissa Bits Agreement vs BPTT Depth

Error Ratio ||Δg|| / ||g^exact|| vs BPTT Depth

Bit-Level Sampling

Per-Bit KL Divergence (log scale)

Lyapunov Structure Across Positions

Cosine Similarity (f32 vs exact gradient) at d=10 across positions

Program 5: Pattern Attribution Rankings

W_y (output layer)

W_x (input layer, by BPTT depth)

Program 6: The Cost of f32

Predictions vs Results

Source & Reproducibility

Q1 Exact: f32 vs MPFR-256

The Headline Numbers

Program 1: Forward Pass (t = 42)

Saturation Distribution

Program 2: Output Gradient

Programs 3+4: BPTT Backward Trace

Sign Agreement vs BPTT Depth

Mantissa Bits Agreement vs BPTT Depth

Error Ratio ||Δg|| / ||gexact|| vs BPTT Depth

Bit-Level Sampling

Per-Bit KL Divergence (log scale)

Lyapunov Structure Across Positions

Cosine Similarity (f32 vs exact gradient) at d=10 across positions

Program 5: Pattern Attribution Rankings

Wy (output layer)

Wx (input layer, by BPTT depth)

Program 6: The Cost of f32

Predictions vs Results

Source & Reproducibility

Error Ratio ||Δg|| / ||g^exact|| vs BPTT Depth

W_y (output layer)

W_x (input layer, by BPTT depth)