← Back to Hutter
Archive 2026-02-11
Total Interpretation & Weight Construction for a Small RNN
Papers
- e-onto-n.pdf — E onto N: The Bi-Embedding of Events and Numbers. The UM forward pass is a thermodynamic partition function. The RNN learned a bi-embedding: E→N (counting events → natural numbers → weights) and N→E (weights → event predictions). Shannon entropy = Boltzmann entropy via the identity: SN strength = log Ω (microstates per macrostate). Quotient chain E→N→Q traces every operation. Forward/backward temporal patterns are the two directions. (13 pages)
(source)
- quotient-chain.pdf — The Quotient Chain: E → N → Q at Every Operation. Full E→N→Q derivation at every layer of the UM/RNN forward pass. Input embedding, W_x, W_h recurrence, bias, binary-ES softmax, W_y output. Quotient decomposition theorem: Q_total = Π Q_layer. Amplification explains the export gap. Training as quotient alignment. Q as universal currency. (10 pages)
(source)
- temporal-biembedding.pdf — The Temporal Bi-Embedding: Forward Patterns, Backward Attribution. Forward map ET→N (skip-k-grams as temporal counting) and backward map N→ET (attribution chains via Jacobian). The bi-embedded function the RNN learned, explicitly characterized at W_h and W_y level. Hebbian r=0.56, PMI alignment 74%. Category-theoretic functor formulation. (10 pages)
(source)
- microstate-macrostate.pdf — Microstates, Macrostates, and the Partition Function. Full thermodynamic derivation: microstates = dataset positions × event tuples, macrostates = factored event classes. Binary-ES softmax IS the Boltzmann distribution with β = ln 2. Partition function at every layer, free energy gap = bpc. Second law: factoring increases macrostate entropy. Bidirectional factoring and phase transitions at saturation. (10 pages)
(source)
- narrative.pdf — From Counting to Construction: The Complete Arc. Connects all twelve days of experiments: Training → UM isomorphism → Pattern discovery → Total interpretation → Writing the weights in. Hebbian covariance predicts W_h at r=0.56; W_y blend improves trained model by 0.66 bpc. Fully analytic: 1.89 bpc, zero optimization. (6 pages)
(source)
- synthesis.pdf — Synthesis: Total Interpretation of a 128-Hidden RNN. All seven questions answered. Boolean automaton; 20 neurons + 36% W_h = 0.15 bpc better. The mantissa was the ladder. (4 pages)
(source)
- boolean-automaton.pdf — The Sat-RNN as a 128-Bit Boolean Automaton. Margins (mean 60.5, 98.9% > 1.0), sparse influence (3.5 out-degree), no attractors, mantissa noise mechanism. (6 pages)
(source)
- q234-results.pdf — Q2–Q4 Results: Offsets, Neurons, Saturation. d=18-25, 1 neuron = 99.7%, all 128 volatile. (5 pages)
(source)
- q6-justifications.pdf — Q6: Human-Readable Justifications. Backward chains, h54 dominates 7/12, routing backbone h54←h121←h78. (4 pages)
(source)
- total-interp.pdf — Toward Total Interpretation. Formalizes backward attribution chains, superset UM, seven key questions.
(source)
- h32.pdf — H = 232: The f32 State Space. 496-bit effective state, mixed Boolean-analog dynamics.
(source)
- q1-exact-results.pdf — Q1 Exact Results. f32 vs MPFR-256, 300:52:1 hierarchy, gradient decorrelation, Lyapunov 3.44.
(source)
- q1-exact.pdf — Q1 Exact: Six programs comparing f32 vs GMP.
(source)
- q1-protocol-c.pdf — Q1 Protocol C: The f32 Quotient.
(source)
- q1-protocol-b.pdf — Q1 Protocol B: Exact pattern census.
(source)
- q1-implementation.pdf — Q1 Implementation Notes.
(source)
- q1-sparsity.pdf — Q1: How Sparse Is the Explanation?
(source)
- cost.pdf — Computational Cost of Analytic Weight Construction vs. Gradient Descent Training. Precise FLOP analysis: analytic = 149 MFLOP ($0.000001), SGD = 5.94 TFLOP ($0.013). Ratio: 39,800× (4.6 OoM). Gap widens to 7.2 OoM at H=4096 because SGD scales as H², analytic is H-independent. (6 pages)
(source)
- es-isomorphism.pdf — The Event Space Isomorphism: Arch-Native and Human-Native Partitions. SVD sign-bit partition vs. human semantic byte classes. V-side: up to 85.7% accuracy, 0.661 NMI. U-side: arch-native refines human partition (discovers vowel/consonant/position within “lowercase”). The E→N map made concrete. (5 pages)
(source, tool)
- entropy-bridge.pdf — The Entropy Bridge: Microstates, Macrostates, and Event Factoring. Shannon = Boltzmann through the event formalism. Microstates are positions, macrostates are factored event classes. H = log₂N − ⟨S_B⟩. Connects Q = λ, pattern chains, and E onto N. (10 pages)
(source)
Goal: Complete interpretability of the sat-rnn (128 hidden, 0.079 bpc), then write the weights back from the interpretation. Close the loop: data → UM → RNN → interpretation → data.
Key Results: Writing the Weights In
Hebbian covariance predicts W_h at r = 0.56
cov(h_j(t), h_i(t+1)) correlates with trained W_h at r = 0.40 (all), r = 0.56 (important entries |w| ≥ 3.0, R² = 31%). Sign prediction accuracy: 72.7% for important weights. Optimal scale factor: 3.94.
Replacing b_h costs zero (improves by 0.011 bpc)
Sign log-odds of the positive fraction perfectly reconstructs the bias. Trained + Hebbian b_h: 4.954 bpc vs trained: 4.965 bpc.
50% Hebbian W_y blend IMPROVES the model by 0.66 bpc
Mixing Hebbian W_y with trained W_y: 4.307 bpc (vs trained 4.965). The trained readout is over-optimized for mantissa dynamics; the Hebbian correction pushes toward the true conditional distribution.
Constructed dynamics + optimized readout: 2.80–3.96 bpc
All Wx,Wh,bh from data covariance, only W_y optimized: 3.96 bpc. Sign-conditioned dynamics + optimized W_y: 2.80 bpc. (Both overfit to 520 bytes.)
FULLY ANALYTIC: 1.89 bpc with ZERO optimization (beats trained 4.97 by 3.08)
Shift-register dynamics + analytic W_y from skip-bigram log-ratios. ALL 82k parameters from data statistics. Zero gradient descent, zero BPTT. Generalizes comparably: test 4.88 vs trained 5.08 (within 0.2 bpc). The loop is closed.
Optimized readout: 0.59 bpc all data, 0.40 bpc test
Shift-register (16 groups of 8, hash encoding) + gradient-optimized W_y. No trained model needed for dynamics. Perfect 16-step memory vs trained model's chaotic info destruction.
Sparse 26k-param architecture cannot train from scratch
Redux architecture (20 neurons, sparse W_h) from random init: 7.74 bpc after 50 epochs. Full 82k architecture: 5.16 bpc. The dense W_h is scaffolding for gradient flow.
Key Results: Boolean Automaton (Q1–Q7)
The mantissa is noise, not memory
Sign-only dynamics: 5.690 bpc — BETTER than full f32 (5.721). Zero-mantissa: 5.582 bpc. Sign carries 99.7% of compression. 31.6 sign changes/step. Mantissa degrades by 0.095 bpc.
Margins prove the Boolean function IS the computation
Mean margin: 60.5. Max mantissa perturbation: 4.7×10-5. 98.9% of neuron-steps have |z| > 1.0. Safety factor 106×.
1 neuron = 99.7%, top 15 = 102% (Q3)
h28 alone captures 99.7% of compression gap. Top 15 beat full model. 113 neurons are noise for readout. 20 neurons + 36% W_h = 4.81 bpc (0.15 better than full).
All 128 volatile, deep offsets d=18-25 (Q2, Q4)
Zero frozen neurons. Mean dwell: 3.3 steps. Co-flip pairs (Jaccard > 0.5). MI-greedy captures 9.4%. 23% of neurons dominated by d=25.
Routing backbone: h54 ← h121 ← h78 (Q6)
Each prediction: ~15 weights (0.1% of W_h). h54 dominates 7/12 predictions (smallest margin 26.7, most volatile 234 flips).
74% RNN-PMI alignment (Q7)
Shallow offsets (d=1-4): 88%. Deep offsets (d>15): 24-37%. The RNN develops higher-order patterns at depth.
Key Results: f32/GMP Experiments (Q1)
Bit leverage 300:52:1, gradient decorrelates at d=1
Per-bit KL: sign 0.046, exponent 0.008/bit, mantissa 0.00015/bit. Mantissa bits 0-14 have zero effect. Pattern ranking ρ = 1.000 at d ≥ 11. Error ratio 3.44 (Lyapunov).
Interactive Experiment Viewers
- The Evidence — Master evidence viewer: 12 charts covering the full arc from Boolean automaton to weight construction. Every key result with real experiment data. The case that this RNN is completely understood.
- Seven Questions (Q1–Q7) — Total interpretation synthesis: one section per question with interactive charts. Bit leverage, neuron knockout, saturation dynamics, offset analysis, redux, justifications, algebraic structure. Real data from all Q1–Q7 experiments.
- The Boolean Automaton — Margin analysis, mantissa ablation, bit leverage 300:52:1, influence graph, no attractors. Real data from q1_boolean, q1_margins, q1_bool_attractor.
- Neuron Knockout & The Minimal Model — h8 is king (+0.035 bpc). Top 15 neurons beat full 128. 113 neurons are noise. 82k params → 26k needed. Real data from q3_neurons, q5_redux.
- Saturation Dynamics — All 128 neurons volatile. Mean dwell 3.3 steps. Co-flip clusters. The snapshot is not the dynamics. Real data from q4_saturation.
- Offset Analysis (Q2) — Dominant offset d=14 (19 neurons). MI-greedy captures only 10.3%. Deep memory, not shallow. Real data from q2_offsets.
- Per-Prediction Justifications (Q6) — Worked examples at t=10, t=42, t=50. ~15 weights per prediction. Routing backbone through h8. Real data from q6_justify.
- Weight Construction — The closed loop: ALL 82k parameters from data statistics. Optimization continuum: log-ratio 1.89 → pseudo-inverse 1.56 → 20 Newton 0.98 → SGD 0.59. All beat trained 4.97. Hash diversity > optimality. 12 experiments from PMI to IRLS.
- Q1 Exact: f32 vs MPFR-256 — Forward pass, BPTT trace, bit-level sampling, Lyapunov structure. Sign dies at d~7, error ratio 3.44, pattern ranking ρ = 1.000 at d ≥ 11. Data from q1-exact-results.tex.
- Q1 Sparsity: How Sparse Is the Explanation? — 44,794 patterns, median 15 active at τ=0.01. Wh dominates (87%). Gradient does not vanish: peak mass at d≈21 exceeds d=0. Data from q1-sparsity.tex.
- Q2–Q4: Offsets, Neurons, Saturation — Three questions answered. d=25 dominant, h28 = 99.7%, all 128 volatile, dwell 3.3 steps. Co-flip clusters and routing backbone. Data from q234-results.tex.
- E onto N: The Bi-Embedding — Events and numbers, entropy identity, Hebbian covariance, analytic construction. Data from e-onto-n.tex.
- Event Space Viewer — Interactive SVD of skip-bigram tables: 16 offsets × 8 singular vectors. Byte grid colored by event assignment (sign bits of top 3 SVs). Input/output loadings, cross-offset event map, auto-generated SV names. Data from 262k enwik8.
- ES Isomorphism Viewer — Arch-native (SVD) ↔ human-native (semantic) byte partitions. Optimal permutation, Sankey diagram, confusion matrix, inner product mass, bidirectional projections across offsets. Client-side brute-force 8! search.
- The Entropy Bridge — Shannon = Boltzmann through event factoring. Hierarchical factoring table. Data from entropy-bridge.tex.
- Quotient Chain — Quotient chain decomposition, factor map, temporal patterns.
- H = 232: The f32 State Space — 496-bit effective state, IEEE 754 decomposition, 300:52:1 leverage, mantissa ablation (removing mantissa improves bpc by 0.139), bit propagation. Data from h32.tex.
- Experiment Dashboard — Combined dashboard view of all experiments with tabbed navigation.
Experimental Tools
- Weight construction: write_weights12.c (optimization continuum: PI→Newton→SGD), write_weights11.c (pseudo-inverse W_y, 1.56 bpc), write_weights10.c (cross-offset MI, L2 regularization), write_weights9.c (optimal hash design), write_weights8.c (comprehensive comparison), write_weights7.c (perfect hash + generalization), write_weights6.c (proper hash, 1.89 bpc analytic), write_weights5.c (fully analytic, zero optimization), write_weights4.c (shift-register), write_weights3.c (Hebbian covariance), write_weights2.c (sign-conditioned), write_weights.c (PMI-based)
- Boolean automaton: q1_boolean.c, q1_bool_attractor.c, q1_margins.c
- Q2-Q4: q2_offsets.c, q3_neurons.c, q3_neuron_roles.c, q3_decode_neurons.c, q4_saturation.c
- Q5-Q7: q5_redux.c, q5_redux_train.c, q6_justify.c, q7_algebraic.c, q7_higher_order.c
- GMP/exact: q1_exact_p1.c–q1_exact_p6.c, q1_lyapunov.c, q1_lyapunov2.c, q1_bit_sample.c
Navigation
← Previous: 20260210
Narrative paper, event arithmetic, prime encoding, viewer v8.
Next: 20260211_2 →
Scaling to full enwik9: experimental design, seven predictions.