← Back to 2026-02-11 Archive

Weight Construction

Can you derive all 82,304 parameters from data statistics alone?

The sat-rnn has 82,304 trainable parameters found by BPTT-50 + Adam over ~2 million gradient steps. This page presents twelve experiments that ask: how much of each weight matrix is predicted by data statistics? What happens when you replace trained weights with data-derived versions? And can you build a working model from scratch, with zero gradient descent?

"The weights are not arbitrary parameters found by stochastic optimization. They are a noisy encoding of the data's covariance structure, filtered through the f32 quotient and the BPTT optimization landscape."
— narrative.tex

1. The 82,304 Parameters

The sat-rnn has four weight matrices and two bias vectors. Each serves a different role in the computation, and each has a different relationship to the data's statistical structure.

32,768
Wx (128 × 256)
Input encoding
16,384
Wh (128 × 128)
Recurrent dynamics
128
bh (128)
Hidden bias
32,768
Wy (256 × 128)
Output readout

2. Weight Prediction from Data Statistics

If the RNN's weights encode data statistics, we should be able to predict them from the data alone. For each matrix, we compute a data-derived estimate and measure its correlation with the trained values.

Methods

Hebbian covariance: &hat;Wh[i][j] = scale · cov(hj(t), hi(t+1)) over data positions.
Sign-conditioned log-ratio: For each neuron j and byte x, compute log P(x | hj > 0) − log P(x | hj < 0).
Boolean influence: Track which neurons actually flip other neurons' signs across time steps.
Sign log-odds: log(fraction of positions where hj > 0) − log(fraction where hj < 0).
Weight Prediction: Pearson Correlation with Trained Values
Higher bars = more variance explained by data statistics. Wh for dynamically important entries (|w| ≥ 3.0) reaches r = 0.56 (R² = 31%). These are the entries that actually determine the Boolean transition function. Data: write_weights.c
Matrix Method Entries Correlation (r)
Wh (all) Hebbian covariance 16,384 0.40 16%
Wh (|w| ≥ 3.0) Hebbian covariance 5,887 0.56 31%
Wh (all) Boolean influence 16,384 0.56 32%
Wy (all) Sign-split log-prob 32,768 0.54 29%
Wh (all) Sign correlation 16,384 0.44 19%
Wx (all) PMI-based 32,768 0.31 10%
Wx (all) Sign-conditioned 32,768 0.25 6%
bh Sign log-odds 128 0.58 34%
"The recurrent weights encode the temporal covariance structure of the hidden states. Since neurons are strongly saturated (|hj| ≈ 1), their covariance is dominated by the sign-flip structure, which is precisely what the Boolean interpretation captures."
— narrative.tex
"The input encoding has a symmetry that covariance cannot capture: Wx maps 256 bytes to 128-dimensional space, and permuting the byte labels doesn't change the covariance. The actual Wx breaks this symmetry in ways determined by the interaction between Wx and Wh during BPTT optimization."
— narrative.tex (on why Wx is harder to predict)

Per-Neuron Prediction Quality

Wy Prediction: Top 10 Per-Neuron Correlations (Sign-Split Log-Prob)
Each bar shows how well the data-derived Wy column matches the trained column for one neuron. Top neurons (h117, h59, h94) reach r > 0.85. Mean per-neuron |r| = 0.675. Data: write_weights.c
Wx Prediction: Top 10 Per-Neuron Correlations (Sign-Conditioned)
The important neurons (h8, h68, h52) have the best-predicted input encoding (r > 0.75). These are the same neurons that dominate the readout (see Neuron Knockout). Mean per-neuron |r| = 0.283. Data: write_weights.c

3. Substitution: Replacing Trained Weights

Correlation measures prediction quality. The stronger test is substitution: what happens when you actually replace trained weights with data-derived versions and run the model?

Effect of Replacing Trained Weights with Data-Derived Versions
Green bars: improvement over the trained model (4.965 bpc). Red bars: worsening. Replacing bh and blending Wy both improve the trained model. Data: write_weights3.c
Configuration BPC Δ from Trained Direction
Trained model (baseline) 4.965
50% Hebbian Wy blend 4.307 −0.658 IMPROVES
Trained + Hebbian bh 4.954 −0.011 IMPROVES
80/20 trained/Hebbian Wh 4.919 −0.046 IMPROVES
Replace Wh entirely 5.219 +0.254 worsens
Replace Wx entirely 5.460 +0.496 worsens
Hebbian all matrices 6.954 +1.989 worsens
Uniform (no model) 8.000 +3.035 baseline
"Mixing 50% Hebbian Wy with 50% trained Wy yields 4.31 bpc, 0.66 better than the trained model. The trained Wy is over-optimized for the training dynamics and slightly miscalibrated for the Boolean dynamics that actually matter. The Hebbian correction pushes the readout toward the data's true conditional distribution."
— narrative.tex, Observation 1
Three substitutions improve the model. Replacing bh costs zero (improves 0.011). Blending 20% Hebbian Wh improves 0.046. And 50% Hebbian Wy improves by 0.658 — the data-derived readout is better than what BPTT found.

4. Construction from Scratch

The substitution experiments replace individual matrices. The next step: build all the dynamics (Wx, Wh, bh) from data, then either derive or optimize Wy.

Shift-Register Architecture

128 neurons partitioned into 16 groups of 8. Group 0 encodes the current input byte via a deterministic hash in Wx. Group g (g ≥ 1) carries the hash of g steps ago via a shift-register Wh (diagonal copy, weight 5.0). After 16 steps of warmup, each group encodes the exact identity of a past input byte. 100% encoding accuracy verified.

This gives the model access to offsets 0–15 with zero information loss.

Constructed vs Trained: Three Readout Methods
All three constructions use shift-register dynamics (Wx, Wh, bh from data). The analytic construction uses zero optimization. The optimized construction uses SGD on Wy only. Both are compared against the full BPTT-50 trained model. Data: write_weights5.c, write_weights6.c
Construction All Data (bpc) Train Half Test Half Optimization
Uniform baseline 8.000 None
Analytic Wy (log-ratio) 1.890 3.72 4.88 Zero
Optimized Wy (SGD) 0.587 0.40 1,000 epochs
Trained model (BPTT-50) 4.965 4.82 5.08 ~2M steps
1.890
Analytic construction
Zero optimization
0.587
Optimized Wy only
1,000 epochs SGD
4.965
Trained model
~2M gradient steps
82,304
All parameters
from data statistics
"The fully analytic construction beats the trained model by 3.08 bpc on the full data with zero optimization. All 82,304 parameters are determined by data statistics: Wx and Wh by construction (hash + shift register), Wy by skip-bigram log-ratios, by by byte marginals."
— narrative.tex

Generalization

Train vs Test Performance: Analytic, Optimized, and Trained
The analytic model generalizes to unseen data within 0.2 bpc of the trained model (4.88 vs 5.08 on test half). The optimized Wy overfits to 520 bytes but the underlying structure transfers (test 0.40 bpc). Data: write_weights6.c
"On the test half (unseen during Wy construction), the analytic model achieves 4.88 bpc vs the trained model's 5.08 bpc — within 0.2 bpc. The analytic model captures the same statistical structure that BPTT discovers, but directly from data counts rather than through gradient optimization."
— narrative.tex, Observation 2

5. The Optimization Continuum

The gap between closed-form Wy (1.89 bpc) and SGD-optimized Wy (0.37 bpc) can be bridged incrementally. Each step adds a small amount of numerical optimization on top of the data-derived solution.

From Closed-Form to Optimized: The Continuum
Every point on this curve uses the same shift-register dynamics (Wx, Wh, bh from data). Only Wy varies, from fully analytic (0 iterations) through pseudo-inverse and Newton steps to full SGD convergence. The trained model (dashed red line) uses ~2M gradient steps on all 82k parameters. Data: write_weights12.c
Method BPC Iterations What It Captures
Per-offset log-ratio 1.890 0 (closed form) Independent per-offset statistics
Pseudo-inverse (residual targets) 1.557 0 (matrix solve) Cross-offset interactions
PI + 20 Newton steps 0.967 20 Loss surface curvature
SGD from PI initialization 0.476 500 Fine nonlinear structure
SGD from zero 0.374 1,000 Full Wy optimization
Trained model (BPTT-50) 4.965 ~2,000,000 All params, chaotic dynamics
"The pseudo-inverse captures cross-offset interactions that the per-offset log-ratio misses (improving from 1.89 to 1.56 bpc). Each additional Newton step further adapts the readout to the cross-entropy loss surface. Even 20 steps suffice to reach 0.97 bpc — beating the trained model by 5×."
— narrative.tex
Every point on the continuum beats the trained model. From zero optimization (1.89) through 20 Newton steps (0.97) to full SGD (0.37), the shift-register construction outperforms 2 million gradient steps on the full architecture.

6. Hash Design: Diversity Beats Optimality

The hash function that encodes input bytes into neuron signs has a dramatic effect on analytic performance. This is a design choice, not a learned parameter.

Hash Function Design: BPC by Approach
A random mixed hash (170 distinct patterns out of 256) achieves 1.89 bpc. A perfect bit-extraction hash (256/256 distinct) achieves only 3.60 bpc. Data: write_weights7.c, write_weights9.c
"Random half-splits provide diverse binary features; bit extraction creates correlated features. This is the ensemble-methods principle: diverse weak learners outperform repeated strong learners. Hash collisions (86 out of 256 bytes share a pattern with another) provide implicit regularization by pooling similar bytes."
— narrative.tex

6b. Full-Bandwidth Carrier Signal

What happens when we increase the carrier bandwidth? With 256 hidden units (16 groups × 16 hash bits), the shift-register encodes nearly every byte distinctly — and the closed-form PI drops to 0.44 bpc with zero SGD.

Architecture Comparison: PI vs SGD across Carrier Designs
Four carrier architectures compared on closed-form pseudo-inverse (green) and SGD-optimized Wy (blue). Doubling hidden size to 256 drops PI from 1.41 to 0.44 bpc. The hybrid (semantic + hash features) achieves the best PI at 0.40 bpc. Data: write_weights14.c
Architecture Hidden Distinct Patterns PI (closed form) SGD (1000 ep)
Hash baseline 128 (16×8) 170 / 256 1.409 0.068
Full bandwidth 256 (16×16) 253 / 256 0.436 0.075
Semantic features 128 (16×8) 11 / 256 1.493 0.189
Hybrid (semantic + hash) 256 (16×(8+8)) 202 / 256 0.396 0.077
Near-perfect byte discrimination drops PI by 3×. With 253/256 distinct hash patterns (vs 170/256 at baseline), the closed-form PI drops from 1.41 to 0.44 bpc. The hybrid architecture achieves 0.40 bpc PI — approaching SGD territory without any optimization. SGD converges to ~0.07 bpc regardless of architecture, suggesting the remaining gap is in Wy's nonlinear structure.

7. Why the Trained Model Underperforms

The shift-register construction has a structural advantage over the trained model: perfect memory. The trained model's chaotic dynamics destroy information at depth.

16
Shift-register memory
steps (perfect)
3.44
Trained Lyapunov
exponent (> 1 = chaos)
0%
Shift-register
information loss
100%
Trained gradient
decorrelation at d=1
"The shift-register construction has perfect 16-step memory with zero information loss. The trained model's chaotic dynamics (Wh has Lyapunov exponent > 1) actively destroy information at depth. The Boolean automaton is an inefficient encoding of the data's skip-k-gram structure."
— narrative.tex, Observation 3

8. The Scaffolding Paradox

The 82k-parameter architecture is needed for training but not for inference. Can you train the minimal 26k-parameter architecture directly?

Training from Scratch: Full vs Sparse Architecture
The sparse 26k-parameter redux architecture cannot train from random initialization (7.74 bpc after 50 epochs, barely below uniform 8.0). The full 82k architecture reaches 5.16 bpc. Gradient flow requires the dense Wh as scaffolding. Meanwhile, the constructed model reaches 1.89 bpc with zero training. Data: q5_redux_train.c
"The extra 56k parameters are the ladder: needed for gradient-based navigation of the loss surface, discarded once the function is found."
— narrative.tex
"Training gave us too much. The full model has 82,304 parameters. Inference needs ~26,000 (the redux). The remaining 56,000 parameters were scaffolding for gradient flow — needed to navigate the optimization landscape but pure overhead once the good map is found."
— synthesis.tex
Training needs 82k parameters. Inference needs 26k. Construction needs 0. The shift-register construction bypasses the optimization landscape entirely, building the right function directly from data statistics.

9. The Full Comparison

All Configurations on a Single Scale
Every configuration on a single axis. The dashed line shows the trained model (4.965 bpc). Constructed models span from 6.954 (Hebbian all) down to 0.374 (optimized Wy). Three substitution experiments (bh, Wh 80/20, Wy 50/50) improve the trained model.

Constructed Model Evaluation Table

Configuration BPC Notes
Uniform baseline8.000No model
Sparse redux (trained from scratch)7.7426k params, gradient flow fails
Hebbian all matrices6.954Covariance-only, no optimization
Replace Wx5.460+0.496 from trained
Replace Wh5.219+0.254 from trained
Trained model4.965Full f32, BPTT-50
Trained + Hebbian bh4.954Improves by 0.011
80/20 trained/Hebbian Wh4.919Improves by 0.046
50% Hebbian Wy blend4.307Improves by 0.658
Hebbian dynamics + optimized Wy3.961500 epochs SGD
Bit-extraction hash + analytic Wy3.600Perfect hash, worse diversity
Sign-cond. dynamics + optimized Wy2.800500 epochs SGD
Analytic Wy (zero optimization)1.890ALL params from data, ZERO optimization
Pseudo-inverse Wy1.5570 iterations (matrix solve)
Bool readout + optimized Wy1.005Overfits to 520 bytes
PI + 20 Newton steps0.96720 iterations
SGD from PI init0.476500 epochs
SGD from zero0.3741,000 epochs

10. The Twelve-Day Arc

This page is the culmination of twelve days of experiments. The arc:

  1. Training (Jan 31): 82,304 opaque parameters, 0.079 bpc
  2. UM isomorphism (Feb 1–4): Every prediction has a pattern witness
  3. Pattern discovery (Feb 7–8): 6,180 patterns, skip-k-grams
  4. Boolean automaton (Feb 9–11): Sign carries 99.7%, mantissa is noise (experiment page)
  5. Minimal model: 20 neurons + 36% Wh = 0.15 bpc better (experiment page)
  6. Attribution chains: ~15 weights per prediction (experiment page)
  7. Weight construction (this page): All 82k parameters from data
"The Hebbian rule Δw ∝ cov(pre, post) is the first-order Taylor expansion of gradient descent on cross-entropy loss. This is why Hebbian covariance predicts Wh: the gradient updates that found the trained weights are dominated by the same covariance structure that the UM counts."
— narrative.tex
"The mantissa was the ladder. The result is counting."
— narrative.tex, final line

Related Experiment Pages