← Back to 2026-02-11 Archive

Weight Construction

Can you derive all 82,304 parameters from data statistics alone?

The sat-rnn has 82,304 trainable parameters found by BPTT-50 + Adam over ~2 million gradient steps. This page presents twelve experiments that ask: how much of each weight matrix is predicted by data statistics? What happens when you replace trained weights with data-derived versions? And can you build a working model from scratch, with zero gradient descent?

"The weights are not arbitrary parameters found by stochastic optimization. They are a noisy encoding of the data's covariance structure, filtered through the f32 quotient and the BPTT optimization landscape."

— narrative.tex

1. The 82,304 Parameters

The sat-rnn has four weight matrices and two bias vectors. Each serves a different role in the computation, and each has a different relationship to the data's statistical structure.

32,768

W_x (128 × 256)
Input encoding

16,384

W_h (128 × 128)
Recurrent dynamics

128

b_h (128)
Hidden bias

32,768

W_y (256 × 128)
Output readout

2. Weight Prediction from Data Statistics

If the RNN's weights encode data statistics, we should be able to predict them from the data alone. For each matrix, we compute a data-derived estimate and measure its correlation with the trained values.

Methods

Hebbian covariance: &hat;W_h[i][j] = scale · cov(h_j(t), h_i(t+1)) over data positions.
Sign-conditioned log-ratio: For each neuron j and byte x, compute log P(x | h_j > 0) − log P(x | h_j < 0).
Boolean influence: Track which neurons actually flip other neurons' signs across time steps.
Sign log-odds: log(fraction of positions where h_j > 0) − log(fraction where h_j < 0).

Weight Prediction: Pearson Correlation with Trained Values

Higher bars = more variance explained by data statistics. W_h for dynamically important entries (|w| ≥ 3.0) reaches r = 0.56 (R² = 31%). These are the entries that actually determine the Boolean transition function. Data: write_weights.c

Matrix	Method	Entries	Correlation (r)	R²
W_h (all)	Hebbian covariance	16,384	0.40	16%
W_h (\|w\| ≥ 3.0)	Hebbian covariance	5,887	0.56	31%
W_h (all)	Boolean influence	16,384	0.56	32%
W_y (all)	Sign-split log-prob	32,768	0.54	29%
W_h (all)	Sign correlation	16,384	0.44	19%
W_x (all)	PMI-based	32,768	0.31	10%
W_x (all)	Sign-conditioned	32,768	0.25	6%
b_h	Sign log-odds	128	0.58	34%

"The recurrent weights encode the temporal covariance structure of the hidden states. Since neurons are strongly saturated (|h_j| ≈ 1), their covariance is dominated by the sign-flip structure, which is precisely what the Boolean interpretation captures."

— narrative.tex

"The input encoding has a symmetry that covariance cannot capture: W_x maps 256 bytes to 128-dimensional space, and permuting the byte labels doesn't change the covariance. The actual W_x breaks this symmetry in ways determined by the interaction between W_x and W_h during BPTT optimization."

— narrative.tex (on why W_x is harder to predict)

Per-Neuron Prediction Quality

W_y Prediction: Top 10 Per-Neuron Correlations (Sign-Split Log-Prob)

Each bar shows how well the data-derived W_y column matches the trained column for one neuron. Top neurons (h117, h59, h94) reach r > 0.85. Mean per-neuron |r| = 0.675. Data: write_weights.c

W_x Prediction: Top 10 Per-Neuron Correlations (Sign-Conditioned)

The important neurons (h8, h68, h52) have the best-predicted input encoding (r > 0.75). These are the same neurons that dominate the readout (see Neuron Knockout). Mean per-neuron |r| = 0.283. Data: write_weights.c

3. Substitution: Replacing Trained Weights

Correlation measures prediction quality. The stronger test is substitution: what happens when you actually replace trained weights with data-derived versions and run the model?

Effect of Replacing Trained Weights with Data-Derived Versions

Green bars: improvement over the trained model (4.965 bpc). Red bars: worsening. Replacing b_h and blending W_y both improve the trained model. Data: write_weights3.c

Configuration	BPC	Δ from Trained	Direction
Trained model (baseline)	4.965	—	—
50% Hebbian W_y blend	4.307	−0.658	↓ IMPROVES
Trained + Hebbian b_h	4.954	−0.011	↓ IMPROVES
80/20 trained/Hebbian W_h	4.919	−0.046	↓ IMPROVES
Replace W_h entirely	5.219	+0.254	↑ worsens
Replace W_x entirely	5.460	+0.496	↑ worsens
Hebbian all matrices	6.954	+1.989	↑ worsens
Uniform (no model)	8.000	+3.035	baseline

"Mixing 50% Hebbian W_y with 50% trained W_y yields 4.31 bpc, 0.66 better than the trained model. The trained W_y is over-optimized for the training dynamics and slightly miscalibrated for the Boolean dynamics that actually matter. The Hebbian correction pushes the readout toward the data's true conditional distribution."

— narrative.tex, Observation 1

Three substitutions improve the model. Replacing b_h costs zero (improves 0.011). Blending 20% Hebbian W_h improves 0.046. And 50% Hebbian W_y improves by 0.658 — the data-derived readout is better than what BPTT found.

4. Construction from Scratch

The substitution experiments replace individual matrices. The next step: build all the dynamics (W_x, W_h, b_h) from data, then either derive or optimize W_y.

Shift-Register Architecture

128 neurons partitioned into 16 groups of 8. Group 0 encodes the current input byte via a deterministic hash in W_x. Group g (g ≥ 1) carries the hash of g steps ago via a shift-register W_h (diagonal copy, weight 5.0). After 16 steps of warmup, each group encodes the exact identity of a past input byte. 100% encoding accuracy verified.

This gives the model access to offsets 0–15 with zero information loss.

Constructed vs Trained: Three Readout Methods

All three constructions use shift-register dynamics (W_x, W_h, b_h from data). The analytic construction uses zero optimization. The optimized construction uses SGD on W_y only. Both are compared against the full BPTT-50 trained model. Data: write_weights5.c, write_weights6.c

Construction	All Data (bpc)	Train Half	Test Half	Optimization
Uniform baseline	8.000	—	—	None
Analytic W_y (log-ratio)	1.890	3.72	4.88	Zero
Optimized W_y (SGD)	0.587	—	0.40	1,000 epochs
Trained model (BPTT-50)	4.965	4.82	5.08	~2M steps

1.890

Analytic construction
Zero optimization

0.587

Optimized W_y only
1,000 epochs SGD

4.965

Trained model
~2M gradient steps

82,304

All parameters
from data statistics

"The fully analytic construction beats the trained model by 3.08 bpc on the full data with zero optimization. All 82,304 parameters are determined by data statistics: W_x and W_h by construction (hash + shift register), W_y by skip-bigram log-ratios, b_y by byte marginals."

— narrative.tex

Generalization

Train vs Test Performance: Analytic, Optimized, and Trained

The analytic model generalizes to unseen data within 0.2 bpc of the trained model (4.88 vs 5.08 on test half). The optimized W_y overfits to 520 bytes but the underlying structure transfers (test 0.40 bpc). Data: write_weights6.c

"On the test half (unseen during W_y construction), the analytic model achieves 4.88 bpc vs the trained model's 5.08 bpc — within 0.2 bpc. The analytic model captures the same statistical structure that BPTT discovers, but directly from data counts rather than through gradient optimization."

— narrative.tex, Observation 2

5. The Optimization Continuum

The gap between closed-form W_y (1.89 bpc) and SGD-optimized W_y (0.37 bpc) can be bridged incrementally. Each step adds a small amount of numerical optimization on top of the data-derived solution.

From Closed-Form to Optimized: The Continuum

Every point on this curve uses the same shift-register dynamics (W_x, W_h, b_h from data). Only W_y varies, from fully analytic (0 iterations) through pseudo-inverse and Newton steps to full SGD convergence. The trained model (dashed red line) uses ~2M gradient steps on all 82k parameters. Data: write_weights12.c

Method	BPC	Iterations	What It Captures
Per-offset log-ratio	1.890	0 (closed form)	Independent per-offset statistics
Pseudo-inverse (residual targets)	1.557	0 (matrix solve)	Cross-offset interactions
PI + 20 Newton steps	0.967	20	Loss surface curvature
SGD from PI initialization	0.476	500	Fine nonlinear structure
SGD from zero	0.374	1,000	Full W_y optimization
Trained model (BPTT-50)	4.965	~2,000,000	All params, chaotic dynamics

"The pseudo-inverse captures cross-offset interactions that the per-offset log-ratio misses (improving from 1.89 to 1.56 bpc). Each additional Newton step further adapts the readout to the cross-entropy loss surface. Even 20 steps suffice to reach 0.97 bpc — beating the trained model by 5×."

— narrative.tex

Every point on the continuum beats the trained model. From zero optimization (1.89) through 20 Newton steps (0.97) to full SGD (0.37), the shift-register construction outperforms 2 million gradient steps on the full architecture.

6. Hash Design: Diversity Beats Optimality

The hash function that encodes input bytes into neuron signs has a dramatic effect on analytic performance. This is a design choice, not a learned parameter.

Hash Function Design: BPC by Approach

A random mixed hash (170 distinct patterns out of 256) achieves 1.89 bpc. A perfect bit-extraction hash (256/256 distinct) achieves only 3.60 bpc. Data: write_weights7.c, write_weights9.c

"Random half-splits provide diverse binary features; bit extraction creates correlated features. This is the ensemble-methods principle: diverse weak learners outperform repeated strong learners. Hash collisions (86 out of 256 bytes share a pattern with another) provide implicit regularization by pooling similar bytes."

— narrative.tex

6b. Full-Bandwidth Carrier Signal

What happens when we increase the carrier bandwidth? With 256 hidden units (16 groups × 16 hash bits), the shift-register encodes nearly every byte distinctly — and the closed-form PI drops to 0.44 bpc with zero SGD.

Architecture Comparison: PI vs SGD across Carrier Designs

Four carrier architectures compared on closed-form pseudo-inverse (green) and SGD-optimized W_y (blue). Doubling hidden size to 256 drops PI from 1.41 to 0.44 bpc. The hybrid (semantic + hash features) achieves the best PI at 0.40 bpc. Data: write_weights14.c

Architecture	Hidden	Distinct Patterns	PI (closed form)	SGD (1000 ep)
Hash baseline	128 (16×8)	170 / 256	1.409	0.068
Full bandwidth	256 (16×16)	253 / 256	0.436	0.075
Semantic features	128 (16×8)	11 / 256	1.493	0.189
Hybrid (semantic + hash)	256 (16×(8+8))	202 / 256	0.396	0.077

Near-perfect byte discrimination drops PI by 3×. With 253/256 distinct hash patterns (vs 170/256 at baseline), the closed-form PI drops from 1.41 to 0.44 bpc. The hybrid architecture achieves 0.40 bpc PI — approaching SGD territory without any optimization. SGD converges to ~0.07 bpc regardless of architecture, suggesting the remaining gap is in W_y's nonlinear structure.

7. Why the Trained Model Underperforms

The shift-register construction has a structural advantage over the trained model: perfect memory. The trained model's chaotic dynamics destroy information at depth.

Shift-register memory
steps (perfect)

3.44

Trained Lyapunov
exponent (> 1 = chaos)

Shift-register
information loss

100%

Trained gradient
decorrelation at d=1

"The shift-register construction has perfect 16-step memory with zero information loss. The trained model's chaotic dynamics (W_h has Lyapunov exponent > 1) actively destroy information at depth. The Boolean automaton is an inefficient encoding of the data's skip-k-gram structure."

— narrative.tex, Observation 3

8. The Scaffolding Paradox

The 82k-parameter architecture is needed for training but not for inference. Can you train the minimal 26k-parameter architecture directly?

Training from Scratch: Full vs Sparse Architecture

The sparse 26k-parameter redux architecture cannot train from random initialization (7.74 bpc after 50 epochs, barely below uniform 8.0). The full 82k architecture reaches 5.16 bpc. Gradient flow requires the dense W_h as scaffolding. Meanwhile, the constructed model reaches 1.89 bpc with zero training. Data: q5_redux_train.c

"The extra 56k parameters are the ladder: needed for gradient-based navigation of the loss surface, discarded once the function is found."

— narrative.tex

"Training gave us too much. The full model has 82,304 parameters. Inference needs ~26,000 (the redux). The remaining 56,000 parameters were scaffolding for gradient flow — needed to navigate the optimization landscape but pure overhead once the good map is found."

— synthesis.tex

Training needs 82k parameters. Inference needs 26k. Construction needs 0. The shift-register construction bypasses the optimization landscape entirely, building the right function directly from data statistics.

9. The Full Comparison

All Configurations on a Single Scale

Every configuration on a single axis. The dashed line shows the trained model (4.965 bpc). Constructed models span from 6.954 (Hebbian all) down to 0.374 (optimized W_y). Three substitution experiments (b_h, W_h 80/20, W_y 50/50) improve the trained model.

Constructed Model Evaluation Table

Configuration	BPC	Notes
Uniform baseline	8.000	No model
Sparse redux (trained from scratch)	7.74	26k params, gradient flow fails
Hebbian all matrices	6.954	Covariance-only, no optimization
Replace W_x	5.460	+0.496 from trained
Replace W_h	5.219	+0.254 from trained
Trained model	4.965	Full f32, BPTT-50
Trained + Hebbian b_h	4.954	Improves by 0.011
80/20 trained/Hebbian W_h	4.919	Improves by 0.046
50% Hebbian W_y blend	4.307	Improves by 0.658
Hebbian dynamics + optimized W_y	3.961	500 epochs SGD
Bit-extraction hash + analytic W_y	3.600	Perfect hash, worse diversity
Sign-cond. dynamics + optimized W_y	2.800	500 epochs SGD
Analytic W_y (zero optimization)	1.890	ALL params from data, ZERO optimization
Pseudo-inverse W_y	1.557	0 iterations (matrix solve)
Bool readout + optimized W_y	1.005	Overfits to 520 bytes
PI + 20 Newton steps	0.967	20 iterations
SGD from PI init	0.476	500 epochs
SGD from zero	0.374	1,000 epochs

10. The Twelve-Day Arc

This page is the culmination of twelve days of experiments. The arc:

Training (Jan 31): 82,304 opaque parameters, 0.079 bpc
UM isomorphism (Feb 1–4): Every prediction has a pattern witness
Pattern discovery (Feb 7–8): 6,180 patterns, skip-k-grams
Boolean automaton (Feb 9–11): Sign carries 99.7%, mantissa is noise (experiment page)
Minimal model: 20 neurons + 36% W_h = 0.15 bpc better (experiment page)
Attribution chains: ~15 weights per prediction (experiment page)
Weight construction (this page): All 82k parameters from data

"The Hebbian rule Δw ∝ cov(pre, post) is the first-order Taylor expansion of gradient descent on cross-entropy loss. This is why Hebbian covariance predicts W_h: the gradient updates that found the trained weights are dominated by the same covariance structure that the UM counts."

— narrative.tex

"The mantissa was the ladder. The result is counting."

— narrative.tex, final line

Related Experiment Pages

Boolean Automaton

Sign carries 99.7%. Margins 60.5. Mantissa is noise.

Neuron Knockout

h8 dominates. Top 15 beat full 128. 113 are noise.

Saturation Dynamics

All 128 volatile. Mean dwell 3.3 steps. Co-flip clusters.

Offset Analysis

Deep offsets d=18-25. MI-greedy captures only 10%.

Per-Prediction Justifications

~15 weights per prediction. h54 routing backbone.

Source: narrative.tex · synthesis.tex · Programs: write_weights.c–write_weights14.c