← Back to 2026-02-11 Archive
Weight Construction
Can you derive all 82,304 parameters from data statistics alone?
The sat-rnn has 82,304 trainable parameters found by BPTT-50 + Adam over ~2 million gradient steps.
This page presents twelve experiments that ask: how much of each weight matrix is predicted by
data statistics? What happens when you replace trained weights with data-derived versions?
And can you build a working model from scratch, with zero gradient descent?
"The weights are not arbitrary parameters found by stochastic optimization.
They are a noisy encoding of the data's covariance structure, filtered
through the f32 quotient and the BPTT optimization landscape."
— narrative.tex
1. The 82,304 Parameters
The sat-rnn has four weight matrices and two bias vectors.
Each serves a different role in the computation, and each has a different
relationship to the data's statistical structure.
32,768
Wx (128 × 256)
Input encoding
16,384
Wh (128 × 128)
Recurrent dynamics
32,768
Wy (256 × 128)
Output readout
2. Weight Prediction from Data Statistics
If the RNN's weights encode data statistics, we should be able to predict them
from the data alone. For each matrix, we compute a data-derived estimate and
measure its correlation with the trained values.
Methods
Hebbian covariance: &hat;Wh[i][j] = scale · cov(hj(t), hi(t+1)) over data positions.
Sign-conditioned log-ratio: For each neuron j and byte x, compute log P(x | hj > 0) − log P(x | hj < 0).
Boolean influence: Track which neurons actually flip other neurons' signs across time steps.
Sign log-odds: log(fraction of positions where hj > 0) − log(fraction where hj < 0).
Weight Prediction: Pearson Correlation with Trained Values
Higher bars = more variance explained by data statistics.
W
h for dynamically important entries (|w| ≥ 3.0) reaches r = 0.56 (R² = 31%).
These are the entries that actually determine the Boolean transition function.
Data:
write_weights.c
| Matrix |
Method |
Entries |
Correlation (r) |
R² |
| Wh (all) |
Hebbian covariance |
16,384 |
0.40 |
16% |
| Wh (|w| ≥ 3.0) |
Hebbian covariance |
5,887 |
0.56 |
31% |
| Wh (all) |
Boolean influence |
16,384 |
0.56 |
32% |
| Wy (all) |
Sign-split log-prob |
32,768 |
0.54 |
29% |
| Wh (all) |
Sign correlation |
16,384 |
0.44 |
19% |
| Wx (all) |
PMI-based |
32,768 |
0.31 |
10% |
| Wx (all) |
Sign-conditioned |
32,768 |
0.25 |
6% |
| bh |
Sign log-odds |
128 |
0.58 |
34% |
"The recurrent weights encode the temporal covariance structure of the hidden states.
Since neurons are strongly saturated (|h
j| ≈ 1), their covariance is
dominated by the sign-flip structure, which is precisely what the Boolean interpretation captures."
— narrative.tex
"The input encoding has a symmetry that covariance cannot capture: W
x maps
256 bytes to 128-dimensional space, and permuting the byte labels doesn't change the
covariance. The actual W
x breaks this symmetry in ways determined by the
interaction between W
x and W
h during BPTT optimization."
— narrative.tex (on why Wx is harder to predict)
Per-Neuron Prediction Quality
Wy Prediction: Top 10 Per-Neuron Correlations (Sign-Split Log-Prob)
Each bar shows how well the data-derived W
y column matches the trained column for one neuron.
Top neurons (h117, h59, h94) reach r > 0.85. Mean per-neuron |r| = 0.675.
Data:
write_weights.c
Wx Prediction: Top 10 Per-Neuron Correlations (Sign-Conditioned)
The important neurons (h8, h68, h52) have the best-predicted input encoding (r > 0.75).
These are the same neurons that dominate the readout (see
Neuron Knockout).
Mean per-neuron |r| = 0.283. Data:
write_weights.c
3. Substitution: Replacing Trained Weights
Correlation measures prediction quality. The stronger test is substitution:
what happens when you actually replace trained weights with data-derived versions
and run the model?
Effect of Replacing Trained Weights with Data-Derived Versions
Green bars: improvement over the trained model (4.965 bpc). Red bars: worsening.
Replacing b
h and blending W
y both
improve the trained model.
Data:
write_weights3.c
| Configuration |
BPC |
Δ from Trained |
Direction |
| Trained model (baseline) |
4.965 |
— |
— |
| 50% Hebbian Wy blend |
4.307 |
−0.658 |
↓ IMPROVES |
| Trained + Hebbian bh |
4.954 |
−0.011 |
↓ IMPROVES |
| 80/20 trained/Hebbian Wh |
4.919 |
−0.046 |
↓ IMPROVES |
| Replace Wh entirely |
5.219 |
+0.254 |
↑ worsens |
| Replace Wx entirely |
5.460 |
+0.496 |
↑ worsens |
| Hebbian all matrices |
6.954 |
+1.989 |
↑ worsens |
| Uniform (no model) |
8.000 |
+3.035 |
baseline |
"Mixing 50% Hebbian W
y with 50% trained W
y yields 4.31 bpc,
0.66 better than the trained model. The trained W
y is
over-optimized for the training dynamics and slightly miscalibrated
for the Boolean dynamics that actually matter. The Hebbian correction
pushes the readout toward the data's true conditional distribution."
— narrative.tex, Observation 1
Three substitutions improve the model.
Replacing bh costs zero (improves 0.011). Blending 20% Hebbian Wh improves 0.046.
And 50% Hebbian Wy improves by 0.658 — the data-derived readout is better than
what BPTT found.
4. Construction from Scratch
The substitution experiments replace individual matrices. The next step: build
all the dynamics (Wx, Wh, bh) from data, then either derive or optimize Wy.
Shift-Register Architecture
128 neurons partitioned into 16 groups of 8. Group 0 encodes the current input byte
via a deterministic hash in Wx. Group g (g ≥ 1) carries the hash of g steps ago
via a shift-register Wh (diagonal copy, weight 5.0). After 16 steps of warmup, each
group encodes the exact identity of a past input byte. 100% encoding accuracy verified.
This gives the model access to offsets 0–15 with zero information loss.
Constructed vs Trained: Three Readout Methods
All three constructions use shift-register dynamics (W
x, W
h, b
h from data).
The analytic construction uses zero optimization. The optimized construction uses SGD on W
y only.
Both are compared against the full BPTT-50 trained model.
Data:
write_weights5.c,
write_weights6.c
| Construction |
All Data (bpc) |
Train Half |
Test Half |
Optimization |
| Uniform baseline |
8.000 |
— |
— |
None |
| Analytic Wy (log-ratio) |
1.890 |
3.72 |
4.88 |
Zero |
| Optimized Wy (SGD) |
0.587 |
— |
0.40 |
1,000 epochs |
| Trained model (BPTT-50) |
4.965 |
4.82 |
5.08 |
~2M steps |
1.890
Analytic construction
Zero optimization
0.587
Optimized Wy only
1,000 epochs SGD
4.965
Trained model
~2M gradient steps
82,304
All parameters
from data statistics
"The fully analytic construction beats the trained model by 3.08 bpc
on the full data with zero optimization. All 82,304 parameters are
determined by data statistics: W
x and W
h by construction (hash +
shift register), W
y by skip-bigram log-ratios, b
y by byte marginals."
— narrative.tex
Generalization
Train vs Test Performance: Analytic, Optimized, and Trained
The analytic model generalizes to unseen data within 0.2 bpc of the trained model
(4.88 vs 5.08 on test half). The optimized W
y overfits to 520 bytes but the
underlying structure transfers (test 0.40 bpc).
Data:
write_weights6.c
"On the test half (unseen during W
y construction), the analytic model
achieves 4.88 bpc vs the trained model's 5.08 bpc — within 0.2 bpc.
The analytic model captures the same statistical structure that BPTT
discovers, but directly from data counts rather than through gradient optimization."
— narrative.tex, Observation 2
5. The Optimization Continuum
The gap between closed-form Wy (1.89 bpc) and SGD-optimized Wy (0.37 bpc)
can be bridged incrementally. Each step adds a small amount of numerical
optimization on top of the data-derived solution.
From Closed-Form to Optimized: The Continuum
Every point on this curve uses the same shift-register dynamics (W
x, W
h, b
h from data).
Only W
y varies, from fully analytic (0 iterations) through pseudo-inverse and Newton steps
to full SGD convergence. The trained model (dashed red line) uses ~2M gradient steps on all 82k parameters.
Data:
write_weights12.c
| Method |
BPC |
Iterations |
What It Captures |
| Per-offset log-ratio |
1.890 |
0 (closed form) |
Independent per-offset statistics |
| Pseudo-inverse (residual targets) |
1.557 |
0 (matrix solve) |
Cross-offset interactions |
| PI + 20 Newton steps |
0.967 |
20 |
Loss surface curvature |
| SGD from PI initialization |
0.476 |
500 |
Fine nonlinear structure |
| SGD from zero |
0.374 |
1,000 |
Full Wy optimization |
| Trained model (BPTT-50) |
4.965 |
~2,000,000 |
All params, chaotic dynamics |
"The pseudo-inverse captures cross-offset interactions that the per-offset
log-ratio misses (improving from 1.89 to 1.56 bpc). Each additional
Newton step further adapts the readout to the cross-entropy loss surface.
Even 20 steps suffice to reach 0.97 bpc — beating the trained model by 5×."
— narrative.tex
Every point on the continuum beats the trained model.
From zero optimization (1.89) through 20 Newton steps (0.97) to full SGD (0.37),
the shift-register construction outperforms 2 million gradient steps on the full architecture.
6. Hash Design: Diversity Beats Optimality
The hash function that encodes input bytes into neuron signs has a dramatic
effect on analytic performance. This is a design choice, not a learned parameter.
Hash Function Design: BPC by Approach
A random mixed hash (170 distinct patterns out of 256) achieves 1.89 bpc.
A perfect bit-extraction hash (256/256 distinct) achieves only 3.60 bpc.
Data:
write_weights7.c,
write_weights9.c
"Random half-splits provide diverse binary features; bit extraction creates
correlated features. This is the ensemble-methods principle: diverse weak
learners outperform repeated strong learners. Hash collisions (86 out of 256
bytes share a pattern with another) provide implicit regularization by
pooling similar bytes."
— narrative.tex
6b. Full-Bandwidth Carrier Signal
What happens when we increase the carrier bandwidth? With 256 hidden units (16 groups × 16 hash bits),
the shift-register encodes nearly every byte distinctly — and the closed-form PI drops to
0.44 bpc with zero SGD.
Architecture Comparison: PI vs SGD across Carrier Designs
Four carrier architectures compared on closed-form pseudo-inverse (green) and
SGD-optimized W
y (blue). Doubling hidden size to 256 drops PI from 1.41 to 0.44 bpc.
The hybrid (semantic + hash features) achieves the best PI at 0.40 bpc.
Data:
write_weights14.c
| Architecture |
Hidden |
Distinct Patterns |
PI (closed form) |
SGD (1000 ep) |
| Hash baseline |
128 (16×8) |
170 / 256 |
1.409 |
0.068 |
| Full bandwidth |
256 (16×16) |
253 / 256 |
0.436 |
0.075 |
| Semantic features |
128 (16×8) |
11 / 256 |
1.493 |
0.189 |
| Hybrid (semantic + hash) |
256 (16×(8+8)) |
202 / 256 |
0.396 |
0.077 |
Near-perfect byte discrimination drops PI by 3×.
With 253/256 distinct hash patterns (vs 170/256 at baseline), the closed-form PI drops from
1.41 to 0.44 bpc. The hybrid architecture achieves 0.40 bpc PI — approaching SGD territory
without any optimization. SGD converges to ~0.07 bpc regardless of architecture, suggesting
the remaining gap is in Wy's nonlinear structure.
7. Why the Trained Model Underperforms
The shift-register construction has a structural advantage over the trained model:
perfect memory. The trained model's chaotic dynamics destroy information at depth.
16
Shift-register memory
steps (perfect)
3.44
Trained Lyapunov
exponent (> 1 = chaos)
0%
Shift-register
information loss
100%
Trained gradient
decorrelation at d=1
"The shift-register construction has perfect 16-step memory with zero
information loss. The trained model's chaotic dynamics (W
h has
Lyapunov exponent > 1) actively destroy information at depth.
The Boolean automaton is an inefficient encoding of the data's
skip-k-gram structure."
— narrative.tex, Observation 3
8. The Scaffolding Paradox
The 82k-parameter architecture is needed for training but not for inference.
Can you train the minimal 26k-parameter architecture directly?
Training from Scratch: Full vs Sparse Architecture
The sparse 26k-parameter redux architecture cannot train from random initialization
(7.74 bpc after 50 epochs, barely below uniform 8.0). The full 82k architecture
reaches 5.16 bpc. Gradient flow requires the dense W
h as scaffolding.
Meanwhile, the constructed model reaches 1.89 bpc with zero training.
Data:
q5_redux_train.c
"The extra 56k parameters are the ladder: needed for gradient-based navigation
of the loss surface, discarded once the function is found."
— narrative.tex
"Training gave us too much. The full model has 82,304 parameters.
Inference needs ~26,000 (the redux). The remaining 56,000 parameters
were scaffolding for gradient flow — needed to navigate the
optimization landscape but pure overhead once the good map is found."
— synthesis.tex
Training needs 82k parameters. Inference needs 26k. Construction needs 0.
The shift-register construction bypasses the optimization landscape entirely,
building the right function directly from data statistics.
9. The Full Comparison
All Configurations on a Single Scale
Every configuration on a single axis. The dashed line shows the trained model (4.965 bpc).
Constructed models span from 6.954 (Hebbian all) down to 0.374 (optimized Wy).
Three substitution experiments (bh, Wh 80/20, Wy 50/50) improve the trained model.
Constructed Model Evaluation Table
| Configuration |
BPC |
Notes |
| Uniform baseline | 8.000 | No model |
| Sparse redux (trained from scratch) | 7.74 | 26k params, gradient flow fails |
| Hebbian all matrices | 6.954 | Covariance-only, no optimization |
| Replace Wx | 5.460 | +0.496 from trained |
| Replace Wh | 5.219 | +0.254 from trained |
| Trained model | 4.965 | Full f32, BPTT-50 |
| Trained + Hebbian bh | 4.954 | Improves by 0.011 |
| 80/20 trained/Hebbian Wh | 4.919 | Improves by 0.046 |
| 50% Hebbian Wy blend | 4.307 | Improves by 0.658 |
| Hebbian dynamics + optimized Wy | 3.961 | 500 epochs SGD |
| Bit-extraction hash + analytic Wy | 3.600 | Perfect hash, worse diversity |
| Sign-cond. dynamics + optimized Wy | 2.800 | 500 epochs SGD |
| Analytic Wy (zero optimization) | 1.890 | ALL params from data, ZERO optimization |
| Pseudo-inverse Wy | 1.557 | 0 iterations (matrix solve) |
| Bool readout + optimized Wy | 1.005 | Overfits to 520 bytes |
| PI + 20 Newton steps | 0.967 | 20 iterations |
| SGD from PI init | 0.476 | 500 epochs |
| SGD from zero | 0.374 | 1,000 epochs |
10. The Twelve-Day Arc
This page is the culmination of twelve days of experiments. The arc:
- Training (Jan 31): 82,304 opaque parameters, 0.079 bpc
- UM isomorphism (Feb 1–4): Every prediction has a pattern witness
- Pattern discovery (Feb 7–8): 6,180 patterns, skip-k-grams
- Boolean automaton (Feb 9–11): Sign carries 99.7%, mantissa is noise
(experiment page)
- Minimal model: 20 neurons + 36% Wh = 0.15 bpc better
(experiment page)
- Attribution chains: ~15 weights per prediction
(experiment page)
- Weight construction (this page): All 82k parameters from data
"The Hebbian rule Δw ∝ cov(pre, post) is the first-order Taylor expansion
of gradient descent on cross-entropy loss. This is why Hebbian covariance predicts W
h:
the gradient updates that found the trained weights are dominated by the same covariance
structure that the UM counts."
— narrative.tex
"The mantissa was the ladder. The result is counting."
— narrative.tex, final line
Related Experiment Pages