The Guessing Game
Imagine I show you a sentence, one letter at a time, and you have to guess what comes next.
You'd probably guess "t" — because "the cat" is common in English. You might even know that in a Wikipedia article, the next word is probably "cat" or "car" or "castle".
Your brain is doing compression(压缩). When you can guess the next letter, you don't need to store it — you already know it. The better you guess, the less space the text takes up.
Compression = prediction. If you can predict the next letter, you can shrink the file. A perfect guesser could store the whole of Wikipedia in almost nothing.
The Tiny Brain
We built a tiny artificial brain with 128 "neurons" (神经元). That's not many — your real brain has 86 billion. But this tiny brain can read English text and guess the next letter about 92% correctly.
Here's how it works:
↑——————↓
memory loops back
Each neuron is just a number (a switch, really — ON or OFF). The brain reads one byte at a time, updates its 128 switches, and makes a guess. Then the next byte arrives and it does it again.
This is called an RNN (Recurrent Neural Network,循环神经网络). "Recurrent" means the output loops back as input — it remembers.
What We Discovered
Most AI researchers train a neural network and then say "it works, but we don't know why." We did something different. We took this tiny brain apart, neuron by neuron, and figured out exactly what every single piece does.
Every neuron has a simple job. Each of the 128 neurons watches for a specific pattern. For example, one neuron detects "are we inside an XML tag like <page>?" Another detects "how many letters since the last space?" (That's word length.)
We tested this rigorously. For 120 out of 128 neurons, we can predict their behavior with R² > 0.80 (that means our explanation captures >80% of what the neuron does). In science, that's very good.
Letters Have Personalities
When we analyzed what the brain learns, we found something beautiful: it naturally sorts the 256 possible bytes into groups based on how they behave in text.
We call these Event Spaces(事件空间). The brain doesn't just see "the letter e" — it sees "a common lowercase letter that probably continues a word." The grouping depends on context: what letter came 1 step ago, 2 steps ago, even 8 steps ago.
Building a Brain Without Training
Normally, to build an AI, you have to "train" it. That means showing it millions of examples and slowly adjusting its weights (the connections between neurons) using calculus. This is expensive and slow.
We discovered that you can skip all of that.
We can write the weights directly from the data. Instead of training, we just count how often pairs of letters appear together, do some math (linear algebra — you'll learn this in a few years!), and directly calculate what the weights should be.
The result? Our calculated brain actually works better than the trained one on new text it hasn't seen before. The trained brain memorizes too much; our calculated brain generalizes.
Trained brain: 5.08 bpc
Our brain: 4.88 bpc
(lower is better)
训练的大脑:5.08 bpc
我们的大脑:4.88 bpc
(越低越好)
Why Does This Matter?
There are three reasons this is exciting:
Most AI today is a "black box" (黑箱) — it works but nobody knows why. We opened the box completely. Every neuron, every weight, every connection — explained. This is the first time anyone has done this for a working neural network.
Training GPT-4 cost over $100 million. Our method builds a brain from data using arithmetic. If this scales up, AI could become dramatically cheaper.
$0.013 trained vs $0.000001 calculated
That's 10,000× cheaper. Four orders of magnitude(四个数量级).
We proved that a neural network is really just a counting machine(计数器). It counts how often patterns appear in data, and uses those counts to predict. That's it. There's no magic, no emergence, no mystery. Just math.
A Peek at the Math
You don't need to understand this yet, but here's a taste of what's inside:
The brain's guess for the next letter uses this formula:
Wy = the output weights (a matrix / 矩阵)
h = the hidden state (those 128 switches)
b = a bias (a nudge)
softmax = turns numbers into probabilities that add to 1
We discovered that Wy is just the log of how often letter pairs appear together. You could calculate it with a for loop and a log() function. No training needed.
The tools we used: SVD (Singular Value Decomposition,奇异值分解) — a way to find the most important patterns in a matrix. Mutual Information (互信息) — a measure of how much knowing one thing tells you about another. And basic counting(计数).
What's Next?
Right now this works on a small brain (128 neurons) reading a small piece of Wikipedia (1 million bytes). The real question is: does it scale?
If the same trick works for bigger brains — with thousands or millions of neurons — then we might be able to build powerful AI systems by just doing math on data, instead of the expensive trial-and-error of training.
About This Research
This work was done in 12 days (January 31 – February 11, 2026) as part of the Hutter Prize challenge, which offers €500,000 for the best compression of Wikipedia. Compression and intelligence are deeply connected — to compress well, you must understand well.
By Claude and MJC. Full archive →