← Back to archive
The Tock Step
Domain-native architecture from evidence. Learning what the event spaces ARE, not just the patterns between them.
The key equation
tock step = next E
k+1 + maps to/from existing {E
i}
Ei
Architecture = event spaces
+ maps between them
MI
Evidence-driven discovery
No gradient descent needed
π-1
Backtracking through
the factorization tower
!
Strawberry theorem:
tokens can't count chars
1. What Is the Tock Step?
In the CMP tick-tock cycle, the tick step uses the current model to process data (forward pass, counting, prediction). The tock step steps back and asks: are we using the right event spaces?
The tock step is architecture discovery from evidence. It finds the next event space Ek+1 that would most reduce prediction error, then builds the maps connecting it to existing event spaces. No training. No hyperparameter search. Just counting.
| Neural Network | Universal Model |
| Architecture (fixed before training) | Event spaces {Ei} (fixed) |
| Parameters (learned by SGD) | Patterns {pij} (learned by counting) |
| Hyperparameter/architecture search | Tock step (discover next Ei from evidence) |
The architecture IS the event spaces. What events can you distinguish? What questions can you ask? The architecture determines what's representable. Everything else is parameters.
2. The Four Steps of a Tock
1
Identify the gap
Where does prediction fail? Which positions, contexts, outputs are most surprising?
bpc, cross-entropy
2
Search for Ek+1
Find a new event space that maximally reduces residual error. Search over factorizations.
MI with output
3
Learn the maps
Compute patterns pi,k+1 connecting the new ES to existing ones. Just counting.
ω0 (count tables)
4
Update architecture
Add Ek+1 and its maps. The UM grows by one event space.
Ak+1
Every step is evidence-based. The gap is computed from data. The search uses MI from co-occurrence counts. The maps are log contingency tables. No beliefs (axioms), no abductions (pattern commitments). Pure evidence.
3. Domain-Native Tock Sequence
Starting from raw bytes and iteratively discovering the most informative event space:
BPC Improvement with Each Tock
Each tock discovers a new ES that reduces prediction error. A0 (bytes only) = unigram at 5.0 bpc. A4 (offset conjunctions) approaches 1.0 bpc. Compare: the sat-rnn at 2.81 bpc sits between A2 and A3.
| Step | Architecture | New Ei | Discovery method | BPC |
| A0 | Bytes only | E0 = {0..255} | Given | ~5.0 |
| A1 | + bigram ES | E0 × E0 | Adjacent byte MI | ~3.5 |
| A2 | + word boundary | {0, 1} | Space char has highest MI | ~3.0 |
| A3 | + tag state | {0, 1} | < > create high-MI partition | ~2.5 |
| A4 | + offset conjunctions | 2-offset product ESes | Factor map / backward trie | ~1.0 |
The experimental program IS the tock sequence. Backward trie discovers MI-ranked offsets (Feb 7). Factor map discovers 2-offset conjunctions (Feb 9). ES discovery finds SVD event clusters (Feb 12). Weight construction builds maps from count tables (Feb 11). Each is a tock step, performed manually. The goal is to automate it.
4. The Factorization Tower
Event spaces form a tower from fine to coarse, connected by surjections (quotient maps). Each level discards within-class structure.
E0
Bits
|E| = 2
↑ π1: 8 bits → 1 byte
E1
Bytes (characters)
|E| = 256
↑ π2: byte → class
E2
Character classes (vowel, consonant, digit, ...)
|E| ≈ 10
↑ π3: char seq → token
E3
Subword tokens (BPE)
|E| ≈ 50k
↑ π4: tokens → word
E4
Words
|E| ≈ 100k
Product factorization
E = Ea × Eb
Both components independently accessible. I(E) = I(Ea) + I(Eb).
Example: h = (h1, ..., h128) decomposes into 128 binary ESes. Each bit is independently readable.
Preserves access. No backtracking needed.
Sum (sequential/quotient) factorization
Efine ↠ Ecoarse
Coarse event is a function of fine. Fine structure is inside the coarse event.
Example: "straw" is a token. The characters s,t,r,a,w are inside it, inaccessible at the token level.
Hides structure. Backtracking required to recover.
The tock step should prefer product factorizations when possible, because they preserve access to all components. Sum factorizations (like tokenization) hide structure and require costly backtracking.
5. The Strawberry Theorem
The Strawberry Theorem
P(model answers correctly) ≤ P(answer is retrievable from E
V)
This is not a failure of scale or training. No amount of data or compute can give a token-level model access to character-level structure. The fix is not a "better tokenizer" but a tower: maintain multiple levels simultaneously, with the ability to move between them as the task demands.
The strawberry problem generalizes to any task requiring finer resolution than the model's operating level:
| Task | Requires | Token level has | Fix |
| Count chars in a word | E1 (characters) | E3 (tokens) | Backtrack π3 |
| Count syllables | Phoneme-level ES | E3 (tokens) | Backtrack to phonemes |
| Syntactic role of "the" | Word-role ES | Positional only | Backtrack to parse |
| Does A∧B imply C? | Proposition-level ES | Token sequence | Backtrack to logic |
6. Backtracking Through the Tower
Cost of Backtracking vs Benefit of Resolution
Backtracking from tokens to bytes increases sequence length ~4x but gives access to character-level structure. The domain-native UM pays this cost only when needed.
Backtracking recovers lost information: I(after backtrack) = I(Ek) + H(fiber | context). If context determines the fine event exactly, backtracking is free. If context gives no information, it costs the full within-class entropy.
Three strategies compared:
| Strategy | Cost | Can solve fine tasks? |
| Always finest level (char-by-char) | High (long sequences) | Yes |
| Always coarsest level (token-by-token) | Low | No |
| Tower with backtracking | Low (usually) + high (when needed) | Yes |
7. Architecture from Evidence vs. Belief
Neural Architecture Search
Search space
Hyperparameters
Method
Grid / random / Bayesian
Result
Architecture (opaque)
Backtracking
Not applicable
Source
Belief (prior commitment)
vs
Domain-Native Tock
Method
Counting + quotients
Result
Architecture (interpretable)
Backtracking
Built in (tower)
Source
Evidence (data counts)
The three sources of support for architecture choices:
Evidence (the tock step): count co-occurrences, compute MI, discover Ei. All from data.
Belief (NAS): "use 128 hidden neurons" or "BPE with 50k vocab." Prior commitment, not domain-specific.
Abduction: seeing the backward trie pattern and recognizing "offset 7 captures word-initial context." Short-circuits induction by understanding why.
8. The Tick-Tock Closed Loop
→
TOCK
Ek+1
discover ES
from MI
→
TICK
predict
forward pass
update counts
→
EVAL
gap?
if error high
→ tock again
No gradient descent anywhere. Tock uses MI analysis. Tick uses counting (ω0). Eval uses the forward pass (f). All operations are O(N) in data size. The Feb 11 weight construction demonstrated this: all 82k parameters from data statistics, 1.89 bpc with zero SGD (vs 4.97 for trained).
9. Tock Steps in the Experimental Program
Every Discovery Tool Is a Tock Step
| Method | What it discovers | New Ei | Archive |
| Backward trie | MI-ranked offsets | Offset conjunctions | Feb 7 |
| Skip-k-gram analysis | Informative offset pairs | 2-offset ESes | Feb 8 |
| Factor map | Neuron → conjunction | Domain features | Feb 9 |
| ES discovery (SVD) | Skip-bigram SVD | Event clusters | Feb 12 |
| Weight construction | Shift-register groups | Hash-based ESes | Feb 11 |
Each method discovers a new event space from data statistics, then builds maps connecting it to existing architecture. This is the tock step, performed manually. The goal is automation.
The MI Criterion: Greedy Offset Selection
The backward trie ranks offsets by MI with the output. Greedy selection: d=1 first, then d=8, then d=2, then d=7... This matches the empirically discovered skip-pattern order exactly.
10. Why Current LLMs Can't Backtrack
Modern LLMs make an irrevocable architectural choice at tokenization. The factorization π: chars → tokens is fixed before training and cannot be undone at inference time. The model operates at the token level, period. Every task requiring character-level resolution must be handled by memorization or heuristics.
Token-level models lose character structure irrecoverably. The tower maintains all levels, paying the cost of backtracking only when needed.
The fix is not a better tokenizer. It's a tower: maintain multiple levels of the factorization simultaneously, with the ability to move between them as the task demands. Normally operate at whatever level is efficient (E3 or E4). When a task requires finer resolution, backtrack to the appropriate level.
The strongest reading of CMP's equivalence thesis
Interpretability and efficiency are the same problem because both are solved by the evidence-driven factorization. The domain-native tock makes this constructive.
Navigation
Claude and MJC · February 12, 2026 · The Tock Step: Domain-Native Architecture from Evidence