Arithmetic Coding ↔ RNN Memory

Side-by-side comparison of how AC and RNN carry context through time.

Arithmetic Coding

State: [low, high) ⊂ [0, 1)

Update: narrow interval by p(symbol)

Width shrinks: width *= p

Bits used: -log₂(width)

Precision limit: 32-64 bits

↔

RNN Hidden State

State: h ∈ ℝ¹²⁸ (within [-1,1] via tanh)

Update: h' = tanh(W·x + U·h + b)

Capacity shrinks: ||h|| changes

Bits used: -log₂ p(output)

Precision limit: float32 = 24 mantissa bits

100

Steps

7.0

Avg entropy

6.4

Avg surprisal

643.7

Total AC bits

Click a character to see state details

Step N: 'x'

	Arithmetic Coding	RNN
State	[low, high)	\|\|h\|\| = ...
Width / Capacity	...	...
Bits accumulated	...	...
Entropy (uncertainty)	...
Surprisal (this symbol)	...

AC Interval [low, high) over time

RNN Hidden State ||h|| over time

AC: -log₂(interval width) = cumulative bits

Entropy & Surprisal (shared prediction)

Key Insight: Both systems narrow down possibilities over time. AC does it explicitly (interval shrinks). RNN does it implicitly (hidden state encodes context). Both hit precision limits: AC after ~32-64 bits, RNN after ~24 bits × 128 dims ≈ 3000 bits (but not all usable due to correlation).

Bayesian Interpretation (Q = λ)

Each prediction step is a Bayesian update. The surprisal shown above is the log-luck:

Λ = -log₂ p(symbol) = log₂ λ    where λ = 1/p is the "luck"

The cumulative surprisal (643.7 bits) equals the compressed message length.

Unification (Q = λ): The quotient Q = |prior|/|posterior| equals the luck λ = 1/p.

Bayes: luck λ = 1/p, log-luck Λ = -log p
Thermo: microstates shrink by factor λ
AC interval: width shrinks by factor p (= 1/λ)
RNN: h encodes accumulated context, outputs p

All four views describe the same operation: narrowing possibilities by observing symbols.

← Back to Archive