Memory-Surprisal Trade-off Framework #

@cite{futrell-2019} @cite{hahn-degen-futrell-2021} @cite{zaslavsky-hu-levy-2020}

Core formalization of @cite{hahn-degen-futrell-2021} "Modeling Word and Morpheme Order as an Efficient Trade-Off of Memory and Surprisal", Psychological Review 128(4):726–756.

Key Idea #

Bounded memory forces a trade-off between memory usage (H_M) and processing difficulty (surprisal S_M). At each time step, the processor stores a lossy encoding of the past. Better encoding reduces surprisal but costs more memory. The optimal trade-off forms a convex curve in (H_M, S_M) space.

Information Locality #

The curve's shape is determined by the mutual information profile I_t: how much mutual information exists between the current word and the word t steps back. Languages that concentrate predictive information locally (information locality) achieve steeper, more efficient trade-off curves.

Mathematical Structure #

The central mathematical insight is the marginal rate of substitution: at distance t, each bit of surprisal reduction costs exactly (t+1) bits of memory. This is why information locality matters — short-distance information is cheap (1 bit of memory per bit of surprisal at t=0), while long-distance information is expensive ((t+1) bits per bit at distance t). The increasing marginal rate makes the bound curve convex.

§3 proves the marginal rate theorem and its consequences from the definitions, without requiring measure theory. The bound itself (Theorem 1, §4) — that the I_t profile determines the achievable region of (H_M, S_M) pairs — is stated as comprehension postulates requiring the data processing inequality and chain rule for stationary processes.

Connection to DLM #

Information locality generalizes dependency length minimization: DLM minimizes the structural distance between syntactically related words, while information locality minimizes the information-theoretic distance at which predictive information concentrates.

Sections #

§1: Memory-surprisal framework (MemoryEncoding, averageSurprisal, memoryEntropy)
§2: Mutual information profile (MutualInfoProfile, surplusSurprisal, memoryCost)
§3: Marginal analysis (surplus_step, memoryCost_step, marginal_rate)
§4: Information locality bound (comprehension postulates, Theorem 1)
§5: Trade-off curve (TradeoffPoint, TradeoffCurve, AUC)
§6: Concrete profiles and efficiency comparison
§7: Bridges (rate-distortion, processing model, dependency locality)

Memory Encoding #

A memory encoding maps the history of observed words to a finite memory state. At each time step t, the processor sees word w_t and updates its memory state m_t = encode(m_{t-1}, w_t). The memory's entropy H_M measures how much information the processor retains about the past.

Memory-Surprisal	Rate-Distortion
Memory H_M	Rate R
Surprisal S_M	Distortion D
Trade-off curve	RD curve
Info locality	Structural constraint

Memory-Surprisal Trade-off Framework #

Key Idea #

Information Locality #

Mathematical Structure #

Connection to DLM #

Sections #

Memory Encoding #

Mutual Information Profile #

Marginal Analysis #

Information Locality Bound #

Trade-off Curve #

Concrete profiles #

Bridge: Memory-Surprisal ↔ Rate-Distortion #

Bridge: Memory ↔ Processing Locality #

Bridge: Information Locality ↔ Dependency Locality #

Bridge: Memory-Surprisal ↔ Generalised Surprisal #