Documentation

Linglib.Theories.Syntax.DependencyGrammar.Formal.MemorySurprisal.Basic

Memory-Surprisal Trade-off Framework #

@cite{futrell-2019} @cite{hahn-degen-futrell-2021} @cite{zaslavsky-hu-levy-2020}

Core formalization of @cite{hahn-degen-futrell-2021} "Modeling Word and Morpheme Order as an Efficient Trade-Off of Memory and Surprisal", Psychological Review 128(4):726–756.

Key Idea #

Bounded memory forces a trade-off between memory usage (H_M) and processing difficulty (surprisal S_M). At each time step, the processor stores a lossy encoding of the past. Better encoding reduces surprisal but costs more memory. The optimal trade-off forms a convex curve in (H_M, S_M) space.

Information Locality #

The curve's shape is determined by the mutual information profile I_t: how much mutual information exists between the current word and the word t steps back. Languages that concentrate predictive information locally (information locality) achieve steeper, more efficient trade-off curves.

Connection to DLM #

Information locality generalizes dependency length minimization: DLM minimizes the structural distance between syntactically related words, while information locality minimizes the information-theoretic distance at which predictive information concentrates.

Sections #

§1: Memory-surprisal framework (MemoryEncoding, averageSurprisal, memoryEntropy)
§2: Mutual information profile and information locality bound (Theorem 1)
§3: Trade-off curve (TradeoffPoint, TradeoffCurve, AUC)
§4: Bridges (rate-distortion, processing model, dependency locality)

Memory Encoding #

A memory encoding maps the history of observed words to a finite memory state. At each time step t, the processor sees word w_t and updates its memory state m_t = encode(m_{t-1}, w_t). The memory's entropy H_M measures how much information the processor retains about the past.

structure DepGrammar.MemorySurprisal.MemoryEncoding (W Mem : Type) :

A memory encoding compresses the past into a finite memory state.

W is the word type, Mem is the memory state type. The encoding function takes a memory state and a new word and returns the updated memory state.

encode : Mem → W → Mem
Update memory state given a new word
initial : Mem
Initial memory state (before any words are seen)

Instances For

def DepGrammar.MemorySurprisal.averageSurprisal {W Mem : Type} [BEq W] [BEq Mem] (joint : List ((Mem × W) × ℚ)) (marginalMem : List (Mem × ℚ)) :

Average surprisal under a memory encoding.

This is the conditional entropy of the next word given the current memory state: S_M = H(W_t | M_t).

Lower memory → less information about the past → higher surprisal. When memory is unlimited, S_M = S_∞ (the entropy rate of the language).

We reuse conditionalEntropy from InformationTheory/Basic.lean.

Equations

DepGrammar.MemorySurprisal.averageSurprisal joint marginalMem = Core.InformationTheory.conditionalEntropy joint marginalMem

Instances For

def DepGrammar.MemorySurprisal.memoryEntropy {Mem : Type} [BEq Mem] (memDist : List (Mem × ℚ)) :

Memory entropy: the entropy of the memory state distribution.

H_M = H(M_t), measuring how many bits the processor uses.

We reuse entropy from InformationTheory/Basic.lean.

Equations

DepGrammar.MemorySurprisal.memoryEntropy memDist = Core.InformationTheory.entropy memDist

Instances For

Mutual Information Profile #

The mutual information profile I_t measures how much information the word at position t provides about the word at position t + d (at distance d). This determines the shape of the memory-surprisal trade-off curve.

Key insight: I_t = S_t - S_{t+1}, where S_t is the surprisal when the processor remembers only t steps of context. So I_t measures the marginal value of remembering one more step.

structure DepGrammar.MemorySurprisal.MutualInfoProfile :

Mutual information at distance t between words.

I_t represents how much mutual information exists between a word and the word t steps back. Higher I_t at small t means information is concentrated locally (good for memory-efficient processing).

Stored as I_t × 1000 for decidable computation.

name : String
Name for this profile (e.g., "English", "Japanese", "Baseline")
values : List ℕ
I_t × 1000 for distances t = 1, 2, 3,...

Instances For

instance DepGrammar.MemorySurprisal.instReprMutualInfoProfile :

Repr MutualInfoProfile

Equations

DepGrammar.MemorySurprisal.instReprMutualInfoProfile = { reprPrec := DepGrammar.MemorySurprisal.instReprMutualInfoProfile.repr }

def DepGrammar.MemorySurprisal.instReprMutualInfoProfile.repr :

MutualInfoProfile → ℕ → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

instance DepGrammar.MemorySurprisal.instDecidableEqMutualInfoProfile :

DecidableEq MutualInfoProfile

Equations

DepGrammar.MemorySurprisal.instDecidableEqMutualInfoProfile = DepGrammar.MemorySurprisal.instDecidableEqMutualInfoProfile.decEq

def DepGrammar.MemorySurprisal.instDecidableEqMutualInfoProfile.decEq (x✝ x✝¹ : MutualInfoProfile) :

Decidable (x✝ = x✝¹)

Equations

One or more equations did not get rendered due to their size.

Instances For

def DepGrammar.MemorySurprisal.instBEqMutualInfoProfile.beq :

MutualInfoProfile → MutualInfoProfile → Bool

Equations

DepGrammar.MemorySurprisal.instBEqMutualInfoProfile.beq { name := a, values := a_1 } { name := b, values := b_1 } = (a == b && a_1 == b_1)
DepGrammar.MemorySurprisal.instBEqMutualInfoProfile.beq x✝¹ x✝ = false

Instances For

instance DepGrammar.MemorySurprisal.instBEqMutualInfoProfile :

BEq MutualInfoProfile

Equations

DepGrammar.MemorySurprisal.instBEqMutualInfoProfile = { beq := DepGrammar.MemorySurprisal.instBEqMutualInfoProfile.beq }

def DepGrammar.MemorySurprisal.MutualInfoProfile.totalInfo (p : MutualInfoProfile) :

Sum of I_t values (total predictive information × 1000).

Equations

p.totalInfo = List.foldl (fun (x1 x2 : ℕ) => x1 + x2) 0 p.values

Instances For

def DepGrammar.MemorySurprisal.MutualInfoProfile.weightedSum (p : MutualInfoProfile) :

Weighted sum Σ t·I_t (memory cost × 1000).

This is the key quantity in the information locality bound: the memory needed to store T steps of context is bounded by Σ_{t≤T} t·I_t. Languages with lower weighted sums concentrate information at shorter distances.

Equations

p.weightedSum = DepGrammar.MemorySurprisal.MutualInfoProfile.weightedSum.go p.values 0 0

Instances For

def DepGrammar.MemorySurprisal.MutualInfoProfile.weightedSum.go :

List ℕ → ℕ → ℕ → ℕ

Equations

DepGrammar.MemorySurprisal.MutualInfoProfile.weightedSum.go [] x✝¹ x✝ = x✝
DepGrammar.MemorySurprisal.MutualInfoProfile.weightedSum.go (it :: rest) x✝¹ x✝ = DepGrammar.MemorySurprisal.MutualInfoProfile.weightedSum.go rest (x✝¹ + 1) (x✝ + (x✝¹ + 1) * it)

Instances For

theorem DepGrammar.MemorySurprisal.information_locality_bound (p : MutualInfoProfile) (memoryCapacity : ℕ) (_h : memoryCapacity ≤ p.values.length) :

Information Locality Bound (Theorem 1, eq. 4)

The memory required to achieve surprisal S_M is bounded below by: H_M ≥ Σ_{t : I_t > threshold} t · I_t

where the threshold is determined by S_M.

More precisely: for a processor with memory capacity T steps, H_M ≤ Σ_{t≤T} t · I_t (memory cost of storing T steps) S_M ≥ S_∞ + Σ_{t>T} I_t (surplus surprisal from forgetting)

Languages with steeper I_t decay (more information locality) achieve better trade-offs: less memory for the same surprisal.

TODO: Full proof requires formalizing stationary ergodic processes and the chain rule for mutual information. See paper SI §1.2.

def DepGrammar.MemorySurprisal.efficientProfile :

MutualInfoProfile

A language with high information locality: I_t decays rapidly. Predictive information concentrated at short distances. Represents an efficient language like English or Japanese.

Equations

DepGrammar.MemorySurprisal.efficientProfile = { name := "Efficient", values := [500, 200, 80, 30, 10, 5, 2, 1] }

Instances For

def DepGrammar.MemorySurprisal.inefficientProfile :

MutualInfoProfile

A language with low information locality: I_t decays slowly. Predictive information spread across many distances. Represents a less efficient baseline.

Equations

DepGrammar.MemorySurprisal.inefficientProfile = { name := "Inefficient", values := [200, 180, 150, 120, 90, 60, 30, 10] }

Instances For

theorem DepGrammar.MemorySurprisal.efficient_lower_weighted_sum :

efficientProfile.weightedSum < inefficientProfile.weightedSum

The efficient profile has lower weighted sum (less memory for same info).

theorem DepGrammar.MemorySurprisal.profiles_similar_total_info :

efficientProfile.totalInfo > 700 ∧ inefficientProfile.totalInfo > 700

Both profiles have comparable total information.

Trade-off Curve #

The memory-surprisal trade-off curve plots (H_M, S_M) pairs achievable by different memory encodings. Languages with more efficient word orders have curves closer to the origin (lower AUC).

Values are stored as Nat × 1000 for decidable computation.

structure DepGrammar.MemorySurprisal.TradeoffPoint :

A single point on the memory-surprisal trade-off curve.

memoryBits1000 = H_M × 1000 (memory in millibits) surprisal1000 = S_M × 1000 (surprisal in millibits)

memoryBits1000 : ℕ
surprisal1000 : ℕ

Instances For

def DepGrammar.MemorySurprisal.instReprTradeoffPoint.repr :

TradeoffPoint → ℕ → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

instance DepGrammar.MemorySurprisal.instReprTradeoffPoint :

Repr TradeoffPoint

Equations

DepGrammar.MemorySurprisal.instReprTradeoffPoint = { reprPrec := DepGrammar.MemorySurprisal.instReprTradeoffPoint.repr }

instance DepGrammar.MemorySurprisal.instDecidableEqTradeoffPoint :

DecidableEq TradeoffPoint

Equations

DepGrammar.MemorySurprisal.instDecidableEqTradeoffPoint = DepGrammar.MemorySurprisal.instDecidableEqTradeoffPoint.decEq

def DepGrammar.MemorySurprisal.instDecidableEqTradeoffPoint.decEq (x✝ x✝¹ : TradeoffPoint) :

Decidable (x✝ = x✝¹)

Equations

One or more equations did not get rendered due to their size.

Instances For

instance DepGrammar.MemorySurprisal.instBEqTradeoffPoint :

BEq TradeoffPoint

Equations

DepGrammar.MemorySurprisal.instBEqTradeoffPoint = { beq := DepGrammar.MemorySurprisal.instBEqTradeoffPoint.beq }

def DepGrammar.MemorySurprisal.instBEqTradeoffPoint.beq :

TradeoffPoint → TradeoffPoint → Bool

Equations

DepGrammar.MemorySurprisal.instBEqTradeoffPoint.beq { memoryBits1000 := a, surprisal1000 := a_1 } { memoryBits1000 := b, surprisal1000 := b_1 } = (a == b && a_1 == b_1)
DepGrammar.MemorySurprisal.instBEqTradeoffPoint.beq x✝¹ x✝ = false

Instances For

structure DepGrammar.MemorySurprisal.TradeoffCurve :

A memory-surprisal trade-off curve for a language or baseline.

name : String
points : List TradeoffPoint
Points ordered by increasing memory (decreasing surprisal)

Instances For

instance DepGrammar.MemorySurprisal.instReprTradeoffCurve :

Repr TradeoffCurve

Equations

DepGrammar.MemorySurprisal.instReprTradeoffCurve = { reprPrec := DepGrammar.MemorySurprisal.instReprTradeoffCurve.repr }

def DepGrammar.MemorySurprisal.instReprTradeoffCurve.repr :

TradeoffCurve → ℕ → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

instance DepGrammar.MemorySurprisal.instDecidableEqTradeoffCurve :

DecidableEq TradeoffCurve

Equations

DepGrammar.MemorySurprisal.instDecidableEqTradeoffCurve = DepGrammar.MemorySurprisal.instDecidableEqTradeoffCurve.decEq

def DepGrammar.MemorySurprisal.instDecidableEqTradeoffCurve.decEq (x✝ x✝¹ : TradeoffCurve) :

Decidable (x✝ = x✝¹)

Equations

One or more equations did not get rendered due to their size.

Instances For

instance DepGrammar.MemorySurprisal.instBEqTradeoffCurve :

BEq TradeoffCurve

Equations

DepGrammar.MemorySurprisal.instBEqTradeoffCurve = { beq := DepGrammar.MemorySurprisal.instBEqTradeoffCurve.beq }

def DepGrammar.MemorySurprisal.instBEqTradeoffCurve.beq :

TradeoffCurve → TradeoffCurve → Bool

Equations

DepGrammar.MemorySurprisal.instBEqTradeoffCurve.beq { name := a, points := a_1 } { name := b, points := b_1 } = (a == b && a_1 == b_1)
DepGrammar.MemorySurprisal.instBEqTradeoffCurve.beq x✝¹ x✝ = false

Instances For

def DepGrammar.MemorySurprisal.TradeoffCurve.auc (c : TradeoffCurve) :

Approximate area under the trade-off curve via trapezoidal rule.

AUC × 1000000 (millibits²). Lower AUC = more efficient trade-off. The curve should have points ordered by increasing memory.

Equations

One or more equations did not get rendered due to their size.

Instances For

def DepGrammar.MemorySurprisal.efficientTradeoffHypothesis (real baseline : TradeoffCurve) :

The efficient trade-off hypothesis: a real language's AUC is smaller than its random baseline's AUC.

Equations

DepGrammar.MemorySurprisal.efficientTradeoffHypothesis real baseline = decide (real.auc < baseline.auc)

Instances For

Bridge: Memory-Surprisal ↔ Rate-Distortion #

The memory-surprisal trade-off is structurally analogous to rate-distortion theory:

Memory-Surprisal	Rate-Distortion
Memory H_M	Rate R
Surprisal S_M	Distortion D
Trade-off curve	RD curve
Info locality	Structural constraint

Both characterize optimal lossy compression: the memory encoding compresses the past (lossy), and the trade-off curve is the achievable region boundary.

TODO: Formal proof requires showing that the memory-surprisal trade-off curve equals the rate-distortion function for the appropriate source and distortion measure. See @cite{hahn-degen-futrell-2021} SI §1.3.

theorem DepGrammar.MemorySurprisal.memory_surprisal_rate_distortion_correspondence :

Bridge: Memory ↔ Processing Locality #

The memory dimension H_M connects to the locality dimension of ProcessingModel.ProcessingProfile: higher memory means the processor can track longer-distance dependencies, which is equivalent to tolerating higher locality costs.

def DepGrammar.MemorySurprisal.memoryToLocality (memoryBits1000 : ℕ) :

ProcessingModel.ProcessingProfile

Map a memory capacity (in bits × 1000) to a processing profile.

Higher memory capacity → processor can handle longer dependencies → maps to locality (the maximum dependency distance the processor can comfortably handle).

Equations

DepGrammar.MemorySurprisal.memoryToLocality memoryBits1000 = { locality := memoryBits1000 / 100, boundaries := 0, referentialLoad := 0, ease := 0 }

Instances For

theorem DepGrammar.MemorySurprisal.more_memory_more_locality (m1 m2 : ℕ) (h : m1 < m2) :

(memoryToLocality m1).locality ≤ (memoryToLocality m2).locality

More memory → can handle longer locality.

Bridge: Information Locality ↔ Dependency Locality #

Information locality (I_t concentrated at small t) generalizes dependency locality (short structural distances between related words). When syntactic dependencies are short, the words that carry predictive information about each other are close together, making I_t decay rapidly.

TODO: Formal proof requires connecting the structural notion of dependency length to the information-theoretic notion of mutual information at distance t. See @cite{futrell-2019} and @cite{hahn-degen-futrell-2021} §2.3 for the argument that DLM is a special case of information locality optimization.

theorem DepGrammar.MemorySurprisal.information_locality_generalizes_dep_locality :

Bridge: Memory-Surprisal ↔ Generalised Surprisal #

The memory-surprisal trade-off operates at the standard surprisal configuration: negLog warping, indicator scoring, horizon 1, predictive level. The trade-off curve varies memory capacity while holding the prediction resolution fixed.

@cite{giulianelli-etal-2026} generalizes this by also varying the resolution parameters (forecast horizon h and representational level l), showing that different psycholinguistic measures are best predicted at different resolutions. The memory-surprisal trade-off is the special case where prediction resolution is held constant and only the memory budget varies.

def DepGrammar.MemorySurprisal.memorySurprisalConfig :

Core.GeneralisedSurprisal.SurprisalConfig

The generalised surprisal configuration used by the memory-surprisal trade-off: standard surprisal (negLog × indicator × h=1 × predictive).

The trade-off curve parametrizes over memory encodings while holding this resolution fixed. IAS extends this by also parametrizing over the prediction resolution (horizon and representational level).

Equations

DepGrammar.MemorySurprisal.memorySurprisalConfig = Core.GeneralisedSurprisal.standardSurprisal

Instances For