Lexicalization: Efficient Encoding of Emerging Concepts #
@cite{xu-etal-2024}
Inaugural module of Theories/Diachronic/: formal theories of language change.
Xu et al. (2024) unify word reuse and combination (compounding) under a single information-theoretic account. Both strategies for encoding novel concepts are shaped by the same tradeoff between minimizing speaker effort (word length) and minimizing information loss (listener confusion). Attested encodings in English, French, and Finnish sit near the Pareto frontier of this tradeoff.
Architecture #
The model extends RSA's speaker-listener framework to lexicon evolution:
- The listener uses prototype-based categorization (Eq. 1) — an L0
whose meaning function is
exp(-γ · d(c, q_w)). - The speaker trades off informativity against word-length cost — an S1
with
beliefActionscoring wherecost(w) = β · l(w). - The encoding E* is the set of (concept, form) pairs for emerging
concepts. Its efficiency is measured against the Pareto frontier
of possible encodings (
Core.Efficiency).
Connection to LCEC #
The LCEC (Morphology.WP.LCEC) states that I-complexity (conditional entropy
across paradigm cells) is uniformly low despite high E-complexity. The analog
here: lexicons maintain low information loss despite high polysemy, because
reuse and compounding are informationally efficient. Both are instances of a
general principle: natural languages achieve low integrative complexity despite
high enumerative complexity, via structured redundancy.
Strategy by which a novel concept enters the lexicon. @cite{xu-etal-2024} Table 1: reuse items (R) vs. compounds (C).
- reuse : Strategy
Reuse an existing word for a new meaning. E.g., "mouse" (rodent → peripheral), "dish" (plate → antenna).
- combination : Strategy
Combine existing words into a compound. E.g., "birthday card", "spreadsheet", "urban renewal".
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- Diachronic.Lexicalization.instBEqStrategy.beq x✝ y✝ = (x✝.ctorIdx == y✝.ctorIdx)
Instances For
Literality of the form-meaning relationship. Literal items are semantically transparent and tend to be more communicatively efficient (@cite{xu-etal-2024} §Item-Level Variation).
- literal : Literality
Literal: form directly relates to the intended concept.
- Reuse: intended meaning is a hyponym of the existing sense.
- Combination: endocentric compound (head = superordinate).
- nonliteral : Literality
Nonliteral: metaphorical or metonymic relationship.
- Reuse: e.g., "mouse" for computer peripheral.
- Combination: exocentric, e.g., "boîte noire" = flight recorder.
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- Diachronic.Lexicalization.instBEqLiterality.beq x✝ y✝ = (x✝.ctorIdx == y✝.ctorIdx)
Instances For
A form-concept pair in an emerging encoding (one entry in E*).
- form : String
- concept : String
- strategy : Strategy
- literality : Literality
- formLength : ℕ
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Communicative costs of an encoding, parameterized by a listener model.
listenerScore concept form is the probability that the listener
recovers concept from form. In @cite{xu-etal-2024}, this is the
prototype-based categorization model (Eq. 1):
m̂_{w,L}(c) ∝ exp{-γ · d(c, q_w)}.
Effort (Eq. 2) = expected word length. Information loss (Eq. 3) = expected surprisal under listener distribution. The aggregate is weighted by concept need probability.
Equations
- One or more equations did not get rendered due to their size.
Instances For
Unified objective (Eq. 5): L_β = info_loss + β · effort. Parameterizes the Pareto frontier.
Equations
- Diachronic.Lexicalization.unifiedObjective pairs needProb listenerScore β = Core.Efficiency.weightedCost (Diachronic.Lexicalization.encodingCosts pairs needProb listenerScore) β
Instances For
The prototype-based listener IS an RSA L0, and the unified objective
IS an S1 with beliefAction scoring.
To instantiate: set RSAConfigData with
U := Form,W := Conceptmeaning _ c w := exp(-γ · d(c, prototype(w)))s1Spec := .beliefAction (fun w => β * length(w))
This function constructs the corresponding S1 scoring rule.
Equations
- Diachronic.Lexicalization.asS1ScoreSpec β length = RSA.S1ScoreSpec.beliefAction fun (w : String) => β * length w
Instances For
Efficiency Claim (Figs. 2–3): attested encodings are closer to the Pareto frontier than baseline encodings (random or near-synonym).
Equations
- Diachronic.Lexicalization.moreEfficientThan attested baseline optimalAt βs = (Core.Efficiency.efficiencyLoss attested optimalAt βs < Core.Efficiency.efficiencyLoss baseline optimalAt βs)
Instances For
Strategy Tradeoff (§Strategy Comparison): reuse items tend shorter; compounds tend more informative. The two strategies occupy complementary regions of the effort-informativity space.
Equations
Instances For
Literal Advantage (§Item-Level Variation): literal items (hyponymic reuse, endocentric compounds) are more efficient than nonliteral ones, because semantic transparency reduces information loss.
Equations
- One or more equations did not get rendered due to their size.