Documentation

Linglib.Theories.Diachronic.Lexicalization

Lexicalization: Efficient Encoding of Emerging Concepts #

@cite{xu-etal-2024}

Inaugural module of Theories/Diachronic/: formal theories of language change.

Xu et al. (2024) unify word reuse and combination (compounding) under a single information-theoretic account. Both strategies for encoding novel concepts are shaped by the same tradeoff between minimizing speaker effort (word length) and minimizing information loss (listener confusion). Attested encodings in English, French, and Finnish sit near the Pareto frontier of this tradeoff.

Architecture #

The model extends RSA's speaker-listener framework to lexicon evolution:

Connection to LCEC #

The LCEC (Morphology.WP.LCEC) states that I-complexity (conditional entropy across paradigm cells) is uniformly low despite high E-complexity. The analog here: lexicons maintain low information loss despite high polysemy, because reuse and compounding are informationally efficient. Both are instances of a general principle: natural languages achieve low integrative complexity despite high enumerative complexity, via structured redundancy.

Strategy by which a novel concept enters the lexicon. @cite{xu-etal-2024} Table 1: reuse items (R) vs. compounds (C).

  • reuse : Strategy

    Reuse an existing word for a new meaning. E.g., "mouse" (rodent → peripheral), "dish" (plate → antenna).

  • combination : Strategy

    Combine existing words into a compound. E.g., "birthday card", "spreadsheet", "urban renewal".

Instances For
    Equations
    • One or more equations did not get rendered due to their size.
    Instances For

      Literality of the form-meaning relationship. Literal items are semantically transparent and tend to be more communicatively efficient (@cite{xu-etal-2024} §Item-Level Variation).

      • literal : Literality

        Literal: form directly relates to the intended concept.

        • Reuse: intended meaning is a hyponym of the existing sense.
        • Combination: endocentric compound (head = superordinate).
      • nonliteral : Literality

        Nonliteral: metaphorical or metonymic relationship.

        • Reuse: e.g., "mouse" for computer peripheral.
        • Combination: exocentric, e.g., "boîte noire" = flight recorder.
      Instances For
        Equations
        • One or more equations did not get rendered due to their size.
        Instances For

          A form-concept pair in an emerging encoding (one entry in E*).

          Instances For
            Equations
            • One or more equations did not get rendered due to their size.
            Instances For

              Communicative costs of an encoding, parameterized by a listener model.

              listenerScore concept form is the probability that the listener recovers concept from form. In @cite{xu-etal-2024}, this is the prototype-based categorization model (Eq. 1): m̂_{w,L}(c) ∝ exp{-γ · d(c, q_w)}.

              Effort (Eq. 2) = expected word length. Information loss (Eq. 3) = expected surprisal under listener distribution. The aggregate is weighted by concept need probability.

              Equations
              • One or more equations did not get rendered due to their size.
              Instances For
                def Diachronic.Lexicalization.unifiedObjective (pairs : List FormConceptPair) (needProb : String) (listenerScore : StringString) (β : ) :

                Unified objective (Eq. 5): L_β = info_loss + β · effort. Parameterizes the Pareto frontier.

                Equations
                Instances For

                  The prototype-based listener IS an RSA L0, and the unified objective IS an S1 with beliefAction scoring.

                  To instantiate: set RSAConfigData with

                  • U := Form, W := Concept
                  • meaning _ c w := exp(-γ · d(c, prototype(w)))
                  • s1Spec := .beliefAction (fun w => β * length(w))

                  This function constructs the corresponding S1 scoring rule.

                  Equations
                  Instances For

                    Efficiency Claim (Figs. 2–3): attested encodings are closer to the Pareto frontier than baseline encodings (random or near-synonym).

                    Equations
                    Instances For

                      Strategy Tradeoff (§Strategy Comparison): reuse items tend shorter; compounds tend more informative. The two strategies occupy complementary regions of the effort-informativity space.

                      Equations
                      Instances For
                        def Diachronic.Lexicalization.literalAdvantage (literalCosts nonliteralCosts : Core.Efficiency.CostPair) (optimalAt : Core.Efficiency.CostPair) (βs : List ) :

                        Literal Advantage (§Item-Level Variation): literal items (hyponymic reuse, endocentric compounds) are more efficient than nonliteral ones, because semantic transparency reduces information loss.

                        Equations
                        • One or more equations did not get rendered due to their size.
                        Instances For