Documentation

Linglib.Phenomena.Dialogue.Studies.Anderson2021

Anderson (2021): Conversation Update for RSA #

@cite{anderson-2021}

A system for multi-turn conversation update in the Rational Speech Acts framework. The core contributions:

  1. Common ground as distribution: the CG is a probability distribution over worlds, substituted directly for the RSA world prior
  2. Learning-rate update: CG evolves via convex combination with Pragmatic Listener posteriors
  3. Shared vs. approximate CG: two models differing in whether participants share a single CG representation
  4. Observation sampling: weighted, thresholded, and difference-based strategies for cooperative speaker behavior

Key Connections #

The CG update rule CG'(w) = (1-lr)·CG(w) + lr·post(w) is algebraically identical to @cite{luce-1959}'s linear learning rule with retention rate 1-lr and reinforcement target post. This connects RSA pragmatics to learning theory: multi-turn conversation IS iterated learning over distributions.

The distributional CG refines @cite{stalnaker-2002}'s classical context set: worlds with zero weight are excluded from the context, recovering set-intersection update as a special case.

BToM Connection #

Anderson's distributional CG has the type expected by BToMModel.sharedUpdate (Shared → Action → World → Shared). Setting Shared := DistributionalCG W instantiates BToM's discourse dynamics for the first time in linglib — the shared state is a distribution that evolves after each utterance via the learning-rate update.

MutualFriends Domain #

The paper illustrates predictions using the MutualFriends dataset, where worlds are individuals characterized by features (major, location) and utterances describe those features.

Worlds in the MutualFriends domain: four individuals with different feature combinations.

Instances For
    Equations
    • One or more equations did not get rendered due to their size.
    Instances For
      Equations
      • One or more equations did not get rendered due to their size.
      Instances For
        Equations
        • One or more equations did not get rendered due to their size.
        Instances For
          Equations
          • One or more equations did not get rendered due to their size.
          Instances For

            Utterances available to speakers. Includes a null utterance for passing when the speaker has no confident observation to share.

            Instances For
              Equations
              • One or more equations did not get rendered due to their size.
              Instances For
                Equations
                • One or more equations did not get rendered due to their size.

                Truth-conditional semantics for MutualFriends utterances.

                Equations
                Instances For

                  A distributional common ground: a non-negative weight function over worlds.

                  This is the probabilistic counterpart of @cite{stalnaker-2002}'s context set. Instead of a sharp membership predicate (W → Prop), a distributional CG assigns graded plausibility (W → ℝ).

                  • weight : W
                  • weight_nonneg (w : W) : 0 self.weight w
                  Instances For

                    Uniform distributional CG: all worlds equally plausible (empty CG).

                    Equations
                    Instances For

                      Bridge to classical context set: a world is "in the context" iff its weight is positive. This recovers @cite{stalnaker-2002}'s set-membership view from the distributional representation.

                      Equations
                      Instances For

                        A world with zero weight is excluded from the classical context set.

                        noncomputable def Phenomena.Dialogue.Studies.Anderson2021.updateCG {W : Type u_1} (cg : DistributionalCG W) (posterior : W) (post_nonneg : ∀ (w : W), 0 posterior w) (lr : ) (hlr : 0 lr) (hlr1 : lr 1) :

                        Convex combination update for distributional common ground:

                        CG'(w) = (1 - lr) · CG(w) + lr · posterior(w)
                        

                        The learning rate lr ∈ [0,1] controls how much weight is given to new information vs. the existing CG.

                        Equations
                        Instances For
                          theorem Phenomena.Dialogue.Studies.Anderson2021.updateCG_eq {W : Type u_1} (cg : DistributionalCG W) (posterior : W) (hn : ∀ (w : W), 0 posterior w) (lr : ) (h0 : 0 lr) (h1 : lr 1) (w : W) :
                          (updateCG cg posterior hn lr h0 h1).weight w = (1 - lr) * cg.weight w + lr * posterior w

                          The CG update formula is a convex combination — definitionally equal to (1 - lr) · CG(w) + lr · posterior(w).

                          theorem Phenomena.Dialogue.Studies.Anderson2021.updateCG_matches_linear_learning {W : Type u_1} (cg : DistributionalCG W) (posterior : W) (hn : ∀ (w : W), 0 posterior w) (lr : ) (h0 : 0 lr) (h1 : lr 1) (w : W) :
                          (updateCG cg posterior hn lr h0 h1).weight w = (1 - lr) * cg.weight w + (1 - (1 - lr)) * posterior w

                          Bridge to @cite{luce-1959} linear learning: the CG update has the same algebraic form as LinearLearner.update with retention rate 1 - lr and reinforcement target posterior:

                          CG'(w) = (1 - lr) · CG(w) + lr · posterior(w)     [Anderson]
                          v'(a)  = α · v(a) + (1 - α) · r(a)                [Luce §4.C]
                          

                          Setting α = 1 - lr and r = posterior makes the formulas identical. Multi-turn conversation IS iterated learning over distributions.

                          theorem Phenomena.Dialogue.Studies.Anderson2021.updateCG_lr_zero {W : Type u_1} (cg : DistributionalCG W) (posterior : W) (hn : ∀ (w : W), 0 posterior w) (w : W) :
                          (updateCG cg posterior hn 0 ).weight w = cg.weight w

                          With learning rate 0, the CG is unchanged (full retention).

                          theorem Phenomena.Dialogue.Studies.Anderson2021.updateCG_lr_one {W : Type u_1} (cg : DistributionalCG W) (posterior : W) (hn : ∀ (w : W), 0 posterior w) (w : W) :
                          (updateCG cg posterior hn 1 ).weight w = posterior w

                          With learning rate 1, the CG is replaced by the posterior.

                          The state of a two-participant conversation (Figure 2).

                          Tracks the common ground (distributional), each participant's private beliefs, and the learning rate for updates. In the shared CG model (§5.1, Figure 4), both participants access the same cg. In the approximate CG model (§5.2, Figure 6), each maintains a separate approximation (not yet formalized).

                          The distributional CG serves as both RSAConfig.meaning (L0 prior) and RSAConfig.worldPrior (L1 prior) — the CG enters the RSA model at two levels (Figure 4). Between turns, the CG evolves via updateCG, and the RSA model is reconstructed via mfRSAAt.

                          Instances For

                            Initial conversation state: uniform CG, specified beliefs, A speaks first.

                            Equations
                            • One or more equations did not get rendered due to their size.
                            Instances For

                              Weighted sampling: sample a world proportional to the speaker's belief. Biased toward truth (zero-probability worlds are never sampled) but can lead to flip-flopping when the speaker has no strong beliefs.

                              Equations
                              Instances For
                                noncomputable def Phenomena.Dialogue.Studies.Anderson2021.thresholdedSample {W : Type u_1} (bel : DistributionalCG W) (θ : ) :
                                W

                                Thresholded sampling: filter out worlds below a confidence threshold. If no world exceeds the threshold, the speaker produces the null utterance (passes). Prevents noncommittal speakers from making random assertions.

                                Equations
                                Instances For

                                  Difference-based sampling: weight worlds by the positive difference between the speaker's belief and the current CG. Worlds already established in the CG get downweighted, favoring informative (non-redundant) contributions.

                                  weight(w) = max(0, Bel(w) - CG(w))
                                  

                                  This implements a pressure toward Gricean Quantity: don't repeat what's already in the common ground.

                                  Equations
                                  Instances For
                                    noncomputable def Phenomena.Dialogue.Studies.Anderson2021.toBToMSharedUpdate {W : Type u_1} {U : Type u_2} (posteriorFn : UW) (post_nonneg : ∀ (u : U) (w : W), 0 posteriorFn u w) (lr : ) (hlr : 0 lr) (hlr1 : lr 1) :

                                    Anderson's CG update expressed as a BToM shared-state update.

                                    Given a fixed posterior-computation function (from RSA inference), the CG update has the type required by BToMModel.sharedUpdate:

                                    Shared → Action → World → Shared
                                    

                                    with Shared := DistributionalCG W and Action := U.

                                    The World parameter is unused: the listener doesn't know the true world, so the CG update depends on the posterior (derived from the utterance), not the true world directly.

                                    Equations
                                    Instances For

                                      A believes the person is Nancy: weight 3 on Nancy, 1 on others.

                                      Equations
                                      • One or more equations did not get rendered due to their size.
                                      Instances For

                                        B believes the person is Katie: weight 3 on Katie, 1 on others.

                                        Equations
                                        • One or more equations did not get rendered due to their size.
                                        Instances For

                                          Under difference-based sampling, A initially prioritizes Nancy (highest positive difference from uniform CG).

                                          Equations
                                          • One or more equations did not get rendered due to their size.
                                          Instances For

                                            RSA model for the MutualFriends domain at turn 1.

                                            Anderson's Shared CG model (Figure 4) uses the distributional CG at BOTH L0 and L1:

                                            L0(w|u) ∝ ⟦u⟧(w) · CG(w)     -- CG enters L0 via `meaning`
                                            S1(u|w) ∝ L0(w|u)              -- speaker optimizes CG-weighted informativity
                                            L1(w|u) ∝ S1(u|w) · CG(w)     -- CG enters L1 via `worldPrior`
                                            

                                            At turn 1, CG is uniform (weight 1 everywhere), so the CG factor drops out of L0 and the meaning reduces to Boolean semantics: ⟦u⟧(w) · 1 = ⟦u⟧(w). The general CG-weighted pattern is visible in mfRSA_turn2.

                                            Equations
                                            • One or more equations did not get rendered due to their size.
                                            Instances For

                                              A speaker who knows the person is Nancy prefers "studyHumanity" over "studyScience". Nancy studies German (a humanity), so "studyScience" has L0(nancy|studyScience) = 0, while "studyHumanity" has L0(nancy|studyHumanity) = 1/2.

                                              A speaker who knows it's Nancy prefers "likeOutdoors" over "likeIndoors". Nancy likes being outdoors.

                                              A speaker who knows it's Ina prefers "studyScience" over "studyHumanity". Ina studies Astronomy (a science).

                                              A speaker who knows it's Ina is indifferent between "studyScience" and "likeIndoors": both are true of exactly 2 worlds, giving equal L0 posteriors. This tests rsa_predict on equality goals.

                                              The null utterance is always suboptimal: a speaker who knows it's Nancy strictly prefers any true specific utterance over saying nothing. Null is true of all 4 worlds (L0 = 1/4), while "studyHumanity" is true of only 2 (L0 = 1/2).

                                              Symmetry: S1(studyHumanity|nancy) = S1(likeOutdoors|nancy). Both utterances partition the 4 worlds into 2 true + 2 false, so L0(nancy|studyHumanity) = L0(nancy|likeOutdoors) = 1/2, hence equal S1.

                                              False utterances get zero S1 probability. "studyScience" is false of Nancy (she studies German), so S1 = 0. Tests rsa_predict on negation of strict inequality.

                                              After hearing "studyHumanity", L1 assigns higher probability to Nancy than to Ina. Nancy studies a humanity; Ina studies a science.

                                              After hearing "likeOutdoors", L1 favors Nancy over Sally. Nancy likes outdoors; Sally likes indoors.

                                              After hearing "studyHumanity", L1 assigns equal probability to Nancy and Sally — both study a humanity, and S1 scores are symmetric.

                                              After hearing "studyScience", L1 assigns equal probability to Ina and Katie — both study a science.

                                              The null utterance conveys no information: L1 assigns equal probability to all worlds. Every world has S1(null|w) = 1/5 by the domain's symmetry (each world makes exactly 2 non-null utterances true).

                                              CG weights after hearing "studyHumanity" at turn 1.

                                              After L1 processes "studyHumanity" with uniform prior, the posterior concentrates on nancy and sally (the German-studying worlds): L1(studyHumanity) = [0, 0, 1/2, 1/2]. Updating the normalized uniform CG [1/4, 1/4, 1/4, 1/4] via updateCG with lr=0.2 (footnote 9) gives:

                                              CG'(w) = 0.8 · 1/4 + 0.2 · L1(w)
                                              CG' = [1/5, 1/5, 3/10, 3/10]
                                              

                                              The weights [2, 2, 3, 3] are proportional to [1/5, 1/5, 3/10, 3/10], which is the exact post-update CG from the paper's Figure 5, panel 1A. Since RSA normalizes, proportional weights give identical predictions.

                                              Equations
                                              Instances For

                                                RSA model after hearing "studyHumanity" at turn 1.

                                                The updated CG enters BOTH L0 and L1 (Figure 4), matching Anderson's Shared CG model:

                                                L0(w|u) ∝ ⟦u⟧(w) · CG'(w)     -- CG in L0 via `meaning`
                                                L1(w|u) ∝ S1(u|w) · CG'(w)     -- CG in L1 via `worldPrior`
                                                

                                                This means S1 adapts to the CG: the speaker reasons about informativity relative to the current common ground. After "studyHumanity" shifts the CG toward nancy/sally (weights [2,2,3,3] ∝ [1/5, 1/5, 3/10, 3/10]), utterances that disambiguate within that subspace (e.g., "likeOutdoors") become more informative than utterances that merely re-assert the major dimension (e.g., "studyHumanity").

                                                Equations
                                                • One or more equations did not get rendered due to their size.
                                                Instances For

                                                  After CG update from "studyHumanity", L1("likeOutdoors") now favors Nancy over Katie. In turn 1, they were symmetric (both like outdoors). The updated prior (3 vs 1) breaks the tie — Nancy's higher CG weight makes her more probable. This is the key multi-turn prediction.

                                                  After CG update, "likeIndoors" favors Sally over Ina. Both like indoors, but Sally has higher prior (3 vs 1) from the CG shift.

                                                  After CG update, "studyScience" still treats Ina and Katie equally: both study a science and both have equal prior weight (1).

                                                  After CG update, "studyHumanity" still treats Nancy and Sally equally: both study a humanity and both have equal updated prior (3).

                                                  CG update breaks turn-1 symmetry: in turn 1, L1("likeOutdoors") assigned equal weight to Nancy and Katie. After the CG shift, Nancy is favored. Multi-turn conversation enriches inference.

                                                  With CG entering L0 (Figure 4), the speaker at turn 2 adapts to what's already in the common ground. After "studyHumanity" establishes the major dimension, utterances that disambiguate within the high-CG subspace become more informative. This section verifies that the CG-weighted S1 captures Anderson's cooperative contribution mechanism.

                                                  CG-adapted informativity: At turn 2, the speaker who knows it's Nancy prefers "likeOutdoors" over "studyHumanity". At turn 1, these were equal (both partition the 4-world space into 2+2). After the CG shifts toward nancy/sally (weights [2,2,3,3]), "likeOutdoors" discriminates within the high-weight subspace (L0(nancy|likeOutdoors) = 3/5) while "studyHumanity" merely re-asserts what's already established (L0(nancy|studyHumanity) = 1/2).

                                                  This is Anderson's key insight: the CG-weighted L0 makes speakers prefer new information over redundant information.

                                                  CG adaptation works differently for low-CG worlds: at turn 2, Ina (weight 2) prefers "studyScience" over "likeIndoors" because sally (weight 3) dominates the indoor partition, making L0(ina|likeIndoors) = 2/5 < L0(ina|studyScience) = 1/2. The CG shift makes the major dimension MORE informative for low-CG worlds, the opposite of the high-CG case (nancy, §12b above).

                                                  S2 endorsement: given world Nancy, the pragmatic speaker endorses "studyHumanity" over "studyScience". S2(u|w) ∝ L1(w|u), and L1(nancy|studyHumanity) > 0 = L1(nancy|studyScience).

                                                  S2 endorsement: given world Nancy, "studyHumanity" and "likeOutdoors" are equally endorsed (symmetric L1 posteriors).

                                                  RSA model for MutualFriends at an arbitrary CG (Figure 4).

                                                  This is the general form of Anderson's Shared CG model: the CG enters as the L0 meaning weight (via meaning) AND as the L1 prior (via worldPrior). One-shot RSA is the special case with uniform CG.

                                                  Used by conversationStep to construct the RSA model at each turn.

                                                  Equations
                                                  • One or more equations did not get rendered due to their size.
                                                  Instances For

                                                    One step of the Shared CG conversation loop (Figure 2).

                                                    Given the current CG and an utterance:

                                                    1. Build the RSA model at the current CG (mfRSAAt)
                                                    2. Compute L1 posteriors: the pragmatic listener's world beliefs
                                                    3. Update the CG via convex combination with the posteriors

                                                    This closes the loop: RSA inference → CG update → new RSA model. The returned CG serves as the world prior for the next turn's model.

                                                    Normalization note: The L1 posterior is normalized (sums to 1), so the CG weights should also be normalized for the convex combination to preserve total weight. With normalized CG, updateCG is a true convex combination and preserves sum-to-1. Starting from DistributionalCG.uniform (weight=1 per world) gives correct RSA predictions but produces CG updates with a different scale than the paper's normalized distributions.

                                                    Equations
                                                    • One or more equations did not get rendered due to their size.
                                                    Instances For
                                                      theorem Phenomena.Dialogue.Studies.Anderson2021.conversationStep_nonneg (cg : DistributionalCG MFWorld) (u : MFUtterance) (lr : ) (hlr : 0 lr) (hlr1 : lr 1) (w : MFWorld) :
                                                      0 (conversationStep cg u lr hlr hlr1).weight w

                                                      The conversation step preserves CG non-negativity.

                                                      With lr = 0, the conversation step leaves the CG unchanged.

                                                      The RSA predictions above verify properties from Phenomena.Dialogue.Basic:

                                                      1. Contributions informative: S1 prefers specific utterances over null (§9, s1_null_suboptimal), matching contributionsInformative = true.

                                                      2. Uncertainty decreases: L1 concentrates probability after hearing an informative utterance (this section).

                                                      3. CG-adapted contributions: At turn 2, S1 adapts to what's already in the CG, preferring non-redundant information (§12b).

                                                      L1 concentrates probability after an informative utterance: L1(nancy|studyHumanity) > L1(nancy|null). The null utterance gives uniform L1 (= 1/4), while "studyHumanity" concentrates on the 2 German-studying worlds (= 1/2). This matches Phenomena.Dialogue.successfulInfoSharing.uncertaintyDecreases.

                                                      Informed speakers are informative: S1 assigns higher probability to a true specific utterance than to null. This matches Phenomena.Dialogue.successfulInfoSharing.contributionsInformative.

                                                      theorem Phenomena.Dialogue.Studies.Anderson2021.lr_one_excludes_false_worlds (cg : DistributionalCG MFWorld) (posterior : MFWorld) (hn : ∀ (w : MFWorld), 0 posterior w) (w : MFWorld) (h : posterior w = 0) :
                                                      ¬(updateCG cg posterior hn 1 ).toContextSet w

                                                      Anderson's distributional CG update subsumes @cite{stalnaker-2002}'s set-intersection update as a special case: with learning rate 1 and a posterior that assigns zero weight to worlds where the utterance is false, the updated CG excludes exactly those worlds — recovering ContextSet.update.

                                                      This bridge connects the probabilistic conversation model to the classical assertion framework in Core.Semantics.CommonGround.

                                                      Exact rational values for the turn-1 RSA computations underlying Figure 5, panel 1A. At turn 1 with uniform CG, the domain's 2×2 feature symmetry gives clean fractions: each non-null utterance is true of exactly 2 worlds (L0 = 1/2), null is true of all 4 (L0 = 1/4), and each world makes exactly 2 non-null utterances true, giving S1(true u|w) = (1/2)/(5/4) = 2/5 and S1(null|w) = (1/4)/(5/4) = 1/5.

                                                      Null gives uniform L1: every world has the same S1(null|w) by the domain's symmetry, so L1(w|null) = CG(w)/Σ CG = 1/4.

                                                      The Approximate Common Ground model relaxes the shared-CG assumption: each participant maintains their own CG approximation. The speaker uses CG_S in production; the listener uses CG_L in comprehension with their private beliefs B_L as the L1 world prior.

                                                      Key difference from shared CG (Figure 4):

                                                      This models realistic divergence: participants with different priors hear the same utterance but reach different posteriors, causing their CG approximations to drift apart (Figure 7).

                                                      State for the Approximate CG model (§5.2, Figure 6).

                                                      Instances For
                                                        Equations
                                                        • One or more equations did not get rendered due to their size.
                                                        Instances For
                                                          noncomputable def Phenomena.Dialogue.Studies.Anderson2021.approxComprehensionRSA (cgL : MFWorld) (hcgL : ∀ (w : MFWorld), 0 cgL w) (belL : MFWorld) (hbelL : ∀ (w : MFWorld), 0 belL w) :

                                                          Approximate comprehension RSA (Figure 6): L0 uses CG_L, but L1 uses B_L (listener's private beliefs) as the world prior.

                                                          Equations
                                                          • One or more equations did not get rendered due to their size.
                                                          Instances For

                                                            When beliefs equal the CG, the approximate model reduces to the shared CG model — the split is only meaningful when they diverge.

                                                            The belief update model extends the conversation system by also updating participants' private beliefs. After comprehension, the listener updates their beliefs via the same linear rule as CG update:

                                                            Bel'(w) = (1 - lr_bel) · Bel(w) + lr_bel · posterior(w)
                                                            

                                                            The speaker does not update beliefs (they already knew the info). Different learning rates for CG vs beliefs allow modeling:

                                                            State for the belief update model (Figure 8). Extends approximate CG with separate learning rates for CG and beliefs.

                                                            Instances For
                                                              Equations
                                                              • One or more equations did not get rendered due to their size.
                                                              Instances For
                                                                theorem Phenomena.Dialogue.Studies.Anderson2021.belief_update_is_linear_learning {W : Type u_1} (bel : DistributionalCG W) (posterior : W) (hn : ∀ (w : W), 0 posterior w) (lr : ) (h0 : 0 lr) (h1 : lr 1) (w : W) :
                                                                (updateCG bel posterior hn lr h0 h1).weight w = (1 - lr) * bel.weight w + lr * posterior w

                                                                Belief update is algebraically identical to CG update — both are instances of @cite{luce-1959}'s linear learning rule. The only difference is the learning rate parameter and the interpretation (private vs shared).

                                                                A speaker with uniform beliefs (no private information) assigns equal weight to all worlds under weighted sampling. Since no observation is more probable than any other, the speaker makes random assertions about worlds they don't know, causing the CG to flip-flop (Figure 12).

                                                                Solutions:

                                                                1. Threshold sampling (§7.1.1): filter out low-confidence worlds; a noncommittal speaker passes (null utterance) instead of guessing.
                                                                2. Uncertainty-based lr (§6.3): scale the CG update rate by the listener's uncertainty, so confident listeners resist random input.

                                                                Uniform beliefs assign equal weight to all worlds under weighted sampling — a noncommittal speaker has no basis for choosing.

                                                                Threshold sampling filters out all worlds when beliefs don't exceed the threshold. For uniform beliefs (weight 1), any θ > 1 produces zero weight everywhere — the speaker passes (Figure 13).

                                                                Threshold preserves confident worlds: weights above θ pass through.

                                                                Under weighted sampling, a speaker whose beliefs match the CG keeps repeating already-shared information (Figure 14). Difference-based sampling fixes this by weighting worlds by max(0, Bel(w) - CG(w)): worlds already established in the CG get zero weight.

                                                                Combined with thresholding, thresholded difference-based sampling gives the best behavior (Figure 15): informed speakers contribute new information, noncommittal speakers pass.

                                                                When beliefs match the CG exactly, difference sampling assigns zero weight to all worlds — nothing new to contribute.

                                                                Difference sampling assigns positive weight when belief exceeds CG — these worlds carry new information not yet in the common ground.

                                                                A's initial difference from uniform CG: Nancy has the highest positive difference (belief 3 vs CG 1), so A's first contribution should describe Nancy — matching the stochastic trace in Figure 5.