Documentation

Linglib.Core.Agent.Learning

Learning Models #

@cite{luce-1959} @cite{bush-mosteller-1955} @cite{rescorla-wagner-1972}

Three families of learning model:

§1–6. Luce/Bush-Mosteller (ratio-scale and probability-space) #

@cite{luce-1959} extends the choice axiom to dynamic settings where choice probabilities change over trials via learning and reinforcement. The central question: if the Luce choice rule (IIA) holds at trial n, does it hold at trial n+1 after a learning update?

The key result (§4.B) is that the Luce rule is closed under positive linear transformations of the ratio scale values. Since the linear learning model (§4.A) is exactly such a transformation, IIA is preserved across learning trials. This is structurally guaranteed in our formalization: LinearLearner.update produces a new RationalAction, and RationalAction is the Luce choice rule.

We formalize:

  1. Alpha model (§4.C): v_{n+1}(a) = α·v_n(a) + (1-α)·r(a) — the linear learning rule
  2. Axiom 1 preservation (§4.B, Theorem 16): update yields a RationalAction — IIA by construction
  3. Beta model (§4.D): probability-space learning P_{n+1}(a) = (1-β)·P_n(a) + β·δ(a, chosen)
  4. Convergence (§4.G): under constant reinforcement, v_n(a) → r(a) as n → ∞

§7. Stochastic Gradient Ascent #

The Gradual Learning Algorithm (@cite{boersma-1998}) = SGA with single-sample estimates. Convergence to the global maximum of a concave objective (MaxEnt).

§8. Rescorla-Wagner (associative learning) #

@cite{rescorla-wagner-1972} models cue-outcome association via error-driven learning: ΔV_c = α_c · β · (λ − ΣV). Unlike Luce, R-W has a prediction error term (λ − ΣV) that generates blocking: when cue A already fully predicts the outcome, a co-present cue C learns nothing. @cite{ellis-2006} uses R-W to explain why L2 learners fail to acquire low-salience grammatical morphemes — they are blocked by higher-salience cues (word order, lexical content, temporal adverbs) that already predict the same meanings.

structure Core.LinearLearner (A : Type u_1) :
Type u_1

A linear learner for the Luce ratio-scale model (@cite{luce-1959}, §4.B–C).

The learning rule is a positive linear operator on ratio-scale values: v_{n+1}(a) = α · v_n(a) + (1-α) · r(a) where α ∈ (0,1) is the retention rate and r : A → ℝ is the reinforcement function. The key property is that this is a positive linear transformation — it maps non-negative scores to non-negative scores — which is exactly what Luce's Axiom 1 preservation (§4.B) requires.

  • learnRate :

    Retention rate: weight on prior score. Must be in (0,1).

  • reinforcement : A

    Reinforcement: the asymptotic target score for each action.

  • learnRate_pos : 0 < self.learnRate

    Retention rate is strictly positive.

  • learnRate_lt_one : self.learnRate < 1

    Retention rate is strictly less than 1.

  • reinforcement_nonneg (a : A) : 0 self.reinforcement a

    Reinforcement values are non-negative.

Instances For

    The complement 1 - α, which serves as the weight on reinforcement.

    Equations
    Instances For
      def Core.LinearLearner.update {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) :

      One step of linear learning (@cite{luce-1959}, §4.C, Alpha Model): v_{n+1}(s, a) = α · v_n(s, a) + (1 - α) · r(a).

      The result is a new RationalAction, so the Luce choice rule (IIA) holds at trial n+1 by construction — this is the content of §4.B.

      Equations
      Instances For
        theorem Core.linear_update_nonneg {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (s : S) (a : A) :
        0 (ll.update ra).score s a

        Updated scores are non-negative — a direct consequence of the positivity of α, (1-α), the original scores, and the reinforcement values. This is what makes the linear learning model a positive linear operator (@cite{luce-1959}, §4.B).

        theorem Core.linear_learning_preserves_axiom1 {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (s : S) (a : A) :
        (ll.update ra).policy s a = if (ll.update ra).totalScore s = 0 then 0 else (ll.update ra).score s a / (ll.update ra).totalScore s

        Axiom 1 preservation (@cite{luce-1959}, §4.B, Theorem 16):

        If the Luce choice rule P(a) = v(a)/Σv(b) holds at trial n, then after a positive linear update v' = α·v + (1-α)·r, the updated rule P'(a) = v'(a)/Σv'(b) is again a Luce choice rule.

        In our formalization this is true by construction: LinearLearner.update produces a RationalAction, and RationalAction.policy is the Luce rule. We state this as an explicit equational theorem for clarity.

        structure Core.BetaModel (A : Type u_3) :

        The β-model (@cite{luce-1959}, §4.D; @cite{bush-mosteller-1955}):

        An alternative learning model operating directly on probabilities rather than ratio-scale values: P_{n+1}(a) = (1-β) · P_n(a) + β · δ(a, chosen_n)

        where δ(a, chosen_n) = 1 if a was chosen on trial n, 0 otherwise. Unlike the linear model, this operates in probability space, so it does not generally preserve the Luce ratio-scale structure.

        • beta :

          Learning rate in (0,1).

        • beta_pos : 0 < self.beta

          β is strictly positive.

        • beta_lt_one : self.beta < 1

          β is strictly less than 1.

        Instances For
          def Core.BetaModel.update {A : Type u_2} [DecidableEq A] (bm : BetaModel A) (P : A) (chosen : A) :
          A

          One step of β-model update (@cite{luce-1959}, §4.D): P'(a) = (1-β) · P(a) + β · δ(a, chosen).

          Takes a probability function and the chosen action, returns updated probabilities.

          Equations
          Instances For
            theorem Core.BetaModel.update_nonneg {A : Type u_2} [DecidableEq A] (bm : BetaModel A) (P : A) (chosen : A) (hP : ∀ (a : A), 0 P a) (a : A) :
            0 bm.update P chosen a

            β-model update preserves non-negativity of probabilities.

            theorem Core.BetaModel.update_sum_one {A : Type u_2} [Fintype A] [DecidableEq A] (bm : BetaModel A) (P : A) (chosen : A) (h_chosen : chosen Finset.univ) (h_sum : a : A, P a = 1) :
            a : A, bm.update P chosen a = 1

            β-model update preserves sum-to-one when the input sums to 1.

            def Core.iterate_linear {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) :

            n-step iteration of linear learning (@cite{luce-1959}, §4.C):

            Apply the linear update rule n times starting from an initial RationalAction. At each step, Axiom 1 is preserved because update produces a RationalAction.

            Equations
            Instances For
              theorem Core.iterate_linear_nonneg {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (n : ) (s : S) (a : A) :
              0 (iterate_linear ll ra n).score s a

              After n iterations, scores are still non-negative. Follows by induction since each update step preserves non-negativity.

              theorem Core.iterate_linear_closed_form {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (n : ) (s : S) (a : A) :
              (iterate_linear ll ra n).score s a = ll.learnRate ^ n * ra.score s a + (1 - ll.learnRate ^ n) * ll.reinforcement a

              Closed-form for iterated linear learning (@cite{luce-1959}, §4.C):

              After n iterations with retention rate α and reinforcement r: v_n(s, a) = α^n · v_0(s, a) + (1 - α^n) · r(a)

              This is the geometric-series solution to the linear recurrence.

              theorem Core.linear_convergence {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (s : S) (a : A) :
              Filter.Tendsto (fun (n : ) => (iterate_linear ll ra n).score s a) Filter.atTop (nhds (ll.reinforcement a))

              Convergence of linear learning (@cite{luce-1959}, §4.G):

              Under constant reinforcement, the ratio-scale values converge to the reinforcement values: v_n(s, a) → r(a) as n → ∞.

              This follows from the closed form: v_n = α^n · v_0 + (1 - α^n) · r, and α^n → 0 since 0 < α < 1.

              structure Core.TrialRecord (A : Type u_3) :
              Type u_3

              A record of a learning trial: what was available, what was chosen, what was the reinforcement. Used for analyzing sequences of choices over trials (@cite{luce-1959}, §4.C–G).

              • chosen : A

                The action chosen on this trial.

              • reinforcement : A

                The reinforcement received (may differ from the learner's default).

              Instances For
                structure Core.ExpectedChoiceSequence (S : Type u_3) (A : Type u_4) [Fintype A] :
                Type (max u_3 u_4)

                An expected choice sequence: the initial agent, a learner, and a history of trials. This structure supports analyzing how choice probabilities evolve over a sequence of reinforced trials (@cite{luce-1959}, §4.C–G).

                Instances For

                  The agent at trial n, after applying n steps of learning.

                  Equations
                  Instances For
                    noncomputable def Core.ExpectedChoiceSequence.choiceProb {S : Type u_1} {A : Type u_2} [Fintype A] (ecs : ExpectedChoiceSequence S A) (n : ) (s : S) (a : A) :

                    The probability of choosing action a in state s at trial n.

                    Equations
                    Instances For
                      theorem Core.ExpectedChoiceSequence.choiceProb_sum_one {S : Type u_1} {A : Type u_2} [Fintype A] (ecs : ExpectedChoiceSequence S A) (n : ) (s : S) (h : (ecs.agentAt n).totalScore s 0) :
                      a : A, ecs.choiceProb n s a = 1

                      Choice probabilities at any trial sum to 1 (when totalScore is nonzero), because each trial's agent is a RationalAction.

                      def Core.sgaUpdate (r_j η observed_j expected_j : ) :

                      Stochastic Gradient Ascent update for a single weight of a log-linear model.

                      For MaxEnt phonology, observed_j is the violation count of constraint j on the observed output, and expected_j is the expected violation count under the current model distribution.

                      The batch gradient is E_emp[c_j] − E_r̄[c_j] (see hasDerivAt_logConditional in RationalAction). The stochastic version replaces both expectations with single-sample estimates.

                      Equations
                      Instances For
                        def Core.glaUpdate (r_j η : ) (c_j_observed c_j_hypothesis : ) :

                        The Gradual Learning Algorithm (@cite{boersma-1998}) update for a single weight: adjust by the signed difference between the observed violation count and the violation count of a hypothesis sampled from the current grammar.

                        Equations
                        • Core.glaUpdate r_j η c_j_observed c_j_hypothesis = r_j + η * (c_j_observed - c_j_hypothesis)
                        Instances For
                          theorem Core.gla_eq_sga (r_j η : ) (obs hyp : ) :
                          glaUpdate r_j η obs hyp = sgaUpdate r_j η obs hyp

                          GLA = SGA: the Gradual Learning Algorithm is Stochastic Gradient Ascent with single-sample estimates of both observed and expected feature values. The update rules are identical by definition.

                          theorem Core.sga_uses_correct_gradient {ι : Type u_3} [Fintype ι] [Nonempty ι] (s r : ι) (y : ι) (wⱼ : ) :
                          HasDerivAt (fun (w : ) => w * s y + r y - logSumExpOffset s r w) (s y - i : ι, softmaxOffset s r wⱼ i * s i) wⱼ

                          SGA converges to the global maximum of a concave objective.

                          For MaxEnt log-likelihood, the per-weight objective is concave (logConditional_concaveOn in RationalAction), so SGA is guaranteed to converge. This is the key advantage of MaxEnt over Stochastic OT, where convergence of the GLA is not generally proved.

                          structure Core.RescorlaWagner (C : Type u_3) :
                          Type u_3

                          The Rescorla-Wagner learning model (@cite{rescorla-wagner-1972}).

                          Models how organisms learn cue-outcome associations. On each trial, all present cues have their associative strengths updated by:

                          ΔV_c = α_c · β · (λ − ΣV)
                          

                          where α_c is cue salience, β is the learning rate (outcome importance), λ is the maximum conditioning supported by the outcome, and ΣV is the sum of associative strengths of all cues present on the trial.

                          The prediction error (λ − ΣV) is the key innovation: it makes learning competitive across cues. When the outcome is already fully predicted (ΣV = λ), no further learning occurs for any co-present cue — this generates blocking (@cite{ellis-2006}, p. 177).

                          • salience : C

                            Cue salience: how noticeable each cue is. Must be in [0, 1].

                          • learnRate :

                            Learning rate (outcome importance). Must be in (0, 1].

                          • maxCond :

                            Maximum conditioning supported by the outcome.

                          • salience_nonneg (c : C) : 0 self.salience c

                            Salience is non-negative.

                          • salience_le_one (c : C) : self.salience c 1

                            Salience is at most 1.

                          • learnRate_pos : 0 < self.learnRate

                            Learning rate is strictly positive.

                          • learnRate_le_one : self.learnRate 1

                            Learning rate is at most 1.

                          • maxCond_nonneg : 0 self.maxCond

                            Maximum conditioning is non-negative.

                          Instances For
                            def Core.RescorlaWagner.predictionError {C : Type u_3} (rw : RescorlaWagner C) (present : Finset C) (V : C) :

                            Prediction error on a trial: λ − ΣV for cues present.

                            When positive, the outcome is under-predicted (surprise → learning). When zero, the outcome is fully predicted (no learning). When negative, the outcome is over-predicted (extinction).

                            Equations
                            Instances For
                              def Core.RescorlaWagner.update {C : Type u_3} [DecidableEq C] (rw : RescorlaWagner C) (present : Finset C) (V : C) (c : C) :

                              One trial of Rescorla-Wagner learning (@cite{rescorla-wagner-1972}).

                              For each cue c:

                              • If c is present: V'(c) = V(c) + α_c · β · (λ − ΣV)
                              • If c is absent: V'(c) = V(c) (no change)
                              Equations
                              Instances For
                                theorem Core.RescorlaWagner.update_absent {C : Type u_3} [DecidableEq C] (rw : RescorlaWagner C) (present : Finset C) (V : C) (c : C) (hc : cpresent) :
                                rw.update present V c = V c

                                Absent cues are not updated.

                                theorem Core.RescorlaWagner.blocking {C : Type u_3} [DecidableEq C] (rw : RescorlaWagner C) (present : Finset C) (V : C) (B : C) (hB : B present) (h_total : cpresent, V c = rw.maxCond) :
                                rw.update present V B = V B

                                Blocking theorem (@cite{rescorla-wagner-1972}; @cite{ellis-2006}, p. 177):

                                When cue A already fully predicts the outcome (V(A) = λ) and is the only cue with nonzero strength among those present, adding a novel cue B to the compound produces zero learning for B: V'(B) = V(B).

                                This is the formal basis of @cite{ellis-2006}'s account of L2 fossilization: L1 cues (temporal adverbs, word order) that already predict temporal/number meanings block acquisition of L2 grammatical morphemes (tense inflection, plural -s) that encode the same meanings.

                                theorem Core.RescorlaWagner.no_learning_at_equilibrium {C : Type u_3} [DecidableEq C] (rw : RescorlaWagner C) (present : Finset C) (V : C) (h_total : cpresent, V c = rw.maxCond) (c : C) :
                                rw.update present V c = V c

                                When the outcome is fully predicted, no present cue learns anything.

                                def Core.RescorlaWagner.iterate {C : Type u_3} [DecidableEq C] (rw : RescorlaWagner C) (trials : List (Finset C)) (V₀ : C) :
                                C

                                Iterated R-W learning over a sequence of trials. Each trial specifies which cues are present.

                                Equations
                                Instances For
                                  def Core.RescorlaWagner.iterateConst {C : Type u_3} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (V₀ : C) :
                                  C

                                  Constant-trial iteration: the same cue set is presented on every trial.

                                  Equations
                                  Instances For
                                    theorem Core.RescorlaWagner.totalStrength_recurrence {C : Type u_3} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (V : C) :
                                    cS, rw.update S V c = cS, V c + rw.learnRate * rw.predictionError S V * cS, rw.salience c

                                    Total strength recurrence: the sum of associative strengths across present cues follows a linear recurrence after each trial.

                                    ΣV' = ΣV + β · (Σ_{c∈S} α_c) · (λ − ΣV)
                                        = (1 − β·A) · ΣV + β·A·λ       where A = Σ_{c∈S} α_c
                                    

                                    This has the same form as the Luce α-model (§2) with retention rate 1 − β·A. The prediction error (λ − ΣV) is shared across all present cues, so the total strength update factors into a scalar recurrence.

                                    Proof: Finset.sum_congr eliminates the if (every c ∈ S takes the then branch), then Finset.sum_add_distrib + Finset.sum_mul.

                                    theorem Core.RescorlaWagner.totalStrength_convergence {C : Type u_3} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (V₀ : C) (h_pos : 0 < rw.learnRate * cS, rw.salience c) (h_stable : rw.learnRate * cS, rw.salience c < 1) :
                                    Filter.Tendsto (fun (n : ) => cS, rw.iterateConst S V₀ n c) Filter.atTop (nhds rw.maxCond)

                                    Total strength convergence: under repeated identical trials with the same cue set S, the total associative strength converges to λ.

                                    Since the total strength follows a linear recurrence with retention rate r = 1 − β·A (see totalStrength_recurrence), convergence follows from the closed form T_n = r^n · T_0 + (1 − r^n) · λ and r^n → 0 (since 0 ≤ r < 1 from h_pos and h_stable). This is the same proof pattern as linear_convergence (§5) for the Luce α-model.

                                    theorem Core.RescorlaWagner.proportional_partition {C : Type u_3} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (n : ) :
                                    ∃ (K : ), cS, rw.iterateConst S (fun (x : C) => 0) n c = rw.salience c * K

                                    Proportional partition (@cite{rescorla-wagner-1972}):

                                    When all cues start with zero associative strength and the same cue set is presented on every trial, each cue's strength at every step is proportional to its salience: there exists K_n such that V_n(c) = α_c · K_n for all present cues c.

                                    The proof is by induction on n. The base case is trivial (K = 0). For the step, the prediction error (λ − ΣV) is shared across all present cues, so ΔV(c) = α_c · β · (λ − ΣV) preserves proportionality. The inductive hypothesis rewrites ΣV_n = K · Σα, giving K_{n+1} = K + β · (λ − K · Σα).

                                    At convergence (ΣV = λ), this gives V_∞(c) = (α_c / Σα) · λ: each cue's equilibrium strength is its share of the total salience, scaled by λ.

                                    theorem Core.RescorlaWagner.overshadowing {C : Type u_3} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (A B : C) (n : ) (hA : A S) (hB : B S) (_hne : A B) (h_salience : rw.salience B < rw.salience A) (h_pos : 0 < cS, rw.iterateConst S (fun (x : C) => 0) n c) :
                                    rw.iterateConst S (fun (x : C) => 0) n B < rw.iterateConst S (fun (x : C) => 0) n A

                                    Overshadowing (@cite{rescorla-wagner-1972}; @cite{ellis-2006}):

                                    When two cues are both present on every trial, the more salient cue captures more associative strength at every step (and hence at equilibrium). This follows directly from proportional_partition: V_n(c) = α_c · K, so V_n(A) = α_A · K > α_B · K = V_n(B) whenever K > 0.

                                    @cite{ellis-2006} (p. 177, 179–180) uses overshadowing to explain why low-salience grammatical morphemes (tense inflection, plural -s) are acquired late: high-salience cues (word order, lexical content) that co-occur with the same outcomes capture most of the associative strength.

                                    theorem Core.RescorlaWagner.blocked_cue_zero {C : Type u_3} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (V : C) (C' : C) (hC : C' S) (hV : V C' = 0) (h_total : cS, V c = rw.maxCond) :
                                    rw.update S V C' = 0

                                    ΔP–R-W correspondence (@cite{ellis-2006}; @cite{cheng-holyoak-1995}):

                                    For the blocking experiment (two phases: A→X then AC→X vs B→∅ then BD→X), the R-W equilibrium strengths predict the same pattern as ΔP:

                                    • Blocked cue C: R-W gives V(C) ≈ 0 at equilibrium; ΔP(C→X) = 0
                                    • Unblocked cue D: R-W gives V(D) > 0 at equilibrium; ΔP(D→X) = 1

                                    This theorem states the blocked direction: if cue A already captures all the associative strength and C starts at zero, then C remains at zero after compound AC→X trials. The ΔP = 0 counterpart is deltaPCounts with a = c (cue equally predicts outcome whether C is present or not).

                                    See Phenomena.Acquisition.Studies.Ellis2006 for the concrete Chapman & Robbins (1990) experiment verifying both measures.