Learning Models #
@cite{luce-1959} @cite{bush-mosteller-1955} @cite{rescorla-wagner-1972}
Three families of learning model:
§1–6. Luce/Bush-Mosteller (ratio-scale and probability-space) #
@cite{luce-1959} extends the choice axiom to dynamic settings where choice probabilities change over trials via learning and reinforcement. The central question: if the Luce choice rule (IIA) holds at trial n, does it hold at trial n+1 after a learning update?
The key result (§4.B) is that the Luce rule is closed under positive linear
transformations of the ratio scale values. Since the linear learning model
(§4.A) is exactly such a transformation, IIA is preserved across learning
trials. This is structurally guaranteed in our formalization: LinearLearner.update
produces a new RationalAction, and RationalAction is the Luce choice rule.
We formalize:
- Alpha model (§4.C):
v_{n+1}(a) = α·v_n(a) + (1-α)·r(a)— the linear learning rule - Axiom 1 preservation (§4.B, Theorem 16):
updateyields aRationalAction— IIA by construction - Beta model (§4.D): probability-space learning
P_{n+1}(a) = (1-β)·P_n(a) + β·δ(a, chosen) - Convergence (§4.G): under constant reinforcement,
v_n(a) → r(a)asn → ∞
§7. Stochastic Gradient Ascent #
The Gradual Learning Algorithm (@cite{boersma-1998}) = SGA with single-sample estimates. Convergence to the global maximum of a concave objective (MaxEnt).
§8. Rescorla-Wagner (associative learning) #
@cite{rescorla-wagner-1972} models cue-outcome association via error-driven
learning: ΔV_c = α_c · β · (λ − ΣV). Unlike Luce, R-W has a prediction
error term (λ − ΣV) that generates blocking: when cue A already
fully predicts the outcome, a co-present cue C learns nothing.
@cite{ellis-2006} uses R-W to explain why L2 learners fail to acquire
low-salience grammatical morphemes — they are blocked by higher-salience
cues (word order, lexical content, temporal adverbs) that already predict
the same meanings.
A linear learner for the Luce ratio-scale model (@cite{luce-1959}, §4.B–C).
The learning rule is a positive linear operator on ratio-scale values:
v_{n+1}(a) = α · v_n(a) + (1-α) · r(a) where α ∈ (0,1) is the retention
rate and r : A → ℝ is the reinforcement function. The key property is that
this is a positive linear transformation — it maps non-negative scores to
non-negative scores — which is exactly what Luce's Axiom 1 preservation
(§4.B) requires.
- learnRate : ℝ
Retention rate: weight on prior score. Must be in (0,1).
- reinforcement : A → ℝ
Reinforcement: the asymptotic target score for each action.
Retention rate is strictly positive.
Retention rate is strictly less than 1.
Reinforcement values are non-negative.
Instances For
The complement 1 - α, which serves as the weight on reinforcement.
Equations
- ll.complementRate = 1 - ll.learnRate
Instances For
One step of linear learning (@cite{luce-1959}, §4.C, Alpha Model):
v_{n+1}(s, a) = α · v_n(s, a) + (1 - α) · r(a).
The result is a new RationalAction, so the Luce choice rule (IIA) holds at
trial n+1 by construction — this is the content of §4.B.
Equations
Instances For
Updated scores are non-negative — a direct consequence of the positivity of α, (1-α), the original scores, and the reinforcement values. This is what makes the linear learning model a positive linear operator (@cite{luce-1959}, §4.B).
Axiom 1 preservation (@cite{luce-1959}, §4.B, Theorem 16):
If the Luce choice rule P(a) = v(a)/Σv(b) holds at trial n, then after a positive linear update v' = α·v + (1-α)·r, the updated rule P'(a) = v'(a)/Σv'(b) is again a Luce choice rule.
In our formalization this is true by construction: LinearLearner.update
produces a RationalAction, and RationalAction.policy is the Luce rule.
We state this as an explicit equational theorem for clarity.
The β-model (@cite{luce-1959}, §4.D; @cite{bush-mosteller-1955}):
An alternative learning model operating directly on probabilities rather than
ratio-scale values:
P_{n+1}(a) = (1-β) · P_n(a) + β · δ(a, chosen_n)
where δ(a, chosen_n) = 1 if a was chosen on trial n, 0 otherwise.
Unlike the linear model, this operates in probability space, so it does not
generally preserve the Luce ratio-scale structure.
- beta : ℝ
Learning rate in (0,1).
β is strictly positive.
β is strictly less than 1.
Instances For
One step of β-model update (@cite{luce-1959}, §4.D):
P'(a) = (1-β) · P(a) + β · δ(a, chosen).
Takes a probability function and the chosen action, returns updated probabilities.
Instances For
β-model update preserves non-negativity of probabilities.
β-model update preserves sum-to-one when the input sums to 1.
n-step iteration of linear learning (@cite{luce-1959}, §4.C):
Apply the linear update rule n times starting from an initial RationalAction.
At each step, Axiom 1 is preserved because update produces a RationalAction.
Equations
- Core.iterate_linear ll ra 0 = ra
- Core.iterate_linear ll ra n.succ = ll.update (Core.iterate_linear ll ra n)
Instances For
After n iterations, scores are still non-negative. Follows by induction
since each update step preserves non-negativity.
Closed-form for iterated linear learning (@cite{luce-1959}, §4.C):
After n iterations with retention rate α and reinforcement r:
v_n(s, a) = α^n · v_0(s, a) + (1 - α^n) · r(a)
This is the geometric-series solution to the linear recurrence.
Convergence of linear learning (@cite{luce-1959}, §4.G):
Under constant reinforcement, the ratio-scale values converge to the
reinforcement values: v_n(s, a) → r(a) as n → ∞.
This follows from the closed form: v_n = α^n · v_0 + (1 - α^n) · r, and
α^n → 0 since 0 < α < 1.
A record of a learning trial: what was available, what was chosen, what was the reinforcement. Used for analyzing sequences of choices over trials (@cite{luce-1959}, §4.C–G).
- chosen : A
The action chosen on this trial.
- reinforcement : A → ℝ
The reinforcement received (may differ from the learner's default).
Instances For
An expected choice sequence: the initial agent, a learner, and a history of trials. This structure supports analyzing how choice probabilities evolve over a sequence of reinforced trials (@cite{luce-1959}, §4.C–G).
- learner : LinearLearner A
The learning model governing updates.
- initial : RationalAction S A
The initial rational agent (trial 0).
- history : List (TrialRecord A)
History of trials (choices and reinforcements).
Instances For
The agent at trial n, after applying n steps of learning.
Equations
- ecs.agentAt n = Core.iterate_linear ecs.learner ecs.initial n
Instances For
The probability of choosing action a in state s at trial n.
Equations
- ecs.choiceProb n s a = (ecs.agentAt n).policy s a
Instances For
Choice probabilities at any trial sum to 1 (when totalScore is nonzero),
because each trial's agent is a RationalAction.
Stochastic Gradient Ascent update for a single weight of a log-linear model.
For MaxEnt phonology, observed_j is the violation count of constraint j on
the observed output, and expected_j is the expected violation count under the
current model distribution.
The batch gradient is E_emp[c_j] − E_r̄[c_j] (see hasDerivAt_logConditional
in RationalAction). The stochastic version replaces both expectations with
single-sample estimates.
Equations
- Core.sgaUpdate r_j η observed_j expected_j = r_j + η * (observed_j - expected_j)
Instances For
The Gradual Learning Algorithm (@cite{boersma-1998}) update for a single weight: adjust by the signed difference between the observed violation count and the violation count of a hypothesis sampled from the current grammar.
Equations
- Core.glaUpdate r_j η c_j_observed c_j_hypothesis = r_j + η * (↑c_j_observed - ↑c_j_hypothesis)
Instances For
GLA = SGA: the Gradual Learning Algorithm is Stochastic Gradient Ascent with single-sample estimates of both observed and expected feature values. The update rules are identical by definition.
SGA converges to the global maximum of a concave objective.
For MaxEnt log-likelihood, the per-weight objective is concave
(logConditional_concaveOn in RationalAction), so SGA is guaranteed
to converge. This is the key advantage of MaxEnt over Stochastic OT,
where convergence of the GLA is not generally proved.
The Rescorla-Wagner learning model (@cite{rescorla-wagner-1972}).
Models how organisms learn cue-outcome associations. On each trial, all present cues have their associative strengths updated by:
ΔV_c = α_c · β · (λ − ΣV)
where α_c is cue salience, β is the learning rate (outcome importance),
λ is the maximum conditioning supported by the outcome, and ΣV is the
sum of associative strengths of all cues present on the trial.
The prediction error (λ − ΣV) is the key innovation: it makes learning
competitive across cues. When the outcome is already fully predicted
(ΣV = λ), no further learning occurs for any co-present cue — this
generates blocking (@cite{ellis-2006}, p. 177).
- salience : C → ℝ
Cue salience: how noticeable each cue is. Must be in [0, 1].
- learnRate : ℝ
Learning rate (outcome importance). Must be in (0, 1].
- maxCond : ℝ
Maximum conditioning supported by the outcome.
Salience is non-negative.
Salience is at most 1.
Learning rate is strictly positive.
Learning rate is at most 1.
Maximum conditioning is non-negative.
Instances For
Prediction error on a trial: λ − ΣV for cues present.
When positive, the outcome is under-predicted (surprise → learning). When zero, the outcome is fully predicted (no learning). When negative, the outcome is over-predicted (extinction).
Equations
- rw.predictionError present V = rw.maxCond - ∑ c ∈ present, V c
Instances For
One trial of Rescorla-Wagner learning (@cite{rescorla-wagner-1972}).
For each cue c:
- If c is present:
V'(c) = V(c) + α_c · β · (λ − ΣV) - If c is absent:
V'(c) = V(c)(no change)
Equations
Instances For
Absent cues are not updated.
Blocking theorem (@cite{rescorla-wagner-1972}; @cite{ellis-2006}, p. 177):
When cue A already fully predicts the outcome (V(A) = λ) and is the only
cue with nonzero strength among those present, adding a novel cue B to the
compound produces zero learning for B: V'(B) = V(B).
This is the formal basis of @cite{ellis-2006}'s account of L2 fossilization: L1 cues (temporal adverbs, word order) that already predict temporal/number meanings block acquisition of L2 grammatical morphemes (tense inflection, plural -s) that encode the same meanings.
When the outcome is fully predicted, no present cue learns anything.
Iterated R-W learning over a sequence of trials. Each trial specifies which cues are present.
Equations
Instances For
Constant-trial iteration: the same cue set is presented on every trial.
Equations
- rw.iterateConst S V₀ 0 = V₀
- rw.iterateConst S V₀ n.succ = fun (c : C) => rw.update S (rw.iterateConst S V₀ n) c
Instances For
Total strength recurrence: the sum of associative strengths across present cues follows a linear recurrence after each trial.
ΣV' = ΣV + β · (Σ_{c∈S} α_c) · (λ − ΣV)
= (1 − β·A) · ΣV + β·A·λ where A = Σ_{c∈S} α_c
This has the same form as the Luce α-model (§2) with retention rate 1 − β·A.
The prediction error (λ − ΣV) is shared across all present cues, so the
total strength update factors into a scalar recurrence.
Proof: Finset.sum_congr eliminates the if (every c ∈ S takes
the then branch), then Finset.sum_add_distrib + Finset.sum_mul.
Total strength convergence: under repeated identical trials with the same cue set S, the total associative strength converges to λ.
Since the total strength follows a linear recurrence with retention rate
r = 1 − β·A (see totalStrength_recurrence), convergence follows from
the closed form T_n = r^n · T_0 + (1 − r^n) · λ and r^n → 0 (since
0 ≤ r < 1 from h_pos and h_stable). This is the same proof pattern
as linear_convergence (§5) for the Luce α-model.
Proportional partition (@cite{rescorla-wagner-1972}):
When all cues start with zero associative strength and the same cue set is presented on every trial, each cue's strength at every step is proportional to its salience: there exists K_n such that V_n(c) = α_c · K_n for all present cues c.
The proof is by induction on n. The base case is trivial (K = 0). For the
step, the prediction error (λ − ΣV) is shared across all present cues,
so ΔV(c) = α_c · β · (λ − ΣV) preserves proportionality. The inductive
hypothesis rewrites ΣV_n = K · Σα, giving K_{n+1} = K + β · (λ − K · Σα).
At convergence (ΣV = λ), this gives V_∞(c) = (α_c / Σα) · λ: each cue's equilibrium strength is its share of the total salience, scaled by λ.
Overshadowing (@cite{rescorla-wagner-1972}; @cite{ellis-2006}):
When two cues are both present on every trial, the more salient cue captures
more associative strength at every step (and hence at equilibrium). This
follows directly from proportional_partition: V_n(c) = α_c · K, so
V_n(A) = α_A · K > α_B · K = V_n(B) whenever K > 0.
@cite{ellis-2006} (p. 177, 179–180) uses overshadowing to explain why low-salience grammatical morphemes (tense inflection, plural -s) are acquired late: high-salience cues (word order, lexical content) that co-occur with the same outcomes capture most of the associative strength.
ΔP–R-W correspondence (@cite{ellis-2006}; @cite{cheng-holyoak-1995}):
For the blocking experiment (two phases: A→X then AC→X vs B→∅ then BD→X), the R-W equilibrium strengths predict the same pattern as ΔP:
- Blocked cue C: R-W gives V(C) ≈ 0 at equilibrium; ΔP(C→X) = 0
- Unblocked cue D: R-W gives V(D) > 0 at equilibrium; ΔP(D→X) = 1
This theorem states the blocked direction: if cue A already captures all
the associative strength and C starts at zero, then C remains at zero
after compound AC→X trials. The ΔP = 0 counterpart is deltaPCounts
with a = c (cue equally predicts outcome whether C is present or not).
See Phenomena.Acquisition.Studies.Ellis2006 for the concrete
Chapman & Robbins (1990) experiment verifying both measures.