@cite{jaeger-2007}: Maximum Entropy Models and Stochastic Optimality Theory #
@cite{jaeger-2007}
@cite{jaeger-2007} demonstrates that @cite{boersma-1998}'s Gradual Learning Algorithm (GLA) for Stochastic OT is mathematically identical to Stochastic Gradient Ascent (SGA) for Maximum Entropy models. This unifies two traditions:
- StOT (Boersma): adds Gaussian noise to constraint ranks, learns via the GLA (online, cognitively plausible)
- MaxEnt (@cite{goldwater-johnson-2003}): log-linear model over constraint violations, learns via batch gradient ascent or SGA
Key contributions formalized here #
GLA = SGA (§4): The GLA update rule is SGA with single-sample estimates. Both adjust each weight by
η · (observed − expected). This isgla_eq_sgafromCore.Agent.Learning.Correct gradient (§4, eq (2)): The per-weight gradient of MaxEnt log-likelihood is
E_emp[cⱼ] − E_r̄[cⱼ]— observed minus expected feature value. This ishasDerivAt_logConditionalfromCore.Agent.RationalAction, instantiated asgradientin @cite{goldwater-johnson-2003}.Convergence guarantee (§4): SGA converges to the global maximum because log-likelihood is concave (
logConditional_concaveOn). StOT's GLA has no such guarantee — this is the main formal advantage of MaxEnt.Ganging-up (§3): Both MaxEnt and StOT admit ganging-up effects (multiple weak constraints overriding a strong one), unlike classical OT. This is
GangingfromOTLimit.lean.Dutch syllable acquisition (§5): Replication of Boersma & Levelt (2000) with MaxEnt+SGA produces the same acquisition order as the GLA, consistent with child language data.
The main theorem of @cite{jaeger-2007}: the Gradual Learning Algorithm is Stochastic Gradient Ascent by definition.
Both update weight j by η · (c_j(observed) − c_j(hypothesis)).
For MaxEnt, this is an unbiased estimate of the log-likelihood gradient
E_emp[c_j] − E_r̄[c_j] (see sga_uses_correct_gradient).
MaxEnt convergence guarantee: the per-weight log-likelihood is concave, so gradient-based learning converges to the unique global maximum.
@cite{jaeger-2007} §4: "The log-likelihood has the desirable property of being convex [sic — concave], which means that it does not have local maxima. Gradient Ascent is thus guaranteed to find the global maximum."
StOT's GLA lacks this guarantee — no proof of convergence exists for the general case (@cite{jaeger-2007} §2, fn. 1).
Both MaxEnt and StOT admit ganging-up: two weak constraints can jointly
override a strong one. Classical OT precludes this when weights are
exponentially separated (exponential_separation_precludes_ganging).
@cite{jaeger-2007}: "Both StOT and ME diverge from classical OT in admitting ganging-up effects."
Dutch syllable types from @cite{jaeger-2007} Table 1 (Boersma & Levelt 2000, data from Joost van de Weijer).
- CV : DutchSyllable
- CVC : DutchSyllable
- VC : DutchSyllable
- V : DutchSyllable
- CVCC : DutchSyllable
- CCVC : DutchSyllable
- CCV : DutchSyllable
- VCC : DutchSyllable
- CCVCC : DutchSyllable
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Frequency of Dutch syllable types in child-directed speech (%).
Equations
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.syllableFreq Phenomena.PhonologicalAlternation.Studies.Jaeger2007.DutchSyllable.CV = 4481 / 100
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.syllableFreq Phenomena.PhonologicalAlternation.Studies.Jaeger2007.DutchSyllable.CVC = 3205 / 100
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.syllableFreq Phenomena.PhonologicalAlternation.Studies.Jaeger2007.DutchSyllable.VC = 1199 / 100
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.syllableFreq Phenomena.PhonologicalAlternation.Studies.Jaeger2007.DutchSyllable.V = 385 / 100
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.syllableFreq Phenomena.PhonologicalAlternation.Studies.Jaeger2007.DutchSyllable.CVCC = 325 / 100
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.syllableFreq Phenomena.PhonologicalAlternation.Studies.Jaeger2007.DutchSyllable.CCVC = 198 / 100
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.syllableFreq Phenomena.PhonologicalAlternation.Studies.Jaeger2007.DutchSyllable.CCV = 138 / 100
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.syllableFreq Phenomena.PhonologicalAlternation.Studies.Jaeger2007.DutchSyllable.VCC = 42 / 100
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.syllableFreq Phenomena.PhonologicalAlternation.Studies.Jaeger2007.DutchSyllable.CCVCC = 26 / 100
Instances For
Constraints for Dutch syllable structure (§5).
- starCoda : SyllableConstraint
- onset : SyllableConstraint
- starComplexCoda : SyllableConstraint
- starComplexOnset : SyllableConstraint
- faith : SyllableConstraint
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Violation count: how many times each constraint is violated by each syllable type (assuming faithful mapping, so FAITH = 0).
Equations
- One or more equations did not get rendered due to their size.
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.violations Phenomena.PhonologicalAlternation.Studies.Jaeger2007.SyllableConstraint.faith x✝ = 0
Instances For
CV violates no markedness constraints.
CCVCC violates three markedness constraints (*Coda, *ComplexCoda, *ComplexOnset).
The converged constraint ranking from §5: FAITH ≫ *COMPLEXONSET ≫ *COMPLEXCODA ≫ ONSET ≫ *CODA
We represent this as learned weights (higher weight = higher priority). The exact values are from Jäger's simulation (Fig. 1).
Equations
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.convergedWeights Phenomena.PhonologicalAlternation.Studies.Jaeger2007.SyllableConstraint.faith = 13
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.convergedWeights Phenomena.PhonologicalAlternation.Studies.Jaeger2007.SyllableConstraint.starComplexOnset = 8
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.convergedWeights Phenomena.PhonologicalAlternation.Studies.Jaeger2007.SyllableConstraint.starComplexCoda = 7
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.convergedWeights Phenomena.PhonologicalAlternation.Studies.Jaeger2007.SyllableConstraint.onset = 5
- Phenomena.PhonologicalAlternation.Studies.Jaeger2007.convergedWeights Phenomena.PhonologicalAlternation.Studies.Jaeger2007.SyllableConstraint.starCoda = 0
Instances For
FAITH outranks all markedness constraints at convergence.
The markedness constraints are ranked in the predicted order.
Simpler syllables (fewer violations) are acquired first because they have higher harmony. CV has harmony 0 (no violations), while CCVCC has the lowest harmony (3 violations).