@cite{jaeger-2007}: Maximum Entropy Models and Stochastic Optimality Theory #

@cite{jaeger-2007}

@cite{jaeger-2007} demonstrates that @cite{boersma-1998}'s Gradual Learning Algorithm (GLA) for Stochastic OT is mathematically identical to Stochastic Gradient Ascent (SGA) for Maximum Entropy models. This unifies two traditions:

StOT (Boersma): adds Gaussian noise to constraint ranks, learns via the GLA (online, cognitively plausible)
MaxEnt (@cite{goldwater-johnson-2003}): log-linear model over constraint violations, learns via batch gradient ascent or SGA

Key contributions formalized here #

GLA = SGA (§4): The GLA update rule is SGA with single-sample estimates. Both adjust each weight by η · (observed − expected). This is gla_eq_sga from Core.Agent.Learning.
Correct gradient (§4, eq (2)): The per-weight gradient of MaxEnt log-likelihood is E_emp[cⱼ] − E_r̄[cⱼ] — observed minus expected feature value. This is hasDerivAt_logConditional from Core.Agent.RationalAction, instantiated as gradient in @cite{goldwater-johnson-2003}.
Convergence guarantee (§4): SGA converges to the global maximum because log-likelihood is concave (logConditional_concaveOn). StOT's GLA has no such guarantee — this is the main formal advantage of MaxEnt.
Ganging-up (§3): Both MaxEnt and StOT admit ganging-up effects (multiple weak constraints overriding a strong one), unlike classical OT. This is Ganging from OTLimit.lean.
Dutch syllable acquisition (§5): Replication of Boersma & Levelt (2000) with MaxEnt+SGA produces the same acquisition order as the GLA, consistent with child language data.

source

theorem Phenomena.PhonologicalAlternation.Studies.Jaeger2007.gla_is_sga (r_j η : ℝ) (obs hyp : ℕ) :

Core.glaUpdate r_j η obs hyp = Core.sgaUpdate r_j η ↑obs ↑hyp

The main theorem of @cite{jaeger-2007}: the Gradual Learning Algorithm is Stochastic Gradient Ascent by definition.

Both update weight j by η · (c_j(observed) − c_j(hypothesis)). For MaxEnt, this is an unbiased estimate of the log-likelihood gradient E_emp[c_j] − E_r̄[c_j] (see sga_uses_correct_gradient).

source

theorem Phenomena.PhonologicalAlternation.Studies.Jaeger2007.maxent_convergence_guarantee {ι : Type u_1} [Fintype ι] [Nonempty ι] (s r : ι → ℝ) (y : ι) :

ConcaveOn ℝ Set.univ fun (wⱼ : ℝ) => wⱼ * s y + r y - Core.logSumExpOffset s r wⱼ

MaxEnt convergence guarantee: the per-weight log-likelihood is concave, so gradient-based learning converges to the unique global maximum.

@cite{jaeger-2007} §4: "The log-likelihood has the desirable property of being convex [sic — concave], which means that it does not have local maxima. Gradient Ascent is thus guaranteed to find the global maximum."

StOT's GLA lacks this guarantee — no proof of convergence exists for the general case (@cite{jaeger-2007} §2, fn. 1).

source

theorem Phenomena.PhonologicalAlternation.Studies.Jaeger2007.ganging_possible_without_separation :

Theories.Phonology.HarmonicGrammar.Ganging 1 1 (3 / 2)

Both MaxEnt and StOT admit ganging-up: two weak constraints can jointly override a strong one. Classical OT precludes this when weights are exponentially separated (exponential_separation_precludes_ganging).

@cite{jaeger-2007}: "Both StOT and ME diverge from classical OT in admitting ganging-up effects."

source

inductive Phenomena.PhonologicalAlternation.Studies.Jaeger2007.DutchSyllable :

Type

Dutch syllable types from @cite{jaeger-2007} Table 1 (Boersma & Levelt 2000, data from Joost van de Weijer).

Instances For

source

instance Phenomena.PhonologicalAlternation.Studies.Jaeger2007.instDecidableEqDutchSyllable :

DecidableEq DutchSyllable

Equations

Phenomena.PhonologicalAlternation.Studies.Jaeger2007.instDecidableEqDutchSyllable x✝ y✝ = if h : x✝.ctorIdx = y✝.ctorIdx then isTrue ⋯ else isFalse ⋯

source

def Phenomena.PhonologicalAlternation.Studies.Jaeger2007.instReprDutchSyllable.repr :

DutchSyllable → ℕ → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

source

instance Phenomena.PhonologicalAlternation.Studies.Jaeger2007.instReprDutchSyllable :

Repr DutchSyllable

Equations

Phenomena.PhonologicalAlternation.Studies.Jaeger2007.instReprDutchSyllable = { reprPrec := Phenomena.PhonologicalAlternation.Studies.Jaeger2007.instReprDutchSyllable.repr }

source

def Phenomena.PhonologicalAlternation.Studies.Jaeger2007.syllableFreq :

DutchSyllable → ℚ

Frequency of Dutch syllable types in child-directed speech (%).

Equations

Instances For

source

inductive Phenomena.PhonologicalAlternation.Studies.Jaeger2007.SyllableConstraint :

Type

Constraints for Dutch syllable structure (§5).

starCoda : SyllableConstraint
onset : SyllableConstraint
starComplexCoda : SyllableConstraint
starComplexOnset : SyllableConstraint
faith : SyllableConstraint

Instances For

source

instance Phenomena.PhonologicalAlternation.Studies.Jaeger2007.instDecidableEqSyllableConstraint :

DecidableEq SyllableConstraint

Equations

Phenomena.PhonologicalAlternation.Studies.Jaeger2007.instDecidableEqSyllableConstraint x✝ y✝ = if h : x✝.ctorIdx = y✝.ctorIdx then isTrue ⋯ else isFalse ⋯

source

instance Phenomena.PhonologicalAlternation.Studies.Jaeger2007.instReprSyllableConstraint :

Repr SyllableConstraint

Equations

Phenomena.PhonologicalAlternation.Studies.Jaeger2007.instReprSyllableConstraint = { reprPrec := Phenomena.PhonologicalAlternation.Studies.Jaeger2007.instReprSyllableConstraint.repr }

source

def Phenomena.PhonologicalAlternation.Studies.Jaeger2007.instReprSyllableConstraint.repr :

SyllableConstraint → ℕ → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

source

def Phenomena.PhonologicalAlternation.Studies.Jaeger2007.violations :

SyllableConstraint → DutchSyllable → ℕ

Violation count: how many times each constraint is violated by each syllable type (assuming faithful mapping, so FAITH = 0).

Equations

One or more equations did not get rendered due to their size.
Phenomena.PhonologicalAlternation.Studies.Jaeger2007.violations Phenomena.PhonologicalAlternation.Studies.Jaeger2007.SyllableConstraint.faith x✝ = 0

Instances For

source

theorem Phenomena.PhonologicalAlternation.Studies.Jaeger2007.cv_no_violations (c : SyllableConstraint) :

violations c DutchSyllable.CV = 0

CV violates no markedness constraints.

source

theorem Phenomena.PhonologicalAlternation.Studies.Jaeger2007.ccvcc_three_violations :

violations SyllableConstraint.starCoda DutchSyllable.CCVCC + violations SyllableConstraint.starComplexCoda DutchSyllable.CCVCC + violations SyllableConstraint.starComplexOnset DutchSyllable.CCVCC = 3

CCVCC violates three markedness constraints (*Coda, *ComplexCoda, *ComplexOnset).

source

def Phenomena.PhonologicalAlternation.Studies.Jaeger2007.convergedWeights :

SyllableConstraint → ℚ

The converged constraint ranking from §5: FAITH ≫ *COMPLEXONSET ≫ *COMPLEXCODA ≫ ONSET ≫ *CODA

We represent this as learned weights (higher weight = higher priority). The exact values are from Jäger's simulation (Fig. 1).

Equations

Instances For

source

theorem Phenomena.PhonologicalAlternation.Studies.Jaeger2007.faith_highest (c : SyllableConstraint) :

c ≠ SyllableConstraint.faith → convergedWeights c < convergedWeights SyllableConstraint.faith

FAITH outranks all markedness constraints at convergence.

source

theorem Phenomena.PhonologicalAlternation.Studies.Jaeger2007.markedness_ranking :

convergedWeights SyllableConstraint.starCoda < convergedWeights SyllableConstraint.onset ∧ convergedWeights SyllableConstraint.onset < convergedWeights SyllableConstraint.starComplexCoda ∧ convergedWeights SyllableConstraint.starComplexCoda < convergedWeights SyllableConstraint.starComplexOnset

The markedness constraints are ranked in the predicted order.

source

Simpler syllables (fewer violations) are acquired first because they have higher harmony. CV has harmony 0 (no violations), while CCVCC has the lowest harmony (3 violations).

Documentation

Linglib.Phenomena.PhonologicalAlternation.Studies.Jaeger2007

@cite{jaeger-2007}: Maximum Entropy Models and Stochastic Optimality Theory #

Key contributions formalized here #