Documentation

Linglib.Phenomena.PhonologicalAlternation.Studies.Jaeger2007

@cite{jaeger-2007}: Maximum Entropy Models and Stochastic Optimality Theory #

@cite{jaeger-2007}

@cite{jaeger-2007} demonstrates that @cite{boersma-1998}'s Gradual Learning Algorithm (GLA) for Stochastic OT is mathematically identical to Stochastic Gradient Ascent (SGA) for Maximum Entropy models. This unifies two traditions:

Key contributions formalized here #

  1. GLA = SGA (§4): The GLA update rule is SGA with single-sample estimates. Both adjust each weight by η · (observed − expected). This is gla_eq_sga from Core.Agent.Learning.

  2. Correct gradient (§4, eq (2)): The per-weight gradient of MaxEnt log-likelihood is E_emp[cⱼ] − E_r̄[cⱼ] — observed minus expected feature value. This is hasDerivAt_logConditional from Core.Agent.RationalAction, instantiated as gradient in @cite{goldwater-johnson-2003}.

  3. Convergence guarantee (§4): SGA converges to the global maximum because log-likelihood is concave (logConditional_concaveOn). StOT's GLA has no such guarantee — this is the main formal advantage of MaxEnt.

  4. Ganging-up (§3): Both MaxEnt and StOT admit ganging-up effects (multiple weak constraints overriding a strong one), unlike classical OT. This is Ganging from OTLimit.lean.

  5. Dutch syllable acquisition (§5): Replication of Boersma & Levelt (2000) with MaxEnt+SGA produces the same acquisition order as the GLA, consistent with child language data.

theorem Phenomena.PhonologicalAlternation.Studies.Jaeger2007.gla_is_sga (r_j η : ) (obs hyp : ) :
Core.glaUpdate r_j η obs hyp = Core.sgaUpdate r_j η obs hyp

The main theorem of @cite{jaeger-2007}: the Gradual Learning Algorithm is Stochastic Gradient Ascent by definition.

Both update weight j by η · (c_j(observed) − c_j(hypothesis)). For MaxEnt, this is an unbiased estimate of the log-likelihood gradient E_emp[c_j] − E_r̄[c_j] (see sga_uses_correct_gradient).

theorem Phenomena.PhonologicalAlternation.Studies.Jaeger2007.maxent_convergence_guarantee {ι : Type u_1} [Fintype ι] [Nonempty ι] (s r : ι) (y : ι) :
ConcaveOn Set.univ fun (wⱼ : ) => wⱼ * s y + r y - Core.logSumExpOffset s r wⱼ

MaxEnt convergence guarantee: the per-weight log-likelihood is concave, so gradient-based learning converges to the unique global maximum.

@cite{jaeger-2007} §4: "The log-likelihood has the desirable property of being convex [sic — concave], which means that it does not have local maxima. Gradient Ascent is thus guaranteed to find the global maximum."

StOT's GLA lacks this guarantee — no proof of convergence exists for the general case (@cite{jaeger-2007} §2, fn. 1).

Both MaxEnt and StOT admit ganging-up: two weak constraints can jointly override a strong one. Classical OT precludes this when weights are exponentially separated (exponential_separation_precludes_ganging).

@cite{jaeger-2007}: "Both StOT and ME diverge from classical OT in admitting ganging-up effects."

Dutch syllable types from @cite{jaeger-2007} Table 1 (Boersma & Levelt 2000, data from Joost van de Weijer).

Instances For
    Equations
    • One or more equations did not get rendered due to their size.
    Instances For

      Frequency of Dutch syllable types in child-directed speech (%).

      Equations
      Instances For

        Constraints for Dutch syllable structure (§5).

        Instances For
          Equations
          • One or more equations did not get rendered due to their size.
          Instances For

            Violation count: how many times each constraint is violated by each syllable type (assuming faithful mapping, so FAITH = 0).

            Equations
            Instances For