Documentation

Linglib.Phenomena.WordOrder.Gradience

Gradient Word-Order Measures #

@cite{levshina-stoynova-2023}

Levshina, Namboodiripad, Allassonnière-Tang, Kramer, Talamo, Verkerk, Wilmoth, Garrido Rodriguez, Gupton, Kidd, Liu, Naccarato, Nordlinger, @cite{levshina-stoynova-2023} "Why we need a gradient approach to word order" (Linguistics 61(4):825–883) argues that word-order typology should use continuous measures (proportions, Shannon entropy, mutual information) rather than categorical labels (SVO, SOV, "rigid", "flexible").

Key Claims #

  1. SO proportion is continuous across languages (Figure 1)
  2. Head-finality is continuous (Figure 2: 123 SUD corpora)
  3. Case marking MI correlates with SO entropy (Figure 3: ~30 languages)
  4. Register affects word-order proportions (Figure 7: Russian VO varies by register)
  5. Flexibility scores form a continuum (Figure 8: Avar–English)

Data Sources #

Proportion of harmonic languages × 1000 (integer permille).

Equations
Instances For

    Proportion of head-initial languages × 1000 (hihi + hihf cells).

    Equations
    Instances For

      Harmonic proportion decreases with construction complexity: adposition > subordinator > relative clause.

      Levshina et al.'s point: not all constructions are equally categorical. Even the "best" universal (VO ↔ preposition) is only 94.3% harmonic.

      Gradient and categorical measures agree: harmonicProportion1000 > 500 ↔ harmonicDominant. The gradient measure refines, not contradicts, the binary one.

      Per-language gradient word-order data from Levshina, @cite{levshina-stoynova-2023} OSF datasets. SO proportion from Dataset1.txt, entropy and case MI from Dataset3.txt. All values × 1000, rounded to nearest integer.

      • name : String
      • isoCode : String
      • soProportion1000 :

        Proportion of SO (subject before object) orders × 1000 (from Dataset1.txt)

      • soEntropy1000 :

        Shannon entropy of S-O order × 1000 (0 = deterministic, 1000 = maximal; Dataset3.txt)

      • caseMI1000 :

        Mutual information between case markers and grammatical role × 1000 (Dataset3.txt)

      Instances For
        Equations
        • One or more equations did not get rendered due to their size.
        Instances For
          Equations
          • One or more equations did not get rendered due to their size.
          Instances For
            Equations
            Instances For
              Equations
              Instances For
                Equations
                Instances For
                  Equations
                  Instances For
                    Equations
                    Instances For
                      Equations
                      Instances For
                        Equations
                        Instances For
                          Equations
                          Instances For
                            Equations
                            Instances For
                              Equations
                              Instances For
                                Equations
                                Instances For
                                  Equations
                                  Instances For
                                    Equations
                                    Instances For
                                      Equations
                                      Instances For
                                        Equations
                                        Instances For
                                          Equations
                                          Instances For
                                            Equations
                                            Instances For
                                              Equations
                                              Instances For
                                                Equations
                                                Instances For
                                                  Equations
                                                  Instances For
                                                    Equations
                                                    Instances For
                                                      Equations
                                                      Instances For
                                                        Equations
                                                        Instances For
                                                          Equations
                                                          Instances For
                                                            Equations
                                                            Instances For
                                                              Equations
                                                              Instances For
                                                                Equations
                                                                Instances For
                                                                  Equations
                                                                  Instances For
                                                                    Equations
                                                                    Instances For
                                                                      Equations
                                                                      Instances For
                                                                        Equations
                                                                        Instances For

                                                                          All 30 gradient word-order profiles from OSF Dataset1.txt + Dataset3.txt.

                                                                          Equations
                                                                          • One or more equations did not get rendered due to their size.
                                                                          Instances For

                                                                            Languages with near-deterministic SO order (proportion > 960) have low SO entropy (< 300). 13 languages: Bulgarian, Danish, Dutch, English, French, Indonesian, Italian, Korean, Portuguese, Romanian, Spanish, Swedish, Vietnamese.

                                                                            Among Indo-European languages with case morphology, high SO entropy (> 700) implies high case MI (> 400). Czech, Hungarian, Latvian, Lithuanian all use case marking to compensate for word-order freedom.

                                                                            Tamil is a counterexample to the simple "flexibility requires case marking" story: high SO entropy (824) but low case MI (59). Tamil uses verb agreement and animacy rather than case morphology for role disambiguation.

                                                                            This makes the gradient approach especially valuable — it reveals that the case–flexibility correlation is a tendency with principled exceptions, not a law.

                                                                            Case MI correlates with SO entropy: languages with caseMI > 300 have higher mean SO entropy than languages with caseMI ≤ 300.

                                                                            Equations
                                                                            • One or more equations did not get rendered due to their size.
                                                                            Instances For

                                                                              SO proportion spans a wide range: from Lithuanian (608) to Indonesian (999). This 391-point spread refutes any simple binary classification.

                                                                              For all three WALS tables, harmonicProportion1000 > 500 → harmonicDominant = true.

                                                                              Shared languages between our gradient profiles and the 54-language Hahn et al. dataset.

                                                                              Equations
                                                                              • One or more equations did not get rendered due to their size.
                                                                              Instances For

                                                                                Languages with high SO entropy (> 600) in Levshina that also appear in Hahn et al. all have high branching direction entropy (> 250) in Hahn et al.

                                                                                Two independent measures of word-order freedom (SO entropy from corpus counts, branching direction entropy from dependency trees) converge.

                                                                                Shared languages between gradient profiles and Futrell et al.'s DLM dataset.

                                                                                Equations
                                                                                • One or more equations did not get rendered due to their size.
                                                                                Instances For

                                                                                  Languages with high propHeadFinal (> 700) in Futrell have high soProportion (> 700) in Levshina: head-final ≈ SOV ≈ high SO proportion.

                                                                                  Languages with soEntropy > 600 in Levshina include at least one of the Hahn et al. exception languages. Latvian appears in both: high SO entropy in Levshina, and a memory-surprisal optimization exception in Hahn et al. Flexible word order weakens memory-surprisal optimization.

                                                                                  Russian VO probability by register, from OSF Dataset6.txt (100 clauses per register). Demonstrates within-language variation that a categorical label obscures.

                                                                                  Instances For
                                                                                    Equations
                                                                                    • One or more equations did not get rendered due to their size.
                                                                                    Instances For
                                                                                      Equations
                                                                                      • One or more equations did not get rendered due to their size.
                                                                                      Instances For
                                                                                        Equations
                                                                                        Instances For
                                                                                          Equations
                                                                                          Instances For
                                                                                            Equations
                                                                                            Instances For
                                                                                              Equations
                                                                                              Instances For

                                                                                                Russian conversation has lower VO probability than fiction: spoken language permits more OV orders.

                                                                                                The register variation is large: fiction - conversation > 400 (a 44 percentage-point spread). A single categorical label cannot capture this within-language variation.