Gradient Word-Order Measures #
@cite{levshina-stoynova-2023}
Levshina, Namboodiripad, Allassonnière-Tang, Kramer, Talamo, Verkerk, Wilmoth, Garrido Rodriguez, Gupton, Kidd, Liu, Naccarato, Nordlinger, @cite{levshina-stoynova-2023} "Why we need a gradient approach to word order" (Linguistics 61(4):825–883) argues that word-order typology should use continuous measures (proportions, Shannon entropy, mutual information) rather than categorical labels (SVO, SOV, "rigid", "flexible").
Key Claims #
- SO proportion is continuous across languages (Figure 1)
- Head-finality is continuous (Figure 2: 123 SUD corpora)
- Case marking MI correlates with SO entropy (Figure 3: ~30 languages)
- Register affects word-order proportions (Figure 7: Russian VO varies by register)
- Flexibility scores form a continuum (Figure 8: Avar–English)
Data Sources #
- OSF Dataset1.txt: per-language SO proportion (31 languages)
- OSF Dataset3.txt: per-language SO entropy, case MI (30 languages)
- OSF Dataset6.txt: Russian register variation (conversation/fiction/news)
- https://osf.io/w9u6v/
Proportion of harmonic languages × 1000 (integer permille).
Equations
Instances For
Proportion of disharmonic languages × 1000.
Equations
Instances For
Proportion of head-initial languages × 1000 (hihi + hihf cells).
Equations
- Phenomena.WordOrder.Gradience.hiProportion1000 t = (t.hihi.count + t.hihf.count) * 1000 / t.totalCount
Instances For
Table 1: 94.3% harmonic (VO × Adposition).
Table 2: 86.1% harmonic (VO × Subordinator).
Table 3: 82.2% harmonic (VO × Relative clause).
Harmonic proportion decreases with construction complexity: adposition > subordinator > relative clause.
Levshina et al.'s point: not all constructions are equally categorical. Even the "best" universal (VO ↔ preposition) is only 94.3% harmonic.
Gradient and categorical measures agree: harmonicProportion1000 > 500 ↔ harmonicDominant. The gradient measure refines, not contradicts, the binary one.
Per-language gradient word-order data from Levshina, @cite{levshina-stoynova-2023} OSF datasets. SO proportion from Dataset1.txt, entropy and case MI from Dataset3.txt. All values × 1000, rounded to nearest integer.
- name : String
- isoCode : String
- soProportion1000 : ℕ
Proportion of SO (subject before object) orders × 1000 (from Dataset1.txt)
- soEntropy1000 : ℕ
Shannon entropy of S-O order × 1000 (0 = deterministic, 1000 = maximal; Dataset3.txt)
- caseMI1000 : ℕ
Mutual information between case markers and grammatical role × 1000 (Dataset3.txt)
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- One or more equations did not get rendered due to their size.
- Phenomena.WordOrder.Gradience.instBEqGradientWOProfile.beq x✝¹ x✝ = false
Instances For
Equations
- Phenomena.WordOrder.Gradience.arabic = { name := "Arabic", isoCode := "ar", soProportion1000 := 933, soEntropy1000 := 345, caseMI1000 := 36 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.bulgarian = { name := "Bulgarian", isoCode := "bg", soProportion1000 := 965, soEntropy1000 := 218, caseMI1000 := 28 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.croatian = { name := "Croatian", isoCode := "hr", soProportion1000 := 856, soEntropy1000 := 586, caseMI1000 := 415 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.czech = { name := "Czech", isoCode := "cs", soProportion1000 := 781, soEntropy1000 := 760, caseMI1000 := 525 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.danish = { name := "Danish", isoCode := "da", soProportion1000 := 989, soEntropy1000 := 74, caseMI1000 := 0 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.dutch = { name := "Dutch", isoCode := "nl", soProportion1000 := 970, soEntropy1000 := 183, caseMI1000 := 0 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.english = { name := "English", isoCode := "en", soProportion1000 := 994, soEntropy1000 := 47, caseMI1000 := 0 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.estonian = { name := "Estonian", isoCode := "et", soProportion1000 := 842, soEntropy1000 := 634, caseMI1000 := 692 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.finnish = { name := "Finnish", isoCode := "fi", soProportion1000 := 912, soEntropy1000 := 426, caseMI1000 := 314 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.french = { name := "French", isoCode := "fr", soProportion1000 := 995, soEntropy1000 := 42, caseMI1000 := 5 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.german = { name := "German", isoCode := "de", soProportion1000 := 916, soEntropy1000 := 386, caseMI1000 := 288 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.greek = { name := "Greek", isoCode := "el", soProportion1000 := 896, soEntropy1000 := 490, caseMI1000 := 70 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.hindi = { name := "Hindi", isoCode := "hi", soProportion1000 := 874, soEntropy1000 := 509, caseMI1000 := 334 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.hungarian = { name := "Hungarian", isoCode := "hu", soProportion1000 := 727, soEntropy1000 := 858, caseMI1000 := 738 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.indonesian = { name := "Indonesian", isoCode := "id", soProportion1000 := 999, soEntropy1000 := 12, caseMI1000 := 0 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.italian = { name := "Italian", isoCode := "it", soProportion1000 := 969, soEntropy1000 := 192, caseMI1000 := 6 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.japanese = { name := "Japanese", isoCode := "ja", soProportion1000 := 953, soEntropy1000 := 246, caseMI1000 := 582 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.korean = { name := "Korean", isoCode := "ko", soProportion1000 := 978, soEntropy1000 := 146, caseMI1000 := 357 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.latvian = { name := "Latvian", isoCode := "lv", soProportion1000 := 767, soEntropy1000 := 784, caseMI1000 := 726 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.lithuanian = { name := "Lithuanian", isoCode := "lt", soProportion1000 := 608, soEntropy1000 := 968, caseMI1000 := 788 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.persian = { name := "Persian", isoCode := "fa", soProportion1000 := 924, soEntropy1000 := 315, caseMI1000 := 219 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.portuguese = { name := "Portuguese", isoCode := "pt", soProportion1000 := 986, soEntropy1000 := 102, caseMI1000 := 14 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.romanian = { name := "Romanian", isoCode := "ro", soProportion1000 := 966, soEntropy1000 := 216, caseMI1000 := 7 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.russian = { name := "Russian", isoCode := "ru", soProportion1000 := 861, soEntropy1000 := 580, caseMI1000 := 335 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.slovene = { name := "Slovene", isoCode := "sl", soProportion1000 := 873, soEntropy1000 := 536, caseMI1000 := 478 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.spanish = { name := "Spanish", isoCode := "es", soProportion1000 := 978, soEntropy1000 := 143, caseMI1000 := 21 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.swedish = { name := "Swedish", isoCode := "sv", soProportion1000 := 988, soEntropy1000 := 86, caseMI1000 := 0 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.tamil = { name := "Tamil", isoCode := "ta", soProportion1000 := 715, soEntropy1000 := 824, caseMI1000 := 59 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.turkish = { name := "Turkish", isoCode := "tr", soProportion1000 := 922, soEntropy1000 := 353, caseMI1000 := 167 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.vietnamese = { name := "Vietnamese", isoCode := "vi", soProportion1000 := 981, soEntropy1000 := 105, caseMI1000 := 0 }
Instances For
All 30 gradient word-order profiles from OSF Dataset1.txt + Dataset3.txt.
Equations
- One or more equations did not get rendered due to their size.
Instances For
Languages with near-deterministic SO order (proportion > 960) have low SO entropy (< 300). 13 languages: Bulgarian, Danish, Dutch, English, French, Indonesian, Italian, Korean, Portuguese, Romanian, Spanish, Swedish, Vietnamese.
Among Indo-European languages with case morphology, high SO entropy (> 700) implies high case MI (> 400). Czech, Hungarian, Latvian, Lithuanian all use case marking to compensate for word-order freedom.
Tamil is a counterexample to the simple "flexibility requires case marking" story: high SO entropy (824) but low case MI (59). Tamil uses verb agreement and animacy rather than case morphology for role disambiguation.
This makes the gradient approach especially valuable — it reveals that the case–flexibility correlation is a tendency with principled exceptions, not a law.
Case MI correlates with SO entropy: languages with caseMI > 300 have higher mean SO entropy than languages with caseMI ≤ 300.
Equations
- One or more equations did not get rendered due to their size.
Instances For
SO proportion spans a wide range: from Lithuanian (608) to Indonesian (999). This 391-point spread refutes any simple binary classification.
For all three WALS tables, harmonicProportion1000 > 500 → harmonicDominant = true.
The three tables have different harmonic proportions (943 vs 861 vs 822), showing harmony is a matter of degree, not a binary universal.
Languages with high SO entropy (> 600) in Levshina that also appear in Hahn et al. all have high branching direction entropy (> 250) in Hahn et al.
Two independent measures of word-order freedom (SO entropy from corpus counts, branching direction entropy from dependency trees) converge.
Languages with high propHeadFinal (> 700) in Futrell have high soProportion (> 700) in Levshina: head-final ≈ SOV ≈ high SO proportion.
Languages with soEntropy > 600 in Levshina include at least one of the Hahn et al. exception languages. Latvian appears in both: high SO entropy in Levshina, and a memory-surprisal optimization exception in Hahn et al. Flexible word order weakens memory-surprisal optimization.
Russian VO probability by register, from OSF Dataset6.txt (100 clauses per register). Demonstrates within-language variation that a categorical label obscures.
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- Phenomena.WordOrder.Gradience.russianConversation = { register := "conversation", voProbability1000 := 390 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.russianFiction = { register := "fiction", voProbability1000 := 830 }
Instances For
Equations
- Phenomena.WordOrder.Gradience.russianNews = { register := "news", voProbability1000 := 830 }
Instances For
Russian conversation has lower VO probability than fiction: spoken language permits more OV orders.
The register variation is large: fiction - conversation > 400 (a 44 percentage-point spread). A single categorical label cannot capture this within-language variation.