Documentation

Linglib.Phenomena.WordOrder.Studies.HahnDegenFutrell2021

Study 2: 54-Language Word-Order Efficiency #

@cite{hahn-degen-futrell-2021}

Tests the Efficient Trade-off Hypothesis: the ordering regularities of natural language optimize the memory-surprisal trade-off, serving the communicative interest of the hearer. 54 languages from Universal Dependencies corpora are measured against grammar-preserving random baselines. 50/54 languages have significantly more efficient trade-offs; the 4 exceptions (Latvian, North Sami, Polish, Slovak) all have high word-order freedom (high branching direction entropy).

Key empirical finding (Figure 13): branching direction entropy is negatively correlated with optimization strength (Spearman ρ ≈ −.58, p < .0001). Languages with freer word order show weaker optimization, plausibly because free-order languages use word order to encode information structure rather than minimize processing cost.

Values #

Efficiency data for a single language from Study 2.

  • name : String
  • isoCode : String
  • family : String
  • moreEfficient : Bool

    Whether the real language's trade-off AUC is significantly lower than baseline AUCs (Hochberg-corrected p < .01). This is the empirical instantiation of Processing.MemorySurprisal.efficientTradeoffHypothesis from the theory module.

  • gMean1000 :

    Bootstrapped mean G × 1000 (from SI Figure 2). 1000 = fully optimized.

  • branchDirEntropy1000 : Option

    Branching direction entropy × 1000 (higher = more word-order freedom). none when the value is unavailable in the published data.

Instances For
    Equations
    • One or more equations did not get rendered due to their size.
    Instances For
      Equations
      • One or more equations did not get rendered due to their size.
      Instances For

        Efficient languages (50) #

        G ≥ 0.5 in the LSTM estimator (main paper). Most have G = 1.0.

        Equations
        • One or more equations did not get rendered due to their size.
        Instances For
          Equations
          • One or more equations did not get rendered due to their size.
          Instances For
            Equations
            • One or more equations did not get rendered due to their size.
            Instances For
              Equations
              • One or more equations did not get rendered due to their size.
              Instances For
                Equations
                Instances For
                  Equations
                  Instances For
                    Equations
                    • One or more equations did not get rendered due to their size.
                    Instances For
                      Equations
                      • One or more equations did not get rendered due to their size.
                      Instances For
                        Equations
                        • One or more equations did not get rendered due to their size.
                        Instances For
                          Equations
                          • One or more equations did not get rendered due to their size.
                          Instances For
                            Equations
                            • One or more equations did not get rendered due to their size.
                            Instances For
                              Equations
                              • One or more equations did not get rendered due to their size.
                              Instances For
                                Equations
                                • One or more equations did not get rendered due to their size.
                                Instances For
                                  Equations
                                  • One or more equations did not get rendered due to their size.
                                  Instances For
                                    Equations
                                    • One or more equations did not get rendered due to their size.
                                    Instances For
                                      Equations
                                      • One or more equations did not get rendered due to their size.
                                      Instances For
                                        Equations
                                        • One or more equations did not get rendered due to their size.
                                        Instances For
                                          Equations
                                          Instances For
                                            Equations
                                            • One or more equations did not get rendered due to their size.
                                            Instances For
                                              Equations
                                              • One or more equations did not get rendered due to their size.
                                              Instances For
                                                Equations
                                                Instances For
                                                  Equations
                                                  • One or more equations did not get rendered due to their size.
                                                  Instances For
                                                    Equations
                                                    • One or more equations did not get rendered due to their size.
                                                    Instances For
                                                      Equations
                                                      • One or more equations did not get rendered due to their size.
                                                      Instances For
                                                        Equations
                                                        • One or more equations did not get rendered due to their size.
                                                        Instances For
                                                          Equations
                                                          • One or more equations did not get rendered due to their size.
                                                          Instances For
                                                            Equations
                                                            • One or more equations did not get rendered due to their size.
                                                            Instances For
                                                              Equations
                                                              • One or more equations did not get rendered due to their size.
                                                              Instances For
                                                                Equations
                                                                • One or more equations did not get rendered due to their size.
                                                                Instances For
                                                                  Equations
                                                                  • One or more equations did not get rendered due to their size.
                                                                  Instances For
                                                                    Equations
                                                                    Instances For
                                                                      Equations
                                                                      Instances For
                                                                        Equations
                                                                        • One or more equations did not get rendered due to their size.
                                                                        Instances For
                                                                          Equations
                                                                          • One or more equations did not get rendered due to their size.
                                                                          Instances For
                                                                            Equations
                                                                            Instances For
                                                                              Equations
                                                                              • One or more equations did not get rendered due to their size.
                                                                              Instances For
                                                                                Equations
                                                                                • One or more equations did not get rendered due to their size.
                                                                                Instances For
                                                                                  Equations
                                                                                  • One or more equations did not get rendered due to their size.
                                                                                  Instances For
                                                                                    Equations
                                                                                    • One or more equations did not get rendered due to their size.
                                                                                    Instances For
                                                                                      Equations
                                                                                      • One or more equations did not get rendered due to their size.
                                                                                      Instances For
                                                                                        Equations
                                                                                        • One or more equations did not get rendered due to their size.
                                                                                        Instances For
                                                                                          Equations
                                                                                          • One or more equations did not get rendered due to their size.
                                                                                          Instances For
                                                                                            Equations
                                                                                            • One or more equations did not get rendered due to their size.
                                                                                            Instances For
                                                                                              Equations
                                                                                              • One or more equations did not get rendered due to their size.
                                                                                              Instances For
                                                                                                Equations
                                                                                                Instances For
                                                                                                  Equations
                                                                                                  Instances For
                                                                                                    Equations
                                                                                                    • One or more equations did not get rendered due to their size.
                                                                                                    Instances For
                                                                                                      Equations
                                                                                                      Instances For
                                                                                                        Equations
                                                                                                        Instances For
                                                                                                          Equations
                                                                                                          • One or more equations did not get rendered due to their size.
                                                                                                          Instances For

                                                                                                            Exception languages (4) #

                                                                                                            G < 0.5 in the LSTM estimator (main paper, Figure 13; SI Figure 2). All have high branching direction entropy (free word order).

                                                                                                            Equations
                                                                                                            • One or more equations did not get rendered due to their size.
                                                                                                            Instances For
                                                                                                              Equations
                                                                                                              • One or more equations did not get rendered due to their size.
                                                                                                              Instances For
                                                                                                                Equations
                                                                                                                • One or more equations did not get rendered due to their size.
                                                                                                                Instances For
                                                                                                                  Equations
                                                                                                                  • One or more equations did not get rendered due to their size.
                                                                                                                  Instances For

                                                                                                                    All 54 languages from Study 2 (SI Table 2).

                                                                                                                    Equations
                                                                                                                    • One or more equations did not get rendered due to their size.
                                                                                                                    Instances For

                                                                                                                      The 50 efficient languages.

                                                                                                                      Equations
                                                                                                                      • One or more equations did not get rendered due to their size.
                                                                                                                      Instances For

                                                                                                                        The 4 exception languages.

                                                                                                                        Equations
                                                                                                                        • One or more equations did not get rendered due to their size.
                                                                                                                        Instances For

                                                                                                                          50 out of 54 languages have more efficient word orders than baselines.

                                                                                                                          All 4 exceptions have high branching direction entropy (> 300 × 10⁻³).

                                                                                                                          This supports the paper's explanation: languages with very free word order have weaker optimization pressure because many orderings are nearly equally acceptable, reducing the signal of optimization.

                                                                                                                          Entropy values from branching_entropy.tsv at https://github.com/m-hahn/memory-surprisal

                                                                                                                          All 4 exceptions have G < 500 (below the optimization threshold).

                                                                                                                          The moreEfficient flag is consistent with a G ≥ 500 threshold across all 54 languages. This cross-checks two independently encoded fields: moreEfficient (from the binomial test) and gMean1000 (from SI Figure 2's bootstrapped fraction).

                                                                                                                          The 4 exceptions form a contiguous block at the bottom of the G ranking: no efficient language has G below any exception's G.

                                                                                                                          Japanese has the lowest branching direction entropy among languages with known entropy data (most rigid word order). Korean is excluded because its entropy is not available in the published data.

                                                                                                                          Estonian has the highest entropy among efficient languages (435) but is still efficient (G = 0.80), showing that word-order freedom is necessary but not sufficient for being an exception.

                                                                                                                          Mean branching direction entropy is higher for exceptions than efficient languages (computed over languages with known entropy).

                                                                                                                          Equations
                                                                                                                          • One or more equations did not get rendered due to their size.
                                                                                                                          Instances For

                                                                                                                            Slovak has the lowest G value (least evidence for optimization).

                                                                                                                            42 out of 50 efficient languages have G = 1.0 (fully optimized: the real language beats every sampled baseline grammar).

                                                                                                                            ISO codes appearing in @cite{futrell-gibson-2020}'s 32-language dataset.

                                                                                                                            Equations
                                                                                                                            • One or more equations did not get rendered due to their size.
                                                                                                                            Instances For

                                                                                                                              ISO codes appearing in this study's 54-language dataset.

                                                                                                                              Equations
                                                                                                                              • One or more equations did not get rendered due to their size.
                                                                                                                              Instances For

                                                                                                                                Languages in both datasets (by ISO code).

                                                                                                                                Equations
                                                                                                                                • One or more equations did not get rendered due to their size.
                                                                                                                                Instances For

                                                                                                                                  Polish is the only shared language that is an exception.

                                                                                                                                  Negative correlation between word-order freedom and optimization #

                                                                                                                                  Figure 13 of @cite{hahn-degen-futrell-2021} shows that branching direction entropy (x-axis) is negatively correlated with the surprisal difference between real and baseline orders (y-axis). Spearman ρ ≈ −.58, p < .0001.

                                                                                                                                  We cannot compute a Spearman correlation in Lean without a ranking function, but we can verify the key structural claims that drive the correlation:

                                                                                                                                  Languages with known low branching entropy (< 300) are all efficient. This is the left side of Figure 13: rigid-order languages cluster at high surprisal difference (strong optimization).

                                                                                                                                  All 4 exceptions have entropy ≥ 315. This is the lower-right of Figure 13: exceptions cluster at high entropy.

                                                                                                                                  Not all high-entropy languages are exceptions: word-order freedom is necessary but not sufficient for being an exception. Estonian (entropy 435) and Finnish (357) are efficient despite high entropy.

                                                                                                                                  theorem Phenomena.WordOrder.Studies.HahnDegenFutrell2021.low_entropy_higher_mean_g :
                                                                                                                                  have lowEntropy := List.filter (fun (l : LanguageEfficiency) => match l.branchDirEntropy1000 with | some e => decide (e < 250) | none => false) allLanguages; have highEntropy := List.filter (fun (l : LanguageEfficiency) => match l.branchDirEntropy1000 with | some e => decide (e 250) | none => false) allLanguages; List.foldl (fun (x1 x2 : ) => x1 + x2) 0 (List.map (fun (x : LanguageEfficiency) => x.gMean1000) lowEntropy) / lowEntropy.length > List.foldl (fun (x1 x2 : ) => x1 + x2) 0 (List.map (fun (x : LanguageEfficiency) => x.gMean1000) highEntropy) / highEntropy.length

                                                                                                                                  The mean G value decreases as entropy increases: partition languages into low-entropy (< 250) and high-entropy (≥ 250) groups. The low-entropy group has higher mean G, consistent with the negative correlation.

                                                                                                                                  Information locality generalizes dependency locality #

                                                                                                                                  @cite{hahn-degen-futrell-2021} argue (§"Other Kinds of Memory Bottlenecks" and Discussion) that information locality generalizes dependency length minimization: DLM minimizes structural distance between related words, while information locality minimizes the information-theoretic distance at which predictive information concentrates.

                                                                                                                                  The HarmonicOrder module proves that consistent head direction achieves shorter dependency chains (harmonic_always_shorter). The present study shows that languages with shorter dependencies (lower branching entropy, more consistent direction) achieve better memory-surprisal trade-offs (rigid_order_languages_efficient). Together, these two results establish the chain: harmonic order → short dependencies → information locality → efficient trade-off.

                                                                                                                                  The DLM harmonic order prediction holds: consistent head direction produces shorter total dependency length (from HarmonicOrder.lean).

                                                                                                                                  The full chain: all languages with low entropy (consistent direction, short dependencies) are efficient, and the DLM prediction holds. This connects the structural argument (HarmonicOrder) to the information-theoretic result (memory-surprisal efficiency).

                                                                                                                                  WALS Language Validation #

                                                                                                                                  The study uses ISO 639-1 codes (2-letter) from Universal Dependencies. WALS uses ISO 639-3 codes (3-letter). This mapping connects them, enabling family classification cross-checks against WALS v2020.4.

                                                                                                                                  Coverage: 51 of 54 languages have WALS entries (missing: Buryat, Croatian, Serbian). Of 51, 42 have identical family names; 9 differ due to terminology (Turkic/Altaic, Japonic/Japanese, Kra-Dai/Tai-Kadai, etc.).

                                                                                                                                  ISO 639-1 codes that coincide with ISO 639-3 pass through directly. For macrolanguages (Arabic, Chinese, Persian, Estonian), the mapping points to the specific ISO 639-3 variety used in WALS.

                                                                                                                                  ISO 639-1 (study) → ISO 639-3 (WALS) mapping for the 54 languages.

                                                                                                                                  Equations
                                                                                                                                  • One or more equations did not get rendered due to their size.
                                                                                                                                  Instances For

                                                                                                                                    Look up a study language's WALS entry via its ISO code.

                                                                                                                                    Equations
                                                                                                                                    • One or more equations did not get rendered due to their size.
                                                                                                                                    Instances For

                                                                                                                                      Languages with WALS entries (51 of 54).

                                                                                                                                      Equations
                                                                                                                                      • One or more equations did not get rendered due to their size.
                                                                                                                                      Instances For

                                                                                                                                        The 3 languages without WALS entries are Buryat, Croatian, and Serbian.

                                                                                                                                        For all 42 languages where the family names agree, the study family matches the WALS family exactly.

                                                                                                                                        The 9 family-name divergences (all terminological, not errors):

                                                                                                                                        • Basque: study "Isolate" vs WALS "Basque"
                                                                                                                                        • Japanese: "Japonic" vs "Japanese"
                                                                                                                                        • Kazakh/Turkish/Uyghur: "Turkic" vs "Altaic" (Altaic hypothesis disputed)
                                                                                                                                        • Korean: "Koreanic" vs "Korean"
                                                                                                                                        • Naija: "Creole" vs "other"
                                                                                                                                        • Thai: "Kra-Dai" vs "Tai-Kadai"
                                                                                                                                        • Vietnamese: "Austroasiatic" vs "Austro-Asiatic" (hyphenation)