Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Studying language evolution in the age of big data

Journal Article · · Journal of Language Evolution
DOI:https://doi.org/10.1093/jole/lzy004· OSTI ID:1484654
 [1];  [2];  [3];  [4];  [5];  [6];  [4];  [7];  [8];  [9];  [10];  [11]
  1. Santa Fe Inst. (SFI), Santa Fe, NM (United States); Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
  2. Univ. of Leipzig (Germany); Max Planck Inst. for Mathematics in the Sciences, Leipzig (Germany)
  3. Univ. of Zurich (Switzerland); Max Planck Inst. for the Science of Human History, Jena (Germany)
  4. Univ. of New Mexico, Albuquerque, NM (United States)
  5. Phillips Univ. Marburg (Germany)
  6. Arizona State Univ., Tempe, AZ (United States)
  7. Univ. of Leipzig (Germany)
  8. Santa Fe Inst. (SFI), Santa Fe, NM (United States); Georgia Inst. of Technology, Atlanta, GA (United States); Tokyo Inst. of Technology (Japan)
  9. Santa Fe Inst. (SFI), Santa Fe, NM (United States); Tokyo Inst. of Technology (Japan); Univ. of Leipzig (Germany)
  10. Santa Fe Inst. (SFI), Santa Fe, NM (United States); Russian State Univ. for the Humanities, Moscow (Russian Federation); Higher School of Economics, Moscow (Russian Federation)
  11. Northwester Inst. on Complex Systems, Evanston, IL (United States); Northwestern Univ., Evanston, IL (United States). Dept. of Chemistry
The increasing availability of large digital corpora of cross-linguistic data is revolutionizing many branches of linguistics. Overall, it has triggered a shift of attention from detailed questions about individual features to more global patterns amenable to rigorous, but statistical, analyses. This engenders an approach based on successive approximations where models with simplified assumptions result in frameworks that can then be systematically refined, always keeping explicit the methodological commitments and the assumed prior knowledge. Therefore, they can resolve disputes between competing frameworks quantitatively by separating the support provided by the data from the underlying assumptions. These methods, though, often appear as a ‘black box’ to traditional practitioners. In fact, the switch to a statistical view complicates comparison of the results from these newer methods with traditional understanding, sometimes leading to misinterpretation and overly broad claims. We describe here this evolving methodological shift, attributed to the advent of big, but often incomplete and poorly curated, data, emphasizing the underlying similarity of the newer quantitative to the traditional comparative methods and discussing when and to what extent the former have advantages over the latter. In this review, we cover briefly both randomization tests for detecting patterns in a largely model-independent fashion and phylolinguistic methods for a more model-based analysis of these patterns. We foresee a fruitful division of labor between the ability to computationally process large volumes of data and the trained linguistic insight declaring worthy prior commitments and interesting hypotheses in need of comparison.
Research Organization:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
USDOE
Grant/Contract Number:
89233218CNA000001
OSTI ID:
1484654
Report Number(s):
LA-UR--18-24872
Journal Information:
Journal of Language Evolution, Journal Name: Journal of Language Evolution Journal Issue: 2 Vol. 3; ISSN 2058-4571
Publisher:
Oxford University PressCopyright Statement
Country of Publication:
United States
Language:
English

References (88)

An Ancestral Recombination Graph book January 1997
Mathematical Elegance with Biochemical Realism: The Covarion Model of Molecular Evolution journal December 2001
A canonical decomposition theory for metrics on a finite set journal March 1992
A general method applicable to the search for similarities in the amino acid sequence of two proteins journal March 1970
An improved algorithm for matching biological sequences journal December 1982
Pattern recognition of strings with substitutions, insertions, deletions and generalized transpositions journal May 1997
Graphs in sequence spaces: a review of statistical geometry journal June 1997
The cross-linguistic categorization of everyday events: A study of cutting and breaking journal November 2008
Detecting Regular Sound Changes in Linguistics as Events of Concerted Evolution journal January 2015
Evidence for syntax as a signal of historical relatedness journal November 2009
Evaluating linguistic distance measures journal September 2010
Arbitrariness, Iconicity, and Systematicity in Language journal October 2015
Biological Sequence Analysis book January 2012
Language trees support the express-train sequence of Austronesian expansion journal June 2000
Language-tree divergence times support the Anatolian theory of Indo-European origin journal November 2003
Deuterostome phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida journal October 2006
Frequency of word-use predicts rates of lexical evolution throughout Indo-European history journal October 2007
Evolved structure of language shows lineage-specific trends in word-order universals journal April 2011
New deep-sea species of Xenoturbella and the position of Xenacoelomorpha journal February 2016
dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation journal September 2016
Architecture, constraints, and behavior journal July 2011
The origin and evolution of word order journal October 2011
Automated reconstruction of ancient languages using probabilistic models of sound change journal February 2013
Ultraconserved words point to deep language ancestry across Eurasia journal May 2013
Rate of language evolution is affected by population size journal February 2015
On the universal structure of human lexical semantics journal February 2016
Sound–meaning association biases evidenced across thousands of languages journal September 2016
How to use typological databases in historical linguistic research journal December 2007
Increased Taxon Sampling Greatly Reduces Phylogenetic Error journal July 2002
On the Validity of Glottochronology journal April 1962
Towards Greater Accuracy in Lexicostatistic Dating journal April 1955
Automated Dating of the World’s Language Families Based on Lexical Similarity journal December 2011
A stagewise rejective multiple test procedure based on a modified Bonferroni test journal January 1988
A sharper Bonferroni procedure for multiple tests of significance journal January 1988
Beyond cognacy: historical relations between words and their implication for phylogenetic reconstruction journal June 2016
A Pluralistic Account of Homology: Adapting the Models to the Data journal November 2013
Simultaneous Bayesian Estimation of Alignment and Phylogeny under a Joint Model of Protein Sequence and Structure journal June 2014
The ancestor of modern Holozoa acquired the CCA-adding enzyme from Alphaproteobacteria by horizontal gene transfer journal June 2015
Detection and characterization of horizontal transfers in prokaryotes using genomic signature journal January 2005
Estimating Absolute Rates of Molecular Evolution and Divergence Times: A Penalized Likelihood Approach journal January 2002
Estimating the rate of evolution of the rate of molecular evolution journal December 1998
Phylogenetic Inference via Sequential Monte Carlo journal January 2012
Does horizontal transmission invalidate cultural phylogenies? journal March 2009
Bayesian phylogenetic analysis supports an agricultural origin of Japonic languages journal May 2011
The riddle of Tasmanian languages journal September 2012
Historical linguistics in Australia: trees, networks and their implications journal December 2010
Splits or waves? Trees or webs? How divergence measures and network analysis can unravel language histories journal December 2010
How do we use language? Shared patterns in the frequency of word use across 17 world languages journal April 2011
Language evolution and human history: what a difference a date makes
  • Gray, Russell D.; Atkinson, Quentin D.; Greenhill, Simon J.
  • Philosophical Transactions of the Royal Society B: Biological Sciences, Vol. 366, Issue 1567 https://doi.org/10.1098/rstb.2010.0378
journal April 2011
A new look at the statistical model identification journal December 1974
A direct approach to false discovery rates journal August 2002
A Course in Modern Linguistics journal January 1958
Identifying remote protein homologs by network propagation journal October 2005
Tutorial on Computational Linguistic Phylogeny journal September 2008
Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing journal January 1995
Speciation by Distance in a Ring Species journal January 2005
Structural Phylogenetics and the Reconstruction of Ancient Language History journal September 2005
Languages Evolve in Punctuational Bursts journal February 2008
Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement journal January 2009
Mapping the Origins and Expansion of the Indo-European Language Family journal August 2012
Timing the Ancestor of the HIV-1 Pandemic Strains journal June 2000
Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment journal October 1993
An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules journal April 1983
Lexical Universals journal October 1978
A Pipeline for Computational Historical Linguistics journal January 2011
Phylogenetic Inference from Word Lists Using Weighted Alignment with Empirically Determined Weights journal January 2013
Internal Classification of the Alor-Pantar Language Family Using Computational Methods Applied to the Lexicon journal January 2012
A priori assessment of data quality in molecular phylogenetics journal January 2014
Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics journal August 2016
Table for Estimating the Goodness of Fit of Empirical Distributions journal June 1948
Estimating the Dimension of a Model journal March 1978
LocARNA-P: Accurate boundary prediction and improved detection of structural RNAs journal March 2012
'Natural Concepts' in the Spatial Topologial Domain--Adpositional Meanings in Crosslinguistic Perspective: An Exercise in Semantic Typology journal January 2003
Computational phylogenetics and the internal structure of Pama-Nyungan journal January 2012
Sound Correspondences in the World's Languages journal January 2013
Recent Evolutions of Multiple Sequence Alignment Algorithms journal January 2007
The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment journal July 2010
Ancestral Reconstruction journal July 2016
Genome-Wide Inference of Ancestral Recombination Graphs journal May 2014
Inferring universals from grammatical variation: Multidimensional scaling for typological analysis journal January 2008
The verbs of perception: a typological study journal January 1983
The Relationship of Uto-Aztecan and Tanoan journal October 1937
general principles of human anatomical partonomy and speculations on the growth of partonomic nomenclature 1 journal August 1976
On Prediction Using Variable Order Markov Models journal July 2004
On the genealogy of large populations journal January 1982
Lexicostatistics: A Critique journal January 1956
The importance and application of the ancestral recombination graph journal January 2013
Constraints of vowels and consonants on lexical selection: Cross-linguistic comparisons journal September 2000

Cited By (1)

Phylogenetics beyond biology journal June 2018

Figures / Tables (16)


Similar Records

Parallel Hybrid Metaheuristics with Distributed Intensification and Diversification for Large-scale Optimization in Big Data Statistical Analysis
Conference · Sat Nov 30 23:00:00 EST 2019 · OSTI ID:1606948

Upper Subcritical Calculations Based on Correlated Data
Conference · Wed Dec 31 23:00:00 EST 2014 · OSTI ID:1215586

Upper subcritical calculations based on correlated data - 14427
Conference · Tue Sep 15 00:00:00 EDT 2015 · OSTI ID:23100882