skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Predicting Foreign Language Usage from English-Only Social Media Posts

Conference ·
DOI:https://doi.org/10.18653/v1/N18-2096· OSTI ID:1440628

Social media is known for its multicultural and multilingual interactions, a natural product of which is code-mixing. Multilingual speakers mix languages they tweet to address a different audience, express certain feelings, or attract attention. This paper presents a large-scale analysis of 6 million tweets produced by 23 thousand multilingual users speaking 11 other languages besides English. We rely on this multilingual corpus to build predictive models for a novel task – inferring non- English languages that users speak exclusively from their English tweets. We contrast the predictive power of different linguistic signals and report that lexical content and syntactic structure of English tweets are the most predictive of non-English languages that users speak on Twitter. By analyzing cross-lingual transfer – the influence of non-English languages on various levels of linguistic performance in English, we present novel findings on stylistic and syntactic variations across speakers of 11 languages.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1440628
Report Number(s):
PNNL-SA-124151; 453040300
Resource Relation:
Conference: Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), June 1-6, 2018, New Orleans, Louisiana, 608-614
Country of Publication:
United States
Language:
English