Active Learning for Language Modeling

Kemp, Emily; Compton, Jonathan; McKenzie, Darrien

doi:10.2172/1890039

Active Learning for Language Modeling

Technical Report · Thu Sep 01 04:00:00 EDT 2022

DOI:https://doi.org/10.2172/1890039· OSTI ID:1890039

Kemp, Emily ^[1]; Compton, Jonathan ^[1]; McKenzie, Darrien ^[1]

Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)

Foreign disinformation campaigns undermine national security. Various supervised language modeling techniques in NLP can help to understand and dismantle these campaigns, but they rely heavily on large, labeled (often by humans) datasets. This work provides a solution to this problem in the form of an active learning (AL) framework, which is used to generate labeled datasets and leverage human input for detecting disinformation. The developed AL framework utilizes task adaptive pretraining to fully leverage the unlabeled data and boost the performance of the classifier used for labeling. A disinformation rhetoric metric was developed to measure the presence of common rhetorical techniques used in text that are meant to deceive, for both the classifier and human to use in the task of identifying disinformation. This metric was combined with an uncertainty criterion to create a hybrid acquisition method for AL, and this hybrid method was tested alongside other acquisition functions. A sophisticated and robust stopping strategy was developed to signal the AL process should terminate, saving human time from being wasted on iterations that would not significantly benefit classifier performance.

Research Organization:: Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA)

DOE Contract Number:: NA0003525

OSTI ID:: 1890039

Report Number(s):: SAND2022-13312; 710260

Country of Publication:: United States

Language:: English

Similar Records

KEBLM: Knowledge-Enhanced Biomedical Language Models

Journal Article · Thu May 18 20:00:00 EDT 2023 · Journal of Biomedical Informatics · OSTI ID:2420838

Identifying Disinformation Using Rhetorical Devices in Natural Language Models

Technical Report · Thu Sep 01 00:00:00 EDT 2022 · OSTI ID:1891194

Related Subjects

99 GENERAL AND MISCELLANEOUS

Active Learning for Language Modeling

Citation Formats

Similar Records

Related Subjects