Active Learning for Language Modeling
- Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)
Foreign disinformation campaigns undermine national security. Various supervised language modeling techniques in NLP can help to understand and dismantle these campaigns, but they rely heavily on large, labeled (often by humans) datasets. This work provides a solution to this problem in the form of an active learning (AL) framework, which is used to generate labeled datasets and leverage human input for detecting disinformation. The developed AL framework utilizes task adaptive pretraining to fully leverage the unlabeled data and boost the performance of the classifier used for labeling. A disinformation rhetoric metric was developed to measure the presence of common rhetorical techniques used in text that are meant to deceive, for both the classifier and human to use in the task of identifying disinformation. This metric was combined with an uncertainty criterion to create a hybrid acquisition method for AL, and this hybrid method was tested alongside other acquisition functions. A sophisticated and robust stopping strategy was developed to signal the AL process should terminate, saving human time from being wasted on iterations that would not significantly benefit classifier performance.
- Research Organization:
- Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- DOE Contract Number:
- NA0003525
- OSTI ID:
- 1890039
- Report Number(s):
- SAND2022-13312; 710260
- Country of Publication:
- United States
- Language:
- English
Similar Records
KEBLM: Knowledge-Enhanced Biomedical Language Models
Identifying Disinformation Using Rhetorical Devices in Natural Language Models
Journal Article
·
Thu May 18 20:00:00 EDT 2023
· Journal of Biomedical Informatics
·
OSTI ID:2420838
Identifying Disinformation Using Rhetorical Devices in Natural Language Models
Technical Report
·
Thu Sep 01 00:00:00 EDT 2022
·
OSTI ID:1891194