skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort

Abstract

Scientific Named Entity Referent Extraction is often more complicated than traditional Named Entity Recognition (NER). For example, in polymer science, chemical structure may be encoded in a variety of nonstandard naming conventions, and authors may refer to polymers with conventional names, commonly used names, labels (in lieu of longer names), synonyms, and acronyms. As a result, accurate scientific NER methods are often based on task-specific rules, which are difficult to develop and maintain, and are not easily generalized to other tasks and fields. Machine learning models require substantial expert-annotated data for training. Here we propose polyNER: a semi-automated system for efficient identification of scientific entities in text. PolyNER applies word embedding models to generate entity-rich corpora for productive expert labeling, and then uses the resulting labeled data to bootstrap a context-based word vector classifier. Evaluation on materials science publications shows that the polyNER approach enables improved precision or recall relative to a state-of-the-art chemical entity extraction system at a dramatically lower cost: it required just two hours of expert time, rather than extensive and expensive rule engineering, to achieve that result. This result highlights the potential for human-computer partnership for constructing domain-specific scientific NER systems.

Authors:
; ; ; ; ; ; ; ; ;
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
National Institute of Standards and Technology (NIST); USDOE Office of Science (SC)
OSTI Identifier:
1558659
DOE Contract Number:  
AC02-06CH11357
Resource Type:
Conference
Resource Relation:
Journal Volume: 11536; Conference: 2019 International Conference on Computational Science, 06/12/19 - 06/14/19, Faro, PT
Country of Publication:
United States
Language:
English
Subject:
Crowdsourcing; Natural Language Processing; Polymers; Scientific Named Entities; Word Embedding

Citation Formats

Tchoua, Roselyne B., Ajith, Aswathy, Hong, Zhi, Ward, Logan T., Chard, Kyle, Belikov, Alexander, Audus, Debra J., Patel, Shrayesh, de Pablo, Juan J., and Foster, Ian T. Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort. United States: N. p., 2019. Web. doi:10.1007/978-3-030-22734-0_29.
Tchoua, Roselyne B., Ajith, Aswathy, Hong, Zhi, Ward, Logan T., Chard, Kyle, Belikov, Alexander, Audus, Debra J., Patel, Shrayesh, de Pablo, Juan J., & Foster, Ian T. Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort. United States. doi:10.1007/978-3-030-22734-0_29.
Tchoua, Roselyne B., Ajith, Aswathy, Hong, Zhi, Ward, Logan T., Chard, Kyle, Belikov, Alexander, Audus, Debra J., Patel, Shrayesh, de Pablo, Juan J., and Foster, Ian T. Tue . "Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort". United States. doi:10.1007/978-3-030-22734-0_29. https://www.osti.gov/servlets/purl/1558659.
@article{osti_1558659,
title = {Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort},
author = {Tchoua, Roselyne B. and Ajith, Aswathy and Hong, Zhi and Ward, Logan T. and Chard, Kyle and Belikov, Alexander and Audus, Debra J. and Patel, Shrayesh and de Pablo, Juan J. and Foster, Ian T.},
abstractNote = {Scientific Named Entity Referent Extraction is often more complicated than traditional Named Entity Recognition (NER). For example, in polymer science, chemical structure may be encoded in a variety of nonstandard naming conventions, and authors may refer to polymers with conventional names, commonly used names, labels (in lieu of longer names), synonyms, and acronyms. As a result, accurate scientific NER methods are often based on task-specific rules, which are difficult to develop and maintain, and are not easily generalized to other tasks and fields. Machine learning models require substantial expert-annotated data for training. Here we propose polyNER: a semi-automated system for efficient identification of scientific entities in text. PolyNER applies word embedding models to generate entity-rich corpora for productive expert labeling, and then uses the resulting labeled data to bootstrap a context-based word vector classifier. Evaluation on materials science publications shows that the polyNER approach enables improved precision or recall relative to a state-of-the-art chemical entity extraction system at a dramatically lower cost: it required just two hours of expert time, rather than extensive and expensive rule engineering, to achieve that result. This result highlights the potential for human-computer partnership for constructing domain-specific scientific NER systems.},
doi = {10.1007/978-3-030-22734-0_29},
journal = {},
issn = {0302--9743},
number = ,
volume = 11536,
place = {United States},
year = {2019},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: