Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
The number of published materials science articles has increased manyfold over the past few decades. Now, a major bottleneck in the materials discovery pipeline arises in connecting new results with the previously established literature. A potential solution to this problem is to map the unstructured raw text of published articles onto structured database entries that allow for programmatic querying. To this end, we apply text mining with named entity recognition (NER) for large-scale information extraction from the published materials science literature. The NER model is trained to extract summary-level information from materials science documents, including inorganic material mentions, sample descriptors, phase labels, material properties and applications, as well as any synthesis and characterization methods used. Our classifier achieves an accuracy (f1) of 87%, and is applied to information extraction from 3.27 million materials science abstracts. We extract more than 80 million materials-science-related named entities, and the content of each abstract is represented as a database entry in a structured format. We demonstrate that simple database queries can be used to answer complex "meta-questions" of the published literature that would have previously required laborious, manual literature searches to answer. Finally, all of our data and functionality has been made freely available on our Github ( https://github.com/materialsintelligence/matscholar ) and website ( http://matscholar.com ), and we expect these results to accelerate the pace of future materials science discovery.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- Grant/Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1581363
- Journal Information:
- Journal of Chemical Information and Modeling, Vol. 59, Issue 9; ISSN 1549-9596
- Publisher:
- American Chemical SocietyCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Web of Science
MatScIE: An automated tool for the generation of databases of methods and parameters used in the computational materials science literature
|
journal | May 2021 |
Machine-Learning Rationalization and Prediction of Solid-State Synthesis Conditions
|
journal | August 2022 |
Additional file 1 of Application of machine reading comprehension techniques for named entity recognition in materials science | dataset | January 2024 |
Similar Records
Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort
Deep learning of electrochemical CO2 conversion literature reveals research trends and directions