skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature

Journal Article · · Journal of Chemical Information and Modeling

The number of published materials science articles has increased manyfold over the past few decades. Now, a major bottleneck in the materials discovery pipeline arises in connecting new results with the previously established literature. A potential solution to this problem is to map the unstructured raw text of published articles onto structured database entries that allow for programmatic querying. To this end, we apply text mining with named entity recognition (NER) for large-scale information extraction from the published materials science literature. The NER model is trained to extract summary-level information from materials science documents, including inorganic material mentions, sample descriptors, phase labels, material properties and applications, as well as any synthesis and characterization methods used. Our classifier achieves an accuracy (f1) of 87%, and is applied to information extraction from 3.27 million materials science abstracts. We extract more than 80 million materials-science-related named entities, and the content of each abstract is represented as a database entry in a structured format. We demonstrate that simple database queries can be used to answer complex "meta-questions" of the published literature that would have previously required laborious, manual literature searches to answer. Finally, all of our data and functionality has been made freely available on our Github ( https://github.com/materialsintelligence/matscholar ) and website ( http://matscholar.com ), and we expect these results to accelerate the pace of future materials science discovery.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1581363
Journal Information:
Journal of Chemical Information and Modeling, Vol. 59, Issue 9; ISSN 1549-9596
Publisher:
American Chemical SocietyCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 68 works
Citation information provided by
Web of Science

References (39)

Information Retrieval and Text Mining Technologies for Chemistry journal May 2017
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature journal October 2016
Ab initio relativistic pseudopotential study of the zero-temperature structural properties of SnTe and PbTe journal August 1985
Tetragonal-tetragonal-monoclinic-rhombohedral transition: Strain relaxation of heavily compressed BiFeO 3 epitaxial thin films journal February 2014
A survey on knowledge representation in materials science and engineering: An ontological perspective journal October 2015
Biomedical Ontologies in Action: Role in Knowledge Management, Data Integration and Decision Support journal January 2008
An Information Retrieval and Recommendation System for Astronomical Observatories journal March 2018
Unsupervised word embeddings capture latent knowledge from materials science literature journal July 2019
Entities database dataset January 2019
Automated cognome construction and semi-automated hypothesis generation journal June 2012
Materials Science Named Entity Recognition: train/development/test sets dataset January 2019
Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature journal September 2004
ChemSpot: a hybrid system for chemical named entity recognition journal April 2012
Ferroelectric-Paraelectric Transition in BiFeO 3 : Crystal Structure of the Orthorhombic β Phase journal January 2009
Human gene name normalization using text matching with automatically extracted synonym dictionaries
  • Fang, Haw-ren; Murphy, Kevin; Jin, Yang
  • Proceedings of the Workshop on Linking Natural Language Processing and Biology Towards Deeper Biological Literature Analysis - BioNLP '06 https://doi.org/10.3115/1567619.1567627
conference January 2006
Virtual screening of inorganic materials synthesis parameters with deep learning journal December 2017
High thermoelectric figure of merit in heavy hole dominated PbTe journal January 2011
Entity Normalization dataset January 2019
Enhanced thermoelectric performance of PbTe within the orthorhombic P n m a phase journal October 2007
PubChem Substance and Compound databases journal September 2015
Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction journal June 2018
Materials Screening for the Discovery of New Half-Heuslers: Machine Learning versus ab Initio Methods journal August 2017
A survey of named entity recognition and classification journal August 2007
tmChem: a high performance approach for chemical named entity recognition and normalization journal January 2015
Long Short-Term Memory journal November 1997
Hybrid functional calculations of point defects and hydrogen in SrZrO 3 journal May 2014
Chemical named entities recognition: a review on approaches and applications journal April 2014
Machine-learned and codified synthesis parameters of oxide materials journal September 2017
Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning journal October 2017
A Relation Aware Search Engine for Materials Science journal January 2018
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition conference January 2003
Composition, structure, and stability of RuO 2 ( 110 ) as a function of oxygen pressure journal December 2001
Combinatorial screening for new materials in unconstrained composition space with machine learning journal March 2014
Epitaxial BiFeO3 Multiferroic Thin Film Heterostructures journal March 2003
Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis journal February 2013
Neural Architectures for Named Entity Recognition
  • Lample, Guillaume; Ballesteros, Miguel; Subramanian, Sandeep
  • Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies https://doi.org/10.18653/v1/N16-1030
conference January 2016
Entity Normalization dataset January 2019
Entities database dataset January 2019
Materials Science Named Entity Recognition: train/development/test sets dataset January 2019

Cited By (3)

Progress and prospects for accelerating materials science with automated and autonomous workflows journal January 2019
MatScIE: An automated tool for the generation of databases of methods and parameters used in the computational materials science literature journal May 2021
Machine-Learning Rationalization and Prediction of Solid-State Synthesis Conditions journal August 2022

Figures / Tables (8)