DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A rule-free workflow for the automated generation of databases from scientific literature

Journal Article · · npj Computational Materials

Abstract In recent times, transformer networks have achieved state-of-the-art performance in a wide range of natural language processing tasks. Here we present a workflow based on the fine-tuning of BERT models for different downstream tasks, which results in the automated extraction of structured information from unstructured natural language in scientific literature. Contrary to existing methods for the automated extraction of structured compound-property relations from similar sources, our workflow does not rely on the definition of intricate grammar rules. Hence, it can be adapted to a new task without requiring extensive implementation efforts and knowledge. We test our data-extraction workflow by automatically generating a database for Curie temperatures and one for band gaps. These are then compared with manually curated datasets and with those obtained with a state-of-the-art rule-based method. Furthermore, in order to showcase the practical utility of the automatically extracted data in a material-design workflow, we employ them to construct machine-learning models to predict Curie temperatures and band gaps. In general, we find that, although more noisy, automatically extracted datasets can grow fast in volume and that such volume partially compensates for the inaccuracy in downstream tasks.

Research Organization:
Ames Laboratory (AMES), Ames, IA (United States). Critical Materials Institute (CMI)
Sponsoring Organization:
Advance Laureate Award; Science Foundation Ireland AMBER center; USDOE; USDOE Advanced Research Projects Agency - Energy (ARPA-E); USDOE Office of Energy Efficiency and Renewable Energy (EERE), Energy Efficiency Office. Advanced Materials & Manufacturing Technologies Office (AMMTO)
Grant/Contract Number:
AC02-07CH11358
OSTI ID:
2229779
Journal Information:
npj Computational Materials, Journal Name: npj Computational Materials Journal Issue: 1 Vol. 9; ISSN 2057-3960
Publisher:
Nature Publishing GroupCopyright Statement
Country of Publication:
United Kingdom
Language:
English

References (49)

Electronic Structure of Crystalline Buckyballs: fcc-C60 journal October 2015
Diamond as an electronic material journal January 2008
The search for high entropy alloys: A high-throughput ab-initio approach journal October 2018
Effect of hole doping and strain modulations on electronic structure and magnetic properties in ZnO monolayer journal February 2019
AFLOWLIB.ORG: A distributed materials properties repository from high-throughput ab initio calculations journal June 2012
First-principles investigation on the optoelectronic performance of Mg doped and Mg–Al co-doped ZnO journal March 2016
Stabilization and Band-Gap Tuning of the 1T-MoS 2 Monolayer by Covalent Functionalization journal May 2015
Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning journal October 2017
Machine Learning Accelerated Recovery of the Cubic Structure in Mixed-Cation Perovskite Thin Films journal March 2020
BatteryBERT: A Pretrained Language Model for Battery Database Enhancement journal May 2022
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature journal October 2016
Machine Learning Prediction of Superconducting Critical Temperature through the Structural Descriptor journal May 2022
Reconsideration of Intrinsic Band Alignments within Anatase and Rutile TiO 2 journal February 2016
Predicting the Band Gaps of Inorganic Solids by Machine Learning journal March 2018
Materials Cartography: Representing and Mining Materials Space Using Structural and Electronic Fingerprints journal January 2015
Using Machine Learning and Data Mining to Leverage Community Knowledge for the Engineering of Stable Metal–Organic Frameworks journal October 2021
Discovery of High-Performance Thermoelectric Chalcogenides through Reliable High-Throughput Material Screening journal July 2018
Universal fragment descriptors for predicting properties of inorganic crystals journal June 2017
Single-layer MoS2 transistors journal January 2011
The Open Quantum Materials Database (OQMD): assessing the accuracy of DFT formation energies journal December 2015
A general-purpose machine learning framework for predicting properties of inorganic materials journal August 2016
Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm journal September 2020
MatSciBERT: A materials domain language model for text mining and information extraction journal May 2022
A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing journal April 2023
Unsupervised word embeddings capture latent knowledge from materials science literature journal July 2019
Materials Cloud, a platform for open computational science journal September 2020
Auto-generated database of semiconductor band gaps using ChemDataExtractor journal May 2022
Learning properties of ordered and disordered materials from multi-fidelity data journal January 2021
Machine-learned and codified synthesis parameters of oxide materials journal September 2017
Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction journal June 2018
Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases journal October 2021
Temperature dependence of the band gap of silicon journal April 1974
Commentary: The Materials Project: A materials genome approach to accelerating materials innovation journal July 2013
BioBERT: a pre-trained biomedical language representation model for biomedical text mining journal September 2019
Band structure of MoS 2 , MoSe 2 , and α − MoTe 2 : Angle-resolved photoelectron spectroscopy and ab initio calculations journal November 2001
Big Data of Materials Science: Critical Role of the Descriptor journal March 2015
Predicting the Curie temperature of ferromagnets using machine learning journal October 2019
Statistics on magnetic properties of Co compounds: A database-driven method for discovering Co-based ferromagnets journal June 2022
MAGNDATA : towards a database of magnetic structures. I. The commensurate case journal September 2016
Recent developments in the Inorganic Crystal Structure Database: theoretical crystal structure data and related features journal September 2019
Validation of the Crystallography Open Database using the Crystallographic Information Framework journal February 2021
The Cambridge Structural Database
  • Groom, Colin R.; Bruno, Ian J.; Lightfoot, Matthew P.
  • Acta Crystallographica Section B Structural Science, Crystal Engineering and Materials, Vol. 72, Issue 2, p. 171-179 https://doi.org/10.1107/S2052520616003954
journal April 2016
A Statistical Interpretation of term Specificity and its Application in Retrieval journal January 1972
Accelerated discovery of new magnets in the Heusler alloy family journal April 2017
Electric Field Effect in Atomically Thin Carbon Films journal October 2004
Inorganic Materials Database for Exploring the Nature of Material journal November 2011
SciBERT: A Pretrained Language Model for Scientific Text
  • Beltagy, Iz; Lo, Kyle; Cohan, Arman
  • Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) https://doi.org/10.18653/v1/D19-1371
conference January 2019
Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets conference January 2019
Glove: Global Vectors for Word Representation
  • Pennington, Jeffrey; Socher, Richard; Manning, Christopher
  • Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) https://doi.org/10.3115/v1/D14-1162
conference January 2014