DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Structured information extraction from scientific text with large language models

Journal Article · · Nature Communications

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
Sponsoring Organization:
USDOE Office of Science (SC), Basic Energy Sciences (BES). Materials Sciences & Engineering Division (MSE)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
2433806
Journal Information:
Nature Communications, Journal Name: Nature Communications Journal Issue: 1 Vol. 15; ISSN 2041-1723
Publisher:
Nature Publishing GroupCopyright Statement
Country of Publication:
United States
Language:
English

References (44)

LoRA weights for Llama-2 NERRE dataset January 2023
Metal–Organic Framework‐Based Ion‐Selective Membranes journal February 2021
Joint entity recognition and relation extraction as a multi-head selection problem journal December 2018
Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science journal April 2022
Identification Schemes for Metal–Organic Frameworks To Enable Rapid Search and Cheminformatics Analysis journal September 2019
Similarity of Precursors in Solid-State Synthesis as Text-Mined from Scientific Literature journal August 2020
Machine-Learning Rationalization and Prediction of Solid-State Synthesis Conditions journal August 2022
MOF-Based Membranes for Gas Separations journal July 2020
Single Model for Organic and Inorganic Chemical Named Entity Recognition in ChemDataExtractor journal February 2022
BatteryBERT: A Pretrained Language Model for Battery Database Enhancement journal May 2022
Do Large Language Models Understand Chemistry? A Conversation with ChatGPT journal March 2023
Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature journal July 2019
ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis journal August 2023
Recent advances and applications of deep learning methods in materials science journal April 2022
Unsupervised word embeddings capture latent knowledge from materials science literature journal July 2019
Text-mined dataset of inorganic materials synthesis recipes journal October 2019
Comparative dataset of experimental and computational attributes of UV/vis absorption spectra journal December 2019
A database of battery materials auto-generated using ChemDataExtractor journal August 2020
Auto-generated database of semiconductor band gaps using ChemDataExtractor journal May 2022
A database of refractive indices and dielectric constants auto-generated using ChemDataExtractor journal May 2022
Auto-generating databases of Yield Strength and Grain Size using ChemDataExtractor journal June 2022
Dataset of solution-based inorganic materials synthesis procedures extracted from the scientific literature journal May 2022
Perovskite- and Dye-Sensitized Solar-Cell Device Databases Auto-generated Using ChemDataExtractor journal June 2022
A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor journal October 2022
Assessment of chemistry knowledge in large language models that generate code journal January 2023
14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon journal January 2023
BioCreative V CDR task corpus: a resource for chemical disease relation extraction journal January 2016
Fine-Tuning BERT Model for Materials Named Entity Recognition conference December 2021
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing journal October 2021
Survey of Hallucination in Natural Language Generation journal March 2023
Machine Learning in Materials Discovery: Confirmed Predictions and Their Underlying Approaches journal July 2020
Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research journal February 2015
REBEL: Relation Extraction By End-to-end Language generation conference January 2021
GPT-NeoX-20B: An Open-Source Autoregressive Language Model conference January 2022
A sequence-to-sequence approach for document-level relation extraction conference January 2022
Multi-Stage Prompting for Knowledgeable Dialogue Generation conference January 2022
PcMSP: A Dataset for Scientific Action Graphs Extraction from Polycrystalline Materials Synthesis Procedure Text conference January 2022
Language Models as Knowledge Bases?
  • Petroni, Fabio; Rocktäschel, Tim; Riedel, Sebastian
  • Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) https://doi.org/10.18653/v1/D19-1250
conference January 2019
Pretraining-Based Natural Language Generation for Text Summarization conference January 2019
DocRED: A Large-Scale Document-Level Relation Extraction Dataset conference January 2019
Materials Discovery With Machine Learning and Knowledge Discovery journal July 2022
LoRA weights for Llama-2 NERRE dataset January 2023
LoRA weights for Llama-2 NERRE dataset January 2023
Comparative dataset of experimental and computational attributes of UV/vis absorption spectra dataset January 2019

Similar Records