Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Text Mining for Process–Structure–Properties Relationships in Metals

Journal Article · · Integrating Materials and Manufacturing Innovation
With the advent of large language models (LLMs), the vast unstructured text within millions of academic papers is increasingly accessible for materials discovery—although significant challenges remain. While LLMs offer promising few- and zero-shot learning capabilities, particularly valuable in the materials domain where expert annotations are scarce, general-purpose LLMs often fail to address key materials-specific queries without further adaptation. To bridge this gap, fine-tuning LLMs on human-labeled data is essential for effective structured knowledge extraction (Liu in The Importance of Human-Labeled Data in the Era of LLMs, 2023). Here, in this study, we introduce a novel annotation schema designed to extract generic process–structure–properties relationships from scientific literature. We demonstrate the utility of this approach using a dataset of 128 abstracts, with annotations drawn from two distinct domains: high-temperature materials (Domain I) and uncertainty quantification in simulating materials microstructure (Domain II). Initially, we developed a conditional random field (CRF) model based on MatBERT—a domain-specific BERT variant—and evaluated its performance on Domain I. Subsequently, we compared this model with a fine-tuned LLM (GPT-4o from OpenAI) under identical conditions. Our results indicate that fine-tuning LLMs can significantly improve entity extraction performance over the BERT-CRF baseline on Domain I. However, when additional examples from Domain II were incorporated, the performance of the BERT-CRF model became comparable to that of the GPT-4o model. These findings underscore the potential of our schema for structured knowledge extraction and highlight the complementary strengths of both modeling approaches.
Research Organization:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Organization:
US Army Research Laboratory (USARL); USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC52-07NA27344
OSTI ID:
3014104
Report Number(s):
LLNL--JRNL-2011378
Journal Information:
Integrating Materials and Manufacturing Innovation, Journal Name: Integrating Materials and Manufacturing Innovation Journal Issue: 4 Vol. 14; ISSN 2193-9764; ISSN 2193-9772
Publisher:
SpringerCopyright Statement
Country of Publication:
United States
Language:
English

References (24)

Tackling Structured Knowledge Extraction from Polymer Nanocomposite Literature as an NER/RE Task with seq2seq journal July 2024
Opportunities and challenges of text mining in materials research journal March 2021
Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science journal April 2022
Similarity of Precursors in Solid-State Synthesis as Text-Mined from Scientific Literature journal August 2020
Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning journal October 2017
Nanomaterial Synthesis Insights from Machine Learning of Scientific Articles by Extracting, Structuring, and Visualizing Knowledge journal April 2020
BatteryBERT: A Pretrained Language Model for Battery Database Enhancement journal May 2022
Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature journal July 2019
Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks journal January 2020
Structured information extraction from scientific text with large language models journal February 2024
Extracting accurate materials data from research papers with conversational language models and prompt engineering journal February 2024
Text-mined dataset of inorganic materials synthesis recipes journal October 2019
Agent-based learning of materials datasets from the scientific literature journal January 2024
Data-driven materials research enabled by natural language processing and information extraction journal December 2020
ChemSpot: a hybrid system for chemical named entity recognition journal April 2012
BioBERT: a pre-trained biomedical language representation model for biomedical text mining journal September 2019
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm journal April 1967
Long Short-Term Memory journal November 1997
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages conference January 2020
A Frustratingly Easy Approach for Entity and Relation Extraction conference January 2021
SciBERT: A Pretrained Language Model for Scientific Text
  • Beltagy, Iz; Lo, Kyle; Cohan, Arman
  • Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) https://doi.org/10.18653/v1/D19-1371
conference January 2019
Deep Contextualized Word Representations
  • Peters, Matthew; Neumann, Mark; Iyyer, Mohit
  • Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) https://doi.org/10.18653/v1/N18-1202
conference January 2018
The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures conference January 2019
The Importance of Human-Labeled Data in the Era of LLMs conference August 2023

Similar Records

Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases
Journal Article · Mon Oct 13 20:00:00 EDT 2025 · EBioMedicine · OSTI ID:3014511

MechBERT: Language Models for Extracting Chemical and Property Relationships about Mechanical Stress and Strain
Journal Article · Thu Jan 30 19:00:00 EST 2025 · Journal of Chemical Information and Modeling · OSTI ID:2510512

Related Subjects