Text Mining for Process–Structure–Properties Relationships in Metals

Verma, Amit K.; Zhang, Zhisong; Seo, Junwon; Kuo, Robin; Jiang, Runbo; Strubell, Emma; Rollett, Anthony D.

doi:10.1007/s40192-025-00420-7

Text Mining for Process–Structure–Properties Relationships in Metals

Journal Article · Fri Sep 26 00:00:00 EDT 2025 · Integrating Materials and Manufacturing Innovation

DOI:https://doi.org/10.1007/s40192-025-00420-7· OSTI ID:3014104

^[1]; Zhang, Zhisong ^[2]; Seo, Junwon ^[2]; Kuo, Robin ^[2]; Jiang, Runbo ^[2]; Strubell, Emma ^[2]; Rollett, Anthony D. ^[2]

Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Carnegie Mellon Univ., Pittsburgh, PA (United States)

With the advent of large language models (LLMs), the vast unstructured text within millions of academic papers is increasingly accessible for materials discovery—although significant challenges remain. While LLMs offer promising few- and zero-shot learning capabilities, particularly valuable in the materials domain where expert annotations are scarce, general-purpose LLMs often fail to address key materials-specific queries without further adaptation. To bridge this gap, fine-tuning LLMs on human-labeled data is essential for effective structured knowledge extraction (Liu in The Importance of Human-Labeled Data in the Era of LLMs, 2023). Here, in this study, we introduce a novel annotation schema designed to extract generic process–structure–properties relationships from scientific literature. We demonstrate the utility of this approach using a dataset of 128 abstracts, with annotations drawn from two distinct domains: high-temperature materials (Domain I) and uncertainty quantification in simulating materials microstructure (Domain II). Initially, we developed a conditional random field (CRF) model based on MatBERT—a domain-specific BERT variant—and evaluated its performance on Domain I. Subsequently, we compared this model with a fine-tuned LLM (GPT-4o from OpenAI) under identical conditions. Our results indicate that fine-tuning LLMs can significantly improve entity extraction performance over the BERT-CRF baseline on Domain I. However, when additional examples from Domain II were incorporated, the performance of the BERT-CRF model became comparable to that of the GPT-4o model. These findings underscore the potential of our schema for structured knowledge extraction and highlight the complementary strengths of both modeling approaches.

View Accepted Manuscript (DOE)

Research Organization:: Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)

Sponsoring Organization:: US Army Research Laboratory (USARL); USDOE National Nuclear Security Administration (NNSA)

Grant/Contract Number:: AC52-07NA27344

OSTI ID:: 3014104

Report Number(s):: LLNL--JRNL-2011378

Journal Information:: Integrating Materials and Manufacturing Innovation, Journal Name: Integrating Materials and Manufacturing Innovation Journal Issue: 4 Vol. 14; ISSN 2193-9764; ISSN 2193-9772

Publisher:: SpringerCopyright Statement

Country of Publication:: United States

Language:: English

References (24)

Tackling Structured Knowledge Extraction from Polymer Nanocomposite Literature as an NER/RE Task with seq2seq Hu, Bingyin; Lin, Anqi; Brinson, L. Catherine Integrating Materials and Manufacturing Innovation, Vol. 13, Issue 3 https://doi.org/10.1007/s40192-024-00363-5	journal	July 2024
Opportunities and challenges of text mining in materials research Kononova, Olga; He, Tanjin; Huo, Haoyan iScience, Vol. 24, Issue 3 https://doi.org/10.1016/j.isci.2021.102155	journal	March 2021
Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science Trewartha, Amalie; Walker, Nicholas; Huo, Haoyan Patterns, Vol. 3, Issue 4 https://doi.org/10.1016/j.patter.2022.100488	journal	April 2022
Similarity of Precursors in Solid-State Synthesis as Text-Mined from Scientific Literature He, Tanjin; Sun, Wenhao; Huo, Haoyan Chemistry of Materials, Vol. 32, Issue 18 https://doi.org/10.1021/acs.chemmater.0c02553	journal	August 2020
Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning Kim, Edward; Huang, Kevin; Saunders, Adam Chemistry of Materials, Vol. 29, Issue 21 https://doi.org/10.1021/acs.chemmater.7b03500	journal	October 2017
Nanomaterial Synthesis Insights from Machine Learning of Scientific Articles by Extracting, Structuring, and Visualizing Knowledge Hiszpanski, Anna M.; Gallagher, Brian; Chellappan, Karthik Journal of Chemical Information and Modeling, Vol. 60, Issue 6 https://doi.org/10.1021/acs.jcim.0c00199	journal	April 2020
BatteryBERT: A Pretrained Language Model for Battery Database Enhancement Huang, Shu; Cole, Jacqueline M. Journal of Chemical Information and Modeling https://doi.org/10.1021/acs.jcim.2c00035	journal	May 2022
Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature Weston, L.; Tshitoyan, V.; Dagdelen, J. Journal of Chemical Information and Modeling, Vol. 59, Issue 9 https://doi.org/10.1021/acs.jcim.9b00470	journal	July 2019
Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks Kim, Edward; Jensen, Zach; van Grootel, Alexander Journal of Chemical Information and Modeling, Vol. 60, Issue 3 https://doi.org/10.1021/acs.jcim.9b00995	journal	January 2020
Structured information extraction from scientific text with large language models Dagdelen, John; Dunn, Alexander; Lee, Sanghoon Nature Communications, Vol. 15, Issue 1 https://doi.org/10.1038/s41467-024-45563-x	journal	February 2024
Extracting accurate materials data from research papers with conversational language models and prompt engineering Polak, Maciej P.; Morgan, Dane Nature Communications, Vol. 15, Issue 1 https://doi.org/10.1038/s41467-024-45914-8	journal	February 2024
Text-mined dataset of inorganic materials synthesis recipes Kononova, Olga; Huo, Haoyan; He, Tanjin Scientific Data, Vol. 6, Issue 1 https://doi.org/10.1038/s41597-019-0224-1	journal	October 2019
Agent-based learning of materials datasets from the scientific literature Ansari, Mehrad; Moosavi, Seyed Mohamad Digital Discovery, Vol. 3, Issue 12 https://doi.org/10.1039/D4DD00252K	journal	January 2024
Data-driven materials research enabled by natural language processing and information extraction Olivetti, Elsa A.; Cole, Jacqueline M.; Kim, Edward Applied Physics Reviews, Vol. 7, Issue 4 https://doi.org/10.1063/5.0021106	journal	December 2020
ChemSpot: a hybrid system for chemical named entity recognition Rocktäschel, Tim; Weidlich, Michael; Leser, Ulf Bioinformatics, Vol. 28, Issue 12 https://doi.org/10.1093/bioinformatics/bts183	journal	April 2012
BioBERT: a pre-trained biomedical language representation model for biomedical text mining Lee, Jinhyuk; Yoon, Wonjin; Kim, Sungdong Bioinformatics https://doi.org/10.1093/bioinformatics/btz682	journal	September 2019
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm Viterbi, A. IEEE Transactions on Information Theory, Vol. 13, Issue 2 https://doi.org/10.1109/TIT.1967.1054010	journal	April 1967
Long Short-Term Memory Hochreiter, Sepp; Schmidhuber, Jürgen Neural Computation, Vol. 9, Issue 8 https://doi.org/10.1162/neco.1997.9.8.1735	journal	November 1997
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages Qi, Peng; Zhang, Yuhao; Zhang, Yuhui Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations https://doi.org/10.18653/v1/2020.acl-demos.14	conference	January 2020
A Frustratingly Easy Approach for Entity and Relation Extraction Zhong, Zexuan; Chen, Danqi Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies https://doi.org/10.18653/v1/2021.naacl-main.5	conference	January 2021
SciBERT: A Pretrained Language Model for Scientific Text Beltagy, Iz; Lo, Kyle; Cohan, Arman Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) https://doi.org/10.18653/v1/D19-1371	conference	January 2019
Deep Contextualized Word Representations Peters, Matthew; Neumann, Mark; Iyyer, Mohit Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) https://doi.org/10.18653/v1/N18-1202	conference	January 2018
The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures Mysore, Sheshera; Jensen, Zachary; Kim, Edward Proceedings of the 13th Linguistic Annotation Workshop https://doi.org/10.18653/v1/W19-4007	conference	January 2019
The Importance of Human-Labeled Data in the Era of LLMs Liu, Yang Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence https://doi.org/10.24963/ijcai.2023/802	conference	August 2023

Similar Records

Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases

Journal Article · Mon Oct 13 20:00:00 EDT 2025 · EBioMedicine · OSTI ID:3014511

MechBERT: Language Models for Extracting Chemical and Property Relationships about Mechanical Stress and Strain

Journal Article · Thu Jan 30 19:00:00 EST 2025 · Journal of Chemical Information and Modeling · OSTI ID:2510512

Related Subjects

Materials science

Text Mining for Process–Structure–Properties Relationships in Metals

Citation Formats

References (24)

Similar Records

Related Subjects