DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: MechBERT: Language Models for Extracting Chemical and Property Relationships about Mechanical Stress and Strain

Journal Article · · Journal of Chemical Information and Modeling
 [1];  [2]; ORCiD logo [1]
  1. University of Cambridge (United Kingdom); STFC Rutherford Appleton Laboratory, Didcot (United Kingdom)
  2. STFC Rutherford Appleton Laboratory, Didcot (United Kingdom); Neutron Sciences Directorate, Oak Ridge, TN (United States)

Language models are transforming materials-aware naturallanguage processing by enabling the extraction of dynamic, context-rich information from unstructured text, thus, moving beyond the limitations of traditional information-extraction methods. Moreover, small language models are on the rise because some of them can perform better than large language models (LLMs) when given domain-specific questionanswer tasks, especially about an application area that relies on a highly specialized vernacular, such as materials science. We therefore present a new class of MechBERT language models for understanding mechanical stress and strain in materials. These employ Bidirectional Encoder Representations for transformer (BERT) architectures. We showcase four MechBERT models, all of which were pretrained on a corpus of documents that are textually rich in chemicals and their stress–strain properties and were fine-tuned on question-answering tasks. We evaluated the level of performance of our models on domain-specific as well as general English-language question-answer tasks and also explored the influence of the size and type of BERT architectures on model performance. We find that our MechBERT models outperform BERT-based models of the same size and maintain relevancy better than much larger BERT-based models when tasked with domain-specific question-answering tasks within the stress–strain engineering sector. These small language models also enable much faster processing and require a much smaller fraction of data to pretrain them, affording them greater operational efficiency and energy sustainability than LLMs.

Research Organization:
University of Cambridge (United Kingdom)
Sponsoring Organization:
USDOE Office of Science (SC), Basic Energy Sciences (BES); USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
2510512
Journal Information:
Journal of Chemical Information and Modeling, Journal Name: Journal of Chemical Information and Modeling Journal Issue: 4 Vol. 65; ISSN 1549-9596
Publisher:
American Chemical SocietyCopyright Statement
Country of Publication:
United States
Language:
English

References (23)

A Design-to-Device Pipeline for Data-Driven Materials Discovery journal February 2020
ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science journal September 2021
PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format journal March 2022
Single Model for Organic and Inorganic Chemical Named Entity Recognition in ChemDataExtractor journal February 2022
BatteryBERT: A Pretrained Language Model for Battery Database Enhancement journal May 2022
OpticalBERT and OpticalTable-SQA: Text- and Table-Based Language Models for the Optical-Materials Domain journal March 2023
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature journal October 2016
MatSciBERT: A materials domain language model for text mining and information extraction journal May 2022
Comparative dataset of experimental and computational attributes of UV/vis absorption spectra journal December 2019
A database of battery materials auto-generated using ChemDataExtractor journal August 2020
Auto-generated database of semiconductor band gaps using ChemDataExtractor journal May 2022
A database of refractive indices and dielectric constants auto-generated using ChemDataExtractor journal May 2022
Auto-generating databases of Yield Strength and Grain Size using ChemDataExtractor journal June 2022
A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor journal October 2022
Automated Construction of a Photocatalysis Dataset for Water-Splitting Applications journal September 2023
A database of thermally activated delayed fluorescent molecules auto-generated from scientific literature with ChemDataExtractor journal January 2024
Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction journal June 2018
Data-driven materials research enabled by natural language processing and information extraction journal December 2020
The Deformation and Ageing of Mild Steel: III Discussion of Results journal September 1951
BioBERT: a pre-trained biomedical language representation model for biomedical text mining journal September 2019
FinBERT: A Large Language Model for Extracting Information from Financial Text* journal January 2023
DeepSpeed conference August 2020
A Multiscale Visualization of Attention in the Transformer Model conference January 2019