Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Autogenerating a Domain-Specific Question-Answering Data Set from a Thermoelectric Materials Database to Enable High-Performing BERT Models

Journal Article · · Journal of Chemical Information and Modeling
 [1];  [2]
  1. Univ. of Cambridge (United Kingdom). Cavendish Lab.
  2. Univ. of Cambridge (United Kingdom). Cavendish Lab.; Science and Technology Facilities Council (STFC), Oxford (United Kingdom). Rutherford Appleton Lab. (RAL)
We present a method for autogenerating a large domain-specific question-answering (QA) dataset from a thermoelectric materials database. We show that a small language model, BERT, once fine-tuned on this automatically generated dataset of 99,757 QA pairs about thermoelectric materials, affords better performance in the field of thermoelectric materials compared to a BERT model fine-tuned on the generic English-language QA data set, SQuAD-v2. We further show that mixing the two data sets (ours and SQuAD-v2), which have significantly different syntactic and semantic scopes, allows the BERT model to achieve even better performance. The best-performing BERT model fine-tuned on the mixed data set outperforms the models fine-tuned on the other two data sets by scoring an exact match of 67.93% and an F1 score of 72.29% when evaluated on our test data set. This has important implications as it demonstrates the ability to realize high-performing small language models, with modest computational resources, empowered by domain-specific materials data sets which can be generated according to our method.
Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
Engineering and Physical Sciences Research Council; Science and Technology Facilities Council; USDOE Office of Science (SC), Basic Energy Sciences (BES)
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
2586792
Journal Information:
Journal of Chemical Information and Modeling, Journal Name: Journal of Chemical Information and Modeling Journal Issue: 16 Vol. 65; ISSN 1549-9596; ISSN 1549-960X
Publisher:
American Chemical Society (ACS)Copyright Statement
Country of Publication:
United States
Language:
English

References (42)

Temperature dependent solubility of Yb in Yb–CoSb3 skutterudite and its effect on preparation, optimization and lifetime of thermoelectrics journal March 2015
Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science journal April 2022
Synergistically optimizing electrical and thermal transport properties of n -type PbSe journal June 2018
ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science journal September 2021
BatteryBERT: A Pretrained Language Model for Battery Database Enhancement journal May 2022
OpticalBERT and OpticalTable-SQA: Text- and Table-Based Language Models for the Optical-Materials Domain journal March 2023
How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting? journal March 2024
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature journal October 2016
Structured information extraction from scientific text with large language models journal February 2024
Extracting accurate materials data from research papers with conversational language models and prompt engineering journal February 2024
MatSciBERT: A materials domain language model for text mining and information extraction journal May 2022
A database of battery materials auto-generated using ChemDataExtractor journal August 2020
Auto-generated database of semiconductor band gaps using ChemDataExtractor journal May 2022
A database of refractive indices and dielectric constants auto-generated using ChemDataExtractor journal May 2022
Auto-generating databases of Yield Strength and Grain Size using ChemDataExtractor journal June 2022
A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor journal October 2022
Automated Construction of a Photocatalysis Dataset for Water-Splitting Applications journal September 2023
A database of thermally activated delayed fluorescent molecules auto-generated from scientific literature with ChemDataExtractor journal January 2024
Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction journal June 2018
14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon journal January 2023
Machine learning based feature engineering for thermoelectric materials by design journal January 2024
MaScQA: investigating materials science knowledge of large language models journal January 2024
Harnessing GPT-3.5 for text parsing in solid-state synthesis – case study of ternary chalcogenides journal January 2024
Auto-generating question-answering datasets with domain-specific knowledge for language models in scientific tasks journal January 2025
Machine-learning guided prediction of thermoelectric properties of topological insulator Bi2Te3−xSex journal January 2024
An extensive review of tools for manual annotation of documents journal December 2019
BioBERT: a pre-trained biomedical language representation model for biomedical text mining journal September 2019
A Learning Algorithm for Continually Running Fully Recurrent Neural Networks journal June 1989
Domain Adaptation with BERT-based Domain Classification and Data Selection conference January 2019
Is Attention Interpretable? conference January 2019
Training Tips for the Transformer Model journal April 2018
A Photovoltaic Technology Review: History, Fundamentals and Applications journal March 2022
SQuAD: 100,000+ Questions for Machine Comprehension of Text preprint January 2016
Mixed Precision Training preprint January 2017
Universal Language Model Fine-tuning for Text Classification preprint January 2018
Know What You Don't Know: Unanswerable Questions for SQuAD preprint January 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding preprint January 2018
Attention is not Explanation preprint January 2019
SciBERT: A Pretrained Language Model for Scientific Text text January 2019
Entity-Relation Extraction as Multi-Turn Question Answering preprint January 2019
HuggingFace's Transformers: State-of-the-art Natural Language Processing preprint January 2019
AIMS-EREA -- A framework for AI-accelerated Innovation of Materials for Sustainability -- for Environmental Remediation and Energy Applications preprint January 2023

Similar Records

Language Models for the Prediction of SARS-CoV-2 Inhibitors
Conference · Sat Oct 01 00:00:00 EDT 2022 · International Journal of High Performance Computing Applications · OSTI ID:1892426

Language models for the prediction of SARS-CoV-2 inhibitors
Journal Article · Thu Oct 06 20:00:00 EDT 2022 · International Journal of High Performance Computing Applications · OSTI ID:1891374