Autogenerating a Domain-Specific Question-Answering Data Set from a Thermoelectric Materials Database to Enable High-Performing BERT Models
Journal Article
·
· Journal of Chemical Information and Modeling
- Univ. of Cambridge (United Kingdom). Cavendish Lab.
- Univ. of Cambridge (United Kingdom). Cavendish Lab.; Science and Technology Facilities Council (STFC), Oxford (United Kingdom). Rutherford Appleton Lab. (RAL)
We present a method for autogenerating a large domain-specific question-answering (QA) dataset from a thermoelectric materials database. We show that a small language model, BERT, once fine-tuned on this automatically generated dataset of 99,757 QA pairs about thermoelectric materials, affords better performance in the field of thermoelectric materials compared to a BERT model fine-tuned on the generic English-language QA data set, SQuAD-v2. We further show that mixing the two data sets (ours and SQuAD-v2), which have significantly different syntactic and semantic scopes, allows the BERT model to achieve even better performance. The best-performing BERT model fine-tuned on the mixed data set outperforms the models fine-tuned on the other two data sets by scoring an exact match of 67.93% and an F1 score of 72.29% when evaluated on our test data set. This has important implications as it demonstrates the ability to realize high-performing small language models, with modest computational resources, empowered by domain-specific materials data sets which can be generated according to our method.
- Research Organization:
- Argonne National Laboratory (ANL), Argonne, IL (United States)
- Sponsoring Organization:
- Engineering and Physical Sciences Research Council; Science and Technology Facilities Council; USDOE Office of Science (SC), Basic Energy Sciences (BES)
- Grant/Contract Number:
- AC02-06CH11357
- OSTI ID:
- 2586792
- Journal Information:
- Journal of Chemical Information and Modeling, Journal Name: Journal of Chemical Information and Modeling Journal Issue: 16 Vol. 65; ISSN 1549-9596; ISSN 1549-960X
- Publisher:
- American Chemical Society (ACS)Copyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Language Models for the Prediction of SARS-CoV-2 Inhibitors
Language models for the prediction of SARS-CoV-2 inhibitors
Conference
·
Sat Oct 01 00:00:00 EDT 2022
· International Journal of High Performance Computing Applications
·
OSTI ID:1892426
Language models for the prediction of SARS-CoV-2 inhibitors
Journal Article
·
Thu Oct 06 20:00:00 EDT 2022
· International Journal of High Performance Computing Applications
·
OSTI ID:1891374