How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting?
- Univ. of Cambridge (United Kingdom)
- Univ. of Cambridge (United Kingdom); Science and Technology Facilities Council (STFC), Oxford (United Kingdom). Rutherford Appleton Lab., ISIS Neutron Source
Language models trained on domain-specific corpora have been employed to increase the performance in specialized tasks. However, little previous work has been reported on how specific a “domain-specific” corpus should be. Here, we test a number of language models trained on varyingly specific corpora by employing them in the task of extracting information from photocatalytic water splitting. We find that more specific corpora can benefit performance on downstream tasks. Furthermore, PhotocatalysisBERT, a pretrained model from scratch on scientific papers on photocatalytic water splitting, demonstrates improved performance over previous work in associating the correct photocatalyst with the correct photocatalytic activity during information extraction, achieving a precision of 60.8(+11.5)% and a recall of 37.2(+4.5)%.
- Research Organization:
- Argonne National Laboratory (ANL), Argonne, IL (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)
- Grant/Contract Number:
- AC02-06CH11357
- OSTI ID:
- 2469488
- Journal Information:
- Journal of Chemical Information and Modeling, Journal Name: Journal of Chemical Information and Modeling Journal Issue: 8 Vol. 64; ISSN 1549-9596
- Publisher:
- American Chemical SocietyCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
MechBERT: Language Models for Extracting Chemical and Property Relationships about Mechanical Stress and Strain
Automatic Labeling for Entity Extraction in Cyber Security