DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting?

Journal Article · · Journal of Chemical Information and Modeling
ORCiD logo [1]; ORCiD logo [2]
  1. Univ. of Cambridge (United Kingdom)
  2. Univ. of Cambridge (United Kingdom); Science and Technology Facilities Council (STFC), Oxford (United Kingdom). Rutherford Appleton Lab., ISIS Neutron Source

Language models trained on domain-specific corpora have been employed to increase the performance in specialized tasks. However, little previous work has been reported on how specific a “domain-specific” corpus should be. Here, we test a number of language models trained on varyingly specific corpora by employing them in the task of extracting information from photocatalytic water splitting. We find that more specific corpora can benefit performance on downstream tasks. Furthermore, PhotocatalysisBERT, a pretrained model from scratch on scientific papers on photocatalytic water splitting, demonstrates improved performance over previous work in associating the correct photocatalyst with the correct photocatalytic activity during information extraction, achieving a precision of 60.8(+11.5)% and a recall of 37.2(+4.5)%.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
2469488
Journal Information:
Journal of Chemical Information and Modeling, Journal Name: Journal of Chemical Information and Modeling Journal Issue: 8 Vol. 64; ISSN 1549-9596
Publisher:
American Chemical SocietyCopyright Statement
Country of Publication:
United States
Language:
English

References (17)

ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science journal September 2021
BatteryBERT: A Pretrained Language Model for Battery Database Enhancement journal May 2022
OpticalBERT and OpticalTable-SQA: Text- and Table-Based Language Models for the Optical-Materials Domain journal March 2023
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature journal October 2016
Extracting accurate materials data from research papers with conversational language models and prompt engineering journal February 2024
MatSciBERT: A materials domain language model for text mining and information extraction journal May 2022
A database of battery materials auto-generated using ChemDataExtractor journal August 2020
Automated Construction of a Photocatalysis Dataset for Water-Splitting Applications journal September 2023
BatteryDataExtractor: battery-aware text-mining software embedded with BERT models journal January 2022
Nano-ferrites for water splitting: unprecedented high photocatalytic hydrogen production under visible light journal January 2012
BioBERT: a pre-trained biomedical language representation model for biomedical text mining journal September 2019
ZeRO: Memory optimizations Toward Training Trillion Parameter Models conference November 2020
S2ORC: The Semantic Scholar Open Research Corpus conference January 2020
Transformers: State-of-the-Art Natural Language Processing conference January 2020
Know What You Don’t Know: Unanswerable Questions for SQuAD
  • Rajpurkar, Pranav; Jia, Robin; Liang, Percy
  • Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) https://doi.org/10.18653/v1/P18-2124
conference January 2018
Entity-Relation Extraction as Multi-Turn Question Answering conference January 2019
Energy and Policy Considerations for Deep Learning in NLP conference January 2019