Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

PhysBERT: A text embedding model for physics scientific literature

Journal Article · · APL Machine Learning
DOI:https://doi.org/10.1063/5.0238090· OSTI ID:2564799

The specialized language and complex concepts in physics pose significant challenges for information extraction through Natural Language Processing (NLP). Central to effective NLP applications is the text embedding model, which converts text into dense vector representations for efficient information retrieval and semantic analysis. In this work, we introduce PhysBERT, the first physics-specific text embedding model. Pre-trained on a curated corpus of 1.2 × 106 arXiv physics papers and fine-tuned with supervised data, PhysBERT outperforms leading general-purpose models on physics-specific tasks, including the effectiveness in fine-tuning for specific physics subdomains.

Sponsoring Organization:
USDOE
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
2564799
Journal Information:
APL Machine Learning, Journal Name: APL Machine Learning Journal Issue: 4 Vol. 2; ISSN 2770-9019
Publisher:
American Institute of PhysicsCopyright Statement
Country of Publication:
United States
Language:
English

References (11)

Automation of systematic literature reviews: A systematic literature review journal August 2021
Benchmarking topic models on scientific articles using BERTeley journal March 2024
Deep learning journal May 2015
MatSciBERT: A materials domain language model for text mining and information extraction journal May 2022
Parameter-efficient fine-tuning of large-scale pre-trained language models journal March 2023
Physics language and language use in physics—What do we know and how AI might enhance language-related research and instruction journal January 2024
BioBERT: a pre-trained biomedical language representation model for biomedical text mining journal September 2019
Information Retrieval: Recent Advances and Beyond journal January 2023
Survey of Hallucination in Natural Language Generation journal March 2023
PyTorch distributed journal August 2020
SciBERT: A Pretrained Language Model for Scientific Text
  • Beltagy, Iz; Lo, Kyle; Cohan, Arman
  • Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) https://doi.org/10.18653/v1/D19-1371
conference January 2019

Similar Records

Domain-specific text embedding model for accelerator physics
Journal Article · Sun Apr 13 20:00:00 EDT 2025 · Physical Review Accelerators and Beams · OSTI ID:2556923

Teaching AI when to care about gender
Journal Article · Sun Aug 28 20:00:00 EDT 2022 · Code4Lib Journal · OSTI ID:1885750

Related Subjects