PhysBERT: A text embedding model for physics scientific literature
The specialized language and complex concepts in physics pose significant challenges for information extraction through Natural Language Processing (NLP). Central to effective NLP applications is the text embedding model, which converts text into dense vector representations for efficient information retrieval and semantic analysis. In this work, we introduce PhysBERT, the first physics-specific text embedding model. Pre-trained on a curated corpus of 1.2 × 106 arXiv physics papers and fine-tuned with supervised data, PhysBERT outperforms leading general-purpose models on physics-specific tasks, including the effectiveness in fine-tuning for specific physics subdomains.
- Sponsoring Organization:
- USDOE
- Grant/Contract Number:
- AC02-05CH11231
- OSTI ID:
- 2564799
- Journal Information:
- APL Machine Learning, Journal Name: APL Machine Learning Journal Issue: 4 Vol. 2; ISSN 2770-9019
- Publisher:
- American Institute of PhysicsCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Domain-specific text embedding model for accelerator physics
Teaching AI when to care about gender
Journal Article
·
Sun Apr 13 20:00:00 EDT 2025
· Physical Review Accelerators and Beams
·
OSTI ID:2556923
Teaching AI when to care about gender
Journal Article
·
Sun Aug 28 20:00:00 EDT 2022
· Code4Lib Journal
·
OSTI ID:1885750