DOE Data Explorer title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Tokenized Data for FORGE Foundation Models

Abstract

This dataset comprises a vast corpus of 257 billion tokens, accompanied by the corresponding vocabulary file employed in the pre-training of FORGE foundation models. The primary data source for this corpus is scientific documents derived from diverse origins, and they have been tokenized using the Hugging Face BPE tokenizer. Further details about this research can be found in the publication titled "FORGE: Pre-Training Open Foundation Models for Science" authored by Junqi Yin, Sajal Dash, Feiyi Wang, and Mallikarjun (Arjun) Shankar, presented at SC'23. The data tokenization pipeline and resulting artifacts use CORE data [Ref: Knoth, P., & Zdrahal, Z. (2012). CORE: three access levels to underpin open access. D-Lib Magazine, 18(11/12)]. For use of these data sets for any purpose, please follow the guidelines provided in https://core.ac.uk/terms .

Authors:
; ; ;
  1. ORNL-OLCF
Publication Date:
DOE Contract Number:  
AC05-00OR22725
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
Office of Science (SC)
OSTI Identifier:
2004951
DOI:
https://doi.org/10.13139/OLCF/2004951

Citation Formats

Yin, Junqi, Dash, Sajal, Wang, Feiyi, and Shankar, Mallikarjun. Tokenized Data for FORGE Foundation Models. United States: N. p., 2023. Web. doi:10.13139/OLCF/2004951.
Yin, Junqi, Dash, Sajal, Wang, Feiyi, & Shankar, Mallikarjun. Tokenized Data for FORGE Foundation Models. United States. doi:https://doi.org/10.13139/OLCF/2004951
Yin, Junqi, Dash, Sajal, Wang, Feiyi, and Shankar, Mallikarjun. 2023. "Tokenized Data for FORGE Foundation Models". United States. doi:https://doi.org/10.13139/OLCF/2004951. https://www.osti.gov/servlets/purl/2004951. Pub date:Wed Oct 18 04:00:00 UTC 2023
@article{osti_2004951,
title = {Tokenized Data for FORGE Foundation Models},
author = {Yin, Junqi and Dash, Sajal and Wang, Feiyi and Shankar, Mallikarjun},
abstractNote = {This dataset comprises a vast corpus of 257 billion tokens, accompanied by the corresponding vocabulary file employed in the pre-training of FORGE foundation models. The primary data source for this corpus is scientific documents derived from diverse origins, and they have been tokenized using the Hugging Face BPE tokenizer. Further details about this research can be found in the publication titled "FORGE: Pre-Training Open Foundation Models for Science" authored by Junqi Yin, Sajal Dash, Feiyi Wang, and Mallikarjun (Arjun) Shankar, presented at SC'23. The data tokenization pipeline and resulting artifacts use CORE data [Ref: Knoth, P., & Zdrahal, Z. (2012). CORE: three access levels to underpin open access. D-Lib Magazine, 18(11/12)]. For use of these data sets for any purpose, please follow the guidelines provided in https://core.ac.uk/terms .},
doi = {10.13139/OLCF/2004951},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Wed Oct 18 04:00:00 UTC 2023},
month = {Wed Oct 18 04:00:00 UTC 2023}
}