Integration of Titan supercomputer at OLCF with ATLAS Production System

Barreiro Megino, F.; De, K.; Jha, S.; Klimentov, A.; Maeno, T.; Nilsson, P.; Oleynik, D.; Padolski, S.; Panitkin, S.; Wells, J.; Wenaus, T.

doi:10.1088/1742-6596/898/9/092002

Title: Integration of Titan supercomputer at OLCF with ATLAS Production System

Abstract

The PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment. PanDA managed resources are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundreds of Petabytes of data and the rate of data processing already exceeds Exabyte per year. While PanDA currently uses more than 200,000 cores at well over 100 Grid sites, future LHC data taking runs will require more resources than Grid computing can possibly provide. Additional computing and storage resources are required. Therefore ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. In this paper we will describe a project aimed at integration of ATLAS Production System with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Current approach utilizes modified PanDA Pilot framework for job submission to Titan's batch queues and local data management, with lightweight MPI wrappers to run single node workloads in parallel on Titan's multi-core worker nodes. It provides for running of standard ATLAS production jobs on unused resources (backfill) on Titan. The system already allowed ATLAS to collect on Titanmore »« less

Authors:: Barreiro Megino, F.; De, K.; Jha, S.; Klimentov, A.; Maeno, T.; Nilsson, P.; Oleynik, D.; Padolski, S.; Panitkin, S.; Wells, J.; Wenaus, T.

Publication Date:: Sun Oct 01 00:00:00 EDT 2017

Research Org.:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)

Sponsoring Org.:: USDOE Office of Science (SC)

Contributing Org.:: ATLAS Collaboration

OSTI Identifier:: 1567554

Resource Type:: Accepted Manuscript

Journal Name:: Journal of Physics. Conference Series

Additional Journal Information:: Journal Volume: 898; Journal ID: ISSN 1742-6588

Publisher:: IOP Publishing

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING

Citation Formats


                    Barreiro Megino, F., De, K., Jha, S., Klimentov, A., Maeno, T., Nilsson, P., Oleynik, D., Padolski, S., Panitkin, S., Wells, J., and Wenaus, T. Integration of Titan supercomputer at OLCF with ATLAS Production System.  United States: N. p., 2017. 
Web.  doi:10.1088/1742-6596/898/9/092002.

Copy to clipboard


                    Barreiro Megino, F., De, K., Jha, S., Klimentov, A., Maeno, T., Nilsson, P., Oleynik, D., Padolski, S., Panitkin, S., Wells, J., & Wenaus, T. Integration of Titan supercomputer at OLCF with ATLAS Production System.  United States.  https://doi.org/10.1088/1742-6596/898/9/092002

Copy to clipboard


                    Barreiro Megino, F., De, K., Jha, S., Klimentov, A., Maeno, T., Nilsson, P., Oleynik, D., Padolski, S., Panitkin, S., Wells, J., and Wenaus, T. Sun .  
"Integration of Titan supercomputer at OLCF with ATLAS Production System".  United States.  https://doi.org/10.1088/1742-6596/898/9/092002.  https://www.osti.gov/servlets/purl/1567554.

Copy to clipboard


                    
@article{osti_1567554,

  title        = {Integration of Titan supercomputer at OLCF with ATLAS Production System},

  author       = {Barreiro Megino, F. and De, K. and Jha, S. and Klimentov, A. and Maeno, T. and Nilsson, P. and Oleynik, D. and Padolski, S. and Panitkin, S. and Wells, J. and Wenaus, T.},

  abstractNote = {The PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment. PanDA managed resources are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundreds of Petabytes of data and the rate of data processing already exceeds Exabyte per year. While PanDA currently uses more than 200,000 cores at well over 100 Grid sites, future LHC data taking runs will require more resources than Grid computing can possibly provide. Additional computing and storage resources are required. Therefore ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. In this paper we will describe a project aimed at integration of ATLAS Production System with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Current approach utilizes modified PanDA Pilot framework for job submission to Titan's batch queues and local data management, with lightweight MPI wrappers to run single node workloads in parallel on Titan's multi-core worker nodes. It provides for running of standard ATLAS production jobs on unused resources (backfill) on Titan. The system already allowed ATLAS to collect on Titan millions of core-hours per month, execute hundreds of thousands jobs, while simultaneously improving Titans utilization efficiency. We will discuss the details of the implementation, current experience with running the system, as well as future plans aimed at improvements in scalability and efficiency.},

  doi          = {10.1088/1742-6596/898/9/092002},

  journal      = {Journal of Physics. Conference Series},

  number       = ,

  volume       = 898,

  place        = {United States},

  year         = {Sun Oct 01 00:00:00 EDT 2017},

  month        = {Sun Oct 01 00:00:00 EDT 2017}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1088/1742-6596/898/9/092002

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

The ATLAS PanDA Pilot in Operation
journal, December 2011

Nilsson, P.; Caballero, J.; De, K.
Journal of Physics: Conference Series, Vol. 331, Issue 6
DOI: 10.1088/1742-6596/331/6/062040

Scaling up ATLAS production system for the LHC Run 2 and beyond: project ProdSys2
journal, December 2015

Borodin, M.; De, K.; Garcia, J.
Journal of Physics: Conference Series, Vol. 664, Issue 6
DOI: 10.1088/1742-6596/664/6/062005

Overview of ATLAS PanDA Workload Management
journal, December 2011

Maeno, T.; De, K.; Wenaus, T.
Journal of Physics: Conference Series, Vol. 331, Issue 7
DOI: 10.1088/1742-6596/331/7/072024

Similar Records in DOE PAGES and OSTI.GOV collections:

Integration of Titan supercomputer at OLCF with ATLAS Production System

Conference Megino, F. Barreiro ; De, Kaushik ; Jha, Shantenu ; ...

The PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment. PanDA managed resources are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundreds of Petabytes of data and the rate of data processing already exceeds Exabyte per year. While PanDA currently uses more than 200,000 cores at well over 100 Grid sites, future LHC data taking runs will require more resources than Grid computing can possibly provide. Additional computing and storage resources are required. Therefore ATLAS is engaged in an ambitious programmore »« less
Full Text Available
Integration of PanDA workload management system with Titan supercomputer at OLCF

Journal Article De, K. ; Klimentov, A. ; Oleynik, D. ; ... - Journal of Physics. Conference Series

The PanDA (Production and Distributed Analysis) workload management system (WMS) was developed to meet the scale and complexity of LHC distributed computing for the ATLAS experiment. While PanDA currently distributes jobs to more than 100,000 cores at well over 100 Grid sites, the future LHC data taking runs will require more resources than Grid computing can possibly provide. To alleviate these challenges, ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. We will describe a project aimed at integration of PanDA WMS with Titan supercomputermore »« less
Cited by 3
https://doi.org/10.1088/1742-6596/664/9/092020

Full Text Available
INTEGRATION OF PANDA WORKLOAD MANAGEMENT SYSTEM WITH SUPERCOMPUTERS

Conference De, K ; Jha, S ; Maeno, T ; ...

Abstract The Large Hadron Collider (LHC), operating at the international CERN Laboratory in Geneva, Switzerland, is leading Big Data driven scientific explorations. Experiments at the LHC explore the funda- mental nature of matter and the basic forces that shape our universe, and were recently credited for the dis- covery of a Higgs boson. ATLAS, one of the largest collaborations ever assembled in the sciences, is at the forefront of research at the LHC. To address an unprecedented multi-petabyte data processing challenge, the ATLAS experiment is relying on a heterogeneous distributed computational infrastructure. The ATLAS experiment uses PanDA (Production and Datamore »« less
https://doi.org/10.1134/S1547477116050150
Integration Of PanDA Workload Management System With Supercomputers for ATLAS and Data Intensive Science

Conference De, K ; Jha, S ; Klimentov, A ; ...

The Large Hadron Collider (LHC), operating at the international CERN Laboratory in Geneva, Switzerland, is leading Big Data driven scientific explorations. Experiments at the LHC explore the fundamental nature of matter and the basic forces that shape our universe, and were recently credited for the discovery of a Higgs boson. ATLAS, one of the largest collaborations ever assembled in the sciences, is at the forefront of research at the LHC. To address an unprecedented multi-petabyte data processing challenge, the ATLAS experiment is relying on a heterogeneous distributed computational infrastructure. The ATLAS experiment uses PanDA (Production and Data Analysis) Workload Managementmore »« less
Full Text Available
Integration Of PanDA Workload Management System With Supercomputers for ATLAS and Data Intensive Science

Journal Article Klimentov, A. ; De, K. ; Jha, S. ; ... - Journal of Physics. Conference Series

The.LHC, operating at CERN, is leading Big Data driven scientific explorations. Experiments at the LHC explore the fundamental nature of matter and the basic forces that shape our universe. ATLAS, one of the largest collaborations ever assembled in the sciences, is at the forefront of research at the LHC. To address an unprecedented multi-petabyte data processing challenge, the ATLAS experiment is relying on a heterogeneous distributed computational infrastructure. The ATLAS experiment uses PanDA (Production and Data Analysis) Workload Management System for managing the workflow for all data processing on over 150 data centers. Through PanDA, ATLAS physicists see a singlemore »« less
https://doi.org/10.1088/1742-6596/762/1/012021

Full Text Available

Similar Records

Title: Integration of Titan supercomputer at OLCF with ATLAS Production System

Abstract

Citation Formats

The ATLAS PanDA Pilot in Operation journal, December 2011

Scaling up ATLAS production system for the LHC Run 2 and beyond: project ProdSys2 journal, December 2015

Overview of ATLAS PanDA Workload Management journal, December 2011

The ATLAS PanDA Pilot in Operation
journal, December 2011

Scaling up ATLAS production system for the LHC Run 2 and beyond: project ProdSys2
journal, December 2015

Overview of ATLAS PanDA Workload Management
journal, December 2011