DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Integration of Titan supercomputer at OLCF with ATLAS Production System

Abstract

The PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment. PanDA managed resources are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundreds of Petabytes of data and the rate of data processing already exceeds Exabyte per year. While PanDA currently uses more than 200,000 cores at well over 100 Grid sites, future LHC data taking runs will require more resources than Grid computing can possibly provide. Additional computing and storage resources are required. Therefore ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. In this paper we will describe a project aimed at integration of ATLAS Production System with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Current approach utilizes modified PanDA Pilot framework for job submission to Titan's batch queues and local data management, with lightweight MPI wrappers to run single node workloads in parallel on Titan's multi-core worker nodes. It provides for running of standard ATLAS production jobs on unused resources (backfill) on Titan. The system already allowed ATLAS to collect on Titanmore » millions of core-hours per month, execute hundreds of thousands jobs, while simultaneously improving Titans utilization efficiency. We will discuss the details of the implementation, current experience with running the system, as well as future plans aimed at improvements in scalability and efficiency.« less

Authors:
; ; ; ; ; ; ; ; ; ;
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
Contributing Org.:
ATLAS Collaboration
OSTI Identifier:
1567554
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Physics. Conference Series
Additional Journal Information:
Journal Volume: 898; Journal ID: ISSN 1742-6588
Publisher:
IOP Publishing
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Barreiro Megino, F., De, K., Jha, S., Klimentov, A., Maeno, T., Nilsson, P., Oleynik, D., Padolski, S., Panitkin, S., Wells, J., and Wenaus, T. Integration of Titan supercomputer at OLCF with ATLAS Production System. United States: N. p., 2017. Web. doi:10.1088/1742-6596/898/9/092002.
Barreiro Megino, F., De, K., Jha, S., Klimentov, A., Maeno, T., Nilsson, P., Oleynik, D., Padolski, S., Panitkin, S., Wells, J., & Wenaus, T. Integration of Titan supercomputer at OLCF with ATLAS Production System. United States. https://doi.org/10.1088/1742-6596/898/9/092002
Barreiro Megino, F., De, K., Jha, S., Klimentov, A., Maeno, T., Nilsson, P., Oleynik, D., Padolski, S., Panitkin, S., Wells, J., and Wenaus, T. Sun . "Integration of Titan supercomputer at OLCF with ATLAS Production System". United States. https://doi.org/10.1088/1742-6596/898/9/092002. https://www.osti.gov/servlets/purl/1567554.
@article{osti_1567554,
title = {Integration of Titan supercomputer at OLCF with ATLAS Production System},
author = {Barreiro Megino, F. and De, K. and Jha, S. and Klimentov, A. and Maeno, T. and Nilsson, P. and Oleynik, D. and Padolski, S. and Panitkin, S. and Wells, J. and Wenaus, T.},
abstractNote = {The PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment. PanDA managed resources are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundreds of Petabytes of data and the rate of data processing already exceeds Exabyte per year. While PanDA currently uses more than 200,000 cores at well over 100 Grid sites, future LHC data taking runs will require more resources than Grid computing can possibly provide. Additional computing and storage resources are required. Therefore ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. In this paper we will describe a project aimed at integration of ATLAS Production System with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Current approach utilizes modified PanDA Pilot framework for job submission to Titan's batch queues and local data management, with lightweight MPI wrappers to run single node workloads in parallel on Titan's multi-core worker nodes. It provides for running of standard ATLAS production jobs on unused resources (backfill) on Titan. The system already allowed ATLAS to collect on Titan millions of core-hours per month, execute hundreds of thousands jobs, while simultaneously improving Titans utilization efficiency. We will discuss the details of the implementation, current experience with running the system, as well as future plans aimed at improvements in scalability and efficiency.},
doi = {10.1088/1742-6596/898/9/092002},
journal = {Journal of Physics. Conference Series},
number = ,
volume = 898,
place = {United States},
year = {Sun Oct 01 00:00:00 EDT 2017},
month = {Sun Oct 01 00:00:00 EDT 2017}
}

Works referenced in this record:

The ATLAS PanDA Pilot in Operation
journal, December 2011


Scaling up ATLAS production system for the LHC Run 2 and beyond: project ProdSys2
journal, December 2015


Overview of ATLAS PanDA Workload Management
journal, December 2011