skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Integration of PanDA workload management system with Titan supercomputer at OLCF

Abstract

The PanDA (Production and Distributed Analysis) workload management system (WMS) was developed to meet the scale and complexity of LHC distributed computing for the ATLAS experiment. While PanDA currently distributes jobs to more than 100,000 cores at well over 100 Grid sites, the future LHC data taking runs will require more resources than Grid computing can possibly provide. To alleviate these challenges, ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. We will describe a project aimed at integration of PanDA WMS with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). The current approach utilizes a modified PanDA pilot framework for job submission to Titan’s batch queues and local data management, with light-weight MPI wrappers to run single threaded workloads in parallel on Titan’s multi-core worker nodes. It also gives PanDA new capability to collect, in real time, information about unused worker nodes on Titan, which allows precise definition of the size and duration of jobs submitted to Titan according to available free resources. This capability significantly reduces PanDA job wait time while improving Titan’s utilization efficiency. This implementation was tested with a varietymore » of Monte-Carlo workloads on Titan and is being tested on several other supercomputing platforms.« less

Authors:
 [1];  [2];  [1];  [2];  [1];  [2];  [3];  [2]
  1. Univ. of Texas, Arlington, TX (United States). Dept. of Physics
  2. Brookhaven National Lab. (BNL), Upton, NY (United States)
  3. Argonne National Lab. (ANL), Argonne, IL (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1567387
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Journal of Physics. Conference Series
Additional Journal Information:
Journal Volume: 664; Journal Issue: 9; Journal ID: ISSN 1742-6588
Publisher:
IOP Publishing
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Physics

Citation Formats

De, K., Klimentov, A., Oleynik, D., Panitkin, S., Petrosyan, A., Schovancova, J., Vaniachine, A., and Wenaus, T. Integration of PanDA workload management system with Titan supercomputer at OLCF. United States: N. p., 2015. Web. doi:10.1088/1742-6596/664/9/092020.
De, K., Klimentov, A., Oleynik, D., Panitkin, S., Petrosyan, A., Schovancova, J., Vaniachine, A., & Wenaus, T. Integration of PanDA workload management system with Titan supercomputer at OLCF. United States. https://doi.org/10.1088/1742-6596/664/9/092020
De, K., Klimentov, A., Oleynik, D., Panitkin, S., Petrosyan, A., Schovancova, J., Vaniachine, A., and Wenaus, T. 2015. "Integration of PanDA workload management system with Titan supercomputer at OLCF". United States. https://doi.org/10.1088/1742-6596/664/9/092020. https://www.osti.gov/servlets/purl/1567387.
@article{osti_1567387,
title = {Integration of PanDA workload management system with Titan supercomputer at OLCF},
author = {De, K. and Klimentov, A. and Oleynik, D. and Panitkin, S. and Petrosyan, A. and Schovancova, J. and Vaniachine, A. and Wenaus, T.},
abstractNote = {The PanDA (Production and Distributed Analysis) workload management system (WMS) was developed to meet the scale and complexity of LHC distributed computing for the ATLAS experiment. While PanDA currently distributes jobs to more than 100,000 cores at well over 100 Grid sites, the future LHC data taking runs will require more resources than Grid computing can possibly provide. To alleviate these challenges, ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. We will describe a project aimed at integration of PanDA WMS with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). The current approach utilizes a modified PanDA pilot framework for job submission to Titan’s batch queues and local data management, with light-weight MPI wrappers to run single threaded workloads in parallel on Titan’s multi-core worker nodes. It also gives PanDA new capability to collect, in real time, information about unused worker nodes on Titan, which allows precise definition of the size and duration of jobs submitted to Titan according to available free resources. This capability significantly reduces PanDA job wait time while improving Titan’s utilization efficiency. This implementation was tested with a variety of Monte-Carlo workloads on Titan and is being tested on several other supercomputing platforms.},
doi = {10.1088/1742-6596/664/9/092020},
url = {https://www.osti.gov/biblio/1567387}, journal = {Journal of Physics. Conference Series},
issn = {1742-6588},
number = 9,
volume = 664,
place = {United States},
year = {Wed Dec 23 00:00:00 EST 2015},
month = {Wed Dec 23 00:00:00 EST 2015}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 3 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Distributing LHC application software and conditions databases using the CernVM file system
journal, December 2011


Overview of ATLAS PanDA Workload Management
journal, December 2011