skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Use of DAGMan in CRAB3 to improve the splitting of CMS user jobs

Abstract

CRAB3 is a workload management tool used by CMS physicists to analyze data acquired by the Compact Muon Solenoid (CMS) detector at the CERN Large Hadron Collider (LHC). Research in high energy physics often requires the analysis of large collections of files, referred to as datasets. The task is divided into jobs that are distributed among a large collection of worker nodes throughout the Worldwide LHC Computing Grid (WLCG). Splitting a large analysis task into optimally sized jobs is critical to efficient use of distributed computing resources. Jobs that are too big will have excessive runtimes and will not distribute the work across all of the available nodes. However, splitting the project into a large number of very small jobs is also inefficient, as each job creates additional overhead which increases load on infrastructure resources. Currently this splitting is done manually, using parameters provided by the user. However the resources needed for each job are difficult to predict because of frequent variations in the performance of the user code and the content of the input dataset. As a result, dividing a task into jobs by hand is difficult and often suboptimal. In this work we present a new feature calledmore » “automatic splitting” which removes the need for users to manually specify job splitting parameters. We discuss how HTCondor DAGMan can be used to build dynamic Directed Acyclic Graphs (DAGs) to optimize the performance of large CMS analysis jobs on the Grid. We use DAGMan to dynamically generate interconnected DAGs that estimate the processing time the user code will require to analyze each event. This is used to calculate an estimate of the total processing time per job, and a set of analysis jobs are run using this estimate as a specified time limit. Some jobs may not finish within the alloted time; they are terminated at the time limit, and the unfinished data is regrouped into smaller jobs and resubmitted.« less

Authors:
 [1];  [2];  [1];  [3];  [4];  [5];  [2]
  1. Univ. of Notre Dame, IN (United States)
  2. Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
  3. Istituto Nazionale di Fisica Nucleare (INFN),Trieste (Italy)
  4. Univ. of Nebraska, Lincoln, NE (United States)
  5. Research Centre for Energy, Environment and Technology (CIEMAT), Madrid (Spain)
Publication Date:
Research Org.:
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), High Energy Physics (HEP)
OSTI Identifier:
1420914
Report Number(s):
FERMILAB-CONF-16-753-CD
Journal ID: ISSN 1742-6588; 1638491
Grant/Contract Number:  
AC02-07CH11359
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Journal of Physics. Conference Series
Additional Journal Information:
Journal Volume: 898; Journal Issue: 5; Conference: 22nd International Conference on Computing in High Energy and Nuclear Physics, San Francisco, CA, 10/10-10/14/2016; Journal ID: ISSN 1742-6588
Publisher:
IOP Publishing
Country of Publication:
United States
Language:
English
Subject:
72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS; 97 MATHEMATICS AND COMPUTING

Citation Formats

Wolf, M., Mascheroni, M., Woodard, A., Belforte, S., Bockelman, B., Hernandez, J. M., and Vaandering, E. Use of DAGMan in CRAB3 to improve the splitting of CMS user jobs. United States: N. p., 2017. Web. doi:10.1088/1742-6596/898/5/052035.
Wolf, M., Mascheroni, M., Woodard, A., Belforte, S., Bockelman, B., Hernandez, J. M., & Vaandering, E. Use of DAGMan in CRAB3 to improve the splitting of CMS user jobs. United States. https://doi.org/10.1088/1742-6596/898/5/052035
Wolf, M., Mascheroni, M., Woodard, A., Belforte, S., Bockelman, B., Hernandez, J. M., and Vaandering, E. 2017. "Use of DAGMan in CRAB3 to improve the splitting of CMS user jobs". United States. https://doi.org/10.1088/1742-6596/898/5/052035. https://www.osti.gov/servlets/purl/1420914.
@article{osti_1420914,
title = {Use of DAGMan in CRAB3 to improve the splitting of CMS user jobs},
author = {Wolf, M. and Mascheroni, M. and Woodard, A. and Belforte, S. and Bockelman, B. and Hernandez, J. M. and Vaandering, E.},
abstractNote = {CRAB3 is a workload management tool used by CMS physicists to analyze data acquired by the Compact Muon Solenoid (CMS) detector at the CERN Large Hadron Collider (LHC). Research in high energy physics often requires the analysis of large collections of files, referred to as datasets. The task is divided into jobs that are distributed among a large collection of worker nodes throughout the Worldwide LHC Computing Grid (WLCG). Splitting a large analysis task into optimally sized jobs is critical to efficient use of distributed computing resources. Jobs that are too big will have excessive runtimes and will not distribute the work across all of the available nodes. However, splitting the project into a large number of very small jobs is also inefficient, as each job creates additional overhead which increases load on infrastructure resources. Currently this splitting is done manually, using parameters provided by the user. However the resources needed for each job are difficult to predict because of frequent variations in the performance of the user code and the content of the input dataset. As a result, dividing a task into jobs by hand is difficult and often suboptimal. In this work we present a new feature called “automatic splitting” which removes the need for users to manually specify job splitting parameters. We discuss how HTCondor DAGMan can be used to build dynamic Directed Acyclic Graphs (DAGs) to optimize the performance of large CMS analysis jobs on the Grid. We use DAGMan to dynamically generate interconnected DAGs that estimate the processing time the user code will require to analyze each event. This is used to calculate an estimate of the total processing time per job, and a set of analysis jobs are run using this estimate as a specified time limit. Some jobs may not finish within the alloted time; they are terminated at the time limit, and the unfinished data is regrouped into smaller jobs and resubmitted.},
doi = {10.1088/1742-6596/898/5/052035},
url = {https://www.osti.gov/biblio/1420914}, journal = {Journal of Physics. Conference Series},
issn = {1742-6588},
number = 5,
volume = 898,
place = {United States},
year = {Wed Nov 22 00:00:00 EST 2017},
month = {Wed Nov 22 00:00:00 EST 2017}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Figures / Tables:

Figure 1 Figure 1: Workflow for a user submission of a task to the CRAB3 infrastructure. Tasks are accepted by a frontend and processed in several steps. First, metadata about the dataset the user wants to run over is acquired, which is used to split the task into jobs. Using this partitioning,more » a DAG is created to describe the job execution, and submitted to a HTCondor scheduler, where DAGMan will execute pre-, job, and post-jobs as specified in the DAG. Also shown, dashed, is the dry-run splitting estimation, which runs on the user machine and resumes the workflow with the submission to the scheduler after the splitting has been confirmed by the user.« less

Save / Share:
Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.