Distributed Machine Learning Workflow with PanDA and iDDS in LHC ATLAS
Journal Article
·
· EPJ Web of Conferences (Online)
- Brookhaven National Laboratory (BNL), Upton, NY (United States)
- Univ. of Wisconsin, Madison, WI (United States)
- Univ. of Texas, Arlington, TX (United States)
- Univ. of Pittsburgh, PA (United States)
Machine Learning (ML) has become one of the important tools for High Energy Physics analysis. As the size of the dataset increases at the Large Hadron Collider (LHC), and at the same time the search spaces become bigger and bigger in order to exploit the physics potentials, more and more computing resources are required for processing these ML tasks. In addition, complex advanced ML workflows are developed in which one task may depend on the results of previous tasks. How to make use of vast distributed CPUs/GPUs in WLCG for these big complex ML tasks has become a popular research area. In this paper, we present our efforts enabling the execution of distributed ML workflows on the Production and Distributed Analysis (PanDA) system and intelligent Data Delivery Service (iDDS). First, we describe how PanDA and iDDS deal with large-scale ML workflows, including the implementation to process workloads on diverse and geographically distributed computing resources. Next, we report real-world use cases, such as HyperParameter Optimization, Monte Carlo Toy confidence limits calculation, and Active Learning. Finally, we conclude with future plans.
- Research Organization:
- Brookhaven National Laboratory (BNL), Upton, NY (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), High Energy Physics (HEP)
- Grant/Contract Number:
- SC0012704
- OSTI ID:
- 2428916
- Report Number(s):
- BNL--225899-2024-JAAM
- Journal Information:
- EPJ Web of Conferences (Online), Journal Name: EPJ Web of Conferences (Online) Vol. 295; ISSN 2100-014X
- Publisher:
- EDP SciencesCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Towards an Intelligent Data Delivery Service
|
journal | January 2020 |
An intelligent Data Delivery Service for and beyond the ATLAS experiment
|
journal | January 2021 |
Training and Serving ML workloads with Kubeflow at CERN
|
journal | January 2021 |
The Fast Simulation Chain in the ATLAS experiment
|
journal | January 2021 |
LSST: From Science Drivers to Reference Design and Anticipated Data Products
|
journal | March 2019 |
Similar Records
iDDS: intelligent distributed dispatch and scheduling for workflow orchestration
Utilizing Distributed Heterogeneous Computing with PanDA in ATLAS
Modular performance prediction for scientific workflows using Machine Learning
Journal Article
·
Fri Jan 23 19:00:00 EST 2026
· European Physical Journal. C, Particles and Fields (Online)
·
OSTI ID:3017617
Utilizing Distributed Heterogeneous Computing with PanDA in ATLAS
Journal Article
·
Sun May 05 20:00:00 EDT 2024
· EPJ Web of Conferences (Online)
·
OSTI ID:2448346
Modular performance prediction for scientific workflows using Machine Learning
Journal Article
·
Sun May 10 20:00:00 EDT 2020
· Future Generations Computer Systems
·
OSTI ID:1851724