DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Accelerating science: The usage of commercial clouds in ATLAS Distributed Computing

Journal Article · · EPJ Web of Conferences (Online)
 [1];  [2];  [1];  [3];  [4];  [5];  [6];  [3];  [4];  [1];  [3];  [7];  [8];  [3];  [9];  [3];  [10];  [11]
  1. Univ. of Texas, Arlington, TX (United States)
  2. Univ. of Iowa, Iowa City, IA (United States)
  3. Brookhaven National Laboratory (BNL), Upton, NY (United States)
  4. European Organization for Nuclear Research (CERN), Geneva (Switzerland)
  5. Ludwig Maximilian Univ. of Munich, Munich (Germany)
  6. Max Planck Institute for Physics, Munich (Germany)
  7. Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
  8. Port d’Informació Científica, Barcelona (Spain)
  9. Univ. of Massachusetts, Amherst, MA (United States)
  10. Deutsches Elektronen-Synchrotron (DESY), Hamburg (Germany)
  11. California State University, Fresno, CA (United States)

The ATLAS experiment at CERN is one of the largest scientific machines built to date and will have ever growing computing needs as the Large Hadron Collider collects an increasingly larger volume of data over the next 20 years. ATLAS is conducting R&D projects on Amazon Web Services and Google Cloud as complementary resources for distributed computing, focusing on some of the key features of commercial clouds: lightweight operation, elasticity and availability of multiple chip architectures. The proof of concept phases have concluded with the cloud-native, vendoragnostic integration with the experiment’s data and workload management frameworks. Google Cloud has been used to evaluate elastic batch computing, ramping up ephemeral clusters of up to O(100k) cores to process tasks requiring quick turnaround. Amazon Web Services has been exploited for the successful physics validation of the Athena simulation software on ARM processors. We have also set up an interactive facility for physics analysis allowing endusers to spin up private, on-demand clusters for parallel computing with up to 4 000 cores, or run GPU enabled notebooks and jobs for machine learning applications. The success of the proof of concept phases has led to the extension of the Google Cloud project, where ATLAS will study the total cost of ownership of a production cloud site during 15 months with 10k cores on average, fully integrated with distributed grid computing resources and continue the R&D projects.

Research Organization:
Brookhaven National Laboratory (BNL), Upton, NY (United States)
Sponsoring Organization:
USDOE Office of Science (SC), High Energy Physics (HEP)
Grant/Contract Number:
SC0012704
OSTI ID:
2429542
Report Number(s):
BNL--225906-2024-JAAM
Journal Information:
EPJ Web of Conferences (Online), Journal Name: EPJ Web of Conferences (Online) Vol. 295; ISSN 2100-014X
Publisher:
EDP SciencesCopyright Statement
Country of Publication:
United States
Language:
English

References (9)

Rucio: Scientific Data Management journal August 2019
Harvester : an edge service harvesting heterogeneous resources for ATLAS journal January 2019
FTS improvements for LHC Run-3 and beyond journal January 2020
Using Kubernetes as an ATLAS computing site journal January 2020
ATLAS Sim@P1 upgrades during long shutdown two journal January 2020
Seamless integration of commercial Clouds with ATLAS Distributed Computing journal January 2021
The ATLAS experiment software on ARM journal January 2024
PanDA for ATLAS distributed computing in the next decade journal October 2017
Dask: Parallel Computation with Blocked algorithms and Task Scheduling conference January 2015