skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Virtual machine provisioning, code management, and data movement design for the Fermilab HEPCloud Facility

Abstract

The Fermilab HEPCloud Facility Project has as its goal to extend the current Fermilab facility interface to provide transparent access to disparate resources including commercial and community clouds, grid federations, and HPC centers. This facility enables experiments to perform the full spectrum of computing tasks, including data-intensive simulation and reconstruction. We have evaluated the use of the commercial cloud to provide elasticity to respond to peaks of demand without overprovisioning local resources. Full scale data-intensive workflows have been successfully completed on Amazon Web Services for two High Energy Physics Experiments, CMS and NOνA, at the scale of 58000 simultaneous cores. This paper describes the significant improvements that were made to the virtual machine provisioning system, code caching system, and data movement system to accomplish this work. The virtual image provisioning and contextualization service was extended to multiple AWS regions, and to support experiment-specific data configurations. A prototype Decision Engine was written to determine the optimal availability zone and instance type to run on, minimizing cost and job interruptions. We have deployed a scalable on-demand caching service to deliver code and database information to jobs running on the commercial cloud. It uses the frontiersquid server and CERN VM File System (CVMFS)more » clients on EC2 instances and utilizes various services provided by AWS to build the infrastructure (stack). We discuss the architecture and load testing benchmarks on the squid servers. We also describe various approaches that were evaluated to transport experimental data to and from the cloud, and the optimal solutions that were used for the bulk of the data transport. Finally, we summarize lessons learned from this scale test, and our future plans to expand and improve the Fermilab HEP Cloud Facility.« less

Authors:
 [1];  [1];  [1];  [1];  [1];  [1];  [1];  [1];  [2];  [2];  [2];  [2];  [2];  [3]
  1. Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
  2. Illinois Inst. of Technology, Chicago, IL (United States)
  3. Korea Inst. of Science and Technology, Daejeon (Korea, Republic of)
Publication Date:
Research Org.:
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
OSTI Identifier:
1423235
Report Number(s):
[FERMILAB-CONF-17-641-CD]
[Journal ID: ISSN 1742-6588; 1638496]
Grant/Contract Number:  
[AC02-07CH11359]
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Physics. Conference Series
Additional Journal Information:
[ Journal Volume: 898; Journal Issue: 5; Conference: 22nd International Conference on Computing in High Energy and Nuclear Physics, San Francisco, CA, 10/10-10/14/2016]; Journal ID: ISSN 1742-6588
Publisher:
IOP Publishing
Country of Publication:
United States
Language:
English
Subject:
72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS; 97 MATHEMATICS AND COMPUTING

Citation Formats

Timm, S., Cooper, G., Fuess, S., Garzoglio, G., Holzman, B., Kennedy, R., Grassano, D., Tiradani, A., Krishnamurthy, R., Vinayagam, S., Raicu, I., Wu, H., Ren, S., and Noh, S-Y. Virtual machine provisioning, code management, and data movement design for the Fermilab HEPCloud Facility. United States: N. p., 2017. Web. doi:10.1088/1742-6596/898/5/052041.
Timm, S., Cooper, G., Fuess, S., Garzoglio, G., Holzman, B., Kennedy, R., Grassano, D., Tiradani, A., Krishnamurthy, R., Vinayagam, S., Raicu, I., Wu, H., Ren, S., & Noh, S-Y. Virtual machine provisioning, code management, and data movement design for the Fermilab HEPCloud Facility. United States. doi:10.1088/1742-6596/898/5/052041.
Timm, S., Cooper, G., Fuess, S., Garzoglio, G., Holzman, B., Kennedy, R., Grassano, D., Tiradani, A., Krishnamurthy, R., Vinayagam, S., Raicu, I., Wu, H., Ren, S., and Noh, S-Y. Sun . "Virtual machine provisioning, code management, and data movement design for the Fermilab HEPCloud Facility". United States. doi:10.1088/1742-6596/898/5/052041. https://www.osti.gov/servlets/purl/1423235.
@article{osti_1423235,
title = {Virtual machine provisioning, code management, and data movement design for the Fermilab HEPCloud Facility},
author = {Timm, S. and Cooper, G. and Fuess, S. and Garzoglio, G. and Holzman, B. and Kennedy, R. and Grassano, D. and Tiradani, A. and Krishnamurthy, R. and Vinayagam, S. and Raicu, I. and Wu, H. and Ren, S. and Noh, S-Y},
abstractNote = {The Fermilab HEPCloud Facility Project has as its goal to extend the current Fermilab facility interface to provide transparent access to disparate resources including commercial and community clouds, grid federations, and HPC centers. This facility enables experiments to perform the full spectrum of computing tasks, including data-intensive simulation and reconstruction. We have evaluated the use of the commercial cloud to provide elasticity to respond to peaks of demand without overprovisioning local resources. Full scale data-intensive workflows have been successfully completed on Amazon Web Services for two High Energy Physics Experiments, CMS and NOνA, at the scale of 58000 simultaneous cores. This paper describes the significant improvements that were made to the virtual machine provisioning system, code caching system, and data movement system to accomplish this work. The virtual image provisioning and contextualization service was extended to multiple AWS regions, and to support experiment-specific data configurations. A prototype Decision Engine was written to determine the optimal availability zone and instance type to run on, minimizing cost and job interruptions. We have deployed a scalable on-demand caching service to deliver code and database information to jobs running on the commercial cloud. It uses the frontiersquid server and CERN VM File System (CVMFS) clients on EC2 instances and utilizes various services provided by AWS to build the infrastructure (stack). We discuss the architecture and load testing benchmarks on the squid servers. We also describe various approaches that were evaluated to transport experimental data to and from the cloud, and the optimal solutions that were used for the bulk of the data transport. Finally, we summarize lessons learned from this scale test, and our future plans to expand and improve the Fermilab HEP Cloud Facility.},
doi = {10.1088/1742-6596/898/5/052041},
journal = {Journal of Physics. Conference Series},
number = [5],
volume = [898],
place = {United States},
year = {2017},
month = {10}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Figures / Tables:

Figure 1 Figure 1: Network throughput of Squid Server

Save / Share:
Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.