DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploiting Redundancy and Application Scalability for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2

Abstract

The use of clouds to execute high-performance computing (HPC) applications has greatly increased recently. Clouds provide several potential advantages over traditional supercomputers and in-house clusters. The most popular cloud is currently Amazon EC2, which provides fixed-cost and variable-cost, auction-based options. The auction market trades lower cost for potential interruptions that necessitate checkpointing; if the market price exceeds the bid price, a node is taken away from the user without warning. We explore techniques to maximize performance per dollar given a time constraint within which an application must complete. Specifically, we design and implement multiple techniques to reduce expected cost by exploiting redundancy in the EC2 auction market. We then design an adaptive algorithm that selects a scheduling algorithm and determines the bid price. We show that our adaptive algorithm executes programs up to seven times cheaper than using the on-demand market and up to 44 percent cheaper than the best non-redundant, auction-market algorithm. We extend our adaptive algorithm to incorporate application scalability characteristics for further cost savings. In conclusion, we show that the adaptive algorithm informed with scalability characteristics of applications achieves up to 56 percent cost savings compared to the expected cost for the base adaptive algorithm run atmore » a fixed, user-defined scale.« less

Authors:
 [1];  [2];  [3];  [1];  [1];  [1]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  2. Google Corp., Mountain View, CA (United States)
  3. The Univ. of Arizona, Tucson, AZ (United States)
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1399726
Report Number(s):
LLNL-JRNL-676899
Journal ID: ISSN 1045-9219; TRN: US1702972
Grant/Contract Number:  
AC52-07NA27344
Resource Type:
Accepted Manuscript
Journal Name:
IEEE Transactions on Parallel and Distributed Systems
Additional Journal Information:
Journal Volume: 27; Journal Issue: 9; Journal ID: ISSN 1045-9219
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; fault tolerance; reliability; cloud computing; resource provisioning; cost optimization

Citation Formats

Marathe, Aniruddha P., Harris, Rachel A., Lowenthal, David K., de Supinski, Bronis R., Rountree, Barry L., and Schulz, Martin. Exploiting Redundancy and Application Scalability for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2. United States: N. p., 2015. Web. doi:10.1109/TPDS.2015.2508457.
Marathe, Aniruddha P., Harris, Rachel A., Lowenthal, David K., de Supinski, Bronis R., Rountree, Barry L., & Schulz, Martin. Exploiting Redundancy and Application Scalability for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2. United States. https://doi.org/10.1109/TPDS.2015.2508457
Marathe, Aniruddha P., Harris, Rachel A., Lowenthal, David K., de Supinski, Bronis R., Rountree, Barry L., and Schulz, Martin. Thu . "Exploiting Redundancy and Application Scalability for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2". United States. https://doi.org/10.1109/TPDS.2015.2508457. https://www.osti.gov/servlets/purl/1399726.
@article{osti_1399726,
title = {Exploiting Redundancy and Application Scalability for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2},
author = {Marathe, Aniruddha P. and Harris, Rachel A. and Lowenthal, David K. and de Supinski, Bronis R. and Rountree, Barry L. and Schulz, Martin},
abstractNote = {The use of clouds to execute high-performance computing (HPC) applications has greatly increased recently. Clouds provide several potential advantages over traditional supercomputers and in-house clusters. The most popular cloud is currently Amazon EC2, which provides fixed-cost and variable-cost, auction-based options. The auction market trades lower cost for potential interruptions that necessitate checkpointing; if the market price exceeds the bid price, a node is taken away from the user without warning. We explore techniques to maximize performance per dollar given a time constraint within which an application must complete. Specifically, we design and implement multiple techniques to reduce expected cost by exploiting redundancy in the EC2 auction market. We then design an adaptive algorithm that selects a scheduling algorithm and determines the bid price. We show that our adaptive algorithm executes programs up to seven times cheaper than using the on-demand market and up to 44 percent cheaper than the best non-redundant, auction-market algorithm. We extend our adaptive algorithm to incorporate application scalability characteristics for further cost savings. In conclusion, we show that the adaptive algorithm informed with scalability characteristics of applications achieves up to 56 percent cost savings compared to the expected cost for the base adaptive algorithm run at a fixed, user-defined scale.},
doi = {10.1109/TPDS.2015.2508457},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 9,
volume = 27,
place = {United States},
year = {Thu Dec 17 00:00:00 EST 2015},
month = {Thu Dec 17 00:00:00 EST 2015}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 6 works
Citation information provided by
Web of Science

Save / Share: