DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Performance Model of MapReduce Iterative Applications for Hybrid Cloud Bursting

Abstract

Hybrid cloud bursting (i.e., leasing temporary off-premise cloud resources to boost the overall capacity during peak utilization) can be a cost-effective way to deal with the increasing complexity of big data analytics, especially for iterative applications. However, the low throughput, high latency network link between the on-premise and off-premise resources (“weak link”) makes maintaining scalability difficult. While several data locality techniques have been designed for big data bursting on hybrid clouds, their effectiveness is difficult to estimate in advance. Yet such estimations are critical, because they help users decide whether the extra pay-as-you-go cost incurred by using the off-premise resources justifies the runtime speed-up. To this end, the current paper presents a performance model and methodology to estimate the runtime of iterative MapReduce applications in a hybrid cloud-bursting scenario. The paper focuses on the overhead incurred by the weak link at fine granularity, for both the map and the reduce phases. This approach enables high estimation accuracy, as demonstrated by extensive experiments at scale using a mix of real-world iterative MapReduce applications from standard big data benchmarking suites that cover a broad spectrum of data patterns. As a result, not only are the produced estimations accurate in absolute terms comparedmore » with experimental results, but they are also up to an order of magnitude more accurate than applying state-of-art estimation approaches originally designed for single-site MapReduce deployments.« less

Authors:
ORCiD logo [1]; ORCiD logo [2]; ORCiD logo [1]; ORCiD logo [1]
  1. Univ. Jaume I, Castellon (Spain)
  2. Argonne National Lab. (ANL), Argonne, IL (United States)
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
National Science Foundation (NSF); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1466407
Grant/Contract Number:  
AC02-06CH11357
Resource Type:
Accepted Manuscript
Journal Name:
IEEE Transactions on Parallel and Distributed Systems
Additional Journal Information:
Journal Volume: 29; Journal Issue: 8; Journal ID: ISSN 1045-9219
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Big Data Analytics; Hybrid Cloud; Iterative Applications; MapReduce; Performance Prediction; Runtime Estimation

Citation Formats

Clemente-Castello, Francisco J., Nicolae, Bogdan, Mayo, Rafael, and Fernandez, Juan Carlos. Performance Model of MapReduce Iterative Applications for Hybrid Cloud Bursting. United States: N. p., 2018. Web. doi:10.1109/TPDS.2018.2802932.
Clemente-Castello, Francisco J., Nicolae, Bogdan, Mayo, Rafael, & Fernandez, Juan Carlos. Performance Model of MapReduce Iterative Applications for Hybrid Cloud Bursting. United States. https://doi.org/10.1109/TPDS.2018.2802932
Clemente-Castello, Francisco J., Nicolae, Bogdan, Mayo, Rafael, and Fernandez, Juan Carlos. Tue . "Performance Model of MapReduce Iterative Applications for Hybrid Cloud Bursting". United States. https://doi.org/10.1109/TPDS.2018.2802932. https://www.osti.gov/servlets/purl/1466407.
@article{osti_1466407,
title = {Performance Model of MapReduce Iterative Applications for Hybrid Cloud Bursting},
author = {Clemente-Castello, Francisco J. and Nicolae, Bogdan and Mayo, Rafael and Fernandez, Juan Carlos},
abstractNote = {Hybrid cloud bursting (i.e., leasing temporary off-premise cloud resources to boost the overall capacity during peak utilization) can be a cost-effective way to deal with the increasing complexity of big data analytics, especially for iterative applications. However, the low throughput, high latency network link between the on-premise and off-premise resources (“weak link”) makes maintaining scalability difficult. While several data locality techniques have been designed for big data bursting on hybrid clouds, their effectiveness is difficult to estimate in advance. Yet such estimations are critical, because they help users decide whether the extra pay-as-you-go cost incurred by using the off-premise resources justifies the runtime speed-up. To this end, the current paper presents a performance model and methodology to estimate the runtime of iterative MapReduce applications in a hybrid cloud-bursting scenario. The paper focuses on the overhead incurred by the weak link at fine granularity, for both the map and the reduce phases. This approach enables high estimation accuracy, as demonstrated by extensive experiments at scale using a mix of real-world iterative MapReduce applications from standard big data benchmarking suites that cover a broad spectrum of data patterns. As a result, not only are the produced estimations accurate in absolute terms compared with experimental results, but they are also up to an order of magnitude more accurate than applying state-of-art estimation approaches originally designed for single-site MapReduce deployments.},
doi = {10.1109/TPDS.2018.2802932},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 8,
volume = 29,
place = {United States},
year = {Tue Feb 06 00:00:00 EST 2018},
month = {Tue Feb 06 00:00:00 EST 2018}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 9 works
Citation information provided by
Web of Science

Save / Share:

Works referencing / citing this record:

A survey and taxonomy on workload scheduling and resource provisioning in hybrid clouds
journal, February 2020