| | |
Summary: A Study of Skew in MapReduce Applications
YongChul Kwon, Magdalena Balazinska, Bill Howe
University of Washington, USA
Email:{yongchul,magda,billhowe}@cs.washington.edu
Jerome Rolia
HP Labs
Email: jerry.rolia@hp.com
Abstract--This paper presents a study of skew -- highly vari-
able task runtimes -- in MapReduce applications. We describe
various causes and manifestations of skew as observed in real
world Hadoop applications. Runtime task distributions from
these applications demonstrate the presence and negative impact
of skew on performance behavior. We discuss best practices
recommended for avoiding such behavior and their limitations.
I. INTRODUCTION
MapReduce [1] has proven itself as a powerful and cost-
effective approach for massively parallel analytics [2]. A
MapReduce job runs in two main phases: map phase and
reduce phase. In each phase, a subset of the input data is
processed by distributed tasks in a cluster of computers. When
|