skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Automatic Parallelization and Transparent Fault Tolerance

Authors:
 [1];  [1];  [1];  [2];  [3]
  1. Los Alamos National Laboratory
  2. University of North Dakota
  3. New Mexico Tech
Publication Date:
Research Org.:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
1258367
Report Number(s):
LA-UR-16-24056
DOE Contract Number:
AC52-06NA25396
Resource Type:
Conference
Resource Relation:
Conference: 17th Symposium on Trends in Functional Programming ; 2016-06-08 - 2016-06-10 ; College Park, Maryland, United States
Country of Publication:
United States
Language:
English
Subject:
Computer Science

Citation Formats

Davis, Marion Kei, Prichard, Dean A., Ringo, David Matteson, Anderson, Loren, and Marks, Jacob. Automatic Parallelization and Transparent Fault Tolerance. United States: N. p., 2016. Web.
Davis, Marion Kei, Prichard, Dean A., Ringo, David Matteson, Anderson, Loren, & Marks, Jacob. Automatic Parallelization and Transparent Fault Tolerance. United States.
Davis, Marion Kei, Prichard, Dean A., Ringo, David Matteson, Anderson, Loren, and Marks, Jacob. Thu . "Automatic Parallelization and Transparent Fault Tolerance". United States. doi:. https://www.osti.gov/servlets/purl/1258367.
@article{osti_1258367,
title = {Automatic Parallelization and Transparent Fault Tolerance},
author = {Davis, Marion Kei and Prichard, Dean A. and Ringo, David Matteson and Anderson, Loren and Marks, Jacob},
abstractNote = {},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Jun 16 00:00:00 EDT 2016},
month = {Thu Jun 16 00:00:00 EDT 2016}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a meantime- to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unnecessary since all but one node are typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. In this paper, we overcome these shortcomings. Instead of job restart, we have developed a transparent mechanism for job pause within LAM/MPI+BLCR. This mechanismmore » allows live nodes to remain active and roll back to the last checkpoint while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. Our methodology includes LAM/MPI enhancements in support of scalable group communicationwith fluctuating number of nodes, reuse of network connections, transparent coordinated checkpoint scheduling and a BLCR enhancement for job pause. Experiments in a cluster with the NAS Parallel Benchmark suite show that our overhead for job pause is comparable to that of a complete job restart. A minimal overhead of 5.6% is only incurred in case migration takes place while the regular checkpoint overhead remains unchanged. Yet, our approach alleviates the need to reboot the LAM run-time environment, which accounts for considerable overhead resulting in net savings of our scheme in the experiments. Our solution further provides full transparency and automation with the additional benefit of reusing existing resources. Executing continues after failures within the scheduled job, i.e., the application staging overhead is not incurred again in contrast to a restart. Our scheme offers additional potential for savings through incremental checkpointing and proactive diskless live migration, which we are currently working on.« less
  • As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application tomore » the overheads observed.« less
  • A Straight-line code, which consists of assignment, addition, and multiplication statements is an abstraction of a serial computer program to compute a function with n inputs. Given a serial straight-line code with N statements, the authors derive an algorithm that automatically evaluates not only the function but also its first-order derivatives with respect to the n inputs on a parallel computer. The basic idea of the algorithm is to marry automatic computation of derivatives with automatic parallelization of serial programs. The algorithm requires O(M{sub N} log of N) scalar operations, where O(M{sub N}) is the time complexity of a parallelmore » multiplication of two dense N x N matrices and it represents a measure of the complexity of the straight-line code. Although it can be exponential in N in the worse case, it tends to be only polynomial in N for many important problems.« less
  • Automatic parallelization of sequential applications using OpenMP as a target has been attracting significant attention recently because of the popularity of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high level abstractions such as STL containers are largely ignored due to the lack of research compilers that are readily able to recognize high level object-oriented abstractions of STL. In this paper, we use ROSE, a multiple-language source-to-source compiler infrastructure, to build a parallelizer thatmore » can recognize such high level semantics and parallelize C++ applications using certain STL containers. The idea of our work is to automatically insert OpenMP constructs using extended conventional dependence analysis and the known domain-specific semantics of high-level abstractions with optional assistance from source code annotations. In addition, the parallelizer is followed by an OpenMP translator to translate the generated OpenMP programs into multi-threaded code targeted to a popular OpenMP runtime library. Our work extends the applicability of automatic parallelization and provides another way to take advantage of multicore processors.« less