skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Addressing Failures in Exascale Computing

Abstract

We present here a report produced by a workshop on Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.

Authors:
 [1];  [2];  [3];  [4];  [5];  [1];  [6];  [7];  [1];  [3];  [8];  [7];  [9];  [10];  [11];  [12];  [13];  [11];  [1];  [14] more »;  [15];  [1];  [16];  [17];  [1];  [18];  [19];  [20] « less
  1. Argonne National Laboratory (ANL)
  2. Intel Corporation
  3. unknown
  4. University of Illinois, Urbana-Champaign
  5. Purdue University
  6. Lawrence Livermore National Laboratory (LLNL)
  7. IBM T. J. Watson Research Center
  8. University of Chicago
  9. Los Alamos National Laboratory (LANL)
  10. University of Southern California
  11. ORNL
  12. University of Texas at Austin
  13. Booz Allen Hamilton
  14. Science Applications International Corporation (SAIC), Oak Ridge, TN
  15. Pacific Northwest National Laboratory (PNNL)
  16. AMD
  17. Stanford University
  18. HP Labs
  19. Sandia National Laboratories (SNL)
  20. ARM
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
1128984
DOE Contract Number:
DE-AC05-00OR22725
Resource Type:
Journal Article
Resource Relation:
Journal Name: International Journal of High Performance Computing Applications
Country of Publication:
United States
Language:
English

Citation Formats

Snir, Marc, Wisniewski, Robert, Abraham, Jacob, Adve, Sarita, Bagchi, Saurabh, Balaji, Pavan, Belak, J., Bose, Pradip, Cappello, Franck, Carlson, Bill, Chien, Andrew, Coteus, Paul, DeBardeleben, Nathan, Diniz, Pedro, Engelmann, Christian, Erez, Mattan, Fazzari, Saverio, Geist, Al, Gupta, Rinku, Johnson, Fred, Krishnamoorthy, Sriram, Leyffer, Sven, Liberty, Dean, Mitra, Subhasish, Munson, Todd, Schreiber, Rob, Stearley, Jon, and Van Hensbergen, Eric. Addressing Failures in Exascale Computing. United States: N. p., 2014. Web. doi:10.1177/1094342014522573.
Snir, Marc, Wisniewski, Robert, Abraham, Jacob, Adve, Sarita, Bagchi, Saurabh, Balaji, Pavan, Belak, J., Bose, Pradip, Cappello, Franck, Carlson, Bill, Chien, Andrew, Coteus, Paul, DeBardeleben, Nathan, Diniz, Pedro, Engelmann, Christian, Erez, Mattan, Fazzari, Saverio, Geist, Al, Gupta, Rinku, Johnson, Fred, Krishnamoorthy, Sriram, Leyffer, Sven, Liberty, Dean, Mitra, Subhasish, Munson, Todd, Schreiber, Rob, Stearley, Jon, & Van Hensbergen, Eric. Addressing Failures in Exascale Computing. United States. doi:10.1177/1094342014522573.
Snir, Marc, Wisniewski, Robert, Abraham, Jacob, Adve, Sarita, Bagchi, Saurabh, Balaji, Pavan, Belak, J., Bose, Pradip, Cappello, Franck, Carlson, Bill, Chien, Andrew, Coteus, Paul, DeBardeleben, Nathan, Diniz, Pedro, Engelmann, Christian, Erez, Mattan, Fazzari, Saverio, Geist, Al, Gupta, Rinku, Johnson, Fred, Krishnamoorthy, Sriram, Leyffer, Sven, Liberty, Dean, Mitra, Subhasish, Munson, Todd, Schreiber, Rob, Stearley, Jon, and Van Hensbergen, Eric. Wed . "Addressing Failures in Exascale Computing". United States. doi:10.1177/1094342014522573.
@article{osti_1128984,
title = {Addressing Failures in Exascale Computing},
author = {Snir, Marc and Wisniewski, Robert and Abraham, Jacob and Adve, Sarita and Bagchi, Saurabh and Balaji, Pavan and Belak, J. and Bose, Pradip and Cappello, Franck and Carlson, Bill and Chien, Andrew and Coteus, Paul and DeBardeleben, Nathan and Diniz, Pedro and Engelmann, Christian and Erez, Mattan and Fazzari, Saverio and Geist, Al and Gupta, Rinku and Johnson, Fred and Krishnamoorthy, Sriram and Leyffer, Sven and Liberty, Dean and Mitra, Subhasish and Munson, Todd and Schreiber, Rob and Stearley, Jon and Van Hensbergen, Eric},
abstractNote = {We present here a report produced by a workshop on Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.},
doi = {10.1177/1094342014522573},
journal = {International Journal of High Performance Computing Applications},
number = ,
volume = ,
place = {United States},
year = {Wed Jan 01 00:00:00 EST 2014},
month = {Wed Jan 01 00:00:00 EST 2014}
}