skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Initial Proposal for MPI 3.0 Error Handling

Technical Report ·
DOI:https://doi.org/10.2172/945669· OSTI ID:945669

The MPI 2 spec contains error handling and notification mechanisms that have a number of limitations from the point of view of application fault tolerance: (1) The specification makes no demands on MPI to survive failures. Although MPI implementers are encouraged to 'circumscribe the impact of an error, so that normal processing can continue after an error handler was invoked', nothing more is specified in the standard. In particular, the defined MPI error classes are used only to clarify to the user the source of the error and do not describe the MPI functionality that is not available as a result of the error. (2) All errors must somehow be associated with some specific MPI call. As such, (A) It is difficult for MPI to notify users of failures in asynchronous calls, such as an MPI{_}Rsend call, which may return immediately after the message data is sent along the wire but before it is successfully delivered; (B) There is no provision for asynchronous error notification regarding errors that will affect future calls, such as notifying process p of the failure of process q before p tries to communicate with q. (3) There is no description of when error notification will happen relative to the occurrence of the error. In particular, the specification does not state whether an error that would cause MPI functions to return an error code under the MPI{_}ERRORS{_}RETURN error handler would cause a user-defined error handler to be called during the same MPI function or at some earlier or later point in time. (4) Although MPI makes it possible for libraries to define their own error classes and invoke application error handlers, it is not possible for the application to define new error notification patterns either within or across processes. This means that it is not possible for one application process to ask to be informed of errors on other processes or for the application to be informed of specific classes of errors.

Research Organization:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
W-7405-ENG-48
OSTI ID:
945669
Report Number(s):
LLNL-TR-405242; TRN: US200904%%120
Country of Publication:
United States
Language:
English

Similar Records

Fault Tolerance Assistant (FTA): An Exception Handling Programming Model for MPI Applications
Technical Report · Mon May 23 00:00:00 EDT 2016 · OSTI ID:945669

MPI as a coordination layer for communicating HPF tasks
Conference · Tue Dec 31 00:00:00 EST 1996 · OSTI ID:945669

A portable method for finding user errors in the usage of MPI collective operations.
Journal Article · Sun Jul 01 00:00:00 EDT 2007 · Int. J. High Perform. Comput. Appl. · OSTI ID:945669