DOE Patents title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Methods, apparatus and system for selective duplication of subtasks

Abstract

A method for selective duplication of subtasks in a high-performance computing system includes: monitoring a health status of one or more nodes in a high-performance computing system, where one or more subtasks of a parallel task execute on the one or more nodes; identifying one or more nodes as having a likelihood of failure which exceeds a first prescribed threshold; selectively duplicating the one or more subtasks that execute on the one or more nodes having a likelihood of failure which exceeds the first prescribed threshold; and notifying a messaging library that one or more subtasks were duplicated.

Inventors:
; ; ; ;
Issue Date:
Research Org.:
International Business Machines Corporation, Armonk, New York (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1478646
Patent Number(s):
10073739
Application Number:
14/957,584
Assignee:
International Business Machines Corporation (Armonk, NY)
Patent Classifications (CPCs):
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
H - ELECTRICITY H04 - ELECTRIC COMMUNICATION TECHNIQUE H04L - TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
DOE Contract Number:  
B599858
Resource Type:
Patent
Resource Relation:
Patent File Date: 2015 Dec 02
Country of Publication:
United States
Language:
English

Citation Formats

Andrade Costa, Carlos H., Cher, Chen-Yong, Park, Yoonho, Rosenburg, Bryan S., and Ryu, Kyung D. Methods, apparatus and system for selective duplication of subtasks. United States: N. p., 2018. Web.
Andrade Costa, Carlos H., Cher, Chen-Yong, Park, Yoonho, Rosenburg, Bryan S., & Ryu, Kyung D. Methods, apparatus and system for selective duplication of subtasks. United States.
Andrade Costa, Carlos H., Cher, Chen-Yong, Park, Yoonho, Rosenburg, Bryan S., and Ryu, Kyung D. Tue . "Methods, apparatus and system for selective duplication of subtasks". United States. https://www.osti.gov/servlets/purl/1478646.
@article{osti_1478646,
title = {Methods, apparatus and system for selective duplication of subtasks},
author = {Andrade Costa, Carlos H. and Cher, Chen-Yong and Park, Yoonho and Rosenburg, Bryan S. and Ryu, Kyung D.},
abstractNote = {A method for selective duplication of subtasks in a high-performance computing system includes: monitoring a health status of one or more nodes in a high-performance computing system, where one or more subtasks of a parallel task execute on the one or more nodes; identifying one or more nodes as having a likelihood of failure which exceeds a first prescribed threshold; selectively duplicating the one or more subtasks that execute on the one or more nodes having a likelihood of failure which exceeds the first prescribed threshold; and notifying a messaging library that one or more subtasks were duplicated.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Sep 11 00:00:00 EDT 2018},
month = {Tue Sep 11 00:00:00 EDT 2018}
}

Works referenced in this record:

Fanout Connectivity Structure for Use in Facilitating Processing Within a Parallel Computing Environment
patent-application, August 2010


Anticipatory Protection of Critical Jobs in a Computing System
patent-application, April 2015


Reducing application downtime in a cluster using user-defined rules for proactive failover
patent, January 2008


Method of migrating processes between networks and network system thereof
patent-application, January 2008


Method of Predicting Availability of a System
patent-application, July 2008


Method and System for Providing High Availability to Distributed Computer Applications
patent-application, October 2015


Hybrid method for event prediction and system control
patent-application, May 2005


Job Migration in Response to Loss or Degradation of Semi-Redundant Component
patent-application, March 2012


System and Method for State-Based Execution and Recovery in a Payment System
patent-application, December 2007


Enhancing throughput and fault-tolerance in a parallel-processing system
patent-application, September 2007


Method for monitoring and recovery of subsystems in a distributed/clustered system
patent, September 1998


Mechanism For Process Migration On A Massively Parallel Computer
patent-application, March 2009


System and Method for Preventing Multiple Charges for a Transaction in a Payment System
patent-application, November 2007