skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Methods, apparatus and system for selective duplication of subtasks

Patent ·
OSTI ID:1478646

A method for selective duplication of subtasks in a high-performance computing system includes: monitoring a health status of one or more nodes in a high-performance computing system, where one or more subtasks of a parallel task execute on the one or more nodes; identifying one or more nodes as having a likelihood of failure which exceeds a first prescribed threshold; selectively duplicating the one or more subtasks that execute on the one or more nodes having a likelihood of failure which exceeds the first prescribed threshold; and notifying a messaging library that one or more subtasks were duplicated.

Research Organization:
International Business Machines Corporation, Armonk, New York (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
B599858
Assignee:
International Business Machines Corporation (Armonk, NY)
Patent Number(s):
10,073,739
Application Number:
14/957,584
OSTI ID:
1478646
Resource Relation:
Patent File Date: 2015 Dec 02
Country of Publication:
United States
Language:
English

References (15)

Fanout Connectivity Structure for Use in Facilitating Processing Within a Parallel Computing Environment patent-application August 2010
Anticipatory Protection of Critical Jobs in a Computing System patent-application April 2015
Reducing application downtime in a cluster using user-defined rules for proactive failover patent January 2008
Method of migrating processes between networks and network system thereof patent-application January 2008
Method of Predicting Availability of a System patent-application July 2008
Method and System for Providing High Availability to Distributed Computer Applications patent-application October 2015
Hybrid method for event prediction and system control patent-application May 2005
Job Migration in Response to Loss or Degradation of Semi-Redundant Component patent-application March 2012
Distributed computing system clustering model providing soft real-time responsiveness and continuous availability patent May 2004
System and Method for State-Based Execution and Recovery in a Payment System patent-application December 2007
Techniques for maintaining fault tolerance for software programs in a clustered computer system patent September 2002
Enhancing throughput and fault-tolerance in a parallel-processing system patent-application September 2007
Method for monitoring and recovery of subsystems in a distributed/clustered system patent September 1998
Mechanism For Process Migration On A Massively Parallel Computer patent-application March 2009
System and Method for Preventing Multiple Charges for a Transaction in a Payment System patent-application November 2007