Methods, apparatus and system for selective duplication of subtasks
Abstract
A method for selective duplication of subtasks in a high-performance computing system includes: monitoring a health status of one or more nodes in a high-performance computing system, where one or more subtasks of a parallel task execute on the one or more nodes; identifying one or more nodes as having a likelihood of failure which exceeds a first prescribed threshold; selectively duplicating the one or more subtasks that execute on the one or more nodes having a likelihood of failure which exceeds the first prescribed threshold; and notifying a messaging library that one or more subtasks were duplicated.
- Inventors:
- Issue Date:
- Research Org.:
- International Business Machines Corporation, Armonk, New York (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1478646
- Patent Number(s):
- 10073739
- Application Number:
- 14/957,584
- Assignee:
- International Business Machines Corporation (Armonk, NY)
- Patent Classifications (CPCs):
-
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
H - ELECTRICITY H04 - ELECTRIC COMMUNICATION TECHNIQUE H04L - TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- DOE Contract Number:
- B599858
- Resource Type:
- Patent
- Resource Relation:
- Patent File Date: 2015 Dec 02
- Country of Publication:
- United States
- Language:
- English
Citation Formats
Andrade Costa, Carlos H., Cher, Chen-Yong, Park, Yoonho, Rosenburg, Bryan S., and Ryu, Kyung D. Methods, apparatus and system for selective duplication of subtasks. United States: N. p., 2018.
Web.
Andrade Costa, Carlos H., Cher, Chen-Yong, Park, Yoonho, Rosenburg, Bryan S., & Ryu, Kyung D. Methods, apparatus and system for selective duplication of subtasks. United States.
Andrade Costa, Carlos H., Cher, Chen-Yong, Park, Yoonho, Rosenburg, Bryan S., and Ryu, Kyung D. Tue .
"Methods, apparatus and system for selective duplication of subtasks". United States. https://www.osti.gov/servlets/purl/1478646.
@article{osti_1478646,
title = {Methods, apparatus and system for selective duplication of subtasks},
author = {Andrade Costa, Carlos H. and Cher, Chen-Yong and Park, Yoonho and Rosenburg, Bryan S. and Ryu, Kyung D.},
abstractNote = {A method for selective duplication of subtasks in a high-performance computing system includes: monitoring a health status of one or more nodes in a high-performance computing system, where one or more subtasks of a parallel task execute on the one or more nodes; identifying one or more nodes as having a likelihood of failure which exceeds a first prescribed threshold; selectively duplicating the one or more subtasks that execute on the one or more nodes having a likelihood of failure which exceeds the first prescribed threshold; and notifying a messaging library that one or more subtasks were duplicated.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Sep 11 00:00:00 EDT 2018},
month = {Tue Sep 11 00:00:00 EDT 2018}
}
Works referenced in this record:
Fanout Connectivity Structure for Use in Facilitating Processing Within a Parallel Computing Environment
patent-application, August 2010
- Coppinger, Richard J.; Fagiano, Christophe; Lombard, Christophe
- US Patent Application 12/362660; 20100199128
Anticipatory Protection of Critical Jobs in a Computing System
patent-application, April 2015
- Alshunnawi, Shareef F.; Cudak, Gary D.; Suffern, Edward S.
- US Patent Application 14/048868; 20150100816
Reducing application downtime in a cluster using user-defined rules for proactive failover
patent, January 2008
- Vellore, Prabhakar Krishnamurthy; Sharma, Mukund Hari; Liu, Peng
- US Patent Document 7,321,992
Method of migrating processes between networks and network system thereof
patent-application, January 2008
- Imai, Tetsuo
- US Patent Application 10/590355; 20080019316
Method of Predicting Availability of a System
patent-application, July 2008
- Narayan, Ranjani; Varadarajan, Keshavan; Natanasabapathy, Gautham
- US Patent Application 11/885475; 20080168314
Method and System for Providing High Availability to Distributed Computer Applications
patent-application, October 2015
- Havemose, Allan; Ngan, Ching-Yuk Paul
- US Patent Application 14/749760; 20150293819
Hybrid method for event prediction and system control
patent-application, May 2005
- Gupta, Manish; Moreira, Jose E.; Oliner, Adam J.
- US Patent Application 10/720300; 20050114739
Job Migration in Response to Loss or Degradation of Semi-Redundant Component
patent-application, March 2012
- Bower III, Fred A.; Piper, Scott A.; Pruett, Gregory B.
- US Patent Application 12/886299; 20120072765
Distributed computing system clustering model providing soft real-time responsiveness and continuous availability
patent, May 2004
- Rostowfske, Bruce D.; Buscher, Thomas H.; Peck, Andrew W.
- US Patent Document 6,735,717
System and Method for State-Based Execution and Recovery in a Payment System
patent-application, December 2007
- Hoyos, Carlos Antonio Lorenzo; Perazolo, Marcelo; Srikanth, Viswanath
- US Patent Application 11/420040; 20070288365
Techniques for maintaining fault tolerance for software programs in a clustered computer system
patent, September 2002
- D'Souza, Roy Peter
- US Patent Document 6,446,218
Enhancing throughput and fault-tolerance in a parallel-processing system
patent-application, September 2007
- Gross, Kenny C.; Wood, Alan Paul
- US Patent Application 11/371998; 20070214394
Method for monitoring and recovery of subsystems in a distributed/clustered system
patent, September 1998
- Dias, Daniel Manuel; King, Richard P.; Leff, Avraham
- US Patent Document 5,805,785
Mechanism For Process Migration On A Massively Parallel Computer
patent-application, March 2009
- Archer, Charles; Darrington, David; McCarthy, Patrick
- US Patent Application 11/853927; 20090067334
System and Method for Preventing Multiple Charges for a Transaction in a Payment System
patent-application, November 2007
- Hoyos, Carlos Antonio Lorenzo; Perazolo, Marcelo; Peters, Mark E.
- US Patent Application 11/456189; 20070276766