skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fault tolerance and dynamic partitioning in large-scale parallel systems

Abstract

Fault-tolerance and dynamic partitioning are two important issues in the design of large-scale parallel systems. Most previous work in the fault-tolerant design of multistage interconnection networks (MINs) has been based on improving the reliabilities of MINs themselves. This study is to investigate the possibility of adding redundancy to MINs, as well as to other subsystems, to enhance the overall system reliability, and to analyze the improvement that can be obtained. The Dynamic Redundancy (DR) network presented provides the full capability of a Generalized Cube and can tolerate network faults and support a system to tolerate processing element faults without degradation in performance. It is shown that no matter how much redundancy is added into an MIN, the system reliability cannot exceed a certain bound; however, using the DR and spare PEs, this bound can be exceeded. Incorporating the DR network and spare PEs into the basic PASM structure is examined. The problem of partitioning parallel systems is also discussed. Many parallel systems can be partitioned into independent subsystems of different sizes, each subsystem having the characteristics of the complete system with the same size. A parallel system can be partitioned to simultaneously execute tasks with various sizes and computation structures.more » Inappropriate partitioning strategies may create many resource fragments, like the fragmentation problem in paging memory, and may cause the loss of computation power. Dynamic partitioning can alleviate the resource fragmentation problem. It is studied based on a lattice model, a special partial ordering relation on a set. Procedures to manage resources in partitionable systems are presented. These procedures can be applied each time a subsystem changes its status.« less

Authors:
Publication Date:
Research Org.:
Purdue Univ., Indianapolis, IN (United States)
OSTI Identifier:
5533826
Resource Type:
Miscellaneous
Resource Relation:
Other Information: Thesis (Ph. D.)
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; COMPUTER NETWORKS; FAULT TOLERANT COMPUTERS; DESIGN; PARALLEL PROCESSING; PERFORMANCE; RELIABILITY; COMPUTERS; DIGITAL COMPUTERS; PROGRAMMING; 990200* - Mathematics & Computers

Citation Formats

Jeng, M J. Fault tolerance and dynamic partitioning in large-scale parallel systems. United States: N. p., 1987. Web.
Jeng, M J. Fault tolerance and dynamic partitioning in large-scale parallel systems. United States.
Jeng, M J. Thu . "Fault tolerance and dynamic partitioning in large-scale parallel systems". United States.
@article{osti_5533826,
title = {Fault tolerance and dynamic partitioning in large-scale parallel systems},
author = {Jeng, M J},
abstractNote = {Fault-tolerance and dynamic partitioning are two important issues in the design of large-scale parallel systems. Most previous work in the fault-tolerant design of multistage interconnection networks (MINs) has been based on improving the reliabilities of MINs themselves. This study is to investigate the possibility of adding redundancy to MINs, as well as to other subsystems, to enhance the overall system reliability, and to analyze the improvement that can be obtained. The Dynamic Redundancy (DR) network presented provides the full capability of a Generalized Cube and can tolerate network faults and support a system to tolerate processing element faults without degradation in performance. It is shown that no matter how much redundancy is added into an MIN, the system reliability cannot exceed a certain bound; however, using the DR and spare PEs, this bound can be exceeded. Incorporating the DR network and spare PEs into the basic PASM structure is examined. The problem of partitioning parallel systems is also discussed. Many parallel systems can be partitioned into independent subsystems of different sizes, each subsystem having the characteristics of the complete system with the same size. A parallel system can be partitioned to simultaneously execute tasks with various sizes and computation structures. Inappropriate partitioning strategies may create many resource fragments, like the fragmentation problem in paging memory, and may cause the loss of computation power. Dynamic partitioning can alleviate the resource fragmentation problem. It is studied based on a lattice model, a special partial ordering relation on a set. Procedures to manage resources in partitionable systems are presented. These procedures can be applied each time a subsystem changes its status.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {1987},
month = {1}
}

Miscellaneous:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that may hold this item.

Save / Share: