Scalable resource management in high performance computers.

Frachtenberg, E; Petrini, F; Fernandez Peinador, J; Coll, S

doi:10.1109/CLUSTR.2002.1137759

Title: Scalable resource management in high performance computers.

Conference · Tue Jan 01 00:00:00 EST 2002

DOI:https://doi.org/10.1109/CLUSTR.2002.1137759· OSTI ID:976130

Frachtenberg, E ^[1]; Petrini, F ^[2]; Fernandez Peinador, J ^[3]; Coll, S ^[4]

Eitan
Fabrizio
Juan
Salvador

Clusters of workstations have emerged as an important platform for building cost-effective, scalable and highly-available computers. Although many hardware solutions are available today, the largest challenge in making large-scale clusters usable lies in the system software. In this paper we present STORM, a resource management tool designed to provide scalability, low overhead and the flexibility necessary to efficiently support and analyze a wide range of job scheduling algorithms. STORM achieves these feats by closely integrating the management daemons with the low-level features that are common in state-of-the-art high-performance system area networks. The architecture of STORM is based on three main technical innovations. First, a sizable part of the scheduler runs in the thread processor located on the network interface. Second, we use hardware collectives that are highly scalable both for implementing control heartbeats and to distribute the binary of a parallel job in near-constant time, irrespective of job and machine sizes. Third, we use an I/O bypass protocol that allows fast data movements from the file system to the communication buffers in the network interface and vice versa. The experimental results show that STORM can launch a job with a binary of 12MB on a 64 processor/32 node cluster in less than 0.25 sec on an empty network, in less than 0.45 sec when all the processors are busy computing other jobs, and in less than 0.65 sec when the network is flooded with a background traffic. This paper provides experimental and analytical evidence that these results scale to a much larger number of nodes. To the best of our knowledge, STORM is at least two orders of magnitude faster than existing production schedulers in launching jobs, performing resource management tasks and gang scheduling.

View Conference

Cite

Export

Save

Research Organization:: Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)

Sponsoring Organization:: USDOE

OSTI ID:: 976130

Report Number(s):: LA-UR-02-1672; TRN: US201009%%561

Resource Relation:: Conference: Submitted to: International Supercomputing Conference, New York, June, 2002

Country of Publication:: United States

Language:: English

Similar Records

Flexible CoScheduling : mitigating load imbalance and improving utilization of heterogeneous resources

Conference · Tue Jan 01 00:00:00 EST 2002 · OSTI ID:976130

Frachtenberg, E; Feitelson, Dror G; Petrini, F; +1 more

Adaptive Parallel Job Scheduling with Flexible CoScheduling

Journal Article · Tue Nov 01 00:00:00 EST 2005 · IEEE Transactions on Parallel and Distributed Systems, 16(11):1066-1077 · OSTI ID:976130

Frachtenberg, Eitan; Feitelson, Dror; Petrini, Fabrizio; +1 more

A New coscheduling technique for a cluster of symmetric multiprocessors

Conference · Mon Apr 17 00:00:00 EDT 2000 · OSTI ID:976130

Yoo, A B; Jette, M A

Related Subjects

99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE
ALGORITHMS
ARCHITECTURE
BUFFERS
COMMUNICATIONS
COMPUTERS
EVALUATION
FLEXIBILITY
LAUNCHING
MANAGEMENT
PERFORMANCE
PRODUCTION
RESOURCE MANAGEMENT
STORMS

Title: Scalable resource management in high performance computers.

Citation Formats

Similar Records

Related Subjects