| | |
Summary: Scalable Resource Management in High Performance Computers
Eitan Frachtenberg, Fabrizio Petrini, Juan Fernandez, and Salvador Coll
CCS-3 Modeling, Algorithms, and Informatics Group
Computer and Computational Sciences (CCS) Division
Los Alamos National Laboratory (LANL)
{eitanf,fabrizio,juanf,scoll}@lanl.gov
Abstract
Clusters of workstations have emerged as an important
platform for building cost-effective, scalable, and highly-
available computers. Although many hardware solutions
are available today, the largest challenge in making large-
scale clusters usable lies in the system software. In this pa-
per we present STORM, a resource management tool de-
signed to provide scalability, low overhead, and the flex-
ibility necessary to efficiently support and analyze a wide
range of job-scheduling algorithms. STORM achieves these
feats by using a small set of primitive mechanisms that are
common in modern high-performance interconnects. The
architecture of STORM is based on three main technical in-
|