Petascale system management experiences.
Conference
·
OSTI ID:1049677
- LCF
Petascale High-Performance Computing (HPC) systems are among the largest systems in the world. Intrepid, one such system, is a 40,000 node, 556 teraflop Blue Gene/P system that has been deployed at Argonne National Laboratory. In this paper, we provide some background about the system and our administration experiences. In particular, due to the scale of the system, we have faced a variety of issues, some surprising to us, that are not common in the commodity world. We discuss our expectations, these issues, and approaches we have used to address them. HPC systems are a bellwether for computing systems at large, in multiple regards. HPC users are motivated by the need for absolute performance; this results in two important pushes. HPC users are frequently early adopters of new technologies and techniques. Successful technologies, like Infiniband, prove their value in HPC before gaining wider adoption. Unfortunately, this early adoption alone is not sufficient to achieve the levels of performance required by HPC users; parallelism must also be harnessed. Over the last 15 years, beowulf clustering has provided amazing accessibility to non-HPC-savvy and even non-technical audiences. During this time, substantial adoption of clustering has occurred in many market segments unrelated to computational science. A simple trend has emerged: the scale and performance of high-end HPC systems are uncommon at first, but become commonplace over the course of 3-5 years. For example, in early 2003, several systems on the Top500 list consisted of either 1024 nodes or 4096-8192 cores. In 2008, such systems are commonplace. The most recent generation of high-end HPC systems, so called petascale systems, are the culmination of years of research and development in research and academia. Three such systems have been deployed thus far. In addition to the 556 TF Intrepid system at Argonne National Laboratory, a 596 TF Blue Gene/L-based system has been deployed at Lawrence Livermore National Laboratory, and a 504 TF Opteron-based system has been deployed at Texas Advanced Computing Center (TACC). Intrepid is comprised of 40,960 nodes with a total of 163,840 cores. While systems like these are uncommon now, we expect them to become more widespread in the coming years. The scale of these large systems impose several requirements upon system architecture. The need for scalability is obvious, however, power efficiency and density constraints have become increasingly important in recent years. At the same time, because the size of administrative staff cannot grow linearly with the system size, more efficient system management techniques are needed. In this paper we will describe our experiences administering Intrepid. Over the last year, we have experienced a number of interesting challenges in this endeavor. Our initial expectation was for scalability to be the dominant system issue. This expectation was not accurate. Several issues expected to have minor impact have played a much greater role in system operations. Debugging, due to the large numbers of components used in scalable system operations, has become a much more difficult endeavor. The system has a sophisticated monitoring system, however, the analysis of this data has been problematic. These issues are not specific to HPC workloads in any way, so we expect them to be of general interest. This paper consists of three major parts. First, we will provide a detailed overview of several important aspects of Intrepid's hardware and software. In this, we will highlight aspects that have featured prominently in our system management experiences. Next, we will describe our administration experiences in detail. Finally, we will draw some conclusions based on these experiences. In particular, we will discuss the implications for the non-HPC world, system managers, and system software developers.
- Research Organization:
- Argonne National Laboratory (ANL)
- Sponsoring Organization:
- SC
- DOE Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1049677
- Report Number(s):
- ANL/MCS/CP-62546
- Country of Publication:
- United States
- Language:
- ENGLISH
Similar Records
Development and performance of a HemeLB GPU code for human-scale blood flow simulation
MPI on millions of cores
Enabling power measurement and control on Astra: The first petascale Arm supercomputer
Journal Article
·
Wed Sep 21 20:00:00 EDT 2022
· Computer Physics Communications
·
OSTI ID:2424007
MPI on millions of cores
Journal Article
·
Mon Feb 28 23:00:00 EST 2011
· Parallel Processing Letters
·
OSTI ID:1395013
Enabling power measurement and control on Astra: The first petascale Arm supercomputer
Journal Article
·
Tue Sep 06 20:00:00 EDT 2022
· Concurrency and Computation. Practice and Experience
·
OSTI ID:1887388