Petascale system management experiences.

Desai, N; Bradshaw, R; Lueninghoener, C; Cherry, A; Coghlan, S; Scullin, W

Title: Petascale system management experiences.

Full Record
Other Related Research

Abstract

Petascale High-Performance Computing (HPC) systems are among the largest systems in the world. Intrepid, one such system, is a 40,000 node, 556 teraflop Blue Gene/P system that has been deployed at Argonne National Laboratory. In this paper, we provide some background about the system and our administration experiences. In particular, due to the scale of the system, we have faced a variety of issues, some surprising to us, that are not common in the commodity world. We discuss our expectations, these issues, and approaches we have used to address them. HPC systems are a bellwether for computing systems at large, in multiple regards. HPC users are motivated by the need for absolute performance; this results in two important pushes. HPC users are frequently early adopters of new technologies and techniques. Successful technologies, like Infiniband, prove their value in HPC before gaining wider adoption. Unfortunately, this early adoption alone is not sufficient to achieve the levels of performance required by HPC users; parallelism must also be harnessed. Over the last 15 years, beowulf clustering has provided amazing accessibility to non-HPC-savvy and even non-technical audiences. During this time, substantial adoption of clustering has occurred in many market segments unrelated to computational science.more »« less

Authors:

Desai, N; Bradshaw, R; Lueninghoener, C; Cherry, A; Coghlan, S; Scullin, W ^[1]

Publication Date:: Tue Jan 01 00:00:00 EST 2008

Research Org.:: Argonne National Lab. (ANL), Argonne, IL (United States)

Sponsoring Org.:: USDOE Office of Science (SC)

OSTI Identifier:: 1049677

Report Number(s):: ANL/MCS/CP-62546
TRN: US201218%%144

DOE Contract Number:: DE-AC02-06CH11357

Resource Type:: Conference

Resource Relation:: Conference: 22nd Large Installation System Administration Conference (LISA 2008); Nov. 9, 2008 - Nov. 14, 2008; San Diego, CA

Country of Publication:: United States

Language:: ENGLISH

Subject:: 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ANL; ARCHITECTURE; EFFICIENCY; LAWRENCE LIVERMORE NATIONAL LABORATORY; MANAGEMENT; MARKET; MONITORING; PERFORMANCE

Citation Formats


                    Desai, N, Bradshaw, R, Lueninghoener, C, Cherry, A, Coghlan, S, Scullin, W, and MCS). Petascale system management experiences..  United States: N. p., 2008. 
        Web.

Copy to clipboard


                    Desai, N, Bradshaw, R, Lueninghoener, C, Cherry, A, Coghlan, S, Scullin, W, & MCS). Petascale system management experiences..  United States.

Copy to clipboard


                    Desai, N, Bradshaw, R, Lueninghoener, C, Cherry, A, Coghlan, S, Scullin, W, and MCS). 2008.  
        "Petascale system management experiences.".  United States.

Copy to clipboard


                    
@article{osti_1049677,

  title        = {Petascale system management experiences.},

  author       = {Desai, N and Bradshaw, R and Lueninghoener, C and Cherry, A and Coghlan, S and Scullin, W and MCS)},

  abstractNote = {Petascale High-Performance Computing (HPC) systems are among the largest systems in the world. Intrepid, one such system, is a 40,000 node, 556 teraflop Blue Gene/P system that has been deployed at Argonne National Laboratory. In this paper, we provide some background about the system and our administration experiences. In particular, due to the scale of the system, we have faced a variety of issues, some surprising to us, that are not common in the commodity world. We discuss our expectations, these issues, and approaches we have used to address them. HPC systems are a bellwether for computing systems at large, in multiple regards. HPC users are motivated by the need for absolute performance; this results in two important pushes. HPC users are frequently early adopters of new technologies and techniques. Successful technologies, like Infiniband, prove their value in HPC before gaining wider adoption. Unfortunately, this early adoption alone is not sufficient to achieve the levels of performance required by HPC users; parallelism must also be harnessed. Over the last 15 years, beowulf clustering has provided amazing accessibility to non-HPC-savvy and even non-technical audiences. During this time, substantial adoption of clustering has occurred in many market segments unrelated to computational science. A simple trend has emerged: the scale and performance of high-end HPC systems are uncommon at first, but become commonplace over the course of 3-5 years. For example, in early 2003, several systems on the Top500 list consisted of either 1024 nodes or 4096-8192 cores. In 2008, such systems are commonplace. The most recent generation of high-end HPC systems, so called petascale systems, are the culmination of years of research and development in research and academia. Three such systems have been deployed thus far. In addition to the 556 TF Intrepid system at Argonne National Laboratory, a 596 TF Blue Gene/L-based system has been deployed at Lawrence Livermore National Laboratory, and a 504 TF Opteron-based system has been deployed at Texas Advanced Computing Center (TACC). Intrepid is comprised of 40,960 nodes with a total of 163,840 cores. While systems like these are uncommon now, we expect them to become more widespread in the coming years. The scale of these large systems impose several requirements upon system architecture. The need for scalability is obvious, however, power efficiency and density constraints have become increasingly important in recent years. At the same time, because the size of administrative staff cannot grow linearly with the system size, more efficient system management techniques are needed. In this paper we will describe our experiences administering Intrepid. Over the last year, we have experienced a number of interesting challenges in this endeavor. Our initial expectation was for scalability to be the dominant system issue. This expectation was not accurate. Several issues expected to have minor impact have played a much greater role in system operations. Debugging, due to the large numbers of components used in scalable system operations, has become a much more difficult endeavor. The system has a sophisticated monitoring system, however, the analysis of this data has been problematic. These issues are not specific to HPC workloads in any way, so we expect them to be of general interest. This paper consists of three major parts. First, we will provide a detailed overview of several important aspects of Intrepid's hardware and software. In this, we will highlight aspects that have featured prominently in our system management experiences. Next, we will describe our administration experiences in detail. Finally, we will draw some conclusions based on these experiences. In particular, we will discuss the implications for the non-HPC world, system managers, and system software developers.},

  doi          = {},

  url          = {https://www.osti.gov/biblio/1049677},
  journal      = {},
number       = ,

  volume       = ,

  place        = {United States},

  year         = {Tue Jan 01 00:00:00 EST 2008},

  month        = {Tue Jan 01 00:00:00 EST 2008}

}

Copy to clipboard

Conference:

Other availability

Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Export Metadata

Save to My Library

Similar records in OSTI.GOV collections:

2008 ALCF annual report.

Technical Report Drugan, C

The word 'breakthrough' aptly describes the transformational science and milestones achieved at the Argonne Leadership Computing Facility (ALCF) throughout 2008. The number of research endeavors undertaken at the ALCF through the U.S. Department of Energy's (DOE) Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program grew from 9 in 2007 to 20 in 2008. The allocation of computer time awarded to researchers on the Blue Gene/P also spiked significantly - from nearly 10 million processor hours in 2007 to 111 million in 2008. To support this research, we expanded the capabilities of Intrepid, an IBM Blue Gene/P systemmore »« less
https://doi.org/10.2172/975469

Full Text Available
Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report Shen, Xipeng

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardwaremore »« less
https://doi.org/10.2172/1576175

Full Text Available
2011 Computation Directorate Annual Report

Technical Report Crawford, D

From its founding in 1952 until today, Lawrence Livermore National Laboratory (LLNL) has made significant strategic investments to develop high performance computing (HPC) and its application to national security and basic science. Now, 60 years later, the Computation Directorate and its myriad resources and capabilities have become a key enabler for LLNL programs and an integral part of the effort to support our nation's nuclear deterrent and, more broadly, national security. In addition, the technological innovation HPC makes possible is seen as vital to the nation's economic vitality. LLNL, along with other national laboratories, is working to make supercomputing capabilitiesmore »« less
https://doi.org/10.2172/1047770

Full Text Available
Using Pilot Jobs and CernVM File System for Simplified Use of Containers and Software Distribution

Conference Urs, Namratha; Mambelli, Marco; Dykstra, David - TBD

High Energy Physics (HEP) experiments entail an abundance of computing resources, i.e. sites, to run simulations and analyses by processing data. This requirement is fulfilled by local batch farms, grid sites, private/commercial clouds, and supercomputing centers via High Throughput Computing (HTC). The growing needs of such experiments and resources being prone to trends of heterogeneity make it difficult for physicists to handle these resources directly. Additionally, HEP collaborations heavily rely on data and software releases, typically in the order of tens of gigabytes, while conducting simulations and analyses. Hence, aspects of scalability, reliability, and maintenance become crucial with regards tomore »« less
https://doi.org/10.2172/1824852

Full Text Available
Distributed Logical Analytic Domains (DLADs) for Distributed HPC Security

Technical Report Ros-Giralt, Jordi

The DOE estimates that the costs of a cyber attack onto its main high performance computing (HPC) facilities could potentially be in the order of billions of US dollars. The same conclusion is reached in the commercial sector, with a recent study by Oxford Economics reporting a combined annualized loss of $12.5 billion in shareholder value for publicly traded companies (more than $50 billion aggregated during the past four years) due to cybersecurity breaches. On average, the cost of cybercrime for organizations is worsening, increasing at an annualized compound rate of 23 percent, according to Accenture’s 2017 report on themore »« less

Similar Records