skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: HPC-Colony: Services and Interfaces to Aupport Systems With Very Large Numbers of Processors

Abstract

The HPC-Colony Project, a collaboration with Lawrence Livermore National Laboratory, the University of Illinois at Urbana-Champaign and IBM, is focused on services and interfaces for very large numbers of processors. Advances in parallel systems in the last decade have delivered phenomenal progress in the overall capability available to a single parallel application. Several systems with peak capability of over 100TF are already available and systems are expected to exceed 1PF within a few years. Despite these impressive advances in peak performance capability, the sustained performance of these systems continues to fall as a percentage of the peak capability. Initial analysis suggests that key architectural bottlenecks (in hardware and software) are responsible for the lower sustained performance and some architectural change of direction may be necessary to address the declining sustained performance. In this proposal we focus on addressing software architectural bottlenecks, in the areas of operating system and runtime systems. While the trend towards larger processor counts benefits application developers through more processing power, it also challenges application developers to harness ever-increasing numbers of processors for productive work. Much of the burden falls to operating systems and runtime systems that were originally designed for much smaller processor counts. Under themore » Colony project, we are researching and developing system software to enable general purpose operating and runtime systems for tens of thousands of processors. Difficulties in achieving a balanced partitioning and dynamically scheduling workloads can limit scaling for complex problems on large machines. Scientific simulations that span components of large machines require common operating system services, such as process scheduling, event notification, and job management to scale to large machines. Today, application programmers must explicitly manage these resources. We address scaling issues and porting issues by delegating resource management tasks to a sophisticated parallel OS. Our definition of ''managing resources'' includes balancing CPU time, network utilization, and memory usage across the entire machine. We believe a consistent environment that provides newly necessary technology (such as fault tolerance) will also provide important efficiencies in system administration. The primary objective of the Colony Project is to develop technologies that enable application scientists to easily scale applications to computing platforms comprised of tens of thousands to hundreds of thousands of compute cores. This will be accomplished by addressing several problem areas that are known to be key factors when scaling applications to tens of thousands of processors. First, by providing a smart runtime system to quickly and dynamically make cpu and memory and interconnect resource management adjustments, we remove the burden of achieving applications that are highly tuned and load-balanced for a particular execution instance (i.e. a particular input datasets and machine platform combination). Second, by providing a full complement of system services including the entire Linux system call set, we ease the challenge of developing portable applications since lightweight kernels frequently incorporate only a small subset of the POSIX calls prevalent in typical large scientific applications. Third, by providing fundamental changes to the Linux kernel that reduce variability in context switch times and provide for parallel-aware scheduling across the entire machine, we remove the negative impact of synchronizing collectives on bulk-synchronous applications. Fourth, by providing fault tolerance mechanisms that utilize our unique migration abilities in conjunction with in-memory techniques for minimal overhead, we eliminate the necessity for costly frequent application-driven check-points. Our research utilizes full implementations of these technologies on systems consisting of tens of thousands of processors.« less

Authors:
; ; ; ; ; ;
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
902273
Report Number(s):
UCRL-TR-227723
TRN: US200717%%516
DOE Contract Number:
W-7405-ENG-48
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; KERNELS; LAWRENCE LIVERMORE NATIONAL LABORATORY; MANAGEMENT; PERFORMANCE; PROCESSING; RESOURCE MANAGEMENT; TOLERANCE

Citation Formats

Jones, T, Kale, L, Moreira, J, Mendes, C, Chakravorty, S, Tauferner, A, and Inglett, T. HPC-Colony: Services and Interfaces to Aupport Systems With Very Large Numbers of Processors. United States: N. p., 2007. Web. doi:10.2172/902273.
Jones, T, Kale, L, Moreira, J, Mendes, C, Chakravorty, S, Tauferner, A, & Inglett, T. HPC-Colony: Services and Interfaces to Aupport Systems With Very Large Numbers of Processors. United States. doi:10.2172/902273.
Jones, T, Kale, L, Moreira, J, Mendes, C, Chakravorty, S, Tauferner, A, and Inglett, T. Wed . "HPC-Colony: Services and Interfaces to Aupport Systems With Very Large Numbers of Processors". United States. doi:10.2172/902273. https://www.osti.gov/servlets/purl/902273.
@article{osti_902273,
title = {HPC-Colony: Services and Interfaces to Aupport Systems With Very Large Numbers of Processors},
author = {Jones, T and Kale, L and Moreira, J and Mendes, C and Chakravorty, S and Tauferner, A and Inglett, T},
abstractNote = {The HPC-Colony Project, a collaboration with Lawrence Livermore National Laboratory, the University of Illinois at Urbana-Champaign and IBM, is focused on services and interfaces for very large numbers of processors. Advances in parallel systems in the last decade have delivered phenomenal progress in the overall capability available to a single parallel application. Several systems with peak capability of over 100TF are already available and systems are expected to exceed 1PF within a few years. Despite these impressive advances in peak performance capability, the sustained performance of these systems continues to fall as a percentage of the peak capability. Initial analysis suggests that key architectural bottlenecks (in hardware and software) are responsible for the lower sustained performance and some architectural change of direction may be necessary to address the declining sustained performance. In this proposal we focus on addressing software architectural bottlenecks, in the areas of operating system and runtime systems. While the trend towards larger processor counts benefits application developers through more processing power, it also challenges application developers to harness ever-increasing numbers of processors for productive work. Much of the burden falls to operating systems and runtime systems that were originally designed for much smaller processor counts. Under the Colony project, we are researching and developing system software to enable general purpose operating and runtime systems for tens of thousands of processors. Difficulties in achieving a balanced partitioning and dynamically scheduling workloads can limit scaling for complex problems on large machines. Scientific simulations that span components of large machines require common operating system services, such as process scheduling, event notification, and job management to scale to large machines. Today, application programmers must explicitly manage these resources. We address scaling issues and porting issues by delegating resource management tasks to a sophisticated parallel OS. Our definition of ''managing resources'' includes balancing CPU time, network utilization, and memory usage across the entire machine. We believe a consistent environment that provides newly necessary technology (such as fault tolerance) will also provide important efficiencies in system administration. The primary objective of the Colony Project is to develop technologies that enable application scientists to easily scale applications to computing platforms comprised of tens of thousands to hundreds of thousands of compute cores. This will be accomplished by addressing several problem areas that are known to be key factors when scaling applications to tens of thousands of processors. First, by providing a smart runtime system to quickly and dynamically make cpu and memory and interconnect resource management adjustments, we remove the burden of achieving applications that are highly tuned and load-balanced for a particular execution instance (i.e. a particular input datasets and machine platform combination). Second, by providing a full complement of system services including the entire Linux system call set, we ease the challenge of developing portable applications since lightweight kernels frequently incorporate only a small subset of the POSIX calls prevalent in typical large scientific applications. Third, by providing fundamental changes to the Linux kernel that reduce variability in context switch times and provide for parallel-aware scheduling across the entire machine, we remove the negative impact of synchronizing collectives on bulk-synchronous applications. Fourth, by providing fault tolerance mechanisms that utilize our unique migration abilities in conjunction with in-memory techniques for minimal overhead, we eliminate the necessity for costly frequent application-driven check-points. Our research utilizes full implementations of these technologies on systems consisting of tens of thousands of processors.},
doi = {10.2172/902273},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Wed Jan 31 00:00:00 EST 2007},
month = {Wed Jan 31 00:00:00 EST 2007}
}

Technical Report:

Save / Share: