In the OSTI Collections: High-Performance Computing
- Computing efficiently
- Programming efficiently
- Correcting mistakes, avoiding failures
- Research Organizations
- Reports Available through OSTI's SciTech Connect
- Reports Available through OSTI's DOepatents
- Additional Reference
What's happening in one current research field can be guessed from these recent report title excerpts:
- “Global Simulation of Plasma Microturbulence at the Petascale & Beyond”
- “Petascale, Adaptive CFD” (i.e., Computational Fluid Dynamics)
- “Multiscale Molecular Simulations at the Petascale”
- “The Cielo Petascale Capability Supercomputer: Providing Large-Scale Computing for Stockpile Stewardship”
- “Community Petascale Project for Accelerator Science and Simulation”
The reports, which deal with a variety of physical phenomena, have an underlying commonality: simulating phenomena by computer processing of large amounts of data, at the “petascale”, meaning a scale on the order of quadrillions of operations per second, or quadrillions of bytes of data, or both—something only a few hundred machines on Earth are presently capable of.[Wikipedia]
Quadrillion-to-one ratios can be hard to picture one-dimensionally, since a quadrillion of even small objects would still stretch a long way if laid end to end—for example, a quadrillion centimeters equals ten billion kilometers, a span wider than the orbit of Neptune.[Wikipedia] But picturing numbers in three dimensions can make even huge numbers like quadrillions easier (if not entirely easy) to imagine. A cubic kilometer contains one quadrillion cubic centimeters. By contrast, a mere million cubic centimeters would occupy one cubic meter.
The larger the scale of computation, the more detailed a computation can be. The Stony Brook University report “Interoperable Technologies for Advanced Petascale Simulations”[SciTech Connect] illustrates the kind of advances that today’s largest-scale computing makes possible in one type of simulation. The various physical processes to be simulated all involve fluid flow—for example, flows in which fluids deposit, dissolve, and precipitate materials, fluid flows through porous solids, fluid mixing, and fluid interactions with magnetic fields. These processes occur in situations as varied as the driving of windmills, the use of parachutes, the fusion of light atoms by inertial confinement[Wikipedia] to release energy, and the separation and reprocessing of spent nuclear fuel. The petascale simulations described are advanced both by improvements in how the simulation program represents solutions of the equations for the laws of motion as applied to fluids, and by providing the program with a user-friendly interface and a detailed manual.
While such large-scale computations address problems on the current frontiers of research, the data-processing scale itself also constitutes a research frontier. Even today’s fastest processors can’t execute anything like quadrillions of operations on floating-point[Wikipedia] numbers in one second; current supercomputing systems accomplish this by dividing the calculations among a few thousand to a few million processors running in parallel. Recent reports available from OSTI’s SciTech Connect and DOepatents describe ways to use high-performance computers efficiently, make programmers more effective by clarifying the high-performance systems’ workings, and avoid different modes of system failure. Other reports anticipate challenges of constructing computers of even higher performance to solve problems beyond today’s highest-performing systems.
Many mathematical problems can be solved by more than one method of computation, but in general some methods are more efficient than others for any given problem. If the computer you use efficiently solves problems of one general type, but your problems gradually change from being all of that type to being partly of that type and partly of a second type, you may find your computer serving you less and less efficiently. And if you upgrade to a new computer design that is only optimized for solving problems of your original type, your problem-solving may become more efficient, but not as efficient as it could be.
The type of problem most commonly addressed with earlier high-performance computers was solved most efficiently by both accessing data and calculating with it at a fairly steady rate; more recent problems are still partly of this type but increasingly require irregular data accesses and much less calculation per datum. Unfortunately, high-performance computers have continued to be designed to get high efficiency ratings for solving only the former type of problem. This situation is addressed by researchers at University of Tennessee and Sandia National Laboratories in the report “Toward a New Metric for Ranking High Performance Computing Systems” [SciTech Connect], which describes a new benchmark that represents current computational problems more accurately than the present standard benchmark does.
Whatever benchmark a particular high-performance computer is designed around, it will generally execute different algorithms for solving the same problem with different efficiencies. One very common type of computation involves sets of quantities known as matrices.[Wikipedia] While equations that involve sums or products of individual numbers (e.g., “E=mc2”) describe some significant relations between quantities, sets of kindred quantities expressed as matrices can themselves be “added” and “multiplied” to find other sum and product matrices. The more quantities each matrix represents, the more steps a computer requires to calculate matrix sums, and even more so matrix products[Wikipedia]; the simplest cases, sums and products of matrices that contain only one quantity each, are basically the ordinary sums and products of the single quantities. A matrix may consist of any number of quantities—even an infinite number.
The problems commonly solved with high-performance computers often require calculating products of matrices that are large enough to tax the computers’ resources. Matrix multiplication methods that compute their results through an efficient sequence of steps[Wikipedia], for example by minimizing the use of higher-cost physical operations (which, for current high-performance computers, include transferring matrices’ individual quantities between nodes of the machine[Wikipedia]), can save a great deal of time, energy, and money. The IBM patent “Matrix multiplication operations with data pre-conditioning in a high performance computing architecture”[DOepatents] describes essential features of one such method.
Minimization of data transfer costs was also a key consideration in the algorithm-design project whose accomplishments are detailed in the report “Towards Optimal Petascale Simulations”[SciTech Connect] from the University of California, Berkeley. The optimization sought in this project was to maximize the efficiency of algorithms that adapt to use evolving hardware resources, in view of existing trends of hardware development: computer hardware’s speed of operations on floating-point numbers is being improved “exponentially faster” than information transfer rates (or bandwidth[Wikipedia, Wikipedia]), which are themselves improving “exponentially faster” than the latency times[Wikipedia, Wikipedia] between when data is sent from one point to when it’s decoded elsewhere within the computer. Thus the algorithm design concentrated on minimizing data-communication requirements. The project’s accomplishments include identifying lower bounds on these requirements, analyzing existing algorithms which generally don’t attain these lower bounds, and identifying or inventing new algorithms that do attain them and evaluating the speed improvements, “which can be quite large.”
Other recent IBM patents address various problems of data transfer among parallel processors. Among these patents, two in particular describe processor-use coordination methods to solve the problems. The summary of “Internode data communications in a parallel computer”[DOepatents] begins by describing the problem one invention is meant to solve:
Conventionally in distributed processing systems running parallel applications, all parallel processes must be initialized together. This requires that all processes are available from the very beginning—at startup of the parallel application. As such, processes may not be initialized later to expand processing resources on demand. Further, in many parallel computers, a process must establish a data communications reception buffer upon initialization. Here, two problems are created. First, no other process may send data to an uninitialized process because no reception buffer exists. Second, an asynchronously initialized sending process has to check that a receiving process is initialized before sending data to the process. Such a check is expensive, in terms of time and execution cycles, and does not scale well.
The problem to be solved by the second patent’s invention is more clear from the patent title, “Processing communications events in parallel active messaging interface by awakening thread from wait state”[DOepatents], in which the word “thread” refers to one of a computer program’s independently-manageable instruction sequences[Wikipedia].
In the trend toward more powerful processors and decreasing message latency, it becomes desirable to implement an interrupt-oriented way of waiting on data communications resources. In this way, a thread that needs to wait on a communications resource can do so without blocking on an instruction or spinning its processor. The thread can move to a wait state, grant possession of the processor to another thread which can do useful work while the waiting thread is off the processor. When an interrupt notifies the waiting thread of resource availability or arrival of a new instruction or message, the waiting thread awakens, regains possession of the processor, and carries on with its data communications processing.
A standard system for interprocessor data transfer in high-performance computers is called “Message Passing Interface” or “MPI”.[Wikipedia] While standard, it still offers room for improvement, particularly with respect to adapting it for use with the highest-performing computers now operating or planned for the near future. Two routes of improvement explored in one five-year project are discussed in “Final Report for Enhancing the MPI Programming Model for PetaScale Systems”[SciTech Connect] from the University of Illinois at Urbana-Champaign: improvement of MPI algorithms and implementation strategies (already accomplished), and extensions to the standard to better support larger-scale computing systems.
Regardless of how well any device, large or small, is optimized internally, its use won’t yield optimal results unless the complete system that it’s part of—device + user—is made as effective as possible. If a user doesn’t easily grasp the device either mentally or physically, he’s not likely to get the full advantage of its capabilities.
Getting a mental grasp of high-performance computers has been especially difficult. “Users of High Performance Computing (HPC) centers … have historically required expert knowledge of nearly every aspect of the computer center operation in order to take full advantage of the immense resources available. This knowledge takes months or more to develop, and can make entering the world of HPC intimidating for new users.”
One approach to solving this problem is described in the report from which the preceding quote was taken and suggested by the report’s title (“Lorenz: Using the Web to Make HPC Easier”[SciTech Connect]). The Lorenz project at Lawrence Livermore National Laboratory employs web technology to make information about the computing system easy to determine at a glance, manage computing jobs, and provide a high-level interface for defining simulation input, all through an ordinary standards-based web browser. The report notes that “[p]roductivity is improved by increasing access to information and simplifying tedious or difficult tasks. New HPC users are less intimidated because the Lorenz web application suite helps coordinate workflow and facilitate the proper and optimal use of HPC resources.” As a result, “[t]he Lorenz tools allow HPC users to focus on their simulations and science, rather than on the computing center and how to use it. These tools reduce the time it takes to learn how to use resources effectively, simplify tedious and repetitive tasks, and make sure that users have access to critical information when they need it.”
Early versions of a new tool often turn out later to look like rough drafts of something even better. New tools like those from the Lorenz project make their users more productive, but actual use also reveals that the new tools present sticking points of their own, which leads to their users recommending enhancements. The report goes on to list some of the enhancements that users would already like to see in the Lorenz suite:
- A center-wide alert system that allows users to subscribe to specific types of events (batch job completion, host downtime, news items).
- A calendar that tracks system modifications and other significant events on all computer clusters.
- Enhancements to the job management tool to enable more options during job launch and more capabilities for interacting with running jobs.
- Additions to the application portal that help users choose the appropriate resources, such as cluster and file system, given the requirements of a job.
A different approach to solving the user-grasp problem is described in the Oak Ridge National Laboratory report “ALCC Allocation Final Report: HPC Colony II”[SciTech Connect] and the “Project Final Report: HPC-Colony II”[SciTech Connect] by researchers at Oak Ridge, the University of Illinois at Urbana-Champaign, and IBM. Whereas interfaces like Lawrence Livermore’s Lorenz tools would clarify the workings of individual but very different high-performance computers, the HPC Colony II project examined the use of adaptive system software to automate the adaptation of user’s programs to the architecture of whatever high-performance computer they happen to be running on. Users would thus be able to focus on their work without having to get a detailed firsthand knowledge of the computer.
Like any type of composition, computer programs beyond the very simplest seldom work as intended in their original form. Indeed, if one has appropriate means for examining a program, the need for some minor revisions can become evident (and acted upon) while its composition is in progress. For this reason, programmers employ software that facilitates checking their programs for bugs and debugging them.
Common debugging tools work well on programs that have relatively few threads, but not on the kind of massively multithread programs that high-performance computers often run in parallel on huge numbers of processors. The IBM patent “Debugging a high performance computing program”[DOepatents] addresses this problem. Noting that ordinary debuggers “are not aware that the threads of a high performance computing program often perform similar operations”, and thus “require a developer to manually sort through individual threads of execution to identify the defective threads” among, say, about 100,000 threads—“a near impossible task”—the patent describes a debugging method that involves grouping large numbers of threads by the locations in the computer’s memory that contain the functions[Wikipedia] which call for each thread’s execution. Displaying these groups of threads, along with what the debugger infers to be the names of their calling functions, can help the programmer identify which threads are defective.
Designing practical methods of statistical software debugging[Stack Overflow], in which clues to bugs’ sources are found in the way execution failures correlate with the software’s different responses to different inputs, presents special problems when the software runs on a high-performance computer with its typical intensive computing and processor intercommunication. Not only can statistical debugging methods that are adequate for ordinary software impose unsuitable overhead for debugging high-performance software, but such methods can even fail outright because they don’t account for the processor interdependence that generally occurs with massively parallel processing. Lawrence Livermore National Lab’s “Final Report on Statistical Debugging for Petascale Environments”[SciTech Connect], by one of statistical debugging’s inventors at the University of Wisconsin—Madison, describes these problems along with techniques recently developed to deal with them.
While the software running on any computer, along with its bugs, may change with the work assigned to it, the hardware doesn’t change so often. Hardware failure can thus be worked around almost as if it were a constant phenomenon, with the times between a system’s hardware failures having an average and standard deviation that don’t vary rapidly and depend largely on the reliability of the individual hardware components and the system’s size/number of components.
While “petascale” aptly describes the size of today’s highest-rated computers, it doesn’t represent the upper limit of either designers’ plans for future machines or of theoretical possibility. Indeed, the fastest-rated computer in the world[Top 500 …] is, in order-of-magnitude terms, halfway to exascale, or one quintillion floating-point operations per second. (Think of the number of cubic centimeters in a 10-km X 10-km X 10-km cube.) Yet, like anything else whose features are more than a few times bigger or smaller than the corresponding features of an otherwise similar thing, computers whose memory sizes or processor speeds so greatly exceed those of today’s petascale machines would differ from them in kind as well as in size. Furthermore, because of how different parameters of any entity are related, changing one parameter by a given amount generally won’t change all the other parameters in the same proportion. This means that the faster, larger-memory machines being contemplated for future tasks will present new modes of possible failure to be designed against that didn’t need to be considered when today’s machines were designed.
The lack of proportionality in how various computer parameters change between petascale and exascale machines, and its implications for how exascale machines can (and can’t) be made fault-resilient, are noted in the Argonne National Laboratory report “Addressing Failures in Exascale Computing”[SciTech Connect]. The report notes that the present method of making petascale machines resilient, in which checkpoints are set to detect when a system problem has interrupted a calculation so it can be restarted from the previous checkpoint, depends crucially on
- the average time between system failures being much longer than both the time programs run between checkpoints and the time it takes to restore the system to a consistent state and restart calculations after a failure;
- errors being detected and corrected that could corrupt the checkpointed state itself;
- output data being correct before it is used.
Yet, increasing computational speeds mean that the average time between failures is decreasing faster than disk checkpoint times and recovery times, “especially recovery from global system failures”; silent data corruptions may become too frequent; and erroneous results may not be detected before the results are used. The authors note that overcoming any one of these obstacles would provide a particular approach to exascale-computing resilience, but choosing one approach now would require knowledge, currently lacking, of the cause of failures and the frequency of silent data corruptions. The authors thus recommend
- performing experiments to estimate current systems’ rates of silent data-corruption,
- refining estimates of the cost to keep these rates low,
- understanding the market opportunities for low-power, high-resilience technology,
- aiming for an early decision on hardware-provided error detection and correction levels,
- investing first in research and development for technologies that will be required or beneficial for any design,
- focusing primarily on application-level error handling that would apply to at least the large majority of workloads, with solutions that address specific codes being a second priority.
Another report deals with the same problem from a somewhat different perspective. Consider a machine that had enough components for exascale computation, but no reduction in the components’ failure frequency below that of today’s petascale machines. Such a computer would simply fail more often—so often that the time taken to recover from errors would largely eliminate the advantages of its components’ higher speed, making the exascale machine no better overall than the petascale machines it was meant to improve on. The report “Investigating an API for resilient exascale computing”[SciTech Connect], by researchers at Sandia National Laboratories and the University of New Mexico, starts from the premise that hardware component reliability may not improve fast enough for current resilience techniques to suffice on exascale machines in the next 8-10 years. The report proposes building more fault tolerance into system and application software through a Resilience Application Program Interface (API)[Wikipedia], describes an initial investigation of the fault types and of APIs to mitigate them, presents proof-of-concept interfaces for “the frequent and important cases of memory errors and node failures”, and proposes a strategy for file-system failures. The authors note that, while a single API for fault-handling among hardware, operating system, and application “remains elusive”, the investigation “increased our understanding of both the mountainous challenges and the promising trailheads”.
Failures of hardware and software are design failures internal to a computer system. Other failures can be deliberately caused from outside. The best measures for securing a computer system from attack depend largely on the system’s nature, which as we’ve seen is largely a function of its size. The differences between appropriate security measures for desktop computers, servers, and high-performance computers are one of the main concerns discussed in the Los Alamos report “Continuous Monitoring And Cyber Security For High Performance Computing”[SciTech Connect].
As mandatory security measures are changed from periodic tests to continuous monitoring, the new measures will need to properly take the systems’ differing sizes, and consequent usage and maintenance patterns, into account to be effective. For instance, while security measures for desktop systems involve monitoring the systems for unauthorized and unmanaged hardware and software, which are likely to be vulnerable to attack, the hardware and software of large high-performance computers are both managed and so present less of a security problem. Also, high-performance computers undergo frequent hardware replacements, and users install and run application (non-system) software at their discretion, so monitoring systems that produce alerts to every such change would go off counterproductively often. These considerations among others indicate that desktop-based security requirements may be costly and ineffective for high-performance computers. According to the report:
It may be well worth the effort to work at the bleeding edge of continuous monitoring for its potential to tangibly improve HPC technical security and streamline HPC compliance. … While the national focus is on monitoring desktop systems, HPC sites have an opportunity to influence national decision-makers to account for differences between desktop computing and servers, and between [commercial off-the-shelf] computing and HPC and other iterations of scientific computing systems.
Exascale machines would differ from petascale machines in more than their fault-tolerance techniques. Authors from Sandia National Laboratories and the University of Notre Dame forecast “the potential characteristics of high-end systems across a spectrum of architectures … and with enough lower level characteristics to allow non-trivial extrapolations against future benchmarks” in “Yearly update: exascale projections for 2013”[SciTech Connect]. The report focuses mainly on the interaction of technology and architecture, with less attention to such issues as programming models and resiliency. A prime consideration in this forecast is the effect of certain technical trends having reached limits around 2004, one of which involved power dissipation becoming a “first class design constraint”, so that further increases in computer power came from substituting multicore processors[Wikipedia] for the previously developed single-core processors. This change resulted in new computer architectures, five types of which are described, that have “made the job of selecting the ‘most efficient’ ones to pursue as we move towards Exascale systems much more difficult.” The report benchmarks past and present systems and architectures in two ways (compare “Toward a New Metric for Ranking High Performance Computing Systems” [SciTech Connect],mentioned above) and projects these architectures into the future.
Much as the previously mentioned Sandia/University of New Mexico report “Investigating an API for resilient exascale computing” considered that computer components’ reliability might not increase much if at all, another report from a different Sandia/University of New Mexico team considers that current components’ power-density and information-transfer limits might not be surpassable by any practical means. Their proposal is to replace the present technology with a different one, which would be informed by new understanding of a natural exascale computer.
The fact that computers designed to perform efficiently according to certain benchmarks may be less efficient at processes that the benchmark doesn’t typify is particularly apparent when comparing high-performance computers to the human brain. We use computers to perform calculations that the brain either can’t do alone or do only very inefficiently. Yet the brain’s own processes execute significantly more computations per unit of power, unit of volume, and unit of time than even the highest-performing machines do today. The project described in “A comprehensive approach to decipher biological computation to achieve next generation high-performance exascale computing”[SciTech Connect] aims to quantitatively characterize the brain’s information processing to, among other things, inform the design and fabrication of hardware that mimics both neuronal structures and their function—each of which has previously been mimicked independently, but not together so that the brain’s efficiency, compactness, and speed could all be duplicated. The report documents the authors’ achievements in characterizing neurons and neural tissue from the brain, as well as in using multiferroic[Wikipedia] devices and memristors[Wikipedia] for data storage and processing. These devices might eventually be made small enough to circumvent the obstacles to exascale computing set by present devices’ physical limits.
- Petascale computing
- Inertial confinement fusion
- Rayleigh-Taylor instability
- Floating-point numbers
- Matrix (mathematics)
- Matrix multiplication
- Matrix product (two matrices)
- Algorithms for efficient matrix multiplication
- Communication-avoiding and distributed algorithms
- Rotation matrix
- Network performance
- Performance measures
- Bandwidth (computing)
- Latency (engineering)
- Thread (computing)
- Message Passing Interface
- Function (computer science) [aka Subroutine]
- Application programming interface
- Multi-core processor
- Moore’s law
- Argonne Leadership Computing Facility, Argonne National Laboratory
- Princeton Plasma Physics Laboratory
- Princeton University
- Lawrence Berkeley National Laboratory
- Pennsylvania State University
- University of Colorado, Boulder
- University of Chicago
- Los Alamos National Laboratory
- Sandia National Laboratories
- University of California, Los Angeles
- Lawrence Livermore National Laboratory
- Oak Ridge Leadership Computing Facility, Oak Ridge National Laboratory
- State University of New York at Stony Brook
- University of Tennessee
- University of California, Berkeley
- University of Illinois at Urbana-Champaign
- University of Wisconsin—Madison
- University of New Mexico
- University of Notre Dame
- “Global Simulation of Plasma Microturbulence at the Petascale & Beyond (Optimizing the GTC Code for Blue Gene/Q): ALCF-2 Early Science Program Technical Report” [Metadata and full text available through OSTI’s SciTech Connect]
- “Petascale, Adaptive CFD (ALCF ESP Technical Report): ALCF-2 Early Science Program Technical Report” [Metadata and full text available through OSTI’s SciTech Connect]
- “Multiscale Molecular Simulations at the Petascale (Parallelization of Reactive Force Field Model for Blue Gene/Q): ALCF-2 Early Science Program Technical Report” [Metadata and full text available through OSTI’s SciTech Connect]
- “The Cielo Petascale Capability Supercomputer: Providing Large-Scale Computing for Stockpile Stewardship” [Metadata and full text available through OSTI’s SciTech Connect]
- “Community Petascale Project for Accelerator Science and Simulation” [Metadata and full text available through OSTI’s SciTech Connect]
- “Interoperable Technologies for Advanced Petascale Simulations” [Metadata and full text available through OSTI’s SciTech Connect]
- “Toward a New Metric for Ranking High Performance Computing Systems” [Metadata and full text available through OSTI’s SciTech Connect]
- ‘Towards Optimal Petascale Simulations” [Metadata and full text available through OSTI’s SciTech Connect]
- “Final Report for Enhancing the MPI Programming Model for PetaScale Systems” [Metadata and full text available through OSTI’s SciTech Connect] (MPI = Message Passing Interface[Wikipedia])
- “Lorenz: Using the Web to Make HPC Easier” [Metadata and full text available through OSTI’s SciTech Connect]
- “ALCC Allocation Final Report: HPC Colony II” [Metadata and full text available through OSTI’s SciTech Connect]
- “Project Final Report: HPC-Colony II” [Metadata and full text available through OSTI’s SciTech Connect]
- “Final Report on Statistical Debugging for Petascale Environments”[SciTech Connect] [Metadata and full text available through OSTI’s SciTech Connect]
- “Addressing Failures in Exascale Computing” [Metadata and full text available through OSTI’s SciTech Connect]
- “Investigating an API for resilient exascale computing” [Metadata and full text available through OSTI’s SciTech Connect]
- “Continuous Monitoring And Cyber Security For High Performance Computing” [Metadata and full text available through OSTI’s SciTech Connect]
- “A comprehensive approach to decipher biological computation to achieve next generation high-performance exascale computing” [Metadata and full text available through OSTI’s SciTech Connect]
- “Continuous Monitoring And Cyber Security For High Performance Computing” [Metadata and full text available through OSTI’s SciTech Connect]
- “Yearly update: exascale projections for 2013.” [Metadata and full text available through OSTI’s SciTech Connect]
- “Matrix multiplication operations with data pre-conditioning in a high performance computing architecture” [Metadata and full text available through OSTI’s DOepatents]
- “Internode data communications in a parallel computer” [Metadata and full text available through OSTI’s DOepatents]
- “Processing communications events in parallel active messaging interface by awakening thread from wait state” [Metadata and full text available through OSTI’s DOepatents]
- “Debugging a high performance computing program”[Metadata and full text available through OSTI’s DOepatents]
Prepared by Dr. William N. Watson, Physicist
DoE Office of Scientific and Technical Information