skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Petascale Computing Enabling Technologies Project Final Report

Technical Report ·
DOI:https://doi.org/10.2172/972422· OSTI ID:972422

The Petascale Computing Enabling Technologies (PCET) project addressed challenges arising from current trends in computer architecture that will lead to large-scale systems with many more nodes, each of which uses multicore chips. These factors will soon lead to systems that have over one million processors. Also, the use of multicore chips will lead to less memory and less memory bandwidth per core. We need fundamentally new algorithmic approaches to cope with these memory constraints and the huge number of processors. Further, correct, efficient code development is difficult even with the number of processors in current systems; more processors will only make it harder. The goal of PCET was to overcome these challenges by developing the computer science and mathematical underpinnings needed to realize the full potential of our future large-scale systems. Our research results will significantly increase the scientific output obtained from LLNL large-scale computing resources by improving application scientist productivity and system utilization. Our successes include scalable mathematical algorithms that adapt to these emerging architecture trends and through code correctness and performance methodologies that automate critical aspects of application development as well as the foundations for application-level fault tolerance techniques. PCET's scope encompassed several research thrusts in computer science and mathematics: code correctness and performance methodologies, scalable mathematics algorithms appropriate for multicore systems, and application-level fault tolerance techniques. Due to funding limitations, we focused primarily on the first three thrusts although our work also lays the foundation for the needed advances in fault tolerance. In the area of scalable mathematics algorithms, our preliminary work established that OpenMP performance of the AMG linear solver benchmark and important individual kernels on Atlas did not match the predictions of our simple initial model. Our investigations demonstrated that a poor default memory allocation mechanism degraded performance. We developed a prototype NUMA library to provide generic mechanisms to overcome these issues, resulting in significantly improved OpenMP performance. After additional testing, we will make this library available to all users, providing a simple means to improve threading on LLNL's production Linux platforms. We also made progress on developing new scalable algorithms that target multicore nodes. We designed and implemented a new AMG interpolation operator with improved convergence properties for very low complexity coarsening schemes. This implementation will also soon be available to LLNL's application teams as part of the hypre library. We presented results for both topics in an invited plenary talk entitled 'Efficient Sparse Linear Solvers for Multi-Core Architectures' at the 2009 HPCMP Institutes Annual Meeting/CREATE Annual All-Hands Meeting. The interpolation work was summarized in a talk entitled 'Improving Interpolation for Aggressive Coarsening' at the 14th Copper Mountain Conference on Multigrid Methods and in a research paper that will appear in Numerical Linear Algebra with Applications. In the area of code correctness, we significantly extended our behavior equivalence class identification mechanism. Specifically, we not only demonstrated it works well at very large scales but we also added the ability to classify MPI tasks not only by function call traces, but also by specific call sites (source code line numbers) being executed by tasks. More importantly, we developed a new technique to determine relative logical execution progress of tasks in the equivalence classes by combining static analysis with our original dynamic approach. We applied this technique to a correctness issue that arose at 4096 tasks during the development of the new AMG interpolation operator discussed above. This scale isat the limit of effectiveness of production tools, but our technique quickly located the erroneous source code, demonstrating the power of understanding relationships between behavioral equivalence classes. This work is the subject of a paper recently accepted to SC09, as well as a presentation entitled 'Providing Order to Extreme Scale Debugging Chaos' given at the ParaDyn Week annual conference in College Park, MD. In addition to this theoretical extension, we have made significant progress in developing a front end for this tool set, and the front-end is now available on several of LLNL's largescale computing resources. In addition, we explored mechanisms to identify exact locations of erroneous MPI usage in application source code. In this work, we developed a new model that led to a highly efficient algorithm for detecting deadlock during dynamic software testing. This work was the subject of a well-received paper at ICS 2009 [4].

Research Organization:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
W-7405-ENG-48
OSTI ID:
972422
Report Number(s):
LLNL-TR-423702; TRN: US201005%%309
Country of Publication:
United States
Language:
English