skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Automated Cache Performance Analysis And Optimization

Technical Report ·
DOI:https://doi.org/10.2172/1113233· OSTI ID:1113233
 [1]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

While there is no lack of performance counter tools for coarse-grained measurement of cache activity, there is a critical lack of tools for relating data layout to cache behavior to application performance. Generally, any nontrivial optimizations are either not done at all, or are done ”by hand” requiring significant time and expertise. To the best of our knowledge no tool available to users measures the latency of memory reference instructions for particular addresses and makes this information available to users in an easy-to-use and intuitive way. In this project, we worked to enable the Open|SpeedShop performance analysis tool to gather memory reference latency information for specific instructions and memory addresses, and to gather and display this information in an easy-to-use and intuitive way to aid performance analysts in identifying problematic data structures in their codes. This tool was primarily designed for use in the supercomputer domain as well as grid, cluster, cloud-based parallel e-commerce, and engineering systems and middleware. Ultimately, we envision a tool to automate optimization of application cache layout and utilization in the Open|SpeedShop performance analysis tool. To commercialize this software, we worked to develop core capabilities for gathering enhanced memory usage performance data from applications and create and apply novel methods for automatic data structure layout optimizations, tailoring the overall approach to support existing supercomputer and cluster programming models and constraints. In this Phase I project, we focused on infrastructure necessary to gather performance data and present it in an intuitive way to users. With the advent of enhanced Precise Event-Based Sampling (PEBS) counters on recent Intel processor architectures and equivalent technology on AMD processors, we are now in a position to access memory reference information for particular addresses. Prior to the introduction of PEBS counters, cache behavior could only be measured reliably in the aggregate across tens or hundreds of thousands of instructions. With the newest iteration of PEBS technology, cache events can be tied to a tuple of instruction pointer, target address (for both loads and stores), memory hierarchy, and observed latency. With this information we can now begin asking questions regarding the efficiency of not only regions of code, but how these regions interact with particular data structures and how these interactions evolve over time. In the short term, this information will be vital for performance analysts understanding and optimizing the behavior of their codes for the memory hierarchy. In the future, we can begin to ask how data layouts might be changed to improve performance and, for a particular application, what the theoretical optimal performance might be. The overall benefit to be produced by this effort was a commercial quality easy-to- use and scalable performance tool that will allow both beginner and experienced parallel programmers to automatically tune their applications for optimal cache usage. Effective use of such a tool can literally save weeks of performance tuning effort. Easy to use. With the proposed innovations, finding and fixing memory performance issues would be more automated and hide most to all of the performance engineer expertise ”under the hood” of the Open|SpeedShop performance tool. One of the biggest public benefits from the proposed innovations is that it makes performance analysis more usable to a larger group of application developers. Intuitive reporting of results. The Open|SpeedShop performance analysis tool has a rich set of intuitive, yet detailed reports for presenting performance results to application developers. Our goal was to leverage this existing technology to present the results from our memory performance addition to Open|SpeedShop. Suitable for experts as well as novices. Application performance is getting more difficult to measure as the hardware platforms they run on become more complicated. This makes life difficult for the application developer, in that they need to know more about the hardware platform, including the memory system hierarchy, in order to understand the performance of their application. Some application developers are comfortable in that sceario, while others want to do their scientific research and not have to understand all the nuances in the hardware platform they are running their application on. Our proposed innovations were aimed to support both experts and novice performance analysts. Useful in many markets. The enhancement to Open|SpeedShop would appeal to a broader market space, as it will be useful in scientific, commercial, and cloud computing environments. Our goal was to use technology developed initially at the and Lawrence Livermore National Laboratory combined with the development and commercial software experience of the Argo Navis Technologies, LLC (ANT) to form a powerful combination to delivery these objectives.

Research Organization:
ARGO NAVIS TECHNOLOGIES, LLC
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
SC0009671
OSTI ID:
1113233
Report Number(s):
DOE-ARGO-9671
Country of Publication:
United States
Language:
English

Similar Records

Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems
Technical Report · Wed Nov 26 00:00:00 EST 2014 · OSTI ID:1113233

Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)
Technical Report · Fri Nov 29 00:00:00 EST 2019 · OSTI ID:1113233

Center for Technology for Advanced Scientific Componet Software (TASCS)
Technical Report · Sun Oct 31 00:00:00 EDT 2010 · OSTI ID:1113233

Related Subjects