skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Scalable Tools Communication Infrastructure

Abstract

The Scalable Tools Communication Infrastructure (STCI) is an open source collaborative effort intended to provide high-performance, scalable, resilient, and portable communications and process control services for a wide variety of user and system tools. STCI is aimed specifically at tools for ultrascale computing and uses a component architecture to simplify tailoring the infrastructure to a wide range of scenarios. This paper describes STCI's design philosophy, the various components that will be used to provide an STCI implementation for a range of ultrascale platforms, and a range of tool types. These include tools supporting parallel run-time environments, such as MPI, parallel application correctness tools and performance analysis tools, as well as system monitoring and management tools.

Authors:
 [1];  [2];  [3];  [3];  [4]
  1. Argonne National Laboratory (ANL)
  2. University of Tennessee, Knoxville (UTK)
  3. ORNL
  4. IBM T. J. Watson Research Center
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Center for Computational Sciences
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1087452
DOE Contract Number:
DE-AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 6th Annual Symposium on OSCAR and HPC Cluster Systems (OSCAR2008), Quebec City, Quebec, Canada, 20080609, 20080611
Country of Publication:
United States
Language:
English

Citation Formats

Buntinas, Darius, Bosilca, George, Graham, Richard L, Vallee, Geoffroy R, and Watson, Gregory R. A Scalable Tools Communication Infrastructure. United States: N. p., 2008. Web.
Buntinas, Darius, Bosilca, George, Graham, Richard L, Vallee, Geoffroy R, & Watson, Gregory R. A Scalable Tools Communication Infrastructure. United States.
Buntinas, Darius, Bosilca, George, Graham, Richard L, Vallee, Geoffroy R, and Watson, Gregory R. 2008. "A Scalable Tools Communication Infrastructure". United States. doi:.
@article{osti_1087452,
title = {A Scalable Tools Communication Infrastructure},
author = {Buntinas, Darius and Bosilca, George and Graham, Richard L and Vallee, Geoffroy R and Watson, Gregory R.},
abstractNote = {The Scalable Tools Communication Infrastructure (STCI) is an open source collaborative effort intended to provide high-performance, scalable, resilient, and portable communications and process control services for a wide variety of user and system tools. STCI is aimed specifically at tools for ultrascale computing and uses a component architecture to simplify tailoring the infrastructure to a wide range of scenarios. This paper describes STCI's design philosophy, the various components that will be used to provide an STCI implementation for a range of ultrascale platforms, and a range of tool types. These include tools supporting parallel run-time environments, such as MPI, parallel application correctness tools and performance analysis tools, as well as system monitoring and management tools.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = 2008,
month = 1
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • The Scalable Tools Communication Infrastructure (STCI) is an open source collaborative effort intended to provide high-performance, scalable, resilient, and portable communications and process control services for a wide variety of user and system tools. STCI is aimed specifically at tools for ultrascale computing and uses a component architecture to simplify tailoring the infrastructure to a wide range of scenarios. This paper describes STCI's design philosophy, the various components that will be used to provide an STCI implementation for a range of ultrascale platforms, and a range of tool types. These include tools supporting parallel run-time environments, such as MPI, parallelmore » application correctness tools and performance analysis tools, as well as system monitoring and management tools.« less
  • In this project we created a community tool infrastructure for program development tools targeting Petascale class machines and beyond. This includes tools for performance analysis, debugging, and correctness tools, as well as tuning and optimization frameworks. The developed infrastructure provides a comprehensive and extensible set of individual tool building components. We started with the basic elements necessary across all tools in such an infrastructure followed by a set of generic core modules that allow a comprehensive performance analysis at scale. Further, we developed a methodology and workflow that allows others to add or replace modules, to integrate parts into theirmore » own tools, or to customize existing solutions. In order to form the core modules, we built on the existing Open|SpeedShop infrastructure and decomposed it into individual modules that match the necessary tool components. At the same time, we addressed the challenges found in performance tools for petascale systems in each module. When assembled, this instantiation of community tool infrastructure provides an enhanced version of Open|SpeedShop, which, while completely different in its architecture, provides scalable performance analysis for petascale applications through a familiar interface. This project also built upon and enhances capabilities and reusability of project partner components as specified in the original project proposal. The overall project team’s work over the project funding cycle was focused on several areas of research, which are described in the following sections. The reminder of this report also highlights related work as well as preliminary work that supported the project. In addition to the project partners funded by the Office of Science under this grant, the project team included several collaborators who contribute to the overall design of the envisioned tool infrastructure. In particular, the project team worked closely with the other two DOE NNSA laboratories Los Alamos and Sandia leveraging co-funding for Krell by ASC’s Common Computing Environment (CCE) program as laid out in the original proposal. The ASC CCE co-funding, coordinated through LLNL, was for 50% of the total project funding, with the ASC CCE portion of the funding going entirely to Krell, while the ASCR funding itself was split between Krell and the funded partners. This report covers the entire project from both funding sources. Additionally, the team leveraged the expertise of software engineering researchers from Carnegie Mellon University, who specialize in software framework design, in order to achieve a broadly acceptable component framework. The Component Based Tool Framework (CBTF) software has been released to the community. Information related to the project and the released software can be found on the CBTF wiki page at: http://sourceforge.net/p/cbtf/wiki/Home.« less
  • Peta-scale computing environments pose significant challenges for both system and application developers and addressing them required more than simply scaling up existing tera-scale solutions. Performance analysis tools play an important role in gaining this understanding, but previous monolithic tools with fixed feature sets have not sufficed. Instead, this project worked on the design, implementation, and evaluation of a general, flexible tool infrastructure supporting the construction of performance tools as “pipelines” of high-quality tool building blocks. These tool building blocks provide common performance tool functionality, and are designed for scalability, lightweight data acquisition and analysis, and interoperability. For this project, wemore » built on Open|SpeedShop, a modular and extensible open source performance analysis tool set. The design and implementation of such a general and reusable infrastructure targeted for petascale systems required us to address several challenging research issues. All components needed to be designed for scale, a task made more difficult by the need to provide general modules. The infrastructure needed to support online data aggregation to cope with the large amounts of performance and debugging data. We needed to be able to map any combination of tool components to each target architecture. And we needed to design interoperable tool APIs and workflows that were concrete enough to support the required functionality, yet provide the necessary flexibility to address a wide range of tools. A major result of this project is the ability to use this scalable infrastructure to quickly create tools that match with a machine architecture and a performance problem that needs to be understood. Another benefit is the ability for application engineers to use the highly scalable, interoperable version of Open|SpeedShop, which are reassembled from the tool building blocks into a flexible, multi-user interface set of tools. This set of tools targeted at Office of Science Leadership Class computer systems and selected Office of Science application codes. We describe the contributions made by the team at the University of Wisconsin. The project built on the efforts in Open|SpeedShop funded by DOE/NNSA and the DOE/NNSA Tri-Lab community, extended Open|Speedshop to the Office of Science Leadership Class Computing Facilities, and addressed new challenges found on these cutting edge systems. Work done under this project at Wisconsin can be divided into two categories, new algorithms and techniques for debugging, and foundation infrastructure work on our Dyninst binary analysis and instrumentation toolkits and MRNet scalability infrastructure.« less
  • As the flood of data associated with leading edge computational science continues to escalate, the challenge of supporting the distributed collaborations that are now characteristic of it becomes increasingly daunting. The chief obstacles to progress on this front lie less in the synchronous elements of collaboration, which have been reasonably well addressed by new global high performance networks, than in the asynchronous elements, where appropriate shared storage infrastructure seems to be lacking. The recent report from the Department of Energy on the emerging 'data management challenge' captures the multidimensional nature of this problem succinctly: Data inevitably needs to be buffered,more » for periods ranging from seconds to weeks, in order to be controlled as it moves through the distributed and collaborative research process. To meet the diverse and changing set of application needs that different research communities have, large amounts of non-archival storage are required for transitory buffering, and it needs to be widely dispersed, easily available, and configured to maximize flexibility of use. In today's grid fabric, however, massive storage is mostly concentrated in data centers, available only to those with user accounts and membership in the appropriate virtual organizations, allocated as if its usage were non-transitory, and encapsulated behind legacy interfaces that inhibit the flexibility of use and scheduling. This situation severely restricts the ability of application communities to access and schedule usable storage where and when they need to in order to make their workflow more productive. (p.69f) One possible strategy to deal with this problem lies in creating a storage infrastructure that can be universally shared because it provides only the most generic of asynchronous services. Different user communities then define higher level services as necessary to meet their needs. One model of such a service is a Storage Network, analogous to those used within computation centers, but designed to operate on a global scale. Building on a basic storage service that is as primitive as possible, such a Global Storage Network would define a framework within which higher level services can be created. If this framework enabled a variety of more specialized middleware and supported a wide array of applications, then interoperability and collaboration could occur based on that common framework. The research in Logistical Networking (LN) carried out under the DOE's SciDAC program tested the value of this approach within the context of several SciDAC application communities. Below we briefly describe the basic design of the LN storage network and some of the results that the Logistical Networking community has achieved.« less