skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems

Abstract

MOLAR is a multi-institutional research effort that concentrates on adaptive, reliable, and efficient operating and runtime system (OS/R) solutions for ultra-scale high-end scientific computing on the next generation of supercomputers. This research addresses the challenges outlined in FAST-OS (forum to address scalable technology for runtime and operating systems) and HECRTF (high-end computing revitalization task force) activities by exploring the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions, and by advancing computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues. This paper describes recent research of the MOLAR team in advancing RAS for high-end computing OS/Rs.

Authors:
 [1];  [1];  [1];  [2];  [2];  [3];  [3];  [3];  [4];  [4]
  1. ORNL
  2. Louisiana Tech University
  3. North Carolina State University
  4. Ohio State University
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
978167
DOE Contract Number:
DE-AC05-00OR22725
Resource Type:
Journal Article
Resource Relation:
Journal Name: ACM SIGOPS Operating Systems Review; Journal Volume: 40; Journal Issue: 2
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; AVAILABILITY; COMPUTERS; MANAGEMENT; MONITORING; PERFORMANCE; RELIABILITY; SUPERCOMPUTERS

Citation Formats

Engelmann, Christian, Scott, Steven L, Bernholdt, David E, Gottumukkala, Narasimha R., Chokchai, Leangsuksun, Varma, Jyothish S., Wang, Chao, Mueller, Frank, Shet, Aniruddha G., and Sadayappan, Ponnuswamy. MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems. United States: N. p., 2006. Web. doi:10.1145/1131322.1131337.
Engelmann, Christian, Scott, Steven L, Bernholdt, David E, Gottumukkala, Narasimha R., Chokchai, Leangsuksun, Varma, Jyothish S., Wang, Chao, Mueller, Frank, Shet, Aniruddha G., & Sadayappan, Ponnuswamy. MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems. United States. doi:10.1145/1131322.1131337.
Engelmann, Christian, Scott, Steven L, Bernholdt, David E, Gottumukkala, Narasimha R., Chokchai, Leangsuksun, Varma, Jyothish S., Wang, Chao, Mueller, Frank, Shet, Aniruddha G., and Sadayappan, Ponnuswamy. Sun . "MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems". United States. doi:10.1145/1131322.1131337.
@article{osti_978167,
title = {MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems},
author = {Engelmann, Christian and Scott, Steven L and Bernholdt, David E and Gottumukkala, Narasimha R. and Chokchai, Leangsuksun and Varma, Jyothish S. and Wang, Chao and Mueller, Frank and Shet, Aniruddha G. and Sadayappan, Ponnuswamy},
abstractNote = {MOLAR is a multi-institutional research effort that concentrates on adaptive, reliable, and efficient operating and runtime system (OS/R) solutions for ultra-scale high-end scientific computing on the next generation of supercomputers. This research addresses the challenges outlined in FAST-OS (forum to address scalable technology for runtime and operating systems) and HECRTF (high-end computing revitalization task force) activities by exploring the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions, and by advancing computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues. This paper describes recent research of the MOLAR team in advancing RAS for high-end computing OS/Rs.},
doi = {10.1145/1131322.1131337},
journal = {ACM SIGOPS Operating Systems Review},
number = 2,
volume = 40,
place = {United States},
year = {Sun Jan 01 00:00:00 EST 2006},
month = {Sun Jan 01 00:00:00 EST 2006}
}
  • We present a new software-based clock synchronization scheme that provides high precision time agreement among distributed memory nodes. The technique is designed to minimize variance from a reference chimer during runtime and with minimal time-request latency. Our scheme permits initial unbounded variations in time and corrects both slow and fast chimers (clock skew). An implementation developed within the context of the MPI message passing interface is described, and time coordination measurements are presented. Among our results, the mean time variance for a set of nodes improved from 20.0 milliseconds under standard Network Time Protocol (NTP) down to 2.29 secs undermore » our scheme.« less
  • MOLAR is a multi-institution research effort that concentrates on adaptive, reliable,and efficient operating and runtime system solutions for ultra-scale high-end scientific computing on the next generation of supercomputers. This research addresses the challenges outlined by the FAST-OS - forum to address scalable technology for runtime and operating systems --- and HECRTF --- high-end computing revitalization task force --- activities by providing a modular Linux and adaptable runtime support for high-end computing operating and runtime systems. The MOLAR research has the following goals to address these issues. (1) Create a modular and configurable Linux system that allows customized changes based onmore » the requirements of the applications, runtime systems, and cluster management software. (2) Build runtime systems that leverage the OS modularity and configurability to improve efficiency, reliability, scalability, ease-of-use, and provide support to legacy and promising programming models. (3) Advance computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues. (4) Explore the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions. The overall goal of the research conducted at NCSU is to develop scalable algorithms for high-availability without single points of failure and without single points of control.« less
  • We present a new software-based clock synchronization scheme designed to provide high precision time agreement among distributed memory nodes. The technique is designed to minimize variance from a reference chimer during runtime and with minimal time-request latency. Our scheme permits initial unbounded variations in time and corrects both slow and fast chimers (clock skew). An implementation developed within the context of the MPI message passing interface is described and time coordination measurements are presented. Among our results, the mean time variance among a set of nodes improved from 20.0 milliseconds under standard Network Time Protocol (NTP) to 2.29 secs undermore » our scheme.« less
  • In 2003, the High End Computing Revitalization Task Force designated file systems and I/O as an area in need of national focus. The purpose of the High End Computing Interagency Working Group (HECIWG) is to coordinate government spending on File Systems and 1I0 (FSIO) R&D by all the government agencies that are involved in High End Computing. The HECIWG tasked a smaller advisory group to list, categorize, and prioritize HEC VO and File Systems R&D needs. In 2005, leaders in FSIO from academia, industry and government agencies collaborated to Jist and prioritize areas of research in HEC FSIO. This ledmore » to a very successful High End Computing University Research Activity (HECURA) call from NSF in 2006 and has prompted a new HECURA call from NSF in 2009. This paper serves as both a review of the 2008 HEC FSIO identified research gaps as well as a preview of this forthcoming HECURA call.« less