skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: End-to-end experiment management in HPC

Conference ·
OSTI ID:1003804

Experiment management in any domain is challenging. There is a perpetual feedback loop cycling through planning, execution, measurement, and analysis. The lifetime of a particular experiment can be limited to a single cycle although many require myriad more cycles before definite results can be obtained. Within each cycle, a large number of subexperiments may be executed in order to measure the effects of one or more independent variables. Experiment management in high performance computing (HPC) follows this general pattern but also has three unique characteristics. One, computational science applications running on large supercomputers must deal with frequent platform failures which can interrupt, perturb, or terminate running experiments. Two, these applications typically integrate in parallel using MPI as their communication medium. Three, there is typically a scheduling system (e.g. Condor, Moab, SGE, etc.) acting as a gate-keeper for the HPC resources. In this paper, we introduce LANL Experiment Management (LEM), an experimental management framework simplifying all four phases of experiment management. LEM simplifies experiment planning by allowing the user to describe their experimental goals without having to fully construct the individual parameters for each task. To simplify execution, LEM dispatches the subexperiments itself thereby freeing the user from remembering the often arcane methods for interacting with the various scheduling systems. LEM provides transducers for experiments that automatically measure and record important information about each subexperiment; these transducers can easily be extended to collect additional measurements specific to each experiment. Finally, experiment analysis is simplified by providing a general database visualization framework that allows users to quickly and easily interact with their measured data.

Research Organization:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC52-06NA25396
OSTI ID:
1003804
Report Number(s):
LA-UR-10-02150; LA-UR-10-2150; TRN: US201103%%96
Resource Relation:
Conference: Supercomputing 2010 ; November 13, 2010 ; New Orleans, LA
Country of Publication:
United States
Language:
English