A Generic Scheduling Simulator for High Performance Parallel Computers
It is well known that efficient job scheduling plays a crucial role in achieving high system utilization in large-scale high performance computing environments. A good scheduling algorithm should schedule jobs to achieve high system utilization while satisfying various user demands in an equitable fashion. Designing such a scheduling algorithm is a non-trivial task even in a static environment. In practice, the computing environment and workload are constantly changing. There are several reasons for this. First, the computing platforms constantly evolve as the technology advances. For example, the availability of relatively powerful commodity off-the-shelf (COTS) components at steadily diminishing prices have made it feasible to construct ever larger massively parallel computers in recent years [1, 4]. Second, the workload imposed on the system also changes constantly. The rapidly increasing compute resources have provided many applications developers with the opportunity to radically alter program characteristics and take advantage of these additional resources. New developments in software technology may also trigger changes in user applications. Finally, political climate change may alter user priorities or the mission of the organization. System designers in such dynamic environments must be able to accurately forecast the effect of changes in the hardware, software, and/or policies under consideration. If the environmental changes are significant, one must also reassess scheduling algorithms. Simulation has frequently been relied upon for this analysis, because other methods such as analytical modeling or actual measurements are usually too difficult or costly. A drawback of the simulation approach, however, is that developing a simulator is a time-consuming process. Furthermore, an existing simulator cannot be easily adapted to a new environment. In this research, we attempt to develop a generic job-scheduling simulator, which facilitates the evaluation of different scheduling algorithms in various computing environments. The following are our design objectives for this generic simulator. (1) Accept descriptions of varied workloads for a wide range of computing environments. (2) Provide an easy-to-use interface for description of the scheduling policies being evaluated. (3) Accurately calculate the overhead induced by various scheduling algorithms. (4) Accurately model a variety of machine architectures. In summary, we have developed a generic scheduling simulator for high performance parallel computers. This generic simulator supports standard and user-defined job attributes and generates the job attribute values from different input sources, allowing users to model a wide range of workloads, and produces performance parameters with reliability measures. All overheads caused by scheduling algorithms are considered in measuring the performance parameters. The simulator simulates a queuing network to which users can bound a specific scheduling algorithm written as a C function. A set of APIs is provided for the users to facilitate describing the scheduling algorithms. With these features, this simulator can accurately simulate any scheduling algorithms under various workloads and computing platforms. The simulator does not currently model dynamic events like message passing between tasks closely, but we plan to include this crucial functionality into our simulator in the future.
- Research Organization:
- Lawrence Livermore National Lab., CA (US)
- Sponsoring Organization:
- US Department of Energy (US)
- DOE Contract Number:
- W-7405-ENG-48
- OSTI ID:
- 15006309
- Report Number(s):
- UCRL-JC-144818
- Country of Publication:
- United States
- Language:
- English
Similar Records
RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning
Characteristics of workload on ASCI blue-pacific at lawrence livermore national laboratory