Summary: Data-Driven Batch Scheduling
A dissertation submitted
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
University of Wisconsin - Madison
In this thesis, we present a data-driven batch scheduling system. Current CPU-centric batch schedulers ignore
the data needs within workloads and execute them by linking them transparently and directly to their needed data.
When scheduled on remote computational resources, this elegant solution of direct data access can incur an order
of magnitude performance penalty for data-intensive workloads.
To concretely motivate this problem, we provide here a detailed analysis of six current data-intensive, scientific,
batch workloads. From this analysis, we derive quantitative bounds on expected scalability and demonstrate the
infeasibility of scheduling these workloads using current CPU-centric systems that lack data-awareness.