skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows

Journal Article · · Future Generations Computer Systems
ORCiD logo [1];  [2];  [1];  [3];  [3];  [3]
  1. Univ. of Southern California, Marina del Rey, CA (United States)
  2. British Geological Survey, Edinburgh (United Kingdom); Univ. of Edinburgh, Scotland (United Kingdom)
  3. Univ. of Edinburgh, Scotland (United Kingdom)

Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. Although the scientific community has addressed this challenge from both theoretical and practical approaches, failure prediction, detection, and recovery still raise many research questions. In this paper, we propose an approach inspired by the control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach is inspired on the proportional–integral–derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, where the controller will react to adjust its output to mitigate faults. PID controllers aim to detect the possibility of a non-steady state far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of large scale data-intensive workflows—data storage overload and memory overflow. We developed a simulator, which implements and evaluates simple standalone PID-inspired controllers to autonomously manage data and memory usage of a data-intensive bioinformatics workflow that consumes/produces over 4.4 TB of data, and requires over 24 TB of memory to run all tasks concurrently. Experimental results obtained via simulation indicate that workflow executions may significantly benefit from the controller-inspired approach, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when utilizing our proposed method, and faults are detected and mitigated far in advance of their occurrence.

Research Organization:
Univ. of Southern California, Los Angeles, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
SC0012636; #DESC0012636
OSTI ID:
1611913
Alternate ID(s):
OSTI ID: 1562940
Journal Information:
Future Generations Computer Systems, Vol. 95, Issue C; ISSN 0167-739X
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 14 works
Citation information provided by
Web of Science

References (20)

A characterization of workflow management systems for extreme-scale applications journal October 2017
A proposal to apply inductive logic programming to self-healing problem in grid computing: How will it work? journal March 2011
Dynamic and Fault-Tolerant Clustering for Scientific Workflows journal January 2016
Toward Prioritization of Data Flows for Scientific Workflows Using Virtual Software Defined Exchanges conference October 2017
Pegasus, a workflow management system for science automation journal May 2015
Self-healing of workflow activity incidents on distributed computing infrastructures journal October 2013
Intelligent failure prediction models for scientific workflows journal February 2015
Design and evaluation of a self-healing Kepler for scientific workflows conference January 2010
Handling Failures in Parallel Scientific Workflows Using Clouds
  • Costa, Flavio; de Oliveira, Daniel; Ocana, Kary
  • 2012 SC Companion: High-Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion: High Performance Computing, Networking Storage and Analysis https://doi.org/10.1109/SC.Companion.2012.28
conference November 2012
A global reference for human genetic variation journal January 2015
Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions: CONTROLLING FAIRNESS AND TASK GRANULARITY IN WORKFLOWS
  • Ferreira da Silva, Rafael; Glatard, Tristan; Desprez, Frédéric
  • Concurrency and Computation: Practice and Experience, Vol. 26, Issue 14 https://doi.org/10.1002/cpe.3303
journal May 2014
A Cleanup Algorithm for Implementing Storage Constraints in Scientific Workflow Executions conference November 2014
Practical Resource Monitoring for Robust High Throughput Computing conference September 2015
A framework for dynamically generating predictive models of workflow execution
  • Hiden, Hugo; Woodman, Simon; Watson, Paul
  • SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science https://doi.org/10.1145/2534248.2534256
conference November 2013
Task granularity policies for deploying bag-of-task applications on global grids journal January 2013
Toward fine-grained online task characteristics estimation in scientific workflows conference January 2013
Community Resources for Enabling Research in Distributed Scientific Workflows conference October 2014
Fault-tolerant Workflow Scheduling using Spot Instances on Clouds journal January 2014
Execution Time Estimation for Workflow Scheduling conference November 2014
WRENCH: A Framework for Simulating Workflow Management Systems conference November 2018

Cited By (1)

Developing accurate and scalable simulators of production workflow management systems with WRENCH journal November 2020

Figures / Tables (17)