Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows

Ferreira da Silva, Rafael; Filgueira, Rosa; Deelman, Ewa; Pairo-Castineira, Erola; Overton, Ian M.; Atkinson, Malcolm P.

doi:10.1016/j.future.2019.01.015

Title: Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows

Journal Article · Mon Jan 28 00:00:00 EST 2019 · Future Generations Computer Systems

DOI:https://doi.org/10.1016/j.future.2019.01.015· OSTI ID:1611913

^[1]; Filgueira, Rosa ^[2]; Deelman, Ewa ^[1]; Pairo-Castineira, Erola ^[3]; Overton, Ian M. ^[3]; Atkinson, Malcolm P. ^[3]

Univ. of Southern California, Marina del Rey, CA (United States)
British Geological Survey, Edinburgh (United Kingdom); Univ. of Edinburgh, Scotland (United Kingdom)
Univ. of Edinburgh, Scotland (United Kingdom)

Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. Although the scientific community has addressed this challenge from both theoretical and practical approaches, failure prediction, detection, and recovery still raise many research questions. In this paper, we propose an approach inspired by the control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach is inspired on the proportional–integral–derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, where the controller will react to adjust its output to mitigate faults. PID controllers aim to detect the possibility of a non-steady state far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of large scale data-intensive workflows—data storage overload and memory overflow. We developed a simulator, which implements and evaluates simple standalone PID-inspired controllers to autonomously manage data and memory usage of a data-intensive bioinformatics workflow that consumes/produces over 4.4 TB of data, and requires over 24 TB of memory to run all tasks concurrently. Experimental results obtained via simulation indicate that workflow executions may significantly benefit from the controller-inspired approach, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when utilizing our proposed method, and faults are detected and mitigated far in advance of their occurrence.

View Accepted Manuscript (DOE)

View Accepted Manuscript (Publisher)

Cite

Export

Save

Research Organization:: Univ. of Southern California, Los Angeles, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

Grant/Contract Number:: SC0012636; #DESC0012636

OSTI ID:: 1611913

Alternate ID(s):: OSTI ID: 1562940

Journal Information:: Future Generations Computer Systems, Vol. 95, Issue C; ISSN 0167-739X

Publisher:: ElsevierCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 14 works

Citation information provided by
Web of Science

References (20)

A characterization of workflow management systems for extreme-scale applications Ferreira da Silva, Rafael; Filgueira, Rosa; Pietri, Ilia Future Generation Computer Systems, Vol. 75 https://doi.org/10.1016/j.future.2017.02.026	journal	October 2017
A proposal to apply inductive logic programming to self-healing problem in grid computing: How will it work? Ferro, Mariza; Mury, Antonio Roberto; Schulze, Bruno Concurrency and Computation: Practice and Experience, Vol. 23, Issue 17 https://doi.org/10.1002/cpe.1714	journal	March 2011
Dynamic and Fault-Tolerant Clustering for Scientific Workflows Chen, Weiwei; da Silva, Rafael Ferreira; Deelman, Ewa IEEE Transactions on Cloud Computing, Vol. 4, Issue 1 https://doi.org/10.1109/TCC.2015.2427200	journal	January 2016
Toward Prioritization of Data Flows for Scientific Workflows Using Virtual Software Defined Exchanges Mandal, Anirban; Ruth, Paul; Baldin, Ilya 2017 IEEE 13th International Conference on e-Science (e-Science) https://doi.org/10.1109/eScience.2017.92	conference	October 2017
Pegasus, a workflow management system for science automation Deelman, Ewa; Vahi, Karan; Juve, Gideon Future Generation Computer Systems, Vol. 46 https://doi.org/10.1016/j.future.2014.10.008	journal	May 2015
Self-healing of workflow activity incidents on distributed computing infrastructures Ferreira da Silva, Rafael; Glatard, Tristan; Desprez, Frédéric Future Generation Computer Systems, Vol. 29, Issue 8 https://doi.org/10.1016/j.future.2013.06.012	journal	October 2013
Intelligent failure prediction models for scientific workflows Bala, Anju; Chana, Inderveer Expert Systems with Applications, Vol. 42, Issue 3 https://doi.org/10.1016/j.eswa.2014.09.014	journal	February 2015
Design and evaluation of a self-healing Kepler for scientific workflows Hary, Arjun; Akoglu, Ali; AlNashif, Youssif Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing - HPDC '10 https://doi.org/10.1145/1851476.1851525	conference	January 2010
Handling Failures in Parallel Scientific Workflows Using Clouds Costa, Flavio; de Oliveira, Daniel; Ocana, Kary 2012 SC Companion: High-Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion: High Performance Computing, Networking Storage and Analysis https://doi.org/10.1109/SC.Companion.2012.28	conference	November 2012
A global reference for human genetic variation Consortium, The 1000 Genomes Project; Auton, Adam; Abecasis, Gonçalo R. Nature, Vol. 526, Issue 7571, p. 68-74 https://doi.org/10.1038/nature15393	journal	January 2015
Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions: CONTROLLING FAIRNESS AND TASK GRANULARITY IN WORKFLOWS Ferreira da Silva, Rafael; Glatard, Tristan; Desprez, Frédéric Concurrency and Computation: Practice and Experience, Vol. 26, Issue 14 https://doi.org/10.1002/cpe.3303	journal	May 2014
A Cleanup Algorithm for Implementing Storage Constraints in Scientific Workflow Executions Srinivasan, Sudarshan; Juve, Gideon; Silva, Rafael Ferreira da 2014 9th Workshop on Workflows in Support of Large-Scale Science (WORKS) https://doi.org/10.1109/WORKS.2014.8	conference	November 2014
Practical Resource Monitoring for Robust High Throughput Computing Juve, Gideon; Tovar, Benjamin; Silva, Rafael Ferreira da 2015 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2015.115	conference	September 2015
A framework for dynamically generating predictive models of workflow execution Hiden, Hugo; Woodman, Simon; Watson, Paul SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science https://doi.org/10.1145/2534248.2534256	conference	November 2013
Task granularity policies for deploying bag-of-task applications on global grids Muthuvelu, Nithiapidary; Vecchiola, Christian; Chai, Ian Future Generation Computer Systems, Vol. 29, Issue 1 https://doi.org/10.1016/j.future.2012.03.022	journal	January 2013
Toward fine-grained online task characteristics estimation in scientific workflows da Silva, Rafael Ferreira; Juve, Gideon; Deelman, Ewa Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science - WORKS '13 https://doi.org/10.1145/2534248.2534254	conference	January 2013
Community Resources for Enabling Research in Distributed Scientific Workflows Silva, Rafael Ferreira da; Chen, Weiwei; Juve, Gideon 2014 IEEE 10th International Conference on e-Science (e-Science) https://doi.org/10.1109/eScience.2014.44	conference	October 2014
Fault-tolerant Workflow Scheduling using Spot Instances on Clouds Poola, Deepak; Ramamohanarao, Kotagiri; Buyya, Rajkumar Procedia Computer Science, Vol. 29 https://doi.org/10.1016/j.procs.2014.05.047	journal	January 2014
Execution Time Estimation for Workflow Scheduling Chirkin, Artem M.; Belloum, A. S. Z.; Kovalchuk, Sergey V. 2014 9th Workshop on Workflows in Support of Large-Scale Science (WORKS) https://doi.org/10.1109/WORKS.2014.11	conference	November 2014
WRENCH: A Framework for Simulating Workflow Management Systems Casanova, Henri; Pandey, Suraj; Oeth, James 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS) https://doi.org/10.1109/WORKS.2018.00013	conference	November 2018

Cited By (1)

Developing accurate and scalable simulators of production workflow management systems with WRENCH Casanova, Henri; Ferreira da Silva, Rafael; Tanaka, Ryan Future Generation Computer Systems, Vol. 112 https://doi.org/10.1016/j.future.2020.05.030	journal	November 2020