Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Deep Learning for Enhancing Fault Tolerant Capabilities of Scientific Workflows

Conference ·

Deep Learning for Enhancing Fault TolerantCapabilities of Scientific WorkflowsAlok Singh, Ilkay AltintasSan Diego Supercomputer Center, UCSDLa Jolla, CA, USA{a1singh, ialtintas}@ucsd.eduMalachi Schram, Nathan TallentPacific Northwestern National LaboratoryRichland, WA, USA{malachi.schram, nathan.tallent}@pnnl.govAbstract-In the history of Computer Science, the act of ‘delegation’ has been the greatest multiplier of society’s problem-solving ability. A scientist working on detecting anomalies in a phenomenon, does not need to re-invent matrix multiplication techniques to solve her problem. Scientific workflows provide ultimate ‘delegation’ mechanism -where a domain scientist can completely forget the specifics of ‘how’ her program will execute on a large cluster in an efficient and cost-effective manner and can instead focus on the mathematical formulation and theoretical robustness of her solution. We present here an approach that directly aims to make the execution of Scientific Workflows more reliable, robust and efficient. We aim that the work presented in this paper will propel the larger effort, from the scientific workflow community, of making scientific workflow execution as simple, efficient and robust as a JOIN operation in a modern database. Specifically, we apply Deep Learning techniques to develop a mechanism that forecasts the final state (success or failure) of a dynamic job in a large-scale particle physics experiment, with minimal data gathering, and as early as possible in job’s life cycle. The key advantage of having a predictive mechanism to identify and anticipate failure-prone jobs is the potential for designing intelligent Fault Tolerance mechanisms to handle anomalous events. We achieve a 14% improvement in computational resources utilization, and an overall classification accuracy of 85% on real tasks executed in a High Energy Physics Computing workflow.To the best of our knowledge, this is the most exhaustive and first of its kind study of neural network architectures in context of a real-dataset profiled from a large-scalescientific workflow.

Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1525776
Report Number(s):
PNNL-SA-143406
Country of Publication:
United States
Language:
English

Similar Records

Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows
Journal Article · Sun Jan 27 23:00:00 EST 2019 · Future Generations Computer Systems · OSTI ID:1611913

Scientist-Centered Workflow Abstractions via Generic Actors, Workflow Templates, and Context-Awareness for Groundwater Modeling and Analysis
Conference · Mon Jul 04 00:00:00 EDT 2011 · OSTI ID:1024536

Accelerating Scientific Workflows on HPC Platforms with In Situ Processing
Conference · Fri Dec 31 23:00:00 EST 2021 · OSTI ID:1888792

Related Subjects