skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Deep Learning for Enhancing Fault Tolerant Capabilities of Scientific Workflows

Conference ·

Deep Learning for Enhancing Fault TolerantCapabilities of Scientific WorkflowsAlok Singh, Ilkay AltintasSan Diego Supercomputer Center, UCSDLa Jolla, CA, USA{a1singh, ialtintas}@ucsd.eduMalachi Schram, Nathan TallentPacific Northwestern National LaboratoryRichland, WA, USA{malachi.schram, nathan.tallent}@pnnl.govAbstract-In the history of Computer Science, the act of ‘delegation’ has been the greatest multiplier of society’s problem-solving ability. A scientist working on detecting anomalies in a phenomenon, does not need to re-invent matrix multiplication techniques to solve her problem. Scientific workflows provide ultimate ‘delegation’ mechanism -where a domain scientist can completely forget the specifics of ‘how’ her program will execute on a large cluster in an efficient and cost-effective manner and can instead focus on the mathematical formulation and theoretical robustness of her solution. We present here an approach that directly aims to make the execution of Scientific Workflows more reliable, robust and efficient. We aim that the work presented in this paper will propel the larger effort, from the scientific workflow community, of making scientific workflow execution as simple, efficient and robust as a JOIN operation in a modern database. Specifically, we apply Deep Learning techniques to develop a mechanism that forecasts the final state (success or failure) of a dynamic job in a large-scale particle physics experiment, with minimal data gathering, and as early as possible in job’s life cycle. The key advantage of having a predictive mechanism to identify and anticipate failure-prone jobs is the potential for designing intelligent Fault Tolerance mechanisms to handle anomalous events. We achieve a 14% improvement in computational resources utilization, and an overall classification accuracy of 85% on real tasks executed in a High Energy Physics Computing workflow.To the best of our knowledge, this is the most exhaustive and first of its kind study of neural network architectures in context of a real-dataset profiled from a large-scalescientific workflow.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1525776
Report Number(s):
PNNL-SA-143406
Resource Relation:
Conference: Proceedings of the IEEE International Conference on Big Data, (Big Data 2018), December 10-13, 2018, Seattle, WA
Country of Publication:
United States
Language:
English

Similar Records

Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows
Journal Article · Mon Jan 28 00:00:00 EST 2019 · Future Generations Computer Systems · OSTI ID:1525776

HPC-Colony: Services and Interfaces to Aupport Systems With Very Large Numbers of Processors
Technical Report · Wed Jan 31 00:00:00 EST 2007 · OSTI ID:1525776

Enabling machine learning-ready HPC ensembles with Merlin
Journal Article · Fri Feb 04 00:00:00 EST 2022 · Future Generations Computer Systems · OSTI ID:1525776

Related Subjects