Deep Learning for Enhancing Fault Tolerant Capabilities of Scientific Workflows
- University of California, San Diego
- UNIVERSITY OF CALIFORNIA SAN DIEGO
- BATTELLE (PACIFIC NW LAB)
Deep Learning for Enhancing Fault TolerantCapabilities of Scientific WorkflowsAlok Singh, Ilkay AltintasSan Diego Supercomputer Center, UCSDLa Jolla, CA, USA{a1singh, ialtintas}@ucsd.eduMalachi Schram, Nathan TallentPacific Northwestern National LaboratoryRichland, WA, USA{malachi.schram, nathan.tallent}@pnnl.govAbstract-In the history of Computer Science, the act of ‘delegation’ has been the greatest multiplier of society’s problem-solving ability. A scientist working on detecting anomalies in a phenomenon, does not need to re-invent matrix multiplication techniques to solve her problem. Scientific workflows provide ultimate ‘delegation’ mechanism -where a domain scientist can completely forget the specifics of ‘how’ her program will execute on a large cluster in an efficient and cost-effective manner and can instead focus on the mathematical formulation and theoretical robustness of her solution. We present here an approach that directly aims to make the execution of Scientific Workflows more reliable, robust and efficient. We aim that the work presented in this paper will propel the larger effort, from the scientific workflow community, of making scientific workflow execution as simple, efficient and robust as a JOIN operation in a modern database. Specifically, we apply Deep Learning techniques to develop a mechanism that forecasts the final state (success or failure) of a dynamic job in a large-scale particle physics experiment, with minimal data gathering, and as early as possible in job’s life cycle. The key advantage of having a predictive mechanism to identify and anticipate failure-prone jobs is the potential for designing intelligent Fault Tolerance mechanisms to handle anomalous events. We achieve a 14% improvement in computational resources utilization, and an overall classification accuracy of 85% on real tasks executed in a High Energy Physics Computing workflow.To the best of our knowledge, this is the most exhaustive and first of its kind study of neural network architectures in context of a real-dataset profiled from a large-scalescientific workflow.
- Research Organization:
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1525776
- Report Number(s):
- PNNL-SA-143406
- Resource Relation:
- Conference: Proceedings of the IEEE International Conference on Big Data, (Big Data 2018), December 10-13, 2018, Seattle, WA
- Country of Publication:
- United States
- Language:
- English
Similar Records
HPC-Colony: Services and Interfaces to Aupport Systems With Very Large Numbers of Processors
Enabling machine learning-ready HPC ensembles with Merlin