skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Performance analysis and optimization for scalable deployment of deep learning models for country-scale settlement mapping on Titan supercomputer

Abstract

Here, we present a scalable object detection workflow for detecting objects, such as settlements, from remotely sensed (RS) imagery. We have successfully deployed this workflow on Titan supercomputer and utilized it for the task of mapping human settlement at a country scale. The performance of various stages in the workflow was analyzed before making it operational. The workflow implemented various strategies to address issues such as suboptimal resource utilization and long-tail effects due to unbalanced image workload, data loss due to runtime failures, and maximum wall-time constraints imposed by Titan's job scheduling policy. A mean shift clustering–based static load balancing strategy was implemented, which partitions the image load such that each partition contained similar-sized images. Furthermore, a checkpoint-restart strategy was added in the workflow as a fault-tolerance mechanism to prevent the data losses due to unforeseen runtime failures. The performance of the above-mentioned strategies was observed in various scenarios, such as node failure, exceeding wall time, and successful completion. Using this workflow, we have examined an RS data set that has a spatial resolution of 0.31 m and is comprised of 685 675 km 2 of area of the Republic of Zambia in under six hours using 5426 nodes ofmore » the Titan supercomputer.« less

Authors:
ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1];  [1]; ORCiD logo [1]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Basic Energy Sciences (BES) (SC-22). Scientific User Facilities Division
OSTI Identifier:
1511944
Alternate Identifier(s):
OSTI ID: 1511755
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Concurrency and Computation. Practice and Experience
Additional Journal Information:
Journal Volume: TBD; Journal Issue: TBD; Journal ID: ISSN 1532-0626
Publisher:
Wiley
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; convolutional neural network; deep learning; fault tolerance; HPC; human settlement mapping; load balancing

Citation Formats

Kurte, Kuldeep, Sanyal, Jibonananda, Berres, Anne, Lunga, Dalton, Coletti, Mark, Yang, Hsiuhan Lexie, Graves, Daniel, Liebersohn, Benjamin, and Rose, Amy. Performance analysis and optimization for scalable deployment of deep learning models for country-scale settlement mapping on Titan supercomputer. United States: N. p., 2019. Web. doi:10.1002/cpe.5305.
Kurte, Kuldeep, Sanyal, Jibonananda, Berres, Anne, Lunga, Dalton, Coletti, Mark, Yang, Hsiuhan Lexie, Graves, Daniel, Liebersohn, Benjamin, & Rose, Amy. Performance analysis and optimization for scalable deployment of deep learning models for country-scale settlement mapping on Titan supercomputer. United States. doi:10.1002/cpe.5305.
Kurte, Kuldeep, Sanyal, Jibonananda, Berres, Anne, Lunga, Dalton, Coletti, Mark, Yang, Hsiuhan Lexie, Graves, Daniel, Liebersohn, Benjamin, and Rose, Amy. Wed . "Performance analysis and optimization for scalable deployment of deep learning models for country-scale settlement mapping on Titan supercomputer". United States. doi:10.1002/cpe.5305.
@article{osti_1511944,
title = {Performance analysis and optimization for scalable deployment of deep learning models for country-scale settlement mapping on Titan supercomputer},
author = {Kurte, Kuldeep and Sanyal, Jibonananda and Berres, Anne and Lunga, Dalton and Coletti, Mark and Yang, Hsiuhan Lexie and Graves, Daniel and Liebersohn, Benjamin and Rose, Amy},
abstractNote = {Here, we present a scalable object detection workflow for detecting objects, such as settlements, from remotely sensed (RS) imagery. We have successfully deployed this workflow on Titan supercomputer and utilized it for the task of mapping human settlement at a country scale. The performance of various stages in the workflow was analyzed before making it operational. The workflow implemented various strategies to address issues such as suboptimal resource utilization and long-tail effects due to unbalanced image workload, data loss due to runtime failures, and maximum wall-time constraints imposed by Titan's job scheduling policy. A mean shift clustering–based static load balancing strategy was implemented, which partitions the image load such that each partition contained similar-sized images. Furthermore, a checkpoint-restart strategy was added in the workflow as a fault-tolerance mechanism to prevent the data losses due to unforeseen runtime failures. The performance of the above-mentioned strategies was observed in various scenarios, such as node failure, exceeding wall time, and successful completion. Using this workflow, we have examined an RS data set that has a spatial resolution of 0.31 m and is comprised of 685 675 km2 of area of the Republic of Zambia in under six hours using 5426 nodes of the Titan supercomputer.},
doi = {10.1002/cpe.5305},
journal = {Concurrency and Computation. Practice and Experience},
issn = {1532-0626},
number = TBD,
volume = TBD,
place = {United States},
year = {2019},
month = {5}
}

Journal Article:
Free Publicly Available Full Text
This content will become publicly available on May 8, 2020
Publisher's Version of Record

Save / Share:

Works referenced in this record:

A higher order estimate of the optimum checkpoint interval for restart dumps
journal, February 2006