skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Scalable Pipeline for Gigapixel Whole Slide Imaging Analysis on Leadership Class HPC Systems

Abstract

Whole Slide Imaging (WSI) captures microscopic details of a patient's histopathological features at multiple res-olutions organized across different levels. Images produced by WSI are gigapixel-sized, and saving a single image in memory requires a few gigabytes which is scarce since a complicated model occupies tens of gigabytes. Performing a simple met-ric operation on these large images is also expensive. High-performance computing (HPC) can help us quickly analyze such large images using distributed training of complex deep learning models. One popular approach in analyzing these images is to divide a WSI image into smaller tiles (patches) and then train a simpler model with these reduced-sized but large numbers of patches. However, we need to solve three pre-processing challenges efficiently for pursuing this patch-based approach. 1) Creating small patches from a high-resolution image can result in a high number (hundreds of thousands per image) of patches. Storing and processing these images can be challenging due to a large number of I/O and arithmetic operations. To reduce I/Oand memory accesses, an optimal balance between the size and number of patches must exist to reduce I/O and memory accesses. 2) WSI images may have tiny annotated regions for cancer tissue and a significant portionmore » with normal and fatty tissues; correct patch sampling should avoid dataset imbalance. 3) storing and retrieving many patches to and from disk storage might incur I/O latency while training a deep learning model. An efficient distributed data loader should reduce I/O latency during the training and inference steps. This paper explores these three challenges and provides empirical and algorithmic solutions deployed on the Summit supercomputer hosted at the Oak Ridge Leadership Computing Facility.« less

Authors:
 [1]; ORCiD logo [1];  [1]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1885343
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: ExSAIS 2022: Workshop on Extreme Scaling of AI for Science - Virtual, Tennessee, United States of America - 5/30/2022 4:00:00 AM-6/3/2022 4:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Dash, Sajal, Hernandez Arreguin, Benjamin, Tsaris, Aristeidis, Alamudun, Folami, Yoon, Hong-Jun, and Wang, Feiyi. A Scalable Pipeline for Gigapixel Whole Slide Imaging Analysis on Leadership Class HPC Systems. United States: N. p., 2022. Web. doi:10.1109/IPDPSW55747.2022.00223.
Dash, Sajal, Hernandez Arreguin, Benjamin, Tsaris, Aristeidis, Alamudun, Folami, Yoon, Hong-Jun, & Wang, Feiyi. A Scalable Pipeline for Gigapixel Whole Slide Imaging Analysis on Leadership Class HPC Systems. United States. https://doi.org/10.1109/IPDPSW55747.2022.00223
Dash, Sajal, Hernandez Arreguin, Benjamin, Tsaris, Aristeidis, Alamudun, Folami, Yoon, Hong-Jun, and Wang, Feiyi. 2022. "A Scalable Pipeline for Gigapixel Whole Slide Imaging Analysis on Leadership Class HPC Systems". United States. https://doi.org/10.1109/IPDPSW55747.2022.00223. https://www.osti.gov/servlets/purl/1885343.
@article{osti_1885343,
title = {A Scalable Pipeline for Gigapixel Whole Slide Imaging Analysis on Leadership Class HPC Systems},
author = {Dash, Sajal and Hernandez Arreguin, Benjamin and Tsaris, Aristeidis and Alamudun, Folami and Yoon, Hong-Jun and Wang, Feiyi},
abstractNote = {Whole Slide Imaging (WSI) captures microscopic details of a patient's histopathological features at multiple res-olutions organized across different levels. Images produced by WSI are gigapixel-sized, and saving a single image in memory requires a few gigabytes which is scarce since a complicated model occupies tens of gigabytes. Performing a simple met-ric operation on these large images is also expensive. High-performance computing (HPC) can help us quickly analyze such large images using distributed training of complex deep learning models. One popular approach in analyzing these images is to divide a WSI image into smaller tiles (patches) and then train a simpler model with these reduced-sized but large numbers of patches. However, we need to solve three pre-processing challenges efficiently for pursuing this patch-based approach. 1) Creating small patches from a high-resolution image can result in a high number (hundreds of thousands per image) of patches. Storing and processing these images can be challenging due to a large number of I/O and arithmetic operations. To reduce I/Oand memory accesses, an optimal balance between the size and number of patches must exist to reduce I/O and memory accesses. 2) WSI images may have tiny annotated regions for cancer tissue and a significant portion with normal and fatty tissues; correct patch sampling should avoid dataset imbalance. 3) storing and retrieving many patches to and from disk storage might incur I/O latency while training a deep learning model. An efficient distributed data loader should reduce I/O latency during the training and inference steps. This paper explores these three challenges and provides empirical and algorithmic solutions deployed on the Summit supercomputer hosted at the Oak Ridge Leadership Computing Facility.},
doi = {10.1109/IPDPSW55747.2022.00223},
url = {https://www.osti.gov/biblio/1885343}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2022},
month = {5}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: