skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Canopus: Enabling Extreme-Scale Data Analytics on Big {HPC} Storage via Progressive Refactoring

Authors:
ORCiD logo [1];  [2]; ORCiD logo [1]; ORCiD logo [1];  [1]; ORCiD logo [1]; ORCiD logo [1];  [1]
  1. ORNL
  2. New Jersey Institute of Technology
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1399449
DOE Contract Number:
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 9th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 17) - Santa Clara, California, United States of America - 7/10/2017 8:00:00 AM-
Country of Publication:
United States
Language:
English

Citation Formats

Suchyta, Eric D., Lu, Tao, Choi, Jong Youl, Podhorszki, Norbert, Liu, Qing Gary, Pugmire, Dave, Wolf, Matthew D., and Ainsworth, Mark. Canopus: Enabling Extreme-Scale Data Analytics on Big {HPC} Storage via Progressive Refactoring. United States: N. p., 2017. Web.
Suchyta, Eric D., Lu, Tao, Choi, Jong Youl, Podhorszki, Norbert, Liu, Qing Gary, Pugmire, Dave, Wolf, Matthew D., & Ainsworth, Mark. Canopus: Enabling Extreme-Scale Data Analytics on Big {HPC} Storage via Progressive Refactoring. United States.
Suchyta, Eric D., Lu, Tao, Choi, Jong Youl, Podhorszki, Norbert, Liu, Qing Gary, Pugmire, Dave, Wolf, Matthew D., and Ainsworth, Mark. Sat . "Canopus: Enabling Extreme-Scale Data Analytics on Big {HPC} Storage via Progressive Refactoring". United States. doi:. https://www.osti.gov/servlets/purl/1399449.
@article{osti_1399449,
title = {Canopus: Enabling Extreme-Scale Data Analytics on Big {HPC} Storage via Progressive Refactoring},
author = {Suchyta, Eric D. and Lu, Tao and Choi, Jong Youl and Podhorszki, Norbert and Liu, Qing Gary and Pugmire, Dave and Wolf, Matthew D. and Ainsworth, Mark},
abstractNote = {},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sat Jul 01 00:00:00 EDT 2017},
month = {Sat Jul 01 00:00:00 EDT 2017}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Abstract not provided.
  • The increasingly growing data sets processed on HPC platforms raise major challenges for the underlying storage layer. A promising alternative to POSIX-IO- compliant file systems are simpler blobs (binary large objects), or object storage systems. Such systems offer lower overhead and better performance at the cost of largely unused features such as file hierarchies or permissions. Similarly, blobs are increasingly considered for replacing distributed file systems for big data analytics or as a base for storage abstractions such as key-value stores or time-series databases. This growing interest in such object storage on HPC and big data platforms raises the question:more » Are blobs the right level of abstraction to enable storage-based convergence between HPC and Big Data? In this paper we study the impact of blob-based storage for real-world applications on HPC and cloud environments. The results show that blobbased storage convergence is possible, leading to a significant performance improvement on both platforms« less
  • Abstract not provided.
  • Petascale simulations compute at resolutions ranging into billions of cells and write terabytes of data for visualization and analysis. Interactive visuaUzation of this time series is a desired step before starting a new run. The I/O subsystem and associated network often are a significant impediment to interactive visualization of time-varying data; as they are not configured or provisioned to provide necessary I/O read rates. In this paper, we propose a new I/O library for visualization applications: VisIO. Visualization applications commonly use N-to-N reads within their parallel enabled readers which provides an incentive for a shared-nothing approach to I/O, similar tomore » other data-intensive approaches such as Hadoop. However, unlike other data-intensive applications, visualization requires: (1) interactive performance for large data volumes, (2) compatibility with MPI and POSIX file system semantics for compatibility with existing infrastructure, and (3) use of existing file formats and their stipulated data partitioning rules. VisIO, provides a mechanism for using a non-POSIX distributed file system to provide linear scaling of 110 bandwidth. In addition, we introduce a novel scheduling algorithm that helps to co-locate visualization processes on nodes with the requested data. Testing using VisIO integrated into Para View was conducted using the Hadoop Distributed File System (HDFS) on TACC's Longhorn cluster. A representative dataset, VPIC, across 128 nodes showed a 64.4% read performance improvement compared to the provided Lustre installation. Also tested, was a dataset representing a global ocean salinity simulation that showed a 51.4% improvement in read performance over Lustre when using our VisIO system. VisIO, provides powerful high-performance I/O services to visualization applications, allowing for interactive performance with ultra-scale, time-series data.« less