ARCHIE: Data Analysis Acceleration with Array Caching in Hierarchical Storage
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
© 2018 IEEE. Scientific data analysis typically involves reading massive amounts of data that was generated by simulations, experiments, and observations. Performance of reading such large volumes of data from disk-based file systems is often poor because of the slow and mechanical components in the disks. Recent supercomputing systems are adding non-volatile storage layers in a hierarchy to handle the performance gap between fast main memory and slow disk-based storage. Software libraries for managing this hierarchy not only need efficient reading of data but also reduce user-involvement for cross-layer data movement. Furthermore, these libraries need to support array data access patterns into hierarchical storage management as scientific data is often organized in array-based data structures. Existing software typically manage individual storage layers requiring significant manual process in moving data among them. In this paper, we introduce a new array caching in hierarchical storage (ARCHIE) to accelerate array data analysis in a seamless fashion. ARCHIE evaluates array access patterns and prefetches data with array semantics between storage layers. Our evaluation shows that ARCHIE outperforms state-of-the-art file systems, i.e., Lustre and DataWarp, on a production supercomputing system by up to 5.8× in accessing data by scientific analysis applications.
- Research Organization:
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1602833
- Resource Relation:
- Conference: 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, (United States), December 10-13, 2018
- Country of Publication:
- United States
- Language:
- English
Similar Records
SCORPIO: A Scalable Two-Phase Parallel I/O Library With Application To A Large Scale Subsurface Simulator
SCORPIO: A scalable two-phase parallel I/O library with application to a large scale subsurface simulator