Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Lessons Learned in Deploying the World s Largest Scale Lustre File System

Conference ·
OSTI ID:1016043
The Spider system at the Oak Ridge National Laboratory's Leadership Computing Facility (OLCF) is the world's largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF's diverse computational environment, the project had a number of ambitious goals. To support the workloads of the OLCF's diverse computational platforms, the aggregate performance and storage capacity of Spider exceed that of our previously deployed systems by a factor of 6x - 240 GB/sec, and 17x - 10 Petabytes, respectively. Furthermore, Spider supports over 26,000 clients concurrently accessing the file system, which exceeds our previously deployed systems by nearly 4x. In addition to these scalability challenges, moving to a center-wide shared file system required dramatically improved resiliency and fault-tolerance mechanisms. This paper details our efforts in designing, deploying, and operating Spider. Through a phased approach of research and development, prototyping, deployment, and transition to operations, this work has resulted in a number of insights into large-scale parallel file system architectures, from both the design and the operational perspectives. We present in this paper our solutions to issues such as network congestion, performance baselining and evaluation, file system journaling overheads, and high availability in a system with tens of thousands of components. We also discuss areas of continued challenges, such as stressed metadata performance and the need for file system quality of service alongside with our efforts to address them. Finally, operational aspects of managing a system of this scale are discussed along with real-world data and observations.
Research Organization:
Oak Ridge National Laboratory (ORNL); Center for Computational Sciences
Sponsoring Organization:
SC USDOE - Office of Science (SC)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1016043
Country of Publication:
United States
Language:
English

Similar Records

A Next-Generation Parallel File System Environment for the OLCF
Conference · Sat Dec 31 23:00:00 EST 2011 · OSTI ID:1039646

The Spider Center Wide File System; From Concept to Reality
Conference · Wed Dec 31 23:00:00 EST 2008 · OSTI ID:1016038

Oak Ridge Leadership Computing Facility Position Paper
Conference · Fri Dec 31 23:00:00 EST 2010 · OSTI ID:1024315