Failing in place for low-serviceability storage infrastructure using high-parity GPU-based RAID.

Curry, Matthew L; Ward, H Lee; Skjellum, Anthony

Title: Failing in place for low-serviceability storage infrastructure using high-parity GPU-based RAID.

Conference · Sat May 01 00:00:00 EDT 2010

OSTI ID:1020441

Curry, Matthew L; Ward, H Lee; Skjellum, Anthony ^[1]

University of Alabama at Birmingham

In order to provide large quantities of high-reliability disk-based storage, it has become necessary to aggregate disks into fault-tolerant groups based on the RAID methodology. Most RAID levels do provide some fault tolerance, but there are certain classes of applications that require increased levels of fault tolerance within an array. Some of these applications include embedded systems in harsh environments that have a low level of serviceability, or uninhabited data centers servicing cloud computing. When describing RAID reliability, the Mean Time To Data Loss (MTTDL) calculations will often assume that the time to replace a failed disk is relatively low, or even negligible compared to rebuild time. For platforms that are in remote areas collecting and processing data, it may be impossible to access the system to perform system maintenance for long periods. A disk may fail early in a platform's life, but not be replaceable for much longer than typical for RAID arrays. Service periods may be scheduled at intervals on the order of months, or the platform may not be serviced until the end of a mission in progress. Further, this platform may be subject to extreme conditions that can accelerate wear and tear on a disk, requiring even more protection from failures. We have created a high parity RAID implementation that uses a Graphics Processing Unit (GPU) to compute more than two blocks of parity information per stripe, allowing extra parity to eliminate or reduce the requirement for rebuilding data between service periods. While this type of controller is highly effective for RAID 6 systems, an important benefit is the ability to incorporate more parity into a RAID storage system. Such RAID levels, as yet unnamed, can tolerate the failure of three or more disks (depending on configuration) without data loss. While this RAID system certainly has applications in embedded systems running applications in the field, similar benefits can be obtained for servers that are engineered for storage density, with less regard for serviceability or maintainability. A storage brick can be designed to have a MTTDL that extends well beyond the useful lifetime of the hardware used, allowing the disk subsystem to require less service throughout the lifetime of a compute resource. This approach is similar to the Xiotech ISE. Such a design can be deliberately placed remotely (without frequent support) in order to provide colocation, or meet cost goals. For workloads where reliability is key, but conditions are sub-optimal for routine serviceability, a high-parity RAID can provide extra reliability in extraordinary situations. For example, for installations requiring very high Mean Time To Repair, the extra parity can eliminate certain problems with maintaining hot spares, increasing overall reliability. Furthermore, in situations where disk reliability is reduced because of harsh conditions, extra parity can guard against early data loss due to lowered Mean Time To Failure. If used through an iSCSI interface with a streaming workload, it is possible to gain all of these benefits without impacting performance.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC04-94AL85000

OSTI ID:: 1020441

Report Number(s):: SAND2010-3579C; TRN: US1103751

Resource Relation:: Conference: Proposed for presentation at the High Performance Embedded Computing Workshop held September 15-16, 2010 in Burlington, MA.

Country of Publication:: United States

Language:: English

Similar Records

On the Use of GPUs in Realizing Cost-Effective Distributed RAID

Conference · Sun Jan 01 00:00:00 EST 2012 · OSTI ID:1020441

Khasymski, Aleksandr; Rafique, Mustafa; Butt, Ali R; +2 more

Scientific Application Requirements for Leadership Computing at the Exascale

Technical Report · Sat Dec 01 00:00:00 EST 2007 · OSTI ID:1020441

Ahern, Sean; Alam, Sadaf R; Fahey, Mark R; +8 more

...And Eat it Too: High Read Performance in Write-Optimized HPC I/O Middleware File Formats

Conference · Thu Jan 01 00:00:00 EST 2009 · OSTI ID:1020441

Klasky, Scott A; Lofstead, J.; Bent, John; +5 more

Related Subjects

72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS
BRICKS
CLOUDS
CONFIGURATION
DESIGN
IMPLEMENTATION
LIFETIME
MAINTENANCE
PARITY
PERFORMANCE
PROCESSING
RELIABILITY
REMOTE AREAS
REPAIR
STORAGE
TOLERANCE

Title: Failing in place for low-serviceability storage infrastructure using high-parity GPU-based RAID.

Citation Formats

Similar Records

Related Subjects