The Perfect Storm

Posted in: Backup, General, RAID, SAN, SAS, Author: yobitech (February 25, 2012)

As you may remember when SATA drive technology came around several years ago, it was a very exciting time. This new low cost, high-capacity, commodity disk drive revolutionized the home computer data storage needs.

This fueled the age of the digital explosion. Digital photos and media quickly filled hard drives around the world and affordably. This digital explosion propelled companies like Apple and Google into the hundreds of billions in revenue. This also propelled the explosive data growth in the enterprise.

The SAN industry scrambled to meet this demand. SAN vendors such as EMC, NetApp and others saw the opportunity to move into a new market using these same affordable high-capacity drives to quench the thirst for storage.

The concept of using SATA drives in a SAN went mainstream. Companies that once could not afford a SAN can now buy a SAN with larger capacities for a fraction of the cost of a traditional SAN. This was so popular that companies bought SATA based SANs by the bulk, often in multiple batches at a time.

As time progressed, these drives started failing. SATA was known for their low MTBF (mean time before failure) rates. SATA SANs employed RAID 5 at first to provide protection for a single drive failure, but not for dual drive failure.

As companies started to employ RAID 6 technology dual drive failure protection would not result in data loss.

The “Perfect Storm” even with RAID 6 protection looks like this…

– Higher Capacity Drives = longer rebuild times: The industry has released 3TB drives. Depending on SAN vendor, this will vary. I have seen 6 days for a rebuild of a 2TB drive

– Denser Array Footprint = increased heat and vibrations: Dramatically reducing MTBF

– Outsourced drive manufacturing to third world countries = increase rate in drive failures particularly in batches or series: Quality control and management is lacking in outsourced facilities resulting in mass defects

– Common MTBF in Mass Numbers = drives will fail around the same time: This is a statistical game. For example, a 3% failure rate for a SAN in a datacenter is acceptable, but when there are mass quantities of these drives, 3% will approach and/or exceed the fault tolerant of RAID

Virtualized Storage = Complexity in recovery : Most SAN vendors now have virtualized storage, but recovery will vary depending on how they do their virtualization

– Media Errors on Drives = Failure to successfully rebuild RAID volumes: The larger the drive the chance of media errors become greater. Media errors are errors that are on the drive that renders small bits of data to be unreadable. Rebuild of RAID volumes may be compromised or failed due to these errors.

Don’t be fooled into having a false sense of security but having just RAID 6. Employ good backups and data replication as an extension of a good business continuity or disaster recovery plan.

As the industry moves to different technologies other new and interesting anomalies will develop.

In technology, life is never a dull moment.