Large hard drives in RAID 5 a problem?
December 2nd, 2008 | by Vladimir Stajic |A colleague recently brought to my attention Robin Harris’ blog post »Why RAID 5 stops working in 2009.« According to the blog, we’re supposed to be nearing the limit of usefulness of RAID 5. Harris’ thesis states that hard drive manufacturers currently specify that up to 100,000,000,000,000 bits can be read before an unrecoverable read error (URE) occurs. This translates into roughly 11.3 TB.
In other words: if we fill a 1 TB disk with data, we could copy all of that data off the disk at most 11 times. After that, there would be a big chance that at least one of the sectors on the drive could not be read, and we would lose any file residing on that sector. (If it’s only a single file, that could be a bother, but generally not a huge deal.)
However, in RAID 5 environments, the story could be a bit different. If the read error were to occur during the reconstruction process, after one of the discs had already failed, we would lose all of our data. Why? Because during a reconstruction process a single sector is as important as a whole hard drive. The RAID controller doesn’t have knowledge or understanding of the underlying file system, and if it cannot rebuild all the data, it will have to scrap the process.
Bummer.
But there is an up side, which was not touched on by Mr. Harris in his post. Business-grade RAID controllers will check the integrity of individual drives by constantly doing read/write tests. This happens in the background even when all the drives seem to be completely problem-free. Now and then a URE will occur and some piece of data or other will not be retrievable by a normal read. In such an event, the RAID controller will do its magic and reconstruct the lost piece of information while at the same time marking the problematic sector as bad and unusable for further storage. Again, this operation is completely transparent.
So, while the risk of URE is real and larger drives might mean greater exposure to it, that same risk is mitigated by the RAID controllers. Of course, people with software RAIDs might need to rethink their backup strategies.
And on a brighter note: hard drives today don’t seem to fail with any greater frequency than their cousins from 5 or 10 years ago. If anything, my personal experience says that such failures are slightly less common than they were back then. Why? As disk capacities increase, so do other accompanying techniques, such as time to fail and URE values.

1 Trackback(s)
You must be logged in to post a comment.