4TB Hard Drives Bring RAID 6 To Its Limits


4TB hard drives are becoming increasingly common and affordable, but many administrators are doubtful whether they should really use them. The problem is that a hard drives rebuild time increases in linear fashion along with increasing size.

It can take two to four days for a new 4TB drive in a RAID array to be filled with data from the failed disk. The danger is that two further hard drives might fail in this time. RAID 6 can tolerate two simultaneous hard drive failures, but as soon as a third one fails, data is lost.

What at first glance seems unlikely – the failure of three disks within a short period of time – is statistically more likely with an increasing volume of data, higher disk capacity and longer rebuild times. If you believe the hard drive manufacturers, an individual disk should fail every hundred years, or even more infrequently.

But where hundreds or thousands of hard drives are in use, this hardware defect is an everyday occurrence and the Mean Time to Data Loss (MTTDL) is shorter. The risk of failure in 4TB drives is at least eight times as high as in 500GB drives right from the outset due to the rebuild time.

The rebuild time is determined not only by the write speed on the replacement drive but also, primarily, by the read speed on the other drives. The crux of the matter lies in that the system remains live and that, in practice, the hard drives are already 90 percent busy.

This means only around 10 percent of the read capacity is available for the rebuild. Now you have a choice: either you speed up the rebuild times at a cost of performance while the system is live or you learn to live with the long rebuild times.

Thus, it is clear that increasing hard drive size increases the probability of data loss. But there is a cure: triple parity. A RAID array with three parity bits can even withstand the simultaneous failure of three hard drives. This puts the MTTDL back in the desired regions.

Parity is something fundamentally different to simply storing the same data in multiple copies (like mirroring in RAID 1). One must think of parity as being when parity bits are connected with live data bits in such a way that the lost data in a RAID system can always be recalculated based on the remaining data (interpolating), and can therefore be recovered.

Simple parity, like in RAID 5, is calculated using XOR, so is based on the fact that digital data consists only of ones and zeros. The second parity, like in RAID 6, corresponds to orthogonal calculation. Triple parity, like in RAID Z3, is far more complex.

This means that every piece of information could be recovered using an equation involving three unknowns should three hard drives simultaneously fail. The fact that this is possible at all has to do with the fact that three is the smallest factor of 255 (the largest value of unsigned bytes).

RAID Z3 is currently the only system that allows triple parity – unless someone decides to implement quadrupled data storage (a “4-way mirror”). However, RAID Z3 actually offers more benefits than just data security in the event of a third hard drive failure. The Z stands for the ZFS data system and this adds further considerable advantages in terms of data security.

This means that the Unrecoverable Bit Error Rate (UBER) goes down to zero. This describes the frequency with which data blocks become unreadable due to their age, i.e. that ionising radiation or magnetic fields have altered them. RAID Z3 solves this problem of creeping data corruption. It does not rely on the reliability of the hard drive, but rather guarantees consistency of all data at all times using a checksum tree, similar to a database.

UBER is rarely considered in the MMTDL calculation. An Unrecoverable Bit Error is however more common than a complete hard drive crash and can also lead to rebuilds, which are particularly time consuming and risky in the case of large hard drives. In these cases, RAID Z3 simply restores the bad blocks without going through a complete rebuild.

Thanks to the copy-on-write process, RAID Z3 is innately protected against write holes. These occur in standard RAID systems when the system fails between data being changed and the recalculation of parity data. The result is false parity data. Should the original data ever be restored from this parity data, then this results is completely unusable data.

RAID Z3 avoids this problem from the outset because it never overwrites the original data with new data. Rather, the changed data is reallocated and the entire checksum tree is recalculated. This method keeps data in ZFS consistent at all times.

One advantage of how RAID Z3 is controlled using software as opposed to a controller like in other RAID systems is that RAID Z3 distinguishes between used and free data blocks and only restores occupied data blocks during rebuilds. Hard drives in storage systems tend to be between 50 and 85 percent occupied in practice. This alone reduces the rebuild time in RAID Z3 by 15 to 50 percent and increases the MTTDL accordingly.

It makes perfect sense for storage administrators to optimise the capacity and cost efficiency of their storage systems with 4TB hard drives. Even those already using 3TB drives with RAID 6 can increase their live memory by using 4TB drives and RAID Z3.

Let us assume there is an array with 12 slots for hard drives: A typical set-up today would be RAID 6 with 9+2+1. i.e. the equivalent of nine hard drives are used for live storage, the equivalent of two hard drives are used for parity information (in reality all of this data is split between the eleven hard drives).

One hard drive is working as a “hot spare”, so rebuilding can start automatically and immediately in the event of a defect. This means 3TB drives provide live storage of 27TB (9 x 3TB). In RAID Z3 the equation is 8+3+1, thus with 4TB drives we come to 32TB (8 x 4TB) of live storage.

The data security concerns surrounding 4TB hard drives are justified because rebuild times become incredibly long. The problems can be resolved, however RAID 6 is then brought to its limits. The solution is triple parity; preferably with the software-controlled RAID Z3 in the ZFS file system. This increases the cost efficiency of the storage and allows an even greater level of data security than anything before.

Heiko Wust

Heiko Wüst is an expert in UNIX, Linux and Open Source with many years of experience in networking, storage and virtualisation environments. His previous posts include ADIVA and Motorola. Today the qualified electrical engineer is working as a sales engineer at Nexenta Systems.

  • Donald Pearson

    This article is misguided at best.
    All ZFS implementations use Copy On Write, this is not Raidz3 exclusive.
    All ZFS pools made of multi-disk vdevs protect against bit rot by checksumming every read against the expected checksum, this is not Raidz3 exclusive.

    The only difference between Raidz3 and Raidz2 or Raidz1 is that parity includes a 3rd disk. There are no other benefits, and you fail to mention that there is a performance hit cost of 3-drive parity.

    Raidz3 is only one solution. Another is to use more vdevs with less drives and continue using Raidz2 or Raidz. Even more resilant is to use many 2-drive mirrored vdevs which will substantially increase the IOPS and resiliancy of the pool as a whole.

    However at the end of the day, Raid is never the answer to data security, backups are. Raid is there to keep your system live and operational through inevitable drive failures.