Technical Reference

How RAID Parity Actually Works

Q: How does RAID 5 reconstruct data from a failed drive?

RAID 5 uses XOR parity distributed across all drives in the array. When one drive fails, the controller reads the corresponding data blocks and parity block from the surviving drives for each stripe. XOR is reversible: if A XOR B XOR C = P, then any one missing value can be reconstructed by XORing the remaining three. The controller performs this calculation for every stripe across the entire array to rebuild the missing drive's data. This works only for a single drive failure; a second failure during rebuild results in data loss.

Q: What is the difference between RAID 5 and RAID 6?

RAID 5 uses one parity block per stripe, calculated with XOR. It can survive one drive failure. RAID 6 uses two independent parity blocks per stripe: one is standard XOR parity (P) and the other uses a different mathematical function, typically based on Galois field arithmetic (Q). This allows RAID 6 to survive two simultaneous drive failures. RAID 6 requires a minimum of four drives and has slightly lower write performance because two parity blocks must be calculated and written for every data update.

Q: What happens if a RAID hits an Unrecoverable Read Error during rebuild?

Consumer SATA drives have a typical Unrecoverable Bit Error Rate (UBER) of 1 error per 10^14 bits read, which works out to roughly 1 error per 12 TB of data. Enterprise SAS and nearline drives are typically rated 1 in 10^15 (about 125 TB). During a RAID 5 rebuild, the controller must read every sector on every surviving drive. With modern 16-20 TB drives, the probability of hitting a latent URE during a full-array rebuild is high enough that many RAID 5 rebuilds fail partway through. Enterprise controllers like Dell PERC can puncture the affected stripe to continue the rebuild, but the data in that stripe is lost. This is the primary engineering argument for RAID 6 on arrays with large-capacity drives.

Q: Can RAID parity protect against simultaneous SSD firmware failures?

No. XOR parity protects against individual drive failures, not correlated firmware panics that affect multiple drives at the same time. SSDs using identical controllers and firmware revisions can fail simultaneously if they share a firmware bug triggered by a specific write-cycle count or power-on hour threshold. The HPE SAS SSD 40,000-hour bug (firmware prior to HPD7) caused all drives in affected arrays to lock up at the same power-on interval. When two or more drives drop from a RAID 5 simultaneously, parity cannot reconstruct the missing data.

Q: What parity rotation layout does Linux md use by default, and why does it matter for recovery?

The Linux md driver defaults to left-symmetric parity layout, where the parity block moves one position toward drive 0 in each successive stripe and data blocks wrap sequentially. This layout maximizes large sequential read throughput by spreading data evenly across spindles. For data recovery, the layout determines the exact byte order of reassembled data. Data Extractor Express RAID Edition and R-Studio detect the layout heuristically by looking for filesystem signatures and parity-test patterns, but proprietary controller layouts (HP SmartArray delayed parity, Promise wide-pace Q rotation) require the recovery tool to deduce additional parameters such as the parity delay interval before reassembly will produce a mountable image.

Q: What is the RAID 5 write hole and how is it mitigated?

The write hole is silent corruption caused by the non-atomicity of partial-stripe updates. If a RAID 5 host is interrupted (power loss, kernel panic) after writing new data but before writing the matching parity, the parity block on disk no longer matches the data. If a drive subsequently fails and the array enters degraded mode, the rebuild reads the stale parity and reconstructs garbage. Hardware controllers close this hole with battery-backed or NV-DIMM write cache that replays uncommitted writes after a crash. Linux md offers three mitigations: write-intent bitmaps that speed resync but do not fully seal the hole, an external journal device (mdadm --write-journal) that acts as a write-ahead log, and the Partial Parity Log (PPL) which records the XOR of the stripe's unmodified chunks into the parity drive's metadata so the pre-write or post-write state can be reconstructed deterministically.

Q: How does ZFS RAID-Z avoid the write hole?

RAID-Z is parity computed at the filesystem layer rather than the block layer. ZFS uses copy-on-write: a write never overwrites existing data; it allocates new sectors, writes data plus parity for that specific transaction, then atomically updates the Uberblock to point at the new tree. If a power loss occurs mid-write, the Uberblock has not advanced, so the filesystem still references the prior consistent tree. RAID-Z also uses variable-width stripes sized to the logical record being written, which means every write is inherently a full-stripe write and the read-modify-write cycle that creates the write hole in block-layer RAID 5 does not occur.

Written by

Louis Rossmann

Founder & Chief Technician

Published March 8, 2026

Updated May 10, 2026

RAID parity is a mathematical technique that allows an array to survive drive failures without losing data. The core operation is XOR (exclusive OR): a bitwise function that compares bits from multiple data blocks and produces a parity block. If any one input is lost, it can be recalculated from the remaining inputs and the parity. RAID 5 uses single parity (one drive failure tolerance). RAID 6 uses dual parity (two drive failure tolerance). The math is straightforward, but the implementation details of stripe layout, parity distribution, and write handling determine how the array performs and how it fails.

How does XOR parity math work in RAID 5?

XOR operates on individual bits: if the input bits are the same, the output is 0; if they differ, the output is 1. XOR is its own inverse, so if you lose any single value, XORing the remaining values including the parity reproduces the missing one. RAID 5 uses this property to rebuild a failed drive's data from the survivors.

Bit A	Bit B	A XOR B
0	0	0
0	1	1
1	0	1
1	1	0

XOR is associative and commutative, which means it scales to any number of inputs: A XOR B XOR C XOR D = P. That property is what makes the parity block recalculable from the surviving data blocks across an arbitrarily wide stripe.

In a four-drive RAID 5 array, each stripe has three data blocks (D1, D2, D3) and one parity block (P). The parity block stores D1 XOR D2 XOR D3. If drive 2 fails, the controller reconstructs D2 by computing D1 XOR D3 XOR P. This calculation happens for every stripe across the entire array during a rebuild or during degraded-mode reads.

A concrete example with bytes: if D1 = 10110010, D2 = 01101001, and D3 = 11001100, then P = 10110010 XOR 01101001 XOR 11001100 = 00010111. If D2 is lost, D1 XOR D3 XOR P = 10110010 XOR 11001100 XOR 00010111 = 01101001. The original D2 is recovered exactly.

What is the difference between distributed and dedicated parity?

RAID 3 and RAID 4 use a dedicated parity drive where one specific drive stores all parity blocks. Every write to any data drive requires a corresponding parity update on the parity drive, creating a bottleneck. RAID 5 solves this by distributing parity blocks across all drives in a rotating pattern.

In a four-drive RAID 5 array, the parity block rotates to a different drive for each stripe:

Stripe	Drive 0	Drive 1	Drive 2	Drive 3
0	D0	D1	D2	P
1	D3	D4	P	D5
2	D6	P	D7	D8
3	P	D9	D10	D11

The parity block rotates to a different drive in each stripe (left-symmetric layout shown above). This distributes write I/O evenly: no single drive is a bottleneck. The specific rotation pattern (left-symmetric, left-asymmetric, right-symmetric, right-asymmetric) varies by controller manufacturer and affects the order in which data and parity are laid out. During recovery, knowing the exact layout algorithm is necessary to reassemble the array correctly.

How do parity rotation algorithms and controller defaults differ?

The four standard RAID 5 layouts differ in two axes: the direction the parity block walks across stripes (left toward drive 0, or right toward drive N-1) and whether the data blocks restart at drive 0 each stripe (asymmetric) or wrap continuously around the parity block (symmetric).

Layout	Parity Direction	Data Block Order	Default Used By
Left-Symmetric	Walks toward drive 0	Wraps around parity	Linux md, most LSI/Adaptec
Left-Asymmetric	Walks toward drive 0	Restarts at drive 0	Some SNIA DDF controllers
Right-Symmetric	Walks toward drive N-1	Wraps around parity	Less common; some legacy units
Right-Asymmetric	Walks toward drive N-1	Restarts at drive 0	Less common; some legacy units

The Linux md (multiple device) driver defaults to left-symmetric because it produces the best large-sequential-read throughput by spreading the read load evenly across spindles. Most LSI MegaRAID, Adaptec, and 3ware controllers conform to the SNIA Common RAID Disk Drive Format (DDF), which standardizes layout descriptors so a degraded array can be imported into another DDF-compliant controller for recovery.

Several enterprise controllers ship proprietary rotation patterns that DDF-aware tools will not import cleanly. HP SmartArray controllers use a "delayed parity" layout where the parity block does not advance every stripe; instead it stays on the same drive for a configurable run of consecutive stripes (commonly 16 or 32) before moving. Recovery tools must deduce the delay interval and the first-delay offset before they can reassemble a coherent image. Promise controllers running RAID 6 use a "wide pace" layout for the Q syndrome that shifts more than one column per stripe.

The practical consequence: when an array arrives for RAID data recovery without surviving controller metadata, the recovery technician must determine drive order, chunk size, parity direction, parity rotation pattern, and (for HP/Promise hardware) the delay or pace parameters before the data can be reassembled. Data Extractor Express RAID Edition and R-Studio do this heuristically by scanning each drive for filesystem signatures, parity- test patterns, and known constants like NTFS MFT entries, then iterating layout permutations until the assembled image yields valid filesystem metadata.

How does RAID 6 dual parity work?

RAID 6 adds a second parity block per stripe, labeled Q. The P block uses standard XOR, identical to RAID 5. The Q block uses Galois field arithmetic, GF(2^8), where each data block is multiplied by a different coefficient, making P and Q mathematically independent. Two simultaneous failed drives can be solved using two independent equations.

This two-failure tolerance matters increasingly with large-capacity drives (8 TB, 16 TB, 20 TB+) because the probability of an unrecoverable read error (URE) during rebuild is high enough that a second failure during a RAID 5 rebuild is a realistic scenario, not a theoretical one.

RAID 6 requires a minimum of four drives (two data, two parity). Usable capacity is (N-2) drives. Write performance is lower than RAID 5 because every data write requires updating both P and Q parity blocks. Hardware RAID controllers with dedicated XOR engines and battery-backed cache mitigate this penalty.

Q Syndrome Reed-Solomon Math Inside RAID 6

The P syndrome is plain XOR across all data blocks in a stripe. The Q syndrome is a Reed-Solomon code computed inside the Galois field GF(2⁸).

Linux mdadm and the in-kernel lib/raid6 module use the irreducible generator polynomial x⁸ + x⁴ + x³ + x² + 1, written as the byte 0x11D, to keep the arithmetic confined to 8 bits. The primitive element g is the byte 0x02. Multiplying any byte by g is equivalent to a bitwise left shift; if the high bit was set before the shift, the result is XORed with 0x11D to fold it back into the field.

The Q value for a stripe is the XOR of each data block multiplied by a successive power of g: Q = (g⁰ · D₀) XOR (g¹ · D₁) XOR (g² · D₂) XOR ... XOR (g^N-1 · D_N-1). The exponent is the drive index, which is what makes P and Q mathematically independent and gives RAID 6 its two-failure tolerance. Because each data block is multiplied by a distinct coefficient, two simultaneous unknowns produce a two-equation, two-unknown system that can always be solved by Galois field matrix inversion.

Hardware RAID controllers offload this arithmetic to ASIC linear-feedback shift registers or precomputed multiplication tables. Modern x86 software RAID uses Intel ISA-L with AVX2/AVX-512 to vectorize the polynomial multiplications across 32 or 64 bytes at a time; the ARM equivalent uses NEON. Without these accelerations, software RAID 6 parity calculation would be roughly an order of magnitude slower than RAID 5.

Two-Failure Recovery Cases in RAID 6

RAID 6 handles four distinct failure topologies: (a) two data drives lost, (b) one data drive plus the P drive lost, (c) one data drive plus the Q drive lost, or (d) both the P and Q drives lost. Case (d) is trivial: the surviving data is intact and P and Q are recalculated from scratch. Case (c) is also straightforward: the surviving data plus P rebuild the missing data drive via XOR, then Q is recalculated.

Case (b) requires Galois field division, since P is gone and the missing data drive must be solved from the Q equation by multiplying through by the inverse of g raised to that drive's index. Case (a) is the hardest: both P and Q equations are needed, and the two missing data values are extracted by solving a 2x2 linear system over GF(2⁸).

How do stripe size and chunk size affect a RAID array?

The chunk size (also called strip size) is the amount of contiguous data written to a single drive before moving to the next drive in the array. Common values are 64 KB, 128 KB, 256 KB, and 512 KB. A stripe is the set of chunks across all drives at the same address offset, including the parity chunk(s).

Chunk size affects performance. Small chunks (64 KB) spread each I/O across more drives, improving throughput for large sequential reads. Large chunks (512 KB) keep individual I/O operations on a single drive, improving random I/O performance by reducing cross-drive coordination.

During RAID recovery, the chunk size must be known exactly. If a recovery tool assembles the array with the wrong chunk size, the data interleaving is incorrect and the resulting image will be garbled. Data Extractor Express RAID Edition analyzes the raw data on each drive to detect the correct chunk size, parity rotation direction, and drive order automatically when the RAID controller metadata is damaged or unavailable.

What is the RAID write penalty?

Every data write in a parity RAID requires reading the old data, reading the old parity, calculating new parity, writing new data, and writing new parity. This is the read-modify-write cycle. RAID 5 has a write penalty of 4 (four I/O operations per logical write). RAID 6 has a write penalty of 6, since two parity blocks must be updated instead of one.

RAID Level	Write Penalty	Drive Failures Tolerated	Usable Capacity
RAID 0	1	0	N drives
RAID 1	2	1 (per mirror pair)	N/2 drives
RAID 5	4	1	N-1 drives
RAID 6	6	2	N-2 drives
RAID 10	2	1 per mirror pair	N/2 drives

Hardware RAID controllers with battery-backed write cache (BBU/BBM) absorb the write penalty by caching writes in DRAM and flushing them to drives in optimized batches. If the BBU fails or the cache policy is set to write-through, the full write penalty applies and write latency rises sharply. Dell PERC controllers, HP SmartArray, and LSI MegaRAID all implement this caching strategy.

The penalty figures above describe a partial-stripe write: the operating system updates a single block inside an existing stripe, so the controller must read the old data and old parity, calculate new parity, then write both back. Counted as I/O operations, that is two reads plus two writes for RAID 5 (penalty 4), and three reads plus three writes for RAID 6 (penalty 6, since the controller has to read the old data, old P, and old Q before writing the new data, new P, and new Q).

When the operating system writes enough contiguous data to fill an entire stripe, the controller skips the preliminary reads entirely, computes the parity from the new data already in cache, and writes the complete stripe in a single transaction. This optimization is variously called a full-stripe write, a reconstruct write, or stripe coalescing.

Linux md Stripe Cache and Write Coalescing

Linux software RAID exposes the buffer that holds in-flight stripes at /sys/block/mdX/md/stripe_cache_size. The default is 256 pages per disk, sized for low memory consumption rather than throughput. Sequential write workloads (LUKS-encrypted volumes, large file transfers, database checkpoints) frequently arrive in chunks too small to fill a stripe individually but large enough in aggregate to coalesce. With a small stripe cache, md flushes them as partial-stripe writes and pays the read-modify-write penalty on every flush; with a larger cache, md holds the partial writes in RAM long enough to assemble full stripes and bypass the penalty.

Storage administrators commonly raise stripe_cache_size to 4096, 8192, or 32768 pages on parity arrays, trading several hundred megabytes of RAM per array for multi-x throughput improvements on bursty sequential writes. The setting is per-array and survives reboots only if written through a startup hook. There is no equivalent knob on hardware RAID controllers; the controller's own DRAM cache plays the same role and its policy (write-back vs write-through, BBU charge state) governs whether coalescing happens at all.

Parity protects against drive failure, not data corruption.

RAID parity recalculates missing data from failed drives, but it does not detect or correct silent data corruption. If a drive returns incorrect data without reporting an error (a bit flip in DRAM, a firmware bug, or a media defect below the drive's error threshold), the parity system will incorporate the corrupted data into parity calculations without warning. Only checksumming filesystems like ZFS or Btrfs detect this type of corruption.

Why do unrecoverable read errors cause RAID rebuild failures?

Consumer SATA drives carry a worst-case Unrecoverable Bit Error Rate of 1 error per 10¹⁴ bits read, roughly one bad sector per 12.5 TB of data. Enterprise SAS and nearline drives are rated 1 in 10¹⁵, about 125 TB per expected URE. During a degraded RAID rebuild, the controller reads every sector on every surviving drive; on modern 16-20 TB drives, hitting a latent URE before the rebuild completes is a practical concern.

What happens on a URE depends on the controller. Dell PERC and LSI/Broadcom MegaRAID "puncture" the affected stripe: they mark that specific LBA range as unrecoverable & continue rebuilding the rest of the array. The data in the punctured stripe is lost, but the server comes back online. Linux md records the unreadable LBA in its Bad Block Log and continues, the md analog of hardware puncture, so only the affected blocks are lost rather than the whole array. Legacy block-level and low-end consumer controllers, and HP/HPE Smart Array P-series and E-series, instead abort the rebuild and drop the volume offline (HP flags POST Error 1784 or 1786), which is the failed state that requires professional RAID data recovery.

This is the engineering reason RAID 6 matters on large-capacity arrays. A second parity block doesn't just protect against a second drive dying; it provides a mathematical fallback when a surviving drive can't deliver clean reads during reconstruction.

What is the RAID 5 write hole?

The write hole is silent corruption created by the non-atomicity of partial-stripe updates. If the host loses power between the data write and the parity write, the parity block on disk no longer matches the data block. A subsequent drive failure causes the rebuild to read stale parity and reconstruct garbage where the missing data block used to be.

A logical write to a single block forces the controller to perform a sequence: read old data, read old parity, compute new parity, write new data, write new parity. The array is consistent enough to serve reads in normal operation, because the controller reads from data drives directly. The corruption only surfaces later, when a degraded array has to rely on that stale parity for reconstruction and bakes the wrong bytes permanently into the rebuilt drive.

Hardware RAID controllers close this hole with a battery-backed (BBU/BBM) or NV-DIMM-backed write cache. The intended writes are logged in non-volatile cache before the disks ever see them; if a crash occurs partway through, the controller replays the log on restart and finishes the stripe atomically. When the BBU has discharged or is in a learning cycle, enterprise controllers automatically downgrade the cache policy to write-through, which closes the hole at the cost of full-penalty write latency.

Linux md Mitigations: Bitmap, Journal, and Partial Parity Log

Linux software RAID has no battery, so the kernel implements three increasingly thorough mitigations:

Write-intent bitmap. A small bitmap records which stripes have writes in flight. After a crash, only the dirty stripes need to be resynced instead of the entire array. The bitmap accelerates recovery but does not prevent the underlying torn-stripe condition; it just narrows the search space for the resync.
Journal device (mdadm --write-journal). An external fast device, typically an NVMe SSD, receives a write-ahead log of every parity update. After a crash the kernel replays the journal before bringing the array online. This closes the hole but requires dedicating a fast, reliable device to the array.
Partial Parity Log (PPL). Introduced for the case where a separate journal device is impractical. Before a partial-stripe write, md computes the XOR of the stripe's unmodified chunks and stamps that partial parity into a reserved metadata region on the parity drive. If a crash interrupts the write, recovery uses the logged partial parity to deterministically reconstruct either the pre-write or post-write state without leaving stale parity on disk.

How does ZFS RAID-Z handle parity at the filesystem layer?

Hardware controllers and Linux md operate at the block layer and see logical block addresses, not files. ZFS integrates the volume manager and the filesystem in one stack, and ZFS RAID-Z computes parity at the filesystem layer instead. Z1, Z2, and Z3 designate one, two, or three parity blocks per stripe, respectively.

Two structural differences fall out of that:

Variable-width stripes. Block-layer RAID 5 has fixed stripe geometry: if the stripe width is 4 drives and the chunk size is 64 KB, every stripe is 192 KB of data plus a 64 KB parity chunk regardless of the size of the logical write. ZFS allocates sectors per transaction sized to the actual record being written (governed by recordsize and the disk's ashift). A 4 KB write with RAID-Z1 uses one 4 KB data sector plus one 4 KB parity sector and leaves the rest of the stripe alone. Every write is a full-stripe write by definition, so the read-modify-write cycle that creates the partial-write penalty in block-layer RAID does not exist.

Copy-on-write closes the write hole. ZFS never overwrites live data. A write allocates new sectors elsewhere on the pool, commits data and parity atomically as a new transaction, then advances the Uberblock pointer to the new tree. If a power loss occurs after the write but before the Uberblock advance, the pool still references the previous consistent tree on the next mount. There is no stripe to be torn between data and parity because parity is always written as part of the same transaction as its data, and the visibility of the entire transaction is gated by a single atomic pointer flip.

The trade-offs are not free. RAID-Z requires the full ZFS stack with its memory and CPU overhead, the filesystem cannot be migrated off ZFS without destroying the pool, and a rebuild (called a resilver) reads only allocated blocks rather than the entire device, which is faster on lightly used pools but offers no advantage on a pool that is mostly full. For recovery, RAID-Z pools require ZFS-aware tools that walk the pool's block pointer tree from the Uberblock; generic block-layer RAID parity calculators cannot reassemble RAID-Z because the stripe geometry is variable per record.

Frequently Asked Questions

How does RAID 5 reconstruct data from a failed drive?

The controller reads data blocks and parity blocks from the surviving drives for each stripe. Because XOR is reversible, any single missing value can be recalculated by XORing the remaining values. This works for one failed drive; a second failure during rebuild causes data loss.

What is the difference between RAID 5 and RAID 6?

RAID 5 uses one parity block per stripe (XOR) and survives one drive failure. RAID 6 uses two independent parity blocks (XOR + Galois field arithmetic) and survives two simultaneous failures. RAID 6 requires a minimum of four drives and has higher write overhead.

What happens if a RAID hits an Unrecoverable Read Error during rebuild?

Consumer SATA drives carry a worst-case UBER of 1 error per 10¹⁴ bits (roughly 12.5 TB), a warranty floor rather than a schedule. On a full-array rebuild of 16-20 TB drives, that worst-case bound raises the probability of hitting a latent bad sector as array size and drive age climb. Dell PERC and LSI/Broadcom MegaRAID can puncture the affected stripe & continue rebuilding, and Linux md logs the bad LBA in its Bad Block Log and continues; the data in that stripe or block is lost but the rest of the array survives. Legacy and low-end controllers, and HP/HPE Smart Array P-series and E-series, abort the rebuild instead, leaving the array in a degraded or failed state requiring professional recovery.

Can RAID parity protect against simultaneous SSD firmware failures?

No. Parity guards against individual drive failures, not correlated firmware panics across multiple drives. SSDs sharing identical controllers & firmware revisions can fail simultaneously when a firmware bug triggers at a specific power-on hour or write-cycle threshold. The HPE SAS SSD 40,000-hour bug (firmware before HPD7) locked up all drives in affected arrays at the same interval. When two or more drives drop from a RAID 5 at once, XOR parity can't reconstruct the missing data; the array needs RAID data recovery.

What parity rotation layout does Linux md use by default, and why does it matter for recovery?

Linux md defaults to left-symmetric: parity walks one drive toward drive 0 each stripe and data wraps continuously around the parity block. The layout maximizes large-sequential-read throughput. For recovery, the layout determines the byte order when the array is reassembled. Data Extractor Express RAID Edition and R-Studio detect it heuristically by scanning each drive for filesystem signatures and parity-test patterns, but proprietary layouts like HP SmartArray delayed parity or Promise wide- pace require deducing additional parameters (delay interval, first-delay offset) before reassembly produces a mountable image.

What is the RAID 5 write hole and how is it mitigated?

The write hole is silent corruption from the non-atomicity of partial-stripe updates. If power is lost between writing new data and writing matching parity, the on-disk parity is stale; a subsequent drive failure will rebuild garbage. Hardware RAID closes the hole with battery-backed or NV-DIMM cache that replays uncommitted writes after a crash. Linux md offers three mitigations: write-intent bitmaps (speed resync, do not fully seal the hole), an external journal device via mdadm --write-journal, and the Partial Parity Log (PPL) that records the XOR of unmodified stripe chunks into the parity drive's metadata so the pre-write or post-write state can be reconstructed deterministically.

How does ZFS RAID-Z avoid the write hole?

RAID-Z computes parity at the filesystem layer rather than the block layer, and ZFS uses copy-on-write: a write never overwrites existing data. ZFS allocates new sectors, commits data plus parity together as a new transaction, then atomically advances the Uberblock pointer to the new tree. A power loss before the Uberblock advance leaves the pool referencing the prior consistent tree. RAID-Z also uses variable-width stripes sized to the logical record, so every write is inherently a full-stripe write and the read-modify-write cycle behind the block-layer write hole does not occur.

If you are experiencing this issue, learn about our RAID recovery service.