Skip to main contentSkip to navigation
Lab Operational Since: 17 Years, 6 Months, 28 DaysFacility Status: Fully Operational & Accepting New Cases

Technical Reference

How RAID Parity Actually Works

Louis Rossmann
Written by
Louis Rossmann
Founder & Chief Technician
Published March 8, 2026
Updated May 10, 2026

RAID parity is a mathematical technique that allows an array to survive drive failures without losing data. The core operation is XOR (exclusive OR): a bitwise function that compares bits from multiple data blocks and produces a parity block. If any one input is lost, it can be recalculated from the remaining inputs and the parity. RAID 5 uses single parity (one drive failure tolerance). RAID 6 uses dual parity (two drive failure tolerance). The math is straightforward, but the implementation details of stripe layout, parity distribution, and write handling determine how the array performs and how it fails.

XOR Parity Math Behind RAID 5

XOR operates on individual bits. The rule: if the input bits are the same, the output is 0. If they differ, the output is 1.

Bit ABit BA XOR B
000
011
101
110

XOR is associative and commutative, which means it scales to any number of inputs: A XOR B XOR C XOR D = P. More importantly, XOR is its own inverse. If you lose any single value, XORing the remaining values (including the parity) reproduces the missing one.

In a four-drive RAID 5 array, each stripe has three data blocks (D1, D2, D3) and one parity block (P). The parity block stores D1 XOR D2 XOR D3. If drive 2 fails, the controller reconstructs D2 by computing D1 XOR D3 XOR P. This calculation happens for every stripe across the entire array during a rebuild or during degraded-mode reads.

A concrete example with bytes: if D1 = 10110010, D2 = 01101001, and D3 = 11001100, then P = 10110010 XOR 01101001 XOR 11001100 = 00010111. If D2 is lost, D1 XOR D3 XOR P = 10110010 XOR 11001100 XOR 00010111 = 01101001. The original D2 is recovered exactly.

Distributed Parity vs Dedicated Parity

RAID 3 and RAID 4 use a dedicated parity drive: one specific drive stores all parity blocks. Every write to any data drive requires a corresponding parity update on the parity drive. This creates a bottleneck: the parity drive handles write I/O for every data operation across the array, limiting write throughput.

RAID 5 solves this by distributing parity blocks across all drives in a rotating pattern. In a four-drive array:

StripeDrive 0Drive 1Drive 2Drive 3
0D0D1D2P
1D3D4PD5
2D6PD7D8
3PD9D10D11

The parity block rotates to a different drive in each stripe (left-symmetric layout shown above). This distributes write I/O evenly: no single drive is a bottleneck. The specific rotation pattern (left-symmetric, left-asymmetric, right-symmetric, right-asymmetric) varies by controller manufacturer and affects the order in which data and parity are laid out. During recovery, knowing the exact layout algorithm is necessary to reassemble the array correctly.

Parity Rotation Algorithms and Controller Defaults

The four standard RAID 5 layouts differ in two axes: the direction the parity block walks across stripes (left toward drive 0, or right toward drive N-1) and whether the data blocks restart at drive 0 each stripe (asymmetric) or wrap continuously around the parity block (symmetric).

LayoutParity DirectionData Block OrderDefault Used By
Left-SymmetricWalks toward drive 0Wraps around parityLinux md, most LSI/Adaptec
Left-AsymmetricWalks toward drive 0Restarts at drive 0Some SNIA DDF controllers
Right-SymmetricWalks toward drive N-1Wraps around parityLess common; some legacy units
Right-AsymmetricWalks toward drive N-1Restarts at drive 0Less common; some legacy units

The Linux md (multiple device) driver defaults to left-symmetric because it produces the best large-sequential-read throughput by spreading the read load evenly across spindles. Most LSI MegaRAID, Adaptec, and 3ware controllers conform to the SNIA Common RAID Disk Drive Format (DDF), which standardizes layout descriptors so a degraded array can be imported into another DDF-compliant controller for recovery.

Several enterprise controllers ship proprietary rotation patterns that DDF-aware tools will not import cleanly. HP SmartArray controllers use a "delayed parity" layout where the parity block does not advance every stripe; instead it stays on the same drive for a configurable run of consecutive stripes (commonly 16 or 32) before moving. Recovery tools must deduce the delay interval and the first-delay offset before they can reassemble a coherent image. Promise controllers running RAID 6 use a "wide pace" layout for the Q syndrome that shifts more than one column per stripe.

The practical consequence: when an array arrives for RAID data recovery without surviving controller metadata, the recovery technician must determine drive order, chunk size, parity direction, parity rotation pattern, and (for HP/Promise hardware) the delay or pace parameters before the data can be reassembled. PC-3000 RAID Edition and R-Studio do this heuristically by scanning each drive for filesystem signatures, parity- test patterns, and known constants like NTFS MFT entries, then iterating layout permutations until the assembled image yields valid filesystem metadata.

RAID 6 Dual Parity

RAID 6 adds a second parity block to each stripe, labeled Q. The P block is calculated with standard XOR, identical to RAID 5. The Q block uses a different mathematical function based on Galois field arithmetic (GF(2^8)). Each data block is multiplied by a different coefficient in the Galois field before being XORed together. This makes P and Q mathematically independent: two simultaneous unknowns (two failed drives) can be solved using two independent equations.

The practical effect: RAID 6 tolerates two simultaneous drive failures. This matters increasingly with large-capacity drives (8 TB, 16 TB, 20 TB+) because the probability of an unrecoverable read error (URE) during rebuild is high enough that a second failure during a RAID 5 rebuild is a realistic scenario, not a theoretical one.

RAID 6 requires a minimum of four drives (two data, two parity). Usable capacity is (N-2) drives. Write performance is lower than RAID 5 because every data write requires updating both P and Q parity blocks. Hardware RAID controllers with dedicated XOR engines and battery-backed cache mitigate this penalty.

Q Syndrome Reed-Solomon Math Inside RAID 6

The P syndrome is plain XOR across all data blocks in a stripe. The Q syndrome is a Reed-Solomon code computed inside the Galois field GF(28). Linux mdadm and the in-kernel lib/raid6 module use the irreducible generator polynomial x8 + x4 + x3 + x2 + 1, written as the byte 0x11D, to keep the arithmetic confined to 8 bits. The primitive element g is the byte 0x02. Multiplying any byte by g is equivalent to a bitwise left shift; if the high bit was set before the shift, the result is XORed with 0x11D to fold it back into the field.

The Q value for a stripe is the XOR of each data block multiplied by a successive power of g: Q = (g0 · D0) XOR (g1 · D1) XOR (g2 · D2) XOR ... XOR (gN-1 · DN-1). The exponent is the drive index, which is what makes P and Q mathematically independent and gives RAID 6 its two-failure tolerance. Because each data block is multiplied by a distinct coefficient, two simultaneous unknowns produce a two-equation, two-unknown system that can always be solved by Galois field matrix inversion.

Hardware RAID controllers offload this arithmetic to ASIC linear-feedback shift registers or precomputed multiplication tables. Modern x86 software RAID uses Intel ISA-L with AVX2/AVX-512 to vectorize the polynomial multiplications across 32 or 64 bytes at a time; the ARM equivalent uses NEON. Without these accelerations, software RAID 6 parity calculation would be roughly an order of magnitude slower than RAID 5.

Two-Failure Recovery Cases in RAID 6

RAID 6 handles four distinct failure topologies: (a) two data drives lost, (b) one data drive plus the P drive lost, (c) one data drive plus the Q drive lost, or (d) both the P and Q drives lost. Case (d) is trivial: the surviving data is intact and P and Q are recalculated from scratch. Case (c) is also straightforward: the surviving data plus P rebuild the missing data drive via XOR, then Q is recalculated. Case (b) requires Galois field division, since P is gone and the missing data drive must be solved from the Q equation by multiplying through by the inverse of g raised to that drive's index. Case (a) is the hardest: both P and Q equations are needed, and the two missing data values are extracted by solving a 2x2 linear system over GF(28).

Stripe Size, Chunk Size, and Layout

The chunk size (also called strip size) is the amount of contiguous data written to a single drive before moving to the next drive in the array. Common values are 64 KB, 128 KB, 256 KB, and 512 KB. A stripe is the set of chunks across all drives at the same address offset, including the parity chunk(s).

Chunk size affects performance. Small chunks (64 KB) spread each I/O across more drives, improving throughput for large sequential reads. Large chunks (512 KB) keep individual I/O operations on a single drive, improving random I/O performance by reducing cross-drive coordination.

During RAID recovery, the chunk size must be known exactly. If a recovery tool assembles the array with the wrong chunk size, the data interleaving is incorrect and the resulting image will be garbled. PC-3000 RAID analyzes the raw data on each drive to detect the correct chunk size, parity rotation direction, and drive order automatically when the RAID controller metadata is damaged or unavailable.

RAID Write Penalty

Every data write in a parity RAID requires reading the old data, reading the old parity, calculating new parity, writing new data, and writing new parity. This is the read-modify-write cycle. RAID 5 has a write penalty of 4 (four I/O operations per logical write). RAID 6 has a write penalty of 6 (two parity blocks to update instead of one).

RAID LevelWrite PenaltyDrive Failures ToleratedUsable Capacity
RAID 010N drives
RAID 121 (per mirror pair)N/2 drives
RAID 541N-1 drives
RAID 662N-2 drives
RAID 1021 per mirror pairN/2 drives

Hardware RAID controllers with battery-backed write cache (BBU/BBM) absorb the write penalty by caching writes in DRAM and flushing them to drives in optimized batches. If the BBU fails or the cache policy is set to write-through, the full write penalty applies and write latency increases by 5-10x. Dell PERC controllers, HP SmartArray, and LSI MegaRAID all implement this caching strategy.

The penalty figures above describe a partial-stripe write: the operating system updates a single block inside an existing stripe, so the controller must read the old data and old parity, calculate new parity, then write both back. Counted as I/O operations, that is two reads plus two writes for RAID 5 (penalty 4), and three reads plus three writes for RAID 6 (penalty 6, since the controller has to read the old data, old P, and old Q before writing the new data, new P, and new Q). When the operating system writes enough contiguous data to fill an entire stripe, the controller skips the preliminary reads entirely, computes the parity from the new data already in cache, and writes the complete stripe in a single transaction. This optimization is variously called a full-stripe write, a reconstruct write, or stripe coalescing.

Linux md Stripe Cache and Write Coalescing

Linux software RAID exposes the buffer that holds in-flight stripes at /sys/block/mdX/md/stripe_cache_size. The default is 256 pages per disk, sized for low memory consumption rather than throughput. Sequential write workloads (LUKS-encrypted volumes, large file transfers, database checkpoints) frequently arrive in chunks too small to fill a stripe individually but large enough in aggregate to coalesce. With a small stripe cache, md flushes them as partial-stripe writes and pays the read-modify-write penalty on every flush; with a larger cache, md holds the partial writes in RAM long enough to assemble full stripes and bypass the penalty.

Storage administrators commonly raise stripe_cache_size to 4096, 8192, or 32768 pages on parity arrays, trading several hundred megabytes of RAM per array for multi-x throughput improvements on bursty sequential writes. The setting is per-array and survives reboots only if written through a startup hook. There is no equivalent knob on hardware RAID controllers; the controller's own DRAM cache plays the same role and its policy (write-back vs write-through, BBU charge state) governs whether coalescing happens at all.

Parity protects against drive failure, not data corruption.

RAID parity recalculates missing data from failed drives, but it does not detect or correct silent data corruption. If a drive returns incorrect data without reporting an error (a bit flip in DRAM, a firmware bug, or a media defect below the drive's error threshold), the parity system will incorporate the corrupted data into parity calculations without warning. Only checksumming filesystems like ZFS or Btrfs detect this type of corruption.

Unrecoverable Read Errors During Rebuild

Consumer SATA drives carry a typical Unrecoverable Bit Error Rate (UBER) of 1 error per 1014 bits read. That's roughly one bad sector per 12 TB of data. Enterprise SAS and nearline drives are typically rated at 1 in 1015, about 125 TB per expected URE; the published spec sheets for Seagate Exos, WD Ultrastar, and Toshiba MG series all carry that figure. During a degraded RAID rebuild, the controller reads every sector on every surviving drive to recalculate the missing parity data. With modern 16-20 TB drives, the probability of hitting a latent URE before the rebuild completes is high enough to be a practical concern, not a theoretical one.

Enterprise hardware controllers handle this differently than software RAID. Dell PERC controllers can "puncture" the affected stripe: they mark that specific LBA range as unrecoverable & continue rebuilding the rest of the array. The data in the punctured stripe is lost, but the server comes back online. Software RAID implementations (Linux md, Windows Storage Spaces) typically abort the rebuild entirely on a URE, leaving the array in a failed state that requires professional RAID data recovery.

This is the engineering reason RAID 6 matters on large-capacity arrays. A second parity block doesn't just protect against a second drive dying; it provides a mathematical fallback when a surviving drive can't deliver clean reads during reconstruction.

RAID 5 Write Hole and Crash Consistency

The write hole is silent corruption created by the non-atomicity of partial-stripe updates. A logical write to a single block forces the controller to perform a sequence: read old data, read old parity, compute new parity, write new data, write new parity. If the host loses power or kernel-panics between the data write and the parity write, the parity block on disk no longer matches the data block. The array is consistent enough to serve reads in normal operation, because the controller reads from data drives directly. The corruption only surfaces later: if a drive subsequently fails and the array enters degraded mode, the rebuild reads the now-stale parity, computes garbage where the missing data block used to be, and writes it back during the rebuild as if it were correct.

Hardware RAID controllers close this hole with a battery-backed (BBU/BBM) or NV-DIMM-backed write cache. The intended writes are logged in non-volatile cache before the disks ever see them; if a crash occurs partway through, the controller replays the log on restart and finishes the stripe atomically. When the BBU has discharged or is in a learning cycle, enterprise controllers automatically downgrade the cache policy to write-through, which closes the hole at the cost of full-penalty write latency.

Linux md Mitigations: Bitmap, Journal, and Partial Parity Log

Linux software RAID has no battery, so the kernel implements three increasingly thorough mitigations:

  • Write-intent bitmap. A small bitmap records which stripes have writes in flight. After a crash, only the dirty stripes need to be resynced instead of the entire array. The bitmap accelerates recovery but does not prevent the underlying torn-stripe condition; it just narrows the search space for the resync.
  • Journal device (mdadm --write-journal). An external fast device, typically an NVMe SSD, receives a write-ahead log of every parity update. After a crash the kernel replays the journal before bringing the array online. This closes the hole but requires dedicating a fast, reliable device to the array.
  • Partial Parity Log (PPL). Introduced for the case where a separate journal device is impractical. Before a partial-stripe write, md computes the XOR of the stripe's unmodified chunks and stamps that partial parity into a reserved metadata region on the parity drive. If a crash interrupts the write, recovery uses the logged partial parity to deterministically reconstruct either the pre-write or post-write state without leaving stale parity on disk.

ZFS RAID-Z: Parity at the Filesystem Layer

Hardware controllers and Linux md operate at the block layer. They see logical block addresses, not files. ZFS integrates the volume manager and the filesystem in one stack, and ZFS RAID-Z (Z1, Z2, Z3 for one, two, or three parity blocks per stripe) computes parity at the filesystem layer instead. Two structural differences fall out of that:

Variable-width stripes. Block-layer RAID 5 has fixed stripe geometry: if the stripe width is 4 drives and the chunk size is 64 KB, every stripe is 192 KB of data plus a 64 KB parity chunk regardless of the size of the logical write. ZFS allocates sectors per transaction sized to the actual record being written (governed by recordsize and the disk's ashift). A 4 KB write with RAID-Z1 uses one 4 KB data sector plus one 4 KB parity sector and leaves the rest of the stripe alone. Every write is a full-stripe write by definition, so the read-modify-write cycle that creates the partial-write penalty in block-layer RAID does not exist.

Copy-on-write closes the write hole. ZFS never overwrites live data. A write allocates new sectors elsewhere on the pool, commits data and parity atomically as a new transaction, then advances the Uberblock pointer to the new tree. If a power loss occurs after the write but before the Uberblock advance, the pool still references the previous consistent tree on the next mount. There is no stripe to be torn between data and parity because parity is always written as part of the same transaction as its data, and the visibility of the entire transaction is gated by a single atomic pointer flip.

The trade-offs are not free. RAID-Z requires the full ZFS stack with its memory and CPU overhead, the filesystem cannot be migrated off ZFS without destroying the pool, and a rebuild (called a resilver) reads only allocated blocks rather than the entire device, which is faster on lightly used pools but offers no advantage on a pool that is mostly full. For recovery, RAID-Z pools require ZFS-aware tools that walk the pool's block pointer tree from the Uberblock; generic block-layer RAID parity calculators cannot reassemble RAID-Z because the stripe geometry is variable per record.

Frequently Asked Questions

How does RAID 5 reconstruct data from a failed drive?

The controller reads data blocks and parity blocks from the surviving drives for each stripe. Because XOR is reversible, any single missing value can be recalculated by XORing the remaining values. This works for one failed drive; a second failure during rebuild causes data loss.

What is the difference between RAID 5 and RAID 6?

RAID 5 uses one parity block per stripe (XOR) and survives one drive failure. RAID 6 uses two independent parity blocks (XOR + Galois field arithmetic) and survives two simultaneous failures. RAID 6 requires a minimum of four drives and has higher write overhead.

What happens if a RAID hits an Unrecoverable Read Error during rebuild?

Consumer SATA drives have a UBER of 1 error per 1014 bits (roughly 12 TB). On a full-array rebuild of 16-20 TB drives, the odds of hitting a latent bad sector are high. Enterprise controllers like Dell PERC can puncture the affected stripe & continue rebuilding; the data in that stripe is lost but the rest of the array survives. Software RAID typically aborts the entire rebuild, leaving the array in a degraded or failed state requiring professional recovery.

Can RAID parity protect against simultaneous SSD firmware failures?

No. Parity guards against individual drive failures, not correlated firmware panics across multiple drives. SSDs sharing identical controllers & firmware revisions can fail simultaneously when a firmware bug triggers at a specific power-on hour or write-cycle threshold. The HPE SAS SSD 40,000-hour bug (firmware before HPD7) locked up all drives in affected arrays at the same interval. When two or more drives drop from a RAID 5 at once, XOR parity can't reconstruct the missing data; the array needs RAID data recovery.

What parity rotation layout does Linux md use by default, and why does it matter for recovery?

Linux md defaults to left-symmetric: parity walks one drive toward drive 0 each stripe and data wraps continuously around the parity block. The layout maximizes large-sequential-read throughput. For recovery, the layout determines the byte order when the array is reassembled. PC-3000 RAID Edition and R-Studio detect it heuristically by scanning each drive for filesystem signatures and parity-test patterns, but proprietary layouts like HP SmartArray delayed parity or Promise wide- pace require deducing additional parameters (delay interval, first-delay offset) before reassembly produces a mountable image.

What is the RAID 5 write hole and how is it mitigated?

The write hole is silent corruption from the non-atomicity of partial-stripe updates. If power is lost between writing new data and writing matching parity, the on-disk parity is stale; a subsequent drive failure will rebuild garbage. Hardware RAID closes the hole with battery-backed or NV-DIMM cache that replays uncommitted writes after a crash. Linux md offers three mitigations: write-intent bitmaps (speed resync, do not fully seal the hole), an external journal device via mdadm --write-journal, and the Partial Parity Log (PPL) that records the XOR of unmodified stripe chunks into the parity drive's metadata so the pre-write or post-write state can be reconstructed deterministically.

How does ZFS RAID-Z avoid the write hole?

RAID-Z computes parity at the filesystem layer rather than the block layer, and ZFS uses copy-on-write: a write never overwrites existing data. ZFS allocates new sectors, commits data plus parity together as a new transaction, then atomically advances the Uberblock pointer to the new tree. A power loss before the Uberblock advance leaves the pool referencing the prior consistent tree. RAID-Z also uses variable-width stripes sized to the logical record, so every write is inherently a full-stripe write and the read-modify-write cycle behind the block-layer write hole does not occur.

If you are experiencing this issue, learn about our RAID recovery service.