Skip to main contentSkip to navigation
Rossmann Repair Group logo - data recovery and MacBook repair
RAID Recovery

RAID Rebuild Failed: What to Do Next

Your RAID rebuild did not complete. The controller has marked the array as failed, and the volume is offline. This guide covers why rebuilds fail across RAID 1, 5, 6, and 10, which post-failure actions make things worse, and how to assess the array before taking any further steps.

The data is still on the drives. The goal now is to avoid overwriting it.

Louis Rossmann
Written by
Louis Rossmann
Founder & Chief Technician
Updated March 2026

Why RAID Rebuilds Fail

In parity-based arrays (RAID 5, RAID 6), a rebuild reads every sector of every surviving drive to recalculate the missing data onto a replacement. Mirrored arrays (RAID 1, RAID 10) read only the surviving mirror partner, making their rebuilds faster and less stressful. In either case, three categories of failure account for nearly all rebuild aborts: latent sector errors, second mechanical failures, and controller-level errors.

  • 1.Latent sector errors (UREs): Sectors that became unreadable at some point but were never accessed, so the error went undetected. The rebuild forces a full sequential read that surfaces every latent error. On high-capacity drives, the probability of hitting at least one URE increases with total bytes read.
  • 2.Second drive failure: Drives purchased together accumulate similar wear. If one has failed, the remaining drives have experienced identical power-on hours and thermal cycles. The sustained sequential I/O of a rebuild accelerates failure in drives already near the end of their service life.
  • 3.Drive and controller timeout mismatch: RAID arrays depend on drives responding within strict time limits. Enterprise and NAS drives set their internal error-recovery timeout (ERC/TLER) to approximately 7 seconds, ensuring they either return data or report failure quickly. The RAID controller imposes its own command timeout on top of this, typically 8 to 20 seconds depending on the vendor. Consumer desktop drives, which lack ERC configuration, may spend 30 seconds to over 2 minutes retrying bad sectors internally. This mismatch is the root cause of 'phantom' drive drops: the drive is still working, but the controller's patience runs out first, and it marks the drive as failed.

For RAID 5 rebuild failures specifically, the risk is highest because RAID 5 has zero remaining fault tolerance once degraded. RAID 6 and RAID 10 have additional margin, but the same physical failure mechanisms apply. If the failure occurred during a RAID reshape or NAS migration, the array has both a missing member and a split geometry, which requires a different reconstruction approach.

Example: A 4-drive RAID 5 with 12TB WD Red drives. Drive 3 fails. The rebuild starts on a hot spare. Partway through, drive 1 encounters a URE on a sector that was never read during normal operation. The controller cannot compute the XOR for that stripe because two sources are now unavailable. The rebuild aborts. The array transitions from degraded to failed.

SMR Drive Write Amplification During Rebuilds

Drive-Managed Shingled Magnetic Recording (DM-SMR) adds a fourth failure mode that didn't exist before 2018. SMR drives write data in overlapping tracks to increase platter density. Reads are unaffected, but sustained sequential writes force the drive to rewrite entire shingled zones when its small CMR write cache fills up. A RAID rebuild is one continuous sequential write operation.

WD Red models WD20EFAX, WD40EFAX, & WD60EFAX ship as DM-SMR without clear labeling. When a rebuild hits the cache limit, the drive stalls for several seconds while it rewrites zones internally. Hardware RAID controllers with 8-20 second command timeouts interpret the stall as a drive failure & drop it from the array. The rebuild aborts. Independent testing showed DM-SMR drives extending a standard 15-hour rebuild to over 9 days in ZFS environments, with hardware RAID controllers failing outright. WD's CMR-labeled models (WD Red Plus, WD Red Pro) don't have this problem. If you're running a parity-based array where rebuild survival matters, verify that every member drive is CMR before the first failure happens.

SSD Cache & NVMe RAID: Firmware Panics During Rebuild

SSD-based RAID arrays and NAS SSD caches introduce a fifth failure mode that HDD-focused guides overlook. The sustained sequential read load of a rebuild can push consumer SSDs with aging NAND past their failure threshold. SATA SSDs using the Phison S11 controller (PS3111, found in budget drives like the Kingston A400, Patriot Burst, and Silicon Power S55) are prone to a firmware lockout when TLC NAND cells degrade beyond the ECC correction threshold. The controller enters a protective state, the drive drops offline, and re-identifies in the BIOS as "SATAFIRM S11" instead of the original model name. The rebuild's sustained read load does not cause the NAND degradation, but it surfaces latent cell failures that normal desktop workloads would not trigger. NVMe SSDs with Phison E12 controllers experience similar FTL corruption but drop off the PCIe bus or report hardware initialization failures instead. Silicon Motion SM2259XT controllers exhibit a different symptom: firmware corruption (typically from power loss during garbage collection or cache flush) causes the drive to report 0 bytes capacity or appear as unallocated in disk management.

Both failures corrupt the Flash Translation Layer (FTL), the firmware mapping table that tracks which logical block lives on which physical NAND page. Consumer SSD recovery tools can't access a panicked controller. Recovery requires placing the SSD into Technological Mode using PC-3000 SSD to access the raw NAND and reconstruct the block mapping directly from raw NAND. For arrays mixing SSDs and HDDs, the panicked SSD is priced at the firmware-level SSD tier ($600–$900) while healthy HDDs image at the standard $100 rate. If a member SSD fails this way during a rebuild, the same professional RAID data recovery imaging-first approach applies: image every drive before attempting any reconstruction.

URE Probability on Large-Capacity Drives

The math works against RAID 5 as drive capacities increase. Consumer HDDs carry an unrecoverable read error (URE) rating of 1 error per 1014 bits read. That's roughly 12.5 TB of data before you statistically expect one unreadable sector.

A degraded 4-drive RAID 5 array using 16 TB drives forces the controller to read 48 TB across the three surviving members to rebuild the replacement. At consumer URE rates, that's 3-4 expected unreadable sectors per rebuild pass. How the controller responds depends on the stack. Enterprise controllers (Dell PERC, LSI MegaRAID) "puncture" the affected stripe, marking it as unrecoverable, and continue the rebuild; the array goes online with known-bad stripes. Consumer hardware RAID (Intel RST, budget SATA cards) aborts the rebuild outright. Linux mdadm may fault the drive after exceeding its read-error threshold, double-degrading the array. ZFS continues the resilver but marks the affected blocks as permanently errored; the data at those locations is lost. The outcome varies by stack, but none of them are good. Enterprise SAS drives are rated at 1015 bits (125 TB per URE), which cuts the probability by a factor of 10. This is why enterprise drives are specified for RAID arrays that need to survive a rebuild.

The implication is simple: RAID 5 with consumer drives over 4 TB is a rebuild failure waiting to happen. RAID 6 adds a second parity block (Reed-Solomon encoding alongside standard XOR), so a single URE during rebuild doesn't kill the process. RAID 10 avoids the problem entirely because rebuilds only read one mirror partner, not the entire array.

The numbers: 4 x 8 TB drives (RAID 5) = 24 TB rebuild read = ~1.9 expected UREs. 4 x 16 TB drives = 48 TB = ~3.8 expected UREs. 4 x 20 TB drives = 60 TB = ~4.8 expected UREs. Each drive capacity doubling roughly doubles the rebuild failure probability. For hard drive data recovery from a failed RAID rebuild, we image every drive with PC-3000 & DeepSpar through write-blocked connections before any reconstruction attempt.

What Not to Do After a Rebuild Failure

After a rebuild failure, the most common instinct is to retry. Each of the following actions risks overwriting the data you are trying to recover.

  • 1.Do not retry the rebuild. A second attempt repeats the same full-disk read on parity-based arrays (or the mirror-partner read on RAID 1/10), placing the same sustained I/O load on drives that just demonstrated a failure. If the first rebuild found a URE, the second will find it again or trigger a new one.
  • 2.Do not force the array online. Controller utilities like "Force Online," "Force Import," or "Set Foreign Config Good" assemble the array using whatever metadata is available. If the rebuild wrote partial parity updates before failing, the forced assembly mixes pre-rebuild and post-rebuild parity states. The resulting volume may mount, but stripes with mixed parity are silently corrupted.
  • 3.Do not run filesystem repair tools. fsck, chkdsk, xfs_repair, and btrfs check assume the underlying block device is consistent. On a broken RAID array, they interpret parity corruption as filesystem damage and may delete valid directory entries or truncate files.
  • 4.Do not swap drives between slots. Moving drives between bays can trigger an automatic rebuild, cause metadata writes, or create confusion during offline recovery. Leave all drives in their original positions.
  • 5.Do not initialize or delete the virtual disk. Some controller BIOSes offer "Initialize" or "Delete Virtual Disk." Both destroy the RAID metadata that defines the array configuration (stripe size, drive order, parity rotation).

If the controller wrote partial parity updates during the failed rebuild, the pre-rebuild degraded state has been partially overwritten. The damage increases with each additional operation. Power down and image every drive before taking further action.

Assessing the Array State

Before deciding on a course of action, gather information about the array state without modifying anything on disk. The goal is to determine whether the failure was transient (cable, timeout) or physical (media degradation, mechanical fault).

  • 1.Record the controller error. The exact message narrows the diagnosis. "Media error on PD 2 at LBA X" points to a specific drive and sector. "PD 3 not responding" suggests a mechanical or connection failure. Note the rebuild percentage at failure.
  • 2.Check SMART data on all drives. Use smartctl -a /dev/sdX (Linux) or the controller's management utility. Key attributes: Reallocated_Sector_Ct (sectors already moved to spare areas), Current_Pending_Sector (sectors queued for reallocation), and Offline_Uncorrectable (sectors that failed offline scan). Non-zero values on any of these indicate degraded media.
  • 3.Document the RAID configuration. Record the controller model, firmware version, RAID level, stripe size, write policy (write-back vs write-through), and number of drives. This information is required for offline reconstruction if controller metadata is damaged.
  • 4.Label every drive. Mark each drive with its physical slot number using tape or a marker on the drive itself (not just the tray). If drives are removed for imaging, the slot mapping must be preserved.

For detailed guidance on reading controller logs across Dell PERC, HP SmartArray, LSI MegaRAID, and Linux mdadm, see the degraded RAID troubleshooting guide.

When You Can Fix This Yourself

Not every failed rebuild requires professional recovery. The following scenarios can often be resolved by the administrator.

  • 1.The rebuild failed due to a transient error. If the controller dropped a drive because of a timeout (not a URE or mechanical failure) and SMART data on all drives is clean, the issue may be a loose SATA/SAS cable, a failing backplane connector, or a controller port problem. Reseat cables, test on a different port, and attempt the rebuild again. Image the drives first as a precaution.
  • 2.You have recent, verified backups. If backup integrity has been confirmed (not just backup job completion), restore from the backup. This is the correct answer for any array containing replaceable data.
  • 3.Software RAID (mdadm) with a single-sector URE. If the rebuild is mdadm-based and the error is a single-sector URE, you can use ddrescue to image the affected drive (skipping the bad sector), then reassemble the array from images.
  • 4.RAID 6 or RAID 10 after a non-fatal rebuild failure. If a RAID 6 rebuild failed due to a non-fatal error (such as a URE on a single stripe) rather than a complete second drive failure, the array may still be accessible in degraded mode. The array is in a mixed parity state, not a clean single-failure degradation; rebuilt stripes carry updated parity while unrebuilt stripes retain the original layout. If a RAID 10 rebuild failed within one mirror pair, the other pairs remain intact. Check controller status. If the volume is still mounted, copy data off immediately.

Example: An mdadm RAID 5 on an Ubuntu server. The rebuild failed because drive 3 returned a read error on one sector. The admin uses ddrescue to image all four drives (drive 3's image has one unreadable sector, filled with zeros by ddrescue). The admin reassembles the array from images on a separate machine and copies the data off. The one bad sector affected a single file block; everything else is intact.

When Professional Imaging Is the Right Call

Some rebuild failure scenarios leave the array in a state that cannot be safely resolved with standard administrator tools.

  • 1.Multiple physical drive failures. If two or more drives have mechanical problems (clicking, not spinning, SMART reporting thousands of reallocated sectors), the drives need to be imaged with hardware that can manage bad sectors, weak heads, and firmware faults at a level ddrescue cannot.
  • 2.Partial rebuild corrupted parity data. If the controller wrote partial parity updates before the rebuild failed, the array cannot be reassembled using either the pre-rebuild or post-rebuild state without analyzing which stripes were modified. This requires forensic RAID reconstruction that compares parity states across drives.
  • 3.Controller metadata is damaged or missing. If the controller BIOS no longer shows the virtual disk, or shows it as "Foreign" or "Missing," the metadata defining stripe size, drive order, and parity rotation may be corrupted. Reconstruction requires scanning the raw drives to detect RAID parameters from data patterns.
  • 4.Post-failure operations already modified the drives. If someone has run force-online, fsck, or reinitialized the virtual disk, the on-disk state has been modified. Recovery is still possible in many cases, but the window narrows with each modification.

For RAID data recovery involving physical drive faults, we image each drive with PC-3000 and DeepSpar Disk Imager through write-blocked connections, then reconstruct the array offline in software. The original drives are never written to. For RAID 5 arrays with partial rebuild corruption, we analyze stripe-level parity to determine which sections use pre-rebuild vs post-rebuild data.

Hardware Controller vs Software RAID Rebuild Behavior

How a rebuild fails depends on whether the array runs on a hardware controller or software RAID. The recovery approach differs for each.

Hardware controllers store array configuration in NVRAM and on-disk metadata. Dell PERC and LSI/Broadcom MegaRAID use the SNIA Disk Data Format (DDF) written to the end of each physical disk. HP SmartArray uses a proprietary format (RAID Information Sector) stored at the beginning of each drive. When a rebuild fails, the controller updates this metadata to mark drives as failed or foreign. A "Foreign Configuration" error on a Dell PERC means the controller's NVRAM has lost sync with the DDF metadata on disk. Professional recovery bypasses the controller hardware entirely, using PC-3000 Data Extractor to parse DDF headers directly from raw disk images & reconstruct the array offline.

Linux mdadm stores its superblock at a known offset on each member drive. If a rebuild fails, the superblock records the event, but it doesn't lock you out the way a hardware controller does. Synology Hybrid RAID (SHR) is more complex: it layers Linux LVM over multiple mdadm slices to accommodate mixed drive sizes. A failed SHR rebuild leaves fragmented LVM Physical Volumes scattered across different parity sets, which requires aligning LVM metadata before the Btrfs or ext4 filesystem can be accessed. For NAS data recovery involving Synology SHR failures, we reconstruct the LVM layer from imaged drives rather than relying on the NAS firmware to reassemble it. If a second drive failed during the NAS rebuild process, see data loss during a NAS rebuild for that specific failure scenario.

Software RAID Rebuild Loops (Intel RST)

Intel Rapid Storage Technology (RST) and the Intel Optane Memory and Storage Management app have a documented bug where a RAID 5 rebuild reaches 100% completion, crashes the application, and restarts the rebuild from 0%. This has been reported across multiple RST versions, including ICH10R-era controllers through modern chipsets. RST versions 19.5, 20.0, and 20.1 had a separate critical bug that caused RAID 1/5/10 array failures and data corruption (fixed in RST 20.2). Each loop pass forces a full sequential read of all surviving members and rewrites the entire replacement drive from scratch. Repeated passes place identical sustained I/O load on drives that have already been read end-to-end, increasing the chance of a second mechanical failure. If the loop also triggers a consistency check (as some RST versions do), parity on the surviving drives may be recalculated and overwritten, compounding the damage.

If the rebuild loops: power off the system. Do not let it restart. Image all member drives with write-blocked connections before interacting with the Intel RST software again. This is the same principle behind why rebuilding a degraded array risks permanent data loss: every additional pass compounds the damage. For arrays stuck in this loop, recovering a degraded RAID array requires forensic alignment of the stripe geometry from imaged copies, not another software retry.

Data Recovery Cost for Failed RAID Rebuilds

Recovery pricing is based on the physical condition of each individual drive, not the RAID level or array size. There is no flat "RAID recovery fee."

Each member drive is assessed independently. A drive that reads cleanly on a write-blocked connection costs $100 for a sector-level image. A drive with firmware corruption falls in the $600–$900 range. Drives requiring a head swap with donor matching cost $1,200–$1,500, plus the cost of a compatible donor drive. For a 4-drive RAID 5 where one drive has firmware corruption & the other three image cleanly, total recovery might run $600–$900 plus 3 x $100. Compare that to competitors who quote $6,000+ for the same 4-drive NAS without disclosing what the per-drive cost is.

We don't charge diagnostic fees. If we can't recover the data, you don't pay. That's the no-data, no-fee guarantee. For professional RAID data recovery, we image every drive with PC-3000 & DeepSpar through write-blocked connections, then reconstruct the array offline. The original drives are never modified. A full breakdown of per-drive pricing tiers is published on our site. Rush service is available for an additional $100 per drive to move to the front of the queue.

Frequently Asked Questions

Can data be recovered after a failed RAID rebuild?
In most cases, yes. A failed rebuild means the controller could not complete the parity regeneration, but the original data remains on the surviving drives. Recovery involves imaging each drive individually with write-blocked connections and reconstructing the RAID array offline in software. Success depends on how many drives have physical faults and whether any post-failure operations (force-online, fsck, reinitialization) modified the on-disk state.
Should I retry the rebuild with the same replacement drive?
Not until you understand why the first rebuild failed. If the failure was caused by a URE on a surviving drive, retrying reads the same sectors again and encounters the same error. If the failure was a loose cable or controller timeout, fixing the root cause and retrying may work. In either case, image all drives before the second attempt so you have a fallback if the retry triggers a cascading failure.
Does the RAID level affect recovery chances after a rebuild failure?
Yes. RAID 6 and RAID 10 arrays have better recovery prospects than RAID 5 because they provide additional redundancy. However, after a partially completed rebuild, a RAID 6 array is in a mixed parity state: rebuilt stripes have updated parity while unrebuilt stripes still rely on the original parity layout. The actual remaining tolerance depends on why the rebuild failed. If a second drive caused the failure, the array may have zero remaining margin. RAID 5 has zero margin after the first failure, so any rebuild error is fatal to the array. RAID 10 tolerance depends on which mirror pair was affected.
Why do RAID 5 rebuilds fail more often on larger drives?
Consumer HDDs are rated for 1 unrecoverable read error (URE) per 10^14 bits read, which equals roughly 12.5 TB. A degraded 4-drive RAID 5 using 16 TB drives forces the controller to read 48 TB across the surviving members. At consumer URE rates, you'd statistically expect to hit 3-4 unreadable sectors during a single rebuild pass. Enterprise SAS drives are rated at 10^15 bits (125 TB per URE), which is why they're specified for RAID use. RAID 6 survives a single URE during rebuild because its second parity block (Reed-Solomon encoding) can reconstruct the missing data without the failed sector.
Why do ZFS resilvers succeed when hardware RAID 5 rebuilds fail?
Hardware RAID controllers are filesystem-blind. They rebuild by reading every sector on every surviving drive, including empty space. A 16 TB drive that's only 30% full still forces the controller through all 16 TB. ZFS is filesystem-aware; its resilver only reads allocated data blocks. If the pool is 30% full, ZFS reads roughly 30% of the disk surface, cutting the URE exposure by 70%. This is why TrueNAS and FreeNAS arrays using ZFS mirror or RAIDZ tolerate larger drives with fewer rebuild failures than equivalent hardware RAID 5 arrays.
How long does a RAID 5 or RAID 6 rebuild take?
An 8 TB RAID 5 array on 7200 RPM CMR SATA drives takes 15 to 20 hours under ideal conditions with no production I/O competing for disk bandwidth. RAID 6 takes longer because the controller recalculates two parity blocks (XOR plus Reed-Solomon) per stripe instead of one. A 4-drive array with 16 TB drives can take 40+ hours. Every hour the array spends rebuilding is an hour where a second drive failure collapses the entire volume. Drive-Managed SMR (shingled) drives can extend rebuild times from hours to days because their CMR write cache fills up and forces zone rewrites, stalling the controller.
Why did my WD Red drives fail during a RAID rebuild?
Certain WD Red models (WD20EFAX, WD40EFAX, WD60EFAX) use Drive-Managed Shingled Magnetic Recording (DM-SMR). During the sustained sequential writes of a rebuild, the drive's CMR cache fills up and forces the drive to rewrite overlapping shingled zones. This zone-rewrite process stalls the drive for seconds at a time, exceeding the hardware RAID controller's command timeout. The controller interprets the stall as a drive failure, drops the drive, and aborts the rebuild. WD did not originally disclose the SMR status of these models, and many NAS arrays were built with them unknowingly.
How much does data recovery cost after a RAID rebuild fails?
Recovery cost depends on the physical condition of the individual member drives, not the array size or RAID level. Each drive is priced independently based on the work required: $100 for a simple logical copy, up to $2,000 for drives with surface damage requiring platter work. A 4-drive RAID 5 where one drive has a head failure and the other three are logically intact might cost $600–$900 for the mechanical drive plus $100 each for imaging the healthy members. We don't charge flat "RAID recovery" fees or diagnostic fees. No data recovered, no fee charged.
What should I do if my Intel RST RAID rebuild reaches 100% and starts over?
This is a known bug with the Intel Optane Memory and Storage Management app affecting certain Intel RST software RAID configurations. Power down the system immediately. Do not let the rebuild loop continue; repeated parity regeneration on a degraded disk overwrites valid stripe data with each pass. The drives must be cloned sector-by-sector using write-blocked connections before any further software interaction. For details on why repeated rebuilds compound the damage, see our guide on why rebuilding a degraded array risks permanent data loss.
How much does data recovery cost if an SSD in my RAID fails with a SATAFIRM S11 error?
When a SATA SSD drops offline and re-identifies as SATAFIRM S11, the Phison controller has experienced a firmware panic. Recovering the failed SSD requires Flash Translation Layer (FTL) reconstruction via PC-3000 SSD, which falls in our $600–$900 firmware-level tier. The remaining healthy drives in the array still need sector-level imaging at $100 each. A 4-drive RAID 5 with one SATAFIRM S11 SSD and three healthy HDDs would run $600–$900 plus 3 x $100. No diagnostic fee. No data, no fee.

RAID rebuild failed and data is irreplaceable?

Free evaluation. Write-blocked drive imaging. Offline array reconstruction. No data, no fee.

(512) 212-9111Mon-Fri 10am-6pm CT
No diagnostic fee
No data, no fee
4.9 stars, 1,837+ reviews