Why Your RAID Rebuild Failed
A RAID rebuild fails when the controller hits an unrecoverable read error on a surviving drive, when an SMR member stalls long enough for the controller to drop it, or when a drive without TLER enters a 60 to 120 second deep-recovery loop the controller refuses to wait for. The math on consumer drives makes failure a statistical expectation rather than a rare event. We recover failed rebuilds by imaging every member offline and assembling the array virtually on PC-3000 RAID Edition. The original chassis is never written to. Free evaluation. No data recovered means no charge.

What Do RAID Controller Status Codes Mean?
Before troubleshooting a failed rebuild, identify the exact status the controller is reporting. The terminology is consistent across LSI MegaRAID, Dell PERC, HPE SmartArray, and the Linux mdadm stack that Synology and QNAP wrap in their UIs.
- Degraded
- The array is running with one or more members missing or marked failed. Reads still succeed because the surviving members plus parity can reconstruct the missing data on the fly. Writes succeed but generate no parity protection for the missing member.
- Rebuilding
- The controller is actively copying reconstructed stripes onto a replacement member. Every sector of every surviving member must be read during this phase. Multi-TB rebuilds take many hours and stress every drive in the chassis.
- Failed
- The array has lost more members than its parity level can tolerate. RAID 5 lost two members; RAID 6 lost three. The controller stops servicing reads and writes. The data is not erased, but normal access through the controller is impossible until the array is reassembled.
- Offline
- The controller has taken the virtual disk out of service, typically because a rebuild was attempted, halted, or encountered an unrecoverable error mid-stream. Offline is the state most failed rebuilds settle into.
- Foreign Configuration
- The controller detected DDF metadata it did not write itself, usually because the drives were moved from another controller or because the array configuration was cleared on this controller. Importing or clearing a Foreign Configuration rewrites metadata on the member drives. Do neither without an image first.
- Unconfigured-Good
- The drive is detected by the controller, passes its self-test, and is available to be assigned to an array. After a failed rebuild, drives that were dropped due to a TLER timeout sometimes return to Unconfigured-Good on the next power cycle even though their media has marginal sectors.
- Unconfigured-Bad
- The controller flagged the drive as having excessive media errors. The drive is not necessarily mechanically dead; the controller has simply decided not to use it again. These drives are recoverable on PC-3000 Express with adaptive read retry.
Why Do RAID 5 Rebuilds Fail at 90%?
Hard drives are sold with an unrecoverable read error (URE) specification, also called the bit error rate. Consumer SATA drives such as the WD Blue, Seagate Barracuda, and Toshiba P300 spec one URE per 10^14 bits read, which works out to about one unreadable sector per 12.5 TB of sequential reads. Enterprise drives such as the WD Ultrastar and Seagate Exos spec one URE per 10^15 bits, ten times better, or roughly one unreadable sector per 125 TB.
During a RAID 5 rebuild the controller must read every sector of every surviving member in order to XOR them together and reconstruct the missing data. A four-member RAID 5 with 8 TB drives, after losing one member, has three surviving members of 8 TB each. The controller must read 24 TB sequentially under sustained load. That is 1.92 times the URE budget of a single consumer drive. The expected number of UREs across the full rebuild pass is approximately 1.92 on consumer media and 0.19 on enterprise media.
When the controller hits a sector it cannot read, it cannot compute the missing XOR for that stripe. Different controllers handle this differently. Some halt the rebuild entirely and mark the array failed. Others write garbage parity for the affected stripe and continue, which produces a rebuild that completes but leaves silent corruption across an unknown subset of files.
The "stuck at 90 percent" plateau is the URE math playing out on a specific drive. A surviving member has a localized band of marginal sectors at a particular LBA range. The rebuild reads cleanly up to that LBA, hits the band, retries inside the TLER window, fails to recover the sectors in time, and drops the drive. The percentage where the rebuild stalls is just the offset of the bad-LBA band. Arrays with damage near the end of the LBA range plateau at 90 to 99 percent; arrays with damage near the start plateau much earlier.
SMR Drives Trigger Controller Timeout Mid-Rebuild
Drive-managed Shingled Magnetic Recording (SMR) drives use a small persistent conventional-recording (CMR) cache zone for incoming writes, then reorganize that data onto overlapping shingled tracks during idle periods. A RAID rebuild is the opposite of idle. It forces continuous sequential writes to the replacement drive while the surviving members are read at sustained sequential rates.
Once the CMR cache fills, the SMR drive must pause to flush its accumulated writes into the shingled zones. That flush can stall the drive for several seconds while tracks are rewritten in band order. Hardware RAID controllers expect responses inside a Time-Limited Error Recovery (TLER) or Error Recovery Control (ERC) window that defaults to 7 to 14 seconds on enterprise controllers, sometimes as low as 7 seconds on consumer cards. When the SMR pause exceeds that budget, the controller interprets the silence as drive death and drops the SMR member from the array.
Specific models known to ship as drive-managed SMR include the WD Red EFAX series (2 TB through 6 TB capacities), the Seagate Barracuda ST2000DM008 and ST4000DM004, and the Toshiba L200 and P300 families. None of these belong inside a parity RAID array. If one was placed as the replacement during a rebuild, the rebuild does not just slow down; it actively converts the array from single-fault degraded to double-fault failed.
TLER and ERC Mismatch Drops Healthy Drives
Desktop drives without TLER or ERC firmware (WD Green, WD Blue, Seagate Barracuda non-RAID variants) handle a marginal sector by entering a deep internal recovery cycle. The drive will reread the sector with adjusted parameters for up to 120 seconds before giving up. In a single-drive desktop the operating system tolerates this delay because nothing else depends on the drive responding within a fixed window.
Hardware RAID controllers do not tolerate that delay. The controller TLER or ERC window is 7 to 14 seconds on most enterprise cards. After that the controller declares the drive missing and drops it from the array. The drive itself may have eventually returned the sector successfully, but by the time it did, the controller had already moved on. This is how arrays full of perfectly functional drives end up double-faulted during a rebuild.
Enterprise drives (WD Red Plus, WD Ultrastar, Seagate IronWolf and Exos) ship with a short TLER window (typically 7 seconds) so the drive returns a read failure inside the controller budget. The controller can then reconstruct the missing sector from parity in the normal way without dropping the drive. The difference between a desktop drive and a NAS or enterprise drive of similar physical specifications is almost entirely this firmware behavior.
Write-Hole Parity Corruption from Unclean Shutdown
A RAID 5 or RAID 6 write spans every member of a stripe. The controller writes new data blocks and a recalculated parity block. If power is lost between the data write and the parity write, the stripe ends up with new data and old parity, or new parity and old data. The controller has no way to know which blocks are current and which are stale.
On healthy hardware this is mitigated by a battery-backed write cache (BBWC) or flash-backed write cache (FBWC) on the controller. The cache survives the power loss and the controller replays the pending writes on the next boot. Consumer NAS units (Synology, QNAP, Buffalo) ship without battery-backed caches; their "write hole"protection depends on filesystem-layer journaling (ext4, XFS, Btrfs) which does not cover parity blocks.
When a write-hole stripe is later read during a rebuild, the parity does not match the data. The controller computes a missing block from inconsistent inputs and produces garbage. The garbage gets written to the replacement member as part of the rebuild. The rebuild completes successfully and reports no errors, but specific files stored across affected stripes are silently corrupt.
Reinstalling the Original Drive After a Spare Resynced
The most common destructive operator action is to reinstall the original failed drive after a hot spare has already started resyncing. The original drive contains data from before the failure and a lower RAID event count. Some controllers compare event counts and treat the higher count as authoritative. Others rely on the order in which drives respond at power-on and silently sync from whichever member they decide is current.
If the controller treats the stale drive as current, the resync overwrites the working spare and any other surviving members with the older data from the reintroduced drive. The array reports "resync complete" and the volume looks mounted normally, but every block that was modified between the original failure and the reintroduction is now lost.
If a drive was reinstalled and a resync started, power the chassis down immediately. Do not let the resync complete. The original data is still recoverable from the drives that have not yet been overwritten, but only if the resync is interrupted before it walks the full LBA range.
Commands That Destroy Your Array
If your rebuild failed: power down the chassis and stop. The commands below are the ones most often recommended on forums and in vendor knowledge bases for "recovering" a failed rebuild. Every one of them writes to the member drives and forecloses on a clean forensic recovery.
megacli -PDMakeGood -PhysDrv [E:S] -aALLWhat it does: changes the DDF state of an Unconfigured-Bad drive to Unconfigured-Good and frequently triggers an immediate background initialization. Why it destroys data: the initialization overwrites the existing metadata that records which stripes belong to which array.MegaCli -CfgForeign -Clear -aALLWhat it does: tells the LSI controller to discard the Foreign Configuration metadata it found on the drives. Why it destroys data: the array geometry is in that metadata. Clearing it leaves the drives with valid user data but no record of how to assemble it.mdadm --create --assume-clean --level=5 --raid-devices=N ...What it does: creates a new mdadm superblock on every member and assumes parity is already consistent. Why it destroys data: the v1.2 superblock at offset 4 KiB is rewritten with new UUIDs and a new event count; the array geometry from the original create call (chunk size, layout, member order) is lost unless it happens to be identical, and silent corruption follows on the next write.mdadm --re-add /dev/sdX1 /dev/mdNon a heavily degraded driveWhat it does: tells mdadm the drive is current and only needs to replay the write-intent bitmap. Why it destroys data: if the drive was actually behind, mdadm marks it in-sync without resyncing, and every read from those stripes returns stale data.- Synology DSM Storage Manager "Repair" button on a crashed volumeWhat it does: runs a Synology-authored script that calls mdadm and lvm with parameters intended to bring the array back online. Why it destroys data: the script can overwrite md superblocks and LVM metadata on partition 3 of the surviving members. Read-only inspection on a separate Linux workstation is the safe alternative.
- "Force Online" or "Make Optimal" in LSI or PERC BIOS (F2) menusWhat it does: overrides the controller's decision that the array is offline. Why it destroys data: writes pending in the cache flush to the drives even though parity and data are inconsistent.
- QNAP Recovery Wizard "initialize" promptsWhat it does: formats the QNAP system partitions and rewrites the storage pool metadata. Why it destroys data: QTS stores its storage pool configuration database on partition 1 of the member drives. Initializing rewrites that database; the user data on the data partitions is still present but no longer addressable through QTS without manual LVM and Btrfs extraction.
- Online Capacity Expansion (OCE) or RAID level migration during a degraded stateWhat it does: rewrites stripe geometry across the array while parity reconstruction is in progress. Why it destroys data: if the process halts, the array exists in a hybrid state. Sectors before the failure point use the new geometry; sectors after use the old. No standard tool can assemble the split-geometry volume without manual analysis.
How Different RAID Controllers Fail During Rebuild
The on-disk metadata format and the firmware behavior during a failed rebuild differ by controller family. The recovery posture is the same in every case (image first, assemble offline), but understanding what the controller did to the metadata before it gave up is what determines how fast the virtual assembly converges.
LSI MegaRAID and Broadcom
LSI and Broadcom controllers write Disk Data Format (DDF) metadata to the trailing sectors of every member drive. When a drive drops during a rebuild due to a TLER timeout, the controller marks the drive as Unconfigured-Bad or Offline. A removed and reinserted drive shows up as Foreign Configuration on the next boot. The MegaCLI tool can report all of this without writing to the drives, but the -PDMakeGood and -CfgForeign -Clear commands both modify DDF and should never run before forensic imaging.
Dell PERC H700 through H965
Dell PERC is built on LSI silicon with Dell-specific firmware. The PERC family is DDF-conformant but adds copyback behavior: if a global hot spare resynced into a failed slot and the original slot is later populated, the firmware automatically copies the spare back. A drive failure during copyback degrades the array a second time. The H965 has a documented Online Capacity Expansion bug that halts the transformation queue with a "drive count exceeded" error; arrays caught in that state must be imaged member-by-member before any attempt to clear the transformation queue.
HPE SmartArray P-series
HPE SmartArray controllers (P-series, Gen8 through Gen10) use a proprietary metadata layout rather than DDF, written to RAID Information Sectors (RIS) at the beginning of each member drive rather than the trailing sectors used by LSI and PERC. Advanced Data Guarding (ADG), the HPE name for RAID 6, runs through a transformation queue that interleaves rebuild operations with expansion operations. A URE during a transformation halts the queue and writes a log event. Do not change Rebuild Priority or Expand Priority in Smart Storage Administrator once a transformation halts; the change can flush the proprietary metadata in the RAID Information Sectors and complicate offline parsing.
Adaptec and Areca
Adaptec controllers running the aacraid driver maintain a configuration database on the member drives and a battery-backed (BBWC) or zero-maintenance (ZMCP) cache on the card. If the rebuild halts due to a power anomaly, pending parity calculations may still be sitting in NVRAM. Issuing arcconf task start commits cached writes to the drives, including any inconsistent parity computed mid-rebuild. Disconnect the drives, capture metadata offline, and assemble virtually with the array's detected rotation, which is often distinct from Linux mdadm defaults.
Linux mdadm
The mdadm v1.2 superblock sits 4 KiB into the member device. mdadm tracks an event count and a write-intent bitmap; when a member drops, its event count stops advancing while the array keeps writing. The next time mdadm sees that drive, it compares event counts and either marks the drive stale (requires full resync via --add) or replays the bitmap (--re-add). Using --re-add on a drive that should have required a full resync silently introduces stale blocks. Read-only assembly with --assemble --readonly against cloned images is the safe inspection path.
Synology DSM and QNAP QTS
Synology DSM wraps Linux mdadm and LVM. Each member drive carries a small system partition (md raid1), a swap partition, and a data partition that joins the mdadm RAID, which is then exposed through LVM as a Btrfs or ext4 filesystem. A Volume Crashed alert in Storage Manager is mdadm halting on partition 3. QNAP QTS and QuTS hero use a similar layout but store a proprietary configuration database on partition 1; a hard reboot mid-rebuild can corrupt that database, after which the QNAP Recovery Wizard offers to "initialize" the drives. Decline. The user data on the data partitions is still readable through a Linux workstation with mdadm, LVM, and Btrfs or ZFS userspace tools.
Our Image-First, No Live Rebuild Process
- Free evaluation and documentation. Record the controller model, RAID level, member count, filesystem (ext4, XFS, Btrfs, ZFS, NTFS, VMFS), and every prior rebuild or repair attempt and the commands run. This step is free and informs which metadata layer is still intact.
- Label every drive bay. Each drive is marked with its physical slot number before removal and bagged individually. Slot order is required to validate stripe layout during virtual assembly.
- Capture RAID metadata from each member. Metadata location varies by controller family: LSI MegaRAID and Dell PERC store DDF in the trailing sectors of the member drives; HPE SmartArray writes its proprietary RAID Information Sectors (RIS) at the beginning of the drive; Adaptec aacraid uses DDF on modern controllers and proprietary structures on legacy models. For Linux software RAID, the mdadm v1.2 superblock sits at offset 4 KiB. Metadata capture runs against cloned images, not the originals.
- Write-blocked forensic imaging. Each member is connected through a hardware write-blocker to PC-3000 Express or DeepSpar Disk Imager. Adaptive retry and head-map analysis pull marginal sectors that the failed-rebuild controller had given up on inside its TLER window. Mechanical members (clicking, not spinning, head crash) receive donor head transplants on the 0.02 micron ULPA-filtered laminar-flow clean bench before imaging.
- Offline virtual assembly. PC-3000 RAID Edition loads the cloned images and assembles the array virtually using the captured metadata. The stripe size, parity rotation, and member order are read from the on-disk metadata rather than guessed.
- Parity recalculation and filesystem extraction. Stripes with missing data are reconstructed from parity. The assembled volume is mounted read-only. R-Studio and UFS Explorer handle filesystem-level recovery if the filesystem itself sustained damage during the failed rebuild.
- Delivery and secure purge. Recovered data is copied to your target media. After you confirm receipt, working copies are securely purged on request.
How Much Does RAID Rebuild Failure Recovery Cost?
Per-Member Imaging
- Logical or firmware-level issues: $250 to $900 per drive. Covers filesystem corruption on the array, firmware module damage that prevents normal reads, and SMART threshold failures.
- Mechanical failures (head swap, motor seizure): $1,200 to $1,500 per drive with a 50% deposit. Donor parts are consumed during the transplant. Head swaps are performed on a validated laminar-flow clean bench before write-blocked cloning.
Array Reconstruction
- $400 to $800 depending on member count, filesystem type, and whether RAID parameters must be detected from raw data versus captured from surviving DDF or mdadm superblocks.
- PC-3000 RAID Edition performs parameter detection and virtual assembly from cloned member images. R-Studio and UFS Explorer handle filesystem-level extraction after reconstruction.
No Data = No Charge: if we recover nothing from your array, you owe $0. Free evaluation, no obligation.
Example: a four-member array with one mechanically failed member and three healthy members costs approximately $1,200 (head swap) + 3 × $250 (logical imaging) + $400 to $800 (reconstruction) = $2,350 to $2,750.
+$100 rush fee to move to the front of the queue. Full HDD pricing is published at our HDD recovery service page.
RAID Rebuild Failure Recovery Questions
Why does my RAID 5 rebuild fail at 90%?
What is the URE rate that causes RAID 5 rebuild failures?
Can I retry a failed RAID rebuild?
Does forcing a RAID drive online destroy data?
Why does the controller drop a drive during rebuild?
What is the difference between a failed rebuild and a failed drive?
Can SMR drives cause RAID rebuild failures?
How do I recover from a failed RAID 5 rebuild?
Data Recovery Standards & Verification
Our Austin lab operates on a transparency-first model. We use industry-standard recovery tools, including PC-3000 and DeepSpar, combined with strict environmental controls to make sure your hard drive is handled safely and properly. This approach allows us to serve clients nationwide with consistent technical standards.
Open-drive work is performed in a ULPA-filtered laminar-flow bench, validated to 0.02 µm particle count, verified using TSI P-Trak instrumentation.
Transparent History
Serving clients nationwide via mail-in service since 2008. Our lead engineer holds PC-3000 and HEX Akademia certifications for hard drive firmware repair and mechanical recovery.
Media Coverage
Our repair work has been covered by The Wall Street Journal and Business Insider, with CBC News reporting on our pricing transparency. Louis Rossmann has testified in Right to Repair hearings in multiple states and founded the Repair Preservation Group.
Aligned Incentives
Our "No Data, No Charge" policy means we assume the risk of the recovery attempt, not the client.
Technical Oversight
Louis Rossmann
Louis Rossmann's well trained staff review our lab protocols to ensure technical accuracy and honest service. Since 2008, his focus has been on clear technical communication and accurate diagnostics rather than sales-driven explanations.
We believe in proving standards rather than just stating them. We use TSI P-Trak instrumentation to verify that clean-air benchmarks are met before any drive is opened.
See our clean bench validation data and particle test videoYour rebuild failed. Power down before doing anything else.
Free evaluation. No data = no charge. Mail-in from anywhere in the U.S.