Why Your RAID Rebuild Failed

Stop now and power the array off. Do not start another rebuild, do not click Repair, and do not run --assume-clean or mdadm --create. Forcing the array back together with a re-create or initialize command, or shoving a stale member back online, writes new parity across the surviving drives and permanently overwrites the data that is still intact.

Leave the chassis off and route the drives to our Mail-In RAID Recovery Service. All work is performed in-house at our Austin, TX lab, nationwide by mail-in. Free evaluation. No data recovered means no charge.

A RAID rebuild fails when the controller hits an unrecoverable read error on a surviving drive, when an SMR member stalls long enough for the controller to drop it, or when a drive without TLER enters a 60 to 120 second deep-recovery loop the controller refuses to wait for. The math on consumer drives makes failure a statistical expectation rather than a rare event. We recover failed rebuilds by imaging every member offline and assembling the array virtually on Data Extractor Express RAID Edition. The original chassis is never written to. Free evaluation. No data recovered means no charge. A failed rebuild is one of the array states our RAID data recovery service handles.

Free Estimate Mail-In Service

Author00/16

Written by

Louis Rossmann

Founder & Chief Technician

Updated July 2026

21 min read

Diagnosis to Recovery01/16

Your Rebuild Failed. What Is the Safe Next Step?

Power the chassis down and stop. A failed rebuild means each member must be cloned sector-by-sector through a hardware write-blocker with ddrescue, PC-3000 Express, or DeepSpar Disk Imager, then assembled virtually with Data Extractor Express RAID Edition before any further rebuild, repair, --assume-clean, or PDMakeGood runs against the original drives.

Every command that "fixes" a degraded array in place writes to the member drives. A live rebuild pins all surviving members at close to 100% sustained read for 18 to 48 hours, which is the load that finishes off a marginal same-batch survivor and converts a single-fault degraded array into a double-fault failure.

Clone first, reconstruct from the clones, and the original platters keep their geometry intact no matter what the reconstruction attempt does. This is the image-first imperative, and it is the difference between a recoverable array and a destroyed one.

You do not need the original controller card to get the data back. Dell PERC and LSI MegaRAID write their DDF geometry to the trailing sectors of each member; HPE SmartArray writes a proprietary RAID Information Sector at the start of each drive; mdadm keeps its superblock on the member itself. The array geometry lives on the platters, so a dead or foreign controller does not destroy it.

RAID buys you availability, not a backup: a failed rebuild, a controller fault, or a correlated batch failure can take the whole array down at once, and that is exactly the array state our Mail-In RAID Recovery Service is built to handle. Free evaluation. No data recovered means no charge.

Rebuild stuck at 90 to 99 percent, then dropped a drive: The sequential read pass hit a marginal-LBA band on a surviving member and the controller timed out inside its TLER or ERC window. Power the chassis down now. Do not retry the rebuild.
Controller reports not enough devices to start the array, or array failed: A second member dropped during the rebuild, which is a double fault on RAID 5. The drives still hold the data. Power down and do not initialize or clear the configuration.
You already clicked Repair, ran mdadm --create or --assume-clean, or re-added the original drive: Those commands overwrite metadata or resync stale blocks. Power down immediately to stop any running resync before it walks the full LBA range.
SMR member stalled and the controller ejected a healthy drive: The CMR cache overflowed and the multi-second flush exceeded the controller timeout. The drive is not dead. Power down and image it offline.

Every symptom above takes the same safe path: leave the chassis off and ship the labeled drives to the Austin, TX lab for per-member offline imaging on DeepSpar Disk Imager or PC-3000 Express, followed by virtual array reconstruction in software with Data Extractor Express RAID Edition. The original chassis is never written to. This is the array state our RAID data recovery service is built for. Free evaluation. No data recovered means no charge.

Status Codes02/16

What Do RAID Controller Status Codes Mean?

Before troubleshooting a failed rebuild, identify the exact status the controller is reporting. The terminology is consistent across LSI MegaRAID, Dell PERC, HPE SmartArray, and the Linux mdadm stack that Synology and QNAP wrap in their UIs.

Degraded: The array is running with one or more members missing or marked failed. Reads still succeed because the surviving members plus parity can reconstruct the missing data on the fly. Writes succeed but generate no parity protection for the missing member.
Rebuilding: The controller is actively copying reconstructed stripes onto a replacement member. Every sector of every surviving member must be read during this phase. Multi-TB rebuilds take many hours and stress every drive in the chassis.
Failed: The array has lost more members than its parity level can tolerate. RAID 5 lost two members; RAID 6 lost three. The controller stops servicing reads and writes. The data is not erased, but normal access through the controller is impossible until the array is reassembled.
Offline: The controller has taken the virtual disk out of service, typically because a rebuild was attempted, halted, or encountered an unrecoverable error mid-stream. Offline is the state most failed rebuilds settle into.
Foreign Configuration: The controller detected DDF metadata it did not write itself, usually because the drives were moved from another controller or because the array configuration was cleared on this controller. Importing or clearing a Foreign Configuration rewrites metadata on the member drives. Do neither without an image first.
Unconfigured-Good: The drive is detected by the controller, passes its self-test, and is available to be assigned to an array. After a failed rebuild, drives that were dropped due to a TLER timeout sometimes return to Unconfigured-Good on the next power cycle even though their media has marginal sectors.
Unconfigured-Bad: The controller flagged the drive as having excessive media errors. The drive is not necessarily mechanically dead; the controller has simply decided not to use it again. These drives are recoverable on PC-3000 Express with adaptive read retry.

URE Math03/16

Why Do RAID 5 Rebuilds Fail at 90%?

Rebuilds halt at a specific percentage when the sequential read pass hits a localized band of unreadable LBAs on a surviving drive. The controller times out inside its TLER window, drops the drive, and converts a single-fault rebuild into a double-fault failure. On large consumer arrays the dominant cause is mechanical, a marginal same-batch survivor failing under sustained read load, not a clean per-byte bit error.

Hard drives carry a worst-case unrecoverable read error (URE) specification, also called the bit error rate. Consumer SATA drives such as the WD Blue, Seagate Barracuda, and Toshiba P300 spec one URE per 10^14 bits read, which works out to about one unreadable sector per 12.5 TB of sequential reads. Enterprise drives such as the WD Ultrastar and Seagate Exos spec one URE per 10^15 bits, ten times better, or roughly one unreadable sector per 125 TB.

That number is a warranty floor, not a schedule: field studies (USENIX FAST latent-sector-error work; Backblaze fleet data) show the large majority of drives read far past 12.5 TB without a single URE, and read errors cluster on aging or marginal drives rather than striking on an independent per-byte basis.

During a RAID 5 rebuild the controller must read every sector of every surviving member in order to XOR them together and reconstruct the missing data. A four-member RAID 5 with 8 TB drives, after losing one member, has three surviving members of 8 TB each. The controller must read 24 TB sequentially under sustained load. Against the worst-case 10^14 consumer spec that read pass raises the probability of hitting a latent unreadable sector, but it does not guarantee one, and the dominant real-world failure driver is mechanical, not statistical.

Single-parity RAID 5 on large drives runs on razor-thin margins, which is why dual parity (RAID 6, RAIDZ2, SHR-2) is the sane minimum above roughly 12 TB of usable capacity.

What happens when a URE does land during a degraded rebuild depends on the controller, and the data inside that one stripe is lost in every case. Legacy block-level hardware RAID and low-end consumer controllers such as Intel RST hard-abort the rebuild and drop the volume offline. HP and HPE Smart Array P-series and E-series also abort: they drop the logical drive offline to protect integrity and flag POST Error 1784 or 1786.

Modern Dell PERC and LSI/Broadcom MegaRAID puncture instead, writing a bad-block placeholder over the stripe, completing the rebuild, and keeping the volume online with only the punctured stripe permanently lost. Linux mdadm records the unreadable LBA in its Bad Block Log and continues. ZFS RAIDZ finishes the resilver and names the exact corrupted file in zpool status -v. One URE does not take a modern array down, but the controllers that do abort produce the stuck rebuild that brings these arrays to the bench.

On the bench, degraded arrays rarely die from an independent per-byte bit error. They die because a full-surface parity rebuild pins every surviving member at close to 100% sustained read for 18 to 48 or more hours, pushing a marginal head, preamp, or scored platter into a secondary mechanical failure mid-rebuild.

Members also share a manufacturing batch, age, and thermal environment, so a second failure inside the rebuild window is positively correlated, not independent: a drive with one prior scan error is far more likely to fail within the next 60 days. The clean binomial coin-flip overstates random bit-rot while understating this correlated mechanical risk.

The "stuck at 90 percent" plateau is the URE math playing out on a specific drive. A surviving member has a localized band of marginal sectors at a particular LBA range. The rebuild reads cleanly up to that LBA, hits the band, retries inside the TLER window, fails to recover the sectors in time, and drops the drive. The percentage where the rebuild stalls is just the offset of the bad-LBA band.

Arrays with damage near the end of the LBA range plateau at 90 to 99 percent; arrays with damage near the start plateau much earlier.

SMR04/16

SMR Drives Trigger Controller Timeout Mid-Rebuild

Drive-managed SMR drives absorb incoming writes in a small conventional-recording cache, then pause to fold them into shingled tracks. A rebuild fills that cache and forces a multi-second flush stall, which the controller reads as a dead drive and drops the healthy member.

Drive-managed Shingled Magnetic Recording (SMR) drives use a small persistent conventional-recording (CMR) cache zone for incoming writes, then reorganize that data onto overlapping shingled tracks during idle periods. A RAID rebuild is the opposite of idle. It forces continuous sequential writes to the replacement drive while the surviving members are read at sustained sequential rates.

Once the CMR cache fills, the SMR drive must pause to flush its accumulated writes into the shingled zones. That flush can stall the drive for several seconds while tracks are rewritten in band order.

Hardware RAID controllers expect responses inside a Time-Limited Error Recovery (TLER) or Error Recovery Control (ERC) window that defaults to 7 to 14 seconds on enterprise controllers, sometimes as low as 7 seconds on consumer cards. When the SMR pause exceeds that budget, the controller interprets the silence as drive death and drops the SMR member from the array.

Specific models known to ship as drive-managed SMR include the WD Red EFAX series (2 TB through 6 TB capacities), the Seagate Barracuda ST2000DM008 and ST4000DM004, and the Toshiba L200 and P300 families. None of these belong inside a parity RAID array. If one was placed as the replacement during a rebuild, the rebuild does not just slow down; it actively converts the array from single-fault degraded to double-fault failed.

TLER and ERC05/16

TLER and ERC Mismatch Drops Healthy Drives

Desktop drives without TLER firmware enter a deep recovery loop of up to 120 seconds on a marginal sector. Hardware RAID controllers wait only 7 to 14 seconds, then declare the drive missing and drop a physically healthy member mid-rebuild.

Desktop drives without TLER or ERC firmware (WD Green, WD Blue, Seagate Barracuda non-RAID variants) handle a marginal sector by entering a deep internal recovery cycle. The drive will reread the sector with adjusted parameters for up to 120 seconds before giving up. In a single-drive desktop the operating system tolerates this delay because nothing else depends on the drive responding within a fixed window.

Hardware RAID controllers do not tolerate that delay. The controller TLER or ERC window is 7 to 14 seconds on most enterprise cards. After that the controller declares the drive missing and drops it from the array. The drive itself may have eventually returned the sector successfully, but by the time it did, the controller had already moved on.

This is how arrays full of perfectly functional drives end up double-faulted during a rebuild.

Enterprise drives (WD Red Plus, WD Ultrastar, Seagate IronWolf and Exos) ship with a short TLER window (typically 7 seconds) so the drive returns a read failure inside the controller budget. The controller can then reconstruct the missing sector from parity in the normal way without dropping the drive. The difference between a desktop drive and a NAS or enterprise drive of similar physical specifications is almost entirely this firmware behavior.

SSD and NVMe Members06/16

SSD Cache and NVMe Members: Firmware Panics During Rebuild

SSD-based RAID arrays and NAS SSD caches introduce a fifth failure mode that HDD-focused guides overlook. The sustained sequential read load of a rebuild can push consumer SSDs with aging NAND past their failure threshold.

SATA SSDs using the Phison S11 controller (PS3111, found in budget drives like the Kingston A400, Patriot Burst, and Silicon Power S55) are prone to a firmware lockout when TLC NAND cells degrade beyond the ECC correction threshold. The controller enters a protective state, the drive drops offline, and re-identifies in the BIOS as "SATAFIRM S11" instead of the original model name. The rebuild's sustained read load does not cause the NAND degradation, but it surfaces latent cell failures that normal desktop workloads would not trigger.

NVMe SSDs with Phison E12 controllers experience similar FTL corruption but drop off the PCIe bus or report hardware initialization failures instead. Silicon Motion SM2259XT controllers exhibit a different symptom: firmware corruption (typically from power loss during garbage collection or cache flush) causes the drive to report 0 bytes capacity or appear as unallocated in disk management.

Both failures corrupt the Flash Translation Layer (FTL), the firmware mapping table that tracks which logical block lives on which physical NAND page. Consumer SSD recovery tools can't access a panicked controller. Recovery requires placing the SSD into Technological Mode using PC-3000 SSD to access the raw NAND and reconstruct the block mapping directly from raw NAND.

For arrays mixing SSDs and HDDs, the panicked SSD is priced at the firmware-level SSD tier ($600–$900) while healthy HDDs image at the standard $100 rate. If a member SSD fails this way during a rebuild, the same imaging-first approach applies: image every drive before attempting any reconstruction.

Write Hole07/16

Write-Hole Parity Corruption from Unclean Shutdown

If power is lost between a stripe's data write and its parity write, the stripe is left with stale parity. A later rebuild recomputes missing blocks from that stale parity and writes silent corruption. Consumer NAS units lack the battery-backed cache that prevents it.

A RAID 5 or RAID 6 write spans every member of a stripe. The controller writes new data blocks and a recalculated parity block. If power is lost between the data write and the parity write, the stripe ends up with new data and old parity, or new parity and old data. The controller has no way to know which blocks are current and which are stale.

On healthy hardware this is mitigated by a battery-backed write cache (BBWC) or flash-backed write cache (FBWC) on the controller. The cache survives the power loss and the controller replays the pending writes on the next boot. Consumer NAS units (Synology, QNAP, Buffalo) ship without battery-backed caches; their "write hole"protection depends on filesystem-layer journaling (ext4, XFS, Btrfs) which does not cover parity blocks.

When a write-hole stripe is later read during a rebuild, the parity does not match the data. The controller computes a missing block from inconsistent inputs and produces garbage. The garbage gets written to the replacement member as part of the rebuild. The rebuild completes successfully and reports no errors, but specific files stored across affected stripes are silently corrupt.

Stale Drive08/16

Reinstalling the Original Drive After a Spare Resynced

Reinstalling the original failed drive after a hot spare has resynced is the most common destructive operator action. A controller that treats the lower-event-count drive as current resyncs stale sectors over good data, and every block changed since the failure is lost.

The most common destructive operator action is to reinstall the original failed drive after a hot spare has already started resyncing. The original drive contains data from before the failure and a lower RAID event count. Some controllers compare event counts and treat the higher count as authoritative. Others rely on the order in which drives respond at power-on and silently sync from whichever member they decide is current.

If the controller treats the stale drive as current, the resync overwrites the working spare and any other surviving members with the older data from the reintroduced drive. The array reports "resync complete" and the volume looks mounted normally, but every block that was modified between the original failure and the reintroduction is now lost.

If a drive was reinstalled and a resync started, power the chassis down immediately. Do not let the resync complete. The original data is still recoverable from the drives that have not yet been overwritten, but only if the resync is interrupted before it walks the full LBA range.

Banned Commands09/16

Commands That Destroy Your Array

If your rebuild failed: power down the chassis and stop. The commands below are the ones most often recommended on forums and in vendor knowledge bases for "recovering" a failed rebuild. Every one of them writes to the member drives and forecloses on a clean forensic recovery.

megacli -PDMakeGood -PhysDrv [E:S] -aALL
What it does: changes the firmware state of an Unconfigured-Bad drive to Unconfigured-Good so the controller can address it again. Why it destroys data: if auto-rebuild is enabled, the controller may immediately begin rebuilding the degraded array onto this drive, overwriting the original pre-failure data with recomputed parity.
MegaCli -CfgForeign -Clear -aALL
What it does: tells the LSI controller to discard the Foreign Configuration metadata it found on the drives. Why it destroys data: the array geometry is in that metadata. Clearing it leaves the drives with valid user data but no record of how to assemble it.
mdadm --create --assume-clean --level=5 --raid-devices=N ...
What it does: creates a new mdadm superblock on every member and assumes parity is already consistent. Why it destroys data: the v1.2 superblock at offset 4 KiB is rewritten with new UUIDs and a new event count; the array geometry from the original create call (chunk size, layout, member order) is lost unless it happens to be identical, and silent corruption follows on the next write.
mdadm --re-add /dev/sdX1 /dev/mdN on a heavily degraded drive
What it does: tells mdadm the drive is current and only needs to replay the write-intent bitmap. Why it destroys data: if the drive was actually behind, mdadm marks it in-sync without resyncing, and every read from those stripes returns stale data.
Synology DSM Storage Manager "Repair" button on a crashed volume
What it does: runs a Synology-authored script that calls mdadm and lvm with parameters intended to bring the array back online. Why it destroys data: the script can overwrite md superblocks and LVM metadata on partition 3 of the surviving members. Read-only inspection on a separate Linux workstation is the safe alternative.
"Force Online" or "Make Optimal" in LSI or PERC BIOS (F2) menus
What it does: overrides the controller's decision that the array is offline. Why it destroys data: writes pending in the cache flush to the drives even though parity and data are inconsistent.
QNAP Recovery Wizard "initialize" prompts
What it does: formats the QNAP system partitions and rewrites the storage pool metadata. Why it destroys data: QTS stores its storage pool configuration database on partition 1 of the member drives. Initializing rewrites that database; the user data on the data partitions is still present but no longer addressable through QTS without manual LVM and Btrfs extraction.
Online Capacity Expansion (OCE) or RAID level migration during a degraded state
What it does: rewrites stripe geometry across the array while parity reconstruction is in progress. Why it destroys data: if the process halts, the array exists in a hybrid state. Sectors before the failure point use the new geometry; sectors after use the old. No standard tool can assemble the split-geometry volume without manual analysis.

Assess First10/16

Assessing the Array State

Before deciding on a course of action, gather information about the array state without modifying anything on disk. The goal is to determine whether the failure was transient (cable, timeout) or physical (media degradation, mechanical fault).

Record the controller error. The exact message narrows the diagnosis. "Media error on PD 2 at LBA X" points to a specific drive and sector. "PD 3 not responding" suggests a mechanical or connection failure. Note the rebuild percentage at failure.
Check SMART data on all drives. Use smartctl -a /dev/sdX (Linux) or the controller's management utility. Key attributes: Reallocated_Sector_Ct (sectors already moved to spare areas), Current_Pending_Sector (sectors queued for reallocation), and Offline_Uncorrectable (sectors that failed offline scan). Non-zero values on any of these indicate degraded media.
Document the RAID configuration. Record the controller model, firmware version, RAID level, stripe size, write policy (write-back vs write-through), and number of drives. This information is required for offline reconstruction if controller metadata is damaged.
Label every drive. Mark each drive with its physical slot number using tape or a marker on the drive itself (not just the tray). If drives are removed for imaging, the slot mapping must be preserved.

For detailed guidance on reading controller logs across Dell PERC, HP SmartArray, LSI MegaRAID, and Linux mdadm, see the degraded RAID troubleshooting guide.

When You Can Fix This11/16

When You Can Fix This Yourself

Not every failed rebuild requires professional recovery. The following scenarios can often be resolved by the administrator.

The rebuild failed due to a transient error. If the controller dropped a drive because of a timeout (not a URE or mechanical failure) and SMART data on all drives is clean, the issue may be a loose SATA/SAS cable, a failing backplane connector, or a controller port problem. Reseat cables, test on a different port, and attempt the rebuild again. Image the drives first as a precaution.
You have recent, verified backups. If backup integrity has been confirmed (not just backup job completion), restore from the backup. This is the correct answer for any array containing replaceable data.
Software RAID (mdadm) with a single-sector URE. If the rebuild is mdadm-based and the error is a single-sector URE, you can use ddrescue to image the affected drive (skipping the bad sector), then reassemble the array from images.
RAID 6 or RAID 10 after a non-fatal rebuild failure. If a RAID 6 rebuild failed due to a non-fatal error (such as a URE on a single stripe) rather than a complete second drive failure, the array may still be accessible in degraded mode. The array is in a mixed parity state, not a clean single-failure degradation; rebuilt stripes carry updated parity while unrebuilt stripes retain the original layout. If a RAID 10 rebuild failed within one mirror pair, the other pairs remain intact. Check controller status. If the volume is still mounted, copy data off immediately.

Example: In a software mdadm RAID 5 array, if a rebuild fails because a member drive returns a read error on a single sector, it is often possible to use ddrescue to image all drives. By assembling the array offline using cloned images, the data can be extracted. The single unreadable sector typically only affects the specific file block mapped to that physical location, leaving the rest of the filesystem intact.

When to Escalate12/16

When Professional Imaging Is the Right Call

Some rebuild failure scenarios leave the array in a state that cannot be safely resolved with standard administrator tools.

Multiple physical drive failures. If two or more drives have mechanical problems (clicking, not spinning, SMART reporting thousands of reallocated sectors), the drives need to be imaged with hardware that can manage bad sectors, weak heads, and firmware faults at a level ddrescue cannot.
Partial rebuild corrupted parity data. If the controller wrote partial parity updates before the rebuild failed, the array cannot be reassembled using either the pre-rebuild or post-rebuild state without analyzing which stripes were modified. This requires forensic RAID reconstruction that compares parity states across drives.
Controller metadata is damaged or missing. If the controller BIOS no longer shows the virtual disk, or shows it as "Foreign" or "Missing," the metadata defining stripe size, drive order, and parity rotation may be corrupted. Reconstruction requires scanning the raw drives to detect RAID parameters from data patterns.
Post-failure operations already modified the drives. If someone has run force-online, fsck, or reinitialized the virtual disk, the on-disk state has been modified. Recovery is still possible in many cases, but the window narrows with each modification.

For RAID data recovery involving physical drive faults, we image each drive with PC-3000 and DeepSpar Disk Imager through write-blocked connections, then reconstruct the array offline in software. The original drives are never written to. For RAID 5 arrays with partial rebuild corruption, we analyze stripe-level parity to determine which sections use pre-rebuild vs post-rebuild data.

Controllers13/16

How Different RAID Controllers Fail During Rebuild

On a rebuild read error the controller family decides the outcome. Dell PERC and LSI MegaRAID puncture the stripe and continue; Linux mdadm logs the unreadable LBA to its Bad Block Log and continues; legacy and HP Smart Array controllers hard-abort and drop the volume offline.

The on-disk metadata format and the firmware behavior during a failed rebuild differ by controller family. The recovery posture is the same in every case (image first, assemble offline), but understanding what the controller did to the metadata before it gave up is what determines how fast the virtual assembly converges.

LSI MegaRAID and Broadcom

LSI and Broadcom controllers write Disk Data Format (DDF) metadata to the trailing sectors of every member drive. When a drive drops during a rebuild due to a TLER timeout, the controller marks the drive as Unconfigured-Bad or Offline. A removed and reinserted drive shows up as Foreign Configuration on the next boot. The MegaCLI tool can report all of this without writing to the drives, but the -PDMakeGood and -CfgForeign -Clear commands both modify DDF and should never run before forensic imaging.

Dell PERC H700 through H965

Dell PERC is built on LSI silicon with Dell-specific firmware. The PERC family is DDF-conformant but adds copyback behavior: if a global hot spare resynced into a failed slot and the original slot is later populated, the firmware automatically copies the spare back. A drive failure during copyback degrades the array a second time. The H965 has a documented Online Capacity Expansion bug that halts the transformation queue with a "drive count exceeded" error; arrays caught in that state must be imaged member-by-member before any attempt to clear the transformation queue.

HPE SmartArray P-series

HPE SmartArray controllers (P-series, Gen8 through Gen10) use a proprietary metadata layout rather than DDF, written to RAID Information Sectors (RIS) at the beginning of each member drive rather than the trailing sectors used by LSI and PERC. Advanced Data Guarding (ADG), the HPE name for RAID 6, runs through a transformation queue that interleaves rebuild operations with expansion operations. A URE during a transformation halts the queue and writes a log event. Do not change Rebuild Priority or Expand Priority in Smart Storage Administrator once a transformation halts; the change can flush the proprietary metadata in the RAID Information Sectors and complicate offline parsing.

Adaptec and Areca

Adaptec controllers running the aacraid driver maintain a configuration database on the member drives and a battery-backed (BBWC) or zero-maintenance (ZMCP) cache on the card. If the rebuild halts due to a power anomaly, pending parity calculations may still be sitting in NVRAM. Issuing arcconf task start commits cached writes to the drives, including any inconsistent parity computed mid-rebuild. Disconnect the drives, capture metadata offline, and assemble virtually with the array's detected rotation, which is often distinct from Linux mdadm defaults.

Linux mdadm

The mdadm v1.2 superblock sits 4 KiB into the member device. mdadm tracks an event count and a write-intent bitmap; when a member drops, its event count stops advancing while the array keeps writing. The next time mdadm sees that drive, it compares event counts and either marks the drive stale (requires full resync via --add) or replays the bitmap (--re-add). Using --re-add on a drive that should have required a full resync silently introduces stale blocks. Read-only assembly with --assemble --readonly against cloned images is the safe inspection path.

Intel RST Rebuild Loops

Intel Rapid Storage Technology (RST) and the Intel Optane Memory and Storage Management app have a documented bug where a RAID 5 rebuild reaches 100% completion, crashes the application, and restarts the rebuild from 0%. This has been reported across multiple RST versions, from ICH10R-era controllers through modern chipsets. Each loop pass forces a full sequential read of all surviving members and rewrites the entire replacement drive from scratch, and if the loop also triggers a consistency check, parity on the surviving drives may be recalculated and overwritten. If the rebuild loops: power off the system, do not let it restart, and image all member drives through write-blocked connections before interacting with the RST software again.

Synology DSM and QNAP QTS

Synology DSM wraps Linux mdadm and LVM. Each member drive carries a small system partition (md raid1), a swap partition, and a data partition that joins the mdadm RAID, which is then exposed through LVM as a Btrfs or ext4 filesystem. A Volume Crashed alert in Storage Manager is mdadm halting on partition 3. QNAP QTS and QuTS hero use a similar layout but store a proprietary configuration database on partition 1; a hard reboot mid-rebuild can corrupt that database, after which the QNAP Recovery Wizard offers to "initialize" the drives. Decline. The user data on the data partitions is still readable through a Linux workstation with mdadm, LVM, and Btrfs or ZFS userspace tools.

Process14/16

Our Image-First, No Live Rebuild Process

We image every member of a failed-rebuild array through hardware write-blockers, extract the RAID metadata from the cloned images, and assemble the array virtually on Data Extractor Express RAID Edition. The original chassis is never written to and no live rebuild is ever attempted on the customer's drives.

Free evaluation and documentation. Record the controller model, RAID level, member count, filesystem (ext4, XFS, Btrfs, ZFS, NTFS, VMFS), and every prior rebuild or repair attempt and the commands run. This step is free and informs which metadata layer is still intact.
Label every drive bay. Each drive is marked with its physical slot number before removal and bagged individually. Slot order is required to validate stripe layout during virtual assembly.
Capture RAID metadata from each member. Metadata location varies by controller family: LSI MegaRAID and Dell PERC store DDF in the trailing sectors of the member drives; HPE SmartArray writes its proprietary RAID Information Sectors (RIS) at the beginning of the drive; Adaptec aacraid uses DDF on modern controllers and proprietary structures on legacy models. For Linux software RAID, the mdadm v1.2 superblock sits at offset 4 KiB. Metadata capture runs against cloned images, not the originals.
Write-blocked forensic imaging. Each member is connected through a hardware write-blocker to PC-3000 Express or DeepSpar Disk Imager. Adaptive retry and head-map analysis pull marginal sectors that the failed-rebuild controller had given up on inside its TLER window. Mechanical members (clicking, not spinning, head crash) receive donor head transplants on the 0.02 micron ULPA-filtered laminar-flow clean bench before imaging.
Offline virtual assembly. Data Extractor Express RAID Edition loads the cloned images and assembles the array virtually using the captured metadata. The stripe size, parity rotation, and member order are read from the on-disk metadata rather than guessed.
Parity recalculation and filesystem extraction. Stripes with missing data are reconstructed from parity. The assembled volume is mounted read-only. R-Studio and UFS Explorer handle filesystem-level recovery if the filesystem itself sustained damage during the failed rebuild.
Delivery and secure purge. Recovered data is copied to your target media. After you confirm receipt, working copies are securely purged on request.

If a rebuild is currently running: power the chassis down. An in-progress rebuild on stressed members generally makes things worse, never better. The drives can sit unpowered indefinitely with no further degradation while you arrange evaluation.

Pricing15/16

How Much Does RAID Rebuild Failure Recovery Cost?

Pricing is per member drive based on the failure type of each drive, plus a flat array reconstruction fee of $400 to $800. The reconstruction fee covers offline virtual assembly with Data Extractor Express RAID Edition, parity validation, and filesystem extraction.

Per-Member Imaging

Logical or firmware-level issues: $250 to $900 per drive. Covers filesystem corruption on the array, firmware module damage that prevents normal reads, and SMART threshold failures.
Mechanical failures (head swap, motor seizure): $1,200 to $1,500 per drive with a 50% deposit. Donor parts are consumed during the transplant. Head swaps are performed on a validated laminar-flow clean bench before write-blocked cloning.

Array Reconstruction

$400 to $800 depending on member count, filesystem type, and whether RAID parameters must be detected from raw data versus captured from surviving DDF or mdadm superblocks.
Data Extractor Express RAID Edition performs parameter detection and virtual assembly from cloned member images. R-Studio and UFS Explorer handle filesystem-level extraction after reconstruction.

No Data = No Charge: if we recover nothing from your array, you owe $0. Free evaluation, no obligation.

Example: a four-member array with one mechanically failed member and three healthy members costs approximately $1,200 (head swap) + 3 × $250 (logical imaging) + $400 to $800 (reconstruction) = $2,350 to $2,750.

+$100 rush fee to move to the front of the queue. Full HDD pricing is published at our HDD recovery service page.

Faq16/16

RAID Rebuild Failure Recovery Questions

Why does my RAID 5 rebuild fail at 90%?

Rebuilds halt at a specific percentage when the sequential read pass hits a localized band of unreadable LBAs on one of the surviving members. The controller retries the bad sectors within its TLER or ERC window, the drive fails to recover them in time, and the controller drops the surviving drive. A rebuild that was reconstructing a single missing member becomes a double-fault failure. The percentage is just the offset of the bad-LBA band; arrays with damage near the end of the LBA range plateau at 90 to 99 percent, arrays with damage near the start plateau much earlier.

What is the URE rate that causes RAID 5 rebuild failures?

Consumer SATA drives (WD Blue, Seagate Barracuda, Toshiba P300) spec one unrecoverable read error per 10^14 bits, which works out to roughly one bad sector per 12.5 TB of sequential reads. Enterprise drives (WD Ultrastar, Seagate Exos) spec one per 10^15 bits, about 125 TB. That spec is a worst-case warranty floor, not a schedule: field studies (USENIX FAST, Backblaze) show most drives read far past it clean, so a four-member RAID 5 rebuild on 8 TB consumer drives forcing 24 TB of reads is a worst-case upper bound that scales with array size and drive age, not a count of errors the rebuild will hit. The more common real driver of a failed rebuild is mechanical: 18 to 48 hours of sustained read finishing off a marginal same-batch survivor. When a URE does land, modern Dell PERC and LSI/Broadcom MegaRAID puncture the affected stripe and continue (losing only that stripe), Linux mdadm logs the LBA to its Bad Block Log and continues, while legacy and low-end controllers abort the rebuild.

Can I retry a failed RAID rebuild?

Retrying compounds the problem. Each rebuild pass forces another full sequential read across already-stressed members. Drives with marginal heads progress from occasional read errors to complete head failure across multiple rebuild attempts. If the surviving drives have any thermal, mechanical, or media weakness, the second and third rebuild cycles convert those weaknesses into hard failures. Power down the chassis, label the drives by slot, and image each member with hardware write-blockers before any further rebuild attempt.

Does forcing a RAID drive online destroy data?

Yes. Commands such as MegaCli -CfgForeign -Clear, mdadm --create --assume-clean, and the Synology Storage Manager Repair button all overwrite the metadata that records which member holds which stripe and which event count was current at the moment of failure. After these commands run, the on-disk geometry no longer matches the original array, and a virtual assembly from cloned member images can no longer produce a coherent volume without manual stripe-by-stripe analysis. The drives themselves still hold the data, but the recovery cost increases substantially.

Why does the controller drop a drive during rebuild?

Two physical causes dominate. First, the drive hits a sector that requires deep recovery, enters a 60 to 120 second internal retry loop, and the RAID controller times out long before the drive finishes (controller TLER or ERC windows default to 7 to 14 seconds). Second, the drive is an SMR (Shingled Magnetic Recording) member whose CMR cache zone fills under sustained sequential writes; the drive stalls for several seconds while reorganizing shingled tracks, and the controller interprets the stall as drive death. Neither failure mode means the drive is dead; both mean the drive cannot stay synchronous with the controller's response budget.

What is the difference between a failed rebuild and a failed drive?

A failed drive is a physical fault on one specific member: a head crash, a stuck spindle, a PCB short, firmware lockout. A failed rebuild is the controller halting the reconstruction process because something went wrong while it was reading the surviving members or writing the replacement member. The surviving drives may be fully readable when imaged offline through a write-blocker, even though the controller marked the rebuild as failed. Most failed rebuilds we receive are not failed drives; they are TLER timeouts, SMR ejections, or write-hole parity inconsistencies that the controller cannot reconcile in place.

Can SMR drives cause RAID rebuild failures?

Yes, and they are one of the most common modern causes. Drive-managed SMR drives reorganize data into overlapping shingled tracks during idle periods. A RAID rebuild is the opposite of idle; it forces continuous sequential writes that fill the CMR cache, then force a pause while the drive flushes into shingled zones. The pause lasts several seconds and the RAID controller drops the drive on TLER timeout. Models notorious for this conflict include the WD Red EFAX series (2 TB to 6 TB), Seagate Barracuda ST2000DM008 and ST4000DM004, and Toshiba L200 and P300 families. None of those drives should ever be used as parity RAID members.

How do I recover from a failed RAID 5 rebuild?

Power the chassis down. Do not retry the rebuild, do not click Repair in the NAS UI, do not run --assume-clean or PDMakeGood. Label every drive bay with its physical slot number. Ship the drives, or bring them in, for evaluation. We image each member through hardware write-blockers on PC-3000 Express or DeepSpar Disk Imager, extract the RAID metadata from the cloned images, and assemble the volume virtually with Data Extractor Express RAID Edition. The original chassis is never written to. +$100 rush fee to move to the front of the queue.

Does the RAID level affect recovery chances after a rebuild failure?

Yes. RAID 6 and RAID 10 arrays have better recovery prospects than RAID 5 because they provide additional redundancy. However, after a partially completed rebuild, a RAID 6 array is in a mixed parity state: rebuilt stripes have updated parity while unrebuilt stripes still rely on the original parity layout. The actual remaining tolerance depends on why the rebuild failed. If a second drive caused the failure, the array may have zero remaining margin. RAID 5 has zero parity margin after the first failure, so a read error on a surviving member during rebuild loses the data in that stripe; whether the whole array then drops offline depends on the controller. RAID 10 tolerance depends on which mirror pair was affected.

Why do ZFS resilvers succeed when hardware RAID 5 rebuilds fail?

Hardware RAID controllers are filesystem-blind. They rebuild by reading every sector on every surviving drive, including empty space. A 16 TB drive that's only 30% full still forces the controller through all 16 TB. ZFS is filesystem-aware; its resilver only reads allocated data blocks. If the pool is 30% full, ZFS reads roughly 30% of the disk surface, cutting the URE exposure by 70%. This is why TrueNAS and FreeNAS arrays using ZFS mirror or RAIDZ tolerate larger drives with fewer rebuild failures than equivalent hardware RAID 5 arrays.

What does "not enough devices to start the array" mean after a rebuild?

It means a second member dropped during the rebuild, so the controller or mdadm no longer has enough present members to assemble the array. On RAID 5 that is a double fault: the array tolerates one missing member, and the rebuild pushed a marginal surviving member past a TLER or ERC timeout until it dropped too. The data is still on the platters. Do not initialize the array, clear the configuration, or force it online, because those actions overwrite the metadata that records member order and event counts. Power the chassis down, label each drive by physical slot, and image every member offline before any reassembly.

How long does a RAID 5 or RAID 6 rebuild take?

An 8 TB RAID 5 array on 7200 RPM CMR SATA drives takes 15 to 20 hours under ideal conditions with no production I/O competing for disk bandwidth. RAID 6 takes longer because the controller recalculates two parity blocks (XOR plus Reed-Solomon) per stripe instead of one. A 4-drive array with 16 TB drives can take 40+ hours. Every hour the array spends rebuilding is an hour where a second drive failure collapses the entire volume. Drive-Managed SMR (shingled) drives can extend rebuild times from hours to days because their CMR write cache fills up and forces zone rewrites, stalling the controller.

Data Recovery Standards & Verification

Our Austin lab operates on a transparency-first model. We use industry-standard recovery tools, including PC-3000 and DeepSpar, combined with strict environmental controls to maintain drive integrity. This approach allows us to serve clients nationwide with consistent technical standards.

Validated Clean Zone

Open-drive work is performed in a ULPA-filtered laminar-flow bench, validated to 0.02 µm particle count, verified using TSI P-Trak instrumentation.

Transparent History

Serving clients nationwide via mail-in service since 2008. Our lead engineer holds PC-3000 and HEX Akademia certifications for hard drive firmware repair and mechanical recovery.

Media Coverage

Our repair work has been covered by The Wall Street Journal and Business Insider, with CBC News reporting on our pricing transparency. Louis Rossmann has testified in Right to Repair hearings in multiple states and founded the Repair Preservation Group.

Aligned Incentives

Our "No Data, No Charge" policy means we assume the risk of the recovery attempt, not the client.

Technical Oversight

Louis Rossmann

Our engineers review all lab protocols to maintain technical accuracy and honest service. Since 2008, his focus has been on clear technical communication and accurate diagnostics rather than sales-driven explanations.

We believe in proving standards rather than just stating them. We use TSI P-Trak instrumentation to verify that clean-air benchmarks are met before any drive is opened.

See our clean bench validation data and particle test video

No Data, No Fee

Guarantee

2.49M+

Subscribers

4.9

1,837+ Google Reviews

Since 2008

Established

Repairs on Video

Full Transparency

As Featured In