Skip to main contentSkip to navigation
Lab Operational Since: 17 Years, 7 Months, 14 DaysFacility Status: Fully Operational & Accepting New Cases

Dead NVMe & PCIe Lane Fault Recovery

An NVMe drive that goes dark on the PCIe bus has usually lost its power tree or its host link, not its data. A collapsed PMIC, a shorted decoupling cap, a lost 100 MHz reference clock, or a severed differential pair leaves the NAND intact while the system sees nothing. We diagnose the fault with pre-power diode checks and FLIR thermal imaging, repair the board with Hakko FM-2032 microsoldering, and image the data through the original controller at our Austin, TX lab.

Author01/13
Louis Rossmann
Written by
Louis Rossmann
Founder & Chief Technician
Updated 2026-06-15

If your NVMe drive is dead or not detected, stop applying power to it. If the controller is partly alive but the host link is broken, repeated power-ups risk the controller completing queued TRIM or UNMAP (Deallocate) commands, and background garbage collection can then erase those NAND cells. Power cycling also pushes more current through any shorted component. Do not run recovery software; the drive isn't visible to the OS when the power or signal path is broken. Call (512) 212-9111 for a free evaluation.

Call (512) 212-9111No data, no recovery feeFree evaluation, no diagnostic fees
Bluf02/13

What Is an NVMe PCIe Lane Fault?

An NVMe PCIe lane fault is an electronic break in the power or signal path that connects the drive's controller to the CPU over PCIe. A collapsed PMIC, a shorted decoupling cap, a lost 100 MHz reference clock, or a severed TX/RX differential pair all leave the NAND intact while the drive stays invisible to the system. Board-level repair restores the rail or the link so the original controller boots and images.

Unlike a SATA SSD, an NVMe drive connects to the CPU's PCIe bus directly. There is no SATA cable and no host adapter translating commands. The drive has to train its differential lanes against the CPU root complex and pass a PCIe configuration sequence before the NVMe protocol can even start. Break the power feeding the controller, or break one of those lanes, and the drive never reaches the host. This is different from firmware corruption, where the controller powers on but its mapping table is scrambled. Recovery software can't see a drive that never enumerates on the bus.

Architecture03/13

Why Is NVMe Not Just a Faster SATA SSD?

SATA SSDs inherited a stack built for spinning disks: a SATA cable, a host bus adapter, and the AHCI command set. NVMe threw that out. The controller sits on the PCIe bus and answers the CPU root complex with no intermediary layer, which is why the failure modes are different even though the NAND inside looks the same.

  • Direct PCIe link to the CPU. The drive trains high-speed differential lanes against the processor's root complex. A SATA drive only had to drive a single-lane interface through a cable and a host adapter.
  • No SATA/AHCI abstraction layer. NVMe commands ride straight on PCIe. There is no host adapter to fall back on if the link will not train, so a signal fault on a lane has nowhere to hide.
  • A shared 100 MHz reference clock. PCIe link training depends on a clean reference clock from the slot or board. Lose it and the link never reaches the trained state, even though the controller and NAND are fine.
  • A link-training state machine. The PCIe physical layer walks through a defined sequence of states before data can move. A fault can stall that sequence partway, leaving the drive present but never ready.

The upside is throughput. The downside, for recovery, is that an NVMe drive has more ways to go fully dark than a SATA SSD ever did. A lost clock or a severed lane is invisible to every consumer tool, because the tool needs an enumerated drive and the drive never gets that far.

Pricing04/13

How Much Does Dead NVMe Recovery Cost?

NVMe board repair for an electronic fault costs $600–$900. If the controller is destroyed and the NAND has to move to a donor PCB, the cost is $1,200–$2,500 plus donor drive cost. Every case starts with a free evaluation and a firm quote before paid work. No data recovered means no charge.

Board repair covers the PMIC, decoupling caps, rail shorts, reference-clock path, and differential-pair faults described below, plus BGA reflow when thermal cycling has fractured the controller joints. NAND transplant applies only when the controller die itself is gone. A donor drive is a matching SSD used for its circuit board. Typical donor cost: $40–$100 for common models, $150–$300 for discontinued or rare controllers. No data recovered means no charge. +$100 rush fee to move to the front of the queue.

Fault ClassNVMe PriceTypical Timeline
Shorted MLCC cap / rail-to-ground short / TVS lift$600–$9003-6 weeks
PMIC collapse / reference-clock / differential-pair repair$600–$9003-6 weeks
Controller BGA reflow / reball (thermal cracking)$900–$1,2003-6 weeks
NAND transplant to donor PCB (controller destroyed)$1,200–$2,5004-8 weeks

NAND transplant requires a 50% deposit. Donor drive cost is additional. SATA SSD board repair runs $450–$600 for comparison. All prices exclude tax & target drive.

Recovery Process - Consumer-Friendly05/13

How Do We Recover Data from a Dead NVMe Drive?

The goal is to restore the power tree and the PCIe link so the original controller boots and decrypts the NAND. We fix the support circuitry around the controller, not the controller itself. With clean rails and a trained link, the original controller initializes and serves data through its own translator.

  1. 01

    Pre-Power Diode & Thermal Diagnosis

    Before any rail power is applied, we measure the input rails in diode mode to find shorts, then run a FLIR thermal sweep on a current-limited bench supply. A shorted MLCC cap or a rail-to-ground short heats first and the camera localizes it to one package. Finding the short before applying power keeps a 0402 cap fault from becoming a burned trace.

  2. 02

    Rail & Link Verification

    The PMIC steps the 3.3V M.2 input down to the controller core rails (commonly 1.8V, 1.2V, and 0.9V). We confirm each rail comes up at the correct voltage, then check the 100 MHz reference clock and the TX/RX differential pairs that carry the PCIe link. A missing rail, a dead clock, or an open pair each points to a different repair.

  3. 03

    Component-Level Repair

    Using Hakko FM-2032 microsoldering irons and Atten 862 hot air rework, we lift the shorted cap, replace the failed PMIC, or repair the broken differential pair. When thermal cycling has cracked the controller BGA, the Zhuo Mao precision BGA rework station reflows or reballs the package on a controlled thermal profile.

  4. 04

    Controller Boot & Data Imaging

    With the rails clean and the link trained, the original controller enumerates on the PCIe bus and decrypts the NAND on its own silicon. We connect the drive to PC-3000 SSD to image sector by sector. If the controller enumerates but stalls the NVMe handshake, we reconstruct the corrupted system area before imaging.

Electronic Fault Modes06/13

Which Electronic Faults Kill NVMe Detection?

NVMe failures that produce a dead or undetected drive split into power-tree faults and host-link faults. Power-tree faults starve the controller; host-link faults stop the PCIe lanes from training even when the controller has power. Both leave the NAND untouched.

PMIC collapse and power-delivery failure
The Power Management IC steps the 3.3V M.2 input down to the controller core rails (commonly 1.8V, 1.2V, and 0.9V). A shorted MLCC decoupling cap, a damaged copper trace, or an internally failed PMIC starves the controller, so it never boots. This is the most common cause of a fully dark drive.
Shorted MLCC decoupling caps and rail-to-ground shorts
Multilayer ceramic capacitors filter each rail. A cracked or shorted MLCC pulls its rail to ground and the regulator refuses to start. A 3.3V rail-to-ground short behind the input clamp does the same to the whole drive. These are 0402 and 0603 packages; the FLIR sweep localizes which one is shorted before any iron touches the board.
Lost 100 MHz PCIe reference clock
PCIe link training needs a clean 100 MHz reference clock. If the clock path is broken (a cracked trace, a damaged clock buffer, or a bad slot contact), the link never trains and the drive never appears, even though the controller has power and the NAND is fine.
Severed TX/RX differential pairs
Each PCIe lane is a differential pair: a transmit pair and a receive pair. A cracked trace, a corroded via from a liquid event, or M.2 connector misalignment can sever one pair, which prevents the host from enumerating the drive. Signal-integrity loss from a poorly seated connector produces the same symptom.
Cracked BGA from thermal cycling or PCB flex
The controller connects to the PCB through hundreds of solder balls in a BGA package. Repeated heat-cool cycles fracture those joints over time. PCB flex from forcing an M.2 2280 board into a 2260 standoff also cracks joints and traces. The controller loses contact on specific pins, producing intermittent detection or complete failure. The Zhuo Mao BGA rework station reflows or reballs the package.
Host interface decoder corruption (NVMe handshake failure)
Sometimes the lanes train and the drive enumerates over PCIe, but the NVMe handshake fails: the host sets CC.EN and waits for Controller Ready, and the controller never asserts it. The host interface decoder or internal translator is corrupted, so the controller cannot map NAND addresses to logical block addresses. This is a firmware-side fault, addressed by system-area reconstruction rather than soldering.

Before any NVMe command moves, the PCIe physical layer runs the Link Training and Status State Machine (LTSSM). It walks the link from electrical idle through a defined sequence of states until both ends agree on lane count, polarity, and speed. An electronic fault stalls that walk at a specific state, and where it stalls tells you what is broken.

LTSSM stall pointWhat it meansLikely fault
Stuck in DetectThe link never sees a receiver on the far end of a laneSevered TX or RX differential pair, dead PCIe PHY, or the controller has no power because the PMIC collapsed
Stuck in PollingDetect passed but the link cannot lock bit and symbol timingLost or noisy 100 MHz reference clock, degraded PCIe PHY, or signal-integrity loss on a marginal pair
Stuck in ConfigurationTiming locked but the two ends cannot agree on lane width and polarityLane polarity inversion, a partially severed pair, or one lane of a x4 link open so width negotiation fails
Link trains, NVMe handshake failsLTSSM reached L0 but CC.EN never produces Controller ReadyHost interface decoder or system-area corruption; the lanes are healthy, the controller firmware is not

The distinction matters because the repair changes with the stall point. A Detect or Polling stall is a hardware problem on the board: a rail, a clock, or a pair to repair with an iron.

A handshake failure after L0 is a firmware problem inside the controller, addressed by reconstructing the system area through PC-3000 SSD once the link itself is solid. Reading the stall point first keeps the work pointed at the real fault rather than guesswork.

No software product substitutes for this. Recovery utilities need an enumerated drive. A drive stalled in Detect is invisible to the host root complex, which means it is invisible to every consumer tool ever shipped. The repair has to happen at the lane or the rail before any firmware-level work becomes possible.

Diagnostic Sequence07/13

What Is the Pre-Power Bench Sequence for a Dead NVMe Drive?

The sequence is built to find the fault before the drive ever sees rail power, so a shorted rail does not take out adjacent parts during diagnosis. Every step runs on an ESD-safe bench with the drive disconnected from any host.

  1. Visual inspection under magnification. Check the M.2 board for cracked components, flux residue from liquid damage, a burn ring near the input clamp, or hairline trace fractures along the differential pairs. Document everything before powering.
  2. Pre-power diode-mode sweep. With the drive unpowered, measure each rail to ground in diode mode. A reading near zero on the 3.3V input or on a controller core rail flags a shorted MLCC cap or a rail-to-ground short on that rail.
  3. FLIR thermal sweep on a current-limited supply. Apply the 3.3V input through a bench supply set to a low current limit. The shorted component draws the full limit and heats first. FLIR localizes the hotspot to one 0402 or 0603 package within seconds, before any healthy part is stressed.
  4. PMIC output rail verification. Once the short is cleared, probe each PMIC output. Confirm the 1.8V, 1.2V, and 0.9V core rails come up at the correct voltage. A missing or sagging rail points to a failed regulator stage or a downstream short still present.
  5. Reference clock and differential-pair check. Verify the 100 MHz reference clock reaches the controller and that the TX/RX pairs are continuous. An open pair or a dead clock is the fault when the rails are clean but the drive still will not enumerate.
  6. PC-3000 SSD controller communication test. With the rails clean and the link continuous, connect the drive to PC-3000 SSD and confirm the controller responds. PC-3000 SSD NVMe coverage spans Silicon Motion, Phison, and Marvell; the supported examples on this page are the Phison E18 and E26 and the Silicon Motion SM2262EN, SM2263XT, and SM2264.
Signal Integrity08/13

How Do Severed Differential Pairs and a Lost Reference Clock Break the Link?

A consumer NVMe drive runs a x4 PCIe link: four lanes, each a transmit differential pair and a receive differential pair. The link only trains if the pairs are continuous, the polarity is correct, and the reference clock is clean. A single physical break collapses the whole link.

Severed pairs come from a few causes. A cracked trace from PCB flex, a corroded via from a liquid event that wicked under the connector, or M.2 connector misalignment that never made solid contact on one pad.

The result is the same: the host cannot enumerate the drive because one of the pairs it needs is open. Where the link still trains at a reduced width, the LTSSM stalls in Configuration because width negotiation cannot complete with a lane missing.

The reference clock is the other single point of failure. PCIe needs a clean 100 MHz source to lock bit and symbol timing during the Polling phase.

A cracked clock trace, a damaged clock buffer, or a slot contact that lost the clock pin all stall training at Polling. The controller has power, the NAND is intact, and the drive still never appears, because the physical layer never finished locking timing.

Repair is trace-level work under the microscope. A severed pair is rebuilt with fine magnet wire to bridge the break, matched in length so the differential pair stays balanced. A damaged clock buffer is replaced with the Hakko FM-2032 and Atten 862. The continuity and clock checks run again before the drive is allowed onto the host, so the link is confirmed solid before any imaging attempt.

Repair Path Decision09/13

When Does Board Repair Work and When Do We Switch to Chip-Off?

Board-level repair keeps the original controller on the PCB. Chip-off removes the NAND packages and reads them on PC-3000 with a bus interposer that talks to the raw flash. The decision is not preference; it is whether the controller silicon is still alive.

Failure signatureRepair pathWhy
Shorted MLCC or rail-to-ground short, controller intactBoard repairLift the shorted part, the rail restores, and the controller boots on its own key
PMIC collapse, controller intactBoard repair with donor PMICReplacing the PMIC restores clean rails to the original controller
Severed differential pair or lost reference clockBoard repair, trace-levelRebuilding the pair or clock path restores the link without touching the controller
Cracked BGA joints, controller die not crackedBoard repair with BGA reflow or reballReseating the controller restores its connections without replacing the die
Controller silicon cracked or burned through, no hardware encryptionChip-off NAND on PC-3000The controller is gone, so raw flash reads are imaged through a bus interposer and the translation is rebuilt
Controller destroyed on a hardware-encrypted NVMe driveUnrecoverableThe key was bound to the controller silicon and cannot be reconstructed from chip-off reads. We tell you this at the free evaluation rather than bill for a run that cannot succeed

The PCIe lane and power faults on this page sit in the first four rows. The chip-off NAND page covers the fifth row and the bus-interposer imaging path, and the hardware encryption page covers the key-binding limit in the sixth.

Chip-Off Imaging10/13

How Does Chip-Off NAND Imaging Work When the Controller Is Gone?

Chip-off is the path of last resort, used only when the controller die itself is destroyed and the drive does not use hardware encryption. The NAND packages are desoldered with the Atten 862 hot air station, cleaned, and read on PC-3000 through a bus interposer that connects the raw flash to the imaging complex.

Reading the raw NAND is only the first step. The bytes come back as physical pages in the controller's own layout, scrambled by the controller's data-randomizing logic and protected by error-correcting code.

PC-3000 reverses the scrambling, applies error correction, and reconstructs the translation that maps physical pages back to logical blocks. Only then does a usable file system appear.

The hard limit is encryption. On a hardware-encrypted NVMe drive the media-encryption key is generated inside the controller and wrapped by a key tied to the controller's hardware-unique root, so it never leaves the original silicon in plaintext.

With the controller destroyed, raw NAND reads come back as ciphertext and no donor controller can unwrap them. That is why board-level repair to revive the original controller is the preferred path, and why chip-off is reserved for older non-encrypted parts.

Encryption Reality11/13

Why Does the Original Controller Have to Be Revived?

Many modern NVMe drives encrypt the NAND with hardware AES. The media-encryption key is generated on the controller and wrapped by a key tied to that controller's hardware-unique root, so it never leaves the original silicon in plaintext.

Swap in a donor controller of the identical part number and its root is different, so it cannot unwrap the original key. The NAND reads back as ciphertext.

Board-level repair preserves the original controller and its key. We replace the support components around the controller (caps, the PMIC, the clock path, the differential pairs), not the controller itself. When the original controller boots with clean rails and a trained link, it decrypts the NAND on its own silicon through the normal translator. The hardware encryption page has the full key-binding breakdown.

Not every consumer NVMe drive runs hardware AES. Many budget DRAM-less drives leave it off, in which case the chip-off barrier is the controller's data scrambling and error-correcting code rather than encryption. We confirm which case applies during the free evaluation, so the quote reflects the real recovery path rather than an assumption.

Which NVMe Controllers Are Covered by PC-3000 SSD?

PC-3000 SSD NVMe imaging coverage is limited to three vendor families: Silicon Motion, Phison, and Marvell. The board-level diagnostic and repair work on this page applies to any NVMe drive, because restoring the power tree and the PCIe link is electronics work that does not depend on the controller vendor. The difference is which drives can then be imaged through PC-3000 SSD vendor-specific commands.

Controller familyPC-3000 SSD imagingSupported examples
PhisonCoveredE18, E26
Silicon MotionCoveredSM2262EN, SM2263XT, SM2264
MarvellCoveredNVMe controller families
Samsung in-house (Elpis and similar)Not coveredBoard repair only; no PC-3000 SSD imaging
Realtek, InnogritNot coveredBoard repair only; no PC-3000 SSD imaging
Rossmann does not currently offer in-lab recovery for Samsung Elpis. Rossmann does not currently offer in-lab recovery for Realtek NVMe controllers. Rossmann does not currently offer in-lab recovery for Innogrit NVMe controllers. Where those families are named here, the bench power-tree and PCIe-link repair still applies so the native controller can boot on its own, but PC-3000 SSD vendor-specific imaging coverage for those families is not claimed. The Phison E18/E26 and the Silicon Motion SM2262EN/SM2263XT/SM2264 are covered by PC-3000 SSD per the ACELab support matrix.

The full PC-3000 toolchain and what it does at each stage live on the PC-3000 data recovery tool page. Controller coverage is confirmed at the free evaluation before any quote, so the path we quote is the path the drive can actually take.

TRIM Warning12/13

Why Should You Stop Powering a Dead NVMe Drive?

Stop applying power to a dead or undetected NVMe drive. If the controller is partly alive and TRIM or UNMAP (Deallocate) commands were queued, the controller can unmap those blocks and background garbage collection can erase the NAND cells. Once a block is unmapped and erased, no lab can recover it. Pull the drive and have it diagnosed on a current-limited bench.

TRIM is a logical deallocate command, not an instant physical erase. The operating system tells the controller which blocks are no longer needed; the controller unmaps them from its translation table and returns zeros when those addresses are read, then garbage collection erases the physical cells afterward. On a drive that is failing on the host link, those queued operations can still run when power is reapplied, which is how a recoverable drive becomes an unrecoverable one.

Power cycling also pushes more current through any shorted component on the board, so each retry risks spreading the damage from a single 0402 cap to the surrounding copper. The safe move on a dead NVMe drive is to stop, pull it, and bring it to a bench where the first power applied is current-limited and watched on FLIR.

Faq13/13

Frequently Asked Questions

Can data be recovered from a dead NVMe drive that isn't detected?
Often, yes. An NVMe drive that won't enumerate on the PCIe bus usually has a broken power path or a broken signal path, not destroyed NAND. A collapsed PMIC, a shorted MLCC decoupling cap, a 3.3V rail-to-ground short, a lost 100 MHz reference clock, or a severed TX/RX differential pair all leave the data intact while the host sees nothing. Board-level microsoldering restores the power rails and the PCIe link so the original controller boots, decrypts the NAND, and images normally. NVMe board repair costs $600–$900. Free evaluation, no data no fee.
Why isn't NVMe just a faster version of a SATA SSD?
NVMe drives talk to the CPU over the PCIe bus directly, with no SATA cable, no AHCI host adapter, and no SATA/AHCI command abstraction in the middle. The drive trains a set of high-speed differential lanes against the CPU root complex, exchanges configuration over PCIe, then runs the NVMe protocol on top of that link. That removes the legacy abstraction layer SATA depended on. It also adds failure modes SATA never had: PCIe link-training stalls, lost reference clock, and severed differential pairs can each leave a physically healthy drive completely invisible to the system.
My NVMe drive shows no LED and isn't detected. What's wrong?
A drive that is fully dark on the PCIe bus has lost either its power tree or its host link. The most common electronic faults are a collapsed PMIC that no longer steps the 3.3V M.2 input down to the controller core rails, a shorted MLCC decoupling capacitor pulling a rail to ground, a 3.3V rail-to-ground short behind the input clamp, a degraded PCIe PHY, or a cracked BGA joint under the controller from thermal cycling. The diagnostic step is a pre-power diode-mode sweep with a FLIR thermal camera to localize the short before any rail power is applied, so a 0402 cap fault does not become a burned PCB trace.
How much does dead NVMe / PCIe lane fault recovery cost?
NVMe circuit board repair for electronic faults costs $600–$900. If the controller itself is destroyed and the NAND has to be transplanted to a donor PCB, the cost is $1,200–$2,500 plus donor drive cost. A donor drive is a matching SSD used for its circuit board. Typical donor cost: $40–$100 for common models, $150–$300 for discontinued or rare controllers. For comparison, a SATA SSD board repair runs $450–$600. Every case starts with a free evaluation and a firm quote before paid work. No data recovered means no charge. +$100 rush fee to move to the front of the queue.
Can recovery software fix an NVMe drive that won't power on?
No. Recovery software needs the operating system to see the drive, which needs the NVMe controller to enumerate on the PCIe bus, which needs working power delivery and a trained PCIe link. If the PMIC has collapsed or a differential pair is severed, the controller never reaches the host and the drive is invisible. Software has no path to the NAND. The drive needs physical board repair before any imaging tool can touch it.
Should I keep trying to plug in a dead NVMe drive to see if it works?
No. If the controller is still partly alive but the host link is broken, repeatedly powering the drive risks the controller completing queued TRIM or UNMAP (Deallocate) operations against blocks the operating system marked for deletion, and background garbage collection can erase those NAND cells. Power cycling also drives more current through a shorted component, spreading the damage. Stop applying power, pull the drive, and have it diagnosed on a current-limited bench before anything else happens.
What does it mean when an NVMe drive enumerates but fails the NVMe handshake?
It means the PCIe link trained successfully but the NVMe controller cannot finish initialization. The host writes CC.EN to enable the controller and waits for the Controller Ready (CSTS.RDY) bit. If the controller never asserts ready, its host interface decoder or internal address translator is corrupted and it cannot map NAND pages to logical block addresses. This is a controller-side fault rather than a lane fault: the lanes are fine, but the controller's firmware or system area needs rebuilding before it will serve data.
Why does the original controller have to be repaired instead of swapped?
On a hardware-encrypted NVMe drive the media-encryption key is generated inside the controller and wrapped by a key tied to that controller's hardware-unique root, so it never leaves the original silicon in plaintext. Pull the NAND off the board and the bytes read back as ciphertext. Bond them to a donor controller of the identical part number and the donor still cannot unwrap the key, because its root is different. The only path to readable data is to revive the original PCB so the original controller boots on its own silicon and decrypts its own NAND through the normal translator.
Which NVMe controllers can you recover in-lab?
PC-3000 SSD NVMe coverage is limited to Silicon Motion, Phison, and Marvell controllers. Supported examples include the Phison E18 and E26 and the Silicon Motion SM2262EN, SM2263XT, and SM2264. Samsung in-house NVMe controllers (Elpis, Pascal, Polaris, Phoenix), Realtek, and Innogrit are outside that coverage. Rossmann does not currently offer in-lab recovery for Samsung Elpis. Rossmann does not currently offer in-lab recovery for Realtek NVMe controllers. Rossmann does not currently offer in-lab recovery for Innogrit NVMe controllers. On those families the bench power-tree diagnostic and board-level repair still apply, so the native controller can boot and image on its own, but PC-3000 SSD vendor-specific imaging is not claimed. We confirm controller coverage during the free evaluation.
Related Services

NVMe drive dead or not detected?

Free evaluation. NVMe board repair: $600–$900. Pre-power diode and FLIR diagnosis before any rail power. No data, no fee.

(512) 212-9111Mon-Fri 10am-6pm CT
No diagnostic fee
No data, no fee
4.9 stars, 1,837+ reviews