
LVM2 On-Disk Architecture
LVM2 operates through the Linux Device Mapper (DM) framework, mapping physical disk sectors to logical extents via three abstraction layers: Physical Volumes (PV), Volume Groups (VG), and Logical Volumes (LV). Understanding where each metadata structure resides on disk is the prerequisite for any recovery.
| Layer | On-Disk Location | Recovery Implications |
|---|---|---|
| PV Label | Sector 1 (bytes 512-1023) | Contains the LABELONE magic string, PV UUID, and pointer to the VGDA. Overwriting sector 1 (via accidental pvcreate, fdisk, or mkfs) destroys the entry point to the entire LVM tree. The user data area is unaffected. |
| VGDA (Ring Buffer) | Typically sectors 8+ | Human-readable ASCII text describing the VG name, UUID, PE size (default 4MB), and the complete LV-to-PE mapping. Stored in a circular ring buffer; each VG modification writes a new copy and advances the pointer. Older copies survive in the buffer until overwritten by subsequent changes. |
| User Data Area | After PE alignment boundary | Physical extents (default 4MB each) containing the actual filesystem data. This region is untouched by PV label or VGDA corruption. The data is recoverable if the extent mapping can be reconstructed from any surviving VGDA copy. |
Key detail: The VGDA is plain ASCII text, not binary. A volume group descriptor for a 4-drive, 3-LV configuration is approximately 2-4KB of readable text. You can extract it with strings or hexdump -C and read the PV UUIDs, LV names, segment mappings, and extent allocations directly. This makes LVM metadata one of the more recoverable structures in the Linux storage stack, provided the ring buffer region has not been fully overwritten.
How LVM Metadata Gets Destroyed
LVM metadata corruption is rarely spontaneous on healthy hardware. It results from administrative errors, hardware failures in the metadata region, or interactions between stacked storage layers.
- 1.Accidental pvcreate on the wrong device. pvcreate writes a new PV label at sector 1 and a blank VGDA. If the target device already contained an LVM physical volume, the original VG descriptor is overwritten. The user data extents remain intact; only the mapping metadata is lost.
- 2.Partition table overwrites. Running fdisk, parted, or a NAS initialization wizard on a PV member. GPT writes a backup partition table at the end of the disk and a primary table at the beginning. If the primary GPT header overlaps sector 1, the PV label is destroyed. MBR partition tables occupy only sector 0 and leave the LVM label intact, but some tools write additional metadata beyond sector 0.
- 3.Bad sectors in the metadata region. On HDDs, media degradation in sectors 1-16 (where the PV label and VGDA reside) produces I/O errors when LVM attempts to read the descriptor. The drive SMART log shows reallocated or pending sectors. The user data area may be fully intact while the small metadata region is physically unreadable.
- 4.NVMe FTL corruption. On NVMe SSDs, the Flash Translation Layer maps logical block addresses to physical NAND pages. A power loss during FTL journal commit can cause the controller to return UNC errors or stale data for the sectors containing the LVM metadata. The OS reports the same "volume group not found" error, but the root cause is hardware, not logical. Running vgcfgrestore on a drive with FTL corruption may write to an unmapped or mis-mapped NAND page.
- 5.Stacked layer desynchronization. On systems where LVM sits on top of mdadm software RAID, a RAID rebuild or resync can alter the byte layout that LVM expects. Synology SHR and QNAP QTS both layer LVM over mdadm. An interrupted RAID reshape or failed drive replacement that triggers a partial resync can leave the mdadm array intact while the LVM metadata region contains stale or torn writes.
- 6.USB enclosure sector translation. External drives connected via USB bridge boards (ASMedia ASM1153E, JMicron JMS578) can perform 512-byte to 4K-sector translation. The PV label at logical sector 1 may shift to a different physical offset when the drive is connected directly via SATA. This produces a false "metadata missing" error. The metadata is intact; the block addressing has changed.
Commands That Destroy LVM Volumes During Recovery
The standard sysadmin response to a missing volume group is to attempt restoration via LVM tools. Several commonly suggested commands convert a recoverable metadata loss into permanent data destruction.
- ✕pvcreate on an existing PV member. pvcreate generates a new PV UUID and writes a blank VGDA. The original volume group descriptor (including all LV segment mappings) is overwritten. If the ring buffer was small enough that pvcreate's blank descriptor spans the entire VGDA region, no prior copy survives for hex-level extraction.
- ✕vgcfgrestore with an incorrect .vg file. Restoring a VG descriptor from the wrong timestamp (e.g., before an lvextend operation) maps logical extents to physical extents that no longer match the on-disk layout. The filesystem will mount with missing or corrupt files because the extent map points to the wrong physical locations.
- ✕mkfs on the raw PV device. Running mkfs.ext4 or mkfs.xfs directly on the physical volume device writes filesystem superblocks and inode tables across the entire block device, overwriting both the LVM metadata and the user data extents. This is irreversible for the overwritten regions.
- ✕lvremove on an SSD with issue_discards enabled. If
issue_discards = 1is set in/etc/lvm/lvm.conf, lvremove sends TRIM/UNMAP commands to the SSD controller. The controller invalidates the FTL mapping for those blocks. Subsequent reads return zeroes. The physical NAND charge may still exist, but the controller will not serve it. Recovery is not possible after TRIM executes.
Before running any LVM command on a missing volume group: Image every physical volume member to a separate storage target using write-blocked connections. All recovery attempts must operate on images, not original media. If any underlying drive has physical faults, the imaging step captures recoverable sectors before the drive condition worsens.
Standard LVM vs. LVM-Thin Provisioning
Standard LVM and LVM-thin provisioning use fundamentally different metadata structures. The recovery approach for each is distinct, and applying the wrong technique will fail silently or cause data loss.
| Property | Standard LVM | LVM-Thin (dm-thin) |
|---|---|---|
| Metadata Format | ASCII text in VGDA ring buffer | Binary B-tree in a hidden metadata LV |
| Recovery Tool | vgcfgrestore | thin_dump / thin_repair / thin_restore |
| Allocation | Static: PEs mapped to LEs at creation time | Dynamic: virtual blocks allocated on first write |
| Snapshot Metadata | COW exception table (simple, linear) | Shared B-tree with reference counts per block |
| Corruption Impact | LV mapping lost; data extents survive | Entire thin pool offline; all thin LVs inaccessible |
| Common Platforms | RHEL, CentOS, Synology SHR, QNAP QTS | Proxmox VE, oVirt, enterprise KVM hypervisors |
vgcfgrestore does not fix thin pool metadata. The thin pool's binary B-tree is stored inside a hidden logical volume, not in the VGDA. Restoring the VG descriptor via vgcfgrestore re-creates the thin pool LV definitions but does not touch the B-tree contents. If the B-tree is corrupted, the thin pool remains in needs_check state after VG restoration. You must use thin_dump to export the B-tree to XML, repair it, and thin_restore to write it back.
Proxmox VE LVM-Thin Metadata Corruption
Proxmox VE uses LVM-thin provisioning as its default local storage backend. Each VM disk is a thin logical volume. The thin pool metadata LV contains the block-level allocation map for every VM on the host. When this metadata LV corrupts, every VM on the storage pool becomes inaccessible simultaneously.
- 1.Power loss during metadata commit. The dm-thin target writes metadata in transactions. A power failure mid-transaction leaves the B-tree in an inconsistent state. Proxmox boots with the thin pool in read-only mode or refuses to activate it. The
lvs -aoutput shows the thin pool with the "needs_check" attribute. - 2.Thin pool metadata overflow. When the metadata LV runs out of space (common when snapshots accumulate without pruning), dm-thin cannot allocate new mapping entries. Writes to any thin LV begin failing with I/O errors. If the host is not shut down cleanly at this point, the metadata B-tree can record partial allocations that leave orphaned blocks.
- 3.Array expansion truncating the metadata LV. Expanding the underlying physical storage (e.g., replacing drives in a hardware RAID and growing the PV) can inadvertently truncate the thin metadata LV if the LVM tools recalculate extent boundaries. The binary B-tree is severed mid-structure, and standard thin_check reports fatal errors.
Recovery approach: Image the entire block device. Extract the thin metadata LV contents by calculating its PE range from the VG descriptor. Run thin_dump to export the B-tree to XML. Identify broken transaction IDs or orphaned block references in the XML. Use thin_repair to rebuild the B-tree from the last consistent transaction. Restore via thin_restore to the metadata device on the imaged clone. If thin_dump itself fails with I/O errors, the metadata LV contains physical bad blocks and requires sector-level imaging via PC-3000 before any logical repair can proceed.
LVM Snapshot Chain Corruption
LVM snapshots (both standard COW and thin snapshots) add another metadata dependency layer. When snapshot metadata corrupts, the corruption can cascade to the origin volume.
- 1.Standard LVM snapshots (COW exception table): Standard snapshots store changed blocks in a COW (Copy-on-Write) exception table. When the snapshot volume fills completely, the kernel invalidates the snapshot and marks it as "Snapshot invalid." The origin volume remains accessible, but any blocks that were in transit during the overflow event may contain inconsistent data if a write was interrupted mid-COW.
- 2.Thin snapshots (shared B-tree references): Thin snapshots share physical blocks with the origin via reference counting in the thin pool B-tree. A corrupted reference count can cause the thin pool to report impossible block states (negative references or blocks claimed by nonexistent thin LVs). When thin_check detects these inconsistencies, it places the pool in read-only mode. Deleting the corrupt snapshot without first repairing the B-tree can decrement reference counts for blocks still used by the origin, causing silent data loss on the primary volume.
Do not delete corrupt snapshots without verifying B-tree consistency. On thin pools, use thin_dump to export the B-tree to XML and verify reference counts before removing any snapshot. On standard LVM, invalidated snapshots are safe to remove (the exception table is already discarded by the kernel), but the origin volume should be checked for filesystem consistency before resuming writes.
How We Recover LVM Volumes
Professional LVM recovery separates the physical imaging step from the logical metadata reconstruction. We do not run LVM tools on original media under any circumstances.
- 1.Image all physical volume members. Each drive is connected via write-blocked interface and cloned sector-by-sector using PC-3000 or DeepSpar Disk Imager. On HDDs with bad sectors in the metadata region (sectors 1-16), PC-3000 selective head imaging and adaptive read parameter adjustment extract the metadata sectors from degraded platters. On NVMe SSDs with FTL corruption, we stabilize the controller via PCIe link negotiation before extracting the LBA range containing the VGDA.
- 2.Locate the VGDA on each PV image. Scan the imaged block device for the
LABELONEmagic string at sector 1 to find the PV label. If the label is intact, it points directly to the VGDA offset. If the label is destroyed, we scan outward from sector 8 for ASCII text matching the VG descriptor pattern (VG name, PV UUID, extent mappings). The ring buffer stores multiple historical copies; we extract all of them and compare timestamps to identify the most recent valid version. - 3.Cross-reference /etc/lvm/archive if the OS drive is accessible. The archive directory contains timestamped .vg files from every VG metadata change. We compare the archive descriptor to the hex-extracted VGDA to verify consistency. If the archive is more recent, we use it. If the hex-extracted copy is more recent (the archive is on a different disk that was last booted before the corruption event), we use the extracted copy.
- 4.Reconstruct the VG on the imaged clone. Write the validated descriptor back to the PV image using vgcfgrestore. Activate the VG, scan for logical volumes, and mount each LV read-only. For thin pools, run thin_dump on the metadata LV, repair via thin_repair, and thin_restore before activating the thin LVs.
- 5.Verify filesystem integrity and extract data. Run filesystem-appropriate consistency checks (e2fsck -n for ext4, xfs_check for XFS) on the mounted LVs. Extract recovered data to external storage. The original drives are returned to the client unmodified; all work was performed on images.
Thin pool repair risk: Running thin_repair on a production metadata device can corrupt the B-tree further if the underlying storage has uncorrectable read errors. The repair tool writes corrected metadata back to the same device; if those writes land on failing sectors, the metadata damage worsens. Safe recovery images all pool member drives first, rebuilds any hardware RAID virtually, extracts the thin metadata LV from its PE range, and runs thin_dump on the cloned image to produce an XML map of all provisioned blocks. Orphaned transaction IDs are corrected in the dump before thin_restore rebuilds the metadata on the clone.
When LVM Corruption Masks a Hardware Failure
The OS reports the same error message ("Volume group not found") regardless of whether the metadata region has a logical overwrite or a physical read failure. Determining which layer has failed is the first diagnostic step.
HDD: Bad Sectors in the Metadata Region
The PV label occupies a single 512-byte sector. If that sector develops a media defect, the SMART log records a reallocated sector count increment. The firmware may silently redirect reads to a spare sector, or it may return an I/O error if the spare pool is exhausted. On drives with shingled magnetic recording (SMR), the metadata region sits within the conventional recording zone (CMR cache), so metadata corruption on SMR drives is typically a cache flush failure, not a shingle zone problem. PC-3000 reads the defect map to determine whether sector 1 has been remapped or is physically unreadable.
NVMe SSD: FTL Mapping Failure
NVMe controllers use monolithic BGA packages that abstract all NAND access behind the Flash Translation Layer. A corrupted FTL journal means the controller cannot resolve LBA-to-NAND mappings for the sectors containing LVM metadata. The drive may report the correct capacity but return UNC errors for specific LBA ranges. Standard Linux tools (pvck, pvs) will report missing metadata. Running vgcfgrestore writes new data to the drive, but the controller may route that write to an incorrect NAND page due to the damaged FTL. Stabilizing the controller firmware and extracting the raw NAND data requires PC-3000 SSD.
SATA SSD: Firmware Lock (SATAfirm S11, etc.)
Certain Phison-based SATA SSDs (PS3111, PS3110/S10) are prone to a firmware lock that makes the drive report its model string as "SATAfirm S11." The drive appears in BIOS but shows 0 bytes capacity. The OS cannot read any sector, including the LVM metadata region. This is not LVM corruption; it is a controller firmware crash. Recovery requires PC-3000 SSD to reload the firmware tables and rebuild the FTL mapping before any LVM data becomes accessible.
LVM Recovery Pricing
LVM metadata recovery pricing depends on whether the underlying storage is physically healthy or requires hardware-level intervention.
| Scenario | Price Range | What's Involved |
|---|---|---|
| Logical metadata overwrite (healthy hardware) | $250+ | VGDA extraction from ring buffer, vgcfgrestore on cloned image, LV mount and data extraction. Single drive, no physical damage. |
| Multi-drive VG with mdadm RAID substrate | $600 - $900 | Image all array members, reconstruct mdadm layer, then reconstruct LVM layer on top. Synology SHR and QNAP QTS configurations. |
| LVM-thin pool metadata repair | $900 - $1,200 | Binary B-tree extraction, thin_dump to XML, transaction ID repair, thin_restore. Proxmox VE and enterprise hypervisor configurations. |
| Hardware failure + LVM reconstruction | $1,200 - $1,500 | PC-3000 hardware imaging (head swap, firmware repair, or FTL stabilization), followed by logical LVM reconstruction on the extracted image. |
All prices subject to evaluation. No diagnostic fee. No data, no recovery fee. Multi-drive configurations priced per the total number of drives requiring imaging and the complexity of the stacked storage layers.
Frequently Asked Questions
Is vgcfgrestore safe to run on a failing drive?
I accidentally ran pvcreate on a drive that already had LVM data. Is recovery possible?
My Proxmox thin pool shows 'metadata needs_check'. Can I fix it myself?
My NAS uses both mdadm and LVM. Which layer is broken?
I deleted a logical volume on an SSD. Can the data be recovered?
Where does Linux store LVM metadata backups?
Related Recovery Services
Full RAID recovery for all levels and controllers
Linux software RAID superblock recovery
VM disk recovery from Proxmox storage
DSM volume crash and SHR failures
Missing vdevs, corrupted ZIL, import errors
Safe approach to degraded arrays
RAID pool degradation on consumer NAS devices
SAN LUN metadata and provisioning recovery
LVM volume group missing?
Free evaluation. Write-blocked imaging. VGDA ring buffer extraction and thin pool metadata repair. No data, no fee.