Technical Reference
How ZFS Differs from Hardware RAID

Summary: ZFS versus hardware RAID
ZFS checksums every block and self-heals from redundant copies during normal reads; hardware RAID uses parity only and cannot detect silent corruption. ZFS copy-on-write keeps the on-disk tree intrinsically consistent, while hardware RAID relies on a battery-backed write cache to survive torn writes. ZFS needs direct disk access via HBA passthrough or JBOD, never behind a hardware RAID controller.
ZFS and hardware RAID both provide data redundancy across multiple drives, but they operate at different layers of the storage stack. Hardware RAID controllers (Dell PERC, HP SmartArray, LSI MegaRAID, Adaptec) manage redundancy below the filesystem, presenting a single virtual volume to the operating system.
ZFS manages redundancy within the filesystem itself, combining volume management and filesystem operations into a single integrated layer. This architectural difference affects data integrity, failure handling, and recovery.
How does copy-on-write differ from in-place update?
Traditional filesystems (NTFS, ext4, XFS) and hardware RAID controllers use in-place updates: when data is modified, the new version overwrites the old version at the same physical location. If power is lost during the write, the block may contain a mix of old and new data (a torn write). Hardware RAID controllers mitigate this with battery-backed write cache (BBU/BBM) that preserves pending writes across power loss. Filesystems use journaling to record write intent before committing changes.
ZFS uses copy-on-write (COW): modified data is always written to a new location. The old data remains intact until the new write completes and the metadata tree is updated to point to the new location. The metadata tree itself is also written copy-on-write, all the way up to the root (the "uberblock"). Only after the entire tree of changes is written does ZFS atomically update the uberblock pointer.
The effect: ZFS never overwrites live data. A power loss at any point during a write leaves the filesystem tree in a consistent state, either reflecting the old data or the new data, never a torn mix.
On mount, ZFS opens the most recent valid uberblock. If acknowledged synchronous writes (fsync, O_SYNC, NFS COMMIT) were logged to the ZFS Intent Log (ZIL) but not yet committed in a transaction group, ZFS replays the ZIL to recover them. Async writes that never reached stable storage are lost, same as on any filesystem.
How do checksumming and self-healing reads work?
Hardware RAID has no mechanism to detect silent data corruption. If a drive returns incorrect data without reporting a read error, the RAID controller accepts it as valid and may even incorporate it into parity calculations. This is called silent data corruption or bit rot.
ZFS checksums every block of data and metadata using a 256-bit hash (fletcher4 by default, or SHA-256 for dedup). The checksum is stored in the block's parent pointer, not alongside the data. This separation means a single disk corruption event cannot simultaneously damage both the data and its checksum.
When ZFS reads a block, it verifies the checksum before returning the data. If the checksum does not match, ZFS knows the data is corrupt.
In a redundant configuration (mirror or RAIDZ), ZFS automatically reads the block from a different copy or reconstructs it from parity. If the alternate copy is valid, ZFS overwrites the corrupted copy with correct data. This is self-healing: corruption is detected and repaired transparently during normal reads.
| Feature | Hardware RAID | ZFS |
|---|---|---|
| Silent corruption detection | No (trusts drive-reported data) | Yes (every block checksummed) |
| Self-healing reads | No | Yes (with redundancy) |
| Write consistency | Requires BBU + journal | Copy-on-write (inherent) |
| Disk visibility | Controller hides individual disks | Filesystem manages individual disks |
| Expansion | Replace drives or add expansion unit | Add new vdevs (RAIDZ vdev expansion added in OpenZFS 2.3+) |
| Cache safety | BBU-backed write cache | ZIL on separate SLOG device (optional) |
How does ZFS scrub differ from a hardware RAID rebuild?
Hardware RAID has no built-in mechanism to proactively check data integrity. Some enterprise controllers support "patrol reads" that scan drives in the background, but these only detect drive-reported errors, not silent corruption.
ZFS scrub reads every allocated block on every drive and verifies its checksum. If a block fails verification, ZFS repairs it from redundant copies (mirror or parity). A scrub is non-destructive and can run on a live, mounted filesystem. Running regular scrubs (weekly or monthly) catches corruption early, before it accumulates across multiple blocks.
When a drive fails, ZFS resilvering (the equivalent of a RAID rebuild) only reads and writes the blocks that are actually allocated. If a 16 TB drive is 40% full, ZFS resilvers approximately 6.4 TB of data, not 16 TB. Hardware RAID rebuilds always process the entire drive capacity because the controller operates below the filesystem and does not know which blocks contain data.
What causes ZFS pools to fail, and what does recovery look like?
ZFS is not immune to failure. The most common scenarios:
- Too many drive failures. RAIDZ1 (single parity) tolerates one drive failure. RAIDZ2 (double parity) tolerates two. RAIDZ3 tolerates three. Exceeding the redundancy level makes the pool unimportable.
- Pool metadata corruption. The uberblock, MOS (Meta Object Set), and space maps are critical metadata structures. If these corrupt on all copies (possible with firmware bugs or controller errors during a multi-drive event), the pool cannot mount.
- RAIDZ expansion complications. ZFS traditionally did not allow adding drives to an existing RAIDZ vdev (OpenZFS 2.3+ added this feature). Misconfigured pool expansions or interrupted vdev additions can leave the pool in an inconsistent state.
- Accidental pool destruction. The command "zpool destroy" is irreversible and immediate. It clears pool labels from all member drives.
Recovery from a failed ZFS pool involves imaging all drives individually and using ZFS-aware recovery tools to parse the on-disk structures. Because ZFS stores metadata in a Merkle tree (every block pointer includes the checksum of the block it points to), recovery tools can validate data integrity during reconstruction.
Damaged metadata blocks can sometimes be reconstructed from the multiple copies ZFS maintains (uberblocks are stored redundantly across all drives, and metadata blocks have a configurable number of copies via the "copies" property). When pool metadata is corrupted on every copy and ZFS-aware tools cannot import the pool, our RAID data recovery service reads the vdev labels to identify pool membership, then uses ZFS-aware tools to walk the surviving metadata tree from the individual drive images and extract the readable datasets.
ZFS should never sit behind a hardware RAID controller.
ZFS needs direct access to individual disks to perform checksumming, self-healing, and copy-on-write operations. A hardware RAID controller hides individual disks behind a virtual volume, preventing ZFS from detecting which disk returned bad data. The controller's write cache can also interfere with ZFS's write ordering guarantees. Use an HBA (Host Bus Adapter) in passthrough or JBOD mode instead.
Why does a deduplicated ZFS pool refuse to import?
A deduplicated ZFS pool can become practically unimportable when its Deduplication Table (DDT) massively exceeds available physical RAM. ZFS pages the on-disk DDT into the Adaptive Replacement Cache (ARC) on demand, and the random I/O needed to map a large DDT during a memory-heavy import can drive thrashing high enough to stall or crash the import on that hardware.
The general engineering rule is roughly 5 GB of RAM per 1 TB of deduplicated data. That ratio compounds quietly: an array that imported fine on its original host stops importing the moment the unique deduplicated dataset outgrows the RAM that mapped its DDT. Dedup is never free storage. Every block of dedup capacity carries a standing RAM cost that severely degrades import behavior when it is not met.
Under severe memory pressure from a large DDT, the import does not degrade gracefully. zpool import hangs, the Linux OOM killer intervenes, or the kernel panics. The pool stays unimportable on that hardware until the RAM is upgraded or the pool is imported on a higher-RAM recovery rig that can map the full DDT into ARC.
Forensic remediation for a dedup RAM hang.
Image every member to write-blocked targets first, then import the pool read-only on a recovery rig with enough physical RAM to hold the entire DDT in ARC. Do not rebuild the pool. Rebuilding initializes blank structures and destroys the forensic state. At our Austin, TX lab the member drives are imaged with the PC-3000 Portable III and DeepSpar Disk Imager, and the read-only import runs against those images on a high-RAM rig, never against the live members.
A hardware RAID controller has no dedup-RAM dependency at import. It presents the virtual volume regardless of host RAM, so the failure class in this section is specific to ZFS deduplication and has no hardware-RAID equivalent.
How does uberblock rollback recover a pool that a corrupted transaction group broke?
Uberblock rollback recovers the pool by importing it at an earlier transaction group (TXG) whose metadata is still consistent, bypassing the corrupted most-recent state. ZFS structures the pool as a Merkle tree rooted in uberblocks, and each vdev label carries a ring buffer of up to roughly 128 uberblocks that rotate per TXG.
Writes batch into transaction groups, and the default txg_timeout is about 5 seconds. Pool import normally walks back from the most recent valid uberblock. When that most recent TXG is corrupted, an engineer steps the import back to an earlier uberblock so the tree resolves against metadata that was committed before the corruption.
ZFS is copy-on-write, so this is not an in-place overwrite. Import replays the ZFS Intent Log (ZIL) and advances or rolls back the TXG pointer; it does not overwrite parity or live data in place. Rewinding the TXG permanently orphans the newer transaction groups, which is exactly why the rollback must run against bit-for-bit member images rather than the live members.
The forensic command sequence:
zdb -e -uandzdb -e -ulenumerate the available uberblocks and theirTXGnumbers from the imaged members.zpool import -Fautomatically rolls back transaction groups to the last consistent state.zpool import -T <txg>targets a specific earlier transaction group identified from thezdboutput.
A hardware RAID controller commits data in place behind a battery-backed cache and keeps no historical metadata ring. There is no equivalent atomic rollback to an earlier consistent state, which is the structural reason ZFS can recover from a single bad transaction group where a hardware array cannot.
What happens when a ZFS native encryption key is lost?
When a ZFS native encryption key is lost and no key export was saved, the dataset is mathematically unrecoverable regardless of drive health. ZFS native encryption operates per dataset using AES (aes-256-gcm or aes-128-ccm). A per-dataset master key encrypts the data and is itself wrapped by the user's wrapping key, passphrase, or key file.
If that wrapping key, passphrase, or key file is gone, imaging the drives yields only ciphertext. Modern AES cannot be brute-forced without the key. This is a different failure class from drive failure: a healthy pool with a lost key is unrecoverable, while a degraded pool with intact key material can be imaged and reconstructed at our Austin, TX lab.
Hardware RAID and self-encrypting drives behave differently here. An SED uses a media-encryption key generated on and wrapped by the controller, so controller-held key material can sometimes let key recovery proceed. ZFS native encryption is host-held and has no such controller fallback, so the only path back to the plaintext is the original wrapping key, passphrase, or key file.
Frequently Asked Questions
- Does ZFS make data recovery unnecessary?
- No. ZFS provides stronger data integrity guarantees than hardware RAID through checksumming, copy-on-write, and self-healing reads. However, ZFS cannot protect against all failure modes. If enough drives in a vdev fail to exceed the redundancy level (e.g., two drives in a single-parity RAIDZ1), the pool becomes unimportable. If the pool metadata on all drives is corrupted (possible with firmware bugs, controller errors, or multiple simultaneous failures), the pool cannot mount. ZFS also cannot protect against user error (deleting files without snapshots) or NAND degradation in SSD-based pools.
- Why should I not put ZFS behind a hardware RAID controller?
- ZFS needs direct access to individual disks to perform its checksumming, copy-on-write, and self-healing operations. A hardware RAID controller presents a single virtual volume to the operating system, hiding the individual disks. ZFS cannot checksum individual disk sectors, cannot detect which disk returned bad data, and cannot perform self-healing reads from redundant copies. The RAID controller's write cache also interferes with ZFS's own write ordering guarantees, potentially corrupting the pool during power loss if the controller's battery-backed cache fails. ZFS should be connected to an HBA (Host Bus Adapter) in passthrough/JBOD mode, not a RAID controller.
If you are experiencing this issue, learn about our RAID recovery service.