ZFS (Zettabyte File System)
ZFS is an integrated filesystem + volume manager + RAID (originally Sun Microsystems 2005, now OpenZFS) that collapses Linux's traditional ext4+LVM+mdadm stack into one system. Key features: copy-on-write, block-level checksums with self-healing, instant snapshots, RAID-Z without the RAID-5/6 write hole, transparent compression, native encryption.
**ZFS** (Zettabyte File System) is a combined filesystem, volume manager, and RAID system originally developed at Sun Microsystems in 2005 for Solaris. After Sun's acquisition by Oracle and the resulting license complications, the open-source community continued development as **OpenZFS**, which is now the reference implementation across Linux, FreeBSD, illumos, and macOS. ## Architecture ZFS collapses what Linux traditionally does in three separate layers (ext4 filesystem + LVM volume manager + mdadm software RAID) into a single integrated system. This unification is the source of most of ZFS's advantages — the filesystem knows about disks, RAID geometry, and checksums simultaneously. ## Key features ### Copy-on-write (CoW) ZFS **never overwrites data in place**. Writes always go to new blocks; old blocks are only freed after the new write succeeds. Consequences: - **No torn writes mid-crash** — either the old or new state is intact, never a mixed state. - **Instant snapshots** — a snapshot is just a reference to the current block tree; subsequent writes go to new blocks, and the old tree remains accessible. - **Near-zero-space snapshots** — only changed blocks consume space. ### Block-level checksums with self-healing Every block has a Fletcher checksum or SHA-256 stored separately from the data. On read, ZFS verifies the checksum: - **Detects silent bit-rot** other filesystems miss entirely. - **With redundancy** (mirror, RAID-Z), ZFS automatically reads from a good copy and rewrites the corrupted block — **self-healing**. - **Scrubs** can be scheduled to proactively verify all blocks on the pool. ### RAID-Z (RAID without the write hole) Traditional RAID-5/6 has a **write hole**: if power fails during a write that spans multiple disks, parity can be inconsistent with data, leading to silent corruption. Hardware RAID cards work around this with battery-backed cache. **RAID-Z1** (single parity, like RAID-5), **RAID-Z2** (double parity, like RAID-6), and **RAID-Z3** (triple parity) use copy-on-write + variable-stripe width to **eliminate the write hole entirely**. No hardware cache required. ### Pooled storage - Disks go into a **zpool** — a single storage pool. - **Datasets** (analogous to subdirectories but with independent properties) are carved from the pool. - Each dataset can have its own compression, encryption, quota, reservations, checksum algorithm, record size. - Pool can be expanded by adding disks or mirror pairs; as of OpenZFS 2.3 (2024), RAID-Z expansion is supported too. ### Transparent compression - LZ4 default, near-free CPU cost and often net-faster reads (less data from disk = faster). - Zstd for higher ratio at higher CPU cost. - Gzip still available for archival. ### Native encryption - AES-256-GCM per dataset, optional. - Separate encryption from filesystem structure — can replicate encrypted snapshots to untrusted destinations. ### Send/receive - **zfs send** + **zfs receive** over SSH for incremental snapshot replication. - Powers most TrueNAS and Proxmox backup workflows. ### Deduplication - Block-level dedup possible but costs **~5GB RAM per TB deduped** — usually not worth it unless data is heavily redundant (VM disk images, specific workloads). ## Tradeoffs - **RAM-hungry**: ~1GB RAM per TB of storage recommended for reasonable performance. - **ECC RAM strongly recommended** — ZFS protects against disk corruption but RAM errors can still propagate. - **Not in mainline Linux kernel**: CDDL vs GPL license incompatibility. Distributed as separate module (ZFS on Linux / OpenZFS). Distros like Ubuntu ship it as kernel module; others require manual install. - **Learning curve**: more knobs and concepts than ext4. - **Cannot shrink pools easily**: you can remove mirror pairs but not arbitrary disks from a RAID-Z. ## Typical users - **TrueNAS** (core product built on ZFS; TrueNAS SCALE uses ZFS + Linux). - **Proxmox** (virtualization platform with native ZFS support). - **FreeBSD** (upstream ZFS support since 7.0). - **Ubuntu Server** (official ZFS support since 19.10). - Home labs running NAS/storage servers. - Enterprise storage where data integrity is critical (finance, healthcare, research). ## Historical notes - Apple almost adopted ZFS for macOS 10.5 in 2008 but backed out over licensing (Sun's legal terms were incompatible with Apple's needs). - Apple later built **APFS** (2017) which borrowed several ideas — copy-on-write, snapshots, cloning — but doesn't have checksums or native RAID. - Btrfs is the Linux-native copy-on-write filesystem with similar feature goals; has historically been less reliable than ZFS in production, though improving. - bcachefs (merged into Linux kernel 6.7 in 2024) is another Linux-native CoW filesystem aiming at ZFS's feature set without the license complications. ## Typical home-server stack Combined with NVMe (Non-Volatile Memory Express) drives: NVMe + ZFS is the standard high-performance home NAS / storage server configuration in 2026. ZFS's block checksums catch NVMe bit-rot (real issue at high P/E counts or elevated temperatures); NVMe's IOPS benefits ZFS's copy-on-write write patterns. ## Common beginner mistakes - **Running ZFS on hardware RAID**: disables ZFS's ability to do self-healing. Always use HBA / IT-mode controllers so ZFS sees raw disks. - **Wrong ashift**: set `ashift=12` for modern 4K-sector drives; default 9 is wrong and produces bad write amplification. - **Enabling dedup for no reason**: huge RAM cost for marginal benefit in most cases. - **Forgetting scrubs**: schedule regular `zpool scrub` to catch bit-rot proactively (monthly is standard). - **NVMe thermal neglect**: unthrottled NVMe drives can overheat and throttle dramatically under sustained write load.