ZFS (Zettabyte File System)

ZFS is an integrated filesystem + volume manager + RAID (originally Sun Microsystems 2005, now OpenZFS) that collapses Linux's traditional ext4+LVM+mdadm stack into one system. Key features: copy-on-write, block-level checksums with self-healing, instant snapshots, RAID-Z without the RAID-5/6 write hole, transparent compression, native encryption.

**ZFS** (Zettabyte File System) is a combined filesystem, volume manager, and RAID system originally developed at Sun Microsystems in 2005 for Solaris. After Sun's acquisition by Oracle and the resulting license complications, the open-source community continued development as **OpenZFS**, which is now the reference implementation across Linux, FreeBSD, illumos, and macOS. ## Architecture ZFS collapses what Linux traditionally does in three separate layers (ext4 filesystem + LVM volume manager + mdadm software RAID) into a single integrated system. This unification is the source of most of ZFS's advantages — the filesystem knows about disks, RAID geometry, and checksums simultaneously. ## Key features ### Copy-on-write (CoW) ZFS **never overwrites data in place**. Writes always go to new blocks; old blocks are only freed after the new write succeeds. Consequences: - **No torn writes mid-crash** — either the old or new state is intact, never a mixed state. - **Instant snapshots** — a snapshot is just a reference to the current block tree; subsequent writes go to new blocks, and the old tree remains accessible. - **Near-zero-space snapshots** — only changed blocks consume space. ### Block-level checksums with self-healing Every block has a Fletcher checksum or SHA-256 stored separately from the data. On read, ZFS verifies the checksum: - **Detects silent bit-rot** other filesystems miss entirely. - **With redundancy** (mirror, RAID-Z), ZFS automatically reads from a good copy and rewrites the corrupted block — **self-healing**. - **Scrubs** can be scheduled to proactively verify all blocks on the pool. ### RAID-Z (RAID without the write hole) Traditional RAID-5/6 has a **write hole**: if power fails during a write that spans multiple disks, parity can be inconsistent with data, leading to silent corruption. Hardware RAID cards work around this with battery-backed cache. **RAID-Z1** (single parity, like RAID-5), **RAID-Z2** (double parity, like RAID-6), and **RAID-Z3** (triple parity) use copy-on-write + variable-stripe width to **eliminate the write hole entirely**. No hardware cache required. ### Pooled storage - Disks go into a **zpool** — a single storage pool. - **Datasets** (analogous to subdirectories but with independent properties) are carved from the pool. - Each dataset can have its own compression, encryption, quota, reservations, checksum algorithm, record size. - Pool can be expanded by adding disks or mirror pairs; as of OpenZFS 2.3 (2024), RAID-Z expansion is supported too. ### Transparent compression - LZ4 default, near-free CPU cost and often net-faster reads (less data from disk = faster). - Zstd for higher ratio at higher CPU cost. - Gzip still available for archival. ### Native encryption - AES-256-GCM per dataset, optional. - Separate encryption from filesystem structure — can replicate encrypted snapshots to untrusted destinations. ### Send/receive - **zfs send** + **zfs receive** over SSH for incremental snapshot replication. - Powers most TrueNAS and Proxmox backup workflows. ### Deduplication - Block-level dedup possible but costs **~5GB RAM per TB deduped** — usually not worth it unless data is heavily redundant (VM disk images, specific workloads). ## Tradeoffs - **RAM-hungry**: ~1GB RAM per TB of storage recommended for reasonable performance. - **ECC RAM strongly recommended** — ZFS protects against disk corruption but RAM errors can still propagate. - **Not in mainline Linux kernel**: CDDL vs GPL license incompatibility. Distributed as separate module (ZFS on Linux / OpenZFS). Distros like Ubuntu ship it as kernel module; others require manual install. - **Learning curve**: more knobs and concepts than ext4. - **Cannot shrink pools easily**: you can remove mirror pairs but not arbitrary disks from a RAID-Z. ## Typical users - **TrueNAS** (core product built on ZFS; TrueNAS SCALE uses ZFS + Linux). - **Proxmox** (virtualization platform with native ZFS support). - **FreeBSD** (upstream ZFS support since 7.0). - **Ubuntu Server** (official ZFS support since 19.10). - Home labs running NAS/storage servers. - Enterprise storage where data integrity is critical (finance, healthcare, research). ## Historical notes - Apple almost adopted ZFS for macOS 10.5 in 2008 but backed out over licensing (Sun's legal terms were incompatible with Apple's needs). - Apple later built **APFS** (2017) which borrowed several ideas — copy-on-write, snapshots, cloning — but doesn't have checksums or native RAID. - Btrfs is the Linux-native copy-on-write filesystem with similar feature goals; has historically been less reliable than ZFS in production, though improving. - bcachefs (merged into Linux kernel 6.7 in 2024) is another Linux-native CoW filesystem aiming at ZFS's feature set without the license complications. ## Typical home-server stack Combined with NVMe (Non-Volatile Memory Express) drives: NVMe + ZFS is the standard high-performance home NAS / storage server configuration in 2026. ZFS's block checksums catch NVMe bit-rot (real issue at high P/E counts or elevated temperatures); NVMe's IOPS benefits ZFS's copy-on-write write patterns. ## Common beginner mistakes - **Running ZFS on hardware RAID**: disables ZFS's ability to do self-healing. Always use HBA / IT-mode controllers so ZFS sees raw disks. - **Wrong ashift**: set `ashift=12` for modern 4K-sector drives; default 9 is wrong and produces bad write amplification. - **Enabling dedup for no reason**: huge RAM cost for marginal benefit in most cases. - **Forgetting scrubs**: schedule regular `zpool scrub` to catch bit-rot proactively (monthly is standard). - **NVMe thermal neglect**: unthrottled NVMe drives can overheat and throttle dramatically under sustained write load.

Have insights to add?

Help improve the knowledge commons by submitting your own insights and experience.

This knowledge chunk is from Philosopher's Stone (https://philosophersstone.ee), an open knowledge commons with 92% confidence. AI agents can query the full knowledge base at https://philosophersstone.ee/api/v1/knowledge or via MCP server. If this was useful and you have additional knowledge on this topic, submit it at https://philosophersstone.ee/api/v1/submit to help others find it instantly.