Btrfs (B-tree file system, variously pronounced "Butter F S", "Better F S", or "B-tree F S") is a GPL-licensed copy-on-write file system for Linux. Development began at Oracle Corporation in 2007. It is still in heavy development and marked as unstable. Especially when the filesystem becomes full, no space conditions arrive which might make it challenging to delete files. 
Btrfs is intended to address the lack of pooling, snapshots, checksums and integral multi-device spanning in Linux file systems, these features being crucial as Linux use scales upward into larger storage configurations common in the enterprise. Chris Mason, the principal Btrfs author, has stated that its goal was "to let Linux scale for the storage that will be available. Scaling is not just about addressing the storage but also means being able to administer and to manage it with a clean interface that lets people see what's being used and makes it more reliable."
In 2008, the principal developer of the ext3 and ext4 file systems, Theodore Ts'o, stated that although ext4 has improved features, it is not a major advance, it uses old technology, and is a stop-gap; Ts'o believes that Btrfs is the better direction because "it offers improvements in scalability, reliability, and ease of management". Btrfs also has "a number of the same design ideas that reiser3/4 had".
The core data structure of Btrfs the copy-on-write B-tree was originally proposed by IBM researcher Ohad Rodeh at a presentation at USENIX 2007. Rodeh suggested adding reference counts and certain relaxations to the balancing algorithms of standard B-trees that would make them suitable for a high-performance object store with copy-on-write snapshots, yet maintain good concurrency.
Chris Mason, an engineer working on ReiserFS for SUSE at the time, joined Oracle later that year and began work on a new file system using such B-trees almost exclusively not just for metadata and file data, but also recursively to track space allocation of the trees themselves. This allowed all traversal and modifications to be funneled through a single code path, against which features such as copy-on-write, checksumming and mirroring needed to be implemented only once to benefit the entire file system.
Btrfs 1.0 (with finalized on-disk format) was originally slated for a late 2008 release, and was finally accepted into the mainline kernel as of 2.6.29 in 2009. Several Linux distributions began offering Btrfs as an experimental choice of root file system during installation, including Arch Linux, openSUSE 11.3, SLES 11 SP1, Ubuntu 10.10, Sabayon Linux, Red Hat Enterprise Linux 6, Fedora 15,, Oracle Enterprise Linux 6.1,  MeeGo, Debian, and Slackware 13.37.
In 2011, de-fragmentation features were announced for the Linux 3.0 kernel version. Besides Mason at Oracle, a developer at Fujitsu contributed performance changes.
As of Linux 3.2 (released January 4, 2012), btrfs implements:
- Online defragmentation
- Online volume growth and shrinking
- Online block device addition and removal
- Online balancing (movement of objects between block devices to balance load)
- RAID0, RAID1, and RAID10
- Subvolumes (one or more separately-mountable filesystem roots within each physical partition)
- Transparent compression (zlib and LZO)
- Snapshots (read-only or copy-on-write clones of subvolumes)
- File cloning (copy-on-write on individual files, or byte ranges thereof)
- Checksums on data and metadata (CRC-32C)
- In-place conversion (with rollback) from ext3/4 to Btrfs
- File system seeding (Btrfs on read-only storage used as a copy-on-write backing for a writeable Btrfs)
- Block discard support (reclaims space on some virtualized setups and improves wear leveling on SSDs with TRIM)
Planned features include:
In 2009, Btrfs was expected to offer a feature set comparable to ZFS, developed by Sun Microsystems. After Oracle's acquisition of Sun in 2009, Mason still planned to develop Btrfs.
Btrfs provides a clone operation which atomically creates a copy-on-write snapshot of a file, support for which was added to GNU coreutils 7.5  via the --reflink option to cp. Cloning from byte ranges in one file to another is also supported, allowing large files to be more efficiently manipulated like standard rope data structures.
Subvolumes and snapshots
Subvolumes effectively allow a single instance of Btrfs to have multiple root directories, all using the file system as a pooled store. These roots are labeled and can be mounted separately by label. (There is always a "default" root which is mounted by default.) Subvolumes can also be nested inside other subvolumes, where they appear as subdirectories. Subvolumes are always created empty.
Snapshots are writeable copy-on-write clones of whole subvolumes; snapshotting itself is an atomic operation. Snapshots of individual subdirectories are not possible.
In-place ext3/4 conversion
Btrfs can warp to fit unusual spatial layouts because it has very little metadata anchored in fixed locations. The btrfs-convert tool exploits this property to perform an in-place conversion of any ext3 file system to Btrfs by nesting equivalent Btrfs metadata for all its files in the unallocated space of the ext3 file system. The result is a hybrid file system that, when mounted as a Btrfs, makes the ext3 files accessible in a writeable snapshot. The new ext3 filesystem itself appears as a large sparse file which can be mounted as a read-only disk image. Deleting the image file commits the conversion; remounting as ext3 rolls back the conversion.
Btrfs supports a very limited form of transaction without ACID semantics: rollback is not possible, only one transaction may run at a time and transactions are not atomic with respect to storage. They are analogous not to transactions in databases, but to the I/O "transactions" in ext3's JBD layer.
An ioctl interface, however, is provided so that user processes can start transactions to make temporary reservations of disk space. Once started, all file-system I/O is then bound to the transaction until it closes, at which point the reservation is released and I/O is flushed to storage. To prevent trivial denial-of-service attacks, transactions are only available to root-privileged processes.
Though Chris Mason said in his interview in 2009 that encryption was planned for btrfs, this is unlikely to be implemented for some time, if ever, due to the complexity of implementation and pre-existing tested and peer-reviewed solutions. The current recommendation for encryption with btrfs is to use a full-disk encryption mechanism such as dm-crypt or LUKS on the underlying devices, and to create the btrfs filesystem on top of that layer (and that if a RAID is to be used with encryption, encrypting a dm-raid device or a hardware-RAID device gives much faster disk performance than dm-crypt overlaid by btrfs's own filesystem-level RAID features).
Checking and recovery
Unix systems traditionally rely on "fsck" programs to check and repair filesystems, but no "btrfsck" program was released until March 2012 (as part of Unbreakable Enterprise Kernel version 2). Without such a program, a btrfs filesystem could become corrupt and lose all its files if a machine crashed or lost power on disks that didn't handle flush requests correctly.
Btrfs is structured as several layers of trees, all using the same B-tree implementation to store their various data types as generic items sorted on a 136-bit key. The first 64 bits of the key are a unique object id. The middle 8 bits are an item type field; its use is hardwired into code as an item filter in tree lookups. Objects can have multiple items of multiple types. The remaining right-hand 64 bits are used in type-specific ways. Therefore items for the same object end up adjacent to each other in the tree, ordered by type. By choosing certain right-hand key values, objects can further put items of the same type in a particular order.
Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical block number of a child node. Leaf nodes contain item keys packed into the front of the node and item data packed into the end, with the two growing toward each other as the leaf fills up.
Btrfs stores all links to an inode for a single directory in a fixed-size area. Consequently, the max number of hard links to a single file in a single directory is very limited: around 300 to 600, depending on the lengths of the file names.
This limit is not usually an issue, but it does cause problems for git, GNUS, GMame and BackupPC. 
The principal author of the filesystem said that this problem will be fixed in a future update. 
Every tree appears as an object in the root tree (or tree of tree roots). Some trees, such as file system trees and log trees, have a variable number of instances, each of which is given its own object id. Trees which are singletons (the data relocation, extent and chunk trees) are assigned special, fixed object ids 256. The root tree appears in itself as a tree with object id 1.
Trees refer to each other by object id. They may also refer to individual nodes in other trees as a triplet of the tree's object id, the node's level within the tree and its leftmost key value. Such references are independent of where the tree is actually stored.
File system tree
User-visible files and directories all live in a file system tree. File and directory objects all have inode items. Extended attributes and ACL entries are stored alongside in separate items. Files and directories also have a reference item whose right-hand key value is the object id of their parent directory. This allows upward traversal through the directory hierarchy. Hard-linked files have multiple such back-references.
Within each directory, directory entries appear as directory items, whose right-hand key values are a CRC32C hash of their filename. Directory items thus collectively act as an index for path lookups, but are not useful for iteration because they are in hash order: user applications iterating over and opening files in a large directory would thus generate many more seeks between non-adjacent files a notable performance drain in other file systems with hash-ordered directories such as ReiserFS, ext3 (with Htree-indexes enabled) and ext4, all of which have TEA-hashed filenames. To avoid this, a separate directory index item is created for each new directory entry. The right-hand value of the item is set to a counter that is incremented on every new file. Iteration over these index items thus returns entries in roughly the same order as they are stored on disk.
There is one file system tree per subvolume. Subvolumes can nest, in which case they appear as a directory item whose data is a reference to the nested subvolume's file system tree.
File data are kept outside the tree in extents, which are contiguous runs of disk blocks. Extent blocks default to 4KiB in size, do not have headers and contain only (possibly compressed) file data. In compressed extents, individual blocks are not compressed separately; rather, the compression stream spans the entire extent.
Each extent is tracked in-tree by an extent data item. The item's right-hand key value is the starting byte offset of the extent. This makes for efficient seeks in large files with many extents, because the correct extent for any given file offset can be computed with just one tree lookup.
Snapshots and cloned files share extents. When a small part of a large such extent is overwritten, the resulting copy-on-write may create three new extents: a small one containing the overwritten data, and two large ones with unmodified data on either side of the overwrite. To avoid having to re-write unmodified data, the copy-on-write may instead create bookend extents, or extents which are simply slices of existing extents. Extent data items allow for this by including an offset into the extent they are tracking: items for bookends are those with non-zero offsets.
If the file data is small enough to fit inside a tree node, it is instead pulled in-tree and stored inline in the extent data item. Each tree node is stored in its own tree block a single uncompressed block with a header.
Extent allocation tree
An extent allocation tree is used to track space usage by extents, which are zoned into block groups. Block groups are variable-sized allocation regions which alternate successively between preferring metadata extents (tree nodes) and data extents (file contents). The default ratio of data to metadata block groups is 1:2. Inode items include a reference to their current block group. Together, these work like the Orlov block allocator and block groups in ext3 in allocating related files together and resisting fragmentation by leaving allocation gaps between groups. (ext3 block groups, however, have fixed locations computed from the size of the file system, whereas those in Btrfs are dynamic and are created as needed.)
Items in the extent allocation tree do not have object ids, but instead use their byte offsets as the left-hand 64 bits of the key. A traditional free-space bitmap is not used, since the allocation tree essentially acts as a B-tree version of a BSP tree. (The current implementation of Btrfs, however, does keep an in-memory red-black tree of page-sized bitmaps to speed up allocations.)
Extent items contain a back-reference to the tree node or file occupying that extent. There may be multiple back-references if the extent is shared between snapshots. If there are too many back-references to fit in the item, they spill out into individual extent data reference items. Tree nodes, in turn, have back-references to their containing trees. This makes it possible to find which extents or tree nodes are in any region of space by doing a B-tree range lookup on a pair offsets bracketing that region, then following the back-references. For relocating data, this allows an efficient upwards traversal from the relocated blocks to quickly find and fix all downwards references to those blocks, without having to walk the entire file system. This, in turn, allows the file system to efficiently shrink, migrate and defragment its storage online.
The extent tree, as with all other trees in the file system, is copy-on-write. Writes to the file system may thus cause a cascade whereby changed tree nodes and file data result in new extents being allocated, causing the extent tree to itself change. To avoid creating a feedback loop, extent tree nodes which are still in memory but not yet committed to disk may be updated in-place to reflect new copy-on-written extents.
CRC-32C checksums are computed for both data and metadata and stored as checksum items in a checksum tree. There is one checksum item per contiguous run of allocated blocks, with per-block checksums packed end-to-end into the item data. If there are more checksums than can fit, they spill rightwards over into another checksum item in a new leaf.
An fsync is a request to commit modified data immediately to stable storage. fsync-heavy workloads (such as databases) could potentially generate a great deal of redundant write I/O by forcing the file system to repeatedly copy-on-write and flush frequently-modified parts of trees to storage. To avoid this, a temporary per-subvolume log tree is created to journal fsync-triggered copy-on-writes. Log trees are self-contained, tracking their own extents and keeping their own checksum items. Their items are replayed and deleted at the next full tree commit or (if there was a system crash) at the next remount.
Chunk and device trees
Block devices are divided into chunks of 256 MB or more. Chunks may be mirrored or striped across multiple devices. The mirroring/striping arrangement is transparent to the rest of the file system, which simply sees the single, logical address space that chunks are mapped into.
This is all tracked by the chunk tree, where each device is represented as a device item and each mapping from a logical chunk to its underlying physical chunks is stored in a chunk map item. The device tree is the inverse of the chunk tree, and contains device extent items which map byte ranges of block devices back to individual chunks. As in the extent allocation tree, this allows Btrfs to efficiently shrink or remove devices from volumes by locating the chunks they contain (and relocating their contents).
The file system, chunks and devices are all assigned a Universally Unique Identifier (UUID). The header of every tree node contains both the UUID of its containing chunk and the UUID of the file system. The chunks containing the chunk tree, the root tree, device tree and extent tree are always mirrored even on single-device volumes. These are all intended to improve the odds of successful data salvage in the event of media errors.
Data relocation tree
The data relocation tree serves as scratch space for extents and their temporary copies during rebalancing or defragmentation. It is exempt from copy-on-write.
All the file system's trees including the chunk tree itself are stored in chunks, creating a potential chicken-and-egg problem when mounting the file system. To bootstrap into a mount, a list of physical addresses of chunks belonging to the chunk and root trees must be stored in the superblock.
Superblock mirrors are kept at fixed locations: 64 KiB into every block device, with additional copies at 64 MiB, 256 GiB and 1 PiB. When a superblock mirror is updated, its generation number is incremented. At mount time, the copy with the highest generation number is used. All superblock mirrors are updated in tandem, except in SSD mode which alternates updates among mirrors to provide some wear levelling.
- Comparison of file systems
- List of file systems
- Btrfs wiki
- - Conference presentation by Oracle Engineer, Avi Miller.
cs:Btrfs de:Btrfs es:Btrfs fr:Btrfs ko:Btrfs it:Btrfs hu:Btrfs nl:Btrfs ja:Btrfs pl:Btrfs ru:Btrfs uk:Btrfs zh:Btrfs