Vous êtes sur la page 1sur 6

BtrFS: A fault tolerant file system

Case Study

Kumar Amit Mehta


Department of Computer Engineering
Tallinn University of Technology
Tallinn, Estonia
gmate.amit@gmail.com

Abstract integrated many more features, which its predecessors were


lacking. A fault tolerant system can cause severe performance
File system plays a crucial rule in storage fault tolerance, data penalties, rendering it to be not scalable, but BtrFS overcomes
recovery, scalability and performance of the system. Over the those challenges very well [7, 8]. This paper is a case study on
years several File system implementation, such as ext3, ext4, the short comings of current storage fault tolerance design,
JFS, ZFS, XFS etc. have been deployed on servers running on novel ideas in BtrFS, it's fault tolerant features, and
Linux. Different file system offer different features and have comparison with existing solutions.
different design goals. The design goal of BtrFS is scalability
II. BACKGROUND
and storage fault tolerance.
A. Historical perspective
This paper describes challenges associated with a file system BtrFS is an open source file system that has seen extensive
design, discussion on BtrFS fault tolerant features, development since its inception in 2007. It is jointly developed
performance comparison with existing solutions followed by by Fujitsu, Fusion-IO, Intel,Oracle , Red Hat, Strato,
conclusion.
SUSE, and many others. It is slated to become the next major
Keywords: Filesystems, Storage, BtrFS, Fault tolerance Linux file system. It was merged into the mainline Linux
kernel in the beginning of 2009 and debuted in the Linux
I. INTRODUCTION 2.6.29 release. BtrFS is not a successor to the default Ext4 file
A file system is a storage abstraction that control how data system used in most Linux distributions, but it is expected to
is stored and retrieved from the storage device. Without the file replace Ext4 in the future. BtrFS is expected to offer better
system, data is just a bunch of bits without any information scalability and reliability. It is a copy-on-write file system
about the data itself. Essentially, a file system is the methods intended to address various weaknesses in current Linux file
and data structures that an operating system uses to keep track systems. It is expected to work well on systems as small as a
of files on a disk or partition. Due to numerous advantages, file smartphone, and as large as an enterprise production server. As
systems are bare minimum requirement, however due to such, it must work well on a wide range of hardware. It's main
varying workloads, heterogeneous nature of storage hardware features are:
(SSD, HDD, JBOD, SATA, SAN, NAS etc), strict 1. CRCs maintained for all metadata and data
requirements on data consistency and data recovery on
massively parallel systems with very high uptime requirement, 2. Efficient writeable snapshots
it typically takes about 7 to 8 years of development and test 3. Multi-device support
cycles, before a new file system gains the the trust and is put
into production in enterprise class data centers. Data 4. Online resize and defragmentation
unavailability, data loss or even worse, data corruption can 5. Compression
have catastrophic effects and unfortunately, faults, errors and
failures (both hardware and software) are inevitable. There has 6. Efficient storage for small files
been studies [2, 3] on data unavailability and it was found that 7. SSD optimizations and TRIM support
data unavailability due to storage failures occur quite
frequently,. Hence fault tolerance aspect of file systems design 8. In built RAID functionality
is given a lot of consideration and is perhaps the most
important feature of a file system. Currently there are more B. Challenges
than 70 file system supported by Linux and each of these file B.1 Ubiquitous B-trees for organization and maintenance
system have different design goals. BtrFS, whose development of large ordered indexes
work started in 2007 for Linux based hosts has Fault tolerance
and scalability as the main design principle. Though still under The file system on disk layout is a forest of b-trees [1] with
heavy development, it is poised to become the next generation copy-on-write (COW) as update method. Copy-on-write is an
file system for enterprise class infrastructure. It has ideas optimization strategy used in computer programming. Copy-
borrowed from other commercial level file systems and have on-write has an added advantage that a crash during the update
procedure does not impact the original data. Normally, the data B.2 Varying workloads
is loaded from disk to memory, modified, and then written
Workloads affect the ability of file systems to provide high
elsewhere. The idea is not to update the original location in
performance to users. Because of the increasing gap between
place, risking a power failure and partial update. However, B-
processor speed and disk latency, file system performance is
trees in their native form are very incompatible with COW
largely determined by its disk behavior. Storage workloads are
technique that help achieve shadowing and cloning (writable
changing; more file storage, less block storage; larger file size,
snapshots) features of BtrFS. Shadowing is a powerful
Enterprise, Engineering, Email, Backup and many more
mechanism that has been used to implement snapshot, crash-
different types of workload [4] add to the complexity and the
recovery, write-batching and RAID. The basic scheme is to
performance of file system.
look at the file system as a large tree made up of fixed-sized
pages. When a page is shadowed, its location on disk changes, B.3 Heterogeneous storage
and this creates a need to update (and shadow) the immediate
ancestor of the page with the new address. Shadowing The heterogeneous nature of storage devices bring more
propagates up to the file system root. complexity. The read and write policies, latency, aging and
reliability of various storage devices are substantially different.
A rotating disk typically does not have write leveling issues,
while a SSD does not have to deal with wear and tear of
rotating platter.
C. Storage faults
C.1 Mean time to failure of disks
MTTF of disk Component failure in large-scale IT
installations is becoming an ever-larger problem as the number
of components in a single cluster approaches a million. Study
result [2, 3] have shown that the mean time-to-failure (MTTF)
of storage drives such as SCSI and FC, as well as SATA
Fig. 1. Shadowing a leaf (Ohad) interfaces, as specified in their datasheets are more than the
ranges (1,000,000 to 1,500,000 hours) specified in their
Figure 1 shows an initial file system with root A that contains
datasheets. It was found that in the field, annual disk
seven nodes. After leaf node C is modified, a complete path to
replacement rates typically exceed 1%, with 24% common
the root is shadowed, creating a new tree rooted at A. Nodes A,
and up to 13% observed on some systems. It is often assumed
B, and C become unreachable and will later be deallocated.
that the disk failures follow a simple fail-stop model, however,
The b-tree variant typically used by file systems is b+-trees. In
disk failures are much more complex in reality. For example,
a b+-tree, leaf nodes contain data values and internal nodes
disk drives can experience latent sector faults or transient
contain only indexing keys. Files are typically represented by
performance problems. Often it is hard to correctly attribute the
b-trees that hold disk extents in their leaves and leaves are
root cause of a problem to a particular hardware component.
chained together for rebalancing purposes. However such
Table I shows the relative frequency of Hardware Component
scheme becomes extremely inefficient for copy-on-write style
Replacements for the ten most frequently replaced components
file system.
in High performance Computing(HPC1) and cluster system
(COM1, COM2) in a large internet service provider
infrastructure.
TABLE I. SCHROEDER AND GARTH

Fig. 2. A tree whose leaves are chained together (Ohad)

Figure 2 shows a tree whose rightmost leaf node is C and C.2. Bitot
where the leaves are linked from left to right. If C is updated,
the entire tree needs to be shadowed. Without leaf pointers, Bitrot [9] is the silent corruption of data on disk or tape. One
only C, B, and A require shadowing. In such B-trees, with at a time, year by year, a random bit here or there gets flipped.
shadowing each change has to be propagated up to the root. The worst thing is that the backup won't save the user, since
Hence the major challenge is to achieve benefits of shadowing backups are completely oblivious to bitrot. Conventional RAID
mechanism, while retaining the ubiquitous B-trees for doesn't help either. Though RAID5 array can rebuild the data
organization and maintenance of large ordered indexes. from the parity, but that only works if the drive fails completely
Fig. 3. Original image Fig. 4. After flipping one bit

and cleanly. If the drive instead starts spewing corrupted data,


the array may or may not notice the corruption. Even if it does
notice, all the array knows is that something in the stripe is bad,
it has no way of knowing which drive returned bad dataand
therefore which one to rebuild from parity (or whether the
parity block itself was corrupt).
Figure 3 and 4 shows an effect of bitrot on a jpg image. As can Fig. 5. (a) A basic b-tree (b) Inserting key 19, and creating a
be seen here, even a small hardware error can corrupt the data path of modified pages. (Ohad)
significantly. In order to remove a key, copy-on-write is used. Remove
C.3. Journaling is not enough operations do not modify pages in place. For example, Figure 6
shows how key 6 is removed from a tree. Modifications are
Most of the file system today use a technique called written off to the side, creating a new version of the tree.
Journing. A journal is a special log file stored on a persistent
storage media, in which file system writes all its actions before
actually performing them. If the system crashes while the file
system is performing an action, the file system can complete
the pending action(s) upon the next system boot, by referring
the journal and replaying the log. Journaling file system has a
big problem; they can only provide metadata integrity and
consistency. Keeping both data and metadata changes inside
Fig. 6. (a) A basic tree (b) Deleting key 6.
the journal introduces an unacceptable performance overhead,
hence file system end up only keeping logs for the metadata
In order to clone a tree, its root node is copied, and all the child
update. A fault tolerant file system should provide both
pointers are duplicated. For example, Figure 7 shows a tree Tp,
metadata and data integrity and consistency without much
that is cloned to tree Tq . Tree nodes are denoted by symbols.
overhead.
As modifications will be applied to Tq, sharing will be lost
III. BTRFS DISCUSSION between the trees, and each tree will have its own view of the
data.
A. COW friendly B-tree
B-trees in their normal form was posing a big performance
penalty for copy-on-write based update method. In 2007, Ohad
Rodeh published a paper [5] on COW friendly B-trees, which
introduced some novel idea on shadowing and clones. Those
ideas were adopted in the design of BtrFS [6]. The main idea is
to use standard b+-tree construction, but (1) employ a top-
down update procedure, (2) remove leaf-chaining, (3) use lazy
reference-counting for space management.
Figure 5(a) shows an initial tree with two levels. Unmodified
Fig. 7. Cloning tree Tp. A new root Q is created, initially pointing to the same
pages are colored yellow, and COWed pages are colored green. blocks as the original root P. As modifications will be applied, the trees will
Figure 5(b) shows an insert of new key 19 into the right most diverge.(Ohad).
leaf. A path is traversed down the tree, and all modified pages
are written to new locations, without modifying the old pages.
Since tree nodes can be reached from multiple roots, garbage file system still only sees a single copy of each block, it can
collection is needed for space reclamation. As the trees are only say that the checksum is broken but cannot recover.
acyclic, reference counting is used to track pointers to tree-
nodes. Once the counter reaches zero, a block can be reused. In Therefore BtrFS does it own device management. It calculates
checksums, stores them in a separate tree, and is then better
order to keep track of ref-counts, the copy-on-write mechanism
is modified. Whenever a node is COWed, the ref-count for the positioned to recover data when media errors occur. BtrFS
splits each device into large chunks. A chunk tree maintains a
original is decremented, and the ref-counts for the children are
incremented. BtrFS uses this space reclamation process in mapping of logical to physical chunks. A device tree maintains
reverse mapping. The rest of the file system sees logical
background.
chunks. Physical chunks are divided into groups according to
B. Data and metadata checksum the required RAID level of the logical chunk. For mirroring,
chunks are divided into pairs. Table II presents an example
CRC32C checksums are computed for both data and
with three disks, and groups of two. For example, logical
metadata and stored as checksum items in a checksum tree.
chunk L1 is made up of physical chunks C11 and C21 .
There is one checksum item per contiguous run of allocated
blocks, with per-block checksums packed end-to-end into the TABLE II. TO SUPPORT RAID1 LOGICAL
item data. When the file system detects a checksum mismatch CHUNKS, PHYSICAL CHUNKS ARE DIVIDED INTO PARTS.
while reading a block, it first tries to obtain (or create) a good HERE THERE ARE THREE DISKS, EACH WITH TWO
PHYSICAL CHUNKS PROVIDING THREE LOGICAL
copy of this block from another device if internal mirroring CHUNKS. LOGICAL CHUNK L1 IS BUILT OUT OF
or RAID techniques are in use. Even on a non RAID file PHYSICAL CHUNKS C11 AND C21
system, btrfs usually has two copies of metadata which are
both checksummed . Data blocks are not duplicated unless one Logical chunks Disk 1 Disk 2 Disk 3
has RAID1 or higher, but they are checksummed. Scrub [10] L1 C11 C21
will therefore know if metadata is corrupted and typically
L2 C22 C31
correct it on its own It can also tell if, data blocks got
corrupted, auto fix them if RAID allows, or report them in L3 C12
syslog otherwise.
To test this feature, a virtual machine, was created. Latest For striping (RAID0), groups of n chunks are used, where each
stable release of Linux (kernel version 3.17.4) was installed. physical chunk is on a different disk. For example, Table 3
Two virtual disks were added to the virtual machine and were shows stripe width of four (n = 4), with four disks, and three
formatted as BtrFS file system, with BtrFS built in software logical chunks.
raid level 1 for both data and metadata. Data corruption was
performed over the disk itself, underneath the file system so TABLE III. STRIPING WITH FOUR DISKS, STRIPE
that file system has no idea. However, when an attempt was WIDTH IS N= 4. THREE LOGICAL CHUNKS ARE EACH
MADE UP OF FOUR PHYSICAL CHUNKS
made to read the same data through the file system layer, BtrFS
not only detected the error, but also corrected the data. Logical Disk 1 Disk 2 Disk 3 Disk 4
Following message were seen in the system logs. chunks

BTRFS info (device sdb): csum failed ino 257 off 0 csum L1 C11 C21 C31 C41
2566472073 expected csum 3681334314 L2 C12 C22 C32 C42

BTRFS info (device sdb): csum failed ino 257 off 0 csum L3 C13 C23 C33 C43
2566472073 expected csum 3681334314
BTRFS: read error corrected: ino 257 off 0 (dev /dev/sdc D. Subvolumes
sector 449512) Subvolumes provide an alternative restricted view of the
C. Multi-device support file system. Each subvolume can be treated as its own
filesystem and mounted separately and exposed as needed.
The device mapper [11] subsystem in Linux manages storage
Exposing only a part of a file system, restricts the damage to
devices. For example, LVM and mdadm. These are software
the entire file system. A subvolume [12] in btrfs has its
modules whose primary function is to take raw disks, merge
hierarchy and relations between other subvolumes. A
them into a virtually contiguous block address space, and
subvolume in btrfs can be accessed in two ways, (i)From the
export that abstraction to higher level kernel layers. They
parent subvolume when accessing from the parent subvolume,
support mirroring, striping, and RAID5/6. However,
the subvolume can be used just like a directory. It can have
checksums are not supported. This causes problem for BtrFS,
child subvolumes and its own files/directories. (ii) Separate
which maintains checksum for each block. Consider a case
mounted filesystem. From outside, subvolumes look like
where data is stored in RAID-1 form on disk, and each 4KB
ordinary directory structure; one can copy things into that
block has an additional copy. If the file system detects a
directory (which thus puts them into that subvolume), one can
checksum error on one copy, it needs to recover from the other
create other directories under that subvolume directory, and can
copy. Device mapper hide that information behind the virtual
even create other subvolumes under it, however in reality they
address space abstraction, and return one of the copies. Since
are not. An attempt to create hardlinks across subvolumes
won't work. Subvolumes are extremely easy to manage when
taking a snapshot. These snapshot of subvolume can be read- seen in the field. At the end of the day, what matters to a user
only as well. Since BtrFS is copy-on-write based file system, is the robustness and performance for his particular application.
the snapshot initially consume no additional disk space and 1000 surprising power failure test result [7] show that Ext4
will only start to use space if its files are modified, or new files metadata was corrupted, while BtrFS worked without any
are created. problem. The power failure test was done on Freescale TWR-
VF65GS10 board, 1GB DDR3 memory and 16GB Micro SD
E. Snapshot Card. Linux kernel version used was 3.15-rc7. The board was
One of the requirement for mission critical system is to be periodically turned On and Off, while a file writing application
able to recover from failures. Snapshots are one such was continuously creating 4KB files.
mechanism. Snapshot is a state of a system (In this case data)
TABLE IV. POWER FAILURE TEST RESULTS
at a particular point in time. Using snapshot, one can go back
to a particular time in history and recover data. Snapshots are Number of Results
built in BtrFS and cost little performance, especially compared Power Failure
to LVM (Logical Volume Manager). In BtrFS, a snapshot is a BtrFS 1000+ No Abnormal situation occurred
cheap atomic copy of a subvolume, stored on the same file
Ext4 1000+ Corrupted inode had increased up to 32,000
system as the original. Snapshot volume looks similar to a full and Finally Fell into Abnormal Disk Full
backup taken at a particular point. For example, consider a file State
of size 10GB, it takes up 10GB of space. At this point (say at
Time 't') snapshot is taken, the file and the snapshot between
them take up 10GB of space. Later 1GB of the file is Table IV shows the robustness test results. It is perhaps the
modified, and now the file and the snapshot take up 11GB of copy-on-write feature of BtrFS that sustained such abrupt
power failures.
space; 9GB is unmodified and is still shared. Only the
remaining 1GB has two different versions. This approach has Performance test results [6, 7] show that despite supporting
tremendous space savings. These read-only snapshots can be new features such as Snapshot, data checksum and multiple
sent to another file system or machine using send/receive to device support, BtrFS provides reasonable performance under
cancel out single point of failure. A snapshot in Btrfs is a most workloads.
special type of subvolume; one which contains a copy of the
current state of some other subvolume. Snapshots clearly have
a useful backup function. If, for example, one has a Linux
system using Btrfs, one can create a snapshot prior to
installing a set of distribution updates. If the updates go well,
the snapshot can simply be deleted. Should the update go
badly, instead, the snapshot can be made the default
subvolume and, after a reboot, everything is as it was before.

(a). Initial state (b) After Snapshot


Fig. 8. Snapshot of subvolume 'A' (13)

Figure 8 shows BtrFS implementation of snapshot. Snapshot


of subvolume A is taken. Since BtrFS is a copy-on-write file Fig. 8. Kernel compilation, all files systems exhibit similar performance.
system, only reference count of the immediate child, B and D (Ohad)
are updated and hence the snapshot with BtrFS is
The first test was a Linux kernel make, starting from a clean
exceptionally fast. tree of source files. Tests were run on a single socket 3.2 Ghz
IV. TEST RESULTS quad core x86 processor with 8 gigabytes of ram on a single
SATA drive with a 6gb/s link. A block trace was collected,
There are no agreed upon standards for testing file system . starting with the make -j8 command. This starts eight parallel
While there are industry benchmarks for the NFS and CIFS threads that perform C compilation and linking with gcc.
protocols, they cover only a small percent of the workloads
Figure 8 compares throughput, seek count, and Iops between and phoronix.com's benchmark results [8] show that BtrFS was
the three file systems. Ext4 has slightly higher throughput than overall winner.
BtrFS and XFS, averaging a little less than twice as much
throughput. Test results are show in figure 8. V. CONCLUSION
Storage systems are not perfect. Faults are inevitable, but
Another performance test was Basic File I/O (single instance of
FIO) throughput and throughput under high load (multiple expectations for data availability in a mission critical system
are extremely high. A silent data corruption can be even more
instances of FIO).
catastrophic. With high volume of data, ubiquitous computing,
a file system plays a crucial role in the fault tolerance and
performance of storage systems. File systems in the past used
Journaling mechanism for metadata, but it is unable to provide
data integrity and consistency. Copy-on-write mechanism can
solve many of the challenges with storage faults, but the
ubiquitous b-tree based file system in it's native form faced
severe performance penalties. This problem was solved by
using COW friendly b-trees and is the basis of BtrFS core
design. BtrFS is a relatively young Linux file system. It is
based on copy-on-write, and supports efficient snapshots and
strong data integrity. As a general purpose files system, BtrFS
has to work well on a wide range of hardware. It is still under
heavy development. It is recommended to use traditional
backup techniques along with BtrFS deployment. No single
solution can provide all benefits, hence a combination of data
recovery mechanism should be employed for storage fault
tolerance.
Fig. 9. Read operation with single FIO
REFERENCES

[1] BAYER, R. AND McCREIGHT, E. 1972. Organization and maintenance


of large ordered indices. Acta Informatica, 173189.
[2] FORD , D., L ABELLE , F., P OPOVICI , F., S TOKELY , M.,
TRUONG , V., B ARROSO , L., G RIMES , C., AND Q UINLAN , S.
Availability in globally distributed storage systems. In 9th USENIX
Symposium on Operating Systems Design and Implementation (Oct
2010)
[3] Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real
world: what does an MTTF of 1,000,000 hours mean to you?. In
Proceedings of the 5th USENIX conference on File and Storage
Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, ,
Article 1 .
[4] Andrew W. Leung, Shankar Pasupathy, Garth Goodson, and Ethan L.
Miller. 2008. Measurement and analysis of large-scale network file
system workloads. In USENIX 2008 Annual Technical Conference
(ATC'08). USENIX Association, Berkeley, CA, USA, 213-226. .
[5] Ohad Rodeh. 2008. B-trees, shadowing, and clones. Trans. Storage 3, 4,
Article 2 (February 2008), 27 pages. DOI=10.1145/1326542.1326544
Fig. 10. Write operation with single FIO http://doi.acm.org/10.1145/1326542.1326544
[6] Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux
Evaluation environment was, Intel Desktop Board D510MO, B-Tree Filesystem. Trans. Storage 9, 3, Article 9 (August 2013), 32
1GB DDR2-667 PC2-5300 memory, Storage : 32GB Intel pages.DOI=10.1145/2501620.2501623
http://doi.acm.org/10.1145/2501620.2501623
X25-E e-SATA SSD, 64 bit Linux Kenel v3.15.1. FIO; a
[7] events.linuxfoundation.jp/sites/events/files/slides/linux_file_system_ana
software tool for generating different types of I/O was used for lysis_for_IVI_systems.pdf
benchmarking. [8] http://www.phoronix.com/scan.php?
With single instance of FIO, BtrFS performed slightly better page=article&item=linux_315_hddfs&num=3
than Ext4 in sequential read operation. In random read, results [9] Bitrot and atomic COWs: Inside next-gen filesystems.
http://arstechnica.com/information-technology/2014/01/bitrot-and-
were reversed. Ext4 was almost twice Faster than BtrFS for atomic-cows-inside-next-gen-filesystems/1/
write operation. BtrFS performance degradation under high [10] https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-scrub
load was more graceful than Ext4. BtrFS kernel threads used [11] http://en.wikipedia.org/wiki/Device_mapper
CPU resource effectively. There are other use cases as well
[12] https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-subvolume
[13] http://www.ibm.com/developerworks/cn/linux/l-cn-btrfs/

Vous aimerez peut-être aussi