Vous êtes sur la page 1sur 12

A Comparison of Journaling and

Transactional File Systems

by Datalight Staff

WHITEPAPER

A Comparison of Journaling and Transactional


File Systems
Executive Summary
As end-user expectations rise and embedded devices get more complex, reliable file management is rapidly becoming a commonplace requirement. The
file systems available in most embedded operating systems were not specifically designed with the needs of the embedded marketplace in mind, but instead
evolved out of solutions developed for desktop and server environments. These
file systems have significant shortcomings in embedded devices:
1) They were not designed for use in an environment where power may be
lost.
2) The error recovery processes are slow, which is typically not acceptable
when instant-on is the users expectation.
3) Desktop and server file systems were not designed for use in an environment with limited resources such as is often found in embedded devices.
This white paper looks at the limitations of desktop/server file systems in embedded devices, and then reviews the operation of two different solutions
journaling and transactional file systems and the key differences between
them.
Note: The paper assumes familiarity with file system basics. File system concepts are covered in Appendix A for those new to the topic.

Contents
1
1
2
2
3

Executive Summary
Introduction
Challenges with Traditional File Systems
Challenges in Building a
Reliable File System
Journaling Versus Transactional File Systems
3 User-data Integrity
4 Performance
4 Disk Space Efficiency
4 Programmability
6 Overview of Journaling File Systems
Examples of Journaling File Systems
8 Overview of Transactional File Systems
Examples of Transactional File Systems

10 Appendix A
10 File System Basics
12 Bibliography

1 | WHITEPAPER

Introduction
In the past, most embedded devices did not require file management systems. But data storage
needs in the embedded marketplace have increased dramatically over the last 10 years, putting
greater demands on embedded file systems.
The most popular file system for embedded devices today, FAT, originated in the desktop environment. Other embedded file systems, such as ext3, originated in the server environment.
The problem is that desktop and server environments provide controlled startup and shutdown
procedures for file systems, while in the embedded world many devices operate in environments
where power may be unexpectedly lost or interrupted.
Developers are currently exploring alternatives to traditional file systems like FAT and ext3 that
have proven to be inadequate for todays embedded devices. To address the file system reliability in embedded file systems, developers first tried journaling file systems. File systems in this
category include ext3, JFS, ReiserFS, and XFS. Originally developed for use in Linux server environments, these file systems were adopted to address the problems of power loss and system
crashes seen in embedded devices.
Journaling file systems are reliable. However, there is another category of file systems called
transactional file systems that are not only more reliable but, unlike journaling file systems, were
specifically designed for small, resource-constrained embedded devices. Datalight Reliance is an
example of such a file system.
Transactional file systems also offer better performance. The combination of reliability and performance is attractive to embedded developers who are being challenged to produce not only
more reliable devices, but also devices that provide end users with a problem-free, high-performance data storage experience.

Challenges with Traditional File Systems


File systems originally created for use in desktop environments were not designed to accommodate the loss of power while writing to the disk.
When writing to any given file in any file system, several different areas of the disk must be updated. For example, in a typical FAT file system, appending data to the end of a file entails writing
to four different areas on the disk:
1) Allocating a new block and marking it as used in the first file allocation table.
2) Marking the same block as used in the second file allocation table.
3) Updating the files directory entry to record the new length and timestamp.
4) Writing the actual data to the disk.
The first three steps involve updating the file system metadata, while the fourth step is updating the user data. Metadata is simply the logical information about the file system structure,
while the user data is the actual contents of a file.
If power is interrupted at any point in this process, the file system becomes unstable because the

2 | WHITEPAPER

metadata is not in sync. A reliable file system must ensure that all four steps are completed in an
atomic fashion that is, they are either all completed in their entirety, or none of them are performed. This is referred to as atomicity, and it is the foundational concept for both journaling
and transactional file systems.
Created with the idea that they would be used in power stable environments, traditional file systems were not designed to provide atomicity. In order to deal with situations where the file systems metadata structures does become corrupted, utilities such as chkdsk, scandisk, or fsck were
created to scan the entire file system (usually at system startup time) for problems. In addition to
being time consuming, the scanning process provides no guarantees that the actual file data got
written, only that the metadata structures are fixed.

Challenges in Building a Reliable File System


Journaling and transactional file systems were developed to address the limitations of traditional
desktop file systems. The journaling approach was primarily developed to address the needs of
the server market, while the transactional approach was developed to address the needs of the
embedded market.
Both journaling file systems and transactional file systems use the concept of a transaction. A
transaction is defined to be a file system event that is atomic. For example, a text file is either
saved or not saved to a disk. By keeping file system events atomic, the state of the file system is
always known, and therefore, the file system is perceived by the user to be reliable.
Always knowing the state of the file system is critical if power is lost or the system crashes. It
allows the system to power back up with the file system intact. The user experiences minimal
lost data.
Every file system requires writes to different areas of the disk when writing to any given file.
Implementing these writes as a transaction requires that data be written in such a fashion that
it can be determined which operations are valid and which operations are invalid should the process be interrupted at an inopportune moment. This requires that the on-media format be specifically designed for this, which is why it is impossible to write a FAT compatible file system that
is completely reliable. To do so would require media format changes and this would make the
system non-compatible with FAT.
There are two aspects of modern computer hardware that make it very challenging to implement
a transaction:
1) Modern disk controllers may write data to the disk in a different order than that
requested by the file system. For example, writes are often queued up inside the disk
controller hardware. Depending on where the disk heads are in relation to the media
when a write request is issued, a given sector may be written before other sectors that
were already queued up.
2) Depending on the media and the block device driver design, individual sector writes
may not be atomic. In most cases, physical sector writes are atomic (either completely
written, or not modified at all). A truly reliable file system, however, cannot count on
this.

3 | WHITEPAPER

Journaling Versus Transactional File Systems


Journaling file systems use a logging approach to manage transactions. Disk operations are recorded as transactions in a circular buffer known as the journal. Transactional file systems use
a dual-state set of data structures to manage transactions. The two states are referred to as the
committed-state and the working-state.
As a result of these two different approaches, the key differences between journaling file systems and transactional file systems fall into the following areas:

User-data Integrity
The primary focus of journaling file systems is the preservation of the file system metadata,
whereas a transactional file system ensures the integrity of both the metadata and the users
data. One notable exception is ext3, which does have modes for preserving user data as well.
These options are discussed in more detail later.
A second aspect of user-data integrity is how blocks are written to the disk. In a transactional file
system, user data belonging to the committed state is never overwritten. File operations that
would overwrite existing data are instead written to free blocks. Should the power be lost at an
inopportune moment, the committed disk state remains unchanged.
In a journaling file system, user data may be overwritten during the normal course of operations.
Upon startup after a power loss, the journal will be replayed to fix any metadata problems. The
user data, however, is in an unknown state, and it becomes the responsibility of the application
to determine the state of the data. This is a difficult problem issue to resolve for a variety of reasons, not the least of which is the issue that the hard disk may write the data out of order, as previously described.

Performance
Performance issues fall into two categories:
1) Operational performance. A journaling file system is typically slower than a transactional file system because metadata changes must be written twiceonce to the
journal and once to the actual disk. A transactional file system only writes metadata
changes a single time.
2) System startup performance. In the event of a power loss, a journaling file system
must open the journal and replay the events to ensure the file system integrity. This
will take a variable amount of time depending on the number of events in the journal.
A transactional file system needs only to perform a simple checksum on two logical disk
blocks to determine which one points to the valid disk state.

Disk Space Efficiency


Journaling file systems require that a fixed amount of disk space be set aside for the journal itselfthis is in addition to the standard file system metadata. The journal size must be large

4 | WHITEPAPER

enough to contain the maximum number of events the system could ever need and the size is
determined at format time.
Transactional file systems have no such requirement, and in fact, the space needed to record the
dual-state information is smaller than the overhead required by most FAT implementations.

Programmability
Journaling file systems typically operate in a completely automated fashion, about which the
running applications have no specific knowledge. Automated operation is ideal for legacy programs that wont be modified for use in the embedded system.
Transactional file systems can run in a similar automated fashion, or can be specifically controlled
by an application. Many programs used in embedded devices are specifically designed for that
environment and can benefit greatly by using a transactional file system that allows the application to control how transactions are committed to disk.
For example, it is not uncommon for an application to need to update several files on disk in an
atomic fashion. This is a difficult problem to solve if a power interruption occurs and the application has to contain logic to recover from the interrupted operation. With programmable transactions, this is easily accommodated.

Overview of Journaling File Systems


Journaling file systems were developed by the database community11 to solve the problem of data loss due to a system
crash or power loss in server environments. Journaling file systems ensure that file system operations are atomic. This approach enables the file system to be structurally sound at all
times. The user perceives that her data is reliable.
The most basic unit of journaling is called a transaction. In
the context of a journaling file system, a transaction can be

Strengths of Journaling File Systems


Protects file system structures
Atomic operations

Weaknesses of Journaling File Systems


Protection for user data negatively impacts
performance and overhead
Slows down as disk utilization increases
Mount time slows propotional with size of
journal

considered to be a single file operation. For example, a transaction could be to create file A or to delete file B.
Each transaction consists of a record of a sequence of changes made to separate disk sectors during a file operation. When the last modification within a transaction is complete, the contents of
the transaction are written to a log.
A log is a fixed-sized, continuous area on the disk that the journaling code uses as a circular buffer. The log is written only during normal operation, and when old transactions complete, their
space in the log is reclaimed.
The key to journaling is that the disk blocks modified during a transaction are not written until
after the entire transaction is successfully written to the log.
By buffering the transaction in memory until it is complete, journaling avoids partially written
transactions. If the system crashes before successfully writing the journal, the entry is not con1 Practical File System Design with the BE File System, Dominic Giampaolo, Morgan Kaufmann Publishers, Inc.,
San Francisco, CA, 1999, page 112

5 | WHITEPAPER

sidered valid. If the system goes down after the journal is written, then when the device reboots,
it examines the log and replays outstanding transactions.
Two different approaches to journaling are used by journaling file systems22. The difference relates to what information is written to the log:
1) Journaling file systems that log changes to metadata.
2) Journaling file systems that can log changes to both metadata and user data.
With either approach, logging changes to file system metadata is what guarantees the integrity
of a journaling file system. After a system crash, the structure of files, directories, and the file system can be made consistent by re-executing any pending changes that are completely described
in the log.
Journaling file systems that support the logging of user data are rarely implemented. In addition
to being very slow, another shortcoming is that the log must be much larger due to the need to
record both user data and metadata.

Examples of Journaling File Systems


Operations3 of several journaling file systems developed for use in a server environment and then
later adapted for use in embedded devices are described in this section.

Linux Ext3
Ext3 is a Linux-based journaling file system. Ext3 users can specify whether they want to log all
changes to both file data and metadata or whether they want to log only metadata changes.
Selecting between logging all data and metadata changes (the ext3 journaled mode) or simply
logging metadata changes (the ext3 file systems writeback mode) is done through mount options supplied when an ext3 file system is mounted.
Logging changes to both data and metadata is both more robust and substantially slower than
logging metadata changes only. It is more robust because it includes a complete record of changes to all file system data in the log; it is slower because each committed file system update actually causes two sets of writes the first set to the log when all the pending changes are logged,
and the second set when those changes are actually made to the file system.
The ext3 file systems third logging mode, ordered logging, provides most of the guarantees of
fully journaled data mode without the performance penalties inherent in that mode. It does this
by flushing all data associated with a transaction to the disk before the transaction itself commits.

IBM JFS
JFS is IBMs full 64-bit journaling file system. It logs information about changes to the file system metadata as atomic transactions. If an embedded device is restarted without cleaning (or
unmounting) a JFS fileset, any transactions in the log that are not marked as having been completed on disk are replayed when the file system is checked for consistency before it is mounted.
This restores the consistency of the file system but not the contents of the files in the file system.
Files being edited when the system went down will not reflect any updates not successfully writ2 Linux File Systems, William von Hagen, Sams Publishing, 2002
3 Ibid

6 | WHITEPAPER

ten to the disk.

ReiserFS
ReiserFS is built into every version of Linux running a 2.4.1 or greater kernel. ReiserFS journals file
system metadata updates rather than both data changes and metadata updates. ReiserFS uses
some clever strategies to maximize metadata consistency, even in the event of a sudden system
failure. For example, when updating file system metadata, ReiserFS does not overwrite the existing metadata but instead writes it to a new location as close as possible to the existing metadata.

SGI XFS
XFS from SGI was developed by SGI for its UNIX multimedia workstation. XFS provides high
throughput for streaming video and audio, support for huge files, and the ability to store large
amounts of data. XFS file systems are full 64-bit file systems composed of three areas: the data
section, the log, and an optional real-time section. The log includes only file system metadata.

Overview of Transactional File Systems


Transactional file systems were developed to solve the problem of data loss in embedded systems. Rather than ensuring
that each file system operation is atomic, a transactional file
system ensures that all file system operations between two
points in time are atomic. The advantages of this approach are
higher reliability, better performance, and improved disk space
efficiency.

Strengths of Transactional File Systems






Protects file system structures


Protects user data
Live data never overwritten
Atomic operations
Consistent mount times regardless of disk size

Weaknesses of Transactional File Systems


For optimal performance transaction settings
should be adapted for specific use cases

These file system operations can be file additions, changes, or


deletions, or directory additions or deletions. The transactional
file system writes data to the disk. The data is defined to be committed or live data only when
the transaction point occurs.

Examples of Transactional File Systems


Datalight Reliance
Committed and Working States
The Reliance file system has two distinct statesthe committed (on-disk) state and the working
state. The committed state is the state found on the media at initialization time. The committed
state is also the state of the file system as written to the media at the last completed transaction point.
Reliance uses a concept of a metaroot to refer to the start of the disk state. The committed
state and the working state each have their own metaroot. The metaroot is the base of a hierarchical structure of file metadata that represents everything stored on the disk.
The working state is essentially a set of deltas since the last transaction point. On a freshly formatted disk, and immediately after a transaction point, the working state is empty. As changes
are made, the working state is built by updating the metaroot with the changes made to the
committed state.
The working state consists of the file systems logical structures in memory, data in the disk

7 | WHITEPAPER

cache, and blocks on disk that have been written from the disk cache. A critical point to understand is that even though the working state will consist of some data that is already written to
disk the committed disk state is not modified. The working state only writes to blocks that are
considered free by the committed state.
Executing a Transaction Point
When a transaction point is performed, the disk cache is flushed so that all the user data and
metadata from the working state (except the metaroot) is written to disk. Once this is completed, the updated metaroot is written as an atomic operation. Once the working states metaroot
is successfully written to disk, it becomes the committed state, and the previous committed state
becomes the new working state.
At all times during the course of building the working state, the committed state on the media
remain unchanged. The working state never writes anything to disk that coincides with blocks
that the committed state is using, but rather only writes to free blocks. At any time during this
process, the power may be lost without losing anything from the committed state. Only operations from the interrupted working state will be lost.
System Startup Logic
At system startup time, Reliance examines the two metaroot blocks to determine which one represents the valid committed state. This involves doing a simple checksum on the two metaroot
blocks. If both metaroots happen to be valid, then the most recent metaroot is used as the committed state. This is a very quick operation because the two metaroots blocks are single logical
disk blocks. Unlike other file systems that must replay a journal, or scan the entire media with
utilities such as chkdsk, scandisk, or fsck, to repair problems, Reliance requires no such utilities.
Transaction Models
A key to using Reliance is understanding and using a transactional model that is effective for the
type of device and applications being used. Using very frequent transaction points will adversely
affect performance. Using infrequent transaction points will improve performance, but place
more data at risk in the event of power loss. Because embedded devices are often running custom written applications, one approach is to use a customized transaction model that lets the
developer precisely control how and when transaction points are done.
Reliance supports three different types of transactional models, which may used together as well
as under application control.
Automatic transactions are done in conjunction with prescribed file system operations. The developer can tell Reliance to do a transaction point every time a file is closed, for example. Transaction points can be configured for virtually all standard file system operations, as desired by the
system integrator or application programmer.
Timed transactions can be used to force a transaction to occur regularly at a given frequency.
Explicit transactions can be done under application control. Automatic and timed transactions
are ideal for legacy applications that contain Reliance specific knowledge, while explicit transactions are ideal for embedded applications that have specific needs.
Typical embedded applications will often have different areas in the program functionality that
have different needs with regard to how transactions are performed. For example, while a pro-

8 | WHITEPAPER

gram is doing data logging, timed transactions might be perfect. However when the program is
updating its configuration data files, explicit control is often useful. Applications often have the
need to update a group of files in an atomic fashioneither they all get updated, or none of them
do. This is a difficult problem to solve in an unstable power environment.
With Datalight Reliance, the application developer can set the default model to automatic or
timed transactions, and then programmatically disable that mode, perform operations on a
whole group of files, perform an explicit transaction point, and then re-enable the default transaction mode.

Datalight Reliance Nitro


Reliance Nitro is the next generation file system from Datalight that improves upon Reliance. It
shares the same transaction models and data protection features and adds a more sophsticated
metadata structure to improve performance, particularly for systems that have many small files

About Datalight
Datalight is the market leader in software technologies
that manage data reliably
in embedded devices. For
more than 30 years, our focus on portable, flexible solutions has enabled customers to save money, reduce
development time and get
to market faster. Our customers have discovered that
Datalight solutions result in
unparalleled interoperability and increased customer
satisfaction. These accomplishments have earned
Datalight a reputation as
a provider of reliable and
cost effective software solutions that are backed by
a commitment to customer service and satisfaction.
For more information, visit
www.Datalight.com
or call 425.951.8086 ext 100.

Datalight, Inc.
22118 20th Avenue SE, Suite 135
Bothell, WA 98021 USA
1-800-221-6630
www.Datalight.com

Copyright 2013 Datalight, Inc. All rights reserved. DATALIGHT, Datalight, the Datalight Logo,
FlashFX, FlashFX Pro, FlashFX Tera, FlashFXe, Reliance, Reliance Nitro, ROM-DOS, and Sockets
are trademarks or registered trademarks of Datalight, Inc. All other product names are trademarks of their respective holders. Specification and price change privileges reserved.

or complex directory structures. For more about the advantages gained from the Reliance Nitro
architecture, see our whitepaper: Achieving Breakthrough Performance From Tree-based File Systems

Appendix A
File System Basics
This white paper assumes some familiarity with file system basics. Basic file system definitions
and concepts are outlined in this section.
A file system is a way to organize, store, retrieve, and manage information on a permanent storage medium such as a hard disk or flash memory.4 4
Each file system has a block size. The block size is defined to be the smallest unit that a file system
can write. Everything a file system does is composed of operations done on blocks. Basic file system operations include creating a file, opening a file, writing to a file, and so on.
A file system block is a logical unit rather than a physical unit. The logical block size of a file system is either the same size or a multiple of the sector size of the underlying storage medium.
Selecting the right logical block size is a compromise between wasting as little disk space as possible and minimizing the number of blocks that have to be allocated to store a file.
User data is the named piece of information contained in a file. This piece of information may
be any of the following: text such as a letter, text such as program source code, a database, or a
graphic image.
Metadata is a piece of information about a file. Metadata may include the name of a file as well
as other information such as its owner, creation time, size, and date of last modification.
An i-node is a location where a file system stores metadata about a file. The i-node also provides
a pointer to the contents of the file on disk.
The volume of a file system refers to the embedded disk or disks that has/have been initialized
with a file system. The term volume may refer to all the blocks on a disk, a portion of the blocks,
or it can even refer to a span of blocks across several disks.
The superblock of a file system is an area where a file system stores its critical volume-wide information. A superblock contains information such as the name and size of a volume. In some
systems the superblock may be referred to as the master block or the boot record.
Sector size or block size is the minimum unit that the storage medium can read or write. The
block or sector size of most modern hard disks is 512 bytes. Flash memory management software
manages flash memory so that it appears as a hard drive with 512-byte sectors even thought typically block sizes of flash memory are
A file system directory is a way to name and organize multiple files. The main purpose of a direc4 Practical File System Design with the BE File System, Dominic Giampaolo, Morgan Kaufmann Publishers, Inc.
San Francisco, CA 1999, page 7.

10 | WHITEPAPER

tory is to manage a list of files and to connect the name in the directory with the associated files.
Basic file system operations include initialize, mount, unmount, create a file, open a file, read a
file, create a directory, write to files, read files, delete files, rename files, open directories, and
read directories. These operations are fairly self explanatory with the exception of initialization,
mounting, and unmounting which are defined below.
Initialization of a file system occurs when an operating system creates an empty file system on
a given volume. The file system uses the volume size and any other user-specified options to determine the size and placement of its internal data structures.
Mounting a file system consists of several tasks: accessing a raw storage device, reading the superblock and other file system metadata, and then preparing the file system for access to a volume. A part of this preparation is verifying that the file system is valid. An alternate term for a
valid file system is a clean or consistent file system meaning that the meta data and the user
data are consistent with each other. Full verification of a file system can take a long time, especially if the superblock indicates that the volume is dirty usually the result of an unexpected
power loss.
Unmounting of a file system involves flushing out to disk all in-memory state associated with
the volume. Once all the in-memory data (data in RAM or other non-volatile memory) is written
to the volume, the volume is said to be clean. The last operation of unmounting is to mark the
superblock to indicate that a normal shutdown occurred.

Bibliography
UNIX Filesystems: Evolution, Design, and Implementation Steve D. Pate, Wiley, 2003
Linux File Systems, William von Hagen, Sams Publishing, 2002
Practical File System Design with the BE File System, Dominic Giampaolo, Morgan Kaufmann
Publishers, Inc. San Francisco, CA 1999

11 | WHITEPAPER

Vous aimerez peut-être aussi