Académique Documents
Professionnel Documents
Culture Documents
ZFS was first publicly released in the 6/2006 distribution of Solaris 10. Previous
versions of Solaris 10 did not include ZFS.
Pool Management
Members of a storage pool may either be hard drives or slices of at least 128MB in
size.
Although virtual volumes (such as those from DiskSuite or VxVM) can be used as
base devices, it is not recommended for performance reasons.
Filesystem Management
RAID Levels
ZFS filesystems automatically stripe across all top-level disk devices. (Mirrors and
RAID-Z devices are considered to be top-level devices.) It is not recommended that
RAID types be mixed in a pool. (zpool tries to prevent this, but it can be forced with
the -f flag.)
• RAID-0 (striping)
• RAID-1 (mirror)
• RAID-Z (similar to RAID 5, but with variable-width stripes to avoid the RAID
5 write hole)
• RAID-Z2
The zfs man page recommends 3-9 disks for RAID-Z pools.
Performance Monitoring
ZFS performance management is handled differently than with older generation file
systems. In ZFS, I/Os are scheduled similarly to how jobs are scheduled on CPUs.The
ZFS I/O scheduler tracks a priority and a deadline for each I/O. Within each deadline
group, the I/Os are scheduled in order of logical block address.
Writes are assigned lower priorities than reads, which can help to avoid traffic jams
where reads are unable to be serviced because they are queued behind writes. (If a
read is issued for a write that is still underway, the read will be executed against the
in-memory image and will not hit the hard drive.)
To create a snapshot:
zfs snapshot pool-name/filesystem-name@ snapshot-name
To clone a snapshot:
zfs clone snapshot-name filesystem-name
To roll back to a snapshot:
zfs rollback pool-name/filesystem-name@snapshot-name
The difference between a snapshot and a clone is that a clone is a writable, mountable
copy of the file system. This capability allows us to store multiple copies of mostly-
shared data in a very space-efficient way.
Zones
If the filesystem is created in the global zone and added to the local zone viazonecfg,
it may be assigned to more than one zone unless the mountpoint is set tolegacy.
zfs set mountpoint=legacy pool-name/filesystem-name
add fs
set dir=mount-point
set special=pool-name/filesystem-name
set type=zfs
end
verify
commit
exit
Administrative rights for a filesystem can be granted to a local zone:
zonecfg -z zone-name
add dataset
set name=pool-name/filesystem-name
end
commit exit
Data Protection
Checksums are used to validate data during reads and writes. The checksum algorithm
is user-selectable. Checksumming and data recovery is done at a filesystem level; it is
not visible to applications. If a block becomes corrupted on a pool protected by
mirroring or RAID, ZFS will identify the correct data value and fix the corrupted
value.
The scrubbing operation walks through the pool metadata to read each copy of each
block. Each copy is validated against its checksum and corrected if it has become
corrupted.
Hardware Maintenance
Once the drive has been physically replaced, run the replace command against the
device:
zpool replace pool-name device-name
After an offlined drive has been replaced, it can be brought back online:
zpool online pool-name disk-name
Firmware upgrades may cause the disk device ID to change. ZFS should be able to
update the device ID automatically, assuming that the disk was not physically moved
during the update. If necessary, the pool can be exported and re-imported to update
the device IDs.
Troubleshooting ZFS
It is important to check for all three categories of errors. One type of problem is often
connected to a problem from a different family. Fixing a single problem is usually not
sufficient.
The status command also reports on recovery suggestions for any errors it finds.
These are reported in the action section. To diagnose a problem, use the output of
the status command and the fmd messages in /var/adm/messages.
The config section of the status section reports the state of each device. The state can
be:
• ONLINE: Normal
• FAULTED: Missing, damaged, or mis-seated device
• DEGRADED: Device being resilvered
• UNAVAILABLE: Device cannot be opened
• OFFLINE: Administrative action
Once the problems have been fixed, transient errors should be cleared:
zpool clear pool-name
In the event of a panic-reboot loop caused by a ZFS software bug, the system can be
instructed to boot without the ZFS filesystems:
boot -m milestone=none
When the system is up, remount / as rw and remove the file/etc/zfs/zpool.cache.
The remainder of the boot can proceed with the
svcadm milestone all command. At that point import the good pools. The damaged
pools may need to be re-initialized.
Scalability
ZFS Recommendations
Because ZFS uses kernel addressable memory, we need to make sure to allow enough
system resources to take advantage of its capabilities. We should run on a system with
a 64-bit kernel, at least 1GB of physical memory, and adequate swap space.
While slices are supported for creating storage pools, their performance will not be
adequate for production uses.
When latency or other requirements are important, it makes sense to separate them
onto different pools with distinct hard drives. For example, database log files should
be on separate pools from the data files.
Root pools are not yet supported in the Solaris 10 6/2006 release, though they are
anticipated in a future release. When they are used, it is best to put them on separate
pools from the other filesystems.
On filesystems with many file creations and deletions, utilization should be kept under
80% to protect performance.
The ZFS Evil Tuning Guide contains a number of tuning methods that may or may
not be appropriate to a particular installation. As the document suggests, these tuning
mechanisms will have to be used carefully, since they are not appropriate to all
installations.
For example, the Evil Tuning Guide provides instructions for:
• Turning off file system checksums to reduce CPU usage. This is done on a per-
file system basis:
zfs set checksum=off filesystem
ZFS can be used as a failover-only file system with Sun Cluster installations.
If it is deployed on disks also used by Sun Cluster, do not deploy it on any Sun
Cluster quorum disks. (A ZFS-owned disk may be promoted to be a quorum disk on
current Sun Cluster versions, but adding a disk to a ZFS pool may result in quorum
keys being overwritten.)
ZFS Internals
Max Bruning wrote an excellent paper on how to examine the internals of a ZFS data
structure. (Look for the article on the ZFS On-Disk Data Walk.) The structure is
defined in ZFS On-Disk Specification.
Some key structures:
• uberblock_t: The starting point when examining a ZFS file system. 128k array
of 1k uberblock_t structures, starting at 0x20000 bytes within a vdev label.
Defined in uts/common/fs/zfs/sys/uberblock_impl.hOnly one uberblock is
active at a time; the active uberblock can be found with
zdb -uuu zpool-name
• blkptr_t: Locates, describes, and verifies blocks on a disk. Defined
inuts/common/fs/zfs/sys/spa.h.
• dnode_phys_t: Describes an object. Defined byuts/common/fs/zfs/sys/dmu.h
• objset_phys_t: Describes a group of objects. Defined
byuts/common/fs/zfs/sys/dmu_objset.h
• ZAP Objects: Blocks containing name/value pair attributes. ZAP stands for
ZFS Attribute Processor. Defined byuts/common/fs/zfs/sys/zap_leaf.h
• Bonus Buffer Objects:
o dsl_dir_phys_t: Contained in a DSL directory dnode_phys_t; contains
object ID for a DSL dataset dnode_phys_t
o dsl_dataset_phys_t: Contained in a DSL datasetdnode_phys_t; contains
a blkprt_t pointing indirectly at a second array of dnode_phys_t for
objects within a ZFS file system.
o znode_phys_t: In the bonus buffer of dnode_phys_t structures for files
and directories; contains attributes of the file or directory. Similar to a
UFS inode in a ZFS context.