ZFS Management

ZFS Management
ZFS was first publicly released in the 6/2006 distribution of Solaris 10. Previous
versions of Solaris 10 did not include ZFS.
ZFS is flexible, scalable and reliable. It is a POSIX-compliant filesystem with several

important features:
• integrated storage pool management

• data protection and consistency, including RAID
• integrated management for mounts and NFS sharing
• scrubbing and data integrity protection
• snapshots and clones
• advanced backup and restore features
• excellent scalability
• built-in compression
• maintenance and troubleshooting capabilities
• automatic sharing of disk space and I/O bandwidth across disk devices in a
pool
• endian neutrality
No separate filesystem creation step is required. The mount of the filesystem is

automatic and does not require vfstab maintenance. Mounts are controlled via
themountpoint attribute of each file system.
Pool Management
Members of a storage pool may either be hard drives or slices of at least 128MB in
size.
To create a mirrored pool:

zpool create -f pool-name mirror c#t#d# c#t#d#
To check a pool's status, run:
zpool status -v pool-name
To list existing pools:
zpool list
To remove a pool and free its resources:
zpool destroy pool-name
A destroyed pool can sometimes be recovered as follows:
zpool import -D
Additional disks can be added to an existing pool. When this happens in a mirrored or
RAID Z pool, the ZFS is resilvered to redistribute the data. To add storage to an
existing mirrored pool:
zpool add -f pool-name mirror c#t#d# c#t#d#
Pools can be exported and imported to transfer them between hosts.

zpool export pool-name
zpool import pool-name
Without a specified pool, the import command lists available pools. zpool import
To clear a pool's error count, run:

zpool clear pool-name
Although virtual volumes (such as those from DiskSuite or VxVM) can be used as
base devices, it is not recommended for performance reasons.
Filesystem Management
Similar filesystems should be grouped together in hierarchies to make management

easier. Naming schemes should be thought out as well to make it easier to group
administrative commands for similarly managed filesystems.
When a new pool is created, a new filesystem is mounted at /pool-name.
To create another filesystem:

zfs create pool-name/fs-name
To delete a filesystem:
zfs destroy filesystem-name
To rename a ZFS filesystem:

zfs rename old-name new-name
Properties are set via the zfs set command.

To turn on compression:
zfs set compression=on pool-name/filesystem-name
To share the filesystem via NFS:
zfs set sharenfs=on pool-name/fs-name
zfs set sharenfs="mount-options " pool-name/fs-name
Rather than editing the /etc/vfstab:
zfs set mountpoint= mountpoint-name pool-name/filesystem-name
Quotas are also set via the same command:
zfs set quota=#gigG pool-name/filesystem-name
RAID Levels
ZFS filesystems automatically stripe across all top-level disk devices. (Mirrors and
RAID-Z devices are considered to be top-level devices.) It is not recommended that
RAID types be mixed in a pool. (zpool tries to prevent this, but it can be forced with
the -f flag.)
The following RAID levels are supported:
• RAID-0 (striping)
• RAID-1 (mirror)
• RAID-Z (similar to RAID 5, but with variable-width stripes to avoid the RAID
5 write hole)
• RAID-Z2
The zfs man page recommends 3-9 disks for RAID-Z pools.
Performance Monitoring
ZFS performance management is handled differently than with older generation file
systems. In ZFS, I/Os are scheduled similarly to how jobs are scheduled on CPUs.The
ZFS I/O scheduler tracks a priority and a deadline for each I/O. Within each deadline
group, the I/Os are scheduled in order of logical block address.
Writes are assigned lower priorities than reads, which can help to avoid traffic jams
where reads are unable to be serviced because they are queued behind writes. (If a
read is issued for a write that is still underway, the read will be executed against the
in-memory image and will not hit the hard drive.)
In addition to scheduling, ZFS attempts to intelligently prefetch information into

memory. The algorithm tries to pick information that is likely to be needed. Any
forward or backward linear access patterns are picked up and used to perform the
prefetch.
The zpool iostat command can monitor performance on ZFS objects:
• USED CAPACITY: Data currently stored

• AVAILABLE CAPACITY: Space available
• READ OPERATIONS: Number of operations
• WRITE OPERATIONS: Number of operations
• READ BANDWIDTH: Bandwidth of all read operations
• WRITE BANDWIDTH: Bandwidth of all write operations
The health of an object can be monitored with

zpool status
Snapshots and Clones
To create a snapshot:
zfs snapshot pool-name/filesystem-name@ snapshot-name
To clone a snapshot:
zfs clone snapshot-name filesystem-name
To roll back to a snapshot:
zfs rollback pool-name/filesystem-name@snapshot-name
zfs sendand zfs receive allow clones of filesystems to be sent to a development

environment.
The difference between a snapshot and a clone is that a clone is a writable, mountable
copy of the file system. This capability allows us to store multiple copies of mostly-
shared data in a very space-efficient way.
Each snapshot is accessible through the .zfs/snapshot in the /pool-namedirectory.

This can allow end users to recover their files without system administrator
intervention.
Zones
If the filesystem is created in the global zone and added to the local zone viazonecfg,
it may be assigned to more than one zone unless the mountpoint is set tolegacy.
zfs set mountpoint=legacy pool-name/filesystem-name
To import a ZFS filesystem within a zone:

zonecfg -z zone-name
add fs
set dir=mount-point
set special=pool-name/filesystem-name
set type=zfs
end
verify
commit
exit
Administrative rights for a filesystem can be granted to a local zone:
zonecfg -z zone-name
add dataset
set name=pool-name/filesystem-name
end
commit exit
Data Protection
ZFS is a transactional file system. Data consistency is protected via Copy-On-Write

(COW). For each write request, a copy is made of the specified block. All changes are
made to the copy. When the write is complete, all pointers are changed to point to the
new block.
Checksums are used to validate data during reads and writes. The checksum algorithm
is user-selectable. Checksumming and data recovery is done at a filesystem level; it is
not visible to applications. If a block becomes corrupted on a pool protected by
mirroring or RAID, ZFS will identify the correct data value and fix the corrupted
value.
Raid protections are also part of ZFS.
Scrubbing is an additional type of data protection available on ZFS. This is a

mechanism that performs regular validation of all data. Manual scrubbing can be
performed by:
zpool scrub pool-name
The results can be viewed via:
zpool status
Any issues should be cleared with:
The scrubbing operation walks through the pool metadata to read each copy of each
block. Each copy is validated against its checksum and corrected if it has become
corrupted.
Hardware Maintenance
To replace a hard drive with another device, run:

zpool replace pool-name old-disk new-disk
To offline a failing drive, run:
zpool offline pool-name disk-name
(A -t flag allows the disk to come back online after a reboot.)
Once the drive has been physically replaced, run the replace command against the
device:
zpool replace pool-name device-name
After an offlined drive has been replaced, it can be brought back online:
zpool online pool-name disk-name
Firmware upgrades may cause the disk device ID to change. ZFS should be able to
update the device ID automatically, assuming that the disk was not physically moved
during the update. If necessary, the pool can be exported and re-imported to update
the device IDs.
Troubleshooting ZFS
The three categories of errors experienced by ZFS are:
• missing devices: Missing devices placed in a "faulted" state.

• damaged devices: Caused by things like transient errors from the disk or
controller, driver bugs or accidental overwrites (usually on misconfigured
devices).
• data corruption: Data damage to top-level devices; usually requires a restore.
Since ZFS is transactional, this only happens as a result of driver bugs,
hardware failure or filesystem misconfiguration.
It is important to check for all three categories of errors. One type of problem is often
connected to a problem from a different family. Fixing a single problem is usually not
sufficient.
Data integrity can be checked by running a manual scrubbing:

zpool scrub pool-name
zpool status -v pool-name
checks the status after the scrubbing is complete.
The status command also reports on recovery suggestions for any errors it finds.
These are reported in the action section. To diagnose a problem, use the output of
the status command and the fmd messages in /var/adm/messages.
The config section of the status section reports the state of each device. The state can
be:
• ONLINE: Normal
• FAULTED: Missing, damaged, or mis-seated device
• DEGRADED: Device being resilvered
• UNAVAILABLE: Device cannot be opened
• OFFLINE: Administrative action
The status command also reports READ, WRITE or CHKSUM errors.
To check if any problem pools exist, use

zpool status -x
This command only reports problem pools.
If a ZFS configuration becomes damaged, it can be fixed by

running export andimport.
Devices can fail for any of several reasons:
• "Bit rot:" Corruption caused by random environmental effects.

• Misdirected Reads/Writes: Firmware or hardware faults cause reads or writes to
be addressed to the wrong part of the disk.
• Administrative Error
• Intermittent, Sporadic or Temporary Outages: Caused by flaky hardware or
administrator error.
• Device Offline: Usually caused by administrative action.
Once the problems have been fixed, transient errors should be cleared:
In the event of a panic-reboot loop caused by a ZFS software bug, the system can be
instructed to boot without the ZFS filesystems:
boot -m milestone=none
When the system is up, remount / as rw and remove the file/etc/zfs/zpool.cache.
The remainder of the boot can proceed with the
svcadm milestone all command. At that point import the good pools. The damaged
pools may need to be re-initialized.
Scalability
The filesystem is 128-bit. 256 quadrillion zetabytes of information is addressable.

Directories can have up to 256 trillion entries. No limit exists on the number of
filesystems or files within a filesystem.
ZFS Recommendations
Because ZFS uses kernel addressable memory, we need to make sure to allow enough
system resources to take advantage of its capabilities. We should run on a system with
a 64-bit kernel, at least 1GB of physical memory, and adequate swap space.
While slices are supported for creating storage pools, their performance will not be
adequate for production uses.
Mirrored configurations should be set up across multiple controllers where possible to

maximize performance and redundancy.
Scrubbing should be scheduled on a regular basis to identify problems before they

become serious.
When latency or other requirements are important, it makes sense to separate them
onto different pools with distinct hard drives. For example, database log files should
be on separate pools from the data files.
Root pools are not yet supported in the Solaris 10 6/2006 release, though they are
anticipated in a future release. When they are used, it is best to put them on separate
pools from the other filesystems.
On filesystems with many file creations and deletions, utilization should be kept under
80% to protect performance.
The recordsize parameter can be tuned on ZFS filesystems. When it is changed, it

only affects new files. zfs set recordsize=size tuning can help where large files (like
database files) are accessed via small, random reads and writes. The default is 128KB;
it can be set to any power of two between 512B and 128KB. Where the database uses
a fixed block or record size, the recordsize should be set to match. This should only
be done for the filesystems actually containing heavily-used database files.
In general, recordsize should be reduced when iostat regularly shows a throughput

near the maximum for the I/O channel. As with any tuning, make a minimal change to
a working system, monitor it for long enough to understand the impact of the change,
and repeat the process if the improvement was not good enough or reverse it if the
effects were bad.
The ZFS Evil Tuning Guide contains a number of tuning methods that may or may
not be appropriate to a particular installation. As the document suggests, these tuning
mechanisms will have to be used carefully, since they are not appropriate to all
installations.
For example, the Evil Tuning Guide provides instructions for:
• Turning off file system checksums to reduce CPU usage. This is done on a per-
file system basis:
zfs set checksum=off filesystem
zfs set checksum='on | fletcher2 | fletcher4 | sha256'filesystem
• Limiting the ARC size by setting

set zfs:zfs_arc_max
in /etc/system on 8/07 and later.
• If the I/O includes multiple small reads, the file prefetch can be turned off by
setting
zfs:zfs_prefetch_disable
on 8/07 and later.
• If the I/O channel becomes saturated, the device level prefetch can be turned
off with
set zfs:zfs_vdev_cache_bshift = 13
in /etc/system for 8/07 and later
• I/O concurrency can be tuned by setting
set zfs:zfs_vdev_max_pending = 10
in /etc/system in 8/07 and later.
• If storage with an NVRAM cache is used, cache flushes may be disabled with
set zfs:zfs_nocacheflush = 1
in /etc/system for 11/06 and later.
• ZIL intent logging can be disabled. (WARNING: Don't do this.)
• Metadata compression can be disabled. (Read this section of the Evil Tuning
Guide first-- you probably do not need to do this.)
Sun Cluster Integration
ZFS can be used as a failover-only file system with Sun Cluster installations.
If it is deployed on disks also used by Sun Cluster, do not deploy it on any Sun
Cluster quorum disks. (A ZFS-owned disk may be promoted to be a quorum disk on
current Sun Cluster versions, but adding a disk to a ZFS pool may result in quorum
keys being overwritten.)
ZFS Internals
Max Bruning wrote an excellent paper on how to examine the internals of a ZFS data
structure. (Look for the article on the ZFS On-Disk Data Walk.) The structure is
defined in ZFS On-Disk Specification.
Some key structures:
• uberblock_t: The starting point when examining a ZFS file system. 128k array
of 1k uberblock_t structures, starting at 0x20000 bytes within a vdev label.
Defined in uts/common/fs/zfs/sys/uberblock_impl.hOnly one uberblock is
active at a time; the active uberblock can be found with
zdb -uuu zpool-name
• blkptr_t: Locates, describes, and verifies blocks on a disk. Defined
inuts/common/fs/zfs/sys/spa.h.
• dnode_phys_t: Describes an object. Defined byuts/common/fs/zfs/sys/dmu.h
• objset_phys_t: Describes a group of objects. Defined
byuts/common/fs/zfs/sys/dmu_objset.h
• ZAP Objects: Blocks containing name/value pair attributes. ZAP stands for
ZFS Attribute Processor. Defined byuts/common/fs/zfs/sys/zap_leaf.h
• Bonus Buffer Objects:
o dsl_dir_phys_t: Contained in a DSL directory dnode_phys_t; contains
object ID for a DSL dataset dnode_phys_t
o dsl_dataset_phys_t: Contained in a DSL datasetdnode_phys_t; contains
a blkprt_t pointing indirectly at a second array of dnode_phys_t for
objects within a ZFS file system.
o znode_phys_t: In the bonus buffer of dnode_phys_t structures for files
and directories; contains attributes of the file or directory. Similar to a
UFS inode in a ZFS context.

ZFS Management

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

ZFS Management

Transféré par

Droits d'auteur :

Formats disponibles

ZFS Management

ZFS is flexible, scalable and reliable. It is a POSIX-compliant filesystem with several

• integrated storage pool management

No separate filesystem creation step is required. The mount of the filesystem is

To create a mirrored pool:

Pools can be exported and imported to transfer them between hosts.

To clear a pool's error count, run:

Similar filesystems should be grouped together in hierarchies to make management

When a new pool is created, a new filesystem is mounted at /pool-name.

To create another filesystem:

To rename a ZFS filesystem:

Properties are set via the zfs set command.

The following RAID levels are supported:

In addition to scheduling, ZFS attempts to intelligently prefetch information into

The zpool iostat command can monitor performance on ZFS objects:

• USED CAPACITY: Data currently stored

The health of an object can be monitored with

Snapshots and Clones

zfs sendand zfs receive allow clones of filesystems to be sent to a development

Each snapshot is accessible through the .zfs/snapshot in the /pool-namedirectory.

To import a ZFS filesystem within a zone:

ZFS is a transactional file system. Data consistency is protected via Copy-On-Write

Raid protections are also part of ZFS.

Scrubbing is an additional type of data protection available on ZFS. This is a

To replace a hard drive with another device, run:

The three categories of errors experienced by ZFS are:

• missing devices: Missing devices placed in a "faulted" state.

Data integrity can be checked by running a manual scrubbing:

The status command also reports READ, WRITE or CHKSUM errors.

To check if any problem pools exist, use

If a ZFS configuration becomes damaged, it can be fixed by

Devices can fail for any of several reasons:

• "Bit rot:" Corruption caused by random environmental effects.

The filesystem is 128-bit. 256 quadrillion zetabytes of information is addressable.

Mirrored configurations should be set up across multiple controllers where possible to

Scrubbing should be scheduled on a regular basis to identify problems before they

The recordsize parameter can be tuned on ZFS filesystems. When it is changed, it

In general, recordsize should be reduced when iostat regularly shows a throughput

zfs set checksum='on | fletcher2 | fletcher4 | sha256'filesystem

• Limiting the ARC size by setting

Sun Cluster Integration

Vous aimerez peut-être aussi