Académique Documents
Professionnel Documents
Culture Documents
Copyright © 2017 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC, and other trademarks are trademarks
of Dell Inc. or its subsidiaries. Other trademarks may be the property of their respective owners. Published in the
USA.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO
THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE.
Use, copying, and distribution of any DELL EMC software described in this publication requires an applicable software license. The trademarks, logos, and service marks
(collectively "Trademarks") appearing in this publication are the property of DELL EMC Corporation and other parties. Nothing contained in this publication should be construed
as granting any license or right to use any Trademark without the prior written permission of the party that owns the Trademark.
AccessAnywhere Access Logix, AdvantEdge, AlphaStor, AppSync ApplicationXtender, ArchiveXtender, Atmos, Authentica, Authentic Problems, Automated Resource Manager,
AutoStart, AutoSwap, AVALONidm, Avamar, Aveksa, Bus-Tech, Captiva, Catalog Solution, C-Clip, Celerra, Celerra Replicator, Centera, CenterStage, CentraStar, EMC
CertTracker. CIO Connect, ClaimPack, ClaimsEditor, Claralert ,CLARiiON, ClientPak, CloudArray, Codebook Correlation Technology, Common Information Model, Compuset,
Compute Anywhere, Configuration Intelligence, Configuresoft, Connectrix, Constellation Computing, CoprHD, EMC ControlCenter, CopyCross, CopyPoint, CX, DataBridge ,
Data Protection Suite. Data Protection Advisor, DBClassify, DD Boost, Dantz, DatabaseXtender, Data Domain, Direct Matrix Architecture, DiskXtender, DiskXtender 2000, DLS
ECO, Document Sciences, Documentum, DR Anywhere, DSSD, ECS, elnput, E-Lab, Elastic Cloud Storage, EmailXaminer, EmailXtender , EMC Centera, EMC ControlCenter,
EMC LifeLine, EMCTV, Enginuity, EPFM. eRoom, Event Explorer, FAST, FarPoint, FirstPass, FLARE, FormWare, Geosynchrony, Global File Virtualization, Graphic
Visualization, Greenplum, HighRoad, HomeBase, Illuminator , InfoArchive, InfoMover, Infoscape, Infra, InputAccel, InputAccel Express, Invista, Ionix, Isilon, ISIS,Kazeon, EMC
LifeLine, Mainframe Appliance for Storage, Mainframe Data Library, Max Retriever, MCx, MediaStor , Metro, MetroPoint, MirrorView, Mozy, Multi-Band
Deduplication,Navisphere, Netstorage, NetWitness, NetWorker, EMC OnCourse, OnRack, OpenScale, Petrocloud, PixTools, Powerlink, PowerPath, PowerSnap, ProSphere,
ProtectEverywhere, ProtectPoint, EMC Proven, EMC Proven Professional, QuickScan, RAPIDPath, EMC RecoverPoint, Rainfinity, RepliCare, RepliStor, ResourcePak,
Retrospect, RSA, the RSA logo, SafeLine, SAN Advisor, SAN Copy, SAN Manager, ScaleIO Smarts, Silver Trail, EMC Snap, SnapImage, SnapSure, SnapView, SourceOne,
SRDF, EMC Storage Administrator, StorageScope, SupportMate, SymmAPI, SymmEnabler, Symmetrix, Symmetrix DMX, Symmetrix VMAX, TimeFinder, TwinStrata, UltraFlex,
UltraPoint, UltraScale, Unisphere, Universal Data Consistency, Vblock, VCE. Velocity, Viewlets, ViPR, Virtual Matrix, Virtual Matrix Architecture, Virtual Provisioning, Virtualize
Everything, Compromise Nothing, Virtuent, VMAX, VMAXe, VNX, VNXe, Voyence, VPLEX, VSAM-Assist, VSAM I/O PLUS, VSET, VSPEX, Watch4net, WebXtender, xPression,
xPresso, Xtrem, XtremCache, XtremSF, XtremSW, XtremIO, YottaYotta, Zero-Friction Enterprise Storage.
A Data Domain system can connect to your network via Ethernet or Fibre Channel connections.
Data Domain systems consist of three components: a controller, disk drives, and enclosures to hold the
disk drives.
Data Domain systems use Serial Advanced Technology Attachment (SATA) disk drives and Serial
Attached SCSI (SAS) drives.
By reducing storage requirements by 10 to 30x and archive storage requirements by up to 5x, Data
Domain systems can help significantly minimize the storage footprint for small enterprise/ROBO (Remote
Office/Branch Office) environments and scaling all the way up to large enterprise environments.
Also available are the ES30 and DS60 expansion shelves that can be added to most Data Domain
systems for additional storage capacity.
The Data Domain system (controller and any additional expansion shelves) is connected to storage
applications by means of VTL via Fibre Channel, or CIFS or NFS via Ethernet.
In the exploded view diagram, the Data Domain controller sits at the center of the topology implemented
through additional connectivity and system configuration, including:
• Expansion shelves for additional storage, depending on the model and site requirements
• Media server Virtual Tape Library storage via Fibre Channel
• LAN environments for connectivity for Ethernet based data storage, for basic data interactions,
and for Ethernet-based system management
For both active and retention tiers, DD OS 5.2 and later releases support ES30 shelves. DD OS 5.7 and
later support, DS60 shelves.
It uses the same form factor as the earlier ES30 expansion shelves and offers different quantities of 800
GB SAS solid state drives depending on the capacity of the active tier.
With a DD9800, the FS15 can be configured as required with either 8 or 15 disks.
When configured for high availability the DD6800 requires 2 or 5 disks and DD9300 models require 5 or 8
disks..
The FS15 SSD shelf is always counted in the number of ES30 shelf maximums but since it is only used
for metadata, it does not affect capacity.
The SSD shelf for metadata is not supported for ER and Cloud Tier use cases.
Just like a traditional Data Domain appliance, DD VE is a data protection appliance, with one primary
difference. It has no Data Domain hardware tied to it. DD VE is an all software only virtual deduplication
appliance that provides data protection in an enterprise environment. It is intended to be used as a cost
effective solution in customer remote and branch offices.
DD VE 3.0 is supported on Microsoft Hyper-V and VMWare ESXi versions 5.1, 5.5 and 6.0.
• Data Domain Secure Multi-tenancy (SMT) is the simultaneous hosting, by an internal IT department or
an external provider, of an IT infrastructure for more than one consumer or workload (business unit,
department, or Tenant).
• SMT provides the ability to securely isolate many users and workloads in a shared infrastructure, so
that the activities of one Tenant are not apparent or visible to the other Tenants.
• Conformance with IT governance and regulatory compliance standards for archived data
This is all hidden from users and applications. When the data is read, the original data is provided to the
application or user.
Deduplication performance is dependent on the amount of data, bandwidth, disk speed, CPU, and
memory or the hosts and devices performing the deduplication.
When processing data, deduplication recognizes data that is identical to previously stored data. When it
encounters such data, deduplication creates a reference to the previously stored data, thus avoiding
storing duplicate data.
Hashing algorithms yield a unique value based on the content of the data being hashed. This value is
called the hash or fingerprint, and is much smaller in size than the original data.
Different data contents yield different hashes; each hash can be checked against previously stored
hashes.
• Fixed-Length and Variable-Length are the other two methods and are Segment-Based.
File-based deduplication enables storage savings. It can be combined with compression (a way to
transmit the same amount of data in fewer bits) for additional storage savings. It is popular in desktop
backups. It can be more effective for data restores. It doesn’t need to re-assemble files. It can be included
in backup software, so an organization doesn’t have to depend on a vendor disk.
File-based deduplication results are often not as great as with other types of deduplication (such as block-
and segment-based deduplication). The most important disadvantage is there is no deduplication with
previously backed up files if the file is modified.
File-based deduplication stores an original version of a file and creates a digital signature for it (such as
SHA1, a standard for digital signatures). Future exact copy iterations of the file are pointed to the digital
signature rather than being stored.
Fixed-length segment deduplication reads data and divides it into fixed-size segments. These segments
are compared to other segments already processed and stored. If the segment is identical to a previous
segment, a pointer is used to point to that previous segment.
For data that is identical (does not change), fixed-length segment deduplication reduces storage
requirements.
When data is altered the segments shift, causing more segments to be stored. For example, when you
add a slide to a Microsoft PowerPoint deck, all subsequent blocks in the file are rewritten and are likely to
be considered as different from those in the original file, so the deduplication effect is less significant.
Smaller blocks get better deduplication than large ones, but it takes more resources to deduplicate.
In backup applications, the backup stream consists of many files. The backup streams are rarely entirely
identical even when they are successive backups of the same file system. A single addition, deletion, or
change of any file changes the number of bytes in the new backup stream. Even if no file has changed,
adding a new file to the backup stream shifts the rest of the backup stream. Fixed-sized segment
deduplication backs up large numbers of segments because of the new boundaries between the
segments.
Unlike fixed-length segment deduplication, variable-length segment deduplication uses the content of the
stream to divide the backup or data stream into segments based on the contents of the data stream.
When you apply variable-length segmentation to a data sequence, deduplication uses variable data
segments when it looks at the data sequence. In this example, byte A is added to the beginning of the
data. Only one new segment needs to be stored, since the data defining boundaries between the
remaining data were not altered.
Eventually variable-length segment deduplication will find the segments that have not changed, and
backup fewer segments than fixed-size segment deduplication. Even for storing individual files, variable
length segments have an advantage. Many files are very similar to, but not identical to, other versions of
the same file. Variable length segments will isolate the changes, find more identical segments, and store
fewer segments than fixed-length deduplication.
Inline deduplication requires less disk space than post-process deduplication. With post-process
deduplication, files are written to disk first, then they are scanned and compressed.
There is less administration for an inline deduplication process, as the administrator does not need to
define and monitor the staging space.
Inline deduplication analyzes the data in RAM, and reduces disk seek times to determine if the new data
must be stored. Writes from RAM to disk are done in full-stripe batches to use the disk more efficiently,
reducing disk access.
Source-based deduplication
• Occurs where data is created.
• Uses a host-resident agent, or API, that reduces data at the server source and sends just changed
data over the network.
• Reduces the data stream prior to transmission, thereby reducing bandwidth usage.
• DD Boost is designed to offload part of the Data Domain deduplication process to a backup server
or application client, thus using source-based deduplication.
Target-based deduplication
• Occurs where the data is stored.
• Is controlled by a storage system, rather than a host.
• Provides an excellent fit for a virtual tape library (VTL) without substantial disruption to existing
backup software infrastructure and processes.
• Works best for high change-rate environments.
SISL is used to implement Dell EMC Data Domain inline deduplication. SISL uses fingerprints and RAM to
identify segments already on disk.
SISL architecture provides fast and efficient deduplication by avoiding excessive disk reads to check if a
segment is on disk:
• 99% of duplicate data segments are identified inline in RAM before they are stored to disk.
• Scales with Data Domain systems using newer and faster CPUs and RAM.
• Increases new-data processing throughput-rate.
Local compression compresses segments before writing them to disk. It uses common, industry-standard
algorithms (for example, lz, gz, and gzfast). The default compression algorithm used by Data Domain
systems is lz.
Local compression is similar to zipping a file to reduce the file size. Zip is a file format used for data
compression and archiving. A zip file contains one or more files that have been compressed, to reduce file
size, or stored as is. The zip file format permits a number of compression algorithms. Local compression
can be turned off.
If the checksum read back does not match the checksum written to disk, the system will attempt to
reconstruct the data. If the data can not be successfully reconstructed, the backup will fail and an alert will
be issued.
Since every component of a storage system can introduce errors, an end-to-end test is the simplest way to
ensure data integrity. End-to-end verification means reading data after it is written and comparing it to
what was sent do disk, proving that it is reachable through the file system to disk, and proving that data is
not corrupted.
This ensures that the data on the disks is readable and correct and that the file system metadata
structures used to find the data are also readable and correct. This confirms that the data is correct and
recoverable from every level of the system. If there are problems anywhere, for example if a bit flips on a
disk drive, it is caught. Mostly, a problem is corrected through self-healing. If a problem can’t be corrected,
it is reported immediately, and a backup is repeated while the data is still valid on the primary store.
1. New data never overwrites existing data. (The system never puts existing data at risk.)
Traditional file systems often overwrite blocks when data changes, and then use the old block address.
The Data Domain file system writes only to new blocks. This isolates any incorrect overwrite (a software
bug problem) to only the newest backup data. Older versions remain safe.
As shown in this slide, the container log never overwrites or updates existing data. New data is written to
new containers. Old containers and references remain in place and safe even when software bugs or
hardware faults occur when new backups are stored.
2. In a traditional file system, there are many data structures (for example, free block bit maps and
reference counts) that support fast block updates. In a backup application, the workload is primarily
sequential writes of new data. Because a Data Domain system is simpler, it requires fewer data structures
to support it. New writes never overwrite old data. This design simplicity greatly reduces the chances of
software errors that could lead to data corruption.
The system includes a non-volatile RAM (NVRAM) write buffer into which it puts all data not yet safely on
disk. The file system leverages the security of this write buffer to implement a fast, safe restart capability.
The file system includes many internal logic and data structure integrity checks. If a problem is found by
one of these checks, the file system restarts. The checks and restarts provide early detection and recovery
from the kinds of bugs that can corrupt data. As it restarts, the Data Domain file system verifies the
integrity of the data in the NVRAM buffer before applying it to the file system and thus ensures that no data
is lost due to a power outage.
For example, in a power outage, the old data could be lost and a recovery attempt could fail. For this
reason, Data Domain systems never update just one block in a stripe. Following the no-overwrite policy,
all new writes go to new RAID stripes, and those new RAID stripes are written in their entirety. The
verification-after-write ensures that the new stripe is consistent (there are no partial stripe writes). New
writes never put existing backups at risk.
RAID 6:
– Protects against two disk failures.
– Protects against disk read errors during reconstruction.
– Protects against the operator pulling the wrong disk.
– Guarantees RAID stripe consistency even during power failure without reliance on NVRAM or
an uninterruptable power supply (UPS).
– Verifies data integrity and stripe coherency after writes.
By comparison, after a single disk fails in other RAID architectures, any further simultaneous
disk errors cause data loss. A system whose focus is data protection must include the extra
level of protection that RAID 6 provides.
Continuous error detection works well for data being read, but it does not address issues with data that
may be unread for weeks or months before being needed for a recovery. For this reason, Data Domain
systems actively re-verify the integrity of all data every week in an ongoing background process. This
scrub process finds and repairs defects on the disk before they can become a problem.
If a Data Domain system does have a problem, DIA file system recovery ensures that the system is
brought back online quickly.
In a traditional file system, consistency is not checked. Data Domain systems check through initial
verification after each backup to ensure consistency for all new writes. The usable size of a traditional file
system is often limited by the time it takes to recover the file system in the event of some sort of
corruption.
Imagine running fsck on a traditional file system with more than 80 TB of data. The reason the checking
process can take so long is the file system needs to sort out the locations of the free blocks so new writes
do not accidentally overwrite existing data. Typically, this entails checking all references to rebuild free
block maps and reference counts. The more data in the system, the longer this takes.
In contrast, since the Data Domain file system never overwrites existing data and doesn’t have block
maps and reference counts to rebuild, it has to verify only the location of the head of the log (usually the
start of the last completed write) to safely bring the system back online and restore critical data.
The ddvar file structure keeps administrative files separate from storage files.
You cannot rename or delete /ddvar, nor can you access all of its sub-directories.
MTrees provide more granular space management and reporting. This allows for finer management of
replication, snapshots, and retention locking. These operations can be performed on a specific MTree
rather than on the entire file system. For example, you can configure directory export levels to separate
and organize backup files.
You can add subdirectories to MTree directories. You cannot add anything to the /data directory. /col1
can not be changed - however MTrees can be added under that. The backup MTree
(/data/col1/backup) cannot be deleted or renamed. If MTrees are added, they can be renamed and
deleted. You can replicate directories under /backup.
• Network File System (NFS) clients can have access to the system directories or MTrees on the Data
Domain system.
• Common Internet File System (CIFS) clients also have access to the system directories on the Data
Domain system.
• Dell EMC Data Domain Virtual Tape Library (VTL) is a disk-based backup system that emulates the
use of physical tapes. It enables backup applications to connect to and manage DD system storage
using functionality almost identical to a physical tape library. VTL (Virtual Tape Library) is a licensed
feature, and you must use NDMP (Network Data Management Protocol) over IP (Internet Protocol) or
VTL directly over FC (Fibre Channel).
• Data Domain Boost (DD Boost) software provides advanced integration with backup and enterprise
applications for increased performance and ease of use. DD Boost distributes parts of the deduplication
process to the backup server or application clients, enabling client-side deduplication for faster, more
efficient backup and recovery. DD Boost software is an optional product that requires a separate
license to operate on the Data Domain system.
Data Domain data paths, which include NFS, CIFS, DD Boost, NDMP, and VTL over Ethernet or Fibre
Channel.
Data Domain systems integrate non-intrusively into typical backup environments and reduce the amount
of storage needed to back up large amounts of data by performing deduplication and compression on data
before writing it to disk. The data footprint is reduced, making it possible for tapes to be partially or
completely replaced.
An organization can replicate and vault duplicate copies of data when two Data Domain systems have the
Data Domain Replicator software option enabled.
An Ethernet data path supports the NFS, CIFS, NDMP, and DD Boost protocols that a Data Domain
system uses to move data.
In the data path over Ethernet, backup and archive servers send data from clients to Data Domain
systems on the network via the TCP(UDP)/IP.
You can also use a direct connection between a dedicated port on the backup or archive server and a
dedicated port on the Data Domain system. The connection between the backup (or archive) server and
the Data Domain system can be Ethernet or Fibre Channel, or both if needed. This slide shows the
Ethernet connection.
VTL requires a fibre channel data path. DD Boost uses either a fibre channel or Ethernet data path.
To initially access the Data Domain system, the default administrator’s username and password will be
used. The default administrator name is sysadmin. The initial password for the sysadmin user is the
system serial number.
After the initial configuration, use the SSH or Telnet (if enabled) utilities to access the system remotely and
open the CLI.
The DD OS Command Reference Guide provides information for using the commands to accomplish
specific administration tasks. Each command also has an online help page that gives the complete
command syntax. Help pages are available at the CLI using the help command. Any Data Domain system
command that accepts a list (such as a list of IP addresses) accepts entries separated by commas, by
spaces, or both.
DD System Manager provides a single, consolidated management interface that allows for configuration
and monitoring of many system features and system settings. Note the Management options. As we
progress through the course we will use some of the Management options.
Also notice the information contained in the Footer: DDSM – OS – Model – User – Role.
Multiple DD Systems are now managed with Data Domain Management Center.
On Apple OS X:
Mozilla Firefox 30 and higher; Google Chrome
The Data Domain Management Center can monitor all Data Domain platforms. The Data Domain
Management Center can monitor systems running DD OS version 5.1 and later.
The Data Domain Management Center includes an embedded version of the System Manager that can be
launched, providing convenient access to a managed Data Domain system for further investigation of an
issue or to perform configuration.
Deduplication improves data storage because it is performed inline. It looks for redundancy of large
sequences of bytes. Sequences of bytes identical to those previously encountered and stored are
replaced with references to the previously encountered data.
SISL gives Data Domain deduplication speed. 99% of duplicate data segments are identified inline in
RAM before they are stored to disk. This scales with Data Domain systems using newer and faster
CPUs and RAM.
There are 3 ways to interface with Data Domain administration. You can use the Command Line (CLI),
the System Manager GUI, or the Data Domain Management Center.