Taneja Group Evaluating Grid Storage For Enterprise Backup, DR and Archiving NEC HYDRAstor PDF

P
Evaluating Grid Storage for Enterprise Backup, DR and Archiving: NEC HYDRAstor
September 2008 Its hard to believe that just fifteen years ago, most enterprises were challenged to manage hundreds of gigabytes, and possibly several terabytes, of data. Today, most large enterprises are already managing at least one hundred terabytes of data if not more, and are concerned that the monolithic storage architectures that were good at managing the scale of data required in the past have run out of gas. A number of new market drivers, including evolving regulatory mandates, increasingly stringent e-discovery requirements, new data sources such as Web 2.0, and data growth rates in the 50% to 60% range for at least the next five years, mean that most large enterprises will be called upon to manage multiple petabytes of data in the foreseeable future, and to do so cost-effectively. Decisions about replacement storage platforms need to take this scale issue into account. Monolithic storage architectures have predominated in enterprise environments, but the new scale requirements are not well served by this architecture. When a monolithic array is outgrown, it requires a disruptive, fork-lift upgrade to move to next generation technologies. Ratios between processing performance and storage capacity cannot be very flexibly configured, and routine storage management tasks such as provisioning are not cost-effectively scalable into the petabyte range. Growing workloads cannot be seamlessly spread across physical array boundaries. Burgeoning requirements are calling for a new, dense storage architecture that has very different performance, capacity, configuration flexibility, and $/GB requirements than that offered by monolithic storage architectures. In this Product Profile, well take a look at the requirements for secondary storage platforms designed to handle the data explosion cost-effectively, and then show how NECs HYDRAstor, a scale-out NAS solution that embodies a number of compelling technologies, meets those requirements. HYDRAstor is already enjoying strong traction in the industry among large enterprises because of its massive scalability, rich feature set, and aggressive pricing (usable capacity under $1/GB at scale).
87 Elm Street, Suite 900
Copyright The TANEJA Group, Inc. 2008. All Rights Reserved Hopkinton, MA 01748 Tel: 508-435-5040 Fax: 508-435-1530
www.tanejagroup.com
1 of 19
Re-Evaluating the $/GB Costs of Conventional Storage Environments

To manage overall storage capacity more cost-effectively, Taneja Group has been encouraging end users to consider their data access and usage patterns, understand how that translates into storage performance (particularly access latency) requirements, know the costs associated with different storage tiers, and then use that information to tier storage appropriately to achieve the lowest overall $/GB to meet specific requirements. Armed with this data, large enterprises can make strategic decisions about where to deploy disk in their secondary storage environments, allowing them to actually improve their ability to meet service level agreements while managing to a lower overall $/GB cost for storage capacity (across both primary and secondary storage). In our experience, most large enterprises are keeping 70% to 90% more data on primary storage than is necessary to meet their actual requirements, and as a result are paying a significantly higher $/GB cost for their overall storage capacity than they have to. This fact, combined with the very aggressive price points for disk-based secondary storage that are being offered by scale-out NAS platforms built around grid architectures, will be a flashpoint for large enterprises looking to take the plunge. Its clear that conventional storage architectures do not meet the scalability, performance, and manageability requirements very cost-effectively in this new era of exploding data growth, but it may not be clear what the right successor architecture is. In our opinion, the intelligent application of storage tiering concepts will become increasingly important, with grid-based storage as the heir apparent in the secondary storage arena, offering compelling economic advantages over both tape and existing monolithic storage architectures.
An Ideal Storage Tiering Model

In an ideal world, Taneja Group posits a four-tier storage architecture. Legacy environments often had only two tiers: disk for primary storage, and tape for everything else. Disk offered low latency access, high availability and reliability, and could support high duty cycles with enterprise-class Fibre Channel (FC) and SCSI disk drives, but historically has cost two orders of magnitude or more as much as tape. Tape offered nearly limitless capacity and very low acquisition costs on a $/GB basis, but did not meet even medium duty cycle requirements very reliably. Accordingly, disk was used for primary storage and tape for backup and, if they were being done at all, DR and archiving. Primary storage requirements include low latency, high availability, high data reliability, and a high duty cycle (since data residing in primary storage environments is expected to be accessed and modified frequently). As data ages, frequency of access tends to decline, although access frequency declines at different rates for different data. At some point, data no longer requires
87 Elm Street, Suite 900 Copyright The TANEJA Group, Inc. 2008. All Rights Reserved Hopkinton, MA 01748 Tel: 508-435-5040 Fax: 508-435-1530 www.tanejagroup.com 2 of 19
low latency access and can be moved to lower cost secondary storage. Within secondary storage environments, there can be multiple tiers, each with different requirements. Secondary storage generally hosts applications like backup, disaster recovery (DR), and archiving. Its interesting to note that, for most organizations, a given amount of primary storage generates at least five to ten times as much secondary storage if not more. Based on studies done by the Taneja Group in 1H08, enterprise class primary storage generally costs five to eight times as much as secondary storage so it is important to ensure that the only data that is maintained on primary storage actually has to be there. But with a given amount of primary storage generating as much as 10x or more secondary storage capacity, it is also important to ensure that the secondary tier(s) meets the performance, availability, and reliability requirements with an aggressive $/GB.
Figure 1. A tiered storage architecture in an ideal world, showing the key requirements for each tier.
Secondary storage requirements are no longer well met using only tape. With the rapid growth of primary data, backup windows have become a problem. 7x24 requirements for many applications leave little time to back up data sets that are now larger than ever, and growing faster than ever before. Stringent recovery point objective (RPO) and recovery time objective (RTO) requirements are not well met with tape, particularly for the most frequent types of restore requests (object level restores). When tape is used to try and regularly handle the types of duty cycle requirements demanded by most backup and restore activities, its reliability suffers; from our discussions with end users, it is not uncommon for them to see tape reliability
87 Elm Street, Suite 900 Copyright The TANEJA Group, Inc. 2008. All Rights Reserved Hopkinton, MA 01748 Tel: 508-435-5040 Fax: 508-435-1530 3 of 19 www.tanejagroup.com
issues affecting as many as 20% or more of attempted restores. Disk offered a solution for these problems, but its high cost had historically limited its use in secondary storage applications. The emergence of new disk technologies, such as SATA and in particular storage capacity optimization (data reduction technologies like data de-duplication), has now dropped the cost of usable, disk-based secondary storage capacity. Grid architectures provide a storage platform that can leverage disk effectively, can scale to petabytes of capacity, can leverage inherent redundancies to support high levels of availability and resiliency, and, when used in conjunction with storage capacity optimization technologies, can offer $/GB costs under $1. DR requires low cost, long term storage, off-site location, and may carry its own set of RPO/RTO requirements. If disk is used in DR environments, this allows replication to be used to send data to a remote location, avoiding the security and other issues associated with physical tape transport. The use of storage capacity optimization technology can significantly reduce the amount of data that has to be sent across a wide area network (WAN), making replication a viable and more cost-effective option than ever before. The use of replication to create and maintain one or more DR sites generally lets a company significantly improve the RPO of their DR capability since getting the data off of backup clients and to remote locations usually takes only hours instead of days. As e-discovery grows in importance, archiving platform requirements are changing. Tapes are not searchable on-line, and when used as a medium for archiving result in very high discovery costs. Given the prevalence of lawsuits (most large companies are involved in at least one lawsuit at all times), an archive that is searchable on-line can now provide significant cost savings. Fortune 1000 companies frequently assume a minimum of $500,000 in discovery costs against a tape-based archive per lawsuit, whereas e-discovery costs against a disk-based archive for the same lawsuit could easily be one tenth of that. These issues together mean that it is time for large enterprises to re-evaluate the use of disk for secondary storage applications like backup, DR, and archiving.
Criteria For The New Secondary Storage Platform

Offerings in this space should be platform-based products that offer a set of specific capabilities that can be leveraged across a variety of secondary applications. The assumption is that they will be used in conjunction with existing secondary storage application software like enterprise backup or archiving, and it is the combination of the two (e.g. an enterprise backup application and a new secondary storage platform) that provides the overall solution. Massive scalability and performance. In the new environment, secondary storage requires a scalable architecture that can take a customer from entry level configurations in the terabyte range to tens of petabytes (PBs) of data and beyond, providing the necessary throughput as configurations grow.
Modular configurability. Business and regulatory mandates require that the secondary storage platform will have to be independently scalable and configurable along two key metrics: capacity and performance. This configuration flexibility should allow the creation of extremely throughput-intensive or extremely capacity-intensive configurations by adding the appropriate resources as needed without incurring disruptive fork-lift upgrades. High availability. To provide the service necessary in the 24x7 world of IT operations, the secondary storage platform should have no single point of failure and allow for on-line maintenance, upgrades, and data migration. This must include a transparent ability to rebalance workloads as resources are added or subtracted from a system as well as an ability to easily establish replicated configurations for DR purposes. Platform data management tools. Data must be stored securely and redundantly in a manner that can sustain multiple concurrent failures; data must be both reliable and persistent. The platform must include storage capacity optimization technologies designed to minimize the amount of raw storage capacity required to store a given amount of information. Capacity optimization ratios will vary based on data type, application, and company-specific policies. Despite all disks advantages in secondary storage applications, the administrative overhead associated with disk provisioning has been a significant disadvantage which becomes more onerous as storage capacities increase. Look for self-managing capabilities that can help to significantly increase the administrative span of control by automating provisioning tasks as much as possible. Ability to integrate with existing processes. Most large enterprises already have a set of processes established around secondary storage applications like backup and archive that likely leverage some combination of disk and tape. Millions of dollars have often been invested in these processes, and any changes or additions to secondary storage infrastructure can only realistically be done in an evolutionary way which requires minimal or no disruption to existing processes. While storage targets may offer different characteristics, it is important that they support the different access mechanisms and management paradigms of widely deployed backup and archive management software. Industry standards-based. The platform must be built to outlast vendor-specific technology, such as underlying hardware and APIs, which will undoubtedly change over the long time periods likely to be required for most secondary storage applications. Use of commodity hardware, standardized access protocols, and support for a wide variety of popular secondary storage applications will provide a future-proof solution. Affordability. Tape has been the media of choice for most secondary storage applications. Tape acquisition and associated power consumption costs are low, but given evolving data
protection and archiving requirements, tapes disadvantages are becoming more impactful and more costly. Tape fares poorly against disk for discovery purposes, and tape also does not offer the option to winnow down the amount of data retained on primary storage that no longer requires low latency access but still requires on-line access. When looking at a secondary storage platform to accommodate future growth, we strongly urge end users to take issues such as these into account, not just acquisition costs. In certain application environments, diskbased platforms may offer total costs of ownership that are very close to tape. With these criteria, it is important to note that they define a set of features that can be used across multiple secondary storage applications simultaneously. Platforms in this class in fact offer this as a specific cost-savings advantage, providing an ability to safely co-locate certain secondary storage applications simultaneously on a single, massively scalable storage platform. Virtual platforms, dedicated to backup, archive, or other secondary storage applications, can be established within the larger platform, and configured and managed accordingly. This allows the storage for multiple applications to be centrally managed, while any application-specific management requirements are met by the secondary storage application software.
Enter NEC HYDRAstor
Figure 2. HYDRAstor layers into an existing environment, presenting a NAS based storage target for existing secondary storage applications like backup and archive.
NEC Corporation is a $41B international company that has been selling storage products for over 50 years. NEC Corporation of America, the North American subsidiary of NEC Corporation headquartered in Irving, Texas, unveiled its HYDRAstor unified disk storage platform in 2007, basing it around a grid storage architecture that is scalable into the petabyte
range today. HYDRAstor supports two types of nodes: Accelerator and Storage. Accelerator Nodes scale performance, while Storage Nodes scale disk capacity (and the performance necessary to address the added capacity). Both nodes types can be added non-disruptively. NEC initially targeted HYDRAstor as a backup and archival storage platform for large enterprise environments, but as it continues to gain traction and prove itself in the market, its clear that NEC could broaden its positioning to include use as a primary storage platform for consolidated file services. In September 2008, NEC unveiled the HYDRAstor HS8-2000 which includes more powerful nodes based on newer technologies that can be easily integrated into existing HYDRAstor configurations built around the older HS8-1000 nodes. NEC has also enhanced RepliGrid, HYDRAstors replication software, to include support for N to 1 replicated configurations, supporting more flexibility in setting up DR solutions. As the first grid storage solution introduced by a Fortune 200 company, HYDRAstor caused quite a stir. Since its introduction less than a year ago, NEC has deployed HYDRAstor in production in over 50 large enterprise environments across a number of different verticals, including financial services, health care, telecommunications, entertainment, manufacturing, and education. Common business drivers for HYDRAstor acquisition cited by NEC customers include its ability to support a disk-based secondary storage platform that is massively scalable, cost-effective, and exhibits very high data reliability. Most customers were already sold on the concept of using disk for backup, but were concerned about disks affordability and manageability. HYDRAstors aggressive $/GB costs for usable storage capacity, strongly aided by its support for storage capacity optimization technology, combined with its self-managing, self-healing capabilities to address these concerns.
How Well Does HYDRAstor Meet The Criteria?

The detailed discussion of how HYDRAstor meets our established criteria will rely heavily on an understanding of what the Accelerator and Storage Nodes do and how they are configured, so well address that issue first. Accelerator Nodes are based around an industry standardsbased 2U server with Intel-compatible CPUs, main memory, and 2 internal drives. In the HS82000 Accelerator Node models just announced this month, NEC is using quad core, 3GHz CPUs, 8GB of RAM, and 147GB SAS drives. Each Accelerator Node supports 300MB/sec of throughput and offers multiple Ethernet (1Gb or 10Gb) connection options. Accelerator Nodes manage the NFS or CIFS interface, perform the preliminary chunking and fingerprinting of data as it comes into the system, and handle the routing for HYDRAstors unique distributed hash table. Chunking is performed at the sub-file level and uses variable length windows - two factors supporting higher data reduction ratios. All Accelerator Nodes run DynamicStor, which can be thought of as an operating environment that includes a coherent, distributed file system that leverages the inherent redundancy of grid architectures and an automatic failover capability to ensure that no node is a single point of failure. The distributed hash table is one of HYDRAstors stand out technologies as will be explained later.
Storage Nodes are also based around an industry standards-based 2U server that supports quad core, 3 GHz CPUs, 24GB of RAM, and 12 1TB SATA disk drives. DynamicStor virtualizes storage within and across all Storage Nodes in a grid to create one logical pool of capacity accessible by all Accelerator Nodes, and handles all the storage provisioning tasks automatically as resources are added, subtracted, or moved to different locations. For performance and scalability reasons, each Storage Node supports only a part of the distributed hash table. After an Accelerator Node performs the preliminary storage capacity optimization work, it sends the data directly to the Storage Node responsible for that part of the hash table. Because Storage Nodes do not have to process the entire hash table, they support much higher performance than conventional methods where the entire hash table must be processed for each request. This distributed hashing approach also overcomes other issues common with conventional hashing designs, such as an inability to go beyond a hash table of a certain size (due to main memory or other limitations) and poor performance due to paging (because the hash table is too big to fit in main memory).
Figure 3. HYDRAstor systems can be created from any combination of Accelerator and Storage Nodes, supporting tens of thousands of MB/sec of bandwidth and tens of petabytes, to meet a wide variety of performance and capacity requirements.
Copyright The TANEJA Group, Inc. 2008. All Rights Reserved Hopkinton, MA 01748 Tel: 508-435-5040 Fax: 508-435-1530 8 of 19 www.tanejagroup.com
HYDRAstor systems can be very flexibly configured by adding Accelerator and Storage Nodes independently as needed. Backup applications are generally deployed in ratios of 1:2 (Accelerator Nodes to Storage Nodes), while archive applications generally require less bandwidth due to usage patterns, and are deployed in ratios of 1:4 (Accelerator Nodes to Storage Nodes). HYDRAstor can, however, support any combination of Accelerator and Storage Nodes. Massive scalability and performance. Entry-level HYDRAstor configurations start with what NEC calls a 1x2 configuration that includes 1 Accelerator Node and 2 Storage Nodes, providing 300MB/sec of throughput and 24TB of raw capacity (12TB per Storage Node). Assuming a 20:1 data de-duplication ratio and the 25% data protection overhead inherent in HYDRAstors Distributed Resilient Data (DRD) logical data redundancy default setting, this entry level configuration would provide a customer with 360TB of usable capacity. The largest pre-configured HYDRAstor model available from NEC has 55 Accelerator Nodes, 110 Storage Nodes, supports 16.5GB/sec of bandwidth, and offers 1.3PB of raw storage capacity. This system is actually being built in NECs Redmond Technology Center in Washington state, and will be used for internal testing and demo purposes. HYDRAstor includes DataRedux, NECs patent-pending storage capacity optimization technology, which brings usable capacity easily up into the tens of petabytes for large configurations, depending upon the level of data redundancy configured and the data reduction ratios achievable across various workloads. Existing customers using HYDRAstor as a backup platform are achieving data reduction ratios of 20:1 or more over time. HYDRAstors architecture puts it into an elite category of platforms that are able to provide well-balanced, high performance even as the system is scaled out to its maximum capacity. To understand how HYDRAstor achieves this linear scalability, we need at least a high level understanding of how HYDRAstor handles the data. Each Accelerator Node includes at least one share (i.e. file system). When a file is written to this share, it is broken into chunks. A chunk is then broken into multiple fragments and stored across as many Storage Nodes as possible. This approach spreads the processing associated with any disk read or write across many disk spindles, each of which has dedicated processing power within its own Storage Node. As Storage Nodes are added, so is the additional processing power necessary to keep its disks performing optimally. As Accelerator Nodes are added, they also provide additional processing power, as well as an additional 300MB/sec of throughput, whose usage is spread in a balanced manner across all existing Storage Nodes by DynamicStor. It is this balanced addition of resources, the usage of which is spread evenly across all nodes in the system, which supports HYDRAstors linear scalability. HYDRAstors scale-out architecture allows resources to be added over time and even across technology generations (e.g. Accelerator Nodes incorporating newer processor technology and
higher density memory, Storage Nodes incorporating larger capacity and/or different disk types) without incurring a fork-lift upgrade. This extends the usable life of a platform like HYDRAstor from the standard 3-4 years associated with monolithic architectures to one that could be decades in length, significantly changing the economics associated with total cost of ownership (TCO) calculations. Modular configurability. With HYDRAstor, Accelerator and Storage Nodes can be added as needed, with no constraints. As shown earlier in Figure 3 on page 8, if more throughput is required, more Accelerator Nodes can be added, up to a maximum configuration supporting 16.5GB/sec. If more storage capacity is needed, more Storage Nodes can be added, supporting tens of petabytes of usable capacity. Unlike monolithic storage architectures, performance does not degrade as capacity is added (on the contrary, it increases) and there is no constraint which limits the maximum amount of throughput associated with any particular amount of storage capacity. Note also that, by defining share names and associating those with particular secondary storage applications, users can effectively create multiple virtual systems which have the same scalability characteristics as the overall HYDRAstor configuration. A particular share or set of shares can be designated as a backup platform, while a separate set of shares is dedicated for use as an archive platform. These virtual systems are all managed centrally through HYDRAstors web-based management GUI. High availability. DynamicStor takes advantage of the inherent redundancy in grid architectures to maintain active-active configurations for all Accelerator and Storage Nodes. This is one of the reasons that NEC recommends that HYDRAstor configurations include multiple nodes of each type. If an Accelerator Node fails, then another Accelerator Node can take over and serve the shares that the failed Accelerator Node was providing. Overall throughput, of course, drops when an Accelerator Node fails, but there is no service disruption or loss of data. In larger HYDRAstor configurations, the load of the failed Accelerator Node can be spread across all remaining Accelerator Nodes to minimize any perceived performance impact. In its default setting, HYDRAstors patent-pending DRD technology ensures that the failure of three disks is tolerated without any loss of either data or access to that data. In 2x4 configurations (2 Accelerator Nodes, 4 Storage Nodes) or larger, even an entire Accelerator or Storage Node can be lost without any impact to the data or the access to it. DRD can be simplistically defined as an enhanced RAID capability that allows the user to dial in the level of protection (e.g. the number of simultaneous device failures that can occur without impacting data availability) desired, although it really goes far beyond that because it can provide much greater resiliency with less overhead than conventional RAID implementations. It is NECs position that RAID 5 and 6 implementations are not sufficient to address data availability and reliability issues in storage environments where storage capacity optimization
technologies are in use. RAID 5 was designed to ride through any single drive failure, but incurred a pretty substantial write penalty and a 20% cost overhead. Data on a failed disk would be transparently recovered in the background (called a rebuild) while data continued to be available to end users. When disks were relatively small in capacity, rebuild times were not as much of a complicating factor for RAID 5 configurations, but as disk capacities grew, so did the amount of data that had to be transferred to return drive configurations to normal operation. With the size of todays 500GB and larger drives, rebuild times became so long that the bit error rate started to pose significant risks. The typical bit error rate of a 500GB SATA drive is 1 in 1014, which means that a read error can be expected for every 12.5TB of data read. In a standard RAID 5 configuration with 5 disks, the likelihood of encountering an uncorrectable read error during a rebuild is fairly high. This is in fact what led to the development of RAID 6, a RAID configuration which protects against dual concurrent drive failures by introducing a second parity drive. While RAID 6 was an improvement over RAID 5 in terms of resilience, the increased capacity overhead of RAID 6 increases the likelihood of read errors during rebuilds, increases rebuild times, and introduces additional performance overhead. As disk drives get denser, these problems with RAID 6 will only get worse. To address these issues, NEC developed DRD. In its default configuration of resiliency level 3, DRD protects data from any three concurrent disk failures without the RAID write penalty, without incurring performance degradation during rebuilds, without long rebuild times, and with less storage overhead than RAID 6. Briefly, heres how DRD works. DRD initially takes each chunk of data that a write is broken down into, and breaks it into multiple data fragments, with the number of fragments depending upon the resiliency level defined by the administrator (it will always be 12 minus the resiliency level). DRD then computes the required number of parity fragments (a resiliency level of 3 creates 3 parity fragments), with all parity fragment calculations based solely on the data fragments this is all done in main memory, no data must ever be read from disk a second time for DRD to complete any operations, including rebuilds. A total of 12 fragments are always written; at the default setting of 3, DRD writes 9 data fragments and 3 parity fragments, while with a setting of 4 DRD would write 8 data fragments and 4 parity fragments. These 12 fragments are then written across as many Storage Nodes as possible in a configuration. If there are more than 12 Storage Nodes in a system, then the data fragments are written across 12 nodes on each write, while if there are less than 12 Storage Nodes in a system, DRD will distribute the fragments on different disks within a node to maximize resilience but at any rate always distributes fragments to achieve maximum data resiliency given the available hardware. At resiliency level 3, a chunk of data can be re-created using any 9 out of the 12 fragments, allowing the system to simultaneously lose as many as 3 fragments without impacting data availability. At resiliency level 3, DRD incurs 25% overhead (compared to RAID 5s 20% and RAID 6s 33%) but offers 200% more resiliency than RAID 5 and 50% more than RAID 6. DRD calculations are done by the Storage Nodes, with each node performing the DRD calculations only for the data that it writes.
NECs claim that DRD does not suffer performance degradation during the rebuild process is based on the way data is retrieved. When an Accelerator Node requests a chunk, it requests it from one of the many Storage Nodes on which its associated fragments reside. With the default DRD setting, the Storage Node takes the first nine fragments it has been provided to recreate the chunk; if any of these fragments are parity fragments, then the actual fragment is reconstructed in RAM before being passed back to the Accelerator Node. This avoids the many re-reads necessary to reconstruct data in conventional RAID implementations, and is how HYDRAstor is able to maintain I/O performance even when one or more disks or Storage Nodes are being rebuilt. HYDRAstors distributed hash tables also support high availability. Centralized hash tables can represent a single point of failure which can make data unrecoverable. HYDRAstor distributes its hash tables across all Storage Nodes even though each node is only responsible for actively handling a small portion of the hash table during regular operation - so that even the failure of multiple nodes does not result in the loss of any hash information. HYDRAstor also supports its own replication facility called RepliGrid. RepliGrid is licensed on particular Accelerator Nodes, with those nodes handling all replication tasks for the entire HYDRAstor. RepliGrid uses asynchronous, IP-based replication and minimizes bandwidth requirements by only sending unique, capacity optimized data across the WAN. RepliGrid can be configured to replicate at either the Accelerator Node level or at the file system level. Support for N+1 configurations allows replication to occur either from multiple Accelerator Nodes to one Accelerator Node or from multiple grids to one grid. Combined, these features support a solid high availability story. Nodes of either kind can be added, removed, replaced, or shut down for maintenance without impacting HYDRAstors ability to service applications and keep data available. Because of the way it lays data down across independent components, DRD maintains the same high level of performance for data access even during a drive rebuild. HYDRAstor itself is built to sustain system-level availability in the fault tolerant class (99.999%). Platform data management tools. HYDRAstor offers tools to manage data for both persistence and reliability, offers storage capacity optimization technologies to minimize the amount of raw capacity required to storage a given amount of data, and employs a selfmanaging approach which obviates the need for administrators to perform storage provisioning tasks as system configurations change. Weve already talked about DRD and how it provides high availability, and it should be clear how DRDs capabilities support data persistence in the face of multiple failures. DynamicStor includes built-in data verification capabilities that also support data reliability. As chunks are created by the Accelerator Nodes, a hash is calculated for each chunk. HYDRAstor uses a very powerful algorithm for its hashing
calculations that guarantee that the odds are less than 1 in 2160 that the hash value is not unique. DataRedux performs data de-duplication at the sub-file level, using variable length windows, and then compresses the data using modified Lempel-Ziv algorithms prior to storing it. Both the Accelerator and Storage Nodes participate in this process. As data comes into the Accelerator Nodes, it is broken into chunks and unique hash values created. If the hash value is one that HYDRAstor has already seen, then the new data chunk is discarded and a reference pointer is stored instead. If the chunk turns out to be one that HYDRAstor has not seen before, then it compresses it and writes it directly to disk (using the DRD algorithm). DataRedux effectively establishes a global de-duplication repository that can be used against any data coming in from any of the Accelerator Nodes, regardless of whether they have been configured into one or more virtual systems. With this single de-duplication repository scaling up to 16.5GB/sec of throughput and potentially tens of petabytes of usable capacity, it is arguably the most scalable storage capacity optimization engine in the industry. One of the most onerous tasks associated with storage system administration is provisioning. In conventional systems, there are RAID controllers, volume managers, and RAID groups to configure. LUN, volume, and file system sizes must be pre-determined, and monitored to ensure that sufficient capacity has been allocated to meet storage requirements. Although it may be hard to believe, to initialize a HYDRAstor system you just need to connect the Accelerator Nodes to your Ethernet (rack mounted models of the HYDRAstor are already preconnected into the HYDRAstor grid), power them up, set up an administrator account, and create at least one share on each of the Accelerator Nodes its not necessary to associate the shares with any volumes, real or otherwise. All other tasks are handled by DynamicStor. The Storage Nodes automatically discover each other, detect the DRD resiliency level, and start setting up what needs to be created for a new storage system. As resources are added, DynamicStor discovers them, determines their type, and adds them to the resource pool. Applications will interact with specific Accelerator Nodes based on where shares reside; in this way, specific application workloads can be associated with specific Accelerator Nodes, although all shares on all Accelerator Nodes are running data de-duplication against a global repository. DynamicStor automatically recovers any shares on a failed Accelerator Node elsewhere in the system without any knowledge on the part of the application or end users that a recovery has transparently occurred in the background. The total capacity of all the Storage Nodes is combined into a single large pool of virtual storage that is equally accessible by any and all Accelerator Nodes. This storage pool can be increased or decreased in size on-line without any impact to applications and without administrators having to do anything other than physically plug in the new resources. Its clear that the provisioning models that have been historically associated with monolithic storage architectures were not going to be cost-effective in environments approaching 100TB and beyond, and the type of intelligent self-management demonstrated by HYDRAstor is whats needed.
Ability to integrate with existing processes. Backup is probably the area in which this is most an issue, although archive and DR need to be considered as well as other common secondary storage applications. Existing backup processes are likely to be built around tapebased architectures, using enterprise backup software like Symantec NetBackup, IBM Tivoli Storage Manager, EMC Networker, or any of a number of other enterprise backup software products. HYDRAstor supports NFS and CIFS interfaces, allowing it to be configured as a NAS-based disk as disk target for all major backup software products. To set HYDRAstor up as a disk-based backup target, you would need to enable and configure the disk target capability of your existing backup software, then define one or more shares as backup targets. The big disadvantage of working with disk-as-disk backup targets has been the additional provisioning work that is not required with tape, but as pointed out earlier, HYDRAstors unique self-managing capabilities do not require any of this provisioning either. Once the backup targets have been defined, all of the other backup policies, including schedules, can be leveraged as is, without change. This change to a disk-based backup target brings with it a number of key advantages over tape-based backup in the areas of backup window, RPO, RTO, and recovery reliability. And given HYDRAstors replication capability, it now also makes it much easier to establish and maintain a DR facility since backups can be easily replicated to remote locations. HYDRAstors DataRedux minimizes the amount of bandwidth necessary to replicate data, helping to keep any additional network investments required to enable replication to a minimum. Using replication to distribute data to remote locations offers significant advantages over physical tape transport, including faster distribution to maintain more aggressive RPOs at the remote site, better security, and less administrative overhead (i.e. loading, labeling, packing, and shipping of tapes). Data can be migrated from HYDRAstor to physical tape at any point in the process where a customer may want to do so, allowing HYDRAstor to co-exist with existing tape infrastructure or entirely replace it, depending on customer requirements. Many enterprises are still using tape for archiving purposes, but evolving e-discovery requirements are making active (disk-based) archiving more and more compelling. Grid architecture platforms like HYDRAstor that leverage SATA-based disk and storage capacity optimization technologies like DataRedux can realistically offer usable capacity at under $1/GB, and data protection technologies like DRD provide the kind of data reliability required for active archiving. Within a single physical HYDRAstor platform, virtual systems can be defined, although they will perform all data de-duplication against the same global repository. Separate backup and archive systems could be established within a single HYDRAstor grid, and then replication could be used to easily create and maintain a remote-site DR strategy for both systems. Industry standards based. Accelerator and Storage Nodes are based on commodity hardware platforms, running Intel-compatible processors, standard SAS or SATA disks, and
the Linux operating system. NEC based HYDRAstor nodes on industry standards like Intel, Linux, NFS, and CIFS as a conscious choice to keep costs down, make it easy to ride the performance and capacity curves for CPU and disk, and minimize integration costs. While the grid architecture is innovative and very germane to addressing enterprise backup, archiving, and DR issues, the crown jewel of the solution is clearly the DynamicStor software environment. NECs product strategy will allow them to focus their development resources on DynamicStor, which is the key to maintaining their differentiation, while leveraging the huge development resources of Intel, Seagate, and other commodity hardware providers to stay competitive on the hardware side. Basing the platform around industry standards has another positive implication for end users as well. The potential longevity of scale-out NAS platforms like HYDRAstor forms a strong part of the value proposition since enterprise backup, archive, and DR require long term solutions. HYDRAstors ability to accommodate multiple technology generations without fork lift upgrades provides a form of future proofing that will keep expansion costs low over long time horizons. Since customers can increase both performance and capacity independently, there is little risk that the right mix can be achieved to meet a variety of storage platform requirements cost-effectively, a claim that cannot be made for monolithic architectures. Providing access to the HYDRAstor platform through industry standard interfaces like NFS and CIFS, instead of proprietary interfaces used by some vendors particularly in the archive space, requires no custom coding for integration and entails less risk for customers since they enjoy broad industry support. Affordability. The HYDRAstor features that support affordability have already been discussed, but a key metric to evaluate here is the effective $/GB for usable capacity. Enterprise-class tape libraries today support a $/GB of between $.20 and $.30 (for configurations in the 100TB+ range and assuming 2:1 compression), while at list price HYDRAstor comes in at under $.50/GB (assuming the entry level 1x2 with 24TB of raw capacity, the default DRD setting of 3, and a 20:1 storage capacity optimization ratio). While a detailed cost comparison is beyond the scope of this paper, we can provide some guidance on how to evaluate costs in your particular situation: TAPE COSTS Tape acquisition costs (including tape drives and media replacement costs over time) Savings due to compression (assume 2:1 although it may be lower in your environment) 7x24 support costs (for most products these are in the 18-22% of list cost range per year) Energy costs (cost per kilowatt hour varies across the nation between roughly $.05 and $.15) Administrative overhead (time spent loading, labeling, packing, shipping, retrieving tapes) Archive (may or may not apply depending on whether an archive is in use)
15 of 19 www.tanejagroup.com
HYDRAstor COSTS Storage platform acquisition costs for base capacity (raw capacity plus DRD data protection) Savings due to storage capacity optimization (ratios vary between 10:1 and 20:1) 7x24 support costs (12% of list price) Energy costs (cost per kilowatt hour varies depending on location) Administrative overhead (ongoing management) Archive (may or may not apply depending on whether an archive is in use) In general, tape will have a lower acquisition cost for raw storage capacity than disk and may have energy costs that are 5% - 15% of the costs of disk; administrative costs will be greater than disk but may not represent more than 10% of the overall TCO over a 5 year period; for archiving, were seeing Fortune 1000 companies frequently assume a minimum of $500K for discovery costs against tape for a single large lawsuit, and $1500 every time a tape has to be searched. For HYDRAstor, initial platform costs for the HS8-2000 (assuming the entry level configuration) provide a $/GB for raw capacity of $7.50 at list ($180K for 24TB). We also have to factor in the 25% overhead imposed by the default level of logical data redundancy for DRD, putting base capacity (raw capacity after logical data redundancy has been accounted for) at $9.38/GB. Storage capacity optimization ratios vary, but our research shows that for backup, most enterprises where storage capacity optimization is in use are achieving data reduction ratios between 10:1 and 20:1. A worst case assumption of 10:1 would put HYDRAstor usable capacity at $.94/GB, while a 20:1 assumption would put it at $.47/GB. Energy costs for HYDRAstor may be 20 times higher than for tape, but note that over a 5 year period this may be a difference of only one hundred thousand dollars or so for configurations that are supporting 100TB 2ooTB of data, an amount that pales in comparison to acquisition costs for these types of configurations. Its difficult to put a monetary value on backup window, RPO, RTO, and recovery reliability advantages that would apply across the board, so well leave that to the end user. If archive is being used as well though, economics can shift heavily in favor disk. Using the $500K guideline estimate for tape-based discovery costs for a single large lawsuit, we know that e-discovery costs against an active archive are roughly 1/10 of that. This means that the savings associated with a single lawsuit for a large enterprise can easily outweigh tapes energy savings over a 5 year period for archives in the multi hundred terabyte range. How many lawsuits is your company involved with in one year? You may argue that you dont intend to use disk for archive so these savings may not apply, but we would counter With savings like these achievable, why wouldnt you consider an active archive? Theres one other key point to consider with active archives. We made the point earlier that 70% to 90% of the data that most customers keep on primary storage does not require primary storages high performance. But the question is that if you are not going to keep this data on
primary storage, where would you put it? Its not true archive data in the sense that it would very rarely be accessed so you clearly would not want to put it on tape. Much of this data still needs to be on-line, but does not require the low latency access provided by primary storage. If an archive is disk-based, however, it is easy to make a case to move this data. Archiving software can ensure that the data is still referenceable by on-line applications, and an active archive will make it easily retrievable, albeit at rates slightly slower than primary storage. Certain data, like medical images, retained legal documents, and expense reports, may be born archive and can be stored immediately on an active archive. If you could migrate even 50% of the data sitting on primary storage right now to an active archive, how much could you save? You would lower not only primary storage costs, but backup costs as well since youd be backing up less data on a regular basis. Any primary storage cost savings would be calculated at primary storages much higher $/GB rates. Realistically, this could mean that in the first year or two you create an active archive, you could need to buy very little or possibly no primary storage to accommodate growth, and then you would buy primary storage in successive years at a much slower rate.
Key HYDRAstor Differentiators

Data de-duplication appliances targeted at secondary storage environments are one way to address storage capacity optimization needs, but they can suffer from limited scalability and performance relative to HYDRAstor. When dealing with large data stores, issues like high single stream performance, scalable across multiple nodes and a global de-duplication repository, an ability to support tens of petabytes of usable capacity, and a graceful upgrade path that can accommodate the addition of new generations of technology without downtime are much better met by scale-out solutions like HYDRAstor than they are standalone appliances. Scale-out NAS platforms based on grid architectures and targeted specifically as secondary storage are available from several vendors, and they all support a highly available storage platform scalable into the petabyte range, but HYDRAstor is differentiated from these competitors in ways that are meaningful to end users. Key HYDRAstor differentiators include: A grid architecture platform that not only supports high end performance and scalability but includes an integrated storage capacity optimization capability (DataRedux) that relies on sub-file, variable length window data de-duplication to bring costs down well under $1/GB for usable capacity Rated at 16.5GB/sec, HYDRAstors DataRedux boasts the industrys highest throughput storage capacity optimization engine - the fastest in-line data de-duplication vendors today can offer roughly 1GB/sec of single stream throughput, while the fastest postprocessing vendors claim up to 9.6GB/sec; because HYDRAstor uses an in-line
approach, it minimizes capacity requirements and gets the data into capacity optimized form, ready for replication to remote sites, faster than post-processing approaches can A data protection scheme (DRD) that is unique in the industry in allowing customers to dial in the level of data resiliency desired, but that in its default configuration provides 50% better protection than RAID 6 with less overhead for solid, cost-effective data reliability, and this number can be much higher if the issues associated with bit error rates and drive rebuilds are taken into account; one of the standout features of DRD is its ability to maintain high performance data access even during disk rebuilds, a feature unique to the way DRD lays data out on disk A unique distributed hash table that supports linear scalability even as HYDRAstor approaches its maximum configurations, and unlike competitive products actually increases performance as more storage capacity is added
Taneja Group Opinion

Taneja Group is a proponent of leveraging disk for secondary storage applications where it makes sense. In backup, disk is a clear win over tape for near term backup requirements, addressing operational issues such as backup window, RPO, RTO, and recovery reliability in ways that tape just cannot. For large enterprises, evolving e-discovery requirements are quickly turning economics in active archivings favor. In many environments, tape will continue to play a role for many years to come, but it is increasingly being relegated to specific roles such as a second backup tier (for older data that is not accessed very frequently yet is still too recent to archive) or a final DR repository (export to tape after data has been replicated to a remote site). For large enterprises that are dealing with lawsuits on a regular basis, tape is becoming a less and less viable medium for archiving purposes on a pure economic basis, not to mention a speed of response one. Massively scalable storage platforms based on grid architectures provide a compelling diskbased alternative for secondary storage applications. For these platforms to meet the functional and cost requirements, though, there are clearly certain features they must provide: A high performance platform scalable into the petabyte range Linear scalability combined with modular configuration flexibility that will allow the right mix of performance and capacity to be achieved for any secondary storage application High availability that rivals that of fault tolerant platforms (99.999%) A data protection scheme that overcomes the deficiencies of conventional RAID and meets data reliability requirements with high performance Storage capacity optimization technologies that significantly reduce the amount of raw storage capacity required to store data
Intelligent self-management that obviates the need for provisioning as configurations evolve over time
These solutions must integrate easily into existing environments, and must do so costeffectively; a target metric to shoot for here is to provide usable storage capacity under $1/GB. HYDRAstor allows customers to incorporate disk into backup, archive, and DR environments at very aggressive price points and provides them the option to either integrate with existing tape-based infrastructure or to replace it. Boasting a number of unique, patent-pending technologies such as DRD, DataRedux, and a highly scalable, distributed approach to hashing, HYDRAstor has found a strong early following among large enterprises seeking to incorporate disk into their secondary storage environments. It is already setting the functional bar that other scale-out NAS platforms need to emulate to meet enterprise requirements for backup, archive, and DR, and we expect to see other vendors evolve their products along the path being set by NEC. NECs status as a Fortune 200 company offers other large enterprises the sense of security many of them look for in their storage suppliers, giving them an additional edge over many other scale-out NAS offerings. If youve thought about how you might integrate disk into your secondary storage environment but have been put off by costs, HYDRAstors aggressive features and price points may be whats necessary to put you over the top.
NOTICE: The information and product recommendations made by the TANEJA GROUP are based upon public information and sources and may also include personal opinions both of the TANEJA GROUP and others, all of which we believe to be accurate and reliable. However, as market conditions change and not within our control, the information and recommendations are made without warranty of any kind. All product names used and mentioned herein are the trademarks of their respective owners. The TANEJA GROUP, Inc. assumes no responsibility or liability for any damages whatsoever (including incidental, consequential or otherwise), caused by your use of, or reliance upon, the information and recommendations presented herein, nor for any inadvertent errors which may appear in this document.

Taneja Group Evaluating Grid Storage For Enterprise Backup, DR and Archiving NEC HYDRAstor PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Taneja Group Evaluating Grid Storage For Enterprise Backup, DR and Archiving NEC HYDRAstor PDF

Transféré par

Droits d'auteur :

Formats disponibles

P

87 Elm Street, Suite 900

Re-Evaluating the $/GB Costs of Conventional Storage Environments

An Ideal Storage Tiering Model

Criteria For The New Secondary Storage Platform

Enter NEC HYDRAstor

How Well Does HYDRAstor Meet The Criteria?

87 Elm Street, Suite 900

87 Elm Street, Suite 900

Key HYDRAstor Differentiators

87 Elm Street, Suite 900

Taneja Group Opinion

87 Elm Street, Suite 900

87 Elm Street, Suite 900

Vous aimerez peut-être aussi