Académique Documents
Professionnel Documents
Culture Documents
By
Enterprise Services
Wipro Infotech Delhi
Confidentiality
This document is being submitted to Adobe Pvt. Ltd.by Wipro Infotech, with the explicit understanding that the
contents would not be divulged to any third party without prior written consent from Wipro Infotech.
Wipro Confidential
Page 1 of 91
An Overview of VCS
VCS is an architecture-independent, availability management solution focused on proactive management of service
groups, or application services. It is equally applicable in simple shared disk, shared nothing, or SAN configurations
of up to 32 nodes and compatible with single node, parallel, and distributed applications. Cascading and multidirectional application failover is supported, and application services can also be manually migrated to alternate
nodes for maintenance purposes. VCS provides a comprehensive availability management solution designed to
minimize both planned and unplanned downtime.
Designed with a modular and extensible architecture to make it easy to install, configure, and modify, VCS can be
used to enhance the availability of any application service with its fully automated, application-level fault detection,
isolation and recovery. All fault monitors, implemented in software, are themselves monitored and can be
automatically restarted in the event of a monitor process failure. Monitored service groups and resources can either
be restarted locally or migrated to another node and restarted. A service group may include an unlimited number of
resources. Various off-the-shelf agents are available from VERITAS to monitor specific applications such as file
services, RDBMS and enterprise resource planning, or the product can be customized to monitor any hardware
component or software-based service. An SNMP agent allows VCS to generate SNMP traps so that resource state
changes can be communicated to any SNMP-based management tool such as HP OpenView, CA Unicenter, Tivoli
TME, and others. Although applicable to any application service that requires higher availability, VCS is most often
deployed in mission-critical enterprise environments such as file serving, database, and enterprise resource planning
(ERP).
Page 2 of 91
Overview of SPFSHA
SPFSHA extends VERITAS File System and VERITAS Volume Manager to support shared data in a SAN
environment. Using SANPoint Foundation Suite HA, multiple servers can access shared storage and files,
transparently to the applications and concurrently with each other. SANPoint Foundation Suite HA incorporates
VERITAS Cluster Server to provide cluster failover capabilities as well as internode communications across the
servers.
Course Overview
System availability continues to receive wide attention as many organizations grow their critical business
applications on Local Area Networks (LANs). The primary reason to address availability issues is the cost of
downtime. You can establish an annual cost of downtime for every system and measure the benefits obtained by
solving the problems that cause a system to fail. You can then select among the various available options to improve
server uptime, based upon a reasonable cost and effort as well as a reasonable return on your investment.
Wipro Confidential
Page 3 of 91
Course Objectives
The overall goal of this learning experience is to provide a basic understanding of the concepts related to HA. This
course will build the foundation on which to base more advanced courses on VERITAS HA products. During this
course you will:
Define the general concept of high availability.
Identify HA storage management solutions at the disk level, such as hardware Redundant Array of
Independent Disks (RAID) and volume management software.
Describe the concept of clustering and investigate common clustering configurations.
Identify HA methods at the network level, such as redundant network connections and redundant networks.
Describe VERITAS HA products.
Lessons
Defining High Availability
What is High Availability?
Describe the concept of high availability.
The Need for High Availability
Identify the need for increased data availability in today's computer environments.
Types of Faults and Failures
Identify different types of faults and failures that can occur.
High Availability vs. Disaster Planning
Differentiate between the goals and functions of high availability and disaster planning.
High Availability vs. Fault Tolerance
Differentiate between the goals and functions of high availability and fault tolerant availability methods.
High Availability Planning
Identify guidelines to consider when planning a high availability solution.
The Layered Approach to Availability
Describe the concept of the layered availability approach.
Online Storage Management
General RAID Levels
Describe the various RAID levels.
Software RAID vs. Hardware RAID
Identify the advantages and disadvantages of hardware and software RAID.
Defining a Volume
Describe volumes and identify the advantages of using them.
VERITAS Volume Management: Virtual Objects
Describe the relationships between the virtual objects in VERITAS Volume Manager.
VERITAS Volume Management: Volume Layouts
Identify the volume layouts that are available in VERITAS Volume Manager.
VERITAS Volume Management: Hot Relocation
Describe the hot relocation process.
High Availability Clustering
Fault Resilient Clustering Concepts
Describe the general characteristics of fault resilient HA solutions.
Asymmetric 1 to 1 Configurations
Describe an asymmetric 1 to 1 configuration.
Wipro Confidential
Page 4 of 91
Defining HA
HA is defined as the ability of a system to perform its function without interruption for an extended period of time.
HA can be accomplished through special HA software and the implementation of redundant system and network
hardware components. In a properly designed HA system, all of the possible failure modes for critical applications,
network connections, and data storage have been identified and the recovery times have been analyzed. Therefore,
you can determine how long the system will be down for any given failure. You can scale an HA system to an
Wipro Confidential
Page 5 of 91
appropriate level so that in the event of a fault or failure, the system can recover to a known, consistent state in an
acceptable period of time.
Availability Statistics
System availability is expressed as a measure of the period of time that the system is functioning normally. This
involves the determination of the various component failures to factor into the overall rate of system failure. It is
important to note that there is a distinction between component failure statistics and system failure statistics. The
basic availability equation is used to determine the availability of a specific system component:
Availability = MTBF/ (MTBF + MTTR)
Where MTBF is the mean time between failures and MTTR is the mean time to repair.
MTBF
MTBF = Total actual operating time/Total number of failures
The MTBF is an expected future performance based on the past performance of a system component. If the
component is new, there is no historical data to base the MTBF upon. When determining the MTBF of new
hardware components, you should obtain these statistics from the particular vendor. However, these statistics may be
inflated or have been calculated using a high standard deviation.
MTTR
The MTTR is an average amount of time that it takes to repair a component, based upon actual statistical data. When
calculating the MTTR, you can consider only the amount of on-site time that it takes to recover the component from
the time when it failed. You can also calculate the MTTR including factors such as unavailability, response time, and
travel time, in addition to on-site repair time. Many aspects of MTTRs are out of your control. For example, you
may need to replace a specific part of a server. If this part is not currently in your stock, you will have to purchase
the replacement component from the vendor or some other source and rely solely upon their ability to deliver the
part in a short amount of time.
System Availability
As stated earlier, to calculate the availability of a system, you must take into account the availability of the
individual system components such as servers, disks, I/O cards, etc.
The more hardware the system features, the more likely the system will fail. It is here that the effect of having many
of a single type of component affects the availability of a system.
For example, suppose a new disk has a quoted manufacturer's MTBF of 600,000 hours, which indicates that a disk
would be expected to fail once in about 70 years. This MTBF is calculated rather than based on actual failures. In
addition, this MTBF value considers only the disk mechanism itself. If you factor in the power supply, controller,
and fans, the MTBF becomes about 150,000 hours or about 17 years. If your system utilizes 500 disks, the failure
rates are multiplicative and the MTBF for 500 disks is 150,000 hours divided by 500, or only 300 hours. This means
that the system would fail about 30 times a year due to disk failure. The best way to reduce the frequency and
duration of failures that affect the system is to employ a properly designed HA solution.
99.999%
0.001%
5.25 minutes
6 seconds
(5 nines)
99.9999%
0.0001%
31.5 seconds
0.6 seconds
(6 nines)
For most environments, 99% availability is adequate. This level of availability results in less than 2 hours of
downtime a week. It is important to consider when this downtime is taking place. For example, if a typical business
system is down on a Sunday between 3 A.M. and 4:30 A.M., this is more acceptable than if the system is down
during a Tuesday afternoon between 2 P.M. and 3:30 P.M. It is also important to consider when 100% availability is
required. For example, suppose that a brokerage house performs all stock transactions between 9 A.M. and 4 P.M on
weekdays. If the system is designed for 99% availability, it is crucial that you ensure that no system downtime
occurs during the most critical business hours.
HA Requirements
There is a trade-off in costs and benefits for various degrees of availability. When designing a system with HA
requirements, the initial requirements often include:
System availability at all times with no perceived loss of service
No loss of data at any time
Maintenance and upgrade activities do not interfere with operational service
Without being properly informed of the total costs and consequences of implementing a system that satisfies these
requirements, it is natural to want an HA solution to satisfy these lofty goals. 100% data availability is an ideal
concept, but the implementation of this solution results in very high monetary, performance, and complexity costs.
As you move from lower to higher degrees of availability, the costs can increase dramatically. In most environments,
a step from one level to the next (for example, from 99% to 99.9%), increases costs 5 to 10 times.
It is ultimately the responsibility of an HA system designer to determine:
The degree of availability that is actually required by the users, as opposed to what they might like to have
The technological alternatives that can be used to meet these requirements
All the costs
Not only monetary, but also performance degradation and system complexity.
Time to Recovery
Most enterprise environments feature a wide range of systems ranging from on-line e-commerce systems to lesscritical human resources (HR) systems. It is important to analyze the required recovery times of the various systems
in your enterprise by performing a business impact analysis. Currently, there is a lot of work being done in this area
by organizations in the analyst community such as the Gartner Group, Matter Group, and Intelligent Directions
Consulting (IDC) among others. Typically, you can break the systems in an enterprise down into five basic levels
based upon the time to recovery requirements:
Safety critical
Mission critical
Wipro Confidential
Page 7 of 91
Levels of Availability
It may be acceptable for a task known critical system to have a recovery time in terms of days or tens of hours. For
these systems, basic availability, such as a traditional offline tape backup, is sufficient. If you lose your HR system,
you can simply recover it from a secondary copy of the data from tape and bring the system back online in a number
of hours. If the recovery process takes a day or two, the downtime will not significantly impact users.
For business and mission critical systems, you should use a different availability approach. For example, rather than
restoring from an offline copy, you can recover from an online copy of the data. You can utilize technology such as
replication, snapshots, and mirroring to reduce the time to recovery to tens of minutes up to a couple of hours. For
even more critical systems, you can reduce the recovery time to minutes or seconds by using clustering.
There is a wide range of data availability possible. However, this range can be divided into four common levels of
availability:
Basic Availability
Increased Availability
Page 8 of 91
Continous Availability
The most advanced level of the availability is continuous availability (CA). CA is defined as an environment
explicitly designed to eliminate all computer downtime, both unplanned and planned. Today, CA environments
approach 99.999% availability, or less than 5 minutes of downtime per year. However, it is important to note that the
costs for CA systems can range into the millions of dollars. Examples of industries that most often utilize continuous
availability solutions include air-traffic control and stock-floor trading systems
Advanced CA architectures usually feature proprietary, large, hardware-based fault tolerant host machines. In a fault
tolerant system, hardware is designed to perform self-checking diagnostics and all of the main hardware components
are physically duplicated. Self-checking resides on each major hardware component and detects and isolates failures
instantly. This ensures that erroneous data cannot corrupt other system areas. In fact, some diagnostics built into
Wipro Confidential
Page 9 of 91
specific CA architectures often automatically detect problems before they lead to failures, and initiate service
instantaneously should a component fail. Component duplication enables normal processing to continue even in the
event of a hardware failure, with no performance degradation. Safety critical systems would require a CA solution.
Simplified Management
In a typical data center environment, you may have a number of servers that have different operating systems:
Solaris, HP, Windows NT, and Windows 2000. The system might feature a number of network connections as well,
such as traditional Ethernet or SCSI connections, fibre-type connections, or storage area networking (SAN). There
are also various types of storage devices in the system.
Today's enterprise is a very heterogeneous environment. In addition, almost every environment is growing at
tremendous speeds. This requires more disk storage, different types of storage, more systems, applications,
networks, etc. How do you manage all of this? The second part of the high availability equation is simplifying
management.
Today's enterprise is a very heterogeneous environment. In addition, almost every environment is growing at
tremendous speeds. This requires more disk storage, different types of storage, more systems, applications,
networks, etc. How do you manage all of this? The second part of the high availability equation is simplifying
management.
Wipro Confidential
Page 10 of 91
It is important for the enterprise to feature an infrastructure that enables scalability required by future demands. In
addition, you need to implement a solution that enables you to perform automated tasks, virtualization, and
consolidation across all systems in the enterprise, no matter the platform or operating system.
Page 11 of 91
Data must be available round-the-clock. Regular business hours do not exist in our contemporary global
marketplace. For example, an Internet service organization must account for customers arriving at their site at any
hour of any day.
In addition, most modern organizations depend on networking technologies. More and more business-critical data is
available through networks. Access to corporate information and shared knowledge has significantly improved
productivity and communication. However, this reliance on network solutions has also helped to create a need for an
HA solution to ensure that the network is resilient to failures.
These new requirements are creating greater demands on the corporate information technology (IT) infrastructure. In
the past, it was acceptable to expect 99% system availability. This would equate to about 3.5 days of downtime per
year. However, the growth of E-commerce, greater demands for customer service, an increased dependence on
network solutions, and a competitive global market have contributed to a need for high availability. When you
consider the new costs of downtime, 99% system availability is no longer acceptable.
Wipro Confidential
Page 12 of 91
Defining a Failure
A failure is a deviation from the expected behavior of the system. In other words, if the system is specified to
exhibit a certain functionality, and in the process of execution the system produces a discernibly different
functionality, a failure has occurred. Functionality is typically delivered from the system by running a procedure to
execute the logic contained in software that runs in a hardware environment containing client and server machines,
networks, data storage, and other peripherals. Failures can occur in any of these software procedures or the hardware
in a system.
Failures can be classified as either:
Reproducible
A prescribed set of actions leads to the observance of the failure in a predictable manner.
Hard reproducible failures occur identically on every execution with the same input.
Soft reproducible failures might occur with a certain probability on identical executions.
Nonreproducible
The appearance of the failure is random, or is linked to a root cause outside of the environment for which
the system was engineered.
HA solutions are useful in dealing with soft reproducible and non-reproducible failures, but less effective with hard
reproducible failures.
Some contemporary computer systems have the ability to reconfigure a failed component without
requiring a reboot of the system. This capability helps increase data availability in the event of CPU or
memory failures.
Backplane failure
Wipro Confidential
Page 13 of 91
Backplanes, or motherboards, are the large circuit boards that contain sockets for expansion cards and
provide the general pathway for all data in a computer system. These components rarely fail, but they can
fail in some circumstances.
In addition to the expansion sockets, active backplanes also contain logical circuitry that performs CPU
operations.
Passive backplanes contain almost no computing circuitry. Usually, the CPU is inserted on an additional
card in the passive backplane. Passive backplanes enable you to repair failed components or upgrade to
new components easily.
Disk failure
Backplanes, or motherboards, are the large circuit boards that contain sockets for expansion cards and
provide the general pathway for all data in a computer system. These components rarely fail, but they can
fail in some circumstances.
In addition to the expansion sockets, active backplanes also contain logical circuitry that performs CPU
operations.
Passive backplanes contain almost no computing circuitry. Usually, the CPU is inserted on an additional
card in the passive backplane. Passive backplanes enable you to repair failed components or upgrade to
new components easily.
Disk failure
Wipro Confidential
Page 14 of 91
Disks are very prone to failures because of the high rotation speed, low tolerances, and
possible problems with the controller boards or cables.
Tape devices have similar characteristics to disks, such as high speeds and low
tolerances, and are also failure-prone. In addition, tape devices are repeatedly stopping
and starting. These actions may strain or overheat the motor and lead to motor failure.
Fan failure
Fans can also fail. If the cooling system fails, the effects may not be immediately visible, but over time
excessive heat can cause a system to act unpredictably or fail at an undesirable point in the future.
Wipro Confidential
Page 15 of 91
Power supplies often have the worst MTBF of all components in a system. They can fail instantly or over time. The
gradual failure of a power supply can cause intermittent failures or unpredictable behavior in other components.
Failures in power supplies are caused by excessive switching, varying voltage levels, or other stress-inducing
factors.
NICs are expansion boards inserted into a computer so the computer can be connected to
a network. If a NIC fails, network connectivity is lost. It may be difficult to detect a NIC
failure. A simple method used to detect these failures is to initiate some network traffic,
and then use a command to display the packet count. If the packet count does not
increase, it is likely that the NIC has failed. Redundant NICs should be used to avoid any
loss of network connectivity due to the failure of a single NIC.
Environmental Failures
Failures can not only be caused by internal system components, but also by environmental forces beyond your
control. Such environmental failures include:
Power fluctuations or outages
The most common external source of system failures is power outages. Things to consider in determining
the probability of power outages should include, but not be limited to, the history of local utility companies
providing uninterrupted service, the history of brownouts due to high temperatures in your area, and your
proximity to major power sources.
Cooling system failure
The environmental cooling system can fail. This would cause massive overheating of some of your crucial
system components. You should analyze your facilities' environmental control system for the likelihood of
failure.
Structural failure
Wipro Confidential
Page 16 of 91
Structural failures can range from the complete collapse of the building's support structure, to the structural
failure of a single computer rack or cabinet.
Natural disasters
Natural disasters are occurrences such as fires, floods, earthquakes, typhoons, or hurricanes. Considerations
when identifying your organization's susceptibility to natural disasters can include geographic location, the
topography of the land, or the history of natural disasters in the local area.
Disaster Recovery
The ability to recover from a natural disaster, such as fire, flood, or earthquake, in a short time is called disaster
recovery. The results of these disasters include physical damage to systems, and loss of data, telecommunication,
power and work space. Recovery time might be as short as minutes or hours, or as long as days or weeks.
Frequently, recovery time is directly related to how quickly a system can be accessed, the data and applications
loaded, and telecommunications restored. Redundancy is usually provided by a duplicate system at a different,
geographically remote site.
The need for disaster recovery solutions and services is increasing rapidly. The costs after a disaster become quite
large, and the need to restore access to systems and applications becomes very important. Two important issues
associated with disaster recovery are the replication of data and the currency of the data. The replication of data to
an alternate site is affected by distance and speed of the links. The slower the replication method, the more data will
Wipro Confidential
Page 17 of 91
be lost in case of disaster. The impact of a disaster on the organization must be assessed along with the cost of
providing for disaster recovery.
HA and DR Together
When defining a disaster recovery plan, your top priority is your mission critical applications. Mission critical
applications are required to be available at all times. While backup and recovery technology ensures data protection,
recovery methods are often not fast enough to handle the recovery of data used by mission critical applications. HA
methods such as replication and clustering can help to ensure immediate recovery whenever a disaster strikes.
This example illustrates a plan that addresses HA and disaster planning. By implementing a configuration with
cluster management and replication concerns, you can effectively maintain and protect your end-users and
information. You can manage clusters and move applications running at a primary site to a secondary site, while
maintaining access to critical information through the continuous replication of data between sites. Clustering and
replication are covered in more detail later in this course.
Wipro Confidential
Page 18 of 91
Page 19 of 91
Does not use off-the-shelf hardware and software. The hardware usually has very specific software hooks,
and applications need to be written to a specific API of the operating system.
Requires a specially modified operating environment.
Features inherent redundancy management.
Wipro Confidential
Page 20 of 91
Page 21 of 91
A SPOF is any system component that will cause downtime if it fails. It is important to investigate the path
of execution in your system and identify all the weak links in the chain. If one link breaks, the whole
system fails no matter how well constructed the rest of the system. You should walk through the whole
process from your servers and disk storage, to the applications, through the network, and to the client
systems. Common SPOFs are:
the computer system
Clustering software can be used to link several systems that can each run each other's applications
in time of failure of the primary system.
disks
Disk mirroring or disk array technology can be used to protect data.
host adapters and cables
Host adapter failures can be protected against with operating system features and redundant host
adapters.
networks
Networking has many hardware components; each could be a SPOF. The key to eliminating
failures within the network is understanding the topologies being used, understanding the failure
points within those topologies, and removing these failure points from the network. There are
many hardware and software products which provide increased network availability.
electrical power
Uninterruptible Power Supplies (UPSs) and/or multiple power sources can protect against
electrical power failures.
Ensure the security of the system.
Prevent data corruption and unauthorized access to your system. Security is an issue that is often
overlooked in discussions of HA management, because it does not immediately reduce the impact of
failure. However, it is important to any HA solution. The management center must be secured, so that only
authorized personnel have access to it. The management systems, or applications, also need to support
some type of user authentication, such as userIDs and passwords. Secure transactions between the
applications and the system components are available through Remote Procedure Calls (RPC), or some
other protocol. Secure communications should be implemented whenever possible in an HA configuration.
Centralize similar applications and services on large servers.
It should be noted that this is not a steadfast rule, sometimes many small machines running single instances
of databases or single applications can be a more appropriate configuration. In general, by consolidating
similar applications and services on centralized large servers, you can significantly reduce the complexity
of your system, the number of backups that are required, and the number of components that can fail.
Automate repetitive tasks.
You can significantly reduce the number of hours required for hands-on operations by automating the tasks
that are standard and repetitive. In addition, automation reduces the number of possible faults due to human
error, such as mis-typed commands or accidental file deletion. You can also update and maintain consistent
policies and procedures in a single centralized location.
Perform a thorough test initially and perform additional tests on a regular basis once the system is up
and running.
Before you deploy your HA solution, you should perform a thorough test that investigates every level in
your system, from hardware component faults to network failures. The testing environment should mimic
the eventual system environment as closely as possible: the same hardware, software, services, networks,
configurations, loads on the system, and users.
It is also important to perform tests on a regular basis once the system is up and running. Systems and
environments are constantly changing. The only way to ensure that the system can react to failures
appropriately at any given point in time is to test the system throughout it's life cycle.
Wipro Confidential
Page 22 of 91
There is a direct correlation between the reliability of your hardware and the your overall system reliability.
It is important that you obtain appropriate reliability data from hardware vendors, such as mean time
between failure figures that are proven and realistic. There are several other hardware considerations in
addition to reliability, such as ease of repair, ease of access, cost, compatibility, and storage capacity. It is
also a good idea to purchase spare hardware for components that may be more prone to fail that others.
Wipro Confidential
Page 23 of 91
In a study published by the IEEE (International Electric and Electronic Engineering Association), hardware
failures are the cause of only 10% of total system downtime. As much as 30% of all downtime is prescheduled, and most of this time is required due to the inability of system tools to permit online
administration of systems. Another 40% of downtime is due to software errors. Some of these errors are as
simple as a database running out of space on disk and stopping its operations as a result. Any
comprehensive HA solution has to be able to deliver application and information availability in the event of
any cause of downtime.
Examples of planned downtime include those times when the system is shutdown to add additional
hardware, upgrade the operating system, rearrange or repartition disk space, or clean up logfiles and
memory. If you implement an effective HA strategy, you can significantly reduce the amount of planned
downtime. For example, you can provide for backups, maintenance, and upgrades while the system is up
and running. You can also reduce the time required to perform the tasks that can only be done while the
system is down.
Wipro Confidential
Page 24 of 91
Wipro Confidential
Page 25 of 91
To simplify the management of a complicated system, you can break the system down to four basic layers:
Application layer
Storage management layer
Storage network infrastructure layer
Data storage layer
In order to reduce the time of recovery, you need to determine the level of service that each layer must deliver to the
others. You can also simplify management by logically organizing the resources in each layer.
Wipro Confidential
Page 26 of 91
Application Layer
The application layer is the direct interface between the system and the client machines, such as a database, an Email, or a custom application. HA solutions feature functionality that provides continuous service or access to
applications in the event of a fault or failure in a transparent manner. Throughout this course, it is important to view
your system from an application-based viewpoint. In other words, no matter what components, structure, policies,
and procedures are implemented in your HA solution, the most important consideration at any time is to minimize
the impact of a fault or failure on the users ability to access data through the application or service. HA issues
involved in this layer include clustering, application-level failovers, simplified management of large server farms,
common availability management, and replication of data to multiple sites.
The storage management layer refers to the method by which the server manages the storage devices or disks. This
management is performed by the building blocks of an HA solution: volume management and a journaling
filesystem.
Volume Management
Often, the first step taken towards increasing a system's availability is to enable software-based redundancy of disks,
or software RAID. Software RAID defines a logical volume. A volume is a logical object on which filesystems are
written or to which databases write their data. Software RAID is often packaged with volume management software.
Journaling Filesystem
A file system is a collection of directories organized into a structure that enables you to locate and store files. All
information processed is eventually stored in a file system. When a system or server fails, the filesystem can be
eliminated. To avoid this problem, a tape backup is required to restore the filesystem. A journaling filesystem
journals the changes to the file system structure (and occasionally data). If the system crashes and is rebooted, the
journal is replayed to ensure the correctness of the file system structure. Data recovery is dependent upon the
specific application. For example, recovery of an Oracle database would require the use of Oracle log files.
Wipro Confidential
Page 27 of 91
This layer refers to storage network connectivity. This layer is becoming more and more of a concern to the modern
enterprise. Originally, most environments simply connected a server to a storage device through a SCSI connection.
Now, organizations are using other more advanced network connection technology such as Fibre Channel
technology and storage area networking (SAN). Rather than viewing this layer simply as a server connecting to a
piece of storage, you should consider multiple paths between servers and storage. You need to investigate the
possibility of implementing some sort of network redundancy to ensure that if you lose an access route between the
system and storage, there is another access path available.
In addition to application availability, managing storage effectively, and ensuring that you maintain network
connectivity, there are data availability concerns in the storage pool itself. In this layer, you can enable online,
dynamic reconfiguration of storage pool. You need to account for growth and scalability. No matter how many disk
arrays you have, you will inevitably require more in the future. You should also consider the capacity management
aspects of your storage devices and determine how to optimize storage space across common disk hardware.
Page 28 of 91
must be determined based on data access patterns and determining an appropriate trade-off between cost and
performance.
RAID-0 (striping)
This RAID level features disk striping, but no redundancy of data. In this configuration,
a collection of data is divided into small chunks that are written to a separate disk in the
array. This RAID level supplies performance acceleration at no increased storage cost,
because individual disks can perform concurrent write operations. RAID-0 offers no
increase in data availability. In fact, if implemented by itself, RAID-0 decreases overall
data availability. This is because for one disk to function, all the other disks in the array
must be functioning as well. Any failure of an individual disk in the stripe will result in
the inability to perform any read or write operations in the entire stripe. RAID-0 would be
an option for applications requiring high bandwidth such as video production and editing,
image editing, or pre-press applications.
RAID-1 (mirroring)
RAID-1 requires at least double the disk capacity of RAID-0. In RAID-1, the data is
replicated on a separate disk, or multiple disks. No disk striping occurs. Every byte on
one disk is copied block-for-block on a separate disk that acts as a peer and is completely
in sync with the original disk. In the event of an individual disk failure, the other disk
maintains operation without any service interruption. RAID-1 provides the highest
performance for redundant storage, because it does not require read-modify-write cycles
to update data, and because multiple copies of data can be used to accelerate readWipro Confidential
Page 29 of 91
intensive applications. However, resyncing or creating a new RAID-1 copy requires time
and a significant amount of I/O. Therefore, a disadvantage to RAID-1 is the fact that
write performance may suffer. RAID-1 requires 100% additional disk capacity for each
mirror copy. Therefore, another major disadvantage is cost. This RAID level would be
recommended for applications requiring increased availability such as accounting,
payroll, or other financial applications.
RAID-2 features disk striping. This RAID level detects errors that occur and determines
which part is in error by using error checking and correcting (ECC) information. RAID-2
detects 2-bit errors and corrects 1-bit errors on the fly. Each data disk has its Hamming
Code ECC information recorded on ECC disks. On read operations, the ECC code
verifies data or corrects single disk errors. You need a high ratio of ECC disks to data
disks with smaller word sizes. It has no clear advantages over RAID-3, and is not used in
practice.
RAID-3 uses disk striping in a parallel fashion with each virtual disk block distributed
across all the disks in the array except for one that stores the parity check. The parity disk
Wipro Confidential
Page 30 of 91
permits the regeneration and rebuilding of data in the event of a disk failure. In RAID-3,
the stripe depth of an N+1 array is equal to 1/N virtual blocks and each disk drive must be
on its own separate I/O channel. For example, if the virtual block size for a 4+1 set, is
512 bytes, then the stripe depth is 128 bytes (512/4). The RAID volume can only process
one disk I/O at a time. All I/O operations access all disks, because the bytes are
distributed across multiple disks (parallel transfer). For this reason, RAID-3 is best for
applications that are single stream bandwidth-oriented. This would not be a good choice
for a database server, because databases tend to read and write smaller blocks. RAID-3 is
likely to perform significantly better in a controller-based implementation.
RAID-4 uses large stripes, and dedicates one drive to storing parity information. RAID-4
is very similar to RAID-3. The major difference is that where in a RAID-3 array, the
stripe and logical block size are equal, RAID-4 arrays implement variable stripe sizes. In
RAID-4, the stripe depth is an integer multiple of the virtual block size. This means that
Wipro Confidential
Page 31 of 91
multiple virtual blocks can be placed within a single stripe in the RAID-4 array.You can
read records from any single drive. This enables you to take advantage of overlapped I/O
for read operations. Since all write operations have to update the parity drive, no I/O
overlapping is possible. RAID-4 offers no advantage over RAID-5. As with RAID-3, a
RAID-4 implementation is ideal for systems performing large file transfers. It does not
perform well when used in applications that require small file writes at high I/O rates.
RAID-5 removes a possible bottleneck on the parity drive by rotating parity across all
drives in the set. RAID-5 requires at least three and usually five disks for the array. All
read and write operations can be overlapped. RAID-5 stores parity information but not
redundant data. Recovery from a RAID-5 disk failure requires a complete read of all the
disks in the stripe. The recovery process can be time-consuming and system performance
will suffer during recovery. This is the most complex and versatile of the basic RAID
architectures. RAID-5 is best suited for file and application servers, database servers in a
datawarehousing environment, Web servers, and e-mail servers.
The performance overhead for writes can be substantial in a RAID-5 configuration,
because a write can involve much more than simply writing to a data block. A write can
involve reading the old data and parity, computing the new parity, and writing the new
data and parity.
RAID Level Variations
RAID-6
RAID-6 is similar to RAID-5, but with additional independently computed check data. It includes a second
parity scheme that is distributed across different drives and offers very high fault-tolerance. Currently, there
are very few commercial examples of RAID-6.
RAID-7
RAID-7 includes a real-time embedded operating system as a controller, caching data through a high-speed
bus, and other characteristics of a stand-alone computer. This RAID level is not common.
Wipro Confidential
Page 32 of 91
RAID Combinations
RAID-01 is a mirrored RAID-1 pair made from two RAID-0 stripe sets. It is configured
by creating two RAID-0 sets and adding RAID-1. If you lose a drive on one side of a
RAID-01 array, then lose another drive on the other side of that array before the first side
is recovered, you will suffer complete data loss. It is also important to note that in the
event of a single disk failure, all drives in the surviving mirror are involved in rebuilding
the entire damaged stripe set. Performance during recovery is severely degraded during
recovery unless the RAID subsystem allows adjusting the priority of recovery. However,
shifting the priority toward production will lengthen recovery time and increase the risk
of the kind of the catastrophic data loss mentioned earlier.
Example of RAID01 Failure
In this example, if Disks A and D fail, all the disks are unavailable.
RAID-10 (striped mirrors)
Wipro Confidential
Page 33 of 91
RAID-10 is a stripe set made up from a number of mirrored pairs. Only the loss of both drives in the same
mirrored pair can result in any data loss and the loss of that particular drive is 1/Nth as likely as the loss of
some drive on the opposite mirror in RAID-01. Recovery only involves the replacement drive and its
mirror so the rest of the array performs at 100% capacity during recovery. Since only the single drive needs
recovery bandwidth, requirements during recovery are lower and recovery takes far less time, reducing the
risk of catastrophic data loss. The performance of RAID-10 and RAID-01 are identical, but they have
different levels of data integrity.
Example of RAID10 Failure
In this example, first Disk A fails and all the other disks are available. If disk D fails, only the data on disks A and D
are offline.
RAID-53
RAID-53 offers an array of stripes in which each stripe is a RAID-3 array of disks. This offers higher performance
than RAID-3 but at much higher cost and requires at least 5 drives
Wipro Confidential
Page 34 of 91
Wipro Confidential
Page 35 of 91
In hardware RAID, the management operations required to implement the RAID disk array occur within the disk
array itself. The host system does not perform the operations, but an interface program runs on the host system that
enables you to monitor the disk management operations. A hardware RAID operation creates a logical unit (LUN)
that can be monitored regardless of the operating system of the host system. It is often a safe assumption that the
disks are managed properly, no matter what the RAID level, within a hardware RAID configuration. A hardware
RAID system is basically a specialized, single-purpose system that features a controller that does nothing but
aggregate storage disks, stripe and mirror data across these disks, and calculate parity.
The advantages of using hardware RAID over software RAID:
Increased performance on the host system
Performance is increased because the disk management operations are off-loaded onto the disk array. For
example, in a mirrored controller-based configuration, the host would need to pass only one write request
through the disk driver and across the I/O bus, where the controller would decompose it into two separate
writes.
Enhanced features
Hardware RAID manufacturers often add enhanced functionality to their hardware. Such enhancements
include additional internal memory in the disk array, and the abilities to replicate data over a WAN, share
specific disks between multiple host systems, and lock out other hosts while a single host is accessing a
Wipro Confidential
Page 36 of 91
disk. Enterprise class hardware RAID systems also often include redundant power supplies and cooling
fans.
Efficiency
Hardware RAID systems tend to be very efficient because they feature hardware that is only concerned
with performing RAID operations. The RAID controller does not have to concern itself with graphic user
interfaces (GUIs) and other aspects of a general purpose operating system.
The disadvantages of using hardware RAID over software RAID:
Dependence on one RAID hardware vendor
Every RAID manufacturer uses a different management interface and once you familiarize yourself to one,
it will be difficult to switch to a different vendor.
Inability to combine disks from different arrays into a single array
This will create another SPOF in the system.
Hard to resize LUNs.
In most cases, once a LUN is full, you cannot simply increase the size of the LUN to accommodate new
data. You have to destroy the original LUN, create another larger LUN, and then restore the original data to
the new LUN.
Hardware limits on the number and size of LUNs
Often, RAID vendors will enforce some hardware limits that might limit your ability to configure your
system for optimal performance.
Cost
Hardware RAID is more expensive than software RAID.
No inter-box protection
A specific RAID controller has no visibility to other RAID boxes or storage devices.
Page 37 of 91
A duplexed RAID-1 array can sometimes be implemented in software RAID, but not in hardware RAID,
depending on the controller. Building redundant layouts using disks with separate connections to the host
can enhance availability, eliminating the single points of failure introduced by non-redundant host
connections.
Page 38 of 91
coupled with the configuration flexibility added by the inclusion of software-based RAID. Combining hardware and
software RAID solutions offer several key benefits:
Increased availability
Many hardware RAID solutions retain single points of failure (SPOFs), allowing data to become
unavailable if a non-disk component of the array fails. When software RAID is used to build configurations
that incorporate hardware RAID units in separate arrays, many of these vulnerabilities can be eliminated.
Increased performance
A single hardware RAID controller may present a bottleneck to data access because of limited array bus
and host-to-array bandwidth, as well as CPU cycles needed for parity calculations. Efficient controllerbased algorithms can be combined with multiple host connections and supplementary software RAID
processing to increase bandwidth and throughput.
Improved manageability
The limited set of configuration options and the static configuration utilities for hardware RAID
subsystems may make initial setup seem simpler than setting up a software RAID configuration. However,
after running the system, the configuration may need to be modified to reflect the actual I/O pattern of the
applications. With a controller-based setup, this is usually achieved by backing up the data, reconfiguring
the array, and reloading the data. This requires interruption of data access. The on-line reconfiguration
capabilities of most software RAID solutions can be used to enhance the performance monitoring, tuning,
and reconfiguration of hardware RAID, simplifying administration while increasing uptime and
performance.
Wipro Confidential
Page 39 of 91
Wipro Confidential
Page 40 of 91
Defining a Volume
The basis for any volume management solution is a volume. This topic defines a volume and identifies the
advantages of using volumes to manage storage.
What Is a Volume?
Volumes enable an application to view a number of disks as a single logical unit, no matter the physical location of
the disks. This volume has the performance, reliability, and other attributes of its individual components. Each
volume records and retrieves data from one or more physical disks. Volumes are accessed by file systems, databases,
or other applications in the same way that physical disks are accessed. Volumes are also composed of other virtual
objects that are used to change the volume configuration. Volumes and their virtual components are called virtual
objects. Volumes can be used to perform administrative tasks on disks without interrupting applications and users.
Advantages of Volumes
There are several advantages to using volumes:
Ability to combine RAID levels
Volumes enable you to combine any number of different RAID levels. For example, if the important
consideration is cost, maybe you would implement a RAID-5 solution. Alternatively, if you require very
high performance, then you might use striped mirrors.
Scalability
Virtual volumes also offer the flexibility to grow the storage capacity without disrupting the system. Instead
of taking the server off-line or physically moving data from point A to point B, you can simply add more
storage to the volume.
Increased performance and failure tolerance
You can combine enterprise RAID and JBOD and your system will feature the advantages of both. You can
take advantage of a hardware controller and the flexibility of host-based volume management.
Overview of VxVM
VxVM provides easy-to-use online disk storage management for computing environments. Traditional disk storage
management often requires that systems be taken offline at a major inconvenience to users. VxVM provides the
Wipro Confidential
Page 41 of 91
tools to improve performance and ensure data availability and integrity. VxVM also enables you to dynamically
configure disk storage while the system is active.
The connection between physical objects and VxVM objects is made when you place a physical disk under VxVM
control. VxVM creates virtual objects and makes logical connections between the objects. The virtual objects are
then used by VxVM to perform storage management tasks. VxVM objects include:
VxVM disks
When you place a physical disk under VxVM control, a VxVM disk is assigned to the
physical disk. Each VxVM disk corresponds to at least one physical disk. A VxVM disk
typically includes a public region where user data is stored, and a private region where
VxVM internal configuration information is stored.
Disk groups
A disk group is a collection of VxVM disks. You group disks into disk groups for
management purposes, such as to hold the data for a specific application or set of
applications. For example, data for accounting applications can be organized in a disk
group called "acctdg".
Wipro Confidential
Page 42 of 91
A disk group configuration is a set of records with detailed information about related
VxVM objects, their attributes, and their connections. Disk groups are configured by the
system administrator and represent management and configuration boundaries. You can
create additional disk groups as necessary. Disk groups allow you to group disks into
logical collections. Disk groups enable high availability, because a disk group and its
components can be moved as a unit from one host system to another. Disk drives can be
shared by two or more hosts, but accessed by only one host at a time. If one host crashes,
the other host can take over the failed host's disk drives, as well as its disk groups.
Subdisks
A subdisk is a set of contiguous disk blocks. VxVM allocates disk space by dividing a
VxVM disk into one or more subdisks. Each subdisk represents a specific portion of a
VxVM disk, which is mapped to a specific region of a physical disk. A VxVM disk can
contain multiple subdisks, but subdisks cannot overlap or share the same portions of a
VxVM disk.
Plexes (mirrors)
Wipro Confidential
Page 43 of 91
VxVM uses subdisks to build virtual objects called plexes (or mirrors). A plex consists of
one or more subdisks located on one or more physical disks. To organize data on the
subdisks to form a plex, use the following methods:
Concatenation
Striping (RAID-0)
Mirroring (RAID-1)
Striping with parity (RAID-5)
Volumes
Wipro Confidential
Page 44 of 91
A volume consists of one or more plexes, each holding a copy of the data in the volume.
Due to its virtual nature, a volume is not restricted to a particular disk or a specific area of
a disk. The configuration of a volume can be changed by using the VxVM user interfaces.
Configuration changes can be done without causing disruption to applications or file
systems that are using the volume. For example, a volume can be mirrored on separate
disks or moved to use different disk storage. A volume can consist of up to 32 plexes,
each of which contains one or more subdisks. A volume must have at least one associated
plex that has a complete copy of the data in the volume with at least one associated
subdisk.
VxVM Object Relationships
VxVM virtual objects are combined to build volumes. The virtual objects contained in volumes are:
VxVM disks
Disk groups
Subdisks
Plexes
Volume Manager objects have the following connections:
VxVM disks are grouped into disk groups.
One or more subdisks (each representing a specific region of a disk) are combined to form plexes.
A volume is composed of one or more plexes.
In this example, a disk group has two VxVM disks. One disk has a volume with one plex and two subdisks. The
other disk has a volume with one plex and a single subdisk.
Wipro Confidential
Page 45 of 91
Concatenated Layout
A concatenated volume layout maps data in a linear manner onto one or more subdisks in a plex. Subdisks do not
have to be physically contiguous and can belong to more than one VxVM disk. Storage is allocated completely from
one subdisk before using the next subdisk in the span. Data is accessed in the remaining subdisks sequentially until
the end of the last subdisk. For example, if you have 14GB of data, then a concatenated volume can logically map
Wipro Confidential
Page 46 of 91
the volume address space across subdisks on different disks. The addresses 0GB to 8GB of volume address space
map to the first 8-gigabyte subdisk, and addresses 9GB to 14GB map to the second 6-gigabyte subdisk. An address
offset of 12GB therefore maps to an address offset of 4GB in the second subdisk.
Concatenation removes the restriction on size of storage devices imposed by physical disk size. It also enables
better utilization of free space on disks by providing for the ordering of available discrete disk space on multiple
disks into a single addressable volume. In addition, large file systems can be created to reduce overall system
administration complexity. However, concatenation does not protect against disk failure. A single disk failure may
result in the failure of the entire volume.
Striped Layout
A striped volume layout maps data so that the data is interleaved, or allocated in stripes, among two or more
subdisks on two or more physical disks. Data is allocated alternately and evenly to the subdisks of a striped plex.
Wipro Confidential
Page 47 of 91
The subdisks are grouped into "columns". Each column contains one or more subdisks and can be derived from one
or more physical disks. To obtain the performance benefits of striping, each column within a striped volume should
not be allocated space from any disk used by any other column within that volume.
All columns must be the same size. The size of a column should equal the size of the volume divided by the number
of columns.
Data is allocated in equal-sized units, called stripe units, that are interleaved between the columns. Each stripe unit is
a set of contiguous blocks on a disk. The stripe unit size can be in units of sectors, kilobytes, megabytes, or
gigabytes. The default stripe unit size is 128 sectors (64K), which provides adequate performance for most general
purpose volumes. Performance of an individual volume may be improved by matching the stripe unit size to the I/O
characteristics of the application using the volume.
Mirrored Layout
By adding a mirror to a concatenated or striped volume, you create a mirrored layout. A mirrored volume layout
consists of more than one plex that is a duplicate of the information contained in a volume. Each plex in a mirrored
layout contains an identical copy of the volume data. In the event of a physical disk failure and the plex on the failed
disk becomes unavailable, the system can continue to operate using the unaffected mirrors.
Wipro Confidential
Page 48 of 91
Although a volume can have a single plex, at least two plexes are required to provide redundancy of data. Each of
these plexes must contain disk space from different disks to achieve redundancy.
Volume Manager uses true mirrors, which means that all copies of the data are the same at all times. When a write
occurs to a volume, all plexes must receive the write before the write is considered complete.
Each plex in a mirrored configuration can have a different layout. For example one plex can be concatenated and the
other plex can be striped. You should distribute mirrors across all types of hardware to prevent the loss of more than
one copy of the data in case of a single point of failure.
RAID-5 Layout
A RAID-5 volume layout has the same attributes as a striped plex, but includes one additional column of data that is
used for parity. Parity provides redundancy.
Wipro Confidential
Page 49 of 91
Parity is a calculated value used to reconstruct data after a failure. While data is being written to a RAID-5 volume,
parity is calculated by doing an exclusive OR (XOR) procedure on the data. The resulting parity is then written to
the volume. If a portion of a RAID-5 volume fails, the data that was on that portion of the failed volume can be
recreated from the remaining data and parity information.
RAID-5 volumes keep a copy of the data and calculated parity in a plex that is striped across multiple disks. Parity is
spread equally across disks. Given a 5- column RAID-5 where each column is 1G in size, the RAID-5 volume size
is 4G.
One column of space is devoted to parity, and the remaining four 1G columns are used for data.
The default stripe unit size for a RAID-5 volume is 32 sectors (16K). Each column must be the same length but may
be made from multiple subdisks of variable length. Subdisks used in different columns must not be located on the
same physical disk.
RAID-5 requires a minimum of three disks for data and parity. When implemented as recommended, an additional
disk is required for the RAID-5 log.
RAID-5 cannot be mirrored.
Page 50 of 91
Stripe-Mirror (RAID-10)
This example illustrates a layered volume layout called a stripe-mirror layout. In this layout, VxVM creates
underlying volumes that mirror each subdisk. Each of these underlying volumes are used as subvolumes to create a
top-level volume that contains a striped plex of the data.
If two drives fail, the volume survives 4 out of 6 (2/3) times. In other words, the use of layered volumes reduces the
risk of failure rate by 50%.
If a disk fails in a stripe-mirror layout, only the failing subdisk must be detached, and only that portion of the
volume loses redundancy. When the disk is replaced, only a portion of the volume needs to be recovered, which
takes less time.
Mirror-Stripe (RAID-01)
This layout mirrors data across striped plexes. The striped plexes can be made up of different numbers of subdisks.
In the example, plexes are mirrors of each other; each plex is striped across the same number of subdisks. Each
striped plex can have different numbers of columns and different stripe unit sizes. One plex could also be
concatenated.
Wipro Confidential
Page 51 of 91
When you create a volume that is less than one gigabyte in size, a nonlayered mirrored volume is created by default.
Nonlayered, mirrored layouts are recommended if you are using less than 1GB of space, or using a single drive for
each copy of the data.
How Do Layered Volumes Work?
In a regular mirrored volume, subdisks originate from the disk media. In a layered volume, the subdisks originate
from underlying volumes. These subdisks are also called subvolumes. Subvolumes and subdisks are equivalent
objects in terms of constructing a volume. In a layered volume, only the top-level volume is accessible as a device
for use by applications.
Layered volumes tolerate disk failure better than non-layered volumes and provide improved data redundancy. If a
disk in a layered volume fails, a smaller portion of the redundancy is lost and recovery and resynchronization time is
usually quicker than it would be for a nonlayered volume that spans multiple drives.
Stripe-Mirror
Mirror-Stripe
Attribute
Volume
Volume
Recovery of a single
The entire plex (full
subdisk failure
Only the lower plex,
volume contents) that
requires
not the top-level plex.
contain the subdisk.
resynchronization of:
For example, at 10
75 seconds (both
MB per second, the
subvolumes can be
time it will take to
150 seconds.
synchronized at the
resynchronize the
same time).
mirror is:
Layered volumes consist of more VxVM objects than nonlayered volumes. Therefore, layered volumes may fill up
the disk group configuration database sooner than nonlayered volumes. When the configuration database is full, you
cannot create more volumes in the disk group.
Disk Failures
Disk failures can be classified into two general categories:
Permanent disk failure
When a disk is corrupted and no longer usable, the disk must be logically and physically removed, and
then replaced with a new disk. With permanent disk failure, data on the disk is lost.
Page 52 of 91
When communication to a disk is interrupted, but the disk is not damaged, the disk can be logically
removed, then reattached as the replacement disk. With temporary (or intermittent) disk failure, data still
exists on the disk.
Hot relocation is a feature of VxVM that enables a system to automatically react to I/O failures on redundant
VxVM objects and restore redundancy and access to those objects. VxVM detects I/O failures on objects and
relocates the affected subdisks. The subdisks are relocated to disks designated as spare disks or to free space within
the disk group. VxVM then reconstructs the objects that existed before the failure and makes them redundant and
accessible again.
Partial Disk Failure
A partial disk failure is a failure that affects only some subdisks on a disk. When a partial disk failure occurs,
redundant data on the failed portion of the disk is relocated. Existing volumes on the unaffected portions of the disk
remain accessible. With partial disk failure, the disk is not removed from VxVM control. Before removing a failing
disk for replacement, you must evacuate any remaining volumes on the disk.
Wipro Confidential
Page 53 of 91
Wipro Confidential
Page 54 of 91
A fault resilient cluster features at least one machine that is configured to assume responsibility for a failed server.
When one machine in the pair fails, its services are moved to the second server. This is called failover. Failover is
defined as the migration of services from one server to another. In a fault resilient cluster, a significant outage of
your primary server will have little impact on your users. Software can be added to hardware clustering solutions to
provide 99.99% data availability. This accounts for only 53 minutes of downtime per year. In most instances, only
seconds or minutes are lost. This topic describes the general characteristics of fault resilient HA clusters.
Wipro Confidential
Page 55 of 91
A well-tested and robust FMS, such as VERITAS Cluster Server, supports all common networks, databases,
and applications, and features many advantages over other options for monitoring and managing failovers
in a fault resilient pair or cluster.
Support for planned maintenance
Fault resilient solutions support planned maintenance of OS software, applications, or hardware. When a
system is brought offline for maintenance, other systems can immediately take over services to ensure that
the failover is completely transparent to users.
Minimal effects of failover on users
The effect of a fault or failure should be almost completely transparent to your users. The most intrusive
effect that a failover can have on a user in a fault resilient system is a simple reboot of the client machine.
In most cases, even this much of an intrusion is not acceptable. After a failover, the user should not have to
perform any actions to return to work, once the services have been restored by another server in the cluster.
Very quick failover times
Ideally, the failover time in a fault resilient system will be less than 2 minutes. You should always have the
backup server running and have as many system processes active as possible to enable minimal failover
time. The takeover server should never require a reboot in the event of a failover. If this happens, the
failover time can increase to almost an hour in some cases. It is a good idea to create a failover time
expectation for your users.
Minimal hands-on interaction
Ideally, the failover process should never require any sort of human intervention.
Data integrity
To guarantee data integrity, it is required that the servers in a fault resilient cluster share the same storage
disks. After a failover, the user must see the same consistent data that was available to the original server.
These shared disks are critical and should feature some sort of mirrored RAID protection.
Communication networks
Each server in a cluster must continuously monitor the state of the other servers in the cluster. This
is accomplished through a pair of heartbeat networks that run independent of one another.
Another network is required to communicate with the clients or users. This is called the public, or
service, network.
It is not a necessary requirement, but the servers in a fault resilient pair or cluster should also
maintain communication with system administrators. This can be accomplished by a separate
administrative network.
Wipro Confidential
Page 56 of 91
To simplify configuration and administration, all the servers in a fault resilient pair or cluster are completely
identical. This means that they have the same processor type, identical memory, and they are running the same
version of operating system with identical patches. Many system vendors manufacture models that have subtle
differences. It is important that you avoid any incompatibility issues by using identical servers. If you do utilize
different system models, you should use combinations that are proven to be compatible and are well-tested in cluster
environments.
Networks
A fault resilient pair or cluster has three separate levels of network communication:
Wipro Confidential
Page 57 of 91
1.
Public network
The public network is the means by which the server pair or cluster communicates with the end users. In
many systems, the network is the least available component. You can determine methods to increase
availability by breaking the public network down into three basic components:
Page 58 of 91
The servers link to the network through a network interface card (NIC). You should implement
some sort of redundancy at the NIC level to ensure that the servers can connect to the network
even if there is a fault or failure in the NIC. At each cluster node, you should allow for at least two
parallel, independent networking access points. If message traffic is heavy, you may need
additional access points to support message traffic during system failover .
2.
Heartbeat networks
Heartbeat networks are the channel through which the servers in a pair or cluster communicate and monitor each
other. When the heartbeat stops, connectivity is lost.
3.
Administrative network
Wipro Confidential
Page 59 of 91
Disks
Private disks
The private disks are unshared, independent disks that contain not only the operating system, but also any
software required for the failover process.
Public disks
Public disks are the shared storage disks that are accessed by the end user. After a failover, the user should
see the same consistent data that was available to the original server. Public disks are critical and should
feature some sort of mirrored RAID protection.
Stages of Failover
There are three basic stages of failover:
Discovery
First, a hardware or software fault triggers the failover process. This fault can be a part of one system, an
entire system, or a group of systems. Next, the system recognizes that there has been a downgrade in status.
Some subsystems, such as RAIDs, may have built-in automatic recovery capabilities. If not, then the
failover process begins.
Notification
In this stage, the system is made aware of the failure. In fault tolerant systems, subassemblies may be
configured to notify their parent assemblies that they have failed. A driver must be written for notification
Wipro Confidential
Page 60 of 91
to take place. In a cluster, once a loss of a resource has been detected, in order to compensate, all systems
are made aware of the loss. This notification must occur even if the network shared by the servers and users
fails. Therefore, a separate private network must be available for inter-server communication. Systems must
have redundant communication methods available. It is important to note that some servers may
continuously monitor each other's ability to communicate. If one server is unavailable to communicate with
the others, the others will assume that the server's resources and services are offline. The servers will notify
each other and failover its services to other servers in the configuration automatically.
Recovery
Once the cluster has responded to the loss of a resource, operators can repair the resource. The cluster should
then be able to restore operations to the state before the failure in a way that is virtually transparent to client
processes.
Wipro Confidential
Page 61 of 91
In a shared nothing model, each storage device is connected to exactly one node in the cluster. Storage device
ownership may pass from server to server, but a server must relinquish ownership before another can claim a device.
In the shared nothing cluster model, applications running on different servers cannot access the same file systems
concurrently.
Shared nothing clusters enhance the availability of an application. If an application or the server on which it is
executing fails, a failover server takes control of the application's storage devices, and restarts the application
service.
Shared nothing clusters also enable read-only applications to scale beyond the capacity of a single server. Prior to
the Internet explosion, read-only applications were of limited utility. Currently, however, most commercial web
servers are heavily loaded with read-only data. Multiple instances of a read-only web application can run on shared
nothing clustered servers, each accessing its own copy of served web pages. As long as access is read only, there is
no need to synchronize copies of the web pages.
The storage in a shared nothing cluster is not dual-ported. This storage is often mirrored or uses fault-tolerant,
hardware arrays with redundant controllers. This cluster configuration is relevant only for an application which
features a shared-nothing parallel database architecture. Clusters providing highly available data services, such as
Oracle Parallel Server, require physical connections from all nodes to all storage devices, and cannot be configured
in a shared-nothing manner.
Wipro Confidential
Page 62 of 91
Shared data clusters enhance application availability, and in addition, enable any partitionable application to scale
beyond the capacity of a single server. Shared data clusters provide read-write access to a single copy of data to
multiple application instances executing on different servers. Since all applications access the same copy, all
applications have instant access to all data updates. There are two different access modes in a shared data cluster:
Shared parallel access
In a this shared data model, storage devices can be accessed by more than one server at the same time. In
the simplest variation of this shared data model, servers share access to storage devices on which they
create private, logical volumes.
Shared disk clusters feature a common I/O bus for disk access. Because all nodes can
write to or cache data from the centralized disks at the same time, a synchronization
mechanism must be used to preserve the coherence of the system. Some sort of lock
manager serves this purpose in a shared disk cluster configuration.
A sophisticated shared data model, such as VERITAS SANPoint Foundation Suite HA,
supports concurrent access to file system data by all servers in a cluster.
Asymmetric 1 to 1 Configurations
This topic describes fault-resilient, asymmetric, 1 to 1 cluster configurations.
Wipro Confidential
Page 63 of 91
In this example, a file server application is failed over from the master server to the backup server. Notice that the IP
address used by the client systems moves as well. This is extremely important; otherwise all clients would have to
be updated on each server failover process.
Wipro Confidential
Page 64 of 91
Wipro Confidential
Page 65 of 91
Each node has multiple applications running when all of the nodes are functioning properly. If Node1 fails, App1
fails over to Node2, and App2 and App3 to Node3. App4 and App5 on Node1 are discarded. All local applications
on Node3 will also be discarded to make room for App2 and App3.
Symmetric 1 to 1 Configurations
This topic describes fault resilient, symmetric, 1 to 1 configurations.
Wipro Confidential
Page 66 of 91
In the event of a service failure, the other server would take over and run both applications.
Wipro Confidential
Page 67 of 91
The IP address moves to the host that is running the service. When a failover occurs, the service is failed to the
alternate node, and that node is configured with the new IP address, as well as its old address. This way, client-side
applications do not require reconfiguration to be able to locate the recovered version of the application. Of course,
any TCP connections that were open with the old instance of the service will be terminated by the failover, and new
TCP connections will need to be established. In many cases, the restoration of the TCP connection is transparent to
the user.
Wipro Confidential
Page 68 of 91
Each node has multiple applications running when all of the systems are functioning properly. If Node1 fails, App4
is transferred to Node2, App3 to Node3, and App6 is discarded. Node2 must have enough available capacity during
normal operations to accommodate App4 in the event of a Node1 failure. Similarly, Node3 must have enough
available capacity for App3.
Note that in symmetric failover, the hosts are generally configured with more processing and I/O power than is
needed to run their individual applications. The effect of running both sets of applications on one host must be
considered. If both are running at capacity and one fails, the performance of the remaining one will be poor.
On the surface, it would appear the symmetrical configuration is a far more beneficial configuration in terms of
hardware utilization. Many organizations dislike the concept of a valuable system sitting idle. There is a flaw in this
line of reasoning, however. In asymmetrical failover, the takeover server would need only as much processor power
as its peer. On failover, performance would remain the same. In symmetrical failover, the takeover server would
need sufficient processor power to not only run the existing application, but also enough for the new application it
takes over. If a single application needs one processor to run properly, an asymmetric configuration would need two
single processor systems. To run identical applications on each server, a symmetrical configuration would require
two dual processor systems.
N to 1 Clustering
This topic describes a traditional N to 1 networked cluster configuration.
N to 1 Cluster Scalability
One important consideration in clustering is scalability. Most HA packages can scale to eight or more nodes. It is
important to note that attaching more than two hosts to a single SCSI storage device becomes problematic, as
specialized cabling must be used. In most cases, scaling beyond four hosts is not practical, as it severely limits the
actual number of SCSI disks that can be placed on the bus.
Wipro Confidential
Page 69 of 91
Example of a 4 to 1 Cluster
This example illustrates the inherent complexities of a 4 to 1 cluster. Each of the four primary servers are connected
to a set of two disks. All the disks are connected to a fifth server that acts as the backup server. This could be
asymmetric or symmetric cluster. The major difference is in the functionality of the backup server:
In a 4 to 1 asymmetric configuration the fifth server would simply act as the standby server. The four
primary servers act independently of one another. In the event of a single server failure, its services would
be failed over to the standby server.
In a 4 to 1 symmetric configuration, the fifth server would act as the standby server and also run
applications.
Wipro Confidential
Page 70 of 91
This diagram shows how a cluster of systems might share a group of disks. Notice that each of the HBAs on the bus
must have a high-priority, but different SCSI target IDs. Special cables must be used to attach more than two hosts to
the bus.
Disadvantages of this configuration include:
The potential for duplicate IDs
Complicated termination issues that can result in the loss of data
Compatibility between controllers (For example, you must have differential SCSI devices if you have a
differential controller.)
A typical SCSI bus has one SCSI initiator for the controller or HBA, and one or more SCSI targets for the drives. To
configure a dual hosted SCSI configuration, one SCSI initiator ID must be set to a value different than its peer. The
SCSI target IDs must be chosen so they do not conflict with the ID for any drive that is installed or an initiator ID.
Setting the SCSI Initiator ID
The method of setting SCSI initiator IDs are dependent on the system manufacturer. For example, Sun
Microsystems provides two methods to set SCSI initiator IDs:
Changing the scsi-initiator-id value
This affects all SCSI controllers in the system, including the internal controller for the system disk and CDROM. Be careful when choosing a new controller ID to not conflict with the boot disk, floppy drive, or
CD-ROM.
Editing the SCSI driver control file
This file is in the /kernel/drv area. This will set the SCSI initiator ID on a per controller basis. NT and Intel
systems are typically set on a per controller basis with a utility package provided by the SCSI controller
manufacturer. You should refer to your system documentation for details.
Wipro Confidential
Page 71 of 91
The problem may manifest itself during simultaneous commands from both initiators. A
controller could issue a command, and see a response from a drive and assume all is well.
This command may actually have been from the peer system. The original command may
have not executed successfully. Carefully examine systems attached to shared SCSI and
make certain that the controller ID is different.
Configuring Dual Hosted SCSI: Example
The following is an example of how to set up a typical dual hosted SCSI configuration:
1. Attach the storage to one system.
2. Terminate the SCSI bus at the array.
3. Power up the host system and array.
4. Verify all drives can be seen with the operating system by using available commands such as the format
command.
5. Identify the SCSI drive IDs that are used in the array and the internal SCSI drives if they are present.
6. Identify the SCSI controller ID.
7. Identify a suitable ID for the controller on the second system.
This ID must not conflict with any drive in the array or the peer controller. If you plan to set all controllers
to a new ID, ensure that the controller ID chosen on the second system does not conflict with internal SCSI
devices.
8. Set the new SCSI controller ID on the second system.
9. Power down both systems and the external array.
10. Disconnect the SCSI terminator and connect the array to the second system.
11. Power up the array and both systems.
Depending on hardware platform, you may be able to check for array connectivity before the OS is brought
up. Boot console messages such as "unexpected SCSI reset" are a normal occurrence during the boot
sequence of a system connected to a shared array. Most SCSI adapters will perform a bus reset during
initialization. The error message is generated when it sees a reset that was initiated by the peer.
N to 1 SAN Clustering
This topic describes the implementation of an N to 1 clustering design in a Storage Area Network (SAN)
environment. SANs are specialized high-speed networks that enable fast, reliable access among computers and
independent storage resources. In a SAN, all networked servers share storage devices as peer resources. In other
words, they are not the exclusive property of any one server. You can use a SAN to connect servers to storage,
servers to each other, and storage to storage through hubs, switches, and routers.
Wipro Confidential
Page 72 of 91
Defining SAN
SANs are defined as specialized, high-speed networks that are specifically dedicated to storage. SANs provide fast,
reliable access among systems and storage resources. The Storage Networking Industry Association (SNIA) defines
a SAN as:
"A network whose primary purpose is the transfer of data between computer systems and
storage elements and among storage elements. Abbreviated SAN. A SAN consists of a
communication infrastructure, which provides physical connections, and a management
layer, which organizes the connections, storage elements, and computer systems so that
data transfer is secure and robust."
Fibre Channel
Although the definition of a SAN does not specifically mention Fibre Channel technology, the Fibre Channel
protocol was the foundation for the development of SAN technology. With the emergence in the mid-1990s of Fibre
Channel-based networking devices, such as Fibre Channel switches, companies began to create networked
environments for storage in which servers and storage were connected in an any-to-any fashion, supported by a
highly reliable, high-performance fabric network. Fibre Channel, for the first time, enabled companies to virtualize
storage and provide high-speed access to information from any storage device to any server.
SAN Benefits
Attaching more than two hosts to a traditional, single SCSI storage device becomes problematic. SANs enable you
to connect a large number of hosts to a nearly unlimited amount of storage. This allows much larger clusters to be
constructed relatively easily.
A SAN carries only I/O traffic between servers and storage devices. It does not carry general-purpose traffic such as
email or other end user applications. Therefore, it avoids the compromises inherent in using a single network for all
applications. With this shared capacity, organizations can acquire, deploy, and use storage devices more costeffectively. Ultimately, on a SAN, any data at any network location is accessible, often through multiple paths, by
any nodes, applications, or users on the network. Storage on a SAN is shared, resulting in centralized management,
better utilization of disk and tape resources, and enhanced enterprise-wide data management and protection.
Wipro Confidential
Page 73 of 91
SANs are designed to replace today's point-to-point access methods with a new any-to-any architecture. In the
traditional model, if disks are logically shared, this sharing occurs at LAN speeds, such as 100 megabits/second, or
is limited to the small number of nodes which can be directly attached to a given disk array. Through the addition of
a high-speed switch, clients can access any disk from any node on the SAN at channel speeds, such as 100MB/sec.
This allows a much larger number of nodes much faster access to a much larger centralized data store.
Wipro Confidential
Page 74 of 91
Redundancy is easily added to a SAN through the incorporation of a second switch or redundant switching
components to support high availability data access. Additional nodes and disk arrays can be easily added to these
configurations with minimal disruption by plugging new components into the switch, providing a much simpler and
more scalable growth path than traditional architectures. Finally, any node in the SAN may potentially back up any
other node. One or two dedicated nodes can now backup a much greater number of nodes, thereby significantly
reducing the hardware costs associated with cluster configurations.
Wipro Confidential
Page 75 of 91
Application Services
An application service is the service the end user perceives when accessing a particular network address. An
application service is typically composed of multiple resources, some hardware and some software based, all
cooperating together to produce a single service.
Wipro Confidential
Page 76 of 91
The second node fails. On recovery, the application load of the failed server is balanced across the other two nodes
Wipro Confidential
Page 77 of 91
If another server fails, all of the applications would failover to the remaining server.
In a database environment, the monitoring application can connect to the database server
and perform SQL commands and verify read and write to the database. It is important
Wipro Confidential
Page 78 of 91
that data written for subsequent read-back is changed each time to prevent caching from
hiding underlying problems.
In both cases, end-to-end monitoring is a far more robust check of application health. The
closer a test comes to exactly what a user does, the better the test is in discovering
problems. This does come at a price. End to end monitoring increases system load and
may increase system response time. From a design perspective, the level of monitoring
implemented should be a careful balance between assuring the application is up and
minimizing monitor overhead.
The application must be capable of storing all required data on shared disks.
This may require specific setup options or even soft links. For example, the VERITAS NetBackup product
is designed to install in /usr/openv directory only. This requires either linking /usr/openv to a file
system mounted from the shared storage device or actually mounting file system from the shared device on
/usr/openv. Similarly, the application must store data to disk, rather than maintaining in memory. The
takeover system must be capable of accessing all required information.
The application must be capable of being restarted to a known state.
This is the most important application requirement. On a switchover, the application is brought down under
controlled conditions and started on another node. The application must close out all tasks, store data
properly on shared disk, and exit. At this time, the peer system can startup from a clean state. A problem
arises when one server crashes and another must take over. The application must be written in such a way
that data is not stored in memory, but regularly written to disk.
A commercial database such as Oracle, is the perfect example of a well written, crash tolerant application.
On any given client SQL request, the client is responsible for holding the request until it receives an
acknowledgement from the server. When the server receives a request, it is placed in a special log file, or
"redo" file. This data is confirmed as being written to stable disk storage before acknowledging the client.
At a later time, Oracle then de-stages the data from redo log to actual table space. After a server crash,
Oracle can recover to the last known committed state by mounting the data tables and applying the redo
logs. This in effect brings the database to the exact point of time of the crash. The client resubmits any
outstanding client requests not acknowledged by the server; all others are contained in the redo logs.
One key factor to note is the cooperation between client application and server. This must be factored in
when assessing the overall cluster compatibility of an application.
The application must be capable of running on all servers designated as potential hosts.
This means there are no license issues, host name dependencies, or other such problems. Prior to attempting
to bring an application under cluster control, it is highly advised the application be test run on all systems in
the proposed cluster that may be configured to host the application.
Wipro Confidential
Page 79 of 91
Page 80 of 91
Page 81 of 91
Page 82 of 91
Page 83 of 91
A parallel service group can be fully or partially online on both the servers at a time for
eg :
OPS is configured as parallel service group.
26. In veritas cluster dependencies between the resources has to be created.the
dependencies between the resources specify the order in which the resources within a
service group are brought online and taken offline.
for eg if a servicegroup called abc is created which has diskgroup and volumes as
resources.so when a servicegroup is brought online then diskgroup is brought online
first and then the volume. So dependency between diskgroup and volume has to be
created. Same way the dependency between the volume and the mount point as top be
created.
Since diskgroup comes up first , it is called the child and volume is called the parent.
Sameway between the volume and the mount point , volume is the child and the mount
point is the parent. the same holds true between a NIC and an ip address.
27.before starting the VCS gui the database has to be made r/w.use the following
command to make it
r/w
# haconf -makerw -- to set in read/write mode
# hauser -add username -- to add another user
# haconf -dump -makero --reset the configuration to read only
# xhost +
# hagui &
it will open a console . Enter the user as admin or the user u just created
Wipro Confidential
Page 84 of 91
28 . To create dependencies
a. create a service group called bsnl.Include the diskgroup resource in it by
selecting the Add Resource tab.The diskgroup "bsnldg" is already created using veritas
volume manager.
click on the properties of resource diskgroup and enter its properties like diskgroup
name.
b. create resource called volume by selecting Add resource.include all the 10
volumes in it which are being created using veritas volume manager.in the properties
tab of each volume resource , enter the volume name.
(same way create 10 mount points after creating a resource called mount.specify the
mount point name , the physical device to which the mount point corresponds to , and
the volume name to which the mount point corresponds from the properties tab)
c. create the dependency between the diskgroup and the volume.Volume will be
the parent and the diskgroup will be the child.while
Wipro Confidential
Page 85 of 91
creating the dependency , actually a link has to be created between the volume and the
diskgroup. ( by dragging the mouse)
for admin , the password will be password. Also enter the cluster name.
The screen looks as follows :
As shown .. Exch_NIC represents the NIC card resource and Exch_IP represents
resource Ip. Exch_NIC is the child whereas Exch_IP is the master. Same way
Exch_DiskResis the child and Exch_MountX is the parent.
So to create a link ( between 2 blue boxes) , drag the mouse from one object to
another. It will ask whether Exch_DiskRes is the child and Exch_MountX is the parent.
In the same way create links or dependencies between all the objects.
In the above diag. VCSNT5 and VCSNT6 are 2 systems in the cluster.
Once these dependencies are created , start the cluster services on the primary
server.Hence all the volumes in the shared array will get mounted.
On giving up the haswitch command all the volumes will get mounted on the secondary
server
Wipro Confidential
Page 86 of 91
To check which commands are being executed in the background click the "command
center" icon.
Wipro Confidential
Page 87 of 91
f. On running the format command for c3t5d0 disk , u will observe that 2 Mb space is
created in s7 slice.
Wipro Confidential
Page 88 of 91
g. Now readd the disk under volume manager control using the option "remove the disk
after replacement " option from the vxdiskadm option
(PLEASE DO NOT REINITIALIZE THE DISK)
After Dependency for resources is created try to online the group by following
procedure.
1) Select the group you want to online/offline/switch-to in "Services group"
2) Right click on the group. (if group is offline, you will get tabs for bringing it
online & if the group is already online, tabs for offline and switch to is available)
3) Check with the operations of online and offline on local system as well as
remote system.
If up to this step every thing goes through we are ready for switch over of the group.
4) Right click on the group from the "Services group" click switch to and click
on the remote system name in the cluster. (This will offline the group on local
system and online the same on the remote system , But before Switch-to Group
should be online on Local System.
5) Check with the help of ifconfig -a and df -k commands on both systems
whether the mount points and IP is transferred from local to remote system.
6) If all the mount points and IP address configured in VCS is switched over &
come up online successfully on remote system , now go ahead for directly
switching off the system on which the group is presently online.
ORACLE AGENT FOR ORACLE DATABASE :
The Oracle agent monitors the Oracle service and the SQLnet listener process
1. The oracle agent works in 3 modes
a. ONLINE : uses svrmgrl command to open the database
b. OFFLINE : uses svrmgrl command close the database (shutdown immediate.
c. MONITOR : scans process table for ora_pmon , ora_smon and ora_lqwr.
2. The SQLnet Listener process does the following :
a. ONLINE : uses lsnrctl -start to start the listener process
b. OFFLINE: uses lsnrctl -stop to stop listener process
c. MONITOR: scans process table for tnslsnr $LISTENER
Requirements for Oracle Agent :
When Oracle server application ($ORACLE_HOME) is installed on shared disk , each
cluster system must have same mount point directory for shared file system.
To install the Oracle agent :
Wipro Confidential
Page 89 of 91
# cd /cdrom/cdrom0
# pkgadd -d.
Now start the cluster manager GUI and import OracleTypes.cf file to the VCS engine
using following method
a. Start cluster manager GUI
b. Click on file menu and select import files
c. In import files dialog box select the file
/etc/VRTSvcs/conf/sample_Oracle/OracleTypes.cf
d. save the configuration using filesave option .
This will put Oracle as a resource in the Cluster.
(by default diskgroup , mount points , volumes , NIC card , IP , disks are present as
resources. But since oracle is not present , so this method has to be used to make
oracle as resource available for the cluster.
After installing the Oracle agent , when you open the Cluster Manager GUI ,
"Oracle" will be present as resource.So when u create a new resource of type Oracle , it
will ask for following information :
a. sid
b. owner
c. $ORACLE_HOME path
d. Pfile -> $ORACLE_HOME/dbs/initSID.ora value_name of startup profile.
Similarly a resource of type SQLnet will be available. So add the resource of the type
SQLnet and enter the following information :
a.
b.
c.
d.
owner
$ORACLE_HOME path to oracle binaries
name of listener. (default is LISTENER)
$TNS_ADMIN path to directory in
resides( listener.ora)
which
listener
configuration
file
Now to create the dependencies between these 2 resources (oracle and SQLnet)
Also assign a demo ip which will float from one system to another system incase of
system failover.so IP will be the parent and the public NIC card can be the child.This
demo Ip will act as child to Oracle agent which will be the parent to demo Ip.
So oracle will be the client and SQLnet will be the parent
So the final dependency looks this way :
( from left to right i.e. from child to parent )
Wipro Confidential
Page 90 of 91
(also oracle agent is parent to demo Ip, which is the parent to NIC)
diskgroup-volumes-mountpoints-Oracle agent-Sqlnet
|
|
|
|
demo IP
|
|
|
NIC card (hme0:1)
So in case the system fails , the Sqlnet services will stop first (parent goes offline first ) ,
then oracle will shutdown , mnt points will get unmounted ,
Volumes will go offline and diskgroup will automatically deport.
Also at the same time demo IP will go offline.
Now since child comes online first , So on the other system the demo Ip will come up ,
diskgroup will automatically get imported , volumes will then come online , mnt points
will get mounted, database will come up and finally listener service will start
successfully.
Now u can switch off the machine and check whether the Oracle database comes up on
the other system in the cluster.
IMP COMMANDS
1.hastart --start the VCS engine
2.hagrp -display -- to display service groups
3.hastatus -summary-- summary of cluster info.
U can also check cluster which u have configured thro GUI form the main.cf file present
in the path /etc/VRTSvcs/conf/config
Wipro Confidential
Page 91 of 91