Académique Documents
Professionnel Documents
Culture Documents
1
Technical White Paper on
Reliability (Cloud Data Center)
Issue 1.0
Date 2015-04-15
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their respective
holders.
Notice
The purchased products, services and features are stipulated by the contract made between Huawei and
the customer. All or part of the products, services and features described in this document may not be
within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements,
information, and recommendations in this document are provided "AS IS" without warranties, guarantees
or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made in the
preparation of this document to ensure accuracy of the contents, but all statements, information, and
recommendations in this document do not constitute a warranty of any kind, express or implied.
Website: http://e.huawei.com
Purpose
This document describes the system reliability of FusionSphere.
Intended Audience
This document is intended for:
Marketing engineers
Sales engineers
Distributors
Symbol Conventions
The symbols that may be found in this document are defined as follows:
Symbol Description
Change History
Changes between document issues are cumulative. The latest document issue contains all the
changes in earlier issues.
Issue 01 (2015-04-15)
This issue is the first official release.
Contents
2 System Reliability.........................................................................................................................2
2.1 System Reliability Requirements...................................................................................................................................2
2.2 OpenStack HA................................................................................................................................................................2
2.3 Service Deployment Redundancy..................................................................................................................................3
2.4 Traffic Control................................................................................................................................................................4
2.5 Data Consistency Audit..................................................................................................................................................5
2.6 Black Box.......................................................................................................................................................................5
2.7 Protection from Zombies................................................................................................................................................5
2.8 Redundant Network Paths..............................................................................................................................................5
2.9 Plane-based Network Communication...........................................................................................................................6
2.10 Management Data Backup...........................................................................................................................................7
2.11 Global Time Synchronization.......................................................................................................................................8
3 FusionCompute Reliability.........................................................................................................9
3.1 VM Live Migration........................................................................................................................................................9
3.2 Storage Cold and Live Migration.................................................................................................................................10
3.3 VM HA.........................................................................................................................................................................11
3.4 VM Fault Isolation.......................................................................................................................................................12
3.5 Virtualized Deployment of Management Nodes..........................................................................................................13
3.6 Host Fault Recovery.....................................................................................................................................................13
4 FusionStorage Reliability..........................................................................................................14
4.1 Data Store Redundancy Design....................................................................................................................................14
4.2 Multi-Failure Domain Design......................................................................................................................................15
4.3 Data Security Design....................................................................................................................................................15
4.4 Strong Data Consistency..............................................................................................................................................16
4.5 NVDIMM Power Failure Protection............................................................................................................................17
4.6 I/O Traffic Control........................................................................................................................................................17
4.7 Disk Reliability.............................................................................................................................................................18
4.8 Metadata Reliability.....................................................................................................................................................19
5 FusionManager Reliability........................................................................................................20
5.1 Active and Standby Management Nodes Architecture.................................................................................................20
5.2 System Monitoring.......................................................................................................................................................21
5.3 Data Consistency Between the Active and Standby Nodes..........................................................................................21
6 Network Reliability....................................................................................................................22
6.1 Multipathing Storage Access........................................................................................................................................22
6.2 NIC Load Balancing.....................................................................................................................................................23
6.3 Switch Stacking............................................................................................................................................................23
6.4 VRRP............................................................................................................................................................................23
2 System Reliability
2.2 OpenStack HA
OpenStack reliability is determined by the reliability of the following services
Representational State Transfer (REST) API service reliability: provides continuous API
services for users.
Database service reliability: ensures user configuration data security and integrity and
services continuity.
Communication service reliability: ensures that different components properly interact
with one another.
All OpenStack services are deployed in active/active or active/standby mode for redundancy.
Figure 1.2 illustrates an example of OpenStack HA deployment.
A CPS client is deployed on each server to monitor services deployed on the server.
If the CPS client detects that a service is faulty, it automatically restarts the service.
If the CPS server does not receive a heartbeat message from a CPS client within a specified
time period, it regards that the server where the client is running is faulty. The loss of
heartbeat then automatically triggers the VM HA mechanism to restart all affected VMs on
other servers.
CPS servers are deployed in cluster mode, usually on three to seven servers. The Zookeeper
mechanism is used to elect the active CPS server. In addition to monitoring server running
status, the CPS server also provides the active/standby arbitration service.
Physical NICs of servers can be bound and categorized to allow the management, service, and
storage planes to use different logical NICs and to connect to different access switches,
thereby implementing physical network isolation.
3 FusionCompute Reliability
Live migration allows VMs scattered on different servers to be migrated to only several
servers or one server when traffic is light. Then idle servers can be turned off. This reduces
costs for customers and saves energy.
VM live migration can ensure high reliability of the customer system. If a fault occurs on a
running physical machine, its services can be migrated to other properly running machines
before the situation turns worse.
Hardware can be upgraded online without interrupting services. Before upgrading a physical
machine, users can migrate all its VMs to other machines. After the upgrade is complete,
users can migrate the VMs back. In this way, services are not interrupted.
Typical application scenarios for VM live migration are as follows:
Manual VM migration to any idle physical server as required
Batch VM migration to any idle physical server based on the resource utilization
3.3 VM HA
If the physical CNA server breaks down or restarts abnormally, the system can migrate the
VMs with high availability (HA) to other computing servers, ensuring rapid restoration of
VMs.
A cluster can house thousands of VMs. Therefore, if a computing server breaks down, the
system migrates VMs to different servers based on the network traffic and destination server
load to prevent network congestion and overload on the destination server.
Figure 1.10 VM HA
After detecting that a computing node or VM becomes faulty, the VRM restarts the VM on a
properly running computing node based on the recorded VM information.
VM HA is triggered if the heartbeat connection between the VRM and CNA is disconnected
for 30 seconds, or a VM works abnormally suddenly.
The lockout mechanism at the storage layer prevents one VM instance from being started
concurrently on multiple CNAs.
CNA nodes can be recovered from power-off failures. Then, service processes are
automatically resumed after the restart, and the VMs that were running on the CNAs are
migrated to other computing nodes.
All in all, any operation performed on a VM exerts no impact on other VMs of the same
physical server or the virtualization platform.
4 FusionStorage Reliability
In the dual-copy scenario, if a disk becomes faulty in a FusionStorage resource pool, data
of the entire system will not lose, and services still run properly.
In the three-copy scenario, if two disks become faulty concurrently in a FusionStorage
resource pool, data of the entire system will not lose, and services still run properly.
The system data persistence reaches 99.99% in the dual-copy scenario and 99.99999% in
the three-copy scenario.
Rack-based security: The standby rack data copy is distributed only on the nodes of other
racks. Any blade or disk fault on the same rack does not cause data loss in the system,
ensuring service continuity. Figure 1.15 shows the copy distribution by rack.
the data write success. In most cases, FusionStorage ensures that data read from any copy is
the same. If a disk is faulty temporarily, FusionStorage does not write data into the copies on
the disk until the disk fault is rectified. Then FusionStorage restores data in the copies and
continues to write data into them. If a disk can be no longer used, FusionStorage removes it
from the cluster and finds another available disk for copies. Using the rebalance mechanism,
FusionStorage can balance data distribution among all disks.
Figure 1.16 outlines how strong data consistency is implemented.
services with low priorities based on the traffic control algorithm and policies. Based on the
volume of overloaded traffic, the FusionStorage traffic control module adjusts the amount of
I/O traffic processed each time to ensure that the system workloads are within the normal
range.
5 FusionManager Reliability
Management nodes work in active and standby mode to manage services in the system. If
both active and standby management nodes become faulty, new services will be adversely
affected. For example, VM creation and deletion operations cannot be performed. However,
VMs running on the system can still be used, and the VM users will not be aware of the
failure.
6 Network Reliability
Figure 1.18 describes the multipathing access process when computing nodes communicate
with storage nodes. Any VM can provide at least two paths for its attached virtual volumes.
The multipathing software is used to control multipathing access and implements service
switchover upon failures. This prevents service interruption that may be caused by single
points of failure and ensures system reliability.
6.4 VRRP
The Virtual Router Redundancy Protocol (VRRP) is a fault tolerance protocol. With this
protocol used, several routers can be grouped as a virtual router. If the next-hop switch of a
host becomes faulty, other switches rapidly take over services, ensuring service continuity and
reliability.
VRRP combines a group of routers in a LAN as a VRRP backup group, which is equivalent to
a virtual router.
A virtual IP address is next configured for the virtual router. Hosts in the LAN need to know
only the virtual IP address instead of the IP addresses of each specific router. After the default
gateway of the host is set to the virtual IP address, the host can use the virtual gateway to
communicate with external networks.
VRRP dynamically associates the virtual router with a physical device that transmits service
packets. If the physical device is faulty, VRRP selects a new device to take over service
transmission.
This entire process is invisible to users, and the VRRP enables continuous communication
between internal and external networks.