Académique Documents
Professionnel Documents
Culture Documents
Introduction ..............................................................................................2 Overview of Virtualization Technologies ...............................................3 Oracle VM Server for SPARC Overview ...............................................4 Introducing Root Domains ........................................................................8 I/O Domain Models ...............................................................................8 Root Domains.......................................................................................9 Operational Model With Solaris Zones....................................................11 Zone Workloads .................................................................................11 Operational Use Cases ..........................................................................13 SPARC T4-4 (4 Socket)......................................................................13 SPARC T4-4 (2 Socket)......................................................................14 SPARC T4-2.......................................................................................15 General Domaining Considerations ....................................................16 One Domain .......................................................................................16 Two Domains .....................................................................................16 Four Domains .....................................................................................17 Three Domains ...................................................................................17 Hybrid Domains ..................................................................................18 Populating PCIe Slots .........................................................................18 Building Root Domains: How-To .............................................................19 Preparing the Control Domain ............................................................19 Building Root Domains .......................................................................20 Building Root Domains with virtual I/O for boot and networking ..........20
Introduction
This paper describes how to implement an increasingly useful type of virtualization feature known as root domains. It also describes an operational model where root domains can be used in conjunction with Solaris Zones for maximum performance and a high degree of flexibility. Root domains are a type of logical domain characterized by the fact that they own one or more PCIe root complexes and therefore do not share I/O components with other logical domains. System resources are merely divided among the domains (rather than being virtualized and shared). This type of virtualization is unique to the SPARC systems supported by Oracle VM Server for SPARC and offers a number of benefits, specifically bare metal performance (i.e. without virtualization overhead) and lower interdependence among domains. Zones are a feature of Oracle Solaris and this model describes how to use them in a complementary, layered fashion together with root domains. In this model root domains form the lower layer, and Zones the second layer above the domains. The combination allows highly flexible and effective virtualization, resource management, workload isolation and mobility. The main objective of this paper is to introduce the root domain model as one of several architectures that the Oracle VM Server for SPARC product supports. It has a number of benefits, as well as a number of restrictions. The architectural decisions of how to best deploy Oracle VM Server for SPARC technology is driven by the business needs and requirements, and may involve a combination or hybrid approach.
A hypervisor layer sits between the hardware and general-purpose operating system guests. It provides some level of platform virtualization and abstraction to these guests. It is generally said that the guests run on top of the hypervisor. This type of compute virtualization is primarily designed for servers. The way the hypervisor is implemented has a measurable impact on performance. This is discussed in more detail in the next section. Oracle VM Server for SPARC is a Type 1 hypervisor.
Type 2 Hypervisors
The hypervisor runs within a conventional operating system. Guest operating systems then run within this hypervisor. This type of compute virtualization is primarily designed for desktop users and is beyond the scope of this paper. Oracle VM VirtualBox product is a Type 2 hypervisor.
OS-based Virtualization
In this type of virtualization the general-purpose operating system itself provides a form of workload isolation within its own kernel. Only one operating system is actually running even though it may appear that multiple instances exist simultaneously. A hypervisor is not required to implement OSbased virtualization. Solaris Zones is a mature, market leading form of OS-based virtualization. In this paper we explain how to utilize Solaris Zones in conjunction with Oracle VM Server for SPARC in order to provide a highly performant and flexible virtualization environment.
The hypervisor sits directly on bare metal, and owns all the hardware resources: CPU, Memory and I/O. The guest domains running the workloads must involve the hypervisor for access to ALL resources.
The architecture of Oracle VM Server for SPARC, on the other hand, is modular and consists of two main components. 1. 2. A firmware based hypervisor which allocates resources (CPU, memory and optionally I/O devices) directly to logical domains Software based I/O virtualization, which runs in a service domain and provides virtual I/O services to the guest domains. This is an optional component.
The hypervisor is a feature of SPARC T-series architecture. It is controlled from the Primary Domain. A Service Domain owns all the I/O and is allocated CPU and Memory for its own purposes The guest domains running the workloads have direct access to the CPU and Memory allocated to them, and I/O is provided as virtualized I/O via the Service Domain. Figure 2 Oracle VM Server for SPARC Service Domain Model
Unlike a traditional hypervisor, CPU threads and memory are allocated directly to the logical domains. The guest OS (Solaris) running within that logical domain runs on bare metal and can perform the aforementioned privileged operations directly, without the need to context switch into the hypervisor. This key differentiator illustrates why this logical domain model does not carry a virtualization tax from a CPU and memory overhead perspective as the conventional model does. Furthermore, in Oracle VM Server for SPARC using virtualized I/O is optional. It is possible to assign the I/O directly to the logical domains themselves, creating what we call the root domain model.
PCIe devices or an entire PCIe root complex can also be allocated directly to the guest domains. The guest domains running the workloads now have direct access to CPU, Memory and I/O and run with bare-metal performance.
In this model the hypervisor is simply performing the task of resource division and isolation. The hypervisor creates the logical domains and allocates resources to them but is not involved in the execution of instructions, in the allocation and access of memory, nor in performing I/O operations. It sets up the boundaries and gets out of the way. This further eliminates the virtualization tax from an I/O perspective.
In summary, Oracle VM Server for SPARC differs from traditional thick hypervisor designs in the following ways: Memory and CPU resources are assigned directly to logical domains and are neither virtualized nor oversubscribed (and therefore do not suffer the traditional inefficiencies of having to pass through a virtualization layer) Many I/O virtualization functions are offloaded to domains (known as service domains) I/O resources can also be assigned directly to guest domains thus avoiding the need to virtualize I/O and the associated performance overhead
This permits a simpler hypervisor design, which enhances reliability and security. It also reduces single points of failure by assigning responsibilities to multiple system components, which further improves reliability and security. In this architecture, management and I/O functionality are provided within domains. Oracle VM Server for SPARC does this by defining the domains role: Control domain - management control point for virtualization of the server, used to configure domains and manage resources. It is the first domain to boot on a power-up, is an I/O domain, and is usually a service domain as well. There can only be one control domain. I/O domain - has been assigned physical I/O devices: a PCIe root complex, a PCI device, or an SR-IOV (Single-Root I/O Virtualization) function. It has native performance and functionality for the devices it owns, unmediated by any virtualization layer. There can be multiple I/O domains. Service domain - provides virtual network and disk devices to guest domains. There can be multiple service domains, although in practice there is usually one, sometimes two or more for redundancy. A service domain is always an I/O domain as it must own physical I/O resources in order to virtualize them for guest domains. 1 Guest domain - a domain whose devices are all virtual rather than physical: virtual network and disk devices provided by one or more service domains. In common practice, this is where applications are run. There usually are multiple guest domains in a single system. Domain roles may be combined, for example, a control domain can also be an I/O domain and a service domain.
Historical Deployment Model
The typical deployment pattern has focused primarily on using a single service domain, which owns all of the physical I/O devices and provides virtual devices to multiple guest domains where the
There are exceptions: A service domain with no physical I/O could be used to provide a virtual switch for internal networking purposes, or could be configured to run the virtual console service.
applications are run. This single service domain is also an I/O domain and in practice it is usually also the control domain. A variant of this model also exists where a second service domain can be added to provide redundant access to disk and network services to the guest domains. This allows guests to survive an outage (planned or unplanned) of one of the service domains, and can be used for non-disruptive rolling upgrades of service domains. This model offers many benefits: a high degree of flexibility, the ability to run an arbitrary number of guest domains (limited by CPU and memory resources), the ability to easily clone domains (when their virtual boot disks are stored on ZFS) and when appropriately configured, the ability to live migrate domains to another system. These benefits come at the expense of having to allocate CPU and memory resources to the service domains, thus reducing the amount of CPU and memory available to applications, plus the performance overhead associated with virtualized I/O. This expense is justified by the operational flexibility of using service domains. This configuration is considered the default or baseline model due to its operational flexibility. The achieved performance is usually acceptable, and remains appropriate for many deployments. This model is well documented in numerous Oracle publications. Hybrid models also exist where individual PCIe cards (Direct I/O) or virtual functions on a shared PCIe card (via SR-IOV technology) can be assigned to guest domains. These are beyond the scope of this paper.
Certain resources such as console interfaces or management network interfaces can still be virtualized via a small service domain function. However, these are not on the data path for performance.
2
In the Direct I/O and SR-IOV models the guest will have near native I/O performance, however its availability is dependent upon the I/O domain that actually owns the PCIe root complex in which the PCIe device physically resides. For the remainder of this paper we focus exclusively on a very specific implementation of the domain model we refer to as root domains.
Root Domains
So far weve discussed I/O domains primarily in the context of them also being service domains, where the sole purpose for owning I/O is to virtualize it for guest domains. However, the focus of this paper is the concept of an I/O domain hosting one or more applications directly, without relying on a service domain. Specifically, domain I/O boundaries are defined exactly by the scope of one or more root PCIe complexes. The other two methods (Direct I/O and SR-IOV) are not utilized in this model. This offers a number of key advantages over all of the other models available in the Oracle VM Server for SPARC technology and in particular a distinct advantage over all other hypervisors which use the traditional thick model of providing all services to guest VMs through software based virtualization (at the expense of performance). Performance: All I/O is native (i.e. bare metal) with no virtualization overhead. Simplicity: The guest domain and associated guest operating system owns the entire PCIe root complex. There is no need to virtualize any I/O. Configuring this type of domain is significantly simpler than with the service domain model. I/O fault isolation: A root domain does not share I/O with any other domain. Therefore the failure of a PCIe card (i.e. NIC or HBA) impacts only that domain. This is in contrast to the service domain, Direct I/O or SR-IOV models where all domains that share those components are impacted by their failure. Improved security: There are fewer shared components or management points.
The number of root domains is limited by the number of root PCIe complexes available in the platform. With the SPARC T4 processor there is one root PCIe complex per SPARC T4 socket. So a SPARC T4-2 system has two root PCIe complexes and a SPARC T4-4 system has four. Thus the maximum number of root domains possible on a SPARC T4-2 and SPARC T4-4 system is two and four respectively. It is important to note that root domains are not the ideal solution for all workloads and use cases. In practice, a solution that comprises of some systems with service domains as well as some systems with root domains may be appropriate. In fact, the same physical server could consist of 2 root domains running applications, and 2 service domains providing services to a number of fully virtualized guest domains. This is discussed in more detail in the operational model section. Root domains have a number of restrictions that should be carefully considered: They are less flexible. With the guest domain model the number of domains can be changed dynamically, usually without impacting running workloads. Changing the I/O allocation of a
root domain requires downtime, however, CPU and memory continue to be able to be dynamically allocated. The number of domains available is limited to the number of PCIe root complexes. On a T44 this means one, two, three or four root domains. Additional workload isolation can be performed using Solaris Zones as discussed later in the operational model section. They cannot be live migrated. In general, any domain with direct access to physical I/O cannot be live migrated. They require more planning, particularly during the purchase phase to ensure there are enough NICs and HBAs, as these components are not shared across domains. Some of the root domains may not have access to local disks or networks, and will need to be provided via the PCIe cards, or by configuring virtual I/O devices.
While root domains provide a level of fault isolation, they do not provide full isolation against all failure modes. Indeed the primary objective in creating root domains is performance with a secondary benefit of improved isolation. Application or service availability should always be architected at a higher level. Best practices for high availability such as clustering across compute nodes and remote replication for disaster recovery should always be applied to the degree the business requirements warrant. For example, HA should not be implemented with both nodes of the cluster located within the same physical system. However, multiple tiers (i.e. web, app, db) can be placed in different domains and then replicated on other nodes using common clustering technology or horizontal redundancy. Furthermore, the lack of live migration capability with root domains should not be viewed as an impediment to availability. Live migration does not solve availability problems. A domain that has failed cannot be live migrated. Its not live anymore. Indeed many of the objectives seen as desirable features of live migration (i.e. workload migration for service windows) can be obtained by architecting availability at a higher level so that taking one node down does not interrupt the actual service. Then cold workload migration can be implemented by booting domains from a SAN or by the use of technologies such as Solaris Cluster, Oracle Weblogic Server, or Oracle RAC that provide high availability. In the proposed operational model, Zones become the migration entities (boundaries). Zone shutdown and restart time is significantly faster than for the main Solaris instance. This further mitigates the requirement for Live Migration within this scenario. Oracle has published a number of papers that discuss these concepts further and specific to individual workloads. Refer to the Maximum Availability Architecture and Optimized Solutions sections of OTN (the Oracle Technology Network).
10
Zone Workloads
Using Solaris Zones as workload containers inside root domains provides a number of benefits: Workload isolation. Applications can each have their own virtual Solaris environment. Administrative isolation. Each Zone can have a different administrator. Security isolation. Each Zone has its own security context, and compromising one Zone does not imply that other Zones are also compromised. Resource control. Solaris has very robust resource management capabilities that fit well with Zones. This includes CPU capping (including for Oracle license boundaries), Fair Share Scheduler, memory capping and network bandwidth monitoring and management. Workload mobility. Zones can be easily copied, cloned and moved, usually without the need to modify the hosted application. More importantly, Zone startup and shutdown is significantly quicker than traditional VM migrations.
The use of zone workloads on top of root domains is considered best practice, even if you only intend to run a single workload. There is no performance or financial cost to running zones in the domain, and it provides operational flexibility by being able to dynamically resize or migrate to another platform. However, applications can be installed directly within the root domains if so desired.
11
For development systems it is often more desirable to implement domains using the service domain model. This affords maximum flexibility where performance is generally not a concern. For maximum flexibility place the application or workload within a Zone inside the domain, as this becomes the entity that is promoted through the development and test lifecycle. Functional test can be implemented the same way. Zones can be copied, cloned and/or moved from development to test. Zones can further be cloned multiple times to test in parallel if desired.
Migrating Zones to the Production environment
When maximum performance is desired, production and pre-production can be implemented with root domains. Ideally, pre-production will be on hardware identical to the production system and with an identical root domain configuration. Workloads can be migrated from development or functional test guest domains to the root domains that exist in the Prod and Pre-Production environments.
12
13
The four possible root complexes are defined by the top level PCI path, and are as follows: 1. 2. 3. 4. RC0: pci@400, green RC1: pci@500, blue RC2: pci@600, purple RC3: pci@700, orange
It should be noted that the allocation of disks and on-board 1GbE ports is not spread evenly across all the root complexes. RC0 has half the disks and two 1GbE ports, RC2 has two 1GbE ports, and RC3 has the other half of the disks. This needs to be taken into consideration when creating root domains, as neither RC1 nor RC2 have access to local boot disks, and neither RC1 nor RC3 have access to local 1GbE ports. This is easily overcome by providing PCIe cards with this functionality, or having the other domains provide virtual I/O services.
When a T4-4 only has 2 CPUs, the number of root complexes is also reduced to 2. The two possible root complexes are defined by the top level PCI path and are as follows: 1. 2. RC0: pci@400, green RC1: pci@500, blue
As per the 4 socket case, there is some asymmetry here as well, as only RC0 has access to the onboard 1GbE ports, but the disks are split between the 2 root complexes.
14
SPARC T4-2
The T4-2 has 2 root complexes, as shown: 1. 2. RC0: pci@400, green RC1: pci@500, blue
In the case of the T4-2, while both root complexes have access to the onboard 1GbE ports, only the RC0 domain has visibility of the on-board disks.
15
One Domain
It is of course possible to use any of the T4 platforms with a single domain, and not make use of any of the Oracle VM for SPARC functionality. In this case, the domain simply has access to all attached I/O. The diagrams above can be used to ensure that the I/O cards are balanced evenly throughout the system.
Two Domains
This model creates 2 independent root domains by assigning one or more root complexes to a second domain, and leaving the remaining root complexes under the control of the control domain.
SPARC T4-2
In the simple case of the SPARC T4-2 servers, there are only 2 root complexes available, and RC0 will be attached to the control domain, and RC1 will be assigned to the new second root domain. As can be seen from the diagrams above, the second T4-2 domain will not have access to any on-board disks, and therefore needs to be given access to disk. This can be achieved either via a HBA in one of the odd numbered slots, or by providing virtual disk access from the primary root domain. In this case, it is preferable to use directly attached storage for this domain as the virtual disk route creates a dependency on the control domain, which is undesirable in this model.
When performance is critical it is recommended that CPU be allocated in a minimum of 1 core increments and for optimal performance on full socket boundaries.
3
16
Similarly, the 2 socket configuration of the T4-4 only has 2 root complexes, and RC0 will be attached to the control domains, and RC1 assigned to the new second root domain. The second domain, while it has access to disk, does not have access to the on-board 1GbE ports. Again, in this case, the onboard 10GbE ports can be used, or an appropriate network card be installed into one of the EM slots 8-15.
SPARC T4-4 (4 Sockets)
With 4 root complexes to deal with, there are a number of options which can be considered. In the simplest case, assigning RC0 and RC1 to the control domain, and RC2 and RC3 to the second domain creates 2 fully independent domains each with access to both internal disk and onboard 1GbE ports. If an asymmetric I/O configuration is required, then it is also possible to create a pair of domains with RC0 assigned to the control domain, and RC1, RC2 and RC3 assigned to the second domain. This still allows both domains access to the internal disks and 1GbE networks. Any other combination will create a configuration which leaves one of the domains without access to either onboard disk, or the onboard 1GbE ports, and should therefore be avoided.
Four Domains
The four root domain model is only applicable to the SPARC T4-4 (4 socket) configuration. Each of the four PCI root complexes will be assigned to a domain. The RC0 will be assigned to the control domain. RC1, RC2, and RC3 will be assigned to the remaining three domains. In this configuration, two of the domains will not have access to the internal disks, and must be provided an alternative way of booting them. The simplest way of dealing with this is, as before, placing an HBA into the relevant EM slot, and booting from there. It is also possible to provide a pair of virtual disks from the RC0 and RC3 domains. This provides redundant boot disk access, and the dependent domain will only require one or the other of the service domains to be active for uninterrupted operation. This model does add operational complexity and native access to boot disks is preferred. Similarly, two of the domains will not have access to the 1GbE ports on the server. In general, this is not viewed as an issue, as all domains still have native on-board 10GbE access, but if 1GbE network access is required, this can either be done via an express module card, or if necessary by using virtual networking from the other domains.
Three Domains
The three domain case is similar to the four domain model, except that the domain that would have been assigned RC1 is not created, and RC1 is assigned to any of the other domains. RC1 is chosen in this case because it is the only root complex with access to neither the internal disks, nor the internal 1GbE networks, and allows this model to be treated just like the Four Domain model.
17
Hybrid Domains
In any of the above models, nothing precludes you from using one or two of the domains as control and possibly I/O domains for the provision of normal guest domains, while leaving the remaining domains to act as pure root domains.
18
http://www.oracle.com/technetwork/documentation/vm-sparc-194287.html
Preparing the Control Domain
The first domain on the SPARC T-series server is always the control, or the primary domain. In the context of the root domain model, its main purpose is where the LDoms5 manager is running. Under the root domain model, the control domain does not typically provide any virtual services other than the virtual console service. For this reason, it can usually be treated as a domain that can be used to run applications, providing the security concerns of access to the ldm commands have been satisfied. The easiest way to accomplish this is to run applications within zones on the control domain. These instructions work on both Solaris 10 and Solaris 11 control domains, although the use of Solaris 11 is preferred. Ensure that you are running the latest SPARC T4 firmware, and the latest updates/patches for the version of Solaris you are running. In the case of a Solaris 10 domain, you will need to install the Oracle VM Server for SPARC Software. The control domain will ALWAYS be left with the pci@400 root complex at a minimum. Remove the ones that will be used for the other root domains as part of the initial configuration.
Console access to the domains is provided via the control domain, which should be connected to the management network. It may also be worth considering also attaching the root domains to the management network to allow ssh access and for O/S installation and updates. 5 LDoms stands for Logical Domains and was the product name superseded by Oracle VM Server for SPARC. However, the naming convention of the commands and processes has not changed.
4
19
The following steps are examples. Values of cpu, memory allocation and root complexes may need to be changed to reflect the actual configuration.
# # # #
create ldom1 set-vcpu 64 ldom1 set-memory 256G ldom1 add-io pci@500 ldom1
Building Root Domains with virtual I/O for boot and networking
It is possible to configure the root domains so that they can use virtual disks and networks provided by other root domains with direct access to internal disks and networks. In this case, it is simply a matter of configuring virtual disk and network services and configuring in the logical domains configuration above. For more information about Oracle's virtualization solutions, visit oracle.com/virtualization.
20
Implementing Root Domains in Oracle VM Server for SPARC February 2013 Authors: Mikel Manitius, Michael Ramchand, Jeff Savit Contributing Authors: Benoit Chaffanjon Oracle Corporation World Headquarters 500 Oracle Parkway Redwood Shores, CA 94065 U.S.A. Worldwide Inquiries: Phone: +1.650.506.7000 Fax: +1.650.506.7200 oracle.com
Copyright 2013, Oracle and/or its affiliates. All rights reserved. This document is provided for information purposes only, and the contents hereof are subject to change without notice. This document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or fitness for a particular purpose. We specifically disclaim any liability with respect to this document, and no contractual obligations are formed either directly or indirectly by this document. This document may not be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without our prior written permission. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group. 0113