Vous êtes sur la page 1sur 94

Dolphin Express for MySQL

Installation and Reference Guide

Dolphin Interconnect Solutions ASA

Dolphin Express for MySQL: Installation and Reference Guide


Dolphin Interconnect Solutions ASA

by This document describes the Dolphin Express software stack version 3.3.0. Published November 13th, 2007 Copyright 2007 Dolphin Interconnect Solutions ASA
Published under Gnu Public License v2

Abstract ..................................................................................................................................... vii 1. Introduction & Overview ............................................................................................................ 1 1. Who needs Dolphin Express and SuperSockets? ..................................................................... 1 2. How do Dolphin Express and SuperSockets work? .................................................................. 1 3. Terminology ..................................................................................................................... 1 4. Contact & Feedback: Dolphin Support .................................................................................. 2 2. Requirements & Planning ........................................................................................................... 3 1. Supported Platforms .......................................................................................................... 3 1.1. Hardware ............................................................................................................... 3 1.1.1. Supported Platforms ...................................................................................... 3 1.1.2. Recommended Node Hardware ....................................................................... 3 1.1.3. Recommended Frontend Hardware .................................................................. 3 1.2. Software ................................................................................................................ 3 1.2.1. Linux ......................................................................................................... 4 1.2.2. Windows .................................................................................................... 4 1.2.3. Solaris ........................................................................................................ 4 1.2.4. Others ........................................................................................................ 4 2. Interconnect Planning ......................................................................................................... 4 2.1. Nodes to Equip with Dolphin Express Interconnect ....................................................... 4 2.1.1. MySQL Cluster ............................................................................................ 4 2.2. Interconnect Topology ............................................................................................. 5 2.3. Physical Node Placement ......................................................................................... 5 3. Initial Installation ...................................................................................................................... 6 1. Overview ......................................................................................................................... 6 2. Installation Requirements .................................................................................................... 6 2.1. Live Installation ..................................................................................................... 7 2.2. Non-GUI Installation ............................................................................................... 7 2.2.1. No X / GUI on Frontend ............................................................................... 7 2.2.2. No X / GUI Anywhere .................................................................................. 7 3. Adapter Card Installation .................................................................................................... 8 4. Software and Cable Installation ........................................................................................... 8 4.1. Overview ............................................................................................................... 8 4.2. Starting the Software Installation ............................................................................... 9 4.3. Working with the dishostseditor ............................................................................... 13 4.3.1. Cluster Edit ............................................................................................... 13 4.3.2. Node Arrangement ...................................................................................... 14 4.3.3. Cabling Instructions .................................................................................... 16 4.4. Cluster Cabling ..................................................................................................... 17 4.4.1. Connecting the cables .................................................................................. 17 4.4.2. Verifying the Cabling .................................................................................. 18 4.5. Finalising the Software Installation ........................................................................... 19 4.5.1. Static SCI Connectivity Test ......................................................................... 19 4.5.2. SuperSockets Configuration Test ................................................................... 20 4.5.3. SuperSockets Performance Test ..................................................................... 20 4.6. Handling Installation Problems ................................................................................ 20 4.7. Interconnect Validation with sciadmin ....................................................................... 20 4.7.1. Installing sciadmin ...................................................................................... 21 4.7.2. Starting sciadmin ........................................................................................ 21 4.7.3. Cluster Overview ........................................................................................ 21 4.7.4. Cabling Correctness Test .............................................................................. 22 4.7.5. Fabric Quality Test ..................................................................................... 22 4.8. Making Cluster Application use Dolphin Express ........................................................ 23 4.8.1. Generic Socket Applications ......................................................................... 23 4.8.2. Native SCI Applications .............................................................................. 25 4.8.3. Kernel Socket Services ................................................................................ 26 4. Update Installation ................................................................................................................... 27 1. Complete Update ............................................................................................................. 27 2. Rolling Update ................................................................................................................ 27

iv

Dolphin Express for MySQL

5. Manual Installation .................................................................................................................. 1. Installation under Load ..................................................................................................... 2. Installation of a Heterogeneous Cluster ................................................................................ 3. Manual RPM Installation .................................................................................................. 3.1. RPM Package Structure .......................................................................................... 3.2. RPM Build and Installation ..................................................................................... 4. Unpackaged Installation .................................................................................................... 6. Interconnect and Software Maintenance ....................................................................................... 1. Verifying Functionality and Performance ............................................................................. 1.1. Low-level Functionality and Performance .................................................................. 1.1.1. Availability of Drivers and Services ............................................................... 1.1.2. Cable Connection Test ................................................................................. 1.1.3. Static Interconnect Test ................................................................................ 1.1.4. Interconnect Load Test ................................................................................ 1.1.5. Interconnect Performance Test ...................................................................... 1.2. SuperSockets Functionality and Performance .............................................................. 1.2.1. SuperSockets Status .................................................................................... 1.2.2. SuperSockets Functionality ........................................................................... 1.3. SuperSockets Utilization ......................................................................................... 2. Replacing SCI Cables ....................................................................................................... 3. Replacing a PCI-SCI Adapter ............................................................................................ 4. Physically Moving Nodes ................................................................................................. 5. Replacing a Node ............................................................................................................ 6. Adding Nodes ................................................................................................................. 7. Removing Nodes ............................................................................................................. 7. MySQL Operation ................................................................................................................... 1. MySQL Cluster ............................................................................................................... 1.1. SuperSockets Poll Optimization ............................................................................... 1.2. NDBD Deadlock Timeout ...................................................................................... 1.3. SCI Transporter .................................................................................................... 2. MySQL Replication ......................................................................................................... 8. Advanced Topics ..................................................................................................................... 1. Notification on Interconnect Status Changes ......................................................................... 1.1. Interconnect Status ................................................................................................ 1.2. Notification Interface ............................................................................................. 1.3. Setting Up and Controlling Notification .................................................................... 1.3.1. Configure Notification via the dishostseditor .................................................... 1.3.2. Configure Notification Manually ................................................................... 1.3.3. Verifying Notification .................................................................................. 1.3.4. Disabling and Enabling Notification Temporarily .............................................. 2. Managing IRM Resources ................................................................................................. 2.1. Updates with Modified IRM Configuration ................................................................ 9. FAQ ...................................................................................................................................... 1. Hardware ....................................................................................................................... 2. Software ........................................................................................................................ A. Self-Installing Archive (SIA) Reference ...................................................................................... 1. SIA Operating Modes ....................................................................................................... 1.1. Full Cluster Installation .......................................................................................... 1.2. Node Installation ................................................................................................... 1.3. Frontend Installation .............................................................................................. 1.4. Installation of Configuration File Editor .................................................................... 1.5. Building RPM Packages Only ................................................................................. 1.6. Extraction of Source Archive .................................................................................. 2. SIA Options ................................................................................................................... 2.1. Node Specification ................................................................................................ 2.2. Installation Path Specification .................................................................................. 2.3. Installing from Binary RPMs .................................................................................. 2.4. Preallocation of SCI Memory ..................................................................................

30 30 32 33 33 34 34 37 37 37 37 37 37 39 41 42 42 42 43 44 45 45 46 46 47 49 49 49 49 49 50 51 51 51 51 52 52 52 53 53 53 53 55 55 56 59 59 59 59 59 59 59 59 60 60 60 60 60

Dolphin Express for MySQL

2.5. Enforce Installation ............................................................................................... 2.6. Configuration File Specification ............................................................................... 2.7. Batch Mode ......................................................................................................... 2.8. Non-GUI Build Mode ............................................................................................ 2.9. Software Removal ................................................................................................. B. sciadmin Reference ................................................................................................................. 1. Startup ........................................................................................................................... 2. Interconnect Status View .................................................................................................. 2.1. Icons ................................................................................................................... 2.2. Operation ............................................................................................................. 2.2.1. Cluster Status ............................................................................................. 2.2.2. Node Status ............................................................................................... 3. Node and Interconnect Control ........................................................................................... 3.1. Admin Menu ........................................................................................................ 3.2. Cluster Menu ....................................................................................................... 3.3. Node Menu .......................................................................................................... 3.4. Cluster Settings ..................................................................................................... 3.5. Adapter Settings ................................................................................................... 4. Interconnect Testing & Diagnosis ....................................................................................... 4.1. Cable Test ........................................................................................................... 4.2. Traffic Test .......................................................................................................... C. Configuration Files .................................................................................................................. 1. Cluster Configuration ....................................................................................................... 1.1. dishosts.conf ......................................................................................................... 1.1.1. Basic settings ............................................................................................. 1.1.2. SuperSockets settings .................................................................................. 1.1.3. Miscellaneous Notes .................................................................................... 1.2. networkmanager.conf ............................................................................................. 1.3. cluster.conf .......................................................................................................... 2. SuperSockets Configuration ............................................................................................... 2.1. supersockets_profiles.conf ....................................................................................... 2.2. supersockets_ports.conf .......................................................................................... 3. Driver Configuration ........................................................................................................ 3.1. dis_irm.conf ......................................................................................................... 3.1.1. Resource Limitations ................................................................................... 3.1.2. Memory Preallocation .................................................................................. 3.1.3. Logging and Messages ................................................................................ 3.2. dis_ssocks.conf ..................................................................................................... D. Platform Issues and Software Limitations .................................................................................... 1. Platforms with Known Problems ........................................................................................ 2. IRM .............................................................................................................................. 3. SuperSockets ..................................................................................................................

61 61 61 61 62 63 63 63 63 64 64 68 68 68 69 70 70 72 73 73 75 78 78 78 78 79 80 80 80 80 80 81 81 82 82 82 84 84 86 86 86 86

vi

Abstract
This document describes the installation of the Dolphin Interconnect Solutions (DIS) Dolphin Express interconnect hardware and the DIS software stack, including SuperSockets, on single machines or on a cluster of machines. This software stack is needed to use Dolphin's Dolphin Express high-performance interconnect products and consists of drivers (kernel modules), user space libraries and applications, an SDK, documentation and more. SuperSockets drastically accelerate generic socket communication as used by clustered applications.

vii

Chapter 1. Introduction & Overview


1. Who needs Dolphin Express and SuperSockets?
Clustered applications running on multiple machines communicating via an Ethernet-based network often suffer from the delays that occur when data needs to be exchanged between processes running on different machines. These delays caused by the communication time make processes wait for data when they otherwise could perform useful work. Dolphin Express is a combination of a high-performance interconnect hardware to replace the Ethernet network together with a highly optimized software stack. One part of this software stack is SuperSockets which implement a bypass of the TCP/UDP/IP protocol stack for standard socket-based inter-process communication. This bypass moves data directly via the high-performance interconnect and thereby reduces the minimal latency typically by a factor of 10 and more with 100% binary application compatibility. Using this combined software/hardware approach with MySQL Cluster, throughput improvements for the TPC-C like DBT2 benchmark of 300% and more have been measured already on small clusters. For larger clusters, this advantage continues to increase as the communication fraction of the processing time will increase.

2. How do Dolphin Express and SuperSockets work?


The Dolphin Express hardware provides means for a process on one machine to write data directly into the address space of a process running on a remote machine. This can be done using either direct store operations of the CPU (for lowest latency), or using the DMA engine of the Dolphin Express interconnect adapter (for lowest CPU utilization). SuperSockets consists of both, kernel modules and a user-space library. The implementation on kernel-level makes sure that the SuperSockets socket-implementation is fully compatible with the TCP/UDP/IP-based sockets provided by the operating system. By being explicitly preloaded, the user-space library operates between the unmodified binary of the applications and the operating system and intercepts all socket-related function calls. Based on the system configuration and a potential user-provided configuration, the library makes a first decision if this function call will be processed by SuperSockets or the standard socket implementation and redirects it accordingly. The SuperSockets kernel module then performs the operation on the Dolphin Express interconnect. If necessary, it can fall back and forward to Ethernet transparently even when the socket is under load.

3. Terminology
We define some terms that will be used throughout this document. adapter A PCI-to-SCI (D33x series), PCI-Express-to-SCI (D35x series) or PCI-Express fabric (DXH series) adapter. This is the Dolphin Express hardware installed in the cluster nodes. A computer which is part of the Dolphin Express interconnect, which means it has an adapter installed. All nodes together constitute the cluster. The CPU architecture relevant in this guide is characterized by the addressing width of the CPU (32 or 64 bit) and the instruction set (x86, Sparc, etc.). If these two characteristics are identical, the CPU architecture is identical for the scope of this guide. A directed point-to-point connection in the SCI interconnect. Physically, a link is the cable leading from the output of one adapter to the input of another adapter. For an SCI interconnect configured in torus topology, the links are connected as closed multiple closed rings. For a two-dimensional torus topology like

node

CPU architecture

link

ringlet

Introduction & Overview

when using D352 adapters, these rings can be considered to the the columns and rows. These rings are called ringlets. frontend The single computer that is running software that monitors and controls the nodes in the cluster. For increased fault tolerance, the frontend should not be part of the Dolphin Express interconnect it controls, although this is possible. Instead, the frontend should communicate with the nodes out-of-band, which means via Ethernet. The installation script is typically executed on the frontend, but can also be executed on another machine that is neither a node nor the frontend, but has network (ssh) access to all nodes and the frontend. This machine is the installation machine. The interconnect drivers are kernel modules and thus need to be built for the exact kernel running on the node (otherwise, the kernel will refuse to load them). To build kernel modules on a machine, the kernel-specific include files and kernel configuration have to be installed - these are not installed by default on most distributions. You will need to have one kernel build machine available which has these files installed (contained in the kernel-devel RPM that matches the installed kernel version) and that runs the exact same kernel version as the nodes. Typically, the kernel build machine is one of the nodes itself, but you can choose to build the kernel modules on any other machine that fulfills the requirements listed above. All nodes constitute the cluster. The network manager is a daemon process named dis_networkmgr running on the frontend. It is part of the Dolphin software stack and manages and controls the cluster using the node managers running on all nodes. The network manager knows the interconnect status of all nodes. The node manager provides is a daemon process that is running on each node and provides remote access to the interconnect driver and other node status to the network manager. It reports status and performs actions like configuring the installed adapter or changing the interconnect routing table if necessary. A self-installing archive (SIA) is a single executable shell command file (for Linux and Solaris) that is used to compile and install the Dolphin software stack in all required variants. It largely simplifies the deployment and management of a Dolphin Express-based cluster. Scalable Coherent Interface is one of the interconnect implementations that can be used with Dolphin Express software, like SuperSockets and SISCI. SCI is an IEEE standard; the implementation offered by Dolphin are the D33x and D35x series of adapter cards. SISCI (Software Infrastructure for SCI) is the user-level API to create applications that make direct use of the Dolphin Express interconnect capabilities. Despite its inherited name, it also supports other interconnect implementations offered by Dolphin, like DSX.

installation machine

kernel build machine

cluster network manager

node manager

self-installing archive (SIA)

Scalable Coherent Interface (SCI)

SISCI

4. Contact & Feedback: Dolphin Support


If you have any problems with the procedures described in this document, or have suggestions for improvement, please don't hesitate to contact Dolphin's support team via <support@dolphinics.com>. For updated versions of the software stack and this document, please check the download section at http://www.dolphinics.com.

Chapter 2. Requirements & Planning


Before you deploy a Dolphin Express solution by either adding it to an existing system, or by planning it into a new system, some considerations on selection of products and the physical setup are necessary.

1. Supported Platforms
The Dolphin Express software stack is designed to run on all current cluster hard- and software platforms, and also supports and adapts to platforms that are several years old to ensure long-term support. Generally, Dolphin strives to support every platform that can run any version of Windows, Linux or Solaris and offers a PCI (Express) slot. Next to this general approach, we qualify certain platforms with our partners which then are guaranteed to run and perform optimally the qualified application. We also test platforms internally and externally for general functionality and performance. For details, please see Appendix D, Platform Issues and Software Limitations.

1.1. Hardware
1.1.1. Supported Platforms
The Dolphin Express hardware (interconnect adapters) complies to the PCI industry standard (either PCI 2.2 64bit/66MHz or PCI-Express 1.0a) and will thus operate in any machine that offers compliant slots. Supported CPU architectures are x86 (32 and 64 bit), PowerPC and PowerPC64, Sparc and IA-64. However, some combinations of CPU and chipset implementations offer sub-optimal performance which should be considered when planning a new system. A few cases are documented in which bugs in the chipset have shown up with our interconnect as it puts a lot of load onto the related components. For the hardware platforms qualified or tested with Dolphin Express, please see Appendix D, Platform Issues and Software Limitations. If you have questions about your specific hardware platform, please contact Dolphin support.

1.1.2. Recommended Node Hardware


The hardware platform for the nodes should be chosen from the Supported Platforms as described above. Next to the Dolphin Express specific requirements, you need to consult your MySQL Cluster expert / consultant on the recommended configuration for your application. You need to make sure that each node / machine has one full height, half length PCI/PCI-X/PCI-Express slot available. The power consumption of Dolphin Express adapters is between 5W and 15W (Consult the separate data sheets for detail).

Note
Half-height slots can be used with Dolphin DXH-series of adapters. A half-height version of the SCIadapters will be available soon; please contact Dolphin support for availability. The Dolphin Express interconnect is fully inter-operable between all supported hardware platforms, also with different PCI or CPU architectures. As usual, care must be taken by the applications if data with different endianess is communicated.

1.1.3. Recommended Frontend Hardware


The frontend does only run a lightweight network manager service which does not impose special hardware properties. However, the frontend should not be fully loaded to ensure fast operation of the network manager service. The frontend requires a reliable Ethernet connection to all nodes.

1.2. Software
Dolphin Express supports a variety of operating systems that are listed below.

Requirements & Planning

1.2.1. Linux
The Dolphin Express software stack can be compiled for all 2.6 kernel versions and most 2.4 kernel versions. A few extra packages (like the kernel include files and configuration) need to be installed for the compilation. Dolphin does only provide source-based distributions which are to be compiled for the exact kernel and hardware version you are using. Software stacks operating on different kernel versions are of course fully inter-operable for inter-node communication. Dolphin Express fully supports native 32-bit and 64-bit platforms. On 64-bit platforms offering both, 32-bit and 64-bit runtime environments, SuperSockets will support 32-bit applications if the compilation environment for 32-bit is also installed. Otherwise, only the native 64-bit runtime environment is supported. For more information, please refer to the FAQ chapter, Q: 2.1.6. Please refer to the release notes of the software stack version you are about to install for the current list of tested Linux distributions and kernel versions. Installation and operation on Linux distributions and kernel versions that are not in this list will usually work as well, but especially the most recent Linux version may cause problems if it has not yet been qualified by Dolphin.

1.2.2. Windows
The Dolphin Express software stack operates on 32- and 64-bit versions of Windows NT 4.0, Windows 2000, Windows 2003 Server and Windows XP. We provide MSI binary installer packages for each Windows version. TBA: More information on the available components under Windows.

1.2.3. Solaris
Solaris 2.6 through 9 on Sparc is supported (excluding SuperSockets). MySQL Cluster can be used via the SCI Transporter provided by MySQL, using the SISCI interface provided by Dolphin.

Note
Support for Solaris 10 on Sparc and AMD64 (x86_64) including SuperSockets is under development. Ask Dolphin support for the current status.

1.2.4. Others
The Dolphin Express software stack excluding SuperSockets also run on VxWorks, LynxOS and HP-UX. Contact Dolphin support with your requirements.

2. Interconnect Planning
This section discusses the decisions that are necessary when planning to install a Dolphin Express Interconnect.

2.1. Nodes to Equip with Dolphin Express Interconnect


Depending on the application that will run on the cluster, the choice of Dolphin Express Interconnect equipped machines differs.

2.1.1. MySQL Cluster


For best performance, all machines that run either NDB or MySQL server processes should be interconnected with the Dolphin Express interconnect. Although it is possible to only equip a subset of the machines with Dolphin Express, doing so will introduce new bottlenecks. Machines that serve as application servers (clients sending queries to MySQL servers processes) typically have little benefit of being part of the interconnect. Please analyze the individual scenario for a definitive recommendation.

Requirements & Planning

Machines that serve as MySQL frontends (like to run the MySQL Cluster management daemon ndb_mgmd) do not benefit from the Dolphin Express interconnect.

Note
The machine that runs the Dolphin network manager must not be equipped with the Dolphin Express interconnect as this reduces the level of fault tolerance.

2.2. Interconnect Topology


For small clusters of just two nodes, a number of possible approaches do exist: Lowest cost: Connect the two nodes with one single-channel D351 adapter in each node. Highest Performance: Connect the two nodes with one dual-channel D350 adapter in each node. This increases the bandwidth and adds redundancy at the same time. Best Scalability: Use D352 adapters to connect the two nodes. The second dimension of this 2D adapter will not be used, but it is possible to expand this cluster in a fault-tolerant way. When going to 4 nodes or more, the topology has to be the multi-dimensional torus, typically a 2D-torus built with D352 PCI-Express-to-SCI adapters. With D352, you can build clusters of any size up to a recommend maximum of 256 nodes. This is the typical topology for database clusters. For large clusters, it is possible to use a 3-D torus topology. For 3- and 4-node cluster, special topologies without any fail-over delays can be built. It is possible to install more than one interconnect fabric in a cluster by installing two adapters into each node. These interconnect fabrics work independently from each other and do very efficiently and transparently increase the bandwidth and throughput (a real factor of two for two fabrics) and add redundancy at the same time.

Note
For some chipsets, PCI performance does not scale well, reducing the performance improvement of a second fabric. If this feature is important for you, contact Dolphin Support to make sure that you chose a chipset for which the full performance will be delivered.

2.3. Physical Node Placement


Generally, nodes that are to be equipped with Dolphin Express interconnect adapters should be placed close to each other to keep cable lengths short. This reduces costs and allows for better routing of the cables. For larger clusters, it makes sense to arrange the nodes in analogy to the the regular 2D-torus topology of the interconnect. The maximum cable length is 10m for copper cables. For situations where nodes need to be placed in a significant distance, it is possible to use fiber instead of copper for the interconnect. Please ask Dolphin support for details. The minimal cable bend radius is 25mm, which allows to place nodes 5cm apart. Connecting nodes or blades that are less than 5cm apart, like for 1U nodes in a 19" rack being a single rack unit apart, typically causes no problem as long as the effective bend radius is 25mm or more. Please contact Dolphin support for other cabling scenarios. Once you have decided on the physical node placement, the cables should be ordered in the according lengths. The Dolphin sales engineer will assist you in selecting the right cable lengths.

Chapter 3. Initial Installation


This chapter guides you through the initial hardware and software installation of Dolphin Express and SuperSockets. This means that no Dolphin software is installed on the nodes or the frontend prior to these instructions. To update an existing installation, please refer to Chapter 4, Update Installation; to add new nodes to a cluster, please refer to Adding Nodes. The recommended installation procedure, which is described in this chapter, uses the Self-Installing Archive (SIA) distribution of the Dolphin Express software stack which can be used for all Linux versions that support the RPM package format. Installation on other Linux platforms is covered in section Section 4, Unpackaged Installation.

1. Overview
The initial installation of Dolphin Express hardware and software will follow these steps that are described in detail in the following sections: 1. Verification of installation requirements: 2. Study section Section 2, Installation Requirements. The setup script itself will also verify that these requirements are met and indicate what is missing.

Installation of interconnect adapters on the nodes (see Section 3, Adapter Card Installation).

Note
The cables should not be installed in this step! 3. Installation of software and cables. This step is refined in section Section 4, Software and Cable Installation.

2. Installation Requirements
For the SIA-based installation of the full cluster and the frontend, the following requirements have to be met: Homogeneous cluster nodes: All nodes of the cluster are of the same CPU architecture and run the same kernel version. The frontend machine may be of different CPU architecture and kernel version!

Note
The installation of the Dolphin Express software on a system that does not satisfy this requirement is described in Section 2, Installation of a Heterogeneous Cluster RPM support: The Linux distribution on the nodes, the frontend and the installation machine needs to support RPM packages. Both major distributions from Red Hat and Novell (SuSE) use RPM packages.

Note
On platforms that do not support RPM packages, it is also possible to install the Dolphin Express software. Please see Section 4, Unpackaged Installation for instructions. Installed RPM packages: To build the Dolphin Express software stack, a few RPM packages that are often not installed by default are required:
qt and qt-devel (> version 3.0.5), , glibc-devel and libgcc (32- and 64-bit, depending on what binary formats should be supported), rpm-build, and the kernel header files and configuration (typically a kernel-devel or kernel-source RPM that exactly(!) matches the version of the installed kernel)

Initial Installation

Note
The SIA will check for these packages, report what packages might be missing and will offer to install them if the yum RPM management system is supported on the affected machine. All required RPM packages are within the standard set of RPM packages offered for your Linux distribution, but may not be installed by default. If the qt-RPMs are not available, the Dolphin Express software stack can be built nevertheless, but the GUI applications to configure and manage the cluster will not be available. Please see below (Section 2.2, Non-GUI Installation) on how to install the software stack in this case. GUI support: for the initial installation, the installation machine should be able to run GUI application via X. .

Note
If the required configuration files are already available prior to the installation, a GUI is not required (see section Section 2.2, Non-GUI Installation). Disk space: To build the RPM packages, about 500MB free disk space in the system's temporary directory (typically /tmp on Linux) are required on the kernel build machine and the frontend.

Note
It is possible to assign SIA to use a specific temporary directory for building using the --build-root option.

2.1. Live Installation


Dolphin Express can be installed into a cluster which is currently under operation without stopping the cluster application from working. This requires that the application running on the cluster can cope with single nodes going down. It is only necessary to turn off each node once to install the adapter. The software installation can be performed under load, although minor performance impacts are possible. For a description of this installation type, please proceed as described in Section 1, Installation under Load

2.2. Non-GUI Installation


The Dolphin software includes two GUI tools: dishostseditor is a tool that is used to create the interconnect configuration file /etc/dis/dishosts.conf and the network manager configuration file /etc/dis/networkmanager.conf. It is needed once on the initial cluster installation, and each time nodes are added or removed from the cluster. sciadmin is used to monitor and control the cluster interconnect.

2.2.1. No X / GUI on Frontend


If the frontend does not support running GUI applications, but another machine in the network does, it is possible to run the installation on this machine. The only requirement is ssh-access from the installation machine towards the frontend and all nodes. This installation mode can be chosen by executing the SIA on the installation machine and specifying the frontend name when being asked for it. In this scenario, the dishostseditor will be compiled, installed and executed on the installation machine, and the generated configuration files will be transferred to the frontend by the installer.

2.2.2. No X / GUI Anywhere


If no machine in the network does have the capability to run GUI applications, you can still use the SIA-based installation. In this case, it is necessary to create the correct configuration files on another machine and store them in /etc/dis on the frontend before executing the SIA on the frontend (not on another machine).

Initial Installation

In this scenario, no GUI application is run at all during the installation. To create the the configuration files on another machine, you can either run the SIA with the --install-editor option if it is a Linux machine, or install a binary version of the dishostseditor if it is a Windows-based machine. Finally, you can send the necessary information to create the configuration files to Dolphin support which will then provide you with the matching configuration files and the cabling instructions. This information includes: external hostnames (or IP adresses) of all nodes adapter type and number of fabrics (1 or 2) hostnames (or IP adresses/subnet) which should be accelerated with SuperSockets (default is the list of hostnames provided above) planned interconnect topology (default is derived from number of nodes and adapter type) description of how nodes are physically located (to avoid cabling problems)

3. Adapter Card Installation


To install an adapter into a node, it needs to be powered down. Insert the adapter into a free PCI slot that matches the adapter: any type of PCI slot for D33x adapters 4x, 8x or 16x PCI-Express slot for D351 and D352 8x or 16x PCI-Express slot for D350 adapters Make sure you are properly grounded to avoid static discharges that may destroy the hardware. When the card is properly inserted and fixed, close the enclosure and power up the node again. The LED(s) on the adapter slot cover has to light orange. Proceed this way with all nodes. It is recommended that all nodes are set up identically, thus use the same slot for the adapter (if the node hardware is identical). You do not yet need to connect the cables at this point: detailed cabling instructions customized for the specific cluster will be created during the software installation which will guide you through the cabling. Generally, the cables can safely be connected and disconnected with the nodes powered up as SCI is a fully hot-plug capable interconnect.

Note
If you know how to connect the cables, you can do so now. Please be advised to inspect and verify the cabling for correctness as described in the remainder of this chapter, though.

4. Software and Cable Installation


After the adapters are installed, the software has to be installed next. On the nodes, the hardware driver and additional kernel modules, user space libraries and the node manager have to be installed. On the frontend, the network manager and the cluster administration tool will be installed. An additional RPM for SISCI development (SISCI-devel) will be created for both, frontend and nodes, but will not be installed. It can be installed as needed in case that SISCI-based applications or libraries (like NMPI) need to be compiled from source.

4.1. Overview
The integrated cluster and frontend installation is the default operation of SIA, but can be specified explicitly with the --install-all option. It works as follows:

Initial Installation

The SIA is executed on the installation machine with root permissions. The installation machine is typically the machine to serve as frontend, but can be any other machine if necessary (see Section 2.2.1, No X / GUI on Frontend). The SIA controls the building, installation and test operations on the remote nodes via ssh. Therefore, password-less ssh to all remote nodes is required. If password-less ssh access is not set up between the installation machine, frontend and nodes, SIA offers to set this up during the installation. The root passwords for all machines are required for this. The binary RPMs for the nodes and the frontend are built on the kernel build machine and the frontend, respectively. The kernel build machine needs to have the kernel headers and configuration installed, while the frontend and the installation machine only compile user-space applications. The node RPMs with the kernel modules are installed on all nodes, the kernel modules are loaded and the node manager is started. At this stage, the interconnect is not yet configured. On an initial installation, the dishostseditor is installed and executed on the installation machine to create the cluster configuration files. This requires user interaction. The cluster configuration files are transferred to the frontend, and the network manager is installed and started on the frontend. It will in turn configure all nodes according to the configuration files. The cluster is now ready to utilize the Dolphin Express interconnect. A number of tests are executed to verify that the cluster is functional and to get basic performance numbers. For other operation modes, such to install specific components on the local machine, please refer to Appendix A, Self-Installing Archive (SIA) Reference.

4.2. Starting the Software Installation


Log into the chosen installation machine, become root and make sure that the SIA file is stored in a directory with write access (/tmp is fine). Execute the script:
# sh DIS_install_<version>.sh

The script will ask questions to retrieve information for the installation. You will notice that all questions are Yes/no questions, and that the default answer is marked by a capital letter, which can be chosen by just pressing Enter. A typical installation looks like this:
[root@scimple tmp]# sh DIS_install_3.3.0.sh Verifying archive integrity... All good. Uncompressing Dolphin DIS 3.3.0 #* Logfile is /tmp/DIS_install.log_140 on tiger-0 #* #+ Dolphin ICS - Software installation (version: 1.52 $ of: 2007/11/09 16:31:32 $) #+ #* Installing a full cluster (nodes and frontend) . #* This script will install Dolphin Express drivers, tools and services #+ on all nodes of the cluster and on the frontend node. #+ #+ All available options of this script are shown with option '--help' # >>> OK to proceed with cluster installation? [Y/n]y # >>> Will the local machine <tiger-0> serve as frontend? [Y/n]y

The default choice is to use the local machine as frontend. If you answer n, the installer will ask you for the hostname of the designated frontend machine. Each cluster needs its own frontend machine. Please note that the complete installation is logged to a file which is shown at the very top (here: /tmp/ DIS_install.log_140). In case of installation problems, this file is very useful to Dolphin support.

Initial Installation

#* NOTE: Cluster configuration files can be specified now, or be generated #+ ..... during the installation. # >>> Do you have a 'dishosts.conf' file that you want to use for installation? [y/N]n

Because this is the initial installation, no installed configuration files could be found. If you have prepared or received configuration files, they can be specified now by answering y. In this case, no GUI application needs to run during the installation, allowing for a shell-only installation. For the default answer, the hostnames of the nodes need to be specified (see below), and the cluster configuration is created later on using the GUI application dishostseditor.
#* NOTE: #+ No cluster configuration file (dishosts.conf) available. #+ You can now specify the nodes that are attached to the Dolphin #+ Express interconnect. The necessary configuration files can then #+ be created based on this list of nodes. #+ #+ Please enter hostname or IP addresses of the nodes one per line. #* When done, enter a single colon ('.'). #+ (proposed hostname is given in [brackets]) # >>> node hostname/IP address <colon '.' when done> []tiger-1 # >>> node hostname/IP address <colon '.' when done> [tiger-2] -> tiger-2 # >>> node hostname/IP address <colon '.' when done> [tiger-3] -> tiger-3 # >>> node hostname/IP address <colon '.' when done> [tiger-4] -> tiger-4 # >>> node hostname/IP address <colon '.' when done> [tiger-5] -> tiger-5 # >>> node hostname/IP address <colon '.' when done> [tiger-6] -> tiger-6 # >>> node hostname/IP address <colon '.' when done> [tiger-7] -> tiger-7 # >>> node hostname/IP address <colon '.' when done> [tiger-8] -> tiger-8 # >>> node hostname/IP address <colon '.' when done> [tiger-9] -> tiger-9 # >>> node hostname/IP address <colon '.' when done> [tiger-10].

The hostnames or IP-addresses of all nodes need to be entered. The installer suggests the hostnames if possible in brackets. To accept a suggestion, just press Enter. Otherwise, enter the hostname or IP address. The data entered is verified to represent an accessible hostname. If a node has multiple IP addresses / hostnames, make sure you specify the one that is visible for the installation machine and the frontend. When all hostnames are entered, enter a single colon . to finish.
#* NOTE: #+ The kernel modules need to be built on a machine with the same kernel #* version and architecture of the interconnect node. By default, the first #* given interconnect node is used for this. You can specify another build #* machine now. # >>> Build kernel modules on node tiger-1 ? [Y/n]y

If you answer n at this point, you can enter the hostname of another machine on which the kernel modules are built. Make sure it matches the nodes for CPU architecture and kernel version.
# >>> Can you access all machines (local and remote) via password-less ssh? [Y/n]y

The installer will later on verify if the password-less ssh access actually works. If you answer n, the installer will set up password-less ssh for you on all nodes and the frontend. You will need to enter the root password once for each node and the password. The password-less ssh access remain active after the installation. To disable it again, remove the file /root/.ssh/ authorized_keys from all nodes and the frontend.
#* NOTE:

10

Initial Installation

#+ It is recommnended that interconnect nodes are rebooted after the #+ initial driver installation to ensure that large memory allocations will succeed. #+ You can omitt this reboot, or do it anytime later if necesary. # >>> Reboot all interconnect nodes (tiger-1 tiger-2 tiger-3 tiger-4 tiger-5 tiger-6 tiger-7 tiger-8 tiger

For optimal performance, the low-level driver needs to allocate some amount of kernel memory. This allocation can fail on a system that has been under load for a long time. If you are not installing on a live system, rebooting the nodes is therefore offered here. You can perform the reboot manually later on to achieve the same effect. If chosen, the reboot will be performed by the installer without interrupting the installation procedure.
#* NOTE: #+ About to INSTALL Dolphin Express interconnect drivers on these nodes: ... tiger-1 ... tiger-2 ... tiger-3 ... tiger-4 ... tiger-5 ... tiger-6 ... tiger-7 ... tiger-8 ... tiger-9 #+ About to BUILD Dolphin Express interconnect drivers on this node: ... tiger-1 #+ About to install management and control services on the frontend machine: ... tiger-0 #* Installing to default target path /opt/DIS on all machines .. (or the current installation path if this is an update installation). # >>> OK to proceed? [Y/n]y

The installer presents an installation summary and asks for confirmation. If you answer n at this point, the installer will exist and the installation needs to be restarted.
#* NOTE: #+ Testing ssh-access to all cluster nodes and gathering configuration. #+ #+ If you are asked for a password, the ssh access to this node without #+ password is not working. In this case, you need to interrupt with CTRL-c #+ and restart the script answering 'no' to the intial question about ssh. ... testing ssh to tiger-1 ... testing ssh to tiger-2 ... testing ssh to tiger-3 ... testing ssh to tiger-4 ... testing ssh to tiger-5 ... testing ssh to tiger-6 ... testing ssh to tiger-7 ... testing ssh to tiger-8 ... testing ssh to tiger-9 #+ OK: ssh access is working #+ OK: nodes are homogenous #* OK: found 1 interconnect fabric(s). #* Testing ssh to other nodes ... testing ssh to tiger-1 ... testing ssh to tiger-0 ... testing ssh to tiger-0 #* OK.

The ssh-access is tested, and some basic information is gathered from the nodes to verify that the nodes are homogeneous and equipped with at least on Dolphin Express adapter and meet the other requirements. If a required RPM package was missing, it would be indicated here with the option to install it (if yum can be used), or to fix the problem manually and retry. If the test for homogeneous nodes failes, please refer to section Section 2, Installation of a Heterogeneous Cluster for information on how to install the software stack.
#* Building node RPM packages on tiger-1 in /tmp/tmp.AEgiO27908

11

Initial Installation

#+ This will take some minutes... #* Logfile is /tmp/DIS_install.log_983 on tiger-1 #* OK, node RPMs have been built. #* Building frontend RPM packages on scimple in /tmp/tmp.dQdwS17511 #+ This will take some minutes... #* Logfile is /tmp/DIS_install.log_607 on scimple #* OK, frontend RPMs have been built. #* Copying RPMs that have been built: /tmp/frontend_RPMS/Dolphin-NetworkAdmin-3.3.0-1.x86_64.rpm /tmp/frontend_RPMS/Dolphin-NetworkHosts-3.3.0-1.x86_64.rpm /tmp/frontend_RPMS/Dolphin-SISCI-devel-3.3.0-1.x86_64.rpm /tmp/frontend_RPMS/Dolphin-NetworkManager-3.3.0-1.x86_64.rpm /tmp/node_RPMS/Dolphin-SISCI-3.3.0-1.x86_64.rpm /tmp/node_RPMS/Dolphin-SISCI-devel-3.3.0-1.x86_64.rpm /tmp/node_RPMS/Dolphin-SCI-3.3.0-1.x86_64.rpm /tmp/node_RPMS/Dolphin-SuperSockets-3.3.0-1.x86_64.rpm

The binary RPM packages matching the nodes and frontend are built and copied to the directory from where the installer was invoked. They are placed into the subdirectories node_RPMS and frontend_RPMS for later use (see the SIA option --use-rpms).
#* To install/update the Dolphin Express services like SuperSockets, all running #+ Dolphin Express services needs to be stopped. This requires that all user #+ applications using SuperSockets (if any) need to be stopped NOW. # >>> Stop all DolpinExpress services (SuperSockets) NOW? [Y/n]y #* OK: all Dolphin Express services (if any) stopped for upgrade.

On an initial installation, there will be now user applications using SuperSockets, so you can easily answer y right away.
#* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #+ #+ #+ #+ #+ #+ #+ Installing node tiger-1 OK. Installing node tiger-2 OK. Installing node tiger-3 OK. Installing node tiger-4 OK. Installing node tiger-5 OK. Installing node tiger-6 OK. Installing node tiger-7 OK. Installing node tiger-8 OK. Installing node tiger-9 OK. Installing machine scimple as frontend. NOTE: You need to create the cluster configuration files 'dishosts.conf' and 'networkmanager.conf' using the graphical tool 'dishostseditor' which will be launched now. If the interconnect cables are not yet installed, you can create detailed cabling instruction within this tool (File -> Get Cabling Instructions). Then install the cables while this script is waiting.

# >>> Are all cables connected, and do all LEDs on the SCI adapters light green? [Y/n]

The nodes get installed and drivers and the node manager are started. Then, the basic packages are installed on the frontend, and the dishostseditor application is launched to create the required configuration files /etc/dis/ dishosts.conf and /etc/dis/networkmanager.conf if they do not already exist. The script will wait at this

12

Initial Installation

point until the configuration files have been created with disthostseditor, and until you confirm that all cables have been connected according to the cabling instructions. This is described in the next section. For typical problems at this point of the installation, please refer to section Chapter 9, FAQ.

4.3. Working with the dishostseditor


dishostseditor is a GUI tool that helps gathering the cluster configuration (and is used to create the cluster configuration file /etc/dis/dishosts.conf and the network manager configuration file /etc/dis/ networkmanager.conf). A few global interconnect properties need to be set, and the position of each node within the interconnect topology needs to be specified.

4.3.1. Cluster Edit


When dishostseditor is launched, it first displays a dialog box where the global interconnect properties need to be specified (see Figure 3.1, Cluster Edit dialog of dishostseditor).

Figure 3.1. Cluster Edit dialog of dishostseditor

4.3.1.1. Topology The dialog will let you enter he selected topology information (number of nodes in X-, Y- and Z- dimension) according to the topology type you selected. The product of all nodes in every dimension needs to be equal (for regular topologies) or less (for irregular topology variants). The number of fabrics needs to be set to the minimum number of adapters in every node. The topology settings should already be correct by default if dishostseditor is launched by the installation script. If the cables are not yet mounted (which is the recommended way of doing it), you simply choose the settings that matches the way you plan to install. However, if the cables are already in place, it is critical to verify that the actual cable installation matches the dimensions shown here if you install a cluster with a 2D- or 3D-torus interconnect topology. I.e., a 12 node cluster

13

Initial Installation

can be set up as 3 by 4 or 4 by 3 or even 2 by 6, the setup script cannot verify that the cabling matches the dimensions that you selected. Remember that link 0 on the adapter boards (the one where the plug is right on the PCB of the adapter board) is mapped to the X-dimension, and link 1 on the adapter board (the one where the plug is on the piggy-back board) is mapped to the Y-dimension. 4.3.1.2. SuperSockets Network Address If your cluster operates within its own subnet and you want all nodes within this subnet to use SuperSockets (having Dolphin Express installed), you can simplify the configuration by specifying the address of this subnet in this dialog. To do so, activate the Network Address field and enter the cluster IP subnet address including the mask. I.e., if all your node communicate via an IP interface with the address 192.168.4.*, you would enter 192.168.4.0/8 here. If the cluster has its own subnet, this option is recommend. SuperSockets will try to use the Dolphin Express for any node in this subnet when it connects to another node of this subnet. If using Dolphin Express is not possible, i.e. because one or both nodes are only equipped with an Ethernet interface, SuperSockets will automatically fall back to Ethernet. Also, if a node gets assigned a new IP address within this subnet, you don't need to change the SuperSockets configuration. Assigning more than one subnet to SuperSockets is also possible, but this type of configuration is not yet supported by dishostseditor. See section Section 1.1, dishosts.conf on how to edit dishosts.conf accordingly. If this type of configuration is not possible in your environment, you need to configure SuperSockets for each node as described in the following section. 4.3.1.3. Status Notification In case you want to be informed on any change of the interconnect status (i.e. an interconnect link was disabled due to errors, or a node has gone down and the interconnect traffic was rerouted), active the checkbox Alert target and enter the alert target and the alert script to be executed. The default alert script is alert.sh and will send an e-mail to the address specified as alert target. Other alert scripts can be created and used, which may require another type of alert target (i.e. a cell phone number to send an SMS). For more information on using status notification, please refer to Section 1, Notification on Interconnect Status Changes.

4.3.2. Node Arrangement


In the next step, the main pane of the dishostseditor will present the nodes in the cluster arranged in the topology that was selected in the previous dialog. To change this topology and other general interconnect settings, you can always click Edit in the Cluster Configuration area which will bring up the Cluster Edit dialog again. If the font settings of your X server cause dishostseditor to print unreadable characters, you can change the font size and the type with the drop-down box at the top of the windows, next to the floppy disk icon.

14

Initial Installation

Figure 3.2. Main dialog of dishostseditor

At this point, you need to arrange the nodes (marked by their hostnames) such that the placement of each node in the torus as shown by dishostseditor matches its placement in the physical torus. You do this by assigning the correct hostname for each node by double-clicking its node icon which will open the configuration dialog of this node. In this dialog, select the correct machine name, which is the hostname as seen from the frontend, from the drop-down list. You can also type a hostname if a hostname that you specified during the installation was wrong.

15

Initial Installation

Figure 3.3. Node dialog of dishostseditor

After you have assigned the correct hostname to this machine, you may need to configure SuperSockets on this node. If you selected the Network Address in the cluster configuration dialog (see above), then SuperSockets will use this subnet address and will not allow for editing this property on the nodes. Otherwise, you can choose between 3 different options for each of the currently supported 2 SuperSocket-accelerated IP interfaces per node: disable Do not use SuperSockets. If you set this option for both fields, SuperSockets can not be used with this node, although the related kernel modules will still be loaded. Enter the hostname or IP address for which SuperSockets should be used. This hostname or IP address will be statically assigned to this physical node (its DolpinExpress interconnect adapter). Choosing a static socket means that the mapping between the node (its adapters) and the specified hostname/IP address is static and will be specified within the configuration file dishosts.conf. All nodes will use this identical file (which is automatically distributed from the frontend to the nodes by the network manager) to perform this mapping. This option works fine if the nodes in your cluster don't change their IP addresses over time. dynamic Enter the hostname or IP address for which SuperSockets should be used. This hostname or IP address will be dynamically resolved to the DolpinExpress interconnect adapter that is installed in the machine with this hostname/IP address. SuperSockets will therefore resolve the mapping between adapters and hostnames/IP addresses dynamically. This incurs a certain initial overhead when the first connection is set up, but as the mapping is cached, this is not really relevant. This option is similar to using a subnet, but resolves only the explicitly specified IP addresses and not all IP addresses of a subnet. Use this option if nodes change their IP addresses or node identities move between physical machines, i.e. in a fail-over setup.

static

4.3.3. Cabling Instructions


You should now generate the cabling instructions for your cluster. Please do this also when the cables are actually installed: you really want to verify if the actual cable setup matches the topology you just specified. To create the cabling instruction, choose the menu item File Create Cabling Instructions. You can save and/or print the instructions. It is a good idea to print the instructions so you can take them with you to the cluster. 16

Initial Installation

4.4. Cluster Cabling


If the cables are already connected, please proceed with section Section 4.4.2, Verifying the Cabling.

Note
In order to achieve a trouble-free operation of your cluster, setting up the cables correctly is critical. Please take your time to perform this task properly. The cables can be installed while nodes are powered up. The setup script will wait with a question for you to continue:
# >>> Are all cables connected, and do all LEDs on the SCI adapters ligtht green? [Y/n]

4.4.1. Connecting the cables


Please proceed by connecting the nodes as described by the cabling instructions generated by the dishostseditor. The cabling instructions refer to link 0 and link 1 if you are using D352 adapters (for 2D-torus topology), and channel A and channel B in case of D350 adapters being used (for dual-channel operation). Each of the two links/channels will form an independent ring with its adjacent adapters, and thus has an IN and OUT connector to connect to these adjacent adapters. It is critical that you correctly locate the different links/channels on the back of the card, and the IN and OUT connectors. For D352 (D350), link 0 (channel A) is formed by the connectors that are directly connected to the PCB (printed circuit board) of the adapter, while the connectors for link 1 (channel B) are located on the piggy-back board. This is illustrated in Figure 3.4, Location of link 0 and link 1 on D352 adapterand Figure 3.5, Location of channel A and channel B on D350 adapter, respectively. For both links (channels), the IN connectors are located at the lower end of the adapter, and the OUT connectors at the top of the adapter. The D351 adapter has only a single link (0), and the same location of IN and OUT connectors.

Figure 3.4. Location of link 0 and link 1 on D352 adapter

Figure 3.5. Location of channel A and channel B on D350 adapter

Please consider the hints below for connecting the cables: Never apply force: The plugs of the cable will move into the sockets easily. Make sure the orientation is correct. The cables have a minimum bend diameter of 5cm. 17

Initial Installation

Note
This specification applies to black All-Best cables (part number D706), but not to the grey CHH cables (part number D707). With the CHH cables, the minimum bend diameter is 10cm. Fasten evenly. When fastening the screws of the plugs, make sure you fasten both lightly before tightening them. Do not tighten only one screw of the plug, and then the other one, as this is likely to tilt the plug within the connector. Fasten gently. Use a torque screw driver if possible, and apply a maximum of 0.4 Nm. As a rule of thumb: do not apply more torque with the screw driver than you possibly could using only your finger (if there was enough space to grip the screw). Observe LEDs: When an adapter has both input and output of a link connected to it's neighboring adapter, the LED should turn green and emit a steady light (not blinking). Don't mix up links: When using a 2D-torus topology, it is important not to connect link 0 of one adapter with link 1 of another adapter. As decribed above, link 0 is the left pair of connectors on the Dolphin Express SCI interconnect adapter when the adapter is placed in a vertical position. In order to determine a left side this you may hold the Dolphin Express interconnect adapter in a vertical position: the blue "O" (indicating the OUT port) should be located at the top. LEDs are also placed on the top of the adapter the PCI/PCI-X/PCI-Express bus connector is mounted on the lower side of the adapter The left pair of connectors on the Dolphin Express interconnect adapter is what we refer to as Link 0. Link 1 is the right pair of connectors on the Dolphin Express interconnect adapter when the adapter is placed in a vertical position.

Note
If the links have been mixed up, the LED will still turn green, but packet routing will fail. The cabling test of sciadmin will reveal such cabling errors.

4.4.2. Verifying the Cabling

Important
A green link LED indicates that the link between the output plug and input plug could be established and synchronized. It does not assure that the cable is actually placed correctly! It is therefore important to verify once more that the cables are plugged according to the cabling instructions generated by the dishostseditor! If a pair of LEDs do not turn green, please perform the following steps: Disconnect the cables. Make sure you connect an Output with an Input plug. Re-insert and fasten the plug according to the guidelines above. If the LEDs still do not turn green, use a different cable. If the LEDs still do not turn green, swap the cable of the problematic connection with a working one and observe if the problem moves with the cable. Power-cycle the nodes with the orange LEDs according to Q: 1.1.1. Contact Dolphin support if you can not make the LEDs turn green after trying all proposed measures.

18

Initial Installation

When you are done connecting the cables, all LEDs have turned green and you have verified the connections, you can answer "Yes" to the question "Are all cables connected, and do all LEDs on the SCI adapters ligtht green? " and proceed with the next section to finalize the software installation.

4.5. Finalising the Software Installation


Once the cables are connected, no more user interaction is required. Please confirm that all cables are connected and all LEDs shine gren, and the installation will proceed. The network manager will be started on the frontend, configuring all cluster nodes according to the configuration specified in dishosts.conf. After this, a number of tests are run on the cluster to verify that the SCI interconnect was set up correctly and delivers the expected performance. You will see output like this:
#* NOTE: ... node ... node ... node ... node ... node ... node ... node ... node ... node #* OK. checking for cluster configuration to take effect: tiger-1: tiger-2: tiger-3: tiger-4: tiger-5: tiger-6: tiger-7: tiger-8: tiger-9:

#* Installing remaining frontend packages #* NOTE: #+ To compile SISCI applications (like NMPI), the SISCI-devel RPM needs to be #+ installed. It is located in the frontend_RPMS and node_RPMS directories. #* OK.

If no problems are reported (like in the example above), you are done with the installation and can start to use your Dolphin Express accelerated cluster. Otherwise, refer to the next subsections and Section 4.7, Interconnect Validation with sciadmin to learn about the individual tests and how to fix problems reported by each test.

4.5.1. Static SCI Connectivity Test


The Static SCI Connectivity Test verifies that links are up and all nodes can see each other via the SCI interconnect. Success in this test means that all adapters have been configured correctly, and that the cables are inserted properly. It should report TEST RESULT: *PASSED* for all nodes:
#* NOTE: Testing static interconnect connectivity between nodes. ... node tiger-1: TEST RESULT: *PASSED* ... node tiger-2: TEST RESULT: *PASSED* ... node tiger-3: TEST RESULT: *PASSED* ... node tiger-4: TEST RESULT: *PASSED* ... node tiger-5: TEST RESULT: *PASSED* ... node tiger-6: TEST RESULT: *PASSED* ... node tiger-7: TEST RESULT: *PASSED* ... node tiger-8: TEST RESULT: *PASSED* ... node tiger-9: TEST RESULT: *PASSED*

If this test reports errors or warning, you are offered to re-run dishostseditor to validate and possibly fix the interconnect configuration. If the problems persist, you should let the installer continue and analyse the problems using sciadmin after the installation finishes (see Section 4.7, Interconnect Validation with sciadmin).

19

Initial Installation

4.5.2. SuperSockets Configuration Test


The SuperSockets Configuration Test verifies that all nodes have the same valid SuperSockets configuration (as shown by /proc/net/af_sci/socket_maps).
#* NOTE: Verifying SuperSockets configuration on all nodes. #+ No SuperSocket configuration problems found.

Success in this test means that the SuperSockets service dis_supersockets is running and is configured identically on all nodes. If a failure is reported, it means the the interconnect configuration did not propagate correctly to this node. You should check if the dis_nodemgr service is running on this node. If not, start it, wait for a minute, and then configure SuperSockets by calling dis_ssocks_cfg.

4.5.3. SuperSockets Performance Test


The SuperSockets Performance Test runs a simple socket benchmark between two of the nodes. The benchmark is run once via Ethernet and once via SuperSockets, and performance is reported for both cases.
#* NOTE: #+ Verifying SuperSockets performance for tiger-2 (testing via tiger-1). #+ Checking Ethernet performance ... single-byte latency: 56.63 us #+ Checking Dolphin Express SuperSockets performance ... single-byte latency: 3.00 us ... Latency rating: Very good. SuperSockets are working well. #+ SuperSockets performance tests done.

The SuperSockets latency is rated based on our platform validation experience. If the rating indicates that SuperSockets are not performing as expected, or if it shows that a fallback to Ethernet has occurred, please contact Dolphin Support. In this case, it is important that you supply the installation log (see above). The installation finishes with the option to start the administration GUI tool sciadmin, a hint to use LD_PRELOAD to make use of SuperSockets and a pointer to the binary RPMs that have been used for the installation.
#* OK: Cluster installation completed. #+ Remember to use LD_PRELOAD=libksupersockets.so for all applications that #+ should use Dolphin Express SuperSockets. # >>> Do you want to start the GUI tool for interconnect adminstration (sciadmin)? [y/N]n #* RPM packages that were used for installation are stored in #+ /tmp/node_PRMS and /tmp/frontend_PRMS.

4.6. Handling Installation Problems


If for some reason the installation was not successful, you can easily and safely repeat it by simply invoking the SIA again. Please consider: By default, existing RPM packages of the same or even more recent version will not be replaced. To enforce re-installation with the version provided by the SIA, you need to specify --enforce. To avoid that the binary RPMs are built again, use the option --use-rpms or simply run the SIA in the same directory as before where it can find the RPMs in the node_RPMS and frontend_RPMS subdirectories. To start an installation from scratch, you can run the SIA on each node and the frontend using the option --wipe to remove all traces of the Dolphin Express software stack and start again. If you still fail to install the software successfully, you should contact Dolphin support. Please provide all installation logfiles. Every installation attempt creates a differently named logfile; it's name is printed at the very beginning of the installation. Please also include the configuration files that can be found in /etc/dis on the frontend.

4.7. Interconnect Validation with sciadmin


Dolphin provides a graphical tool named sciadmin. sciadmin serves as a single-point-of-control and manage the Dolphin Express interconnect in your cluster. It shows an overview of the status of all adapters and links of a

20

Initial Installation

cluster and allows to perform detailed status queries. It also provides means to manually control the interconnect, inspect and set options and perform interconnect tests. For a complete description of sciadmin, please refer to Appendix B, sciadmin Reference. Here, we will only describe how to use sciadmin to verify the newly installed Dolphin Express interconnect.

4.7.1. Installing sciadmin


sciadmin had been installed on the frontend machine by the SIA if this machine is capable to run X applications and has the Qt toolkit installed. If the frontend does not have these capabilities, you can install it on any other machine that has these capabilities using SIA with the --install-frontend option, or use the Dolphin-NetworkAdmin RPM package from the frontend_RPMS directory (this RPM will only be there if it could be build for the frontend). It is also possible to download a binary version for Windows that runs without the need for extra compilation or installation. You can use sciadmin on any machine that can connect to the network manager on the frontend via a standard TCP/IP socket. You have to make sure that connections towards the frontend using the ports 3445 (sciadmin), 3444 (network manager) and 3443 (node manager) are possible (potentially firewall settings need to be changed).

4.7.2. Starting sciadmin


sciadmin will be installed in the sbin directory of the installation path (default: /opt/DIS/sbin/sciadmin). It will be within the PATH after you login as root, but can also be run by non-root users. After it has been started, you will need to connect to the network manager controlling your cluster. Click the Connect button in the tool bar and enter the appropriate hostname or IP address of the network manager. sciadmin will present you a graphical representation of the cluster nodes and the interconnect links between them.

4.7.3. Cluster Overview


Normally, all nodes and interconnect links should be shown green, meaning that their status is OK. This is a requirement for a correctly installed and configured cluster and you may proceed to Section 4.7.4, Cabling Correctness Test. If a node is plotted red, it means that the network manager can not connect to the node manager on this node. To solve this problem: 1. 2. Make sure that the node is powered and has booted the operating system. Verify that the node manager service is running: On Red Hat:
# service dis_nodemgr status

On other Linux variants:


# /etc/init.d/dis_nodemgr status

should tell you that the node manager is running. If this is not the case: a. Try to start the node manager: On Red Hat:
# service dis_nodemgr start

On other Linux variants:


# /etc/init.d/dis_nodemgr start

should tell you that the node manager has started successfully. b. If the node manager fails to start, please see /var/log/dis_nodemgr.log

21

Initial Installation

c.

Make sure that the service is configured to start in the correct runlevel (Dolphin installation makes sure this is the case). On Red Hat:
# chkconfig --add 2345 dis_nodemgr on

On other Linux variants, please refer to the system documentation to determine the required steps

4.7.4. Cabling Correctness Test


sciadmin can validate that all cables are connected according to the configuration that was specified in the dishostseditor, and which is now stored in /etc/dis/dishosts.conf on all nodes and the frontend. To perform the cable test, select Cluster -> Test Cable Connections. This Cabling Correctness Test runs for only a few seconds and will verify that the nodes are cabled according to the configuration provided by the dishostseditor.

Warning
Running this test will stop the normal traffic over the interconnect as the routing needs to be changed. If you run this test while your cluster is in production, you might experience communication timeouts. SuperSockets in operation will fall back to Ethernet during this test, which also leads to increased communication delays. If the test detects a problem, it will inform you that node A can not communicate with node B although they are supposed to be within the same ringlet. You will typically get more than one error message in case of a cabling problem, as such a problem does in most cases affect more than one pair of nodes. Please proceed as follows: 1. Try to fix the first reported problem by tracing the cable connections from node A to node B: a. Verify that the cable connections are placed within one ringlet: i. Look up the path of cable connections between node A and node B in the Cabling Instructions that you created (or still can create at this point) using dishostseditor. When you arrive at node B, do the same check for the path back from node B to node A.

ii. b.

Along the path, make sure: i. ii. That each cable plug is securely fitted into the socket of the adapter. Each cable plug is connected to the right link (0 or 1) as indicated by the cabling instructions.

2.

If you can't find a problem for the first problem reported, verify the cable connections for all following pairs of node reported bad. After the first change, re-run the cable test to verify if this change solves all problems. If this is not the case, start over with this verification loop.

3.

4.7.5. Fabric Quality Test


The Cable Correctness Test performs only minimal communication between two nodes to determine the functionality of the fabric between them. To verify the actual signal quality of the interconnect fabric, a more intense test is required. Such a Fabric Quality Test can be started for each installed interconnect fabric (0 or 1) from within sciadmin via Cluster Fabric * Test Traffic.

Warning
Running this test will stop the normal traffic over the interconnect as the routing needs to be changed. If you run this test while your cluster is in production, you might experience communication timeouts.

22

Initial Installation

SuperSockets in operation will fall back to a second fabric (if installed) or to Ethernet during this test, which also leads to increased communication delays. This test will run for a few minutes, depending on the size of your cluster, as it tests communication for about 20 seconds between each pair of nodes within the same ring. This means, for a 4 by 4 2D-torus cluster which features 8 rings with 4 nodes each, it will take 8 * ( 3 + 2 +1) * 20 seconds = 16 minutes. It will then report if any CRC errors or other problems have occurred between any pairs of nodes.

Note
Any communication errors reported here are either corrected automatically by retrying a data transfer (like for CRC errors), or are reported. Thus, an communication error does not mean data might get lost. However, every communication error reduces the performances, and an optimally set up Dolphin Express interconnect should not show any communication errors. A small number of communication errors is acceptable, though. Please contact Dolphin support if in doubt. If the test reports communication errors, please proceed as follows: 1. If errors are reported between multiple pairs of nodes, locate the pair of nodes which is located most closely (has the smallest number of cable connections between them). Normally, if any errors are reported, a pair of nodes located next to each other will show up. Check the cable connection on the shortest path between these two nodes (a single cable, if nodes are located next to each other) for being properly mounted: a. b. 3. 4. No excessive stress on the cable, like bending it to sharply or too much force on the plugs. Cable plugs need to be placed in the connectors on the adapters evenly (not tilted) and securely fastened. If in doubt, unplug cable and re-fasten it.

2.

Perform the previous check for all other node pairs; then re-run the test. If communication errors persist, change cables to locate a possibly damaged cable: a. b. Exchange the cables between the most close pair of nodes one-by-one with a cable of a connection for which no errors have been reported. Remember (note down) which cables you exchanged. Run the Fabric Quality Test after each cable exchange. i. ii. If the communication errors move with the cable you just exchanged, then this cable might be damaged. Please contact your sales representative for exchange. If the communication error remains unchanged, the problem might be with one of the adapters. Please contact Dolphin support for further analysis.

4.8. Making Cluster Application use Dolphin Express


After the Dolphin Express hard- and software has been installed and tested, you will want your cluster application to make use of the increased performance.

4.8.1. Generic Socket Applications


All applications that use generic BSD sockets for communication will be accelerated by SuperSockets. No configuration change is required for the application as the same host names/IP v4 addresses can be used. All relevant socket types are supported by SuperSockets: TCP stream sockets as well as UDP and RDS datagram sockets. SuperSockets will use the Dolphin Express interconnect for low-latency, high-bandwidth communication inside the cluster, and will transparently fall back to Ethernet when connecting to nodes outside the cluster. To make an application use SuperSockets, you need to preload a dynamic library on application start. This can be achieved by two means as described in the next two sections.

23

Initial Installation

4.8.1.1. Launch via Wrapper Script To let generic socket applications use SuperSockets, you just need to run them via the wrapper script dis_ssocks_run that sets the LD_PRELOAD environment variable accordingly. This script is installed to the bin directory of the installation (default is /opt/DIS/bin) which is added to the default PATH environment variable. To have i.e. the socket benchmark netperf run via SuperSockets, start the server process on node server_name like
dis_ssocks_run netperf

and the client process on any other node in the cluster like
dis_ssocks_run netperf -h server_name

4.8.1.2. Launch with LD_PRELOAD As an alternative to using this wrapper script, you can also make sure to set LD_PRELOAD correctly to preload the SuperSockets library, i.e. for sh-style shells such as bash:
export LD_PRELOAD=libksupersockets.so

4.8.1.3. Troubeshooting If the applications you are using do not show increased performance, please verify that they use SuperSockets as follows: 1. To verify that the preloading works, use the ldd command on any executable, i.e. the netperf binary mentioned above:
$ export LD_PRELOAD=libksupersockets.so $ ldd netperf libksupersockets.so => /opt/DIS/lib64/libksupersockets.so (0x0000002a95577000) libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x00000033ed300000) libc.so.6 => /lib64/tls/libc.so.6 (0x00000033ec800000) libdl.so.2 => /lib64/libdl.so.2 (0x00000033ecb00000) /lib64/ld-linux-x86-64.so.2 (0x00000033ec600000)

The library libksupersockets.so has to be listed at the top position. If this is not the case, make sure the library file actually exists. The default locations are /opt/DIS/lib/libksupersockets.so and /opt/DIS/ lib64/libksupersockets.so on 64-bit platforms, and libksupersockets.so actually is a symbolic link on a library with the same name and a version suffix:
$ ls -lR /opt/DIS/lib*/*ksupersockets* -rw-r--r-- 1 root root 29498 Nov 14 12:43 -rw-r--r-- 1 root root 901 Nov 14 12:43 lrwxrwxrwx 1 root root 25 Nov 14 12:50 lrwxrwxrwx 1 root root 25 Nov 14 12:50 -rw-r--r-- 1 root root 65160 Nov 14 12:43 -rw-r--r-- 1 root root 19746 Nov 14 12:43 -rw-r--r-- 1 root root 899 Nov 14 12:43 lrwxrwxrwx 1 root root 25 Nov 14 12:50 lrwxrwxrwx 1 root root 25 Nov 14 12:50 -rw-r--r-- 1 root root 48731 Nov 14 12:43

/opt/DIS/lib64/libksupersockets.a /opt/DIS/lib64/libksupersockets.la /opt/DIS/lib64/libksupersockets.so -> libksupersockets.so.3 /opt/DIS/lib64/libksupersockets.so.3 -> libksupersockets.so /opt/DIS/lib64/libksupersockets.so.3.3.0 /opt/DIS/lib/libksupersockets.a /opt/DIS/lib/libksupersockets.la /opt/DIS/lib/libksupersockets.so -> libksupersockets.so.3.3 /opt/DIS/lib/libksupersockets.so.3 -> libksupersockets.so.3 /opt/DIS/lib/libksupersockets.so.3.3.0

Also, make sure that the dynamic linker is configured to find it in this place. The dynamic linker is configured accordingly on installation of the RPM; if you did not install via RPM, you need to configure the dynamic linker manually. To verify that the dynamic linking is the problem, set LD_LIBRARY_PATH to include the path to libksupersockets.so and verify again with ldd:
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/DIS/lib:/opt/DIS/lib64 $ echo $LD_PRELOAD libksupersockets.so $ ldd netperf ....

A better solution than setting LD_LIBRARY_PATH is to configure the dynamic linker ld to include these directories in its search path. Use man ldconfig to learn how to achieve this.

24

Initial Installation

2.

You need to make sure that the preloading of the SuperSockets library described above is effective on both nodes, for both applications that should communicate via SuperSockets. Make sure that the SuperSockets kernel module (and the kernel modules it depends on) are loaded and configured correctly on both nodes. 1. Check the status of all Dolphin kernel modules via the dis_services script (defaut location /opt/DIS/ sbin):
# dis_services status Dolphin IRM 3.3.0 ( November 13th 2007 ) is running. Dolphin Node Manager is running (pid 3172). Dolphin SISCI 3.3.0 ( November 13th 2007 ) is running. Dolphin SuperSockets 3.3.0 "St.Martin", Nov 7th 2007 (built Nov 14 2007) running.

3.

At least the services dis_irm and dis_supersockets need to be running, and you should not see a message about SuperSockets not being configured. 2. Verify the configuration of the SuperSockets to make sure that all cluster nodes will connect and communicate via SuperSockets. The active configuration is shown in /proc/net/af_sci/socket_maps:
# cat /proc/net/af_sci/socket_maps IP/net Adapter NodeId List ----------------------------------------------172.16.5.1/32 0x0000 4 0 0 172.16.5.2/32 0x0000 8 0 0 172.16.5.3/32 0x0000 68 0 0 172.16.5.4/32 0x0000 72 0 0

Depending on the configuration variant you used to set up SuperSockets, the content of this file may look different, but it must never be empty and should be identical on all nodes. The examle above shows a four-node cluster with a single fabric and a static SuperSockets configuration, which will accelerate one socket interface per node. For more information on the configuration of SuperSockets, please refer to Section 1.1, dishosts.conf. 3. Make sure that the host names/IP addresses used effectively by the application are the ones that are configured for SuperSockets, especially if the nodes have multiple Ethernet interfaces configured.

4.

Check the system log for messages of the SuperSockets kernel module. It will report problems all problems, i.e. when running out of resources.
# cat /var/log/messages | grep dis_ssocks

It is a good idea to monitor the system log while you try to connect to a remote node if you suspect problems being reported there:
# tail -f /var/log/messages

For an explanation of typical error messages, please refer to Section 2, Software. 5. Don't forget to check if the port numbers used by this application, or the application itself have been explicitly been exclued from using SuperSockets. By default, only the system port numbers below 1024 are excluded from using SuperSockets, but you should verify the current configuration (see Section 2, SuperSockets Configuration). If you can't solve the problem, please contact Dolphin Support.

6.

4.8.2. Native SCI Applications


Native SCI applications use the SISCI API to use the Dolphin Express hardware features like transparent remote memory access, DMA transfers or remote interrupts. The SISCI library libsisci.so is installed on the nodes by default. 25

Initial Installation

Note
The SISCI library is only available in the native bit width of a machine. This implies that on 64-bit machines, only 64-bit SISCI applications can be created and executed as there is no 32-bit version of the SISCI library on 64-bit machines. To compile and link SISCI applications like the MPI-Implementation NMPI, the SISCI-devel RPM needs to be installed on the respective machine. This RPM is built during installation and placed in the node_RPMS and frontend_RPMS directory, respectively.

4.8.3. Kernel Socket Services


SuperSockets can also be used to accelerate kernel services that communicate via sockets. However, such services need to be adapted to actually use SuperSockets (a minor modification to make them use a different address family when opening new sockets). If you are interested in accelerated kernel services like iSCSI, GNBD or others, please contact Dolphin Support.

26

Chapter 4. Update Installation


This chapter describes how an existing Dolphin Express software stack is to be updated to a new version using the SIA. Dolphin Express software supports "rollling upgrades" between all release unless explicitly noted otherwise in the release notes.

1. Complete Update
Opposed to the initial installation, the update installation can be performed in a fully automatic manner without manual intervention. Therefore, this convenient update method is recommended if you can afford some downtime of the whole cluster. Typically, the update of a 16-node cluster takes about 30 minutes. A complete update is also required in case or protocol incompatibilities between the installed version and the version to be installed. Such incompatibilities are rare and will be described in the release notes. If this is applies, a rolling update is not possible, but you will need to update the system completely in one operation. This will make Dolphin Express functionality unavailable for the duration of this update. Proceed as follows to perform the complete update installation: 1. Stop the applications using Dolphin Express on all nodes. This step can be omitted if you choose the --reboot option below. Become superuser on the frontend. Run the SIA on the frontend with any combination of the following options:
--install-all

2. 3.

This is the default installation variant and will update all nodes and the frontend. You can specifiy --install-node or --install-frontend here to update only the current node or the frontend (you need to execute the SIA on the respective node in these cases!)

--batch

Using this option, the script will run without any user interaction, assuming the default answers to all questions which would otherwise be posed to the user. This option can safely be used if no configuration changes are needed, and if you know that all services/applications using Dolphin Express are stopped on the nodes. Rebooting the nodes in the course of the installation will avoid any problems when loading the updated drivers. Such problems can occur because the drivers are currently in use, or due to resource problems. This option is recommended. By default, packages on a node or the frontend will only be updated if the new package has a more recent version than the installed package. This option will enforce the uninstallation of the installed package, followed by the installation of the new package. This option is recommended if you are unsure about the state of the installation.

--reboot

--enforce

As an example, the complete, non-interactive and enforced installation of a specific driver version (provided via the SIA) with a reboot of all nodes will be invoked as follows:
# sh DIS_install_<version>.sh --install-all --batch --reboot --enforce

4.

Wait for the SIA to complete. The updated Dolphin Express services will be running on the nodes and the frontend.

2. Rolling Update
A rolling update will keep your cluster and all its services available on all but one node. This kind of update needs to be performed node by node. It requires that you stop all applications which use the Dolphin Express software 27

Update Installation

stack (like a database server using SuperSockets) on the node you intend to update. This means your systems needs to tolerate applications going down on a single node. Before performing a rolling update, please refer to the release notes of the new version to be installed if it supports a rolling update of the version currently installed. If this is not the case, you need to perform a complete update (see previous section).

Note
It possilbe to install the updated files while the applications are still using Dolphin Express services. However, in this case the updated Dolphin Express services will not become active until you restart them (or reboot the machine). Perform the following steps on each node: 1. 2. Log into the node and become superuser (root). Build the new binary RPM packages for this node:
# sh DIS_install_<version>.sh --build-rpm

The created binary RPM packages will be stored in the subdirectories node_RPMS and frontend_RPMS which will be created in the current working directory.

Tip
To save a lot of time, you can use the binary RPM packages built on the first node that is updated on all other nodes (if they have the same CPU architecture and Linux version). Please see Section 2.3, Installing from Binary RPMs for more information. 3. 4. Stop all applications on this node that use Dolphin Express services, like a MySQL server or NDB process. Stop all Dolphin Express services on this node using the dis_services command:
# dis_services stop Stopping Dolphin SuperSockets drivers Stopping Dolphin SISCI driver Stopping Dolphin Node Manager Stopping Dolphin IRM driver [ [ [ [ OK OK OK OK ] ] ] ]

If you run sciadmin, you will notice that this node will show up as disabled (not active).

Note
The SIA will also try to stop all services when doing an update installation. Performing this step explicitly will just assure that the services can be stopped, and that the applications are shut down properly. If the services can not be stopped for some reason, you can still update the node, but you have to reboot it to enable the updated services. See the --reboot option in the next step. 5. Run the SIA with the --install-node --use-rpms <path> options to install and updated RPM packages and start the updated drivers and services. The <path> parameter to the --use-rpms option has to point to the directory where the binary RPM packages have been built (see step 1). If you had run the SIA in /tmp in step 1, you would issue the following command:
# sh DIS_install_<version>.sh --install-node --use-rpms /tmp

Adding the option --reboot will reboot the node after the installation has been successful. A reboot is not required if the services were shut down successfully in step 4, but recommend to allow the low-level driver the allocation of suffcient memory resources for remote-memory access commuincation. 28

Update Installation

Important
If the services could not be stopped in step 4, a reboot is required to allow the updated drivers to be loaded. Otherwise, the new drivers will only be installed on disk, but will not be loaded and used. If for some reason you want to re-install the same version, or even an older version of the Dolphin Express software stack than is currently installed, you need to use the --enforce option. 6. The updated services will be started by the installation and are available for use by the applications. Make sure that node has shown up as active (green) in sciadmin again before updating the next node. If the services failed to start, a reboot of the node will fix the problem. This can be caused by situations where the memory is too fragmented for the low-level driver (see above).

29

Chapter 5. Manual Installation


This chapter explains how to manually install the different software packages on the nodes and the frontend, and how to install the software if the native package format of a platform is not supported by the Dolphin installer SIA.

1. Installation under Load


This section describes how to perform the initial Dolphin Express installation on a cluster in operation without the requirement to stop the whole cluster from operating. This type of installation does not require more than one node at a time being offline. Because MySQL Cluster is fault-tolerant by default, this will not stop your cluster application. However, performance may suffer to some degree. It will be necessary to power off single nodes in the course of this installation (unless your machines support PCI hotplug - in this case, please contact Dolphin support). 1. Installing the drivers on the nodes On all nodes, run the SIA with the option --install-node. This is a local operation which will build and install the drivers on the local machine only. As the Dolphin Express hardware is not yet installed, this operation will report errors which can be ignored. Do not reboot the nodes now!

Tip
You can speed up this node installation by re-using binary RPMs that have been build on another node with the same kernel version and the same CPU architecture. To do so, proceed as follows: 1. After the first installation on a node, the binary RPMs are located in the directories node_RPMS and frontend_RPMS, located in the directory where you launched the SIA. Copy these sub-directories to a path that is accessible from the other nodes. When installing on another node with the same Linux kernel version and CPU architecture, use the --use-rpms option to tell SIA where it can find matching RPMs for this node, so it does not have to build them once more.

2.

2.

Installing the Dolphin Express hardware For an installation under load, perform the following steps for each node one by one: 1. 2. Shut down your application processes on the current node. Power off the node, and install the Dolphin Express adapter (see Section 3, Adapter Card Installation). Do not yet connect any cables! Power on the node and boot it up. The Dolphin Express drivers should load successfully now, although the SuperSockets service will not be configured. Verify this via dis_services:

3.

# dis_services status Dolphin IRM 3.3.0 ( November 13th 2007 ) is running. Dolphin Node Manager is running (pid 3172). Dolphin SISCI 3.3.0 ( November 13th 2007 ) is running. Dolphin SuperSockets 3.3.0 "St.Martin", Nov 7th 2007 (built Nov 14 2007) loaded, but not configur

4.

Stop the SuperSockets service:


# service dis_supersockets stop Stopping Dolphin SuperSockets drivers [ OK ]

5. 6.

Start all your own applications on the current node and make sure the whole cluster operates normally. Proceed with the next node until all nodes have the Dolphin Express hardware and software installed. 30

Manual Installation

3.

Creating the cluster configuration files If you have a Linux machine with X available which can run GUI applications, run the SIA with the --install-editor option to install the tool dishostseditor. Ideally, this step is performed on the frontend. If this is the case, you should create the directory /etc/dis and make it writable for root:
# mkdir /etc/dis # chmod 755 /etc/dis

After the SIA has completed the installation, start the tool dishostseditor (default installation location is /opt/DIS/sbin):
# /opt/DIS/sbin/dishostseditor

Information on how to work with this tool can be found in Section 4.3, Working with the dishostseditor. Make sure you create the cabling instructions needed in the next step. If the dishostseditor was run as root on the frontend, proceed with the next step. Otherwise, copy the configuration files dishosts.conf and networkmanager.conf which you have just created to the frontend and place it there under /etc/dis (you may need to create this directory, see above). 4. Cable Installation Using the cabling instructions created by dishostseditor in the previous step, the interconnect cables should now be connected (see Section 4.4, Cluster Cabling). 5. On the frontend machine, run the SIA with the --install-frontend option. This will start the network manager, which will then configure the whole cluster according to the configuration files created in the previous steps. Start all services on all the nodes:
# dis_services start Starting Dolphin IRM 3.3.0 ( November 13th 2007 ) Starting Dolphin Node Manager Starting Dolphin SISCI 3.3.0 ( November 13th 2007 ) Starting Dolphin SuperSockets drivers [ [ [ [ OK OK OK OK ] ] ] ]

6.

7. 8.

Verify the functionality and performance according to Section 1, Verifying Functionality and Performance. At this point, Dolphin Express and SuperSockets are ready to use, but your application is still running on Ethernet. To make your application use SuperSockets, you need to perform the following steps on each node one-by-one: 1. 2. Shut down your application processes on the current node. Refer to Section 4.8, Making Cluster Application use Dolphin Express to determine the best way to have you application use SuperSockets. Typically, this can be achieved by simply starting the process via the dis_ssocks_run wrapper script (located in /opt/DIS/bin by default), like:
$ dis_ssocks_run mysqld_safe

3.

Start all your own applications on the current node and make sure the whole cluster operates normally. Because SuperSockets fall back to Ethernet transparently, your applications will start up normally independently from applications on the other nodes already using SuperSockets or not.

After you have performed these steps on all nodes, all applications that have been started accordingly will now communicate via SuperSockets.

Note
This single-node installation mode will not adapt the driver configuration dis_irm.conf to optimally fit your cluster. This might be necessary for clusters with more than 4 nodes. Please refer to Section 3.1, dis_irm.conf to perform recommended changes, or contact Dolphin support.

31

Manual Installation

2. Installation of a Heterogeneous Cluster


This section describes how to perform the initial Dolphin Express installation on with heterogeneous nodes (different CPU architecture, different operating system version (i.e. Linux kernel version), or even different operating systems).

Note
This single-node installation mode will not adapt the driver configuration dis_irm.conf to optimally fit your cluster. This might be necessary for clusters with more than 4 nodes. Please refer to Section 3.1, dis_irm.conf to perform recommended changes, or contact Dolphin support. 1. Installing the Dolphin Express hardware Power off all nodes, and install the Dolphin Express adapter (see Section 3, Adapter Card Installation). Do not yet connect any cables! Then, power up all nodes again. 2. Installing the drivers on the nodes 1. On all nodes, run the SIA with the option --install-node. This is a local operation which will build and install the drivers on the local machine only.

Tip
You can speed up this node installation by re-using binary RPMs that have been build on another node with the same kernel version and the same CPU architecture. To do so, proceed as follows: 1. After the first installation on a node, the binary RPMs are located in the directories node_RPMS and frontend_RPMS, located in the directory where you launched the SIA. Copy these sub-directories to a path that is accessible from the other nodes. When installing on another node with the same Linux kernel version and CPU architecture, use the --use-rpms option to tell SIA where it can find matching RPMs for this node, so it does not have to build them once more.

2.

2.

The Dolphin Express drivers should load successfully now, although the SuperSockets service will not be configured. Verify this via dis_services:

# dis_services status Dolphin IRM 3.3.0 ( November 13th 2007 ) is running. Dolphin Node Manager is running (pid 3172). Dolphin SISCI 3.3.0 ( November 13th 2007 ) is running. Dolphin SuperSockets 3.3.0 "St.Martin", Nov 7th 2007 (built Nov 14 2007) loaded, but not configur

3.

Stop the SuperSockets service:


# service dis_supersockets stop Stopping Dolphin SuperSockets drivers [ OK ]

3.

Creating the cluster configuration files If you have a Linux machine with X available which can run GUI applications, run the SIA with the --install-editor option to install the tool dishostseditor. Ideally, this step is performed on the frontend. If this is the case, you should create the directory /etc/dis and make it writable for root:
# mkdir /etc/dis # chmod 755 /etc/dis

32

Manual Installation

After the SIA has completed the installation, start the tool dishostseditor (default installation location is /opt/ DIS/sbin):
# /opt/DIS/sbin/dishostseditor

Information on how to work with this tool can be found in Section 4.3, Working with the dishostseditor. Make sure you create the cabling instructions needed in the next step. If the dishostseditor was run as root on the frontend, proceed with the next step. Otherwise, copy the configuration files dishosts.conf and networkmanager.conf which you have just created to the frontend and place it there under /etc/dis (you may need to create this directory). 4. Cable Installation Using the cabling instructions created by dishostseditor in the previous step, the interconnect cables should now be connected (see Section 4.4, Cluster Cabling). 5. On the frontend machine, run the SIA with the --install-frontend option. This will start the network manager, which will then configure the whole cluster according to the configuration files created in the previous steps. Start all services on all the nodes:
# dis_services start Starting Dolphin IRM 3.3.0 ( November 13th 2007 ) Starting Dolphin Node Manager Starting Dolphin SISCI 3.3.0 ( November 13th 2007 ) Starting Dolphin SuperSockets drivers [ [ [ [ OK OK OK OK ] ] ] ]

6.

7.

Verify the functionality and performance according to Section 1, Verifying Functionality and Performance.

3. Manual RPM Installation


It is of course possible to manually install the RPM packages on the nodes and the frontend. This section describes how to do this if it should be necessary.

3.1. RPM Package Structure


The Dolphin Express software stack is organized into a number of RPM packages. Some of these packages have inter-dependencies. Dolphin-SCI Low-level hardware driver for the adapter. Installs the dis_irm kernel module, the node manager daemon and the dis_irm and dis_nodemgr services on a node. To be installed on all nodes. Dolphin-SISCI User-level access to the adapter capabilites via the SISCI API. Installs the dis_sisci kernel module and the dis_sisci service, plus the run-time I libary and header files for the SISCI API on a node. To be installed on all nodes. Depends on Dolphin-SCI. Dolphin-SuperSockets Standard Berkeley sockets with low latency and high bandwidth. Installs the dis_mbox, dis_msq and dis_ssocks kernel modules and the dis_supersockets service, and the redirection library for preloading on a node. To be installed on all nodes. Depends on Dolphin-SCI. Dolphin-NetworkHosts Installs the GUI application dishostseditor for creating the cluster configuration files on the frontend. Also installs some template configuration files for manual editing.

33

Manual Installation

To be installed on the frontend (and addtionally other machines that should run dishostseditor). Dolphin-NetworkManager Contains the network manager on the frontend, which talks to all node managers on the nodes. Installs the service dis_networkmgr. To be installed on the frontend. Depends on Dolphin-NetworkHosts. Dolphin-NetworkAdmin Contains the GUI application sciadmin for managing and monitoring the interconnect. sciadmin talks to the network manager and can be installed on any machine that has connection to the frontend. To be installed on the frontend (or any other machine). Dolphin-SISCI-devel To compile and link applications that use the SISCI API on other machines than the nodes, this RPM installs the header files and library plus examples and documentation on any machine. To be installed on the frontend, or any other machine on which SISCI applications should be compiled and linked.

3.2. RPM Build and Installation


On each machine, the matching binary RPM packages need to be built by calling the SIA with the --build-rpm option. This will take some minutes, and the resulting RPMs are stored in three directories: node_RPMS Contains the binary RPM packages for the driver and kernel modules to be installed on each node. These RPM packages can be installed on every node with the same kernel version. Contains the binary RPM packages for the user-level managing software to be installed on the frontend. The Dolphin-SISCI-devel and Dolphin-NetworkAdmin packages can also be installed on other nodes, not being the frontend or any of the nodes, for development and administration, respectively. The source RPM packages contained in this directory can be used to build binary RPMs on other machines using the standard rpmbuild command.

frontend_RPMS

source_RPMS

To install the packages from one directory, just enter the directory and install them all with a single call of the rpm command, like:
# cd node_RPMS # rpm -Uhv *.rpm

4. Unpackaged Installation
Not all target operating systems are supported with native software packages. In this case, a non-package based installation via a tar-archive is supported. This type of installation will build all software for both, node and frontend, and install it to a path that you specify. From there, you have to perform the actual driver and service installation using scripts provided with the installation. This type of installation installs the complete software into a directory on the local machine. Depending on whether this machine will be a node or the frontend, you have to install different drivers or services from there. To install using this method, please proceed as follows: 1. Become superuser:
$ su #

2.

Create the tar archive from the SIA, and upack it:

34

Manual Installation

# sh DIS_install_<version>.sh --get-tarball #* Logfile is /tmp/DIS_install.log_260 on node1 #* #+ Dolphin ICS - Software installation (version: 1.31 $ of: 2007/09/27 15:05:05 $) #+ #* Generating tarball distribution of the source code #* NOTE: source tarball is /tmp/DIS.tar.gz # tar xzf DIS.tar.gz

3.

Enter the created directory and configure the build system, specifying the target path <install_path> for the installation. We recommend that you use the standard path /opt/DIS, but you can use any other path. The installation procedure will create subdirectories (like bin, sbin, lib, lib64, doc, man, etc) relative to this path and install into them.
# cd DIS # ./configure --prefix=/opt/DIS

4.

Build the software stack using make. Check the output when the command returns to see if it the build operation was successful.
# make ... # make supersockets ...

5.

If the build operations were successful, install the software:


# make install ... # make supersockets-install ...

Tip
You can speed up the installation on multiple nodes if you copy over the installation directory to the other nodes, provided they features the same Linux kernel version and CPU architecture. The best way is to create a tar archive:
# cd /opt # tar czf DIS_binary.tar.gz DIS

Transfer this file to /opt on all nodes and unpack it there:


# cd /opt # tar xzf DIS_binary.tar.gz

6.

Install the drivers and services depending on whether the local machine should be a node or the frontend. It is recommended to first install all nodes, then the frontend, than configure and test the cluster from the frontend. For a node, install the necessary drivers and services as follows: 1. Change to the sbin directory in your installation path:
# cd /opt/DIS/sbin

2.

Invoke the scripts for driver installation using the option -i. The option --start will start the service after a successful installation:
# # # # ./irm_setup -i --start ./nodemgr_setup -i --start ./sisci_setup -i --start ./ssocks_setup -i

35

Manual Installation

Note
Please make sure that SuperSockets are not started yet (do not provide option --start to the setup script). You can remove the driver from the system by calling the script with the option -e. Help is available via -h. Repeat this procedure for each node. For the frontend, install the necessary services and perform the cluster configuration and test as follows: 1. Change to the sbin directory in your installation path:
# cd /opt/DIS/sbin

2.

Configure the cluster via the GUI tool dishostseditor:


# ./dishostseditor

For more information on using dishostseditor, please refer to Section 4.3, Working with the dishostseditor. 3. Invoke the script for service installation using the option -i:
# ./networkmgr_setup -i --start

You can remove the service from the system by calling the script with the option -e. 4. Test the cluster via the GUI tool sciadmin:
# ./sciadmin

For more information on using sciadmin to test your cluster installation, please refer to Appendix B, sciadmin Reference and Section 1, Verifying Functionality and Performance. 5. Enable all services, including SuperSockets, on all nodes.
# dis_services start Starting Dolphin IRM 3.3.0 ( November 13th 2007 ) Starting Dolphin Node Manager Starting Dolphin SISCI 3.3.0 ( November 13th 2007 ) Starting Dolphin SuperSockets drivers [ [ [ [ OK OK OK OK ] ] ] ]

Note
This command has to be executed on the nodes, not only on the frontend!

36

Chapter 6. Interconnect and Software Maintenance


This chapter describes how to perform a number of typical tasks related to the Dolphin Express interconnect.

1. Verifying Functionality and Performance


When installing the Dolphin Express software stack (which includes SuperSockets) via the SIA, the basic functionality and performance is verified at the end of the installation process by some of the same tests that are described in the following sections. This means, that if the tests performed by the SIA did not report any errors, it is very likely that both, the software and hardware work correctly. Nevertheless, the following sections describe the tests that allow you to verify the functionality and performance of your Dolphin Express interconnect and software stack. The tests go from the most low-level functionality up to running socket applications via SuperSockets.

1.1. Low-level Functionality and Performance


The following sections describe how to verify that the interconnect is setup correctly, which means that all nodes can communicate with all other nodes via the Dolphin Express interconnect by sending low-level control packets and performing remote memory access.

1.1.1. Availability of Drivers and Services


Without the required drivers and services running on all nodes and the frontend, the cluster will fail to operate. On the nodes, the kernel services dis_irm (low level hardware driver), dis_sisci (upper level hardware services) and dis_ssocks (SuperSockets) need to be running. Next to these kernel drivers, the user-space service dis_nodemgr (node manager, which talks to the central network manager) needs to be active for configuration and monitoring. On the frontend, only the user-space service dis_networkmgr (the central network manager) needs to be running. Because the drivers do also appear as services, you can query their status with the usual tools of the installed operating system distribution. I.e., for Red Hat-based Linux distributions, you can do
# service dis_irm status Dolphin IRM 3.3.0 ( November 13th 2007 ) is running.

Dolphin provides a script dis_services that performs this task for all Dolphin services installed on a machine. It is used in the same way as the individual service command provided by the distribution:
# dis_services status Dolphin IRM 3.3.0 ( November 13th 2007 ) is running. Dolphin Node Manager is running (pid 3172). Dolphin SISCI 3.3.0 ( November 13th 2007 ) is running. Dolphin SuperSockets 3.3.0 "St.Martin", Nov 7th 2007 (built Nov 14 2007) running.

If any of the required services is not running, you will find more information on the problem that may have occured in the system log facilities. Call dmesg to inspect the kernel messages, or check /var/log/messages for related messages.

1.1.2. Cable Connection Test


To ensure that the cluster is cabled correctly, please perform the cable connection test as described in Section 4.7.4, Cabling Correctness Test.

1.1.3. Static Interconnect Test


The static interconnect test makes sures that all adapters are working correctly by performing a self-test, and determines if the setup of the routing in the adapters is correct (matches the actual hardware topology),. It will also check if all cables are plugged in to the adapters, but this has already been done in the Cable Connection Test. The tool to perform this test is scidiag (default location /opt/DIS/sbin/scidiag).

37

Interconnect and Software Maintenance Running scidiag on a node will perform a self test on the local adapter(s) and list all remote adapters that this adapter can see via the Dolphin Express interconnect. This means, to perform the static interconnect test on a full cluster, you will basically need to run scidiag on each node and see if any problems with the adapter are reported, and if the adapters in each node can see all remote adapters installed in the other nodes. An example output of scidiag for a node which is part of a 9-node cluster configured in a 3 by 3 2D-torus, and using one adapter per node looks like this:
=========================================================================== SCI diagnostic tool -- SciDiag version 3.2.6d ( September 6th 2007 ) =========================================================================== ******************** VARIOUS INFORMATION ********************

Scidiag compiled in 64 bit mode Driver : Dolphin IRM 3.2.6d ( September 6th 2007 ) Date : Thu Oct 4 14:20:45 CEST 2007 System : Linux tiger-9 2.6.9-42.0.3.ELsmp #1 SMP Fri Oct 6 06:28:26 CDT 2006 x86_64 x86_64 x86_64 GNU/Linu Number of configured local adapters found: 1 Hostbridge : NVIDIA nForce 570 - MCP55 , 0x37610de Local adapter 0 > Type NodeId(log) NodeId(phys) SerialNum PSB Version LC Version PLD Firmware SCI Link frequency B-Link frequency Card Revision Switch Type Topology Type Topology Autodetect OK: SCI SCI SCI SCI OK: OK: OK: OK: OK: ==> : : : : : : : : : : : : : D352 140 0x2204 200284 0x0d66706d 0x1066606d 0x0001 166 MHz 80 MHz CD not present 2D Torus No

Psb chip alive in adapter 0. Link 0 - uptime 11356 seconds Link 0 - downtime 0 seconds Link 1 - uptime 11356 seconds Link 1 - downtime 0 seconds Cable insertion ok. Probe of local node ok. Link alive in adapter 0. SRAM test ok for Adapter 0 LC-3 chip accessible from blink in adapter 0. Local adapter 0 ok. ********************

******************** TOPOLOGY SEEN FROM ADAPTER 0 Adapters found: 9 Switch ports found: 0 ----- List of all ranges (rings) found: In range 0: 0004 0008 0012 In range 1: 0068 0072 0076 In range 2: 0132 0136 0140 REMOTE NODE INFO SEEN FROM ADAPTER 0 Log | Phys | resp | resp nodeId | nodeId | conflict | address 4 | 0x0004 | 0| 8 | 0x0104 | 0| 12 | 0x0204 | 0| 68 | 0x1004 | 0| 72 | 0x1104 | 0| 76 | 0x1204 | 0| 132 | 0x2004 | 0| 136 | 0x2104 | 0| 140 | 0x2200 | 0| ----------------------------------

| | 0| 0| 0| 0| 0| 0| 0| 0| 0|

resp type

| | 0| 0| 0| 0| 0| 0| 0| 0| 0|

resp data

| req | | timeout | 0| 4| 0| 1| 0| 0| 0| 2| 0| 0| 0| 0| 0| 1| 0| 1| 0| 0|

TOTAL 4 1 0 2 0 0 1 1 0

38

Interconnect and Software Maintenance


scidiag discovered 0 note(s). scidiag discovered 0 warning(s). scidiag discovered 0 error(s). TEST RESULT: *PASSED*

The static interconnect test passes if scidiag delivers TEST RESULT: *PASSED* and reports the same topology (remote adapters) on all nodes. More information on running scidiag is provided in ???, where you will also find hints on what to do if scidiag reports warning or errors, or reports different topology on different nodes.

1.1.4. Interconnect Load Test


While the static interconnect test sends very a few packets over the links to probe remote adapters, the Interconnect Load Test puts significant stress on the interconnect and observes if any data transmissions have to be retried due to link errors. This can happen if cables are not correctly connected, i.e. plugged in without screws being tightened. Before running this test, make sure your cluster is cabled and configured correctly by running the tests described in the previous sections. 1.1.4.1. Test Execution from sciadmin GUI This test can be performed from within the sciadmin GUI tool. Please refer to Appendix B, sciadmin Reference for details. 1.1.4.2. Test Execution from Command Line To run this test from the command line, simply invoke sciconntest (default location /opt/DIS/bin/sciconntest) on all nodes.

Note
It is recommended to run this test from the sciadmin GUI (see previous section) because it will perform a more controlled variant of this test and give more helpful results. All instances of sciconntest will connect and start to exchange data, which can take up to 30 seconds. The output of sciconntest on one node which is part of a 9-node cluster looks like this:
/opt/DIS/bin/sciconntest compiled Oct 2 2007 : 22:29:09 ---------------------------Local node-id : 76 Local adapter no. : 0 Segment size : 8192 MinSize : 4 Time to run (sec) : 10 Idelay : 0 No Write : 0 Loopdelay : 0 Delay : 0 Bad : 0 Check : 0 Mcheck : 0 Max nodes : 256 rnl : 0 Callbacks : Yes ---------------------------Probing all nodes Response from remote node 4 Response from remote node 8 Response from remote node 12 Response from remote node 68 Response from remote node 72 Response from remote node 132 Response from remote node 136 Response from remote node 140 Local segment (id=4, size=8192) is created. Local segment (id=4, size=8192) is shared. Local segment (id=8, size=8192) is created. Local segment (id=8, size=8192) is shared.

39

Interconnect and Software Maintenance


Local segment (id=12, size=8192) is created. Local segment (id=12, size=8192) is shared. Local segment (id=68, size=8192) is created. Local segment (id=68, size=8192) is shared. Local segment (id=72, size=8192) is created. Local segment (id=72, size=8192) is shared. Local segment (id=132, size=8192) is created. Local segment (id=132, size=8192) is shared. Local segment (id=136, size=8192) is created. Local segment (id=136, size=8192) is shared. Local segment (id=140, size=8192) is created. Local segment (id=140, size=8192) is shared. Connecting to 8 nodes Connect to remote segment, node 4 Remote segment on node 4 is connected. Connect to remote segment, node 8 Remote segment on node 8 is connected. Connect to remote segment, node 12 Remote segment on node 12 is connected. Connect to remote segment, node 68 Remote segment on node 68 is connected. Connect to remote segment, node 72 Remote segment on node 72 is connected. Connect to remote segment, node 132 Remote segment on node 132 is connected. Connect to remote segment, node 136 Remote segment on node 136 is connected. Connect to remote segment, node 140 Remote segment on node 140 is connected. SCICONNTEST_REPORT NUM_TESTLOOPS_EXECUTED 1 NUM_NODES_FOUND 8 NUM_ERRORS_DETECTED 0 node 4 : Found node 4 : Number of failiures : 0 node 4 : Longest failiure : 0.00 (ms) node 8 : Found node 8 : Number of failiures : 0 node 8 : Longest failiure : 0.00 (ms) node 12 : Found node 12 : Number of failiures : 0 node 12 : Longest failiure : 0.00 (ms) node 68 : Found node 68 : Number of failiures : 0 node 68 : Longest failiure : 0.00 (ms) node 72 : Found node 72 : Number of failiures : 0 node 72 : Longest failiure : 0.00 (ms) node 132 : Found node 132 : Number of failiures : 0 node 132 : Longest failiure : 0.00 (ms) node 136 : Found node 136 : Number of failiures : 0 node 136 : Longest failiure : 0.00 (ms) node 140 : Found node 140 : Number of failiures : 0 node 140 : Longest failiure : 0.00 (ms) SCICONNTEST_REPORT_END SCI_CB_DISCONNECT:Segment removed on the other node disconnecting.....

The test passes if all nodes report 0 failures for all remote nodes. If the test reports failures, you can determine the closest pair(s) of nodes for which these failures are reported and check the cabled connection between them. The numerical node identifies shown in this output are the node ID numbers of the adapters (which identify an adapter in the Dolphin Express interconnect). Although this test can be run while a system is in production, but you have to take into account that performance of the productive applications will be reduced significantly while this test is running. If links actually show problems, they might be temporarily disabled, stopping all communication until rerouting takes place.

40

Interconnect and Software Maintenance

1.1.5. Interconnect Performance Test


Once the correct installation and setup and the basic functionality of the interconnect have been verified, it is possible to perform a set of low-level benchmarks to determine the base-line performance of the interconnect without any additional software layers. The tests that are relevant for this are scibench2 (streaming remote memory PIO access performance), scipp (request-response remote memory PIO write performance), dma_bench (streaming remote memory DMA access performance) and intr_bench (remote interrupt performance). All these tests need to run on two nodes (A and B) and are started in the same manner: 1. Determine the Dolphin Express node id of both nodes using the query command (default path /opt/DIS/ bin/query). The Dolphin Express node id is reported as "Local node-id". On node A, start the server-side benchmark with the options -server and -rn <node id of B>, like:
$ scibench2 -server -rn 8

2.

3.

On node B, start the client-side benchmark with the options -client and -rn <node id of A>, like:
$ scibench2 -client -rn 4

4.

The test results are reported by the client.

To simply gather all relevant low-level performance data, the script sisci_benchmarks.sh can be called in the same way. It will run all of the described tests. For the D33x and D35x series of Dolphin Express adapters, the following results can be expected for each test using a single adapter: scibench2 minimal latency to write 4 bytes to remote memory: 0.2s maximal bandwidth for streaming writes to remote memory: 340 MB/s
--------------------------------------------------------------Segment Size: Average Send Latency: Throughput: --------------------------------------------------------------4 0.20 us 19.72 MBytes/s 8 0.20 us 40.44 MBytes/s 16 0.20 us 80.89 MBytes/s 32 0.39 us 81.09 MBytes/s 64 0.25 us 254.22 MBytes/s 128 0.37 us 348.17 MBytes/s 256 0.74 us 344.89 MBytes/s 512 1.49 us 343.05 MBytes/s 1024 3.00 us 341.90 MBytes/s 2048 6.00 us 341.39 MBytes/s 4096 12.00 us 341.45 MBytes/s 8192 24.00 us 341.32 MBytes/s 16384 48.04 us 341.03 MBytes/s 32768 96.03 us 341.24 MBytes/s 65536 192.56 us 340.33 MBytes/s

scipp

The minimal round-trip latency for writing to remote memory should be below 4s. The average number of retries is not a performance metric and can vary from run to run.
Ping Ping Ping Ping Ping Ping Ping Ping Ping Ping Ping Pong Pong Pong Pong Pong Pong Pong Pong Pong Pong Pong round round round round round round round round round round round trip trip trip trip trip trip trip trip trip trip trip latency latency latency latency latency latency latency latency latency latency latency for for for for for for for for for for for 0 4 8 16 32 64 128 256 512 1024 2048 bytes, bytes, bytes, bytes, bytes, bytes, bytes, bytes, bytes, bytes, bytes, average average average average average average average average average average average retries= retries= retries= retries= retries= retries= retries= retries= retries= retries= retries= 1292 365 359 357 4 346 871 832 1072 1643 2738 3.69 us 3.94 us 3.98 us 4.01 us 4.58 us 4.30 us 6.26 us 6.49 us 7.99 us 10.99 us 17.00 us

41

Interconnect and Software Maintenance


Ping Pong round trip latency for Ping Pong round trip latency for 4096 bytes, average retries= 8192 bytes, average retries= 4974 29.00 us 9401 53.06 us

intr_bench

The interrupt latency is the only performance metric of these tests that is affected by the operating system which always handles the interrupts and can therefore vary. The following number have been measured with RHEL 4 (Linux Kernel 2.6.9):
Average unidirectional interrupt time : Average round trip interrupt time : 7.665 us. 15.330 us.

dma_bench

The typical DMA bandwidth achieved for 64kB transfers is 240MB/s, while the maximum bandwidth (for larger blocks) is at about 250MB/s:
64 128 256 512 1024 2048 4096 8192 16384 32768 65536 19.63 19.69 20.36 21.08 23.25 26.80 34.60 50.30 81.74 144.73 270.82 us us us us us us us us us us us 3.26 6.50 12.57 24.29 44.05 76.42 118.40 162.85 200.43 226.41 241.99 MBytes/s MBytes/s MBytes/s MBytes/s MBytes/s MBytes/s MBytes/s MBytes/s MBytes/s MBytes/s MBytes/s

1.2. SuperSockets Functionality and Performance


This section describes how to verify that SuperSockets are working correctly on a cluster. dis_ssocks_cfg, sockperf, /proc/net/af_sci

1.2.1. SuperSockets Status


The general status of SuperSockets can be retrieved via the SuperSockets init script that controls the service dis_supersockets. On Red Hat systems, this can be done like
# service dis_supersockets status

which should show a status of running. If the status shown here is loaded, but not configured, it means that the SuperSockets configuration failed for some reason. Typically, it means that a configuration file could not be parsed correctly. The configuration can be performed manually like
# /opt/DIS/sbin/dis_ssocks_cfg

If this indicates that a configuration file is corrupted, you can verify them according to the reference in Section 2, SuperSockets Configuration. At any time, you can re-create dishosts.conf using the dishostseditor and restore modified SuperSockets configuration files (supersockets_ports.conf and supersockets_profiles.conf) from the default versions that have been installed in /opt/DIS/etc/dis. Once the status of SuperSockets is running, you can verify their actual configuration via the files in /proc/net/ af_sci. Here, the file socket_maps shows you, which IP address (or network mask) the local node's SuperSockets know about. This file should be non-empty and identical on all nodes in the cluster.

1.2.2. SuperSockets Functionality


A benchmark that can be used to validate the functionality and performance of SuperSockets is installed as /opt/ DIS/bin/socket/sockperf. The basic usage requires two machines (n1 and n2). Start the server process on node n1 without any parameters:
$ sockperf

On node n2, run the client side of the benchmark like:


$ sockperf -h n1

42

Interconnect and Software Maintenance The output for a working setup should look like this:
# # # # # # # # # # # sockperf 1.35 - test stream socket performance and system impact LD_PRELOAD: libksupersockets.so address family: 2 client node: n2 server nodes: n1 sockets per process: 1 - pattern: sequential wait for data: blocking recv() send mode: blocking client/server pairs: 1 (running on 2 cores) socket options: nodelay 1 communication pattern: PINGPONG (back-and-forth) bytes loops avg_RTT/2[us] min_RTT/2[us] max_RTT/2[us] msg/s 1 1000 4.26 3.67 18.67 117247 4 1000 4.16 3.87 11.32 120177 8 1000 4.31 4.17 11.81 115889 12 1000 4.29 4.17 9.08 116537 16 1000 4.29 4.16 10.17 116468 24 1000 4.30 4.18 7.16 116251 32 1000 4.38 4.21 44.20 114233 48 1000 4.50 4.24 102.91 111112 64 1000 5.28 5.16 7.54 94687 80 1000 5.37 5.20 11.08 93170 96 1000 5.41 5.20 11.29 92473 112 1000 5.53 5.27 11.04 90400 128 1000 5.74 5.59 11.96 87033 160 1000 5.85 5.68 10.65 85411 192 1000 6.30 6.01 11.24 79383 224 1000 6.47 6.20 80.47 77291 256 1000 6.82 6.55 17.41 73314 512 1000 8.37 8.05 14.52 59766 1024 1000 11.69 11.38 17.66 42764 2048 1000 15.25 14.90 59.72 32792 4096 1000 22.40 22.03 33.08 22318 8192 512 47.19 46.39 52.45 10596 16384 256 72.87 72.20 78.05 6862 32768 128 124.56 123.52 132.97 4014 65536 64 225.73 224.68 230.26 2215

MB/s 0.12 0.48 0.93 1.40 1.86 2.79 3.66 5.33 6.06 7.45 8.88 10.12 11.14 13.67 15.24 17.31 18.77 30.60 43.79 67.16 91.41 86.80 112.43 131.54 145.17

The latency in this example starts around 4s. Recent machines deliver latencies below 3s, and on older machines, the latency may be higher. Latencies above 10s indicate a problem; typical Ethernet latencies start at 20s and more. In case of latencies being to high, please verify if SuperSockets are running and configured as described in the previous section. Also, verify that the environment variable LD_PRELOAD is set to libksupersockets.so. This is reported for the client in the second line of the output (see above), but LD_PRELOAD also needs to be set correctly on the server side. See Section 4.8, Making Cluster Application use Dolphin Express for more information on how to make generic socket applications (like sockperf) use SuperSockets.

1.3. SuperSockets Utilization


To verify if and how SuperSockets are used on a node in operation, the file /proc/net/af_sci/stats can be used:
$ cat /proc/net/af_sci/stats STREAM sockets: 0 DGRAM sockets: 0 TX connections: 0 RX connections: 0 Extended statistics are disabled.

The first line shows the number of open TCP (STREAM) and UDP (DGRAM) sockets that are using SuperSockets. For more detailed information, the extended statistics need to be enabled. Only the root user can do this:
# echo enable >/proc/net/af_sci/stats

43

Interconnect and Software Maintenance With enabled statistics, /proc/net/af_sci/stats will display a message size histogram (next to some internal information). When looking at this histogram, please keep in mind that the listed receive sizes (RX) may be incorrect as it refers to the maximal number of bytes that a process wanted to recv when calling the related socket function. Many applications use larger buffers then actually required. Thus, only the send (TX) values are reliable. To observe the current throughput on all SuperSockets-driven sockets, the tool dis_ssocks_stat can be used. Supported options are:
-d -t -w -h

Delay in seconds between measurements. This will cause dis_ssocks_stat to loop until interrupted. Print time stamp next to measurement point. Print all output to a single line. Show available options.

Example:
# dis_ssocks_stat -d 1 -t (1 s) RX: 162.82 MB/s TX: 165.43 MB/s (1 s) RX: 149.83 MB/s TX: 168.65 MB/s ... ( 0 B/s 0 B/s ) ( 0 B/s 0 B/s ) Mon Nov 12 17:59:33 CET 2007 Mon Nov 12 17:59:34 CET 2007

The first two pairs show the receive (RX) and send (TX) throughput via Dolphin Express of all sockets. The number pair in parentheses shows the throughput of sockets that operated by SuperSockets, but are currently in fallback (Ethernet) mode. Typically, there will be no fallback traffic.

2. Replacing SCI Cables


The Dolphin Express interconnect is fully hot-pluggable. If one or more cables need to be replaced (or just need to be disconnected and reconnect again for some reason), you can do this while all nodes are up and running. However, if the cluster is in production, you should proceed cable by cable, and not disconnect all affected cables at once to ensure continue operation without significant performance degradation. To replace a single SCI cable, proceed as follows: 1. Disconnect the cable at both end. The LEDs on the affected adapters will turn yellow for this link, and the link will show up as disabled in the sciadmin GUI. Properly (re)connect the (replacement) cable. Observe that the LEDs on the adapters within the ringlet ligth green again after the cable is connected. The link will show up as enabled in the sciadmin GUI. To verify that the new cable is working properly, two alternative procedures can be performed using the sciadmin GUI: Run the Cluster Test from within sciadmin. Note that this test will stop all other communication on the interconnect while it is running. If running Cluster Test is not an option, you can check for potential transfer errrors as follows: a. b. c. Scidiag Clear to reset the statistics of both nodes connected to the cable that has been replaced. Operate the nodes under normal load for some minutes. Perform Scidiag -V 1 on both nodes and verify if any error counters have increased, especially the CRC error counter.

2.

3.

If any of the verifications did report errors, make sure that the cable is plugged in cleanly, and that the screws are secured. If the error should persist, swap the position of the cable with another one that is known to be working, and observe if the problem is wandering with the cable. If it does, the cable is likely to be bad. Otherwise, one of the adapters might have a problem. 44

Interconnect and Software Maintenance

3. Replacing a PCI-SCI Adapter


In case an adapter needs to be replaced, proceed as follows: 1. Power down the node. The network manager will automatically reroute the interconnect traffic. When you run sciadmin on the frontend, you will see the icon of the node turn red within the GUI representation of the cluster nodes. Unplug all cables from the adapter. Remember (or mark the cables) into which plug on the adapter each cables belongs. Unmount the old adapter from to be replaced, and insert the new one into the node. Power up the node; then connect the SCI cables in the same way they had been connected before. Make sure that all LEDs on all adapters in the affected ringlets light green again. To verify the installation of the new adapter: 1. 2. 3. The icon of the node in the sciadmin GUI must have turned green again. The output of the dis_services script should list all services as running. Run Scidiag -V 1 for this adapter from sciadmin. No errors should be reported, and all nodes should be visible to this adapter, as illustrated in this example for a 4-node machine:
Remote

2. 3. 4. 5.

6.

Perform the cable test from within sciadmin to ensure that the cabling is correct (see Section 4.7.4, Cabling Correctness Test).

Warning
Running the cable test will stop other traffic on the interconnect for the time the test is running, which can be up to a minute. If this is not an option, please use scidiag from the commandline to verify the functionality of the interconnect (see Section 1.1.3, Static Interconnect Test). Communication between all other nodes will continue uninterrupted during this procedure.

4. Physically Moving Nodes


If you move a node physically without changing the cabling, it is obvious that no configuration change are necessary. If however you have to exchange two nodes, or for other reasons place a node which has been part of the cluster at another position within the interconnect that requires the node to be connected to other nodes than before, than this change has to reflect in the cluster configuration as well. This can easily be done using either dishostseditor or a plain text editor. Please proceed as follows: 1. Power down all nodes that are to be moved. The network manager will automatically reroute the interconnect traffic. When you run sciadmin on the frontend, you will see the icon of the node turn red within the GUI representation of the cluster nodes.

Warning
Powering down more than one node will make other nodes not accessible via SCI if the powereddown nodes are not located within one ringlet. 2. Move the nodes to the new location and connect the cables to the adapters. Do not yet power them up!

45

Interconnect and Software Maintenance 3. Update the cluster configuration file /etc/dis/dishosts.conf on the frontend by either using dishostseditor or a plain text editor: If using dishostseditor, load the original configuration (by running dishostseditor on the frontend, it will be loaded automatically) and change the positions of the nodes within the torus. Save the new configuration. When using a plain text editor, exchange the hostnames of the nodes in this file. You can also change the adapter and socket names accordingly (which typically contain the hostnames), but this will not affect functionality.

4. 5.

Restart the network manager on the frontend to make the changed configuration effective. Power up the nodes. Once they come up, their configuration will be change by the network manager to reflect their new position within the interconnect.

5. Replacing a Node
In case a node needs to be replaced, proceed as follows concerning the SCI interconnect: 1. Power down the node. The network manager will automatically reroute the interconnect traffic. When you run sciadmin on the frontend, you will see the icon of the node turn red within the GUI representation of the cluster nodes. Unplug all cables from the adapter. Remember (or mark the cables) into which plug on the adapter each cables belongs. Unmount the adapter from the node to be replaced, and insert it into the new node. Power up the node; then connect the SCI cables in the same way they had been connected before. Make sure that all LEDs on all adapters in the affected ringlets light green again. Run the SIA with the option --install-node. To verify the installation after the SIA has finished: 1. 2. 6. The icon of the node in the sciadmin GUI must have turned green again. The output of the dis_services script should list all services as running.

2. 3. 4. 5.

Perform the cable test from within sciadmin to ensure that the cabling is correct (see Section 4.7.4, Cabling Correctness Test).

Warning
Running the cable test will stop other traffic on the interconnect for the time the test is running, which can be up to a minute. If this is not an option, please use scidiag from the commandline to verify the functionality of the interconnect (see Section 1.1.3, Static Interconnect Test). In case that more than one node needs to be replaced, please consider the following advice: To ensure that all nodes that are not be replaced can continue to communicate via the SCI interconnect while other nodes are replaced, you should replace nodes in a ring-by-ring manner: power down nodes within one ringlet only. Bring this group of nodes back to operation before powering down the next group of nodes. Communication between all other nodes will continue uninterrupted during this procedure.

6. Adding Nodes
To add new nodes to the cluster, please proceed as follows: 1. Install the adapter in the nodes to be added and power the nodes up.

46

Interconnect and Software Maintenance

Important
Do not yet connect the SCI cables, but make sure that Ethernet communication towards the frontend is working. 2. 3. Install the DIS software stack on all nodes via the --install-node option of the SIA. Change the cluster configuration using dishostseditor: 1. 2. 3. 4. 5. 4. 5. Load the existing configuration. In the cluster settings, change the topology to match the topology with all new nodes added. Change the hostnames of the newly added nodes via the node settings of each node. Also make sure that the socket configuration matches those of the existing nodes. Save the new cluster configuration. If desired, create and save or print the cabling instructions for the extended cluster. If you are not running dishostseditor on the frontend, transfer the saved files dishosts.conf and cluster.conf to the directory /etc/dis on the frontend.

Restart the network manager on the frontend. If you run sciadmin, the new nodes should show up as red icons. All other nodes should continue to stay green. Cable the new nodes: 1. 2. First create the cable connections between the new nodes. Then connect the new nodes to the cluster. Proceed cable-by-cable, which means you disconnect a cable at an "old" node and immeadeletly connect it to a new node (without disconnecting another cable first). This will ensure continue operation of all "old" nodes. When you are done, all LEDs on all adapters should light green, and all node icons in sciadmin should also light green.

3. 6.

Perform the cable test from within sciadmin to ensure that the cabling is correct (see Section 4.7.4, Cabling Correctness Test).

Warning
Running the cable test will stop other traffic on the interconnect for the time the test is running, which can be up to a minute. If this is not an option, please use scidiag from the commandline to verify the functionality of the interconnect (see Section 1.1.3, Static Interconnect Test).

7. Removing Nodes
To permanently remove nodes from the cluster, please proceed as follows: 1. Change the cluster configuration using dishostseditor: 1. 2. 3. Load the existing configuration. In the cluster settings, change the topology to match the topology with all nodes removed. Because the topology change might cut of nodes from the cluster at the "wrong" end, you have to make sure that the hostnames and the placement within the new topology for the remaining nodes is correct. To do this, change the hostnames of nodes by double-clicking their icon and changing the hostname in the displayed dialog box. If the SuperSocket configuration is based on the hostnames (not on the subnet addresses), make sure that the name of the socket interface matches a modified hostname.

47

Interconnect and Software Maintenance 4. Save the new cluster configuration. If desired, create and save or print the cabling instructions for the reduced cluster. If you are not running dishostseditor on the frontend, transfer the saved files dishosts.conf and cluster.conf to the directory /etc/dis on the frontend.

5. 2. 3.

Restart the network manager on the frontend. If you run sciadmin, the removed nodes should no longer show up. All other nodes should continue to stay green. Uncable the nodes to be removed one by one, making sure that the remaining nodes are cabled according to the cabling instructions generated above. On the nodes that have been removed from the cluster, the Dolphin Express software can easily be removed using the SIA option --wipe, like:
# sh DIS_install_<version>.sh --wipe

4.

This will remove all Dolphin software packages, services and configuration data from the node. If no SIA is available, the same effect can be achieved by manually uninstalling all packages that start with Dolphin-, remove potentially remaining installation directories (like /opt/DIS), and remove the configuration directory /etc/dis. 5. Perform the cable test from within sciadmin to ensure that the cabling is correct (see Section 4.7.4, Cabling Correctness Test).

Warning
Running the cable test will stop other traffic on the interconnect for the time the test is running, which can be up to a minute. If this is not an option, please use scidiag from the commandline to verify the functionality of the interconnect (see Section 1.1.3, Static Interconnect Test).

48

Chapter 7. MySQL Operation


This chapter covers topics that are specific to operating MySQL and MySQL Cluster with SuperSockets.

1. MySQL Cluster
Using MySQL Cluster with Dolphin Express and SuperSockets will significantly increase throughput and reduce the response time. Generally, SuperSockets operate fully transparently and no change in the MySQL Cluster configuration is necessary for a working setup. Please read below for some hints on performance tuning and trouble-shooting specific to SuperSockets with MySQL Cluster.

1.1. SuperSockets Poll Optimization


SuperSockets offer and optimization of the functions used by processes to detect new data or other status changes on sockets (poll() and select()). This optimization typically helps to increase performance. It has however shown that for certain MySQL Cluster setups and loads, this optimization can have a negative impact on performance. It is therefore recommended that you evaluate which setting of this optimization delivers the best performance for your setup and load. This optimization can be controlled on a per-process basis and with a node-global default. Explicit per-process settings override the global default. For a per-process setting, you need to set the environment variable SSOCKS_SYSTEM_POLL to 0 to enable the optimization and to 1 to disable the optimization and use the functions provided by the operating system. To set the node-global default, the setting of the variable system_poll needs to be changed in /etc/dis/supersockets_profiles.conf. Please refer to Section 2, SuperSockets Configuration for more details.

1.2. NDBD Deadlock Timeout


In case you experience timeout errors between the ndbd processes on the cluster nodes, you will see error messages like
ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction

Such timeout problems can have different reasons like dead or overloaded nodes. One other possible reason could be that the time for a socket fail over between Dolphin Express and Ethernet exceeded the current timeout, which by default is 1200ms. This should rarely happen, but to solve this problem, please proceed as follows: 1. Increase the value of the NDBD configuration parameter TransactionDeadlockDetectionTimeout (see MySQL reference manual, section 16.4.4.5). As the default value is 1200ms, increasing it to 5000ms might be a good start. You will need to add this line to the NDBD default section in your cluster configuration file:
[NDBD DEFAULT] TransactionDeadlockDetectionTimeout: 5000

2.

Verify the state of the Dolphin Express by checking the logfile of the networkmanager (/var/log/ scinetworkmanager.log) to see if any interconnect events have been logged. If there are repeated logged error events for which no clear reason (such as manual intervention or node shutdown) can be determined, you should test the interconnect using sciadmin (see Section 4.2, Traffic Test). If no events have been logged, it is very unlikely that the interconnect or SuperSockets are the cause of the problem. Instead, you should try to verify that no node is overloaded or stuck.

3.

1.3. SCI Transporter


Prior to SuperSockets, MySQL Cluster could use SCI for communication by means of the SCI Transporter, offered by MySQL. The SCI Transporter is a SISCI-based data communication channel. However, performance with SuperSockets is significantly better than with the SCI Transporter, which is no longer maintained. Platforms that have SuperSockets available should use those instead of SCI Transporter.

49

MySQL Operation

2. MySQL Replication
SuperSockets do also significantly increase the performance in replication setups (speedups up to a factor of 3 have been reported). All that is necessary is to make sure that all MySQL server processes involved run with the LD_PRELOAD variable set accordingly. No MySQL configuration changes are necessary.

50

Chapter 8. Advanced Topics


This chapter deals with techniques like performance analysis and tuning, irregular topologies and debug tools. For most installations, the content of this chapter is not relevant.

1. Notification on Interconnect Status Changes


The network manager provides a mechanism to trigger actions when the state of the interconnect changes. The action to be triggered is a user-definable script or executable that is run by the network manager when the interconnect status changes.

1.1. Interconnect Status


The interconnect can be in any of these externally visible states: UP DEGRADED All nodes and interconnect links are functional. All nodes are up, but one or more interconnect links have been disabled. Disabling links can either happen manually via sciadmin, or through the network manager because of problems reported by the node managers. In status DEGRADED, all nodes can still communicate via Dolphin Express, but the overall performance of the interconnect may be reduced. One or more nodes are down (the node manager is not reachable via Ethernet), and/or a high number of links has been disabled which isolates one or more nodes from the interconnect. These nodes can not communicate via Dolphin Express, but i.e. SuperSockets will fall back to communicate via Ethernet if it is available. UNSTABLE is a state which is only visibly externally. If the interconnect is changing states frequently (i.e. because nodes are rebooted one after the other), the interconnect will enter the state UNSTABLE. After a certain period of less frequent internal status changes (which are continuously recorded by network manager), the external state will again be set to either UP, DEGRADED or FAILED. While in status UNSTABLE, the network manager will enable verbose logging (to /var/log/ scinetworkmanager.log) to make sure that no internal events are lost.

FAILED

UNSTABLE

1.2. Notification Interface


When the network manager invokes the specified script or executable, it hands over a number of parameters by setting environment variables. The content of these variables can be evaluated by the script or executable. The following variables are set: DIS_FABRIC DIS_STATE The number of the fabric for which this notification is generated. Can be 0, 1 or 2. The new state of the fabric. Can be either UP, DEGRADED, FAILED or UNSTABLE. The previous state of the fabric. Can be either UP, DEGRADED, FAILED or UNSTABLE.

DIS_OLDSTATE

DIS_ALERT_TARGET This variable contains the target address for the notification. This target address is provided by the user when the notification is enabled (see below), and the user needs to make sure that the content of this variable is useful for the chosen alert script. I.e., if the alert script should send an email, the content of this variable needs to be an email address. DIS_ALERT_VERSION The version number of this interface (currently 1). It will be increased if incompatible changes to the interface need to be introduced, which could be a change in the possible content of an existing environment variable, or the removal of an environment variable. This is unlikely and does not necessarily make an alert script fail, but a script

51

Advanced Topics

that relies on this interface in a way where this matters needs to verify the content of this variable.

1.3. Setting Up and Controlling Notification


1.3.1. Configure Notification via the dishostseditor
Notification on interconnect status changes is done via the dishostseditor. In the Cluster Edit dialog, tick the check box above Alert target as shown in the screenshot below.

Then enter the alert target and choose the alert script by pressing the button and selecting the script in the file dialog. Dolphin provides an alert script /opt/DIS/etc/dis/alert.sh (for the default installation path) which sends out an email to the specified alert target. Any other executable can be specified here. Please consider that this script will be executed in the context of the user running the network manager (typically root), so the permissions to change this file should be set accordingly. To make the changes done in this dialog effective, you need to save the configuration files (to /etc/dis on the frontend) and then restart the network manager:
# service dis_networkmgr restart

1.3.2. Configure Notification Manually


If the dishostseditor can not be used, it is also possible to configure the notification by editing /etc/dis/ networkmanager.conf. Notification is controlled by two options in this file: -alert_script <file> -alert_target <target> This parameter specifies the alert script <file> to be executed. This parameter specifies the alert target <target> which is passed to the chosen alert script.

To disable notification, these lines can be commented out (precede them with a #). After the file has been edited, the network manager needs to be restarted to make the changes effective:

52

Advanced Topics

# service dis_networkmgr restart

1.3.3. Verifying Notification


To verify that notification is actually working, you should provoke a interconnect status change manually. This can easily be done from sciadmin by disabling any link via the Node Settings dialog of any node.

1.3.4. Disabling and Enabling Notification Temporarily


Once the notification has been configured, it can be controlled via sciadmin. This is useful if the alerts should be stopped for some time. To disable alerts, open the Cluster Settings dialog and switch the setting next to Alert script as needed.

This is a per-session setting and will be lost if the network manager is restarted.

Warning
Make sure that the messages are enabled again before you quit sciadmin. Otherwise, interconnect status changes will not be notified until the network manager is restarted.

2. Managing IRM Resources


A number of resources in the low-level driver IRM (service dis_irm) are run-time limited by parameter in the driver configuration file /opt/DIS/lib/modules/<kernel version>/dis_irm.conf (for the default installation path). This files contains numerous parameter settings; for those parameters that are relevant for changes by the user, please refer to Section 3.1, dis_irm.conf. Generally, to change (increase) default limits, dis_irm.conf file needs to be changed on each node. Typically, you should edit and test the changes on one node, and then copy the file over to all other nodes. To make changes in the configuration file effective, you need to restart the dis_irm driver. Because all other drivers depend on it, it is necessary to restart the complete software stack on the nodes:
# dis_services restart

2.1. Updates with Modified IRM Configuration


You need to be careful when updating RPMs on the nodes with a modified dis_irm.conf. If you directly use RPMs to update the existing Dolphin-SCI RPM like using rpm -U, the existing and modified dis_irm.conf will

53

Advanced Topics

be moved to dis_irm.conf.rpmsave, and the default dis_irm.conf will replace previously modified version. You will need to undo this file renaming. If you update your system with the SIA as described in Chapter 4, Update Installation, SIA will take care that the existing dis_irm.conf will be preserved and stay effective.

54

Chapter 9. FAQ
This chapter lists problems that have been reported by customers and the proposed solutions.

1. Hardware
1.1.1. Although I have properly installed the adapter in a node and its LEDs light orange, I am told (i.e. during the installation) that this node does not contain an SCI adapter! The SCI adapter might not have been recognized by the node during the power-up initialization after power was applied again. The specification requires that a node needs to be powered down for at least 5 seconds before being powered up again. To make the adapter be recognized again, you will need to power-down the node (restarting or resetting is not sufficient!), wait for at least 5 seconds, and power it up again. If this does not fix the problem, please contact Dolphin support. 1.1.2. All cables are connected, and all LEDs shine green on the adapter boards, all required services and drivers are running on all nodes. However, some nodes can not see some other nodes via the SCI interconnect. Between some other pairs of nodes, the communication works fine. These symptoms indicate that the cabling is not correct, i.e. the links 0 and 1 (x- and y-direction in a 2D-torus) are exchanged. To resolve the problem, proceed as follows: 1. 2. 3. Run the cable test from sciadmin (Server Test Cable Connections). If no problem is reported, please contact Dolphin Support. To fix the cable problem, dreate a cabling description via dishotseditor (File Get Cabling Instructions) and the cabling between the nodes that have been reported in the cable test. Repeat step 1. and 2.until no more problems are reported.

1.1.3. The SCI driver dis_irm refuses to load, or driver install never completes. Running dmesg shows that the syslog contains the line Out of vmalloc space. What's wrong? The problem is that the SCI adapter requires more virtual PCI address space than supported by the installed kernel. This problem has so far only been observed on 32 bit operating systems. There are two alternative solutions: 1. If you are building a small cluster you may be able to run your application with less SCI address space. You can change the SCI address space size for the adapter card by using sciconfig with the command set-prefetch-mem-size. A value of 64 or 16 will most likely overcome the problem. This operation can also be performed from the command line using the options -c to specify the card number (1 or 2) and -spms to specify the prefetch memory size in Megabytes:
# sciconfig -c 1 -spms 64 Card 1 - Prefetch space memory size is set to 64 MB A reboot of the machine is required to make the changes take effect.

When rebooting the machine, the problem should be solved. 2. If reducing the prefetch memory size is not desired, the related resources in the kernel have to be increased. For x86-based machines, this is achieved by passing the kernel option vmalloc=256m and the parameter uppermem=524288 at boot time. This is done by editing /boot/grub/grub.conf as shown in the following example:
title CentOS-4 i386 (2.6.9-11.ELsmp) root (hd0,0) uppermem 524288 kernel /i386/vmlinuz-2.6.9-11.ELsmp ro root=/dev/sda6 rhgb quiet vmalloc=256m initrd /i386/initrd-2.6.9-11.ELsmp.img

55

FAQ

2. Software
2.1.1. The service dis_irm (for the low-level driver of the same name) fails to load after it has been installed for the first time. Please follow the procedure below to determine the cause of the problem. 1. Verify that the adapter card has been recognized by the machine. This can be done as follows:
[root@n1 ~]# lspci -v | grep Dolphin 03:0c.0 Bridge: Dolphin Interconnect Solutions AS PSB66 SCI-Adapter D33x Subsystem: Dolphin Interconnect Solutions AS: Unknown device 2200

If this command does not show any output similar to the example above, the adapter card has not been recognized. Please try to power-cycle the system according to FAQ Q: 1.1.1. If this does not solve the issue, a hardware failure is possible. Please contact Dolphin support in such a case. 2. Check the syslog for relevant messages. This can be done as follows:
# dmesg | grep SCI

Depending on which messages you see, proceed as described below: SCI Driver : Preallocation failed The driver failed to preallocate memory which will be used to export memory to remote nodes. Rebooting the node is the simplest solution to defragment the physical memory space. If this is not possible, or if the message appears even after a reboot, you need to adapt the preallocation settings (see Section 3.1, dis_irm.conf). See FAQ Q: 1.1.3.

SCI Driver: Out of vmalloc space

3.

If the driver still fails to load, please contact Dolphin support and provide the driver's syslog messages:
# dmesg > /tmp/syslog_messages.txt

2.1.2. Although the Network Manager is running on the frontend, and all nodes run the Node Manager, configuration changes are not applied to the adapters. I.e., the node ID is not changed according to what is specified in /etc/dis/dishosts.conf on the frontend. The adapters in a node can only be re-configured when they are not in use. This means, no adapter resources must be allocated via the dis_irm kernel module. To achieve this, make sure that upper layer services that use dis_irm (like dis_sisci and dis_supersockets) are stopped. On most Linux installations, this can be achieved like this (dis_services is a convenience script that come with the Dolphin software stack):
# dis_services stop ... # service dis_irm start ... # service dis_nodemgr start ...

2.1.3. The Network Manager on the frontend refuses to start. In most cases, the interconnect configuration /etc/dis/dishosts.conf is corrupted. This can be verified with the command testdishosts. It will report problems in this configuration file, as in the example below:
# testdishosts socket member node-1_0 does not represent a physical adapter in dishosts.conf

56

FAQ

DISHOSTS: signed32 dishostsAdapternameExists() failed

In this case, the adapter name in the socket definition was misspelled. If testdishosts reports a problem, you can either try to fix /etc/dis/dishosts.conf manually, or re-create it with dishostseditor. If this does not solve the problem, please check /var/log/scinetworkmanager.log for error messages. If you can not fix the problem reported in this logfile, please contact Dolphin support providing the content of the logfile. 2.1.4. After a node has booted, or after I restarted the Dolphin drivers on a node, the first connection to a remote node using SuperSockets does only deliver Ethernet performance. Retrying the connection then delivers the expected SuperSockets performance. Why does this happen? Make sure you run the node manager on all nodes of the cluster, and the network manager on the frontend being correctly set up to include all nodes in the configuration (/etc/dis/dishosts.conf). The option Automatic Create Session must be enabled. This will ensure that the low-level "sessions" (Dolphin-internal) are set up between all nodes of the cluster, and a SuperSockets connection will immediately succeed. Otherwise, the set-up of the sessions will not be done until the first connection between two nodes is tried, but this is too late for the first connection to be established via SuperSockets. 2.1.5. Socket benchmarks show that SuperSockets are not active as the minimal latency is much more than 10us. The half roundtrip latency (ping-pong latency) with SuperSockets typically starts between 3 and 4us for very small messages. Any value above 7us for the minimal latency indicates a problem with the SuperSockets configuration, benchmark methodology of something else. Please proceed as follows to determine the reason: 1. Is the SuperSockets service running on both nodes? /etc/init.d/dis_supersockets status should report the status running. If the status is stopped, try to start the SuperSockets service with /etc/init.d/ dis_supersockets start. 2. Is LD_PRELOAD=libksupersockets.so set on both nodes? You can check using the ldd command. Assuming the benchmark you want to run is named sockperf, do ldd sockperf. The libksupersockets.so should appear at the very top of the listing. 3. Are the SuperSockets configured for the interface you are using? This is a possible problem if you have multiple Ethernet interfaces in your nodes with the nodes having different hostnames for each interface. SuperSockets may be configured to accelerate not all of the available interfaces. To verify this, check which IP addresses (or subnet mask) are accelerated by SuperSockets by looking at /proc/net/af_sci/socket_maps (Linux) and use those IP addresses (or related hostnames) that are listed in this file. 4. If the SuperSockets service refuses to start, or only starts into the mode running, but not configured, you probably have a corrupted configuration file /etc/dis/dishosts.conf: verify that this file is identical to the same file on the frontend. If not, make sure that the Network Manager is running on the frontend (/etc/init.d/dis_networkmgr start). 5. If the dishosts.conf files are identical on frontend and node, they could still be corrupted. Please run the dishostseditor on the frontend to have it load /etc/dis/dishosts.conf; then save it again (dishostseditor will always create syntactically correct files). 6. Please check the system log using the dmesg command. Any output there from either dis_ssocks or af_sci should be noted and reported to <support@dolphinics.com>. 2.1.6. I am running a mixed 32/64-bit platform, and while the benchmarks latency_bench and sockperf from the DIS installation show good performance of SuperSockets, other applications do only show Ethernet performance for socket communication.

57

FAQ

Please use the file command to verify if the applications that fail to use SuperSockets are 32-bit applications. If they are, please verify if the 32-bit SuperSockets library can be found as /opt/DIS/lib/ libksupersockets.so (this is a link). If this file is not found, then it could not be built due to a missing or incomplete 32-bit compilation environment on your build machine. This problem is indicated by the message #* WARNING: 32-bit applications may not be able to use SuperSockets of the SIA. If on a 64-bit platform 32-bit libraries can not be built, the RPM packages will still be built successfully (without 32-bit libraries included) as many users of 64-bit platforms do only run 64-bit applications. To fix this problem, make sure that the 32-bit versions of the glibc and libgcc-devel (packages) are installed on your build machine, and re-build the binary RPM packages using the SIA option --build-rpm, making sure that the warning message shown above does not appear. Then, replace the existing RPM package DolphinSuperSockets with the one you have just build. Alternatively, you can perform a complete re-installation. 2.1.7. I have added the statement export LD_PRELOAD=libksupersockets.so to my shell profile to enable the use of SuperSockets. This works well on some machines, but on other machines, I get the error message
ERROR: ld.so object 'libksupersockets.so' from LD_PRELOAD cannot be preloaded : ignore

whenever I log in. How can this be fixed? This error message is generated on machines that do not have SuperSockets installed. On these machines, the linker can not find the libksupersockets.so library. This can be fixed to set the LD_PRELOAD environment variable only if SuperSockets are running. For a sh-type shell such as bash, use the following statements in the shell profile ($HOME/.bashrc):
[ -d /proc/net/af_sci ] && export LD_PRELOAD=libksupersockets.so

2.1.8. How can I permanetly enable the use of SuperSockets for a user? This can be achieved by setting the LD_PRELOAD environment variable in the users' shell profile (i.e. $HOME/ .bashrc for the bash shell). This should be done conditionally by checking if SuperSockets are running on this machine:
[ -d /proc/net/af_sci ] && export LD_PRELOAD=libksupersockets.so

Of course, it is also possible to perform this setting globally (in /etc/profile). 2.1.9. I can not build SISCI applications that are able to run on my cluster because the frontend (where the SISCI-devel package was installed by the SIA) is a 32-bit machine, while my cluster nodes are 64-bit machines (or vice versa). I fail to build the SISCI applications on the nodes as the SISCI header files are missing. How can this deadlock be solved? When the SIA installed the cluster, it has stored the binary RPM packages in different directories node_RPMS, frontend_RPMS and source_RPMS. You will find a SISCI-devel RPM that can be installed on the nodes in the node_RPMS directory. If you can not find these RPM file, you can recreate them on one of the nodes using the SIA with the --build-rpm option. Once you have the Dolphin-SISCI-devel binary RPM, you need to install it on the nodes using the --force option of rpm because the library files conflict between the installed SISCI and the SISCI-devel RPM:
# rpm -i --force Dolphin-SISCI-devel.<arch>.<version>.rpm

58

Appendix A. Self-Installing Archive (SIA) Reference


Dolphin provides the complete software stack as a self-installing archive (SIA). This is a single file that contains the complete source code as well as a setup script that can perform various operations, i.e. compiling, installing and testing the required software on all nodes and the frontend. A short usage information will be displayed when calling the SIA archive with the --help option:

1. SIA Operating Modes


This section explain the different operations that can be performed by the SIA.

1.1. Full Cluster Installation


The full cluster installation mode will install and test the full cluster in a wizard-like guided installation. All required information will be asked for interactively, and it will be tested if the requirements to perform the installation are met. This mode is the default mode, but can also be selected explicitly. Option: --install-all

1.2. Node Installation


Only build and install the kerrnel modules and Node Manager service needed to run an interconnect node. Kernel header files and the kernel configuration are required (package kernel-devel), but no GUI packages (like qt, qt-devel, X packages). Option: --install-node

1.3. Frontend Installation


The frontend installation will only build and install those RPM packages on the local host that are needed to have this host run as a frontend. However, due to limitations of the current build system, the kernel headers and configuration are still required for Linux systems. Option: --install-frontend

1.4. Installation of Configuration File Editor


Build and install the GUI-based cluster configuration tool dishostseditor. This tool is used to define the topology of the interconnect and the placement of the nodes within this topology. With this information, it can create the detailed cabling instructions (useful to cable non-trivial cluster setups) and the cluster configuration files dishosts.conf and networkmanager.conf needed on the frontend by the Network Manager. Option: --install-editor

1.5. Building RPM Packages Only


Build all source and binary RPMs on the local machine. Both, kernel headers and configuration (kernel-devel) as well as GUI development package (qt-devel) are needed on the local machine. Option: --build-rpm

1.6. Extraction of Source Archive


It is possible to extract the sources from the SIA as a tar archive DIS.tar.gz in the current directory. This is required to build and install on non-RPM platforms, or when you want source-code access in general.

59

Self-Installing Archive (SIA) Reference Option: --get-tarball

2. SIA Options
Next to the different operating modes, a number of options are available that influence the operation. Not all options have an impact on all operating modes.

2.1. Node Specification


In case that you want to specify the list of nodes not interactively but on the command line, you can use the option --nodes together with a comma-separated list of hostnames and/or IP addresses to do so. Example:
--nodes n01,n02,n03,n04

If this option is provided, existing configuration files like /etc/dis/dishosts.conf will not be considered.

2.2. Installation Path Specification


By default, the complete software stack will be installed to /opt/DIS. To change the installation path, use the --prefix option. Example:
--prefix /usr/dolphin

This will install into /usr/dolphin. It is recommended to install into a dedicated directory that is located on a local storage device (not mounted via the network). When doing a full cluster install (--install-all, or default operation), the same installation path will be used on all nodes, the frontend and potentially the installation machine (if different from the frontend).

2.3. Installing from Binary RPMs


If you are re-running an installation for which the binary RPM package have already been built, you can save time by not building these packages again, but use the existing ones. The packages have to be placed in two subdirectories node_RPMS and frontend_RPMS, just as the SIA does. Then, provide the name of the directory containing these two subdirectories to the installer using the --use-rpms option. Example:
--use-rpms $HOME/dolphin

The installer does not verify if the provided packages match the installation target, but the RPM installation itself will fail in this case.

2.4. Preallocation of SCI Memory


It is possible to specify the number of Megabytes per node that the low-level interconnect driver dis_irm should allocate on startup for exportable memory segments. The amount of this memory determines i.e. how many SuperSocket-based sockets can be opened. The default setting was chosen to work well with clustered databases. Change this setting if you know you will need more memory to be exported, or will use a very high number of stream sockets per node (datagram sockets are multiplexed and thus need less resources). By default, the driver allocates 8 + N*MB Megabytes of memory, with N being the number of nodes in the cluster and MB = 4 by default. A maximum of 256MB will be allocated. The factor MB can be specified on installation using the --prealloc option. Example:

60

Self-Installing Archive (SIA) Reference


--prealloc 8

On a 16 node cluster, this will make the dis_irm allocate 8 + 16*8 = 136MB on each node.

Note
The operating system can not use preallocated memory for other purposes - it is effectively invisible. Setting MB to -1 will disable all modifications to this configuration option, and the fixed default of 16MB will be preallocated independently from the number of nodes. Setting MB to 0 is also valid (8 MB will be allocated). This option changes a value in the module configuration file dis_irm.conf. It is only effective on an initial installation. An existing configuration file dis_irm.conf will never be changed, i.e. when upgrading an existing installation.

2.5. Enforce Installation


If the installed packages should be replaced with the packages build from the SIA you are currently using even if the installed packages are more recent (have a higher version number), use the option --enforce. This will enforce the installion of the same software version (the one delivered within this SIA) on all nodes and the frontend no matter what might be installed on any of these machines. Examples:
--enforce

2.6. Configuration File Specification


When doing a full cluster install, the installation script will automatically look for the cluster configuration files dishosts.conf and networkmanager.conf in the default path /etc/dis on the installation machine. If these files are not stored in the default path (i.e. because you have created them on another machine or received them from Dolphin and stored them someplace else), you can specify this path using the --config-dir option. Example:
--config-dir /tmp

The script will look for both configuration files in /tmp. If you need to specify the two configuration files being stored in different locations, use the options --dishostsconf <filename> and --networkmgr-conf <filename>, respectively, to specify where each of the configuration files can be found.

2.7. Batch Mode


In case you want to run the installation unattended, you can use the --batch option to have the script assume the default answer for every question that is asked. Additionally, you can avoid most of the console output (but still have the full logfile) by providing the option --quiet. This option can be very useful if you are upgrading an already installed cluster. I.e., to enforce the installation of newly compiled RPM packages and reboot the nodes after the installation, you could issue the following command on the frontend:
# ./DIS_install_<version> --batch --reboot --enforce >install.log

After this command returns, your cluster is guaranteed to be freshly installed unless any error messages can be found in the file install.log.

2.8. Non-GUI Build Mode


When building RPMs only (using the --build-rpm option), it is possible to specify that no GUI-applications (sciadmin and dishostseditor) should be build. This is done by providing the --disable-gui option. This removes the dedency on the QT libraries and header files for the build process Example:
--disable-gui

61

Self-Installing Archive (SIA) Reference

2.9. Software Removal


To remove all software that has been installed via SIA, simply use the --uninstall option:
--uninstall

This will remove all packages from the node, and stop all drivers (if they are not in use). A more thorough cleanup, including all configuration data and possible remainings of non-SIA installations, can be achieved with the --wipe options:
--wipe

This option is a superset of --uninstall.

62

Appendix B. sciadmin Reference


1. Startup
Check out sciadmin -h for startup options. In order to connect to the network manager you may either start sciadmin with the -cluster option, or choose the Connect button after the startup is complete. Type the hostname or IP-address of the machine running the Dolphin Express network manager.

Note
Only one sciadmin process can connect to the network manager at any time. If you should ever need to connect to the network manager while another sciadmin process is blocking the connection, you can restart the network manager to terminate this connection. Afterwards, you can connect to the network manager from your sciadmin process (which needs to be running on a different machine than the other sciadmin process).

2. Interconnect Status View


2.1. Icons
As a visual tool, sciadmin uses icons to let the user trigger actions and to display information by changing the icon shape or color. The icons with all possible states are listed in the tables below.

Table B.1. Node or Adapter State


Dolphin Network Manager has a valid connection to Dolphin Node Manager on the node. Dolphin Network Manager cannot reach the node using TCP/IP, adapter is wrongly configured, broken or the driver is in an invalid state. Adapter has gone into a faulty state where it cannot read system interrupts and has been isolated by the Dolphin IRM driver.

63

sciadmin Reference

Table B.2. Link State


Green pencil strokes indicates that the links are up.

Red pencil strokes indicate that a link is broken. Typically, a cable is unplugged, or not seated well.

Yellow pencil strokes indicate that links have been disabled. Links are typically disabled when there are broken cables somewhere else in the ringlet and automatic fail over recovery has been enabled. A red dot (cranberry) indicates that the node has lost connectivity to other nodes in the cluster.

Blue pencil strokes indicate that an administrator has chosen to disable this link in sciadmin. This may be done if you want to debug the cluster.

2.2. Operation
2.2.1. Cluster Status
The area at the top right informs about the current cluster status and shows settings of sciadmin and the connected network manager. A number of settings can be changed in the Cluster Settings dialog that is shown when pressing the Settings button. Fabric status shows the current status of the fabric, UP, DEGRADED, FAILED or UNSTABLE (see below). Check Interval SCIAdmin shows the number of seconds between each time the Network Manager sends updates to the Dolphin Admin GUI. Check Interval Network Manager shows the number of seconds between each time the Network Manager receives updates from the Node Managers. Topology shows the current topology of the fabric. Auto Rerouting shows the current automatic fail over recovery settings (On, Off or default). Fabric is UP when all nodes are operational and all links ok and therefore plotted in green.

64

sciadmin Reference

Figure B.1. Fabric is UP

Fabric is DEGRADED when all nodes are operational, some links are broken, but we still have full connectivity. In the snapshot below, the input cable of link 0 of node tiger 1, which is the output cable at node tiger-3, is defunct (this typically means unplugged) and therefore the link is plotted in red. The other links on this ring have become disabled by the network manager and therefore plotted in yellow. All other links are functional and plotted in green. To get more information on the interconnect status for node tiger-1, get its diagnostics via Node Diag -V 1.

65

sciadmin Reference

Figure B.2. Fabric is DEGRADED

Fabric is in status FAILED if several links are broken in a way that breaks the full connectivity. In this case, the input cable of link 0 and the output cable of link 1 are defunct. Node tiger-1 can not communicate via SCI in this situation, and SuperSockets-driven sockets will have fallen back to Ethernet. Because the cluster is cabled in an interleaved pattern, the link 1 output cable of tiger 1 is the link 1 input cable of tiger-4, and not tiger-7 as it would be the case for non-interleaved cabling.

66

sciadmin Reference

Figure B.3. Fabric has FAILED due to loss of connectivity

The fabric status is also set to FAILED if one or more nodes are dead as this node can not be reached via SCI. The reason for a node being dead can be Node is not powered up. Solution: power up the node. Node has crashed. Solution: reboot the node. The IRM low-level driver is not running. Solution: start the IRM driver like
# service dis_irm start

The node manager is not running. Solution: start the node manager like
# service dis_nodemgr start

The adapter is in an invalid state or is missing. Please check the node, and also consider the related topic in the FAQ (Q: 1.1.1).

67

sciadmin Reference

Figure B.4. Fabric has FAILED due to dead nodes

2.2.2. Node Status


The status of a node icon tells if a node is up/dead or if a link is broken, disabled (either by the network manager or manually) or up. When selecting a node you will see details in the Node Status area: Serial number. A unique serial number given in production. Adapter Type: The Dolphin part number of the adapter Adapter number: The number of the adapter selected SCI Link 0: current status of link 0 (enabled or disabled) SCI Link 1: current status of link 1 (enabled or disabled)

3. Node and Interconnect Control


3.1. Admin Menu
The items in the Admin menu specifies information that are relevant for the Dolphin Admin GUI

68

sciadmin Reference

Figure B.5. Options in the Admin menu

Connect to the network manager running on the local or a remote machine. Disconnect from the network manager. Refresh Status of the node and interconnect (instead of waiting for the update interval to expire). Switch to Debug Statistics View will show the value of selected counters of each adapter instead of the node icons which is useful for debugging fabric problems.

3.2. Cluster Menu


The commands in the cluster menu are executed on all nodes in parallel and the results are displayed by sciadmin. When choosing one of the fabric options the command will be executed on all nodes in that fabric.

Figure B.6. Options in the Cluster menu

Each fabric in the cluster has a sub-menu Fabric <X>. Within this sub-menu, the Diag (-V 0), Diag (-V 1), Diag (-V 9) are diagnostics functions that can be used to get more detailed information about a fabric that shows problem symptoms. Diag (-V 0) prints only errors that have been found. Diag (-V 1) prints more verbose status information (verbosity level 1). Diag (-V 9) prints the full diagnostic information including all error counters (verbosity level 9).

69

sciadmin Reference

Diag -clear clears all the error counters in the Dolphin Express interconnect adapters. This helps to observe if error counters are changing. Diag -prod prints production information about the Dolphin Express interconnect adapters (serial number, card type, firmware revision etc) The Test option is described in Section 4.2, Traffic Test The other commands in the Cluster menu are: Settings displays the Cluster Settings dialog (see below). Reboot cluster nodes reboot all cluster nodes after a confirmation. Power down cluster nodes powers down all cluster nodes after a confirmation. Toggle Network Manager Verbose Settings to increase/decrease the amount of logging from the Dolphin Network Manager Select the Arrange Fabrics option to make sure that the different adapters in your hosts are connected to the same fabric. This option is only displayed for clusters with more than one fabric. Test Cable Connections is described in Section 4.1, Cable Testt

3.3. Node Menu


The options in the Node menu are identical to the options in the Cluster and Cluster Fabrics <X> menu, only that commands are executed on the selected node only. The only additional option is Settings that is described in the Section 3.5, Adapter Settings.

Figure B.7. Options in the Node menu

3.4. Cluster Settings


The Dolphin Interconnect Manager provides you with several options on how to run the cluster.

70

sciadmin Reference

Figure B.8. Cluster configuration in sciadmin

Check Interval Admin alters the number of seconds between each time the Network Manager sends updates to the SCIAdmin GUI. Check Interval Network Manager alters the number of seconds between each time the Network Manager receives updates from the Node Managers. Topology lets you select the topology of the cluster, while Topology found displays the auto-determined topology. Changes to the topology setting can be performed with dishostseditor. Auto Rerouting lets you decide to enable automatic fail over recovery (On), choose to freeze the routing to a current state (Off), or use the default routing tables in the driver (Default), the latter also means that no automatic rerouting will take place. Nodes in X,Y,Z dimension shows how the interconnect is currently dimensioned. Changes to the dimension settings can be performed with dishostseditor. Remove Session to dead nodes lets you decide whether to remove the session to nodes that are unavailable. Wait before removing session defines the number of seconds to wait until removing sessions to a node that has died or became inaccessible by other means. Automatic Create Sessions to new nodes lets you decide if the Network Manager shall create sessions to all available nodes. Alert script lets you choose to enable/disable the use of a script that may alert the cluster status to an administrator.

71

sciadmin Reference

3.5. Adapter Settings


The Advanced Settings button in the node menu allows you to retrieve more detailed information about an adapter and to disable/enable links of this adapter.

Figure B.9. Advanced settings for a node

Link Frequency sets the frequency of a link. It is not recommended to change the default setting. Prefetch Memsize shows the maximum amount of remote memory that can be accessed by this node. A changed value will not become effective until the IRM driver is restarted on the node, which has to be done outside of sciadmin. Setting this value too high (> 512MB) can cause problem with some machines, especially for 32bit platforms. SCI LINK 0 / 1 / 2 allows to set the way a link is controlled: Automatic lets the network manager control the link to enable and disable it as required by the link and the interconnect status. Disabled forces a link down. This is a per-session setting (the link will be under control of the network manager if it is restarted), and only required as a temporary measure for trouble shooting. The disable link option can also be used as a temporary measure to disable an unstable adapter or ringlet so that it does not impose unnecessary noise on the adapters. If such an unlikely event occurs, please contact Dolphin support. A manually disabled link is marked blue in the sciadmin interconnect display, as shown in the screenshot below.

72

sciadmin Reference

Warning
Please note that when Auto Rerouting is enabled (default setting), disabling a link within a ringlet will disable the complete ringlet. Disabling to many links can thus isolate nodes from access to the Dolphin Express interconnect.

Figure B.10. Link disabled by administrator (Disabling the links on the machine with hostname tiger-5 takes down the corresponding links on the other machines that share the same ringlet.).

4. Interconnect Testing & Diagnosis


4.1. Cable Test
Test Cable Connections tests the cluster for faulty cabling by reading serial numbers and adapter numbers from the other nodes on individual rings only. Using this test, you can verify that the cables are connecting the right nodes via the right ports, which means it servers to ensure that the physical cabling matches the interconnect description in the dishosts.conf configuration file. This test is very useful after a fresh installation, but also every time you worked on the cabling. It will only take a few seconds to complete and display its results in an editor. This allows you to copy or print the test result to fix the described problems right at the cluster.

73

sciadmin Reference

Warning
Please note that while this test is running, all traffic over the Dolphin Express interconnect will be blocked. Although this will not introduce any communication errors except the delay, it therefore is recommended to run the test on an idle cluster. SuperSockets will fall back to Ethernet while this test is running.

Figure B.11. Result of running cable test on a good cluster

Figure B.12. Result of cable test on a problematic cluster

74

sciadmin Reference

4.2. Traffic Test


The Test option for each fabric of a cluster verifies the connection quality of the links that make up the fabric. It will search for bad connections by imposing the maximum amount of traffic on individual rings and observe the internal error counters of all adapters involved. A typical problem that can be found with this test are not well-fitted cables as this will cause CRC errors on the related link. Such CRC errors do not cause data corruption as the corrupted packet will be detected and retransmitted, but the performance will decrease to some degree. A fabric should show no errors on this test. If errors are displayed, the cable connections between the affected nodes should be verified. For more information, please see Section 4.7.5, Fabric Quality Test.

Note
To perform this test, the SISCI RPM has to be installed on all nodes. This is the case if the installation was performed via SIA. If SISCI is not installed on a node, an error will be logged and displayed as shown below.

Warning
Please note that while this test is running, all traffic over the Dolphin Express interconnect will be blocked. Although this will not introduce any communication errors except the delay, it therefore is recommended to run the test on an idle cluster. SuperSockets will fall back to Ethernet while this test is running.

75

sciadmin Reference

Figure B.13. Result of fabric test without installing all the necessary rpms

76

sciadmin Reference

Figure B.14. Result of fabric test on a proper fabric

77

Appendix C. Configuration Files


1. Cluster Configuration
A cluster with Dolphin Express interconnect requires one combined configuration file dishosts.conf for the interconnect topology and the SuperSockets acceleration of existing Ethernet networks, and another file networkmanager.conf that contains the basic options for the mandatory Network Manager. Both of these files should be created using the GUI tool dishostseditor. Using this tool vastly reduces the risk of creating an incorrect configuration file.

1.1. dishosts.conf
The file dishosts.conf is used as a specification of the Dolphin Express interconnect (in a way just like /etc/ hosts specifies nodes on a plain IP based network). It is a system wide configuration file and should be located with its full path on all nodes at /etc/dis/dishosts.conf. Templates of this file can be found in /opt/DIS/ etc/dis/. A syntactical and semantic validation of dishosts.conf can be done with the tool testdishosts. The Dolphin network manager and diagnostic tools will always assume that the current file dishosts.conf is valid. If dynamic information read from the network contradicts the information read in the dishosts.conf file, Dolphin network manager and diagnostic tools will assume that components are mis configured, faulty or removed for repair.
dishosts.conf is by default automatically distributed to all nodes in the cluster when the Dolphin network man-

agement software is started. Therefore, do edit and maintain this file on the frontend only. You should create and maintain dishosts.conf by using the dishostseditor GUI (Unix: /opt/DIS/sbin/dishostseditor). Normally, there is no reason to edit this file manually. To make changes in dishosts.conf effective, the network manager needs to be restarted like
# service dis_networkmgr restart

In case that SuperSockets settings have been changed, dis_ssocks_cfg needs to be run on every node as SuperSockets are not controlled by the network manager. The following sections describe the keywords used.

1.1.1. Basic settings


DISHOSTVERSION [ 0 | 1 | 2 ] The version number of the dishosts.conf is specified after the keyword DISHOSTVERSION. The DISHOSTVERSION should be put on the first line of the dishosts file that is not a comment. DISHOSTVERSION 0 designates a very simple configuration (see Section 1.1.3, Miscellaneous Notes). DISHOSTVERSION > 0 maps hosts/IPs to adapters, which in turn are mapped to nodeids and physical adapter numbers by means of the ADAPTER entries. DISHOSTVERSION 1 or higher is required for running with multiple adapter cards and transparent fail over etc. DISHOSTVERSION 2 provides support for dynamic IP-to-nodeId mappings (sometimes also refered to as virtual IPs). Each cluster node is assigned a unique dishostname, which has to be be equal to its hostname. The hostname is typically the network name (as specified in /etc/hosts), or the nodes IP-address.
HOSTNAME: host1.dolphinics.no HOSTNAME: 193.69.165.21

HOSTNAME: <hostname/IP>

ADAPTER: <physical adaptername> <nodeId> <adapterNo>

A Dolphin network node may hold several physical adapters. Information about a node's physical adapters is listed right below the hostname. All nodes specified by a HOSTNAME need at least one physical adapter. This

78

Configuration Files

physical adapter has to be specified on the next line after the HOSTNAME. The physical adapters are associated with the keyword ADAPTER.
#Keyword name ADAPTER: host1_a0 ADAPTER: host1_a1 nodeid 4 4 adapter 0 1

STRIPE: <virtual adaptername> <physical adaptername 1> <physical adaptername 2>

Defines a virtual striping adapter comprising two physical adapters, which will be used for automatic data striping (also referred to as channel bonding). Striping adapters will also be used as redundant adapters in the case of network failure.
STRIPE: host1_s host1_a0 host1_a1

REDUNDANT: <virtual adaptername> <physical adaptername 1> <physical adaptername 2>

Defines a virtual redundant adapter comprising two physical adapters, which will be used for automatic fail over in case one of the fabrics fails.
REDUNDANT: host1_r host1_a0 host1_a1

1.1.2. SuperSockets settings


The SuperSockets configuration is responsible for mapping certain IP addresses to Dolphin Express adapters. It defines which network interfaces are enabled for Dolphin SuperSockets. SOCKET: <host/IP> <physical or virtual adptername> Enables the given <host/IP> for SuperSockets using the specified adapter. In the following example we assume host1 and host2 have two network interfaces each designated host1pub, host1prv, host2pub, host2prv, but only the 'private' interfaces hostXprv are enabled for SuperSockets using a striping adapter:
SOCKET: SOCKET: host1prv host2prv host1_s host2_s

Starting with DISHOSTVERSION 2 SuperSockets can handle dynamic IP-to-nodeId mappings. I.e. a certain IP address does not need to be bound to a fixed machine but can roam in a pool of machines. The address resolution is done at runtime. For such a configuration a new type of adapter must be specified: SOCKETADAPTER: <socket adaptername> [ SINGLE | STRIPE | REDUNDANT ] <adapterNo> [ <adapterNo> ... ] This keyword basically only defines an adapter number, which is not associated to any nodeId. Example:
SOCKETADAPTER: sockad_s STRIPE 0 1

Defines socket adapter "sockad_s" in striping mode using physical adapters 0 and 1. The resulting internal adapter number is 0x2003. Such socket adapters can now be used in order to define dynamic mappings, and, in extension to DISHOSTVERSION 1 whole networks can be specified for dynamic mappings: SOCKET: [ <ip> | <hostname> | <network/mask_bits> ] <socket adapter> Enables the given address/network for SuperSockets and associates it with a socket adapter. It is possible to mix dynamic and static mappings, but there must be no conflicting entries. Example:
SOCKET: SOCKET: SOCKET: SOCKET: host1 sockad_s host2 sockad_s host3 host3_s 192.168.10.0/24 sockad_s

79

Configuration Files

1.1.3. Miscellaneous Notes


Using multiple nodeids per node is supported.This can be used for some advanced high-availability switchconfigurations. A short version of dishosts.conf is supported for compatibility reasons, which corresponds to DISHOSTVERSION 0. Please note that neither virtual adapters nor dynamic socket mappings are supported by this format. Example:
#host/IP host1prv host2prv nodeid 4 8

1.2. networkmanager.conf
The networkmanager.conf specifies the startup parameters for the Dolphin Network Manager. It is created by the dishostseditor.

1.3. cluster.conf
This file must not be edited by the user. It is a configuration file of the Network Manager that consists of the user-specified settings from networkmanager.conf and derived settings of the cluster (nodes). It is created by the Network Manager.

2. SuperSockets Configuration
The following sections describe the configuration files that specifically control the behaviour of Dolphin SuperSockets. Next to these files, SuperSockets retrieve important configuration information from dishosts.conf as well. To make changes in any of these file effective, you need to run dis_ssocks_cfg on every node. Changes do not apply to sockets that are already open.

2.1. supersockets_profiles.conf
This file defines system-wide settings for all SuperSockets applications using LD_PRELOAD. All settings can be overridden by environment variables named SSOCKS_<option> (like export SSOCKS_DISABLE_FALLBACK=1). SYSTEM_POLL [ 0 | 1 ] Usage of poll/select optimization. Default is 0 which means that the SuperSockets optimization for the poll() and select() system calls is used. This optimization typically reduces the latency without increasing the CPU load. To only use the native system methods for poll() and select(), set this value to 1. Receive poll time [s]. Default is 30. Increasing this value may reduce the latency as the CPU will spin longer to wait for new data until it blocks sleeping. Reducing this value will send the CPU to sleep earlier, but this may increase message latency. Transmit poll time [s]. Default is 0, which means that the CPU does not spin at all when a no buffers at the receiving side are available. Instead, it will imeadeatly block until the receiver reads data from these buffers (which makes buffer space available again for sending). The situation of no available receive buffers does rarely occur, and increasing this value is not recommended. Message buffer size [byte]. Default is 128KB. This value determines how much data can be sent without the receiver reading it. It has no significant impact on bandwidth.

RX_POLL_TIME <int>

TX_POLL_TIME <int>

MSQ_BUF_SIZE <int>

80

Configuration Files

MIN_DMA_SIZE <int> MAX_DMA_GATHER <int> MIN_SHORT_SIZE <int> MIN_LONG_SIZE <int> FAST_GTOD [ 0 | 1 ] DISABLE_FALLBACK [ 0 | 1 ]

Minimum message size for DMA [byte]. Default is 0 (DMA disabled). Maximum number of messages gather into a single DMA transfer. Default is 1. Switch point [byte] from INLINE to SHORT protocol. Default depends on driver. Switch point [byte] from SHORT to LONG protocol. Default depends on driver. Usage of accelerated gettimeofday(). Default is 0 which disables this optimization. Set to 1 to enable it. Control fallback from SuperSockets to native sockets. Default is 0, which means fallback (and fallforward) is enabled. To ensure that only SuperSockets are used (i.e. for benchmarking), set it to 1. Usage of fully asynchronous transfers. Default is 1, which means that the SHORT and LONG protocol is processed by a dedicated kernel thread. By this, the sending process is available immedeatly, and the actual data transfer is performed asynchronously. This generally increases throughput and reduces CPU load with affecting small message latency. To disable asynchronous transfers, set this option to 0; in this case, all data transfers are performed by the CPU that runs the process that called the send function.

ASYNC_PIO [ 0 | 1 ]

2.2. supersockets_ports.conf
This file is used to configure the port filter for SuperSockets. If no such file exists all ports will be enabled by default. It is, however, recommended to exclude all system ports. A suitable port configuration file is part of the SuperSockets software package. You can adjust it to your specific needs.
# Default port configuration for Dolphin SuperSockets # Ports specifically enabled or disabled to run over SuperSockets. # Any socket not specifically covered, is handled by the default: EnablePortsByDefault yes # Recommended settings: # Disable the privileged ports used by system services. DisablePortRange tcp 1 1023 DisablePortRange udp 1 1023 # Disable Dolphin Interconnect Manager service ports. DisablePortRange tcp 3443 3445

The following keywords are valid: EnablePortsByDefault [ yes | no ] DisablePortRange [ tcp | udp ] <from> <to> EnablePortRange [ tcp | udp ] <from> <to> Determines the policy for unspecified ports. Explicitlely disables the given port range for the given socket type. Explicitlely enables the given port range for the given socket type.

3. Driver Configuration
The Dolphin drivers are designed to adapt to the environment they are operating in; therefore, manual configuration is rarely required. The upper limit for memory allocation of the low-level driver is the only setting that may need to be adapted for a cluster, but this is also done automatically during the installation.

81

Configuration Files

Warning
Changing parameters in these files can affect reliability and performance of the Dolphin Express interconnect.

3.1. dis_irm.conf
is located in the lib/modules directory of the DIS installation (default /opt/DIS) and contains options for the hardware driver (dis_irm kernel module). Only few options are to be modified by the user. These options deal with the memory pre-allocation in the driver:
dis_irm.conf

Warning
Changing other values in dis_irm.conf as those described below may cause the interconnect to malfunction. Only do so if instructed by Dolphin support. Whenever a setting in this file is changed, the driver needs to be reloaded to make the new settings effective.

3.1.1. Resource Limitations


These parameters control memory allocations that are only performed on driver initialization. Option Name Description Unit Valid Values integers > 0 4 Default Value

dis_max_segment_size_megabytes Sets the maximum size MiB of a memory segment that can be allocated for remote access. Some systems may lock up if too much memory is requested. max-vc-number Maximum number of n/a virtual channels (one virtual channel is needed per remote memory connection; i.e. 2 per SuperSocket connection)

integers > 0 The upper limit is the consumed memory; values > 16384 are typically not necessary.

1024

3.1.2. Memory Preallocation


Preallocation of memory is recommended on systems without IOMMU (like x86 and x86_64). The problem is the memory fragmentation over time which can cause problems to allocate large segments of contiguous physical memory after the system has been running for some time. To overcome this situation, options has been added to let the IRM driver allocate blocks of memory upon initialization and to provide memory from this pool under certain conditions for allocation of remotely accessible memory segments.. Option Name Description Unit Valid Values Default Value

number-of-megabytes-preallocated Defines the number of MiB MiB memory the IRM shall try to allocate upon initialization. use-sub-pools-for-preallocation If the IRM fails to n/a allocate the amount memory specified by number-of-

0: disable preallocation 16 (may be increased by >0: MiB to preallocate the installer in as few blocks as pos- script) sible 0: disable sub-pools 1: enable sub-pools 1

82

Configuration Files

Option Name

Description megabytes-preallocated it will by default repetively decrease the amount and retry until success. By enabling use-sub-poolsfor-preallocation the IRM will continue allocate memory (possibly in small chunks), until the amount specified by number-ofmegabytes-preallocated is reached.

Unit

Valid Values

Default Value

block-size-of-preallocated-blocks

To allocate not a single bytes large block, but multiple blocks of the same size, this parameter has to be set to a value > 0. Pre-allocating memory this way is useful if the application to be run on the cluster uses many memory segments of the same (relatively small) size.

0: don't preallocate 0 memory in this manner > 0: size in bytes (will be aligned upwards to page size boundary) of each memory block.

number-of-preallocated-blocks

The number of block to n/a be preallocatd (see previous parameter) This sets a lower lim- KiB it on the size of memory segments the IRM may # try to allocate from the preallocated pool. The IRM will allways request the system for additional memory than resolving memory request less than this size.Due to the aspect of the need of the preallocation mechanism, there is a "hard" lower limit of one SCI_PAGE (currently 8K). The mininum size is defined in 1K blocks. Directs the IRM when n/a to try to use memo-

0: don't preallocate 0 memory in this manner > 0: number of blocks

minimum-size-to-allocate-frompreallocated-pool

0: always allocate from 0 pre-allocated memory > 0: try to allocate memory that is smaller than this value from non-preallocatd memory

try-first-to-allocate-from-preallocated-pool

0: The preallocated 1 memory pool becomes a backup solution, on-

83

Configuration Files

Option Name

Description ry from the preallocated pool.

Unit

Valid Values ly to be used when the system can't honor a request for additional memory. 1:IRM to prefers to allocate memory from the preallocated pool when possible.

Default Value

3.1.3. Logging and Messages


Option Name link-messages-enabled Description Control logging of non n/a critical link messages during operation. Control logging of non n/a critical notices during operation. Control logging of gen- n/a eral warnings during operation. Control logging of out- n/a of-resource messages during operation. Control printing of n/a driver messages to the system console Unit Valid Values 0: no link messages 1: show link messages 0: show notice mes- 0 sages 1: no notice messages warn-disabled 0: show warning mes- 0 sages 1: no warning messages dis_report_resource_outtages 0: no messages 1: show messages 0: also print to console 0 1: only print to syslog 0 0 Default Value

notes-disabled

notes-on-log-file-only

3.2. dis_ssocks.conf
Configuration file for SuperSockets (dis_ssocks) kernel module. If a value different from the default is required, edit and uncomment the appropriate line.
#tx_poll_time=0; #rx_poll_time=30; #min_dma_size=0; #min_short_size=1009; #min_long_size=8192; #address_family=27; #rds_compat=0;

The following keywords are valid: tx_poll_time Transmit poll time [s]. Default is 0, which means that the CPU does not spin at all when a no buffers at the receiving side are available. Instead, it will imeadeatly block until the receiver reads data from these buffers (which makes buffer space available again for sending). The situation of no available receive buffers does rarely occur, and increasing this value is not recommended. Receive poll time [s]. Default is 30. Increasing this value may reduce the latency as the CPU will spin longer to wait for new data until it blocks sleeping. Reducing this value will send the CPU to sleep earlier, but this may increase message latency.

rx_poll_time

84

Configuration Files

min_dma_size min_short_size min_long_size address_family

Minimum message size for using DMA (0 means no DMA). Default is 0. Minimum message size for using SHORT protocol. Default and maximum is 1009. Minimum message size for using LONG protocol. Default is 8192. AF_SCI address family index. Default value is 27. If not set, the driver will automatically chose another index between 27 and 32 until it finds an unused index. The chosen index can be retrieved via the /proc file system like cat /proc/net/af_sci/family . If this value is set explicitly, this value will be chosen, and no search for unused values is performed. Generally, this value is only required if SuperSockets should be used explictely without the preload library.

rds_compat

RDS compatibility level. Default is 0.

85

Appendix D. Platform Issues and Software Limitations


This chapter lists known issues of Dolphin Express with certain hardware platforms and limitations of the software stack. Some of these limitations can be overcome by changing default settings of runtime parameters to match your requirements.

1. Platforms with Known Problems


Intel Chipset 5000V Due to a PCI-related bug in the chipset Intel 5000V, limitations on how this chipset can be used with SCI have to be considered. This does only apply to very specific configurtations; one example of such a configuration is the Motherboard SuperMicro X7DVA-8 when the onboard SCSI HBA is used and the SCI adapter is placed in a specific slot. If you plan to use such a platform, please contact Dolphin support in advance. Intel IA-64 The write performance to remote memory within the kernel is low on IA-64 machines, limiting the SuperSockets performance. We recommend to use x86_64 platforms instead.

2. IRM
Resource Limitations The IRM (Interconnect Resource Manager) manages the hardware and related software resources of the Dolphin Express interconnect. Some resources are allocated once when the IRM is loaded. The default setting are sufficient for typical cluster sizes and usage scenarios. However, if you hit a resource limit, it is possible to increase the

Maximum Number of Nodes

3. SuperSockets
UDP Broadcasts Heterogeneous Cluster Operation (Endianess) SuperSockets do not support UDP broadcast operations. By default, SuperSockets are configured to operate in clusters where all nodes use the same endian representation (either little endian or big endian). This avoids costly endian conversions and works fine in typical clusters where all nodes use the same CPU architecture. Only if you mix nodes with Intel or AMD (x86 or x86_64) CPUs i.e. with nodes using PowerPC- or Sparc-based CPUs, this default setting won't work. In this case, an internal flag needs to be set. Please contact Dolphin support if this situation applies to you. Sending and Receiving Vectors The vector length for the writev() and sendmsg() functions is limited to 16. For readv() and recvmsg(), the vector length is not limited. The following socket options are supported by SuperSockets for communication over Dolphin Express: SO_DONTROUTE (implicit, as SuperSockets don't use IP packets for data transport and thus are never routable) TCP_NODELAY 86

Socket Options

Platform Issues and Software Limitations SO_REUSEADDR SO_TYPE The following socket options are passed to the native (fallback) socket: SO_SENDBUF and SO_RECVBUF (the buffer size for the SuperSockets is fixed). All other socket options are not supported (ignored). Fallforward for Stream Sockets While SuperSockets offer fully transparent fall-back and fall-forward between Dolphin Express-based communication and native (Ethernet) communication for any socket (TCP or UDP) while it is open and used, there is currently a limitation on sockets when they connect: a socket that has been created via SuperSockets and connected to a remote socket while the Dolphin Express interconnect was not operational will not fall forward to Dolphin Express communication when the interconnect comes up again. Instead, it will continue to communicate via the native network (Ethernet). This is a rare condition that typically will not affect operation. If you suspect that one node performs not up to the expectations, you can either contact Dolphin support to help you diagnose the problem, or restart the application making sure that the Dolphin Express interconnect is up. Removal of this limitation as well as a simple way to diagnose the precise state of a SuperSockets-driven socket is scheduled for updated versions of SuperSockets. Resource Limitations SuperSockets allocate resources for the communication via Dolphin Express by means of the IRM. Therefore, the resource limitations listed for the IRM indirectly apply to SuperSockets as well. To resolve such limitations if they occur (i.e. when using very large number of sockets per node), please refer to the relevant IRM section above. SuperSockets logs messages to the syslog for two typical out-of-resources situations: No more VCs available. The maximum number of virtual channels needs to be increased (see Section 2, IRM). No more segment memory available. The amount of pre-allocated memory needs to be increased (see Section 2, IRM).

87

Vous aimerez peut-être aussi