Académique Documents
Professionnel Documents
Culture Documents
Routing Algorithm
Created by ophirmaor on Jan 13, 2016 4:56 PM. Last modified by ophirmaor on Feb 14,
2017 4:17 PM.
References
Overview
Configuration
References
opensm(8) - Linux man page
Understanding the GUID Routing Order File (SM Configuration)
Understanding the Root GUID File (SM Configuration)
HowTo Prevent InfiniBand Credit Loops
VPI Gateway Considerations
Overview
Several InfiniBand routing engines may be configured on a network such as Min Hop, Up Down,
Down Up, Fat Tree and more (see opensm ). Up/Down (UpDn) and Fat Tree are the most
commonly used InfiniBand routing algorithms for Clos/fat tree networks.
Note: This includes trees built using director switches and 1U switchesthe two levels of physical
switch enclosure represent 3 tiers of switch ASICs because each director switches contains 2 tiers
of ASICs.
Like most IB routing algorithms, UpDn uses the shortest path(s) available between any two
endpoints. It can route any collection of IB-connected switches and HCAs. Most importantly and
unlike MinHop, UpDn guarantees credit-loop free routing in the fabric. UpDn begins with a list of
https://community.mellanox.com/docs/DOC-2402 Page 1 of 7
Understanding Up/Down InfiniBand Routing Algorithm | Mellanox Interconnect Community 4/29/17, 14)51
the switch ASICs that form the root or top level of the fabric. This list is set with the Subnet
Manager (SM) flag --root_guid_file. It is a simple text file with a line for each globally unique ID
(GUID) of a root ASIC. Although UpDn has an option to auto-discover the root ASICs, it is strongly
recommended that a root GUID list be supplied. The root GUID list must be updated if a root
switch ASIC is replaced or if the topology is expanded, and every SM must have an identical copy
of the GUID list.
To begin routing the fabric, the UpDn algorithm starts with the root switch ASICsto which we will
refer as Distance 0 (zero). The algorithm then finds every switch ASIC that is one hop (one link)
away from the roots. These ASICs can be thought of as Distance 1, because they are one hop away
from the root switches. The algorithm then discovers all switch ASICs that are two hops from the
root switches, these can be thought of Distance 2. The process continues until every switch ASIC
has been assigned a distance from the roots. The following diagram shows an example 3-tier
fabric with the distances assigned.
This process generates a Breadth-First Spanning Tree (BFSP) which is analogous to the approach
used by the Spanning Tree Protocol (STP) used in Ethernet. Unlike STP, UpDn allows multiple roots,
and strives to provision as many paths as possible between each pair of end nodes. The UpDn
algorithm then finds all of the possible shortest paths between every pair of endpoints. Next,
UpDn discards any path that contains a hop from a Distance N ASIC to a Distance N+1 ASIC,
followed by a hop back to Distance N. That is, it discards any path that goes "down" (away from
the roots) and then "up" (toward the roots). Legal paths can go up, or down, or up and then down,
or stay at the same level, but never down and then up. By discarding these paths and not
provisioning them in the switches, UpDn guarantees no logical loops and no credit loops in
routing that can lead to the traffic stoppage..
https://community.mellanox.com/docs/DOC-2402 Page 2 of 7
Understanding Up/Down InfiniBand Routing Algorithm | Mellanox Interconnect Community 4/29/17, 14)51
Note: The two potential paths between nodes E and F are both the same length (same number of
hops) but only one obeys the UpDn rule. The disallowed path contains a DnUp segment.
The credit loop-free property of UpDn (and Fat Tree) routed topology is critical for reliable
network operation.
However, since some potential paths are discarded, there are cases where a pair of end nodes can
become disconnected and unable to communicate one to another.
The calculate_missing_routes opensm option when set to TRUE (the default value) in opensm
configuration file guarantees connectivity between all endpoints in the fabric in credit loop-free
manner with UPDN and Fat Tree routing.
For example, consider a different fabric that has nodes connected above the leaf switches (nodes
G, H, and J). Nodes connected to L1 switches (A, B, C, etc.) have legal UpDn paths to nodes G, H,
and J. There is a legal UpDn path between nodes G and H. However, there is no legal path between
G and J, and these nodes will not be able to communicate with each other. Setting
calculate_missing_routes to TRUE will provide credit-loop free routing between all endpoints.
There may be cases where nodes do not need to communicate with each other (e.g. storage nodes
that do not communicate among themselves). However, this is rare. The best practice for a Clos-5
https://community.mellanox.com/docs/DOC-2402 Page 3 of 7
Understanding Up/Down InfiniBand Routing Algorithm | Mellanox Interconnect Community 4/29/17, 14)51
Note: The diagrams above apply equally well to two different cases: A fabric built from 3 tiers of
1U switches, and a fabric that uses two director switches with 1U switches below them. In the
latter case, nodes E, F, and G represent nodes cabled to the leaf modules of the director switches.
Scatter-Ports
When assigning logical paths to physical links, the UpDn algorithm tries to map the same number
of paths per link to maximize use of the available bandwidth. This balancing is done statically,
without knowledge of actual workloads and traffic patterns. Path balancing decisions are made
locally, at each switch, without assuming anything about the physical topology. The resulting path
assignments may not be optimal for typical Clos/Fat Tree workloads.
A routing option called scatter-ports is available for MinHop and UpDn routing engines. It
instructs the routing algorithm to randomize the local assignments of paths to links, which often
results in better link utilization. The scatter-ports option requires an integer argument, which is
the seed for the random number generator. It is recommended to use a prime number for the
seed; a seed of zero turns off randomization.
Note: scatter-ports configuration is available only on SM running on a host (or UFM), it is not
supported in case the SM is running on a switch.
Configuration
1. The routing engine algorithm is configured with the flag --routing_engine of the opensm
command. The supported engines are: minhop, updn, dnup, file, ftree, lash, dor, torus-2QoS,
dfsssp, sssp, pqft, chain.
In case you are using SM running on an InfiniBand switch, run the following command on the
MLNX-OS CLI:
In case of an issue in the fabric, it is better to fall down to updn and not minhop. In case fat tree
and updn cant converge it will fail to minhop.
https://community.mellanox.com/docs/DOC-2402 Page 4 of 7
Understanding Up/Down InfiniBand Routing Algorithm | Mellanox Interconnect Community 4/29/17, 14)51
2. The list of roots for the UpDn routing algorithm is configured with the flag --root_guid_file of
the opensm command.
In case you are using SM running on an InfiniBand switch, use the following command to set the
list of root GUIDs.
Doing that will force the routing algorithm to use those specific switches as root GUIDs.
a. Run ibswitches on the network (from a switch or from the host) to get the list of switches and
their GUIDs. The GUIDs are marked in red below
b. Filter the switches that are Spine switches in the cluster, and get their GUIDs
In case, for example, you have 18 Spines and 36 leafs, it is recommended to run this command 18
times adding 18 spines GUIDs (on the SM switch)
https://community.mellanox.com/docs/DOC-2402 Page 5 of 7
Understanding Up/Down InfiniBand Routing Algorithm | Mellanox Interconnect Community 4/29/17, 14)51
(2 ratings)
0 Comments
Company Products
About Mellanox Adapters and Cables
Management InfiniBand/VPI Adapter Cards
Board of Directors Ethernet Adapter Cards
Timeline Switches and Gateways
Quality InfiniBand/VPI Switch Systems
Philanthropy Ethernet Switches
Industry Memberships Gateways
Research Partners Software & Drivers
Corporate Headquarters USA Mellanox OFED
Corporate Headquarters Israel WinOF Driver
Regional Offices Application Accelerator Software
Technical Support Unified Fabric Manager (UFM)
Virtualization for Infiniband and Ethernet
RDMA Software for GPU
Firmware Tools
Solutions Support/Education
HPC Solutions Questions MyMellanox Login
RDMA/RoCE and Storage Solutions Global Services
Performance Tuning End-of-Life Products
RDMA/RoCE and Storage Solutions Firmware Download
Interconnect Solutions InfiniBand/VPI Drivers download
Storage Solutions Questions Mellanox Academy
Big Data Solutions Questions Products Overview
Windows Driver Solutions InfiniBand White Papers
Linux/VMWare Driver Solutions Ethernet White Papers
Programming Solutions Silicon Photonics White Papers
Lab Tips and Fun Webinars
Ethernet Switch Solutions Videos
Mellanox NEO Solutions Podcasts
Lab Tips and Fun Case Studies
Virtualization Solutions
Cloud and Acceleration Solutions Events
Cloud Solutions Questions Mellanox News & Events
https://community.mellanox.com/docs/DOC-2402 Page 6 of 7
Understanding Up/Down InfiniBand Routing Algorithm | Mellanox Interconnect Community 4/29/17, 14)51
Community News
Latest Release Announcements
https://community.mellanox.com/docs/DOC-2402 Page 7 of 7