Académique Documents
Professionnel Documents
Culture Documents
Greg Bruno
Rocks Development Group
San Diego Supercomputer Center
HPC Cluster Architecture
© 2005 UC Regents 2
Cluster Pioneers
In the mid-1990s, Network of Workstations project (UC
Berkeley) and the Beowulf Project (NASA) asked the
question:
© 2005 UC Regents 3
Clusters Dominate the Top500
© 2005 UC Regents 4
The Dark Side of Clusters
Clusters are phenomenal price/performance
computational engines …
Can be hard to manage without experience
High-performance I/O is still unsolved
Finding out where something has failed increases at least
linearly as cluster size increases
Not cost-effective if every cluster “burns” a person just
for care and feeding
Programming environment could be vastly improved
Technology is changing very rapidly. Scaling up is
becoming commonplace (128-256 nodes)
© 2005 UC Regents 5
Two Critical Problems with
Clusters
Software skew
When software configuration on some nodes
is different than on others
Small differences (minor version numbers on
libraries) can cripple a parallel program
Adequate job control of a parallel process
Signalpropagation
Cleanup
© 2005 UC Regents 6
Rocks (open source clustering distribution)
www.rocksclusters.org
© 2005 UC Regents 7
Philosophy
Caring and feeding for a
system is not fun
System Administrators cost
more than clusters
1 TFLOP cluster is less than
$200,000 (US)
Close to actual cost of a
fulltime administrator
The system administrator is
the weakest link in the cluster
Bad ones like to tinker
Good ones still make
mistakes
© 2005 UC Regents 8
Philosophy
continued
All nodes are 100% automatically
configured
Zero “hand” configuration
This includes site-specific
configuration
Run on heterogeneous standard
high volume components
Use components that offer the
best price/performance
Software installation and
configuration must support
different hardware
Homogeneous clusters do not
exist
Disk imaging requires
homogeneous cluster
© 2005 UC Regents 9
Philosophy
continued
Optimize for installation
Get the system up quickly
In a consistent state
Build supercomputers in hours
not months
Manage through re-installation
Can re-install 128 nodes in under
20 minutes
No support for on-the-fly system
patching
Do not spend time trying to issue
system consistency
Just re-install
Can be batch driven
© 2005 UC Regents 10
Rocks Basic Approach
Install a frontend
1. Insert Rocks Boot CD
2. Insert Roll CDs (optional components)
3. Answer 7 screens of configuration
data
4. Drink coffee (takes about 30 minutes
to install)
Install compute nodes:
1. Login to frontend
2. Execute “insert-ethers”
Optional Rolls
3. Boot compute node with Rocks Boot Condor
CD (or PXE)
Grid (based on NMI R4)
4. Insert-ethers discovers nodes Intel (compilers)
5. Goto step 3 Java
Add user accounts SCE (developed in Thailand)
Start computing Sun Grid Engine
PBS (developed in Norway)
Area51 (security monitoring tools)
Many Others …
© 2005 UC Regents 11
Minimum Requirements
Frontend Compute Nodes
2 Ethernet Ports 1 Ethernet Port
CDROM 18 GB Disk Drive
18 GB Disk Drive 512 MB RAM
512 MB RAM
A Quick Overview
Goal of Rolls
Develop a method to reliably install software on
a frontend
“User-customizable” frontends
© 2005 UC Regents 14
Add-on Method
1. User responsible for installing and configuring base
software stack on a frontend
2. After the frontend installation, the user downloads
‘add-on’ packages
3. User installs and configures add-on packages
4. User installs compute nodes
© 2005 UC Regents 15
Rocks Method
To address the major problem with the add-on
method, we had the following idea:
All non-RedHat packages must be installed and
configured in a controlled environment
© 2005 UC Regents 16
Goal of Rolls
This led to modifying the standard RedHat installer in
order to accept new packages and configuration
A tricky proposition
A RedHat distribution is a monolithic entity
• It’s tightly-coupled
• A program called “genhdlist” creates a binary files (hdlist and
hdlist2) that contain metadata about every RPM in the distribution
To add/remove/change an RPM, you need to re-run
genhdlist
Else, the RedHat install will not recognize the package
Or worse, it fails during package installation
© 2005 UC Regents 17
Monolithic Software Stack
© 2005 UC Regents 18
Rolls
© 2005 UC Regents 21
Rocks
© 2005 UC Regents 24
Two Examples
Rockstar - SDSC
Tungsten2 - NCSA
Super Computing 2003 Demo
We wanted to build a Top500
machine live at SC’03
From the ground up (hardware and
software)
In under two hours
Show that anyone can build a super
computer with:
Off-the-shelf software (Rocks)
Money
No army of system administrators
required
HPC Wire Interview
HPCwire: What was the most
impressive thing you’ve seen at
SC2003?
Larry Smarr: I think, without
question, the most impressive thing
I’ve seen was Phil Papadopoulos’
demo with Sun Microsystems.
© 2005 UC Regents 26
Rockstar Cluster
129 Sun Fire V60x servers
1 Frontend Node
128 Compute Nodes
Gigabit Ethernet
$13,000 (US)
9 24-port switches
8 4-gigabit trunk uplinks
Built live at SC’03
In under two hours
Running applications
Top500 Ranking
11.2003: 201
06.2004: 433
49% of peak
© 2005 UC Regents 27
NCSA
National Center for Supercomputing Applications
Tungsten2 Core
Fabric
520 Node Cluster
6 72-port
Dell Hardware TS270
Steve Jones
Institute for Computational and
Mathematical Engineering
Stanford University
Research Groups
Flow Physics and Computation
Aeronautics and Astronautics
Chemical Engineering
Center for Turbulence Research
Center for Integrated Turbulence Simulations
Academic
Institute for Computational and Mathematical Engineering
Scientific Computing and Computational Mathematics
Funding
Sponsored Research (AFOSR/ONR/DARPA/DURIP/ASC)
© 2005 UC Regents 30
Active collaboration with the Labs
Buoyancy driven instabilities/mixing - CDP for modeling plumes
(Stanford/SNL)
© 2005 UC Regents 31
Affiliates Program
© 2005 UC Regents 32
Bio-X
Iceberg
300 Node Cluster
• Dual 2.8GHz
Dell Hardware
Fast Ethernet
Deployed 11.2002
Top500 Ranking
6.2003: 319
Shared resource for Bio-X
© 2005 UC Regents 33
Iceberg
Campus
Backbone
Frontend Server
© 2005 UC Regents 34
Top 500 Supercomputer
© 2005 UC Regents 35
CITS
Gfunk
Rocks 3.3
1 Frontend
82 node
• Dual 2.66GHz Xeon
• 2GB RAM
• Myrinet C card
1 NFS Appliance
• 1TB Storage
© 2005 UC Regents 36
Gfunk
Campus
Backbone
Frontend Server
Myrinet
© 2005 UC Regents 37
AFOSR
Nivation
Rocks 3.3
1 Frontend
82 Node
• Dual 3.0 GHz Xeon
• 4GB Ram
• Myrinet D card
4 Application Node
• Matlab
• Ansys ICEM/CFD
• Compilers
Panasas Storage Cluster
• 4TB Shelf
2 NFS Appliance
• 1TB per appliance (scratch)
© 2005 UC Regents 38
Nivation
Campus
Backbone
NFS A ppliance
GigE Net
NFS A ppliance
Myrinet
© 2005 UC Regents 39
DARPA/DUPRIP
Rocks 3.3 x86_64
1 Frontend
1TB Storage
24 Node
48 AMD Opteron 242
2 GB RAM
PCI-X IB
Penguin Computing Hardware
Infinicon Infiniband
The easiest and fastest setup time of
our clusters
Due in large part to
AMD Cluster Development Center
© 2005 UC Regents 40
Sintering
Campus
Backbone
Frontend Server
1TB DISK A RRA Y
GigE Net
INFINIBA ND
© 2005 UC Regents 41
DARPA
Ablation
Rocks 4.0 CentOS
1 Frontend
• 1TB storage
48 Node
• Dual 3.4 GHz EMT64
• 2GB RAM
• PCI-X HCA IB card
Dell Hardware
Topspin Infiniband
Panasas Storage Cluster
• 4TB Shelf
© 2005 UC Regents 42
Ablation
Campus
Backbone
Frontend Server
GigE Net
Topspin 270
Infiniband Switch
© 2005 UC Regents 43
ICME Teaching Cluster
Calving
10 compute nodes/Dell
Dual 3.4 GHz EMT64
2GB RAM
10 compute nodes/Penguin
Dual 1.0 GHz
1GB RAM
Gigabit Ethernet
Easy to change software
configurations each quarter
SUNetID Access Controlled
AFS Access
© 2005 UC Regents 44
Calving
Campus
Backbone
Frontend Server
GigE Net
Slow Fast
Node Node
Node Node
Node Node
© 2005 UC Regents 45
Extending Rocks
<post>
<!-- Insert y our post installation script here. This
code will be executed on the destination node af ter the
packages hav e been installed. Ty pically conf iguration f iles
are built and serv ices setup in this section. -->
# Add panf s to f stab
REALM=10.10.10.10
mount_f lags="rw,noauto,panauto"
/bin/rm -f /etc/f stab.bak.panf s
/bin/rm -f /etc/f stab.panf s
/bin/cp /etc/f stab /etc/f stab.bak.panf s
/bin/grep -v "panf s://" /etc/f stab > /etc/f stab.panf s
/bin/echo "panf s://$REALM:global /panf s panf s $mount_f lags 0 0" >> /etc/f stab.panf s
/bin/mv -f /etc/f stab.panf s /etc/f stab
/bin/sy nc
© 2005 UC Regents 48
Extending Rocks
© 2005 UC Regents 50
Portland Group Compilers (cont.)
Create/modify extend-compute.xml
<file name="/etc/profile.d/pgi-path.sh">
LM_LICENSE_FILE=/home/tools/pgi/license.dat
export LM_LICENSE_FILE
export PGI=/home/tools/pgi
if ! echo ${PATH} | grep -q /home/tools/pgi/linux86-64/6.0/bin ;
then
PATH=/home/tools/pgi/linux86-64/6.0/bin:${PATH}
fi
</file>
<file name="/etc/profile.d/pgi-path.csh">
setenv PGI /home/tools/pgi
setenv LM_LICENSE_FILE $PGI/license.dat
if ( /home/tools/pgi/linux86-64/6.0/bin !~ "${path}" ) then
set path = ( /home/tools/pgi/linux86-64/6.0/bin $path )
endif
</file>
© 2005 UC Regents 51
Stanford Integration
© 2005 UC Regents 53
Kerberos and AFS
Changes…
mv /usr/vice/etc/stanford-CellServDB /usr/vice/etc/CellServDB
mv /usr/vice/etc/stanford-ThisCell /usr/vice/etc/ThisCell
/etc/init.d/afs start
mv /usr/kerberos/bin/kinit /usr/kerberos/bin/kinit.orig
cp /usr/pubsw/bin/kinit /usr/kerberos/bin/kinit
chmod 755 /usr/kerberos/bin/kinit
mv /usr/kerberos/bin/kdestroy /usr/kerberos/bin/kdestroy.orig
cp /usr/pubsw/bin/kdestroy /usr/kerberos/bin/kdestroy
chmod 755 /usr/kerberos/bin/kdestroy
mv /usr/kerberos/bin/klist /usr/kerberos/bin/klist.orig
cp /usr/pubsw/bin/klist /usr/kerberos/bin/klist
chmod 755 /usr/kerberos/bin/klist
mv /etc/krb5.conf /etc/krb5.conf.orig
ln -s /etc/leland/krb5.conf /etc/krb5.conf
mv /etc/krb.conf /etc/krb5.conf.orig
ln -s /etc/leland/krb.conf /etc/krb.conf
© 2005 UC Regents 54
Kerberos and AFS
Changes…
#%PAM-1.0
# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.
© 2005 UC Regents 56
Kerberos and AFS
Next steps…
© 2005 UC Regents 57
Next steps for you?
Build your Rocks Clusters
© 2005 UC Regents 58
Questions?
Rocks Clusters
http://www.rocksclusters.org
npaci-rocks-discussion@sdsc.edu
ICME Clusters
http://www.hpcclusters.org
stevejones@stanford.edu
© 2005 UC Regents 59