Vous êtes sur la page 1sur 59

Overview of Rocks

Greg Bruno
Rocks Development Group
San Diego Supercomputer Center
HPC Cluster Architecture

© 2005 UC Regents 2
Cluster Pioneers
 In the mid-1990s, Network of Workstations project (UC
Berkeley) and the Beowulf Project (NASA) asked the
question:

Can You Build a High Performance Machine From


Commodity Components?

© 2005 UC Regents 3
Clusters Dominate the Top500

© 2005 UC Regents 4
The Dark Side of Clusters
 Clusters are phenomenal price/performance
computational engines …
 Can be hard to manage without experience
 High-performance I/O is still unsolved
 Finding out where something has failed increases at least
linearly as cluster size increases
 Not cost-effective if every cluster “burns” a person just
for care and feeding
 Programming environment could be vastly improved
 Technology is changing very rapidly. Scaling up is
becoming commonplace (128-256 nodes)

© 2005 UC Regents 5
Two Critical Problems with
Clusters
 Software skew
 When software configuration on some nodes
is different than on others
 Small differences (minor version numbers on
libraries) can cripple a parallel program
 Adequate job control of a parallel process
 Signalpropagation
 Cleanup

© 2005 UC Regents 6
Rocks (open source clustering distribution)
www.rocksclusters.org

 Technology transfer of commodity clustering to application scientists


 “make clusters easy”
 Scientists can b uild their own sup ercomp uters and migrate up to national centers as needed
 Rocks is a cluster on a CD
 Redhat Enterp rise Linux (op ensource and free)
 C lustering software (PBS, SGE, Ganglia, NMI)
 Highly p rogrammatic software configuration management
 Core software technology for several campus projects
 BIRN
 C enter for Theoretical Biological Physics
 EOL
 GEON
 NBC R
 Op tIPuter
 First Software release Nov, 2000
 Supports x86, Opteron/Nocona, and Itanium

© 2005 UC Regents 7
Philosophy
 Caring and feeding for a
system is not fun
 System Administrators cost
more than clusters
 1 TFLOP cluster is less than
$200,000 (US)
 Close to actual cost of a
fulltime administrator
 The system administrator is
the weakest link in the cluster
 Bad ones like to tinker
 Good ones still make
mistakes

© 2005 UC Regents 8
Philosophy
continued
 All nodes are 100% automatically
configured
 Zero “hand” configuration
 This includes site-specific
configuration
 Run on heterogeneous standard
high volume components
 Use components that offer the
best price/performance
 Software installation and
configuration must support
different hardware
 Homogeneous clusters do not
exist
 Disk imaging requires
homogeneous cluster

© 2005 UC Regents 9
Philosophy
continued
 Optimize for installation
 Get the system up quickly
 In a consistent state
 Build supercomputers in hours
not months
 Manage through re-installation
 Can re-install 128 nodes in under
20 minutes
 No support for on-the-fly system
patching
 Do not spend time trying to issue
system consistency
 Just re-install
 Can be batch driven

© 2005 UC Regents 10
Rocks Basic Approach
 Install a frontend
1. Insert Rocks Boot CD
2. Insert Roll CDs (optional components)
3. Answer 7 screens of configuration
data
4. Drink coffee (takes about 30 minutes
to install)
 Install compute nodes:
1. Login to frontend
2. Execute “insert-ethers”
Optional Rolls
3. Boot compute node with Rocks Boot  Condor
CD (or PXE)
 Grid (based on NMI R4)
4. Insert-ethers discovers nodes  Intel (compilers)
5. Goto step 3  Java
 Add user accounts  SCE (developed in Thailand)
 Start computing  Sun Grid Engine
 PBS (developed in Norway)
 Area51 (security monitoring tools)
 Many Others …

© 2005 UC Regents 11
Minimum Requirements
 Frontend  Compute Nodes
 2 Ethernet Ports  1 Ethernet Port
 CDROM  18 GB Disk Drive
 18 GB Disk Drive  512 MB RAM
 512 MB RAM

 Complete OS Installation on all Nodes


 No support for Diskless

 Not a Single System Image

 All Hardware must be supported by RHEL


© 2005 UC Regents 12
Rolls

A Quick Overview
Goal of Rolls
 Develop a method to reliably install software on
a frontend

 “User-customizable” frontends

 Two established approaches:


 Add-on method
 Rocks method

© 2005 UC Regents 14
Add-on Method
1. User responsible for installing and configuring base
software stack on a frontend
2. After the frontend installation, the user downloads
‘add-on’ packages
3. User installs and configures add-on packages
4. User installs compute nodes

Major issue with add-on method


 The state of the frontend before the add-on packages
are added/configured is unknown

© 2005 UC Regents 15
Rocks Method
 To address the major problem with the add-on
method, we had the following idea:
 All non-RedHat packages must be installed and
configured in a controlled environment

 A controlled environment has a known state

 We chose the RedHat installation environment


for the controlled environment

© 2005 UC Regents 16
Goal of Rolls
 This led to modifying the standard RedHat installer in
order to accept new packages and configuration
 A tricky proposition
 A RedHat distribution is a monolithic entity
• It’s tightly-coupled
• A program called “genhdlist” creates a binary files (hdlist and
hdlist2) that contain metadata about every RPM in the distribution
 To add/remove/change an RPM, you need to re-run
genhdlist
 Else, the RedHat install will not recognize the package
 Or worse, it fails during package installation

© 2005 UC Regents 17
Monolithic Software Stack

© 2005 UC Regents 18
Rolls

 Dissecting the monolithic software stack


© 2005 UC Regents 19
Rolls

 Think of a roll as a “package” on a car


© 2005 UC Regents 20
Rolls Function and Value

 Function: Rolls extend/modify stock RedHat


 Value: Third parties can extend/modify Rocks
 Because Rolls can be optional

© 2005 UC Regents 21
Rocks

Making Clusters Easy


HPCwire Reader’s Choice
Awards for 2004

 Rocks won in three categories:


 Most Important Software Innovation
(Reader’s Choice)
 Most Important Software Innovation
(Editor’s Choice)
 Most Innovative - Software
(Reader’s Choice)
© 2005 UC Regents 23
Registration Page
(optional)

© 2005 UC Regents 24
Two Examples

Rockstar - SDSC
Tungsten2 - NCSA
Super Computing 2003 Demo
 We wanted to build a Top500
machine live at SC’03
 From the ground up (hardware and
software)
 In under two hours
 Show that anyone can build a super
computer with:
 Off-the-shelf software (Rocks)
 Money
 No army of system administrators
required
 HPC Wire Interview
 HPCwire: What was the most
impressive thing you’ve seen at
SC2003?
 Larry Smarr: I think, without
question, the most impressive thing
I’ve seen was Phil Papadopoulos’
demo with Sun Microsystems.

© 2005 UC Regents 26
Rockstar Cluster
 129 Sun Fire V60x servers
 1 Frontend Node
 128 Compute Nodes
 Gigabit Ethernet
 $13,000 (US)
 9 24-port switches
 8 4-gigabit trunk uplinks
 Built live at SC’03
 In under two hours
 Running applications
 Top500 Ranking
 11.2003: 201
 06.2004: 433
 49% of peak
© 2005 UC Regents 27
NCSA
National Center for Supercomputing Applications

 Tungsten2 Core
Fabric
 520 Node Cluster
6 72-port
 Dell Hardware TS270

 Topspin Infiniband 174 uplink


cables
Edge
 Deployed 11.2004 Fabric
29 24-port
512 1m TS120
 Easily in top 100 of the cables
06.2005 top500 list
18 18
 “We went from PO to Compute Compute
Nodes Nodes
crunching code in 2 weeks.
It only took another 1 week to
shake out some math library
conflicts, and we have been
in production ever since.” --
Greg Keller, NCSA (Dell On-site Support Largest registered Rocks cluster
Engineer)
source: topspin (via google)
© 2005 UC Regents 28
Rocks at Stanford

Steve Jones
Institute for Computational and
Mathematical Engineering
Stanford University
Research Groups
 Flow Physics and Computation
 Aeronautics and Astronautics
 Chemical Engineering
 Center for Turbulence Research
 Center for Integrated Turbulence Simulations

Academic
 Institute for Computational and Mathematical Engineering
 Scientific Computing and Computational Mathematics

Funding
 Sponsored Research (AFOSR/ONR/DARPA/DURIP/ASC)

© 2005 UC Regents 30
Active collaboration with the Labs
Buoyancy driven instabilities/mixing - CDP for modeling plumes
(Stanford/SNL)

LES Technology - Complex Vehicle Aerodynamics using CDP


(Stanford/LLNL)

Tsunami modeling - CDP for Canary Islands Tsunami Scenarios


(Stanford/LANL)

Parallel I/O & Large-Scale Data Visualization - UDM integrated in CDP


(Stanford/LANL)

Parallel Global Solvers - HyPre Library integrated in CDP


(Stanford/LLNL)

Parallel Grid Generation - Cubit and related libraries


(Stanford/SNL)

Merrimac - Streaming Supercomputer Prototype


(Stanford/LLNL/LBNL/NASA)

© 2005 UC Regents 31
Affiliates Program

© 2005 UC Regents 32
Bio-X
 Iceberg
 300 Node Cluster
• Dual 2.8GHz
 Dell Hardware
 Fast Ethernet
 Deployed 11.2002
 Top500 Ranking
 6.2003: 319
 Shared resource for Bio-X

© 2005 UC Regents 33
Iceberg
Campus
Backbone

Frontend Server

GigE Net NFS A ppliance

FastEth FastEth FastEth FastEth FastEth

Node Node Node Node Node

Node Node Node Node Node

Node Node Node Node Node

© 2005 UC Regents 34
Top 500 Supercomputer

© 2005 UC Regents 35
CITS
 Gfunk
 Rocks 3.3
 1 Frontend
 82 node
• Dual 2.66GHz Xeon
• 2GB RAM
• Myrinet C card
 1 NFS Appliance
• 1TB Storage

© 2005 UC Regents 36
Gfunk

Campus
Backbone

Frontend Server

GigE Net NFS A ppliance

Node Node Node Node Node Node Node Node

Myrinet

© 2005 UC Regents 37
AFOSR
 Nivation
 Rocks 3.3
 1 Frontend
 82 Node
• Dual 3.0 GHz Xeon
• 4GB Ram
• Myrinet D card
 4 Application Node
• Matlab
• Ansys ICEM/CFD
• Compilers
 Panasas Storage Cluster
• 4TB Shelf
 2 NFS Appliance
• 1TB per appliance (scratch)
© 2005 UC Regents 38
Nivation
Campus
Backbone

Frontend Server Tools-1 Tools-2 Tools-3 Tools-4

NFS A ppliance
GigE Net

NFS A ppliance

Node Node Node Node Node Node Node Node

Myrinet

© 2005 UC Regents 39
DARPA/DUPRIP
 Rocks 3.3 x86_64
 1 Frontend
 1TB Storage
 24 Node
 48 AMD Opteron 242
 2 GB RAM
 PCI-X IB
 Penguin Computing Hardware
 Infinicon Infiniband
 The easiest and fastest setup time of
our clusters
 Due in large part to
AMD Cluster Development Center
© 2005 UC Regents 40
Sintering
Campus
Backbone

Frontend Server
1TB DISK A RRA Y

GigE Net

Node Node Node Node Node Node Node Node

INFINIBA ND

© 2005 UC Regents 41
DARPA
 Ablation
 Rocks 4.0 CentOS
 1 Frontend
• 1TB storage
 48 Node
• Dual 3.4 GHz EMT64
• 2GB RAM
• PCI-X HCA IB card
 Dell Hardware
 Topspin Infiniband
 Panasas Storage Cluster
• 4TB Shelf
© 2005 UC Regents 42
Ablation
Campus
Backbone

Frontend Server

GigE Net

Node Node Node Node Node Node Node Node

Topspin 270
Infiniband Switch
© 2005 UC Regents 43
ICME Teaching Cluster
 Calving
 10 compute nodes/Dell
 Dual 3.4 GHz EMT64
 2GB RAM
 10 compute nodes/Penguin
 Dual 1.0 GHz
 1GB RAM
 Gigabit Ethernet
 Easy to change software
configurations each quarter
 SUNetID Access Controlled
 AFS Access
© 2005 UC Regents 44
Calving
Campus
Backbone

Frontend Server

GigE Net

Slow Fast

Node Node

Node Node

Node Node

© 2005 UC Regents 45
Extending Rocks

Some examples of software


distribution…
Installing software for Panasas
 Create/modify extend-compute.xml
<package> panf s-2.4.21-15.EL-mp </package>

<post>
<!-- Insert y our post installation script here. This
code will be executed on the destination node af ter the
packages hav e been installed. Ty pically conf iguration f iles
are built and serv ices setup in this section. -->
# Add panf s to f stab
REALM=10.10.10.10
mount_f lags="rw,noauto,panauto"
/bin/rm -f /etc/f stab.bak.panf s
/bin/rm -f /etc/f stab.panf s
/bin/cp /etc/f stab /etc/f stab.bak.panf s
/bin/grep -v "panf s://" /etc/f stab > /etc/f stab.panf s
/bin/echo "panf s://$REALM:global /panf s panf s $mount_f lags 0 0" >> /etc/f stab.panf s
/bin/mv -f /etc/f stab.panf s /etc/f stab
/bin/sy nc

/sbin/chkconf ig --add panf s


/usr/local/sbin/check_panf s
LOCATECRON=/etc/cron.daily /slocate.cron
LOCATE=/etc/sy sconf ig/locate
LOCTEMP=/tmp/slocate.new

/bin/cat $LOCATECRON | sed "s/,proc,/,proc,panf s,/g" > $LOCTEMP


/bin/mv -f $LOCTEMP $LOCATECRON
/bin/cat $LOCATECRON | sed "s/\/af s,/\/af s,\/panf s,/g" > $LOCTEMP
/bin/mv -f $LOCTEMP $LOCATECRON
© 2005 UC Regents 47
Installing software for Panasas
(cont.)
 Copy the RPM to the contrib directory
 /home/install/contrib/~/RPMS
 Rebuild the distribution
 cd /home/install
 rocks-dist dist
 Distribute to the cluster
 cluster-fork ‘/boot/kickstart/cluster-kickstart’

© 2005 UC Regents 48
Extending Rocks

What if our software isn’t available


in RPM format?
Portland Group Compilers
 Create directory on NFS mount point
 mkdir /home/tools/pgi
 Install PGI Workstation
 ./install from source directory
 Use /home/tools/pgi for install directory
 Copy license file to /home/tools/pgi
 Create login scripts for frontend
 /etc/profile.d/pgi-path.sh
 /etc/profile.d/pgi-path.csh

© 2005 UC Regents 50
Portland Group Compilers (cont.)
 Create/modify extend-compute.xml
<file name="/etc/profile.d/pgi-path.sh">
LM_LICENSE_FILE=/home/tools/pgi/license.dat
export LM_LICENSE_FILE
export PGI=/home/tools/pgi
if ! echo ${PATH} | grep -q /home/tools/pgi/linux86-64/6.0/bin ;
then
PATH=/home/tools/pgi/linux86-64/6.0/bin:${PATH}
fi
</file>

<file name="/etc/profile.d/pgi-path.csh">
setenv PGI /home/tools/pgi
setenv LM_LICENSE_FILE $PGI/license.dat
if ( /home/tools/pgi/linux86-64/6.0/bin !~ "${path}" ) then
set path = ( /home/tools/pgi/linux86-64/6.0/bin $path )
endif
</file>
© 2005 UC Regents 51
Stanford Integration

Kerberos and AFS


Kerberos and AFS
 Kernel upgrade
 kernel-smp-2.6.9-11.EL.x86_64.rpm
 kernel-smp-devel-2.6.9-11.EL.x86_64.rpm
 kernel-sourcecode-2.6.9-11.EL.noarch.rpm

 Verify bootloader options


 /boot/grub/grub.conf

 Reboot the system


 Apply packages
 openafs-1.3.82-3.SL.x86_64.rpm
 kernel-module-openafs-2.6.9-11.ELsmp-1.3.82-3.SL.x86_64.rpm
 openafs-client-1.3.82-3.SL.x86_64.rpm
 sulinux-afsconfig-11-1.noarch.rpm
 sulinux-kerberos-11-2.x86_64.rpm

© 2005 UC Regents 53
Kerberos and AFS
 Changes…
mv /usr/vice/etc/stanford-CellServDB /usr/vice/etc/CellServDB
mv /usr/vice/etc/stanford-ThisCell /usr/vice/etc/ThisCell

/etc/init.d/afs start
mv /usr/kerberos/bin/kinit /usr/kerberos/bin/kinit.orig
cp /usr/pubsw/bin/kinit /usr/kerberos/bin/kinit
chmod 755 /usr/kerberos/bin/kinit

mv /usr/kerberos/bin/kdestroy /usr/kerberos/bin/kdestroy.orig
cp /usr/pubsw/bin/kdestroy /usr/kerberos/bin/kdestroy
chmod 755 /usr/kerberos/bin/kdestroy

mv /usr/kerberos/bin/klist /usr/kerberos/bin/klist.orig
cp /usr/pubsw/bin/klist /usr/kerberos/bin/klist
chmod 755 /usr/kerberos/bin/klist

mv /etc/krb5.conf /etc/krb5.conf.orig
ln -s /etc/leland/krb5.conf /etc/krb5.conf
mv /etc/krb.conf /etc/krb5.conf.orig
ln -s /etc/leland/krb.conf /etc/krb.conf
© 2005 UC Regents 54
Kerberos and AFS
 Changes…
#%PAM-1.0
# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.

auth sufficient /lib/security/$ISA/pam_krb5afs.so tokens


auth required /lib/security/$ISA/pam_env.so
auth sufficient /lib/security/$ISA/pam_unix.so likeauth nullok
auth required /lib/security/$ISA/pam_deny.so
account required /lib/security/$ISA/pam_unix.so broken_shadow
account sufficient /lib/security/$ISA/pam_succeed_if.so uid < 100 quiet
account [default=bad success=ok user_unknown=ignore] /lib/security/$ISA/pam_krb5afs.so
account required /lib/security/$ISA/pam_permit.so
password requisite /lib/security/$ISA/pam_cracklib.so retry=3
password sufficient /lib/security/$ISA/pam_krb5afs.so use_authtok
password sufficient /lib/security/$ISA/pam_unix.so nullok use_authtok md5 shadow
password required /lib/security/$ISA/pam_deny.so
session required /lib/security/$ISA/pam_limits.so
session required /lib/security/$ISA/pam_unix.so
session optional /lib/security/$ISA/pam_krb5afs.so
© 2005 UC Regents 55
Kerberos and AFS
 Compute node changes
 Modify extend-compute.xml
<package> kernel-smp </package>
<package> kernel-smp-devel </package>
<package> kernel-sourcecode </package>
<package> openafs </package>
<package> kernel-module-openafs </package>
<package> openafs-client </package>
<package> sulinux-afsconfig </package>
<package> sulinux-kerberos </package>

 Copy RPMS to contrib directory


 rocks-dist dist
 cluster-fork ‘/boot/kickstart/cluster-kickstart’

© 2005 UC Regents 56
Kerberos and AFS
 Next steps…

Build the ‘Stanford Kerberos Roll’ for simple integration

© 2005 UC Regents 57
Next steps for you?
 Build your Rocks Clusters

 Register your clusters


 http://www.rocksclusters.org/rocks-register/

 DEMAND YOUR VENDORS SUPPORT


ROCKS!
 http://www.rocksclusters.org

© 2005 UC Regents 58
Questions?
 Rocks Clusters
 http://www.rocksclusters.org

 npaci-rocks-discussion@sdsc.edu

 ICME Clusters
 http://www.hpcclusters.org

 stevejones@stanford.edu

© 2005 UC Regents 59

Vous aimerez peut-être aussi