Académique Documents
Professionnel Documents
Culture Documents
1
1. INTRODUCTION
SISD (Single Instruction stream, Single Data stream) type computers. These are the
conventional systems that contain one central processing unit (CPU) and hence can
accommodate one instruction stream that is executed serially. Nowadays many
large mainframes may have more than one CPU but each of these execute
instruction streams that are unrelated. Therefore, such systems still should be
regarded as a set of SISD machines acting on different data spaces. Examples of
SISD machines are for instance most workstations like those of DEC, IBM,
Hewlett-Packard, and Sun Microsystems as well as most personal computers.
SIMD (Single Instruction stream, Multiple Data stream) type computers. Such
systems often have a large number of processing units that all may execute the
same instruction on different data in lock-step. Thus, a single instruction
manipulates many data in parallel. Examples of SIMD machines are the CPP DAP
Gamma II and the Alenia Quadrics.
MIMD (Multiple Instruction stream, Multiple Data stream) type computers. These
machines execute several instruction streams in parallel on different data. The
difference with the multi-processor SISD machines mentioned above lies in the
2
fact that the instructions and data are related because they represent different parts
of the same task to be executed. So, MIMD systems may run many sub-tasks in
parallel in order to shorten the time-to-solution for the main task to be executed.
There is a large variety of MIMD systems like a four-processor NEC SX-5 and a
thousand processor SGI/Cray T3E supercomputers. Besides above mentioned
classification, another important distinction between classes of computing systems
can be done according to the type of memory access (Figure 1).
Shared memory (SM) systems have multiple CPUs all of which share the
same address space. This means that the knowledge of where data is
stored is of no concern to the user as there is only one memory accessed
by all CPUs on an equal basis. Shared memory systems can be both
SIMD or MIMD. Single-CPU vector processors can be regarded as an
example of the former, while the multi-CPU models of these machines
are examples of the latter.
Distributed memory (DM) systems. In this case each CPU has its own associated
memory. The CPUs are connected by some network and may exchange data between
their respective memories when required. In contrast to shared memory machines the
user must be aware of the location of the data in the local memories and will have to
move or distribute these data explicitly when needed. Again, distributed memory
systems may be either SIMD or MIMD.
To understand better the current situation in the field of HPC systems and a place
of cluster-type computers among them, some brief overview of supercomputers history
will be given below.
1.1 SUPERCOMPUTERS
3
An important break-through in the field of HPC systems came up in the late
1970s, when the Cray-1 and soon CDC Cyber 203/205 systems, both based on vector
technology, were built. These supercomputers were able to achieve unprecedented
performances for certain applications, being more than one order of magnitude faster
than other available computing systems. In particular, the Cray-1 system boasted that
time a world-record speed of 160 million floating-point operations per second
(MFlops). It was equipped with an 8 megabyte main memory and priced $8.8 million.
A range of first
supercomputers applications was typically limited by ones having regular, easily
vectorisable data structures and being very demanding in terms of floating point
performance. Some examples include mechanical engineering, fluid dynamics and
cryptography tasks. The use of vector computers by a broader community was initially
limited by the lack of programming tools and vectorising compilers, so that the
applications had to be hand coded and optimized for a specific computer system.
However, commercial software packages became available in the 1980s for vector
computers, pushing up their industrial use. At this time, the first multiprocessor
supercomputer Cray X-MP was developed and achieved the performance of 500
MFlops.
Supercomputers are defined as the fastest, most powerful computers in terms of CPU
power and I/O capabilities. Since computer technology is continually evolving, this is
always a moving target. This year’s supercomputer may well be next year’s entry level
personal computer. In fact, today’s commonly available personal computers deliver
performance that easily bests the supercomputers that were available on the market in the
1980’s.
Strong limitation for further scalability of vector computers was their shared-
memory architecture. Therefore, massive parallel processing (MPP) systems using
distributed-memory were introduced by the end of the 1980s. The main advantage of
such systems is the possibility to divide a complex job into several parts, which are
executed in parallel by several processors each having dedicated memory (Figure 1). The
communication between the parts of the main job occurs within the framework of the so-
called message-passing paradigm, which was standardized in the message-passing
interface (MPI). The message-passing paradigm is flexible enough to support a variety of
applications and is also well adapted to the MPP architecture. During last years, a
tremendous improvement in the performance of standard workstation processors led to
their use in the MPP supercomputers, resulting in significantly lowered
price/performance ratios.
4
Better understanding of applications and algorithms as well as a significant
improvement in the communication network technologies and processors speed led to
emerging of new class of systems, called clusters of SMP or networks of workstations
(NOW), which are able to compete in performance with MPPs and have excellent
price/performance ratios for special applications types. On practice, clustering technology
can be used for any arbitrary group of computers, allowing to build homogeneous or
heterogeneous systems. Even bigger performance can be achieved by combining groups
of clusters into
HyperCluster or even Grid-type system.
It is worth to note, that by the end of 2002, the most powerful existing HPC
systems have performance in the range from 3 to 36 TFlops. The top five supercomputer
systems include Earth- Simulator (35.86 TFlops, 5120 processors), installed by NEC in
2002; two ASCI Q systems (7.72TFlops, 4096 processors), built by Hewlett-Packard in
2002 and based on the AlphaServer SC computersystems; ASCI White (7.23 TFlops,
8192 processors), installed by IBM in 2000 [4]; and, a pleasantsurprise, MCR Linux
Cluster (5.69 TFlops, 2304 Xeon 2.4 GHz processors), built by Linux NetworX in
2002 for Lawrence Livermore National Laboratory (USA). According to the TOP500
Supercomputers List from November 2002, cluster based systems represent 18.6% from
all supercomputers, and most of them (about 60%) use Intel's processors. Finally, one
should note that application range of modern supercomputers is very wide and addresses
mainly industrial, research and academic fields. The covered areas are related to
telecommunications, weather and climate research/forecasting, financial risk analysis, car
crash analysis, databases and information services, manufacturing, geophysics,
computational chemistry and biology, pharmaceutics, aerospace industry, electronics and
much more.
5
2. CLUSTERS
Extraordinary technological improvements over the past few years in areas such
as microprocessors, memory, buses, networks, and software have made it possible to
assemble groups of inexpensive personal computers and/or workstations into a cost
effective system that functions in concert and posses tremendous processing power.
Cluster computing is not new, but
in company with other tech nical capabilities, particularly in the area of networking, this
class of machines is becoming a high-performance platform for parallel and distributed
applications Scalable computing clusters, ranging from a cluster of (homogeneous or
heterogeneous) PCs or workstations to SMP (Symmetric Multi Processors), are rapidly
becoming the standard platforms for high-performance and large-scale computing.
6
A network is used to provide inter-processor communications. Applications that
are distributed across the processors of the cluster use either message passing or network
shared memory for communication. A cluster computing system is a compromise
between a massively parallel processing system and a distributed system. An MPP
(Massively Parallel Processors) system node typically cannot serve as a standalone
computer; a cluster node usually contains its own disk and is equipped with a complete
operating systems, and therefore, it also can handle interactive jobs. In a distributed
system, each node can function only as an individual resource while a cluster system
presents itself as a single system to the user.
7
figure 3.architecture of cluster systems
8
figure 4.logical view of a cluster
The master node acts as a server for Network File System (NFS) and as a gateway
to the outside world. As an NFS server, the master node provides user file space and
other common system software to the compute nodes via NFS. As a gateway, the master
node allows users to gain access through it to the compute nodes. Usually, the master
node is the only machine that is also connected to the outside world using a second
network interface card (NIC). The sole task of the compute nodes is to execute parallel
jobs. In most cases, therefore, the compute nodes do not have keyboards, mice, video
cards, or monitors. All access to the client nodes is provided via remote connections from
the master node. Because compute nodes do not need to access machines outside the
cluster, nor do machines outside the cluster need to access compute nodes
directly, compute nodes commonly use private IP addresses, such as the 10.0.0.0/8 or
192.168.0.0/16 address ranges.
From a user’s perspective, a Beowulf cluster appears as a Massively Parallel
Processor (MPP) system. The most common methods of using the system are to access
the master node either directly or through Telnet or remote login from personal
workstations. Once on the master node, users can prepare and compile their parallel
applications, and also spawn jobs on a desired number of compute nodes in the cluster.
Applications must be written in parallel style and use the message-passing programming
model. Jobs of a parallel application are spawned on compute nodes, which work
collaboratively until finishing the application. During the execution, compute nodes use
9
standard message-passing middleware, such as Message Passing Interface (MPI) and
Parallel Virtual Machine (PVM), to exchange information.
3. WHY CLUTERS
The question may arise why clusters are designed and built when perfectly good
commercial supercomputers are available on the market. The answer is that the latter is
expensive. Clusters are surprisingly powerful.
Commercial products have their place, and there are perfectly good reasons to buy
a commercially produced supercomputer. If it is within our budget and our applications
can keep machines busy all the time, we will also need to have a data center to keep it in.
then there is the budget to keep up with the maintenance and upgrades that will be
required to keep our investment up to par. However , many who have a need to harness
supercomputing power don’t buy supercomputers because they can’t afford them. Also it
is impossible to upgrade them.
Clusters, on the other hand, are cheap and easy way to take off-the-shelf
components and combine them into a single supercomputer. In some areas of research
clusters are actually faster than commercial supercomputer. Clusters also have the distinct
advantage in that they are simple to build using components available from hundreds of
sources. We don’t even have to use new equipment to build a cluster.
Today, open standards-based HPC systems are being used to solve problems from
high-end, floating-point intensive scientific and engineering problems to data-intensive
tasks in industry. Some of the reasons why HPC clusters outperform RISC-based systems
include:
Collaboration
Scalability
HPC clusters can grow in overall capacity because processors and nodes can be added as
demand increases.
10
Availability
Because single points of failure can be eliminated, if any one system component goes
down, the system as a whole or the solution (multiple systems) stay highly available.
Processors, memory, disk or operating system (OS) technology can be easily updated,
and new processors and nodes can be added or upgraded as needed.
Compared to proprietary systems, the total cost of ownership can be much lower. This
includes service, support and training.
Vendor lock-in
The age-old problem of proprietary vs. open systems that use industry-accepted standards
is eliminated.
System manageability
Reusability of components
Commercial components can be reused, preserving the investment. For example, older
nodes can be deployed as file/print servers, web servers or other infrastructure servers.
Disaster recovery
Large SMPs are monolithic entities located in one facility. HPC systems can be co-
located or geographically dispersed to make them less susceptible to disaster.
11
12
4. CLUSTERING CONCEPTS
Clusters are in fact quite simple. They are a bunch of computers tied together with a
network working on a large problem that has been broken down into smaller pieces.
There are a number of different strategies we can use to tie them together. There are
also a number of different software packages that can be used to make the software side
of things work.
4.1. PARALLELISM
On one level hardware parallelism deals with the CPU of an individual system
and how we can squeeze performance out of sub-components of the CPU that can speed
up our code. At another level there is the parallelism that is gained by having multiple
systems working on a computational problem in a distributed fashion. These systems are
known as ‘fine grained’ for parallelism inside the CPU or having to do with the multiple
CPUs in the same system, or ‘coarse grained’ for parallelism of a collection of separate
systems acting in concerts.
CPU-LEVEL PARALLELISM
But new CPU architectures have an inherent ability to do more than one thing at once.
The logic of CPU chip divides the CPU into multiple execution units. Systems that have
multiple execution units allow the CPU to attempt to process more than one instruction at
a time. Two hardware features of modern CPUs support multiple execution units: the
cache – a small memory inside the CPU. The pipeline is a small area of memory inside
the CPU where instructions that are next in line to be executed are stored. Both cache and
pipeline allow impressive increases in CPU performances. It also requires a lot of
intelligence on the part of the compiler to arrange the executable code in such a way that
the CPU has good chance of being able to execute multiple instructions simultaneously.
13
introduced into this system. For example if we decide that each node in our cluster will
be a multi CPU system we will be introducing a fundamental degree of parallel
processing at the node level. Having more than one network interface on each node
introduces communication channels that may be used in parallel to communicate with
other nodes in the cluster. Finally, if we use multiple disk drive controllers in each node
we create parallel data paths that can be used to increase the performance of I/O
subsystem.
Software parallelism is the ability to find well defined areas in a problem we want to
solve that can be broken down into self-contained parts. These parts are the program
elements that can be distributed and give us the speedup that we want to get out of a high
performance computing system.
Before we can run a program on a parallel cluster, we have to ensure that the problems
we are trying to solve are amenable to being done in a parallel fashion. Almost any
problem that is composed of smaller sub-problems that can be quantified can be broken
down into smaller problems and run on a node on a cluster.
5. NETWORKING CONCEPTS
NETWORK PROTOCOLS
Protocols are a set of standards that describe a common information format that
computers use communicating across a network. A network protocol is analogous to the
way information is send from one place to another via the postal system. A network
protocol specifies how information must be packaged and how it is labeled inorder to be
delivered from one computer to another.
NETWORK INTERFACES
The internet interface is a hardware device that makes the information packaged
by a network protocol and put into a format that can be transmitted over some physical
medium like Ethernet, a fiber-optical cable, or even the air using radio waves.
TRANSMISSION MEDIUM
BANDWIDTH
14
Bandwidth is the amount of information that can be transmitted over a given
transmission medium over a given amount of time. It is usually expressed in some form
of bits per second.
ROUING
6. OPERATING SYSTEM
Linux is a robust, free and reliable POSIX compliant operating system. Several
companies have built businesses from packaging Linux software into organized
distributions; RedHat is an example of such a company. Linux provides the features
typically found in standard UNIX such as multi-user access, pre-emptive multi-tasking,
demand-paged virtual memory and SMP support. In addition to the Linux kernel, a large
amount of application and system software and tools are also freely available. This makes
Linux the preferred operating system for clusters. The idea of the Linux cluster is to
maximize the performance-to-cost ratio of computing by using low-cost commodity
components and free-source Linux and GNU software to assemble a parallel and
distributed computing system. Software support includes the standard Linux/GNU
environment, including compilers, debuggers, editors, and standard numerical libraries.
Coordination and communication among the processing nodes is a key requirement of
parallel-processing clusters. In order to accommodate this coordination, developers have
created software to carry out the coordination and hardware to send and receive the
coordinating messages. Messaging architectures such as MPI or Message Passing
Interface, and PVM or Parallel Virtual Machine, allow the programmer to ensure that
control and data messages take place as needed during operation.
15
nodes communicate and cooperating with each other. The message passing model of
communication is typically used by
programs running on a set of discrete computing systems (each with its own memory)
which are linked together by means of a communication network. A cluster is such a
loosely coupled distributed memory system.
PVM, or Parallel Virtual Machine, started out as a project at the Oak Ridge
National Laboratory and was developed further at the University of Tennessee. PVM is a
complete distributed computing system, allowing programs to span several machines
across a network. PVM utilizes a Message Passing model that allows developers to
distribute programs across a
variety of machine architectures and across several data formats. PVM essentially
collects the networks workstations into a single virtual machine. PVM allows a network
of heterogeneous computers to be used as a single computational resource called the
parallel virtual machine. As we have seen, PVM is a very flexible parallel processing
environment. It therefore supports almost all models of parallel programming, including
the commonly used all-peers and master-slave paradigms.
A typical PVM consists of a (possibly heterogeneous) mix of machines on the network,
one being the master host and the rest being worker or slave hosts. These various hosts
communicate by message passing. The PVM is started at the command line of the master
which in turn can spawn workers to achieve the desired configuration of hosts for the
PVM. This configuration can be established initially via a configuration file.
Alternatively, the virtual machine can be configured from the PVM command line
(masters console) or during run time from within the application program. A solution to a
large task, suitable for parallelization, is divided into modules to be spawned by the
master and distributed as appropriate among the workers. PVM consists of two software
components, a resident daemon (pvmd) and the PVM library (libpvm). These must be
available
on each machine that is a part of the virtual machine. The first component, pvmd, is the
message-passing interface between the application program on each local machine and
the network connecting it to the rest of the PVM. The second component, libpvm,
provides the local application program with the necessary message-passing functionality,
so that it can communicate with the other hosts. These library calls trigger corresponding
activity by the local pvmd which deals with the details of transmitting the message. The
message is intercepted by the local pvmd of the target node and made available to that
machines application module via the related library call from within that program.
MPI is a message-passing library standard that was published in May 1994 . The
standard of MPI is based on the consensus of the participants in the MPI Forum,
organized by over 40 organizations. Participants included vendors, researchers,
academics, software library developers and users. MPI offers portability, standardization,
performance, and functionality. The advantage for the user is that MPI is standardized on
16
many levels. For example, since the syntax is standardized, you can rely on your MPI
code to execute under any MPI implementation running on your architecture. Since the
functional behavior of MPI calls is also standardized, your MPI calls should behave the
same regardless of the implementation. This guarantees the portability of your parallel
programs. Performance, however, may vary between different implementations. MPI
includes point-to-point message passing and collective (global) operations. These are all
scoped to a user-specified group of processes. MPI provides a substantial set of libraries
for the writing, debugging, and performance testing of distributed programs. Our system
currently uses LAM/MPI, a portable implementation of the MPI standard developed
cooperatively by Notre Dame University. LAM (Local Area Multi computer) is an MPI
programming environment
and development system and includes a visualization tool that allows a user to examine
the state of the machine allocated to their job as well as provides a means of studying
message flows between nodes.
7. DESIGN CONSIDERATIONS
Before attempting to build a cluster of any kind, think about the type of problems
you are trying to solve. Different kinds of applications will actually run at different levels
of performance on different kinds of clusters. Beyond the brute force characteristics of
memory speed, I/O bandwidth, disk seek/latency time and bus speed on the individual
nodes of your cluster, the way you connect your cluster together can have a great impact
on its efficiency.
There are many kind of clusters that may be used for different applications.
HOMOGENEOUS CLUSTERS
HETEROGENEOUS CLUSTERS
They come in two general forms. The first and most common are heterogeneous
clusters made from different kinds of computers. It does not matter what the actual
hardware is except that there are different makes and models. A cluster made from such
machines will have several very important details.
17
7.2.1 CLUSTER NETWORKING
If you are mixing hardware that have different networking technologies, there will
be large differences in the speed with which data will be accessed and how individual
nodes can communicate. If it is in your budget make sure that all of the machines you
want to include in your cluster have similar networking capabilities, and if at all possible,
have network adapters from the same manufacturer.
You will have to build versions of clustering software for each kind of system you
include in your cluster.
7.2.3 PROGRAMMING
Our code will have to be written to support the lowest common denominator for
data types supported by the least powerful node in our cluster. With mixed machines, the
more powerful machines will have attributes that cannot be attained in the powerful
machine.
7.2.4 TIMING
The second kind of heterogeneous clusters is made from different machines in the
same architectural family: eg. A collection of Intel boxes where the machines are
different generations or machines of same generation from different manufacturers. This
can present issues with regard to driver versions, or little quirks that are exhibited by
different pieces of hardware.
When considering what hardware to use for your cluster, there are three
fundamental questions that you will want to answer before you start
You can build a cluster out of almost any kind of hardware. The factors that will probably
play into your decision are a trade-off between time and money.
18
One of the ways to build a high performance cluster is to have to control over every piece
of equipment that goes into the individual nodes. There are five major components of
clusters
. system-board
. CPUs
. disk storage
. network adapters
. enclosure (cases)
CODE SIZE
For large codes, a large portion of the program time is spending on the program
rather than the processing of the data. So it is best to avoid letting a program become an
end unto itself and try to minimize the program logic so that the code is efficient and as
small as possible. This way more of the data can fit it into the CPU cache, and the
program spends the majority of its time doing productive work.
DATA SIZE
With high performance computers, the focus is on arranging data so that it can be
processed in the most efficient manner possible. In looking at the problems we wish to
solve with our parallel linux cluster, we should look carefully at the data.
I/O
Lastly, we need to consider I/O. how much data do we have, where is it coming
from, can it all fit into a flat file on a hard disk etc. These are the decisions that you
should explore before spending a lot of money and time on hardware.
19
SPEED SELECTION
No matter what topology you choose for your cluster, you will want to get fastest
network that your budget allows. Fortunately, the availability of high speed computers
has also forced the development of high speed networking systems. Examples are 10Mbit
Ethernet, 100Mbit Ethernet, gigabit networking, channel bonding etc.
11
. CLUSTER COMPUTING MODELS
This chart shows a simple arrangement of heterogeneous server tasks — but all are
running on a single physical system (in different partitions, with different granularities of
systems resource allocated to them). One of the major benefits offered by this model it
20
that of convenient and simple systems management — a single point of control.
Additionally, however, this consolidation model offers the benefit of delivering high
quality of service (resources) in a cost effective manner.
This cluster model expands on the simple load-balancing model shown in the previous
chart. Not only does it provide for load balancing, it also delivers high availability
through redundancy of applications and data. This, of course, requires at lease two nodes
— a primary and a backup. In this model, the nodes can be active/passive or active/active.
In the active/passive scenario, one server is doing most of the work while the second
server is spending most of its time on replication work. In the active/active scenario, both
servers are doing primary work and both are accomplishing replication tasks so that each
server always "looks" just like the other. In both instance, instant failover is achievable
should the primary node (or the primary node for a particular application) experience a
system or application outage. As with the previous model, this model easily scales up
(through application replication) as the overall volume of users and transactions goes up.
The scale-up happens through simple application replication, requiring little or no
application modification or alteration.
21
High-performance Parallel Application Cluster — Technical
Model
In this clustering model, extreme vertical scalability is achievable for a single large
computing task. The logic shown here is essentially based on the message passing
interface (MPI) standard. This model would best be applied to scientific and technical
tasks, such as computing artificial intelligence data. In this high performance model, the
application is actually "decomposed" so that segments of its tasks can safely be run in
parallel.
With this clustering model, the number of users (or the number of transactions) can be
allocated (via a load-balancing algorithm) across a number of application instances (here,
we're showing Web application server (WAS) application instances) so as to increase
22
transaction throughput. This model easily scales up as the overall volume of users and
transactions goes up. The scale-up happens through simple application replication only,
requiring little or no application modification or alteration.
This clustering model demonstrates the capacity to deliver extreme database scalability
within the commercial application arena. In this environment, "shared nothing" or "shared
disk" might be the requirement "of the day," and can be accommodated. You would
implement this model in commercial parallel database situations, such as DB2 UDB EEE,
Informix XPS or Oracle Parallel Server. As with the technical high-performance model
shown on the previous chart, this high-performance commercial clustering model requires
that the application be "decomposed" so that segments of its tasks can safely be run in
parallel.
23
12. FUTURE TRENDS - GRID COMPUTING
24
13. CONCLUSION
Scalable computing clusters, ranging from a cluster of (homogeneous or
heterogeneous) PCs or workstations, to SMPs, are rapidly becoming the standard
platforms for high-performance and large-scale computing. It is believed that message-
passing programming is the most obvious approach to help programmer to take
advantage of clustering symmetric multiprocessors (SMP) parallelism.
REFERENCES
25