Académique Documents
Professionnel Documents
Culture Documents
8.1 Multiprocessors
8.2 Multicomputers
8.3 Distributed systems
Definition:
A computer system in which two or
more CPUs share full access to a
common RAM
Multiprocessor Hardware
• Although all multiprocessors have the
property that every CPU can address all of
memory,
– UMA (Uniform Memory Access)
multiprocessors have the additional property
that every memory word can be read as fast as
every other word.
– NUMA (Nonuniform Memory Access)
multiprocessors do not have this property.
4
Multiprocessor Hardware
6
Multiprocessor Hardware
8
Multiprocessor Hardware
• UMA multiprocessors using multistage switching
networks can be built from 2x2 switches
11
Omega Switching Networks
• As the message moves through the
switching network, the bits at the left-hand
end of the module number are no longer
needed. They can be put to good use by
recording the incoming line number there,
so the reply can find its way back.
• The omega network is a blocking network.
Not every set of requests can be processed
simultaneously.
12
Multiprocessor Hardware
14
NUMA Multiprocessors
• The most popular approach for building
large CC-NUMA multiprocessors is the
directory-based multiprocessor.
– The idea is to maintain a database telling where
each cache line is and what its status is.
– When a cache line is referenced, the database is
queried to find out where it is and whether it is
clean or dirty.
15
Multiprocessor Hardware
Bus
20
Multiprocessor OS Types
Bus
Master-Slave multiprocessors
Master-Slave Multiprocessors
• All system calls are redirected to CPU 1.
• There is a single data structure that keeps track of
ready processes. It can never happen that one CPU
is idle and the other is loaded.
• Pages can be allocated among all the processes
dynamically.
• There is only one buffer cache, so inconsistencies
never occur.
• Disadvantage is that, with many CPUs, the master
will become a bottleneck.
22
Multiprocessor OS Types
Bus
• Symmetric Multiprocessors
– SMP multiprocessor model
SMP Model
• Balances processes and memory dynamically, since
ther is only one set of OS tables.
• If two or more CPUs are running OS code at the same
time disaster will result.
• Mutexes may be associated with the critical regions
of OS.
• Each table that may be used by multiple critical
regions needs its own mutex.
• Great care must be taken to avoid deadlocks. All the
tables could be assigned integer values and all the
critical regions acquires tables in increasing order.
24
Multiprocessor Synchronization
26
Multiprocessor Synchronization
• If we could get rid of all the TSL-induced writes
on the requesting side, cache thrashing could be
reduced.
– The requesting CPU first do a pure read to see if the
lock is free. Only if the lock appears to be free does it
do a TSL to acquire it. Most of the polls are now reads
instead of writes
– If the CPU holding the lock is only reading the
variables in the same cache block, they can each have a
copy of the cache block in shared read only mode,
eliminating all the cache block transfers.
– When the lock is freed, the owner does a write,
invalidating all the other copies.
27
Multiprocessor Synchronization
• Another way to reduce bus traffic is to use the
Ethernet binary exponential backoff algorithm.
• An even better idea is to give each CPU
wishing to acquire the mutex its own private
lock variable to test.
– A CPU that fails to acquire the lock allocates a
lock variable and attach itself to the end of a list of
CPUs waiting for the lock.
– When the current lock holder exits the critical
region, it frees the private lock that the first CPU
on the list is testing. It is starvation free.
28
Multiprocessor Synchronization
31
Multiprocessor Scheduling
• Timesharing
– note use of single data structure for scheduling
Multiprocessor Scheduling
• Timesharing
– Provides automatic load balancing
– Disadvantage is the potential contention for the
scheduling data structure
– Also a context switch may happen. Suppose the
process holds a spin lock. Other CPUs waiting on
the spin lock just waste their time spinning until
that process is scheduled and releases the lock.
• Some systems use smart scheduling, in which a
process acquiring a spin lock sets a process-wide flag
to show that it currently has a spin lock. The scheduler
sees the flag and gives the process more time to
complete its critical region.
33
Multiprocessor Scheduling
• Timesharing
– When process A has run for a long time on CPU
k, CPU k’s cache will be full of A’s blocks. If A
gets to run again it may perform better if it is run
on CPU k.
– So, some multiprocessors use affinity scheduling.
– To achieve this, two-level scheduling is used.
• When a process is created, it is assigned to a CPU.
• Each CPU uses its own scheduling and tries to
maximize affinity.
34
Multiprocessor Scheduling
• Space sharing
– multiple related processes or multiple related threads
– multiple threads at same time across multiple CPUs
Multiprocessor Scheduling
• Space Sharing
– The scheduler checks to see if there are as
many free CPUs as there are related threads. If
not, none of the threads are started until enough
CPUs are available.
– Each thread holds onto its CPU until it
terminates (even if it blocks on I/O)
36
Multiprocessor Scheduling
• Space Sharing
– A different approach is for processes to actively
manage the degree of parallelism.
– A central server keeps track of which processes
are running and want to run and what theie
minimum and maximum CPU requirements are
– Periodically, each CPU polls the central server
to ask how many CPUs it may use.
– It then adjusts the number of processes or
threads up or down to match what is available.
37
Multiprocessor Scheduling
Gang Scheduling
Multicomputers
• Definition:
Tightly-coupled CPUs that do not share
memory
• Also known as
– cluster computers
– clusters of workstations (COWs)
• The basic node consists of a CPU, memory,
a network interface and sometimes a disk.
Multicomputer Hardware
• Interconnection topologies
(a) single switch (d) double torus
(b) ring (e) cube
(c) grid (mesh) (f) hypercube
Multicomputer Hardware
• Diameter is the longest path between any
two nodes
– On a grid, it increases only as the square root of
the number of nodes N.
– On a hypercube, it is log N.
– But, the fanout and thus the number of links
(and the cost) is much larger for the hypercube.
43
Multicomputer Hardware
• Switching
– Store-and-forward packet switching
• Packet must bu copied many times.
– Circuit switching
• Bits flow in the circuit with no intermediate
buffering after the circuit is set up.
44
Multicomputer Hardware
• Switching scheme
– store-and-forward packet switching
Multicomputer Hardware
47
Low-Level Communication Software
• Also kernel copying may occur.
– If the interface board is mapped into kernel
virtual address space, the kernels may have to
copy the packets to their own memory both on
input and output.
– So, the interface boards are mapped directly
into user space. However a mechanism is
needed to avoid race conditions. But a process
holding a mutex may not leave it.
– Hence there should be just one user process on
each node or some precautions are taken.
48
Low-Level Communication Software
• If several processes running on node
– need network access to send packets …
• Map interface board to all process that need it
– take precautions for race conditions
• If kernel also needs access to network
– Suppose that while the board was mapped into
user space, a kernel packet arrives?
• Then use two network boards
– one to user space, one to kernel
Low-Level Communication Software
• How to get packets onto the interface board?
– Use the DMA chip to copy them in from RAM.
• Problem is that DMA uses physical not virtual address
• Also, if the OS decides to replace a page while the DMA
chip is copying a packet from it, the wrong data will be
transmitted
• Using system calls to pin and unpin pages marking them as
temporarily unpageable is a solution
– But expensive for small packets.
– So, using programmed I/O to and from the interface
board is usually the safest course, since page faults are
handled in the usual way.
– Also, programmed I/O for small packets and DMA
with pinning and unpinning for large ones can be used 50
Low-Level Communication Software
55
User Level Communication Software
• “receive” can also be blocking or nonblocking.
• Blocking call is simple for multiple threads.
• For nonblocking call, an interrupt can be used to
signal message arrival. Difficult to program.
• Receiver can poll for incoming messages.
• Or, the arrival of a message causes a “pop-up
thread” to be created in the receiving process’
address space.
• Or, the receiver code can run directly in the
interrupt handler. Called active messages.
56
Remote Procedure Call
Send/receive are engaged in I/O. Many believe that I/O is the wrong
programming model. Hence RPC is developed.
60
Distributed Shared Memory
One improvement is to
replicate read only pages.
Replicating read-write pages
need special actions.
Replication
(a) Pages distributed on
4 machines
• False Sharing
– Processor 1 makes heavy use of A, process 2 uses B. The page
containing both variables will be traveling back and forth
between the two machines.
• Must also achieve sequential consistency
– Before a shared page can be written, a message is sent to all
other CPUs holding a copy of the page telling them to unmap
and discard the page.
– Another way is to allow a process to acquire a lock on a
portion of the virtual address space and then perform multiple
read/writes on the locked memory
Multicomputer Scheduling
• Each node has its own nemory and own set of
processes.
• However, when a new process is created, a
choice can be made to place it to balance load
• Each node can use any local scheduling
algorithm
• It is also possible to use gang scheduling
– Some way to coordinate the start of the time slots is
required.
64
Multicomputer Scheduling
Load Balancing
Process
• Graph-theoretic deterministic algorithm
– Each vertex being a process, each arc representing the flow of
messages between two processes. Arcs that go from one
subgraph to another represent network traffic. The goal is to find
the partitioning that minimizes the network traffic while meeting
the constraints. In (a) above the total network traffic is 30 units.
In partitioning (b) it is 28 units.
Load Balancing
(a) (b)
• Ethernet
(a) classic Ethernet
(b) switched Ethernet
• Collisions result. Ethernet uses Binary Exponential
Backoff algorithm.
• Bridges connect multiple Ethernets
Network Hardware
The Internet
Routers extract the destinatiın address of a packet and looks it
up in a table to find which outgoing line to send it on.
Network Services and Protocols
Network Services
Network Services and Protocols
• The Web
– a big directed graph of documents
Document-Based Middleware
How the browser gets a page
1. Asks DNS for IP address
2. DNS replies with IP address
3. Browser makes connection
4. Sends request for specified page
5. Server sends file
6. TCP connection released
7. Browser displays text
8. Browser fetches, displays images
File System-Based Middleware
(b)
(a)
• Transfer Models
(a) upload/download model
(b) remote access model
File System-Based Middleware
Naming Transparency
(b) Clients have same view of file system
(c) Alternatively, clients with different view
File System-Based Middleware
• Remote file systems can be mounted onto
the local file hierarchy
• Naming Transparency
– Location transparency means that the path
name gives no hint as to where the file is
located. A path like /server1/dir1/dir2/x tells
that x is located on server1, but does not tell
where server1 is located.
– A system in which files can be moved without
their names changing is said to have location
independence.
79
File System-Based Middleware
Client's view
• AFS – Andrew File System
– workstations grouped into cells
– note position of venus and vice
File System-Based Middleware
• AFS
– The /cmu directory contains the names of the shared
remote cells, below which are their respective file
systems.
– Close to session semantics: When a file is opened, it is
fetched from the appropriate server and placed in /cache
on WS’s local disk. When the file is closed, it is uploaded
back.
– However, when venus downloads a file into its cache, it
tells vice whether or not it cares about subsequent opens.
If it does, vice records the location of the cached file. If
another process opens the file, vice sends a message to
venus telling it to mark its cache entry as invalid and
return the modified copy. 83
Shared Object-Based Middleware
87
Shared Object-Based Middleware
Publish-Subscribe architecture