Advanced Computer Architecture: Presented By, Krishna

ADVANCED COMPUTER
ARCHITECTURE
PRESENTED BY,
KRISHNA
1
Contents
• Vector Architecture
• SIMD instruction set & extension for multimedia
• Graphic processing unit(GPU)
• Comparison between vector Architecture &GPU
• Comparison between multimedia SIMD computers and GPU
• Loop level parallelism
• Finding dependencies
2
Data parallelism
• Data parallelism is a form of parallelization across
multiple processors in Parallel computing
environments.
• It focuses on distributing the data across different
nodes, which operate on the data in parallel. It can be
applied on regular data structures like arrays and
matrices by working on each element in parallel.
3
SIMD
• Single Instruction, Multiple Data
• SIMD, is a class of Parallel computers in Flynn's taxonomy.
• It describes computers with multiple processing
elements that perform the same operation on multiple
data points simultaneously.
• Three variations of SIMD:

• vector architectures,
• multimedia SIMD instruction set extensions,
• graphics processing units (GPUs).
4
• SIMD architectures can exploit significant
data‐level parallelism for:
– matrix‐oriented scientific computing
– media‐oriented image and sound processors
• Most modern CPU designs include SIMD
instructions in order to improve the
performance of multimedia use.
5
Vector Architecture
• Vector architectures are easier to understand
and to compile to than other SIMD variations,
• But they were considered too expensive for
microprocessors until recently.
6
Vector Architecture
• Vector architectures grab sets of data
elements scattered about memory, place
them into large, sequential register files,
operate on data in those register files, and
then disperse the results back into memory.
7
Vector Architecture
• Registers are controlled by compiler
• Register files act as compiler controlled buffers
• Used to hide memory latency
• Leverage memory bandwidth
8
Basic structure of a vector architecture
9
Components of vector architecture
• Vector registers
‐ Each register holds a 64‐element, 64 bits/element vector
‐ Register file has 16 read ports and 8 write ports
• Vector functional units
• Fully pipelined
• Data and control hazards are detected
• Vector load‐store unit
‐ Fully pipelined
‐ Words move between registers
‐ One word per clock cycle after initial latency
• Scalar registers
• 32 general‐purpose registers
• 32 floating‐point registers
10
How Vector Processors Work: An
Example
• Let’s take a typical vector problem,
Y=a×X+Y
• X and Y are vectors, initially resident in memory, and a
is a scalar.
• Here is the VMIPS code for DAXPY.
L.D F0,a load scalar a
LV V1,Rx load vector X
MULVS.D V2,V1,F0 vector-scalar multiply
LV V3,Ry load vector Y
ADDVV.D V4,V2,V3 add
SV V4,Ry store the result
11
• In MIPS Code
• ADD waits for MUL, SD waits for ADD
• In VMIPS
• Stall once for the first vector element,
subsequent elements will flow smoothly down
the pipeline.
• Pipeline stall required once per vector
instruction!
12
Memory banks
• Memory system must be designed to support high bandwidth
for vector loads and stores
• Spread accesses across multiple banks
– Control bank addresses independently
– Load or store non sequential words
– Support multiple vector processors sharing the same memory
• Example:
– 32 processors, each generating 4 loads and 2 stores/cycle
– Processor cycle time is 2.167 ns, SRAM cycle time is 15 ns
– How many memory banks needed?
13
Vector Execution Time
• Execution time depends on three factors:
‐ Length of operand vectors
‐ Structural hazards
‐ Data dependencies
• VMIPS functional units consume one element
per clock cycle
• Execution time is approximately the vector length
14
Vector Processor Advantages
• Each instruction generates a lot of work
• Reduces instruction fetch bandwidth
• Highly regular memory access pattern
• Interleaving multiple banks for higher memory
bandwidth
• No need to explicitly code loops
• Fewer branches in the instruction sequence
15
Vector Processor Limitations
• Memory (bandwidth) can easily become a
bottleneck, especially if
1. compute/memory operation balance is
not maintained
16
SIMD Instruction Set Extensions for
Multimedia
• Media applications operate on data types
narrower than the native word size
‐Graphics systems use 8 bits per primary color
‐ Audio samples use 8‐16 bits
‐ 256‐bit adder
‐ 16 simultaneous operations on 16 bits
‐ 32 simultaneous operations on 8 bits
17
• In contrast to vector architectures, which offer
an elegant instruction set that is intended to
be the target of a vectorizing compiler, SIMD
extensions have three major omissions:
• Multimedia SIMD extensions fix the number of
operands in the opcode
• Multimedia SIMD extensions: No sophisticated
addressing modes
• No mask registers
18
• The goal of these SIMD extensions has been to
accelerate carefully written libraries rather
than for the compiler to generate them
19
Why is it popular?
• Costs little to add to the standard arithmetic unit
• Easy to implement
• Need smaller memory bandwidth than vector
• Separate data transfers aligned in memory
• Use much smaller register space
• Fewer operands
• No need for sophisticated mechanisms of vector
architecture
20
GPU
What is GPU?
• It is a processor optimized for 2D/3D graphics, video,
visual computing, and display.
• It is highly parallel, highly multithreaded
multiprocessor optimized for visual computing.
• It provide real-time visual interaction with computed
objects via graphics images, and video.
• It serves as both a programmable graphics processor
and a scalable parallel computing platform.
• Heterogeneous Systems: combine a GPU with a CPU
21
GPU Evolution
• 1980’s – No GPU. PC used VGA controller
• 1990’s – Add more function into VGA controller
• 1997 – 3D acceleration functions:
Hardware for triangle setup and rasterization
Texture mapping
Shading
• 2000 – A single chip graphics processor (
beginning of GPU term)
• 2005 – Massively parallel programmable
processors
• 2007 – CUDA (Compute Unified Device
Architecture)
22
Similarities and Differences between
Vector Architectures and GPUs
• Both architectures are designed to execute
data-level parallel programs.
• The VMIPS register file holds entire vectors—
that is, a contiguous block of 64 doubles. In
contrast, a single vector in a GPU would be
distributed across the registers of all SIMD
Lanes.
• The calculations are implicit in vector
architecture but they are explicit in GPUs.
23
• With respect to conditional branch instructions,
both architectures implement them using mask
registers.
• GPUs hide memory latency using multithreading
but this is not possible in case of vector
architecture.
• The control processors in the vector architecture
broadcasts operations to all vector lanes but
these control processors are absent in GPU
24
Multimedia SIMD Computers and
GPUs
• Both are multiprocessors whose processors
use multiple SIMD lanes, although GPUs have
more processors and many more lanes.
• Both use hardware multithreading to improve
processor utilization, although GPUs have
hardware support for many more threads
25
Similarities
• Both have similar performance ratios between
single-precision and double-precision floating-
point arithmetic.
• Both use caches, although GPUs use smaller
streaming caches and multicore computers use
large multilevel caches that try to contain whole
working sets completely.
• Both use a 64-bit address space, although the
physical main memory is much smaller in GPUs.
26
Multimedia SIMD Computers and GPUs
27
Loop Level Parallelism
• Loop-level parallelism is normally analyzed at
the source level or close to it, while most
analysis of ILP is done once instructions have
been generated by the compiler.
• Loop-level analysis involves determining what
dependences exist among the operands in a
loop across the iterations of that loop.
28
• The analysis of loop-level parallelism focuses
on determining whether data accesses in later
iterations are dependent on data values
produced in earlier iterations.
• Such dependence is called a loop-carried
dependence.
29
• e.g. for (i=1; i<=1000; i++)
x[i] = x[i] + s;
• The computation in each iteration is
independent of the previous iterations and
the loop is thus parallel. The use of X[i] twice
is within a single iteration.
• Thus loop iterations are parallel (or independent from
each other).
30
for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
• S1 uses the value B[i] computed by S2 in the
previous iteration (loop-carried dependence)
31
This dependence is not circular because
S1 depends on S2 but S2 does not depend onS1.
32
Finding Dependences
• Finding dependences in the program is very
important for renaming and executing
instructions in parallel.
• Arrays and pointers makes finding
dependences very difficult.
• Assume array indices are affine, which means
on the form a x i+b where a and b are
constant.
• GCD test can be used to detect dependences.
33
GCD test
• Assume we stored an array with index value of
a x i+b and loaded an array with an index
value of c x i+d.
• If a loop dependence exists, then
GCD(c,a) must divides (d − b)
• If that test fails, there is no guarantee there is

dependence (loop bound)
34
Example
for(i=1; i<=100; i=i+1) {
x[2*i+3] = x[2*i] * 5.0;
}
a=2b=3c=2d=0
GCD(a, c) = 2
d - b = -3
2 does not divide -3 ⇒ No
dependence is not possible
35

Advanced Computer Architecture: Presented By, Krishna

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Advanced Computer Architecture: Presented By, Krishna

Transféré par

Droits d'auteur :

Formats disponibles

ADVANCED COMPUTER

• Three variations of SIMD:

• If that test fails, there is no guarantee there is

Vous aimerez peut-être aussi