Vous êtes sur la page 1sur 36

Introduction:

Modern computer architecture

The stored program computer and its inherent bottlenecks


Multi- and manycore chips and nodes
Introduction: Moore’s law

Intel Sandy Bridge EP: 2.3 billion


Nvidia Kepler: 7 billion
Intel Broadwell: 7.2 billion
Nvidia Pascal: 15 billion

1965: G. Moore claimed


#transistors on “microchip”
doubles every 12-24 months

(c) RRZE 2016 Basic Architecture 2


Multi-core today: Intel Xeon 2600v3 (2014)

 Xeon E5-2600v3 “Haswell EP”:


Up to 18 cores running at 2+ GHz (+ “Turbo Mode”: 3.5+ GHz)

 Simultaneous Multithreading
 reports as 36-way chip

 5.7 billion transistors @ 22 nm

Optional:
 Die size: 662 mm2 “Cluster on Die”
(CoD) mode

... ...

2-socket server

(c) RRZE 2016 Basic Architecture 3


A deeper dive into core and chip
architecture
General-purpose cache-based microprocessor core

Modern CPU core

Stored-program computer

 Implements “Stored
Program Computer”
concept (Turing 1936)

 Similar designs on all


modern systems

 (Still) multiple potential


bottlenecks
 Flexible!

(c) RRZE 2016 Basic Architecture 5


Basic resources on a stored program computer
Instruction execution and data movement

1. Instruction execution
This is the primary resource of the processor. All efforts in hardware design
are targeted towards increasing the instruction throughput.

Instructions are the concept of “work” as seen by processor designers.


Not all instructions count as “work” as seen by application developers!

Example: Adding two arrays A(:) and B(:) Processor work:


LOAD r1 = A(i)
LOAD r2 = B(i)
ADD r1 = r1 + r2
do i=1, N STORE A(i) = r1
A(i) = A(i) + B(i) INCREMENT i
enddo BRANCH  top if i<N

User work:
N Flops (ADDs)

(c) RRZE 2016 Basic Architecture 6


Basic resources on a stored program computer
Instruction execution and data movement

2. Data transfer
Data transfers are a consequence of instruction execution and therefore a
secondary resource. Maximum bandwidth is determined by the request rate
of executed instructions and technical limitations (bus width, speed).

Example: Adding two arrays A(:) and B(:)

Data transfers:
8 byte: LOAD r1 = A(i)
do i=1, N 8 byte: LOAD r2 = B(i)
A(i) = A(i) + B(i) 8 byte: STORE A(i) = r2
enddo Sum: 24 byte

Crucial question: What is the bottleneck?


 Data transfer?
 Code execution?
(c) RRZE 2016 Basic Architecture 7
From high level code to actual execution

sum=0.d0
do i=1, N
sum=sum + A(i)
end

Compiler (naive)
Base address of A(i)
sum in register xmm1
ADDSD: Add 1st argument to 2nd
argument and store result in 2nd
argument

Register increment
Compare register content
i in
N in register rdx
Jump to label if loop register rax
continues

(c) RRZE 2016 Basic Architecture 8


Microprocessors – Pipelining
Pipelining of arithmetic/functional units

 Idea:
 Split complex instruction into several simple / fast steps (stages)
 Each step takes the same amount of time, e.g., a single cycle
 Execute different steps on different instructions at the same time (in parallel)

 Allows for shorter cycle times (simpler logic circuits), e.g.:


 floating point multiplication takes 5 cycles, but
 processor can work on 5 different multiplications simultaneously
 one result at each cycle after the pipeline is full

 Drawback:
 Pipeline must be filled - startup times (#Instructions >> pipeline steps)
 Efficient use of pipelines requires large number of independent instructions 
instruction level parallelism
 Requires complex instruction scheduling by compiler/hardware – software-
pipelining / out-of-order

 Pipelining is widely used in modern computer architectures

(c) RRZE 2016 Basic Architecture 10


5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N

First result is available after 5 cycles (=latency of pipeline)!


Wind-up/-down phases: Empty pipeline stages
(c) RRZE 2016 Basic Architecture 11
Pipelining: The Instruction pipeline

 Besides arithmetic & functional unit, instruction execution itself is


pipelined also, e.g.: one instruction performs at least 3 steps:
Fetch Instruction Decode Execute
from L1I instruction Instruction

Fetch Instruction 1
1 from L1I
2 Fetch Instruction 2 Decode
from L1I Instruction 1
t

Fetch Instruction 3 Decode Execute


3 from L1I Instruction 2 Instruction 1
Fetch Instruction 4 Decode Execute
4 from L1I Instruction 3 Instruction 2

 Branches can stall this pipeline! (Speculative Execution, Predication)
 Each unit is pipelined itself (e.g., Execute = Multiply Pipeline)

(c) RRZE 2016 Basic Architecture 12


Microprocessors – Superscalarity and
Simultaneous Multithreading
Superscalar Processors – Instruction Level Parallelism

 Multiple units enable use of Instrucion Level Parallelism (ILP):


Instruction stream is “parallelized” on the fly

Fetch Instruction 4
Fetch Instruction 3
fromInstruction
Fetch L1I 2 4-way
fromInstruction
Fetch L1I 1
from L1I 2
Fetch Instruction Decode „superscalar“
Fetch Instruction
from L1I 2 Decode
t fromInstruction
Fetch L1I 2 Instruction
Decode1
fromInstruction
Fetch L1I 5 Instruction
Decode 1
from L1I 3
Fetch Instruction Instruction 1
Decode Execute
from L1I 3
Fetch Instruction Instruction 1
Decode Execute
fromInstruction
Fetch L1I 3 Instruction
Decode2 Instruction 1
Execute
from L1I
Fetch Instruction 9 Instruction
Decode 2 Instruction
Execute1
from L1I 4
Fetch Instruction Instruction 2
Decode Instruction 1
Execute
from L1I 4
Fetch Instruction Instruction 5
Decode Instruction 1
Execute
fromInstruction
Fetch L1I 4 Instruction
Decode3 Instruction 2
Execute
from
Fetch L1I
Instruction 13 Instruction
Decode 3 Instruction
Execute2
from L1I Instruction 3 Instruction 2
from L1I Instruction 9 Instruction 5

 Issuing m concurrent instructions per cycle: m-way superscalar


 Modern processors are 3- to 6-way superscalar &
can perform 2 floating point instructions per cycles

(c) RRZE 2016 Basic Architecture 14


Superscalar processors –
executing multiple instructions concurrently
LOAD
Instruction execution STORE (Latency: 4 cy)
(Latency: 2 cy)
ADD
load a[1] (Latency: 3cy)
Cycle 1
Cycle 2 load a[2] …
Cycle 3 load a[3] for(int i=1; i<n; ++i)
Cycle 4 load a[4] a[i] = a[i] + s;
Cycle 5 load a[5] add a[1]=c,a[1]
Cycle 6 load a[6] add a[2]=c,a[2] …
Cycle 7 load a[7] add a[3]=c,a[3]
Cycle 8 load a[8] add a[4]=c,a[4] store a[1]
Cycle 9 load a[9] add a[5]=c,a[5] store a[2] “Steady state:”
Cycle 10 load a[10] add a[6]=c,a[6] store a[3] 3 instructions/cy
Cycle 11 load a[11] add a[7]=c,a[7] store a[4]
(“3-way superscalar”)
Cycle 12 load a[12] add a[8]=c,a[8] store a[5]
Cycle 13 load a[13] add a[9]=c,a[9] store a[6]
Cycle 14 load a[14] add a[10]=c,a[10] store a[7] Instructions Per Cycle: IPC=3
Cycle 15 load a[15] add a[11]=c,a[11] store a[8]
Cycle 16 load a[16] add a[12]=c,a[12] store a[10] Cycles Per Instruction: CPI=0.33
… … … …

Correct interleaving / reordering the instruction streams:


Out-Of-Order (OOO) execution

(c) RRZE 2016 Basic Architecture 15


Core details: Simultaneous multi-threading (SMT)

SMT principle (2-way example):


Standard core
2-way SMT

(c) RRZE 2016 Basic Architecture 16


Microprocessors –
Single Instruction Multiple Data (SIMD)
a.k.a. vectorization
Core details: SIMD processing

 Single Instruction Multiple Data (SIMD) operations allow the


concurrent execution of the same operation on “wide” registers
 x86 SIMD instruction sets:
 SSE: register width = 128 Bit  2 double precision floating point operands
 AVX: register width = 256 Bit  4 double precision floating point operands
 Adding two registers holding double precision floating point
operands
R0 R1 R2 R0 R1 R2

A[3]

B[3]

C[3]
SIMD execution: +
V64ADD [R0,R1] R2

A[2]

B[2]

C[2]
+
256 Bit

A[1]

B[1]

C[1]
+
Scalar execution:

A[0]
A[0]

B[0]

B[0]
C[0]

C[0]
64 Bit + R2 ADD [R0,R1] +

(c) RRZE 2016 Basic Architecture 18


Microprocessors –
Memory Hierarchy
Von Neumann bottleneck reloaded: “DRAM gap”

DP peak performance and peak main memory bandwidth


for a single Intel processor (chip)

Approx.
10 F/B

Main memory access


speed not sufficient to
keep CPU busy…

 Introduce fast on-chip


caches, holding copies of
recently used data items

(c) RRZE 2016 20


Registers and caches: Data transfers in a memory hierarchy

 Caches help with getting instructions and data to the CPU “fast”
 How does data travel from memory to the CPU and back?

CPU registers
 Remember: Caches are organized LD C(1)
MISS ST A(1)
in cache lines (e.g., 64 bytes) MISS
LD C(2..Ncl)
 Only complete cache lines are ST A(2..Ncl) HIT
transferred between memory
CL CL
hierarchy levels (except registers) Cache
 MISS: Load or store instruction does
not find the data in a cache level
write evict
 CL transfer required allocate (delayed)

CL CL
3 CL
C(:) A(:)
transfers
 Example: Array copy A(:)=C(:) Memory

(c) RRZE 2016 Basic Architecture 21


From UMA to ccNUMA
Basic architecture of commodity multi-socket nodes
Yesterday (2006): Dual-socket Intel “Core2” node

Uniform Memory Architecture (UMA)


Flat memory ; symmetric MPs
But: system “anisotropy”

Today: Dual-socket node

Cache-coherent Non-Uniform Memory


Architecture (ccNUMA)
HT / QPI provide scalable bandwidth at the
price of ccNUMA architectures: Where
does my data finally end up?

Intel Cluster on Die (CoD)  ccNUMA within a socket!


(AMD has been there long ago…)
(c) RRZE 2016 Basic Architecture 22
Cray XC30 “SandyBridge-EP” 8-core dual socket node

 8 cores per socket 2.7 GHz


(3.5 @ turbo)
 DDR3 memory interface with 4
channels per chip
 Two-way SMT
 Two 256-bit SIMD FP units
 SSE4.2, AVX

 32 kB L1 data cache per core


 256 kB L2 cache per core
 20 MB L3 cache per chip

(c) RRZE 2016 Basic Architecture 23


There is no single driving force for single core performance!

Maximum floating point (FP) performance:

𝐹𝑃
𝑃𝑐𝑜𝑟𝑒 = 𝑛𝑠𝑢𝑝𝑒𝑟 ∙ 𝑛𝐹𝑀𝐴 ∙ 𝑛𝑆𝐼𝑀𝐷 ∙ 𝑓

Super- FMA SIMD Clock


scalarity factor factor Speed
Typical 𝒏𝑭𝑷
𝒔𝒖𝒑𝒆𝒓 𝒏𝑺𝑰𝑴𝑫 𝑷𝒄𝒐𝒓𝒆
𝒏𝑭𝑴𝑨 Code 𝒇 [GHz]
representatives inst./cy ops/inst. [GF/s]
Nehalem 2 1 2 Q1/2009 X5570 2.93 11.7
Westmere 2 1 2 Q1/2010 X5650 2.66 10.6
Sandy Bridge 2 1 4 Q1/2012 E5-2680 2.7 21.6
Ivy Bridge 2 1 4 Q3/2013 E5-2660 v2 2.2 17.6
Haswell 2 2 4 Q3/2014 E5-2695 v3 2.3 36.8
Broadwell 2 2 4 Q1/2016 E5-2699 v4 2.2 35.2
IBM POWER8 2 2 2 Q2/2014 S822LC 2.93 23.4
(c) RRZE 2016 Basic Architecture 26
Interlude:
A glance at current accelerator
technology

NVidia “Pascal” GP100


vs.
Intel Xeon Phi “Knights Landing”
NVidia Pascal GP100 block diagram

Architecture
 15.3 B Transistors
 ~ 1.4 GHz clock speed
 Up to 60 “SM” units
 64 (SP) “cores” each
 5.7 TFlop/s DP peak
 4 MB L2 Cache
 4096-bit HBM2
 MemBW ~ 732 GB/s
(theoretical)
 MemBW ~ 510 GB/s
(measured)

 2:1 SP:DP
performance

© NVIDIA Corp.

(c) RRZE 2016 29


Intel Xeon Phi “Knights Landing” block diagram
VPU VPU
VPU VPU

MCDRAM MCDRAM MCDRAM


T T T T T T T T
MCDRAM

P P
32 KiB L1 32 KiB L1
DDR4 DDR4 1 MiB L2

36 tiles
DDR4 (72 cores) DDR4 Architecture
max.
 8 B Transistors
DDR4 DDR4
 Up to 1.5 GHz clock speed
 Up to 2x36 cores (2D mesh)
 2x 512-bit SIMD units each
 4-way SMT
MCDRAM
 3.5 TFlop/s DP peak (SP 2x)
MCDRAM MCDRAM MCDRAM

 36 MiB L2 Cache
 16 GiB MCDRAM
 MemBW ~ 470 GB/s (measured)
 Large DDR4 main memory
 MemBW ~ 90 GB/s (measured)

(c) RRZE 2016 30


Trading single thread performance for parallelism:
GPGPUs vs. CPUs

GPU vs. CPU


light speed estimate
(per device)

MemBW ~ 5-10x
Peak ~ 6-15x

2x Intel Xeon E5-2697v4 Intel Xeon Phi 7250 NVidia Tesla P100
“Broadwell” “Knights Landing” “Pascal”
Cores@Clock 2 x 18 @ ≥2.3 GHz 68 @ 1.4 GHz 56 SMs @ ~1.3 GHz
SP Performance/core ≥73.6 GFlop/s 89.6 GFlop/s ~166 GFlop/s
Threads@STREAM ~8 ~40 >8000?
SP peak ≥2.6 TFlop/s 6.1 TFlop/s ~9.3 TFlop/s
Stream BW (meas.) 2 x 62.5 GB/s 450 GB/s (HBM) 510 GB/s
Transistors / TDP ~2x7 Billion / 2x145 W 8 Billion / 215W 14 Billion/300W

(c) RRZE 2016 31


Node topology and
programming models
Parallelism in a modern compute node

 Parallel and shared resources within a shared-memory node

2 GPU #1
1 4 5
10
3
6 Other I/O
9
8 PCIe link
7
GPU #2

Parallel resources: Shared resources:


 Execution/SIMD units 1  Outer cache level per socket 6
 Cores 2  Memory bus per socket 7
 Inner cache levels 3  Intersocket link 8
 Sockets / ccNUMA domains 4  PCIe bus(es) 9
 Multiple accelerators 5  Other I/O resources 10

How does your application react to all of those details?

(c) RRZE 2016 Basic Architecture 35


Scalable and saturating behavior

 Clearly distinguish between “saturating” and “scalable” performance


on the chip level

shared resources parallel


may show resources show
saturating scalable
performance performance

(c) RRZE 2016 Basic Architecture 36


Parallel programming models
on modern compute nodes

 Shared-memory (intra-node)
 Good old MPI
 OpenMP
 POSIX threads
 Intel Threading Building Blocks (TBB)
 Cilk+, OpenCL, StarSs,… you name it
All models require
 Distributed-memory (inter-node) awareness of topology and
 MPI affinity issues for getting
 PVM (gone) best performance out of
the machine!
 Hybrid
 Pure MPI
 MPI+OpenMP
 MPI + any shared-memory model
 MPI (+OpenMP) + CUDA/OpenCL/…

(c) RRZE 2016 Basic Architecture 37


Parallel programming models:
Pure MPI

 Machine structure is invisible to user:


  Very simple programming model
  MPI “knows what to do”!?
 Performance issues
 Intranode vs. internode MPI
 Node/system topology

(c) RRZE 2016 Basic Architecture 38


Parallel programming models:
Pure threading on the node

 Machine structure is invisible to user


  Very simple programming model
 Threading SW (OpenMP, pthreads,
TBB,…) should know about the details
 Performance issues
 Synchronization overhead
 Memory access
 Node topology

(c) RRZE 2016 Basic Architecture 39


Parallel programming models: Lots of choices
Hybrid MPI+OpenMP on a multicore multisocket cluster

One MPI process / node

One MPI process / socket:


OpenMP threads on same
socket: “blockwise”

OpenMP threads pinned


“round robin” across
cores in node

Two MPI processes / socket


OpenMP threads
on same socket

(c) RRZE 2016 Basic Architecture 40


Conclusions about architecture

 Modern computer architecture has a rich “topology”

 Node-level hardware parallelism takes many forms


 Sockets/devices – CPU: 1-4 or more, GPGPU/Phi: 1-6 or more
 Cores – moderate (CPU: 4-24, GPGPU: 10-100, Phi: 64-72)
 SIMD – moderate (CPU: 2-8, Phi: 8-16) to massive (GPGPU: 10’s-100’s)
 Superscalarity (CPU/Phi: 2-6)

 Exploiting performance: parallelism + bottleneck awareness


 “High Performance Computing” == computing at a bottleneck

 Performance of programs is sensitive to architecture


 Topology/affinity influences overheads of popular programming models
 Standards do not contain (many) topology-aware features
 Things are starting to improve slowly (MPI 3.0, OpenMP 4.0)
 Apart from overheads, performance features are largely independent of the
programming model

(c) RRZE 2016 Basic Architecture 41