Vous êtes sur la page 1sur 10

Heterogeneous Parallel Programming

Lecture 1.1
Introduction to Heterogeneous
Parallel Computing
Wen-mei Hwu
University of Illinois at Urbana-Champaign

Heterogeneous Parallel Computing


Cloud
Services

Use the best match for the job


(heterogeneity in mobile SOC)
Latency
Cores

DSP Cores

Throughput
Cores

Configurable
Logic/Cores

(c) Wen-mei Hwu, Cool Chips

HW IPs

On-chip
Memories

11/28/2012

Blue Waters Supercomputer


Cray System & Storage cabinets:
Compute nodes:

>25,000

Usable Storage Bandwidth:

>1 TB/s

System Memory:

Memory per core module:


Gemin Interconnect Topology:
Usable Storage:
Peak performance:
Number of AMD Interlogos processors:
Number of AMD x86 core modules:
Number of NVIDIA Kepler GPUs:
3

>300

>1.5 Petabytes
4 GB
3D Torus
>25 Petabytes
>11.5 Petaflops
>49,000
>380,000
>3,000

CPU and GPU have very different design philosophy


CPU

GPU

Latency Oriented Cores

Throughput Oriented Cores


Chip

Chip

Compute Unit

Core

Cache/Local Mem

SIMD Unit

Control

Registers

Registers

SIMD
Unit

Threading

Local Cache

CPUs: Latency Oriented Design


Large caches
Convert long latency memory
accesses to short latency cache
accesses

Sophisticated control
Branch prediction for reduced
branch latency
Data forwarding for reduced data
latency

Powerful ALU

ALU

ALU

ALU

CPU
Cache

DRAM

Reduced operation latency


David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign

ALU
Control

GPUs: Throughput Oriented Design


Small caches
To boost memory throughput

Simple control
No branch prediction
No data forwarding

GPU

Energy efficient ALUs


Many, long latency but heavily pipelined
for high throughput

Require massive number of threads


to tolerate latencies
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign

DRAM

Winning Applications Use Both CPU


and GPU
CPUs for sequential
parts where latency
matters
CPUs can be 10+X faster
than GPUs for sequential
code
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign

GPUs for parallel parts


where throughput wins
GPUs can be 10+X faster
than CPUs for parallel
code

Heterogeneous parallel computing is


catching on. Data

Financial
Analysis

Scientific
Simulation

Engineering
Simulation

Digital Audio
Processing

Digital Video
Processing

Computer
Vision

Biomedical
Informatics

Statistical
Modeling

Ray Tracing
Rendering

Interactive
Physics

Numerical
Methods

Intensive
Analytics

280 submissions to GPU Computing Gems


90 articles included in two volumes
David Kirk/NVIDIA and Wen-mei W. Hwu, 20072012 ECE408/CS483, University of Illinois, Urbana-

Medical
Imaging

Electronic
Design
Automation

GPU computing is catching on.

CANDE 2011

TO LEARN MORE, READ CHAPTER 1