Lect1 F13

2013-10-06
GPU Programming
Dr. Farshad Khunjush
Department of Computer Science and Engineering Shiraz University Fall 2013
Introduction:
The Multicore Revolution (Third software crisis and its origins and the roadmap to its solutions)
Dr. Farshad Khunjush
Acknowledgement: some slides borrowed from presentations by Jim Demel, Horst Simon, Mark D. Hill, David Patterson, Saman Amarasinghe, Richard Brunner, Luddy Harrison, Jack Dongarra Katherine Yelick, Matei Ripeanu
2013-10-06
The Software Crisis

p E.
n
Dijkstra, 1972 Turing Award Lecture

To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem."
GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush
The First Software Crisis

p Time
Frame: 60s and 70s p Problem: Assembly Language Programming

n
Computers could handle larger more complex programs
p Needed
to get Abstraction and Portability without losing Performance
2013-10-06
Solution to the First Crisis

p High-level
machines
n
languages for von-Neuman
FORTRAN and C
p Provided
n n
common language for uniprocessors.

Common Properties:
p
Single Flow of Control and Single Memory Image ISA (Instruction Set Architecture), Register Files, and Functional Units
Differences:
p
The Second Software Crisis

(80s and 90s)
p Problem:
Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundreds of programmers
n
Computers could handle larger more complex programs
p Needed
n
to get Composability and Maintainability

High-performance was not an issue left for Moores Law
2013-10-06
Solution to the Second Crisis

p Object
n
Oriented Programming
C++, C# and Java Better tools

p
p Also
n n
Component libraries Design patterns, specification, testing, code reviews
Better software engineering methodology

p
Today: Programmers are Oblivious to Processors

p Solid
boundary between Hardware and Software p Programmers dont have to know anything about the processor
n
High level languages abstract away the processors

p
Ex: Java byte code is machine independent
Moores law does not require the programmers to know anything about the processors to get good speedups
2013-10-06
Today: Programmers are Oblivious to Processors (Contd)

p Programs
n
are oblivious of the processor work on all processors

A program written in 70 using C still works and is much faster today
p This
abstraction provides a lot of freedom for the programmers
Third Software Crisis (2005-20..?)

p Origins:
n Problem:
Sequential performance is left behind by Moores law n Needed continuous and reasonable performance improvements
p to p to
n Needed
improvements while sustaining portability and maintainability without unduly increasing complexity faced by the programmer
critical to keep-up with the current rate of evolution in software
support new features support larger datasets
2013-10-06
Technology Trends: Microprocessor Capacity

Moores
Law
2X transistors/Chip Every 1.5 years
Called Moores Law Microprocessors have become smaller, denser, and more powerful.
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
Slide source: Jack Dongarra
But there are limiting forces
2013-10-06
Limit #1: Power density

Can soon put more transistors on a chip than can afford to turn on. -- Patterson 07
Scaling clock speed (business as usual) will not work

10000 Power Density (W/cm2)
Suns Surface Rocket Nozzle Nuclear Reacto r Plate Hot

486 1990 Year P6 Pentium
Source: Patrick Gelsinger, Intel
1000
100
8086 10 4004 8008 8085 386 286 8080 1 1970 1980
2000
2010
Power Efficiency (Watt/Spec)
2013-10-06
Parallelism Saves Power

p
Exploit explicit parallelism for reducing power

Power = C * V2 * F Performance = Cores * F *1 Capacitance Voltage Frequency
Using additional cores

n n n
Increase density (= more transistors = more capacitance) Can increase cores (2x) and performance (2x) Or increase cores (2x), but decrease frequency (1/2): same performance at the power Small/simple cores more predictable performance
Additional benefits
p
Limit #2: Memory Wall
source: www.opensparc.net
2013-10-06
Memory Wall Implications

p Each
memory access takes hundreds of CPU cycles clock frequency doesnt automatically improve performance anymore!
p Increasing
Limit #3: Hidden Parallelism Tapped Out

Application performance was increasing by 52% per year as measured by the SpecInt benchmarks here
10000
1000
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

52%/year
??%/year
100
10 25%/year
1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
VAX: RISC + x86:
25%/year 1978 to 1986 52%/year 1986 to 2002
2013-10-06
Limit #3: Hidden Parallelism Tapped Out

p
Superscalar (SS) designs were the state of the art; many forms of parallelism not visible to programmer
n n n n
multiple instruction issue dynamic scheduling: hardware discovers parallelism between instructions speculative execution: look past predicted branches non-blocking caches: multiple outstanding memory ops
You may have heard of these in ADV COMP. ARCH., but you havent needed to know about them to write software !! Unfortunately, these sources have been used up
Revolution is Happening Now

p
Chip density is continuing increase ~2x every 2 years

n n
Clock speed is not Number of processor cores may double instead
There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

10
2013-10-06
Moores Law reinterpreted

p
Number of cores per chip will double every two years Clock speed will not increase (possibly decrease) Need to deal with systems with large number of concurrent threads
n n n
p p
Millions for HPC systems Thousands for servers Hundreds for workstations and notebooks
Need to deal with inter-chip parallelism as well as intra-chip parallelism
Why Parallelism?
p p
These arguments are no longer theoretical All major processor vendors are producing multi-core chips
n n
Every machine will soon be a parallel machine All programmers will be parallel programmers???
New software model

n n
Want a new feature? Hide the cost by speeding up the code first All programmers will be performance programmers???
Some may eventually be hidden in libraries, compilers, and high level languages Big open questions:
n n n
But a lot of work is needed to get there
What will be the killer apps for multicore machines How should the chips be designed, and how will they be programmed?
11
2013-10-06
Why writing (fast) parallel programs is hard
Principles of Parallel Computing

p Finding
Law) p Granularity p Locality p Load balance p Coordination and synchronization p Performance modeling
enough parallelism (Amdahls
All of these things makes parallel programming even harder than sequential programming.

12
2013-10-06
Automatic Parallelism in Modern Machines

p p p p
Bit level parallelism

n
within floating point operations, etc. multiple instructions execute per clock cycle overlap of memory operations with computation multiple jobs run in parallel on commodity SMPs
Instruction level parallelism (ILP)

n
Memory system parallelism

n
OS parallelism
n
Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks

Amdahls Law
p
Simple software assumption

n n
Fraction F of execution time perfectly parallelizable No Overhead for
Scheduling Synchronization Communication, etc.

n
Fraction 1 F Completely Serial
Time on 1 core = (1 F) / 1 + F / 1 = 1 p Time on N cores = (1 F) / 1 + F / N

p
13
2013-10-06
Finding Enough Parallelism

1 Amdahls Speedup =
p
Implications:
n
N Attack the common case when introducing parallelization's: When F is small, optimizations will have little effect. Even if the parallel part speeds up perfectly performance is limited by the sequential part
p As
1-F 1
The aspects you ignore will limit speedup
N approaches infinity, speedup is bound by 1/(1 F ).
p Discussion:
n
Can you ever obtain super-linear speedups?
Overhead of Parallelism
p Given
enough parallel work, this is the biggest barrier to getting desired speedup p Parallelism overheads include:
cost of starting a thread or process n cost of communicating shared data n cost of synchronizing n extra (redundant) computation
n
14
2013-10-06
Overhead of Parallelism (Contd)

p Each
of these can be in the range of milliseconds (=millions of flops) on some systems p Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (i.e. large granularity), but not so large that there is not enough parallel work
Locality and Parallelism

Conventional Proc Storage Cache Hierarchy L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache potential interconnects
L3 Cache
L3 Cache
L3 Cache
Memory
Memory
Memory
Large memories are slow, fast memories are small Storage hierarchies are large and fast on average Parallel processors, collectively, have large, fast cache
the slow accesses to remote data we call communication
Algorithm should do most work on local data
15
2013-10-06
Course Organization (Tentative)

p p
p p p p
The Multicore Revolution (Third software crisis and its origin and roadmap to its solutions) Different Parallel Architectures & Muticore Architectures (From pipeline to Homogeneous & Heterogeneous Multi-core architectures) Introduction to GPGPU Architecture CUDA programming Concepts Profiling CUDA programs OpenCL Programming Concepts
Grading (tentative)
p
Assignment
n n
n 1st Assignment n 2nd Assignment
3rd Assignment 4th Assignment Report & Presentation
5% 5% 5% 5%
20%
Final Project
n
35%
35% 15% 30%
p p
Mid-Term Exam Final Exam
16

Lect1 F13

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Lect1 F13

Transféré par

Droits d'auteur :

Formats disponibles

2013-10-06

The Software Crisis

Dijkstra, 1972 Turing Award Lecture

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

The First Software Crisis

Frame: 60s and 70s p Problem: Assembly Language Programming

Computers could handle larger more complex programs

to get Abstraction and Portability without losing Performance

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

Solution to the First Crisis

languages for von-Neuman

common language for uniprocessors.

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

The Second Software Crisis

Computers could handle larger more complex programs

to get Composability and Maintainability

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

Solution to the Second Crisis

C++, C# and Java Better tools

Component libraries Design patterns, specification, testing, code reviews

Better software engineering methodology

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

Today: Programmers are Oblivious to Processors

High level languages abstract away the processors

Ex: Java byte code is machine independent

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

Today: Programmers are Oblivious to Processors (Contd)

are oblivious of the processor work on all processors

abstraction provides a lot of freedom for the programmers

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

Third Software Crisis (2005-20..?)

support new features support larger datasets

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

Technology Trends: Microprocessor Capacity

2X transistors/Chip Every 1.5 years

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

But there are limiting forces

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

Limit #1: Power density

Scaling clock speed (business as usual) will not work

Suns Surface Rocket Nozzle Nuclear Reacto r Plate Hot

8086 10 4004 8008 8085 386 286 8080 1 1970 1980

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

Power Efficiency (Watt/Spec)

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

Parallelism Saves Power

Exploit explicit parallelism for reducing power

Using additional cores

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

Limit #2: Memory Wall

Memory Wall Implications

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

Limit #3: Hidden Parallelism Tapped Out

VAX: RISC + x86:

25%/year 1978 to 1986 52%/year 1986 to 2002

GPU Programming-Fall 2013 - Shiraz University Farshad Khunjush

Limit #3: Hidden Parallelism Tapped Out

Revolution is Happening Now

Chip density is continuing increase ~2x every 2 years

Clock speed is not Number of processor cores may double instead

Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

Moores Law reinterpreted

Need to deal with inter-chip parallelism as well as intra-chip parallelism