Académique Documents
Professionnel Documents
Culture Documents
Arvind
Computer Science and Artificial Intelligence Lab
M.I.T.
ISCA 2005, Madison, WI
June 6, 2005
1
Arvind - 2
K. Hiraki
Dynamic
Shown at
Supercomputing 96
Shown at
Manchester (81)
Supercomputing 91
M.I.T.
TTDA,
Monsoon
(88)
Greg Papadopoulos
S. Sakai
M.I.T./Motorola - Monsoon (91) (8 PEs, 8 IS)
ETL - SIGMA-1 (88) (128 PEs,128John
IS) Gurd
T. Shimada
ETL - EM4 (90) (80 PEs), EM-X (96) (80 PEs)
Sandia - EPS88, EPS-2
IBM - Empire
...
Chris
Andy Y. Kodama
Jack
Joerg
Boughton
Costanza
ISCA, Madison, WI, June 6, 2005
Related machines:
Burton Smiths
Denelcor HEP,
Monsoon
Horizon, Tera
Arvind - 3
Software Influences
Parallel Compilers
Intermediate representations: DFG, CDFG (SSA, ,...)
Software pipelining
Keshav Pingali, G. Gao, Bob Rao, ..
Synchronous dataflow
Lustre, Signal
Ed Lee @ Berkeley
ISCA, Madison, WI, June 6, 2005
Arvind - 4
Arvind - 5
Dataflow Graphs
{x = a + b;
y = b * 7
in
(x-y) * (x+y)}
< ip , p , v >
instruction ptr
port
ip = 3
p = L
data
token
*7
x
3
y
4
Arvind - 6
Dataflow Operators
A small set of dataflow operators can be used
to define a general programming language
Fork
Primitive Ops
Switch
T
Merge
T
Arvind - 7
Conditional
T
Before
After
Loop
Bounded
Loop
T
F
p
g
T
Needed for
resource
management
ISCA, Madison, WI, June 6, 2005
Arvind - 8
Outline
Dataflow graphs
A clean model of parallel computation
Arvind - 9
2
1
n
n
1
2
tio
tio
d
d
a
e
a
n
n
ra
ra
od stin
tin
c
e
e
s
p
p
p
De
O
De
O
O
1
2
3
4
5
3L
4L
-*
3R
5L
4R
+
*
*7
5R
out
y
4
Presence bits
Arvind - 10
Instruction Templates
1
2
.
.
.
FU
Send
Op
FU
FU
FU
p2
src2
FU
Arvind - 11
Static Dataflow:
Problems/Limitations
Mismatch between the model and the
implementation
The model requires unbounded FIFO token queues per
arc but the architecture provides storage for one token
per arc
The architecture does not ensure FIFO order in the
reuse of an operand slot
The merge operator has a unique firing rule
- No easy solution in
the static framework
- Dynamic dataflow
provided a framework
for solutions
Arvind - 12
Outline
Dataflow graphs
A clean model of parallel computation
Arvind - 13
instruction
pointer
Arvind - 14
3L, 4L
3R, 4R
5L
5R
out
Program
1
4
5
*7
x
3
2
3
y
4
7
5
Frame
Need to provide storage for only one operand/operator
ISCA, Madison, WI, June 6, 2005
Arvind - 15
Monsoon Processor
Greg Papadopoulos
op r d1,d2
ip
Instruction
Fetch
fp+r
Operand
Fetch
Code
Frames
Token
Queue
ALU
Form
Token
Network
ISCA, Madison, WI, June 6, 2005
Network
Arvind - 16
op r S1,S2
n sets of
registers
(n = pipeline
depth)
Code
Frames
Instruction
Fetch
Operand
Fetch
Registers
Registers evaporate
when an instruction
thread is broken
Token
Queue
ALU
Robert Iannucci
Form
Token
Network
Network
Arvind - 17
32
Instruction Fetch
Effective Address
Presence
bits
Frame
Memory
Presence Bit
Operation
72
Frame
Operation
72
2R, 2W
Registers
72
User
Queue
System
Queue
144
ALU
144
Network
Form Token
Arvind - 18
dest1
[dest2]
EA - effective address
WM
Easy to implement;
no hazard detection
FP + r: frame relative
r: absolute
IP + r: code relative (not supported)
- waiting matching
Register ops:
ALU:
VL X VR VL X VR , CC
Arvind - 19
a1
get frame
token in frame 0
token in frame 1
...
extract tag
change Tag 0
Like standard
call/return
but caller &
callee can be
active
simultaneously
an
change Tag 1
change Tag n
1:
n:
Fork
Graph for f
change Tag 0
ISCA, Madison, WI, June 6, 2005
change Tag 1
Arvind - 20
Outline
Dataflow graphs
A clean model of parallel computation
Arvind - 21
g:
Global Heap of
Shared Objects
f:
h:
active
threads
asynchronous
and parallel
at all levels
loop
ISCA, Madison, WI, June 6, 2005
Arvind - 22
Id World
implicit parallelism
Id
TTDA
Monsoon
*T
*T-Voyager
Arvind - 23
Id World people
Rishiyur Nikhil,
Keshav Pingali,
Vinod Kathail,
David Culler
Ken Traub
Steve Heller,
Richard Soley,
Dinart Mores
Jamey Hicks,
Alex Caro,
Andy Shaw,
Boon Ang
Shail Anditya
R Paul Johnson
Paul Barth
Jan Maessen
Christine Flood
Jonathan Young
Derek Chiou
Arun Iyangar
Zena Ariola
Mike Bekerle
K. Eknadham (IBM)
Wim Bohm (Colorado)
Joe Stoy (Oxford)
...
R.S. Nikhil
Ken Traub
Steve Heller
Arvind - 24
Memory
....
a
I-fetch
I-store
ISCA, Madison, WI, June 6, 2005
Arvind - 25
I-Structure Storage:
Split-phase operations & Presence bits
<s, fp, a >
s
I-Fetch
forwarding
address
s I-Fetch
split
phase
t
address to
be read
I-structure
Memory
a1
a2
a3
a4
v1
v2
fp.ip
Arvind - 26
Outline
Dataflow graphs
A clean parallel model of computation
Arvind - 27
Monsoon
Processor
64-bit
I-structure
100 MB/sec
4M 64-bit words
Unix Box
10M tokens/sec
16
Id World Software
Tony Dahbura
ISCA, Madison, WI, June 6, 2005
Arvind - 28
Hydrodynamics - SIMPLE
Global Circulation Model - GCM
Photon-Neutron Transport code -GAMTEB
N-body problem
Symbolic
Combinatorics - free tree matching,Paraffins
Id-in-Id compiler
System
I/O Library
Heap Storage Allocator on Monsoon
Arvind - 29
Arvind - 30
Single Processor
Monsoon Performance Evolution
One 64-bit processor (10 MHz) + 4M 64-bit I-structure
Feb. 91
Aug. 91
Mar. 92
Sep. 92
Matrix Multiply
4:04
3:58
3:55
1:46
500x500
Wavefront
5:00
500x500, 144 iters.
Paraffins
n = 19
n = 22
:50
5:00
:31
GAMTEB-9C
40K particles
1M particles
17:20
7:13:20
10:42
4:17:14
SIMPLE-100
1 iterations
1K iterations
:19
4:48:00
:15
hours:minutes:seconds
ISCA, Madison, WI, June 6, 2005
3:48
ne
i
h
ac
m
eal
r
a
is
d
h
e
t
e
:02.4
N do
:32
to
5:36
2:36:00
:10
5:36
2:22:00
:06
1:19:49
Arvind - 31
2pe 4pe
8pe
critical path
(millions of cycles)
1pe
2pe
4pe
8pe
Matrix Multiply
500 x 500
1057
531
271
137
Paraffins
n=22
322
162
82
44
GAMTEB-2C
40 K particles
590
303
155
80
SIMPLE-100
100 iters
747
September, 1992
ISCA, Madison, WI, June 6, 2005
Arvind - 32
Base Performance?
Id on Monsoon vs. C / F77 on
R3000
MIPS (R3000)
Monsoon (1pe)
(x 10e6 cycles)
Matrix Multiply
500 x 500
Paraffins
n=22
GAMTEB-9C
40 K particles
SIMPLE-100
100 iters
(x 10e6 cycles)
954 +
1058
102 +
322
265 *
590
1787 *
4682
8-way superscalar?
Unlikely to give 7 fold
speedup
Arvind - 33
Arvind - 34
Outline
Dataflow graphs
A clean parallel model of computation
Arvind - 35
Arvind - 36
Arvind - 37
A multithreaded/dataflow model is
needed to present the found parallelism
to the underlying hardware
Exploiting fine-grain parallelism is
necessary for many situations, e.g.,
producer-consumer parallelism
Arvind - 38
Arvind - 39
Parting thoughts
Dataflow research as conceived by most
researchers achieved its goals
The model of computation is beautiful and will be
resurrected whenever people want to exploit fine-grain
parallelism
Arvind - 40
Thank You!
and thanks to
R.S.Nikhil, Dan Rosenband,
James Hoe, Derek Chiou,
Larry Rudolph, Martin Rinard,
Keshav Pingali
for helping with this talk
41
DFGs vs CDFGs
Both Dataflow Graphs and Control DFGs had the
goal of structured, well-formed, compositional,
executable graphs
CDFG research (70s, 80s) approached this goal
starting with original sequential control-flow
graphs (flowcharts) and data-dependency arcs,
and gradually adding structure (e.g., functions)
Dataflow graphs approached this goal directly, by
construction
Schemata for basic blocks, conditionals, loops, procedures
Arvind - 42