Académique Documents
Professionnel Documents
Culture Documents
1
Reduction
Due
Feb.
10,
2016,
11:59PM
Adapted
from:
GPU
Teaching
Kit
Accelerated
Computing
Objective
Implement
a
kernel
to
perform
a
sum
reduction
on
a
1D
list.
Your
kernel
should
be
able
to
handle
input
lists
of
arbitrary
length.
To
simplify
the
lab,
the
student
can
assume
that
the
input
list
will
be
at
most
of
length
204865,535
elements.
This
means
that
the
computation
can
be
performed
using
only
one
kernel
launch.
The
boundary
condition
can
be
handled
by
filling
identity
value
(0
for
sum)
into
the
shared
memory
of
the
last
block
when
the
length
is
not
a
multiple
of
the
thread
block
size.
The
goal
of
this
assignment
is
to
learn
the
tradeoff
between
parallelism
and
synchronization,
the
cost
of
different
synchronization
mechanisms,
and
some
interaction
with
the
memory
hierarchy.
Instructions
Edit
the
provided
code
in
the
locations
denoted
by
//@@
comments.
You
will
perform
the
following:
The
executable
generated
as
a
result
of
compiling
the
lab
can
be
run
using
the
following
command,
where
the
input
and
output
files
point
to
files
in
the
Dataset
directory:
./reduction
-i
<input.raw>
-o
<output.raw>
Grading
Your
grade
will
be
out
of
10
points:
Performance: 2 points
What
to
Turn
In
Turn
in
your
driver.cu
file
that
implements
the
multiple
versions
of
your
reduction
computation.
DO
NOT
modify
makefile
or
the
way
in
which
the
program
is
invoked.
Also
turn
in
a
README
file
that
summarizes
the
performance
results
you
saw
across
the
different
input
data
sets.
Each
result
provided
should
be
the
average
of
10
runs,
with
truly
outlier
points
thrown
out
from
the
average.
Note
that
other
users
on
the
same
desktop
will
affect
performance
results.
Summarize
performance
for
each
input
data
set.
Also,
write
a
brief
paragraph
on
what
you
learned
from
this
assignment.
This
project
was
adapted
from
work
licensed
by
UIUC
and
NVIDIA
(2015)
under
a
Creative
Commons
Attribution-NonCommercial
4.0
License.