Vous êtes sur la page 1sur 2

Lab

1 Reduction
Due Feb. 10, 2016, 11:59PM
Adapted from: GPU Teaching Kit Accelerated Computing

Objective
Implement a kernel to perform a sum reduction on a 1D list. Your kernel should be
able to handle input lists of arbitrary length.
To simplify the lab, the student can assume that the input list will be at most of
length 204865,535 elements. This means that the computation can be performed
using only one kernel launch.
The boundary condition can be handled by filling identity value (0 for sum) into
the shared memory of the last block when the length is not a multiple of the thread
block size.
The goal of this assignment is to learn the tradeoff between parallelism and
synchronization, the cost of different synchronization mechanisms, and some
interaction with the memory hierarchy.

Instructions
Edit the provided code in the locations denoted by //@@ comments. You will
perform the following:

allocate device memory


copy host memory to device
initialize thread block and kernel grid dimensions
implement the CUDA kernel
try different strategies: use shared memory, use atomic operations
(global or shared data), vary amount of sequential execution on host
invoke CUDA kernel
copy results from device to host
deallocate device memory

Instructions for Compiling and Running


Download the file reduction_handout.tar from Canvas. Place this file in a code
directory and extract with

tar xvf reduction.tar

In this directory are the following:

driver.cu: a template for your assignment

makefile: the makefile for compiling into an executable on gradlab machines

Dataset: a directory of 9 input data files and expected output

The executable generated as a result of compiling the lab can be run using the
following command, where the input and output files point to files in the Dataset
directory:
./reduction -i <input.raw> -o <output.raw>

Grading
Your grade will be out of 10 points:

One working implementation: 3 points

Three additional working implementations: 1 point each

Performance: 2 points

Documentation and comments: 2 points

What to Turn In
Turn in your driver.cu file that implements the multiple versions of your reduction
computation. DO NOT modify makefile or the way in which the program is
invoked.
Also turn in a README file that summarizes the performance results you saw across
the different input data sets. Each result provided should be the average of 10 runs,
with truly outlier points thrown out from the average. Note that other users on the
same desktop will affect performance results. Summarize performance for each
input data set. Also, write a brief paragraph on what you learned from this
assignment.

This project was adapted from work licensed by UIUC and NVIDIA (2015) under a
Creative Commons Attribution-NonCommercial 4.0 License.

Vous aimerez peut-être aussi