Introduction To Parallel Programming

Introduction to Parallel Programming
Student Workbook with Instructors Notes

Intel Software College
Legal Lines and Disclaimers

Student Workbook with Instructors Notes - Inner Front Cover

The information contained in this document is provided for informational purposes only and represents the current view of Intel Corporation ("Intel") and
its contributors ("Contributors") on, as of the date of publication. Intel and the Contributors make no commitment to update the information contained
in this document, and Intel reserves the right to make changes at any time, without notice.
Legal Lines and Disclaimers
DISCLAIMER. THIS DOCUMENT, IS PROVIDED "AS IS." NEITHER INTEL, NOR THE CONTRIBUTORS MAKE ANY REPRESENTATIONS OF ANY KIND WITH
RESPECT TO PRODUCTS REFERENCED HEREIN, WHETHER SUCH PRODUCTS ARE THOSE OF INTEL, THE CONTRIBUTORS, OR THIRD PARTIES. INTEL,
AND ITS CONTRIBUTORS EXPRESSLY DISCLAIM ANY AND ALL WARRANTIES, IMPLIED OR EXPRESS, INCLUDING WITHOUT LIMITATION, ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR ANY PARTICULAR PURPOSE, NON-INFRINGEMENT, AND ANY WARRANTY ARISING OUT OF THE
INFORMATION CONTAINED HEREIN, INCLUDING WITHOUT LIMITATION, ANY PRODUCTS, SPECIFICATIONS, OR OTHER MATERIALS REFERENCED
HEREIN. INTEL, AND ITS CONTRIBUTORS DO NOT WARRANT THAT THIS DOCUMENT IS FREE FROM ERRORS, OR THAT ANY PRODUCTS OR OTHER
TECHNOLOGY DEVELOPED IN CONFORMANCE WITH THIS DOCUMENT WILL PERFORM IN THE INTENDED MANNER, OR WILL BE FREE FROM
INFRINGEMENT OF THIRD PARTY PROPRIETARY RIGHTS, AND INTEL, AND ITS CONTRIBUTORS DISCLAIM ALL LIABILITY THEREFOR.
INTEL, AND ITS CONTRIBUTORS DO NOT WARRANT THAT ANY PRODUCT REFERENCED HEREIN OR ANY PRODUCT OR TECHNOLOGY DEVELOPED IN
RELIANCE UPON THIS DOCUMENT, IN WHOLE OR IN PART, WILL BE SUFFICIENT, ACCURATE, RELIABLE, COMPLETE, FREE FROM DEFECTS OR SAFE FOR
ITS INTENDED PURPOSE, AND HEREBY DISCLAIM ALL LIABILITIES THEREFOR. ANY PERSON MAKING, USING OR SELLING SUCH PRODUCT OR
TECHNOLOGY DOES SO AT HIS OR HER OWN RISK.
Licenses may be required. Intel, its contributors and others may have patents or pending patent applications, trademarks, copyrights or other
intellectual proprietary rights covering subject matter contained or described in this document. No license, express, implied, by estoppels or otherwise,
to any intellectual property rights of Intel or any other party is granted herein. It is your responsibility to seek licenses for such intellectual property
rights from Intel and others where appropriate.
Limited License Grant. Intel hereby grants you a limited copyright license to copy this document for your use and internal distribution only. You may not
distribute this document externally, in whole or in part, to any other person or entity.
LIMITED LIABILITY. IN NO EVENT SHALL INTEL, OR ITS CONTRIBUTORS HAVE ANY LIABILITY TO YOU OR TO ANY OTHER THIRD PARTY, FOR ANY LOST
PROFITS, LOST DATA, LOSS OF USE OR COSTS OF PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES, OR FOR ANY DIRECT, INDIRECT, SPECIAL OR
CONSEQUENTIAL DAMAGES ARISING OUT OF YOUR USE OF THIS DOCUMENT OR RELIANCE UPON THE INFORMATION CONTAINED HEREIN, UNDER ANY
CAUSE OF ACTION OR THEORY OF LIABILITY, AND IRRESPECTIVE OF WHETHER INTEL, OR ANY CONTRIBUTOR HAS ADVANCE NOTICE OF THE
POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS SHALL APPLY NOTWITHSTANDING THE FAILURE OF THE ESSENTIAL PURPOSE OF ANY LIMITED
REMEDY.
Intel and Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
*Other names and brands may be claimed as the property of others.
Copyright 2007, Intel Corporation. All Rights Reserved.

4

January 2007 Intel Corporation
Contents
Contents
Lab 1: Identifying Parallelism .............................................................................. 7
Lab 2: Introducing Threads................................................................................ 11
Lab 3: Domain Decomposition with OpenMP ...................................................... 15
Lab 4: Critical Sections and Reductions with OpenMP ........................................ 21
Lab 5: Implementing Task Decompositions........................................................ 25
Lab 6: Analyzing Parallel Performance............................................................... 29
Lab 7: Improving Parallel Performance.............................................................. 33
Lab 8: Choosing the Appropriate Thread Model .................................................. 35
Instructors Notes and Solutions........................................................................ 37


5

6

Lab 1: Identifying Parallelism

Time Required
Thirty minutes
Part A
For each of the following code segments, draw a dependence graph and determine whether the
computation is suitable for parallelization. If the computation is suitable for parallelization, decide
how it should be divided among three CPUs. You may assume that all functions are free of side
effects.
Example 1:
for (i = 0; i < 4; i++) {
a[i] = 0.25 * i;
b[i] = 4.0 / (a[i] * a[i]);
}
Example 2:
if (a < b) c = f(-1);
else if (a == b) c = f(0);
else c = f(1);
Example 3:
for (i = 0; i < 4; i++)
for (j = 0; j < 3; j++)
a[i][j] = f(a[i][j] * b[j]);
Example 4:
prime = 2;
do {
first = prime * prime;
for (i = first; i < 10; i+= prime)
marked[i] = 1;
while (marked[++prime]);
} while (prime * prime < N);


7
Example 5:
switch (i): {
case 0:
a = f(x);
b = g(y);
break;
case 1:
a = g(x);
b = f(y);
break;
case -1:
a = f(y);
b = f(x);
break;
}
Example 6:
sum = 0.0;
for (i = 0; i < 9; i++)
sum = sum + b[i];

8

Lab. 1: Identifying Parallelism
Part B
Describe how parallelism could be used to reduce the time needed to perform each of the following
tasks.
Example 7:
A relational database table contains (among other things) student ID numbers and their cumulative
GPAs. Find out the percentage of students with a cumulative GPA greater than 3.5.
Example 8:
A ray-tracing program renders a realistic image by tracing one or more rays for each pixel of the
display window.
Example 9:
An operating system utility searches a disk and identifies every text file containing a particular
phrase specified by the user.
Example 10:
We want to improve a game similar to Civilization IV by reducing the amount of time the human
player must wait for the virtual world to be set up.

-

9

10

Lab 2: Introducing Threads

Time Required
Thirty minutes
For each of the following programs or program segments:

1.
determine whether the best parallelization approach is a domain decomposition or a task

decomposition;
2.
decide whether the best thread model is the fork/join model or the general threads model;
3.
determine fork/join points (in the case of the fork/join model) or thread creation points (in
the case of the general threads model); and
4.
decide which variables should be shared and which variables should be private.
Example 1:
/* Matrix multiplication */
int i, j, k;
double **a, **b, **c, tmp;
...
for (i = 0; i < m; i++)
for (j = 0; j < n; j++) {
tmp = 0.0;
for (k = 0; k < p; k++)
tmp += a[i][k] * b[k][j];
c[i][j] = tmp;
}
Example 2:
/* This program implements an Internet-based service that
responds to number-theoretic queries */
int main() {
request r;
...
while(1) {
next_request(&r);
acknowledge_request (r);
switch (r.type) {


11
case PRIME:
primality_test (r);
break;
case PERFECT: perfect_test (r);
break;
case WARING: find_waring_integer (r);
break;
}
}
...
}
Example 3:
double inner_product (double *x, double *y, int n)
{
int i;
double result;
result = 0.0;
for (i = 0; i < n; i++)
result += x[i] * y[i];
return result;
}
int main (int argc, char *argv[])
{
double *d, *g, w, x, y, z;
int i;
...
for (i = 0; i < n; i++)
d[i] = -g[i] + (w/x) * d[i];
y = inner_product (d, g);
z = inner_product (d, t);
...
}
Example 4:
/* Finite difference method to solve string vibration
problem (from Michael J. Quinn, Parallel Programming
in C with MPI and OpenMP, p. 325) */
#include <stdio.h>
#include <math.h>

12

Lab. 2: Introducing Threads
#define
#define
#define
#define
#define
#define
#define
F(x)
G(x)
a
c
m
n
T
int main
{
float
int
float
float
float
sin(3.14159*(x))
0.0
1.0
2.0
2000
1000
1.0
(int argc, char *argv[])

h;
i, j;
k;
L;
u[m+1][n+1];
h = a / n;
k = T / m;
L = (k*c/h)*(k*c/h);
for (j = 0; j <= m; j++) u[j][0] = u[j][n] = 0.0;
for (i = 1; i < n; i++) u[0][i] = F(i*h);
for (i = 1; i < n; i++)
u[1][i] = (L/2.0)*(u[0][i+1] + u[0][i-1])+
(1.0 - L) * u[0][i] + k * G(i*h);
for (j = 1; j < m; j++)
for (i = 1; i < n; i++)
u[j+1][i] = 2.0*(1.0 - L) * u[j][i] +
L*(u[j][i+1] + u[j][i-1]) u[j-1][i];
for (j = 0; j <= m; j++) {
for (i = 0; i <= n; i++) printf (%6.3f, u[j][i]);
putchar (\n);
}
return 0;
}

-

13

14

Lab 3: Domain Decomposition with OpenMP

Time Required
Fifty minutes
For each of the programs below

1.
make the program parallel by adding the appropriate OpenMP pragmas;
2.
compile the program;
3.
execute the program for 1, 2, 3, and 4 threads; and
4.
check the program outputs to verify they are the same.
Note: You will need to generate matrices for the matrix multiplication exercise; a utility
program gen.c is included in the lab folder for this purpose. Compile this code, and run it
to create files matrix_a and matrix_b; explicit usage is outlined in the code itself. Be sure
to generate a workload sufficiently large (e.g., matrix dimensions 1000 x 1000) to be
meaningful.
Program 1:
/*
*
*/
Matrix multiplication
#include <stdio.h>
/*
*
*
*/
Function 'rerror' is called when the program detects an

error and wishes to print an appropriate message and exit.
void rerror (char *s)

{
printf ("%s\n", s);
exit (-1);
}
/*
*
*
*/
Function 'allocate_matrix", passed the number of rows and columns,

allocates a two-dimensional matrix of floats.
void allocate_matrix (float ***subs, int rows, int cols) {

int
i;
float *lptr, *rptr;
float *storage;


15
storage = (float *) malloc (rows * cols * sizeof(float));

*subs = (float **) malloc (rows * sizeof(float *));
for (i = 0; i < rows; i++)
(*subs)[i] = &storage[i*cols];
return;
}
/*
*
*
*/
Given the name of a file containing a matrix of floats, function

'read_matrix' opens the file and reads its contents.
void read_matrix (
char
*s,
/*
float ***subs,
/*
int
*m,
/*
int
*n)
/*
{
char
error_msg[80];
FILE
*fptr;
File name */
2D submatrix indices */
Number of rows in matrix */
Number of columns in matrix */
/* Input file pointer */
fptr = fopen (s, "r");

if (fptr == NULL) {
sprintf (error_msg, "Can't open file '%s'", s);
rerror (error_msg);
}
fread (m, sizeof(int), 1, fptr);
fread (n, sizeof(int), 1, fptr);
allocate_matrix (subs, *m, *n);
fread ((*subs)[0], sizeof(float), *m * *n, fptr);
fclose (fptr);
return;
}
/*
*
*
*
*
*/
Passed a pointer to a two-dimensional matrix of floats and

the dimensions of the matrix, function 'print_matrix' prints
the matrix elements to standard output. If the matrix has more
than 10 columns, the output may not be easy to read.
void print_matrix (float **a, int rows, int cols)

{
int i, j;
for (i = 0; i < rows; i++) {
for (j = 0; j < cols; j++)
printf ("%6.2f ", a[i][j]);
putchar ('\n');
}
putchar ('\n');
return;
}

16

Lab. 3: Domain Decomposition with OpenMP
/*
*
*
*/
Function 'matrix_multiply' multiplies two matrices containing

floating-point numbers.
void matrix_multiply (float **a, float **b, float **c,

int arows, int acols, int bcols)
{
int i, j, k;
float tmp;
for (i = 0; i < arows; i++)
for (j = 0; j < bcols; j++) {
tmp = 0.0;
for (k = 0; k < acols; k++)
tmp += a[i][k] * b[k][j];
c[i][j] = tmp;
}
return;
}
int main (int *argc, char *argv[])
{
int m1, n1;
/* Dimensions of matrix 'a' */
int m2, n2;
/* Dimensions of matrix 'b' */
float **a, **b;
/* Two matrices being multiplied */
float **c;
/* Product matrix */
read_matrix ("matrix_a", &a, &m1, &n1);
print_matrix (a, m1, n1);
read_matrix ("matrix_b", &b, &m2, &n2);
print_matrix (b, m2, n2);
if (n1 != m2) rerror ("Incompatible matrix dimensions");
allocate_matrix (&c, m1, n2);
matrix_multiply (a, b, c, m1, n1, n2);
print_matrix (c, m1, n2);
return 0;
}
Program 2:
/*
*
*
*
*
*
*
*
*/
Polynomial Interpolation
This program demonstrates a function that performs polynomial
interpolation. The function is taken from "Numerical Recipes
in C", Second Edition, by William H. Press, Saul A. Teukolsky,
William T. Vetterling, and Brian P. Flannery.
#include <math.h>
#define N 20

-
/* Number of function sample points */

17
#define X 14.5
/* Interpolate at this value of x */
/* Function 'vector' is used to allocate vectors with subscript

range v[nl..nh] */
double *vector (long nl, long nh)
{
double *v;
v = (double *) malloc(((nh-nl+2)*sizeof(double)));
return v-nl+1;
}
/* Function 'free_vector' is used to free up memory allocated
with function 'vector' */
void free_vector(double *v, long nl, long nh)
{
free ((char *) (v+nl-1));
}
/* Function 'polint' performs a polynomial interpolation */
void polint (double xa[], double ya[], int n, double x, double *y, double
*dy)
{
int i, m, ns=1;
double den,dif,dift,ho,hp,w;
double *c, *d;
dif = fabs(x-xa[1]);
c = vector(1,n);
d = vector(1,n);
for (i=1; i <= n; i++) {
dift = fabs (x - xa[i]);
if (dift < dif) {
ns = i;
dif = dift;
}
c[i] = ya[i];
d[i] = ya[i];
}
*y = ya[ns--];
for (m = 1; m < n; m++) {
for (i = 1; i <= n-m; i++) {
ho = xa[i] - x;
hp = xa[i+m] - x;
w = c[i+1] - d[i];
den = ho - hp;
den = w / den;
d[i] = hp * den;
c[i] = ho * den;
}
*y += (*dy=(2*ns < (n-m) ? c[ns+1] : d[ns--]));

18

Lab. 3: Domain Decomposition with OpenMP
}
free_vector (d, 1, n);
free_vector (c, 1, n);
}
/* Functions 'sign' and 'init' are used to initialize the
x and y vectors holding known values of the function.
*/
int sign (int j)
{
if (j % 2 == 0) return 1;
else return -1;
}
void init (int i, double *x, double *y)
{
int j;
*x = (double) i;
*y = sin(i);
}
/* Function 'main' demonstrates the polynomial interpolation function
by generating some test points and then calling 'polint' with a
value of x between two of the test points. */
{
double x, y, dy;
double *xa, *ya;
int i;
xa = vector (1, N);
ya = vector (1, N);
/* Initialize xa's and ya's */
for (i = 1; i <= N; i++) {
init (i, &xa[i], &ya[i]);
printf ("f(%4.2f) = %13.11f\n", xa[i], ya[i]);
}
/* Interpolate polynomial at X */
polint (xa, ya, N, X, &y, &dy);
printf ("\nf(%6.3f) = %13.11f with error bound %13.11f\n", X, y,
fabs(dy));
free_vector (xa, 1, N);
free_vector (ya, 1, N);
return 0;
}

-

19

20

Lab 4: Critical Sections and Reductions

with OpenMP
Time Required
Twenty minutes
Exercise 1
Make this program parallel by adding the appropriate OpenMP pragmas and clauses. Compile the
program, execute it on 1 and 2 threads, and make sure the program output is the same as the
sequential program. Finally, compare the execution times of the sequential, single-threaded, and
double-threaded programs.
/*
*
*
*
*
*
*
*
*
*/
/*
*
*
*
*
*/
A small college is thinking of instituting a six-digit student ID

number. It wants to know how many "acceptable" ID numbers there
are. An ID number is "acceptable" if it has no two consecutive
identical digits and the sum of the digits is not 7, 11, or 13.
024332 is not acceptable because of the repeated 3s.
204124 is not acceptable because the digits add up to 13.
304530 is acceptable.
Function "no_problem_with_digits" extracts the digits from

the ID number from right to left, making sure that there are
no repeated digits and that the sum of the digits is not 7,
11, or 13.
int no_problem_with_digits (int i)

{
int j;
int latest;
/* Digit currently being examined */
int prior;
/* Digit to the right of "latest" */
int sum;
/* Sum of the digits */
prior = -1;
sum = 0;
for (j = 0; j < 6; j++) {
latest = i % 10;
if (latest == prior) return 0;
sum += latest;
prior = latest;
i /= 10;
}
if ((sum == 7) || (sum == 11) || (sum == 13)) return 0;
return 1;


21
}
/*
*
*
*
*/
Function "main" iterates through all possible six-digit ID

numbers (integers from 0 to 999999), counting the ones that
meet the college's definition of "acceptable."
int main (void)

{
int count;
/* Count of acceptable ID numbers */
int i;
count = 0;
for (i = 0; i < 1000000; i++)
if (no_problem_with_digits (i)) count++;
printf ("There are %d acceptable ID numbers\n", count);
return 0;
}
Exercise 2
Make this program parallel by adding the appropriate OpenMP pragmas and clauses. Compile the
program, execute it on 1 and 2 threads, and make sure the program output is the same as the
sequential program. Finally, compare the execution times of the sequential, single-threaded, and
/*
*
*
*
*
*
*/
This program uses the Sieve of Eratosthenes to determine the

number of prime numbers less than or equal to 'n'.
Adapted from code appearing in Parallel Programming in C with
MPI and OpenMP, by Michael J. Quinn, McGraw-Hill (2004).
#include <stdio.h>
#define MIN(a,b) ((a)<(b)?(a):(b))
{
int
count;
/* Prime count */
int
first;
/* Index of first multiple */
int
i;
int
index;
/* Index of current prime */
char *marked;
/* Marks for 2,...,'n' */
int
n;
/* Sieving from 2, ..., 'n' */
int
prime;
/* Current prime */
if (argc != 2) {
printf ("Command line: %s <m>\n", argv[0]);
exit (1);
}

22

Lab. 4: Critical Sections and Reductions with OpenMP
n = atoi(argv[1]);
marked = (char *) malloc (n-1);
if (marked == NULL) {
printf ("Cannot allocate enough memory\n");
exit (1);
}
for (i = 0; i < n-1; i++) marked[i] = 1;
index = 0;
prime = 2;
do {
first = prime * prime - 2;
for (i = first; i < n-1; i += prime) marked[i] = 0;
while (!marked[++index]);
prime = index + 2;
} while (prime * prime <= n);
count = 0;
for (i = 0; i < n-1; i++)
count += marked[i];
printf ("There are %d primes less than or equal to %d\n", count, n);
return 0;
}
Exercise 3
The Monte Carlo method refers to the use of statistical sampling to solve a problem. Some experts
say that more than half of all supercomputing cycles are devoted to Monte Carlo computations. A
Monte Carlo program can benefit from parallel processing in two ways. Parallel processing can be
used to reduce the time needed to find a solution of a particular resolution. The other use of parallel
processing is to find a more accurate solution in the same amount of time. This assignment is to
reduce the time needed to find a solution of a particular accuracy. The following C program uses the
Monte Carlo method to come up with an approximation to pi. Add OpenMP directives to make the
program suitable for execution on multiple threads. Divide the number of points to be generated
evenly among the threads. Compare the execution times of the sequential, single-threaded, and
/*
*
*
*
*/
This program uses the Monte Carlo method to come up with an

approximation to pi. Taken from Parallel Programming in C with
MPI and OpenMP, by Michael J. Quinn, McGraw-Hill (2004).
#include <stdio.h>
int main (int argc, char *argv[1])
{
int count;
/* Points inside unit circle */
int i;
int samples;
/* Number of points to generate */
unsigned short xi[3];
/* Random number seed */

-

23
double x, y;
/* Coordinates of point */
/* Number of points and 3 random number seeds are command-line

arguments. */
if (argc != 5) {
printf (Command-line syntax: %s <samples>
<seed> <seed> <seed>\n, argv[0]);
exit (-1);
}
samples = atoi (argv[1]);
count = 0;
xi[0] = atoi(argv[2]);
for (i = 0; i < samples; i++) {
x = erand48(xi);
y = erand48(xi);
if (x*x + y*y <= 1.0) count++;
}
printf (Estimate of pi: %7.5f\n, 4.0 * count / samples);
return 0;
}

24

Lab 5: Implementing Task Decompositions

Time Required
Sixty minutes
Exercise 1
Make this quicksort program parallel by adding the appropriate OpenMP pragmas and clauses.
Compile the program, execute it on 1 and 2 threads, and make sure the program is still correctly
sorting the elements of array A. Finally, compare the execution times of the sequential, singlethreaded, and double-threaded programs.
/*
*
*
*
*
*
*
*
*
*
*
*
*
*
*/
Stack-based Quicksort
The quicksort algorithm works by repeatedly dividing unsorted
sub-arrays into two pieces: one piece containing the smaller
elements and the other piece containing the larger elements.
The splitter element, used to subdivide the unsorted sub-array,
ends up in its sorted location. By repeating this process on
smaller and smaller sub-arrays, the entire array gets sorted.
The typical implementation of quicksort uses recursion. This
implementation replaces recursion with iteration. It manages its
own stack of unsorted sub-arrays. When the stack of unsorted
sub-arrays is empty, the array is sorted.
#include <stdio.h>
#include <stdlib.h>
#define MAX_UNFINISHED 1000
/* Maximum number of unsorted sub-arrays */
/* Global shared variables */

struct {
int first;
int last;
} unfinished[MAX_UNFINISHED];
/* Low index of unsorted sub-array */

/* High index of unsorted sub-array */
/* Stack */
int unfinished_index;
/* Index of top of stack */
float *A;
int
n;
/* Array of elements to be sorted */

/* Number of elements in A */
/* Function 'swap' is called when we want to exchange two array elements */

void swap (float *x, float *y)
{


25
float tmp;
tmp = *x;
*x = *y;
*y = tmp;
}
/* Function 'partition' actually does the sorting by dividing an
Unsorted sub-array into two parts: those less than or equal to the
splitter, and those greater than the splitter. The splitter is the
last element in the unsorted sub-array. The splitter ends up in its
final, sorted location. The function returns the final location of
the splitter (its index). This code is an implementation of the
algorithm appearing in Introduction to Algorithms, Second Edition,
by Cormen, Leiserson, Rivest, and Stein (The MIT Press, 2001). */
int partition (int first, int last)
{
int i, j;
float x;
x = A[last];
i = first - 1;
for (j = first; j < last; j++)
if (A[j] <= x) {
i++;
swap (&A[i], &A[j]);
}
swap (&A[i+1], &A[last]);
return (i+1);
}
/* Function 'quicksort' repeatedly retrieves the indices of unsorted
sub-arrays from the stack and calls 'partition' to divide these
sub-arrays into two pieces. It keeps one of the pieces and puts the
other piece on the stack of unsorted sub-arrays. Eventually it ends
up with a piece that doesn't need to be sorted. At this point it
gets the indices of another unsorted sub-array from the stack. The
function continues until the stack is empty. */
void quicksort (void)
{
int first;
int last;
int my_index;
int q;
/* Split point in array */
while (unfinished_index >= 0) {
my_index = unfinished_index;
unfinished_index--;
first = unfinished[my_index].first;
last = unfinished[my_index].last;
while (first < last) {

26

Lab. 5: Implementing Task Decompositions
/* Split unsorted array into two parts */

q = partition (first, last);
/* Put upper portion on stack of unsorted sub-arrays */
if ((unfinished_index+1) >= MAX_UNFINISHED) {
printf ("Stack overflow\n");
exit (-1);
}
unfinished_index++;
unfinished[unfinished_index].first = q+1;
unfinished[unfinished_index].last = last;
/* Keep lower portion for next iteration of loop */
last = q-1;
}
}
}
/* Function 'print_float_array', given the address and length of an
Array of floating-point values, prints the values to standard
output, one element per line. */
void print_float_array (float *A, int n)
{
int i;
printf ("Contents of array:\n");
for (i = 0; i < n; i++)
printf ("%6.4f\n", A[i]);
}
/* Function 'verify_sorted' returns 1 if the elements of array 'A'
are in monotonically increasing order; it returns 0 otherwise. */
int verify_sorted (float *A, int n)
{
int i;
for (i = 0; i < n-1; i++)
if (A[i] > A[i+1]) return 0;
return 1;
}
/* Function 'main' gets the array size and random number seed from
the command line, initializes the array, prints the unsorted array,
sorts the array, and prints the sorted array. */
{
int
i;
int
seed;
/* Seed component input by user */

-

27
if (argc != 3) {
printf ("Command-line syntax: %s <n> <seed>\n", argv[0]);
exit (-1);
}
seed = atoi (argv[2]);
xi[0] = xi[1] = xi[2] = seed;
n = atoi (argv[1]);
A = (float *) malloc (n * sizeof(float));
for (i = 0; i < n; i++)
A[i] = erand48(xi);
/*
print_float_array (A, n);
*/
unfinished[0].first = 0;
unfinished[0].last = n-1;
unfinished_index = 0;
quicksort ();
/*
*/
if (verify_sorted (A, n)) printf ("Elements are sorted\n");
else printf ("ERROR: Elements are NOT sorted\n");
return 0;
}

28

Lab 6: Analyzing Parallel Performance

Time Required
Thirty-five minutes
Exercise 1
You are responsible for maintaining a library of core functions used by a wide variety of programs in
an application suite. Your supervisor has noted the availability of multi-core processors and wants to
know whether rewriting the library of functions using threads would significantly improve the
performance of the programs in the application suite. What do you need to do to provide a
meaningful answer?
Exercise 2
Somebody wrote an OpenMP program to solve the problem posed in Lab 5 and benchmarked its
performance sorting 25 million keys. Here are the run times of the program, as reported by the
command-line utility time:
Threads
1
2
3
4
Run Time (sec)

8.535
21.183
22.184
25.060
What is the efficiency of the multithreaded program for 2, 3, and 4 threads? What can you conclude
about the design of the parallel program? Can you offer any suggestions for improving the
performance of the program?
Exercise 3
A co-worker has been working on converting a sequential program into a multithreaded program. At
this point, only some of the functions of the program have been made parallel. On a key data set,
the multithreaded program exhibits these execution times:
Processors
1
2
3
4
Time (sec)
5.34
3.74
3.31
3.10
Is your co-worker on the right track? Would you advise your co-worker to continue the
parallelization effort?


29
Exercise 4
Youve worked hard to convert a key application to multithreaded execution, and youve
benchmarked it on a quad-core processor. Here are the results:
Threads
1
2
3
4
Time (sec)
24.3
14.6
11.7
10.6
Suppose an 8-core version of the processor becomes available.

(a) Predict the execution time of this algorithm on an 8-core processor.
(b) Give a reason why the actual speedup may be lower than expected.
(c) Give two reasons why the actual speedup may be higher than expected.
Exercise 5
You have benchmarked your multithreaded application on a system with CPU A, and it exhibits this
performance:
Threads
1
2
3
4
Time (sec)
14.20
7.81
5.87
4.72
Next you benchmark the same application on an otherwise identical system that has been upgraded
with a newer processor, CPU B, and it exhibits this performance:
Threads
1
2
3
4
Time (sec)
11.83
7.01
5.42
4.59
CPU B is clearly faster than CPU A. The execution times are lower when CPU B is used. However, the
single-processor performance is improved by 20% by using CPU B. In contrast, when four
processors are engaged, the parallel program is only 3% faster. Explain how this can happen.

30

Lab. 6: Analyzing Parallel Performance
Exercise 6
Hard disk drives continue to improve in speed at a slower rate than microprocessors. What are the
implications of this trend for developers of multithreaded applications? What can be done about it?

-

31

32

Lab 7: Improving Parallel Performance

Time Required
Forty-five minutes
Exercise 1
Recall that the parallel quicksort program developed in Lab 5 exhibited poor performance because of
excessive contention among the tasks for access to the shared stack containing the indices of
unsorted sub-arrays. You can dramatically improve the performance by reducing the frequency at
which threads access the shared stack.
One way to reduce accesses to the shared stack is to switch to sequential quicksort for sub-arrays
smaller than a threshold size. In other words, when a thread encounters a sub-array smaller than
the threshold size and partitions it into two pieces, it does not put one piece on the stack and work
on the remaining piece. Instead, it sorts both pieces itself by recursively calling the sequential
quicksort function.
Use this strategy, and the sequential quicksort function given below, to improve the performance of
the parallel quicksort program you developed in Lab 5. Run some experiments to determine the best
threshold size for switching to sequential quicksort.
void seq_quicksort (int first, int last)

{
int q;
if (first < last) {
seq_quicksort (first, q-1);
seq_quicksort (q+1, last);
}
}
Exercise 2
The following C program counts the number of primes less than n. Use OpenMP pragmas and clauses
to enable it to run on a multiprocessor. Make as many changes as you can in the time allowed to
improve the performance of the program on the maximum available number of processors.
/*
*
*/
This C program counts the number of primes between 2 and n.
#include <stdio.h>
#include <math.h>
#include <omp.h>
/*
Passed a positive integer p, function is_prime returns 1 if


33
p is prime and 0 if p is not prime. */

int is_prime (int p)
{
int i;
if (p < 2) return 0;
i = 2;
while (i*i <= p) {
if (p % i == 0) return 0;
i++;
}
return 1;
}
{
int *a;
/* Keeps track of which numbers are primes */
int count;
/* Number of primes between 2 and n */
int i;
int n;
/* Were finding primes up through this number */
int t;
/* Desired number of concurrent threads */
/* Get problem size and number of threads */
if (argc != 3) {
printf ("Command line syntax: %s <n> <threads>\n",
argv[0]);
exit (-1);
}
n = atoi(argv[1]);
t = atoi(argv[2]);
omp_set_num_threads (t);
a = (int *) malloc (n * sizeof(int));
for (i = 0; i < n; i++)
a[i] = is_prime(i);
count = 0;
for (i = 0; i < n; i++)
count += a[i];
printf ("There are %d primes less than %d\n", count, n);
return 0;
}

34

Lab 8: Choosing the Appropriate Thread

Model
Time Required
Forty minutes
For each of the following problems, decide if it would be more suitable for a parallel program based
on OpenMP or a parallel program based on Win32/Java/POSIX threads.
Problem 1
You are working on a software package that will be able to evaluate a wide variety of complicated
integrals, such as:
y
4 tan x cos dy dx
x
.
There are a wide variety of techniques to evaluate integrals, and it is impossible to determine in
advance which technique will be most effective in solving a particular problem. Your parallel program
will simultaneously attempt many techniques, stopping as soon as one technique has successfully
evaluated the integral. (Example from Practical Parallel Programming by Gregory V. Wilson, The MIT
Press, 1995)
Problem 2
A radiation source emits neutrons that hit a homogeneous plate. The plate may reflect the neutron,
absorb it, or allow it to be transmitted. Your parallel program will use a Monte Carlo method to
estimate the probability of each of these three outcomes, based on the plates make-up and
thickness. The program will simulate the paths of millions of particles, and it will keep track of how
many particles have each of the three possible outcomes. In the course of the simulation, some
particles are reflected immediately, while others bounce around for a long time in the plate before
finally being reflected, absorbed, or transmitted. Hence the variance in the amount of time needed
to simulate the path of a single particle is quite large. (Example from Parallel Programming in C with
MPI and OpenMP by Michael J. Quinn, McGraw-Hill, 2004)
Problem 3
You are on a team creating a parallel program that will give the user the ability to see satellite
images from many locations on Earth. The program will have to perform a wide variety of tasks,
including responding to keyboard presses and mouse clicks, translating an address into map


35
coordinates, retrieving relevant satellite images and map information from a database server,
displaying these images, and zoom and pan operations.
Problem 4
An existing sequential program converts a PostScript program into Perfectly Desirable Format (PDF).
A PostScript program consists of a prolog followed by a script. The prolog contains applicationspecific function definitions. The script consists of a sequence of page descriptions. Each page
description is independent of every other page description, depending only on the definitions in the
prolog. A PDF document contains one or more pages. A page may contain text, graphics, and/or
images. Because the sequential program executes too slowly, your task is to design a parallel
program that reduces the conversion time.
Problem 5
Your team is developing a program supporting digital video editing. You are responsible for the
algorithm that divides raw video footage into scenes. The algorithm analyzes the entire video frame
by frame. A frame that is significantly different from the preceding frame should be marked as the
beginning of a new scene. To make the process as quick as possible, youve been asked to
implement a parallel version of the algorithm.

36

Instructors Notes and Solutions


37

38

Lab 1: Identifying Parallelism

Part A
Example 1:
This is a straightforward loop amenable to a domain decomposition.
Example 2:
Since only one of the three then-else clauses will ever execute, there is no point in using
more than one CPU to execute this code segment.
Example 3:
This code segment is amenable to domain decomposition. It is designed to raise the
question about whether it is better to divide the inner loop or the outer loop among the
CPUs. This is the sort of question well answer later in the short course.
Example 4:
The inner for loop is amenable to a domain partitioning. The iterations of the outer dowhile loop cannot be executed in parallel.
Example 5:
It does no good to assign a different case to each CPU, since only one case will execute.
However, we could do a functional decomposition within each of the cases, keeping two
CPUs occupied. In other words, in case 0 one CPU could compute g(x) while another CPU
computed f(y); in case 1 one CPU could compute g(x) while another CPU computed
f(y); and in case 3 one CPU could compute f(y) while another CPU computed f(x).
Example 6:
As written, this code is not amenable to parallelization, since every addition involves the
variable sum. This ought to give the students pause, since one example of domain
decomposition in the slides was quite similar to this. We can transform this code into code
suitable for domain decomposition by giving each CPU a temporary memory location it can
use for accumulating the subtotal. This will be discussed in a subsequent lecture.

-

39
Part B
Example 7:
We could do a domain decomposition of the database table, giving each CPU its share of
the rows. After each CPU has come up with its subtotals (for number of students
considered and number of students with GPAs > 3.5), the CPUs could add these into two
global sums, at which point one CPU would compute the global fraction.
Example 8:
This problem sounds easier than it is. Ray-tracing is certainly amenable to domain
decomposition. The problem is that some pixels take much longer to trace than others. For
example, rays passing through glass are much more complicated that rays that strike no
objects. We want to motivate thinking about how work should be allocated to CPUs.
Example 9:
This application is amenable to a variety of different functional decompositions. One

function is to search the file system and find text files. Another function searches text files
for the phrase. Should a new process be created every time a text file is discovered?
Would it be better to have a central list of files yet to be searched and have the text-filescanning processes go to this list when they need work? The idea is to promote creative
thinking, not answer the question definitively.
Example 10:
This is another question designed to promote creative thinking. The world is initialized to a
random state. Is it okay if the world created by a two-CPU system is different from a world
created by a one-CPU system? Should each CPU be given an equal-sized region of the
virtual world? What if one region has more continents than anotherhow will we balance
the work among the CPUs? In what ways will the CPUs be able to work independently? In
what ways will the CPUs need to exchange information with each other?

40

Lab 2: Introducing Threads
Example 1:
The best parallelization approach is domain decomposition. The best loop to make parallel
is the outermost loop indexed by i, because then there is only a single fork/join operation,
minimizing overhead. Variables a, b, and c should be shared. Variables i, j, k, and tmp
should be private.
Example 2:
We can make this program parallel using the general threads model. Instead of the main
thread calling primality_test, perfect_test, or find_waring_integer, it forks another thread
to perform the function and return the answer to the machine originating the request.
Every created thread will have a private value of r.
Some people would call this an example of task parallelism, since different threads are
executing different functions. Others would call this domain parallelism, because the
number of threads created is proportional to the size of the data set. Since the parallelism
scales with the amount of data rather than the number of functions, the second
characterization is probably better.
The advantage of this strategy is that it enables the main thread to get quickly back to
function next_request, improving the responsiveness of the system; i.e., how quickly it
acknowledges a request. The risk is that if requests are coming in too quickly, we could
end up with far more threads than CPUs, indefinitely postponing the time users might
have to wait for their requests to be handled. That leads to a nice discussion topic: How
could this problem be avoided?
Example 3:
Function inner_product is amenable to domain decomposition. The fork would happen at

the beginning of the for loop, and the join would happen at the end of the for loop.
Variable i would be private, and variables result, x, and y would be shared. Bright students
should realize that we could have a problem if multiple threads update result at the same
time. This is called a race condition. Well discuss this problem in detail in the fourth
lecture.
Another approach would be to use a functional decomposition in the main function to
execute both calls to inner_product in parallel. That eliminates the problem described in
the previous paragraph, because only one thread executes each call to inner_product. If
we do a functional decomposition, variables i and result inside inner_product are private
variables.
The for loop in function main is amenable to a domain decomposition. Variable i is the only
private variable.

-

41
Example 4:
This is a short, but complete C program amenable to domain decomposition. The first
three for loops can be parallelized with domain decomposition. The last for loop is not
amenable to parallel execution, since we want to print the values in the correct order. The
bulk of the processing time, however, is spent in the doubly-nested loop. The execution
time of the other computational loops is trivial compared to the time spent in this loop. For
that reason, this loop ought to be made parallel first using domain decomposition.
Analyzing the doubly nested loop, we see that we cant compute row j+1 of u until weve
computed rows j and j-1. In other words, there is data dependence from iteration j to
iteration j+1 (and from iteration j-1 to iteration j). Hence we cannot execute all the
iterations of the outer for loop in parallel. However, we can execute all the iterations of the
inner for loop indexed by i. So there should be a fork every time we get to the inner for
loop indexed by i and a join every time we finish that loop.

42

Lab 3: Domain Decomposition with OpenMP

At this point in the course, the variety of programs that the students can make parallel is
limited because they do not yet know about critical sections. That means we have to stick
with embarrassingly parallel applications in which the threads work completely
independently. Here are two programs that fit the bill. At the end I give a third alternative.
Program 1:
Matrix multiplication is a relatively easy program to make parallel. In function
matrix_multiply, the two best candidates are the outermost for loop indexed by i or the
middle for loop indexed by j. You may wish to talk with the students about the problems
with trying to make parallel the inner for loop indexed by k. To decide between the i and
the j loops, students should be thinking about grain size and how matrices are allocated in
C (row major order). Both of these considerations lead to making the outer loop the
parallel loop.
Also note that this is the standard matrix multiplication algorithm found in most textbooks.
Its weakness is that in order to compute a row of the product matrix, it references a row
of the first factor matrix and ALL of the second factor matrix. When matrices are large, the
second factor matrix is unlikely to fit completely in cache memory. This means that the
cache hit rate of this algorithm is poor. A block-oriented matrix multiplication program can
exhibit a much higher cache hit rate and significantly outperform this program.
Program 2:
This example is designed to give students a greater challenge. Ultimately, they should
figure out that the best candidate for parallelization is the for loop indexed by i inside the
for loop indexed by m inside function polint. In truth, unless n is very large, this program
is unlikely to benefit much from parallelization.
If these programs are unacceptable, another likely candidate would be a program to

compute the Mandelbrot set. This program has several advantages. It is short (if the
graphics routines are not counted), consumes a great deal of CPU time, produces a pretty
output, and can generate interesting questions about load balancing among threads. The
principal disadvantage of doing an exercise on the Mandelbrot set is that it has been done
so many times before.

-

43
Lab 4: Critical Sections and Reductions with OpenMP
Exercise 1
This should be an easy program for the students to parallelize using the parallel for
pragma with a reduction clause. They need to figure out that the loop to make parallel is
the one inside function main, not the one inside function no_problem_with_digits, and
they can easily rewrite the for loop to get a statement of the form count += v;
Exercise 2
The final for loop is an example of a reduction. The for loop inside the do...while loop is
also amenable to parallelization. The initialization of array marked is another opportunity.
Students should test their programs for small values of n and benchmark their programs
for large values of n. There are 25 primes less than 100 and 168 primes less than 1000. A
good value of n for benchmarking is 10 million.
Exercise 3
Making this program parallel is more difficult than it may first appear. It wont work simply
to make the for loop parallel and add a reduction clause for the summation to count. The
reason is subtle: variable xi, containing the random number seeds, must be private to
each thread and must be initialized to different values for each thread. Otherwise, every
thread will generate the same sequence of random numbers.
How can each thread come up with its own random number seed? We need a way to get a
thread-dependent value for xi[0], xi[1], or xi[2]. The way to do this is through function
omp_get_thread_num(). The idea behind this exercise, then, is to help the students get to
the point where they realize they need a function like this, and then give them the
function name.
That means the entire block of code from the initialization of count to the end of the for
loop needs to be in a parallel block. So Id suggest rather than using a reduction clause
inside the for loop, create a private variable to keep track of the local count and then
simply have a critical section at the very end of the block to add the local count to the
global count.

44

Lab 5: Implementing Task Decompositions
Exercise 1
This is a challenging exercise, but it can teach students a lot about the issues they must
confront when making a program parallel. When students get their parallel program
executing correctly, they should feel a real sense of satisfaction. In order to get the
program working correctly, students must clear three hurdles. The first one is easy; the
second and third are tougher.
First, students must ensure that all accesses to the shared stack of unsorted sub-arrays
occur inside critical sections. All of these accesses occur inside function quicksort. Its
easy for students to make the critical sections too small.
Second, students must realize that just because there is something on the stack when a
thread tests unfinished_index in the outer while loops condition, that doesnt mean there
will still be something on the stack when the thread enters the while loop. In other words,
there is a race condition. So the code for getting something off the stack must be put
inside an if statement. If the stack index is < 0, then first and last must be given values
such that function partition doesnt get called.
Finally, students need to figure out that a thread cant simply stop when there is no work
to do. Otherwise, there is a good chance all but one of the threads will exit quicksort
while one thread does the very first partitioning step. The loop condition must be rewritten
so that threads keep looking for work as long as the array is unsorted. How do we know
when the array is sorted? We know that every time we call function partition, one more
array element is put in the right place. We also know that every time an unsorted subarray has exactly one element in it, that element is in the right place. So we need to
create a global counter that keeps track of how many elements are in their sorted
positions. Threads should exit quicksort only when this global counter reaches n.
Making the program parallel by adding a new global variable, new logic, and OpenMP
directives adds 30-40 lines to its length. With appropriate hints, students should be able to
complete this exercise in about an hour.
Solution:
/*
*
*
*
*
*
*
*
*
*
*

-

45
*
*
*
*/

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
/* Maximum number of unsorted sub-arrays */

struct {
int first;
int last;

/* Stack */
float *A;
int n;
int num_sorted;

/* Elements in A */
/* Sorted elements in A */

{
float tmp;
tmp = *x;
*x = *y;
*y = tmp;
}
the splitter (its index). */
{
int i, j;
float x;
x = A[last];
i = first - 1;
if (A[j] <= x) {
i++;

46


}
return (i+1);
}
{
int first;
int id;
int last;
int my_count;
int my_index;
int q;
id = omp_get_thread_num();
printf ("Thread %d enters quicksort\n", id);
my_count = 0;
while (num_sorted < n) {
#pragma omp critical
{
if (unfinished_index >= 0) {
unfinished_index--;
} else {
first = 0;
last = -1;
}
}
while (first <= last) {
if (first == last) {
num_sorted++;
my_count++;
last = first - 1;
} else {
num_sorted++;
my_count++;

-

47

exit (-1);
}
{
unfinished_index++;
}
last = q-1;
}
}
}
printf ("Thread %d exits, having sorted %d\n", id, my_count);
}
{
int i;
for (i = 0; i < n; i++)
}
{
int i;
for (i = 0; i < n-1; i++)
return 1;
}
{
int
i;
int
seed;

48


int
t;

/* Number of threads */
if (argc != 4) {
printf ("Command-line syntax: %s <n> <threads> <seed>\n",
argv[0]);
exit (-1);
}
xi[0] = xi[1] = xi[2] = seed;
t = atoi(argv[2]);
n = atoi (argv[1]);
for (i = 0; i < n; i++)
A[i] = erand48(xi);
/*
*/
num_sorted = 0;
#pragma omp parallel
quicksort ();
/*
*/
return 0;
}

-

49
Lab 6: Analyzing Parallel Performance
Exercise 1
The first step should be to profile the programs in the application suite and see what
percentage of their execution time is spent inside the core functions. If not enough time is
spent in the core functions, making them execute faster will not reduce the overall
execution time by that much. We can use Amdahls Law to estimate the maximum
improvement.
For example, suppose 40% of an applications execution time is spent in core functions,
and we get all our core functions to execute twice as fast. Then the maximum speedup of
the application is
1
1
=
= 1.25
0.6 + 0.4 / 2 0.8
Exercise 2
The complete table.

Threads
1
2
3
4
Run Time (sec)

8.535
21.183
22.184
25.060
Speedup
1.00
0.40
0.38
0.34
Efficiency
1.00
0.20
0.19
0.09
The efficiency drops tremendously as soon as more than one thread is executing the
program. You can see that even for 2 threads, each thread is doing productive work only
20% of the time. Since this program does not require I/O, we can conclude that the
threads are wasting a lot of time waiting for access to critical sections. In order to improve
the performance of the parallel program, we must increase the amount of useful work that
gets done between critical sections of code.
Exercise 3
We should look at the Karp-Flatt metric and see how the experimentally determined serial
fraction is changing as threads are added.
Processors
2
3
4
Experimentally Determined Serial Fraction

0.40
0.43
0.44
The experimentally determined serial fraction is growing very slowly as threads are added;
this is very good news. The principal limiting factor to the speedup is not parallel
overhead. Instead, its the large portion of time spent inside functions that have not yet

50

been made parallel. Thats a good omen for continuing forward with the parallelization
effort.
Exercise 4
(a) To predict the execution time on 8 processors, we should use the Karp-Flatt metric.
Threads
1
2
3
4
Time (sec)
24.3
14.6
11.7
10.6
Speedup
1.00
1.66
2.08
2.29
e
0.20
0.22
0.25
The experimentally determined serial fraction e is growing about 0.025 per thread, so we
estimate its value will be 0.35 when there are 8 threads. Now we use the formula
p
e( p 1) + 1
to predict a speedup of 2.32 on 8 threads.

(b) The actual speedup may be lower than this because the parallel overhead may grow at
a faster rate than the number of threads. For example, a critical section of code may
become a bottleneck, and it may do no good to add threads beyond a certain number.
(c) Speedup may be higher than expected if the additional cores bring the total amount of
cache memory up to a point where the cache hit rate rises significantly.
Exercise 5
Even though CPU B is faster than CPU A, the speed of other system components, such as
the I/O system, have not increased. Hence the fraction of program execution time devoted
to inherently sequential operations has increased, reducing the speedup that can be
achieved as processors are added.
Exercise 6
Were back to Amdahls Law again. As the time required to execute the parallel portion of
the program shrinks, the sequential component becomes relatively more significant.
Processors and RAM are increasing in speed faster than hard disk drives. If we move a
parallel application that does some disk I/O from an older system to a newer system with
a much faster processor and a slightly faster hard disk drive, we would expect the
execution times to be lower, but we would expect the speedup curves to be lower as well.
One solution is to employ RAID technology to improve the throughput of the hard disk
drive system. Another solution is to adopt a significantly faster secondary storage scheme,
such as a solid state disk (also called solid state drive).

-

51
Lab 7: Improving Parallel Performance
Exercise 1
To complete this assignment, students must modify function quicksort in the original
parallel program to call seq_quicksort when the difference between indices last and first
is less than a particular size. Once they have their programs running, students must
experiment to determine the best threshold size. If the size is too small, there will be too
much contention for the stack. If the size is too large, there may be a significant load
imbalance among the tasks.
Here is one solution:
/*
*
*
*
*
*
*
*
*
*
*
*
*
*
*/
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define CHUNK_SIZE 1
/* Max number of unsorted sub-arrays */

struct {
int first;
int last;

/* Stack */
float *A;
int n;
int num_sorted;

/* Elements in A */
/* Sorted elements in A */

52


{
float tmp;
tmp = *x;
*x = *y;
*y = tmp;
}
the splitter (its index). */
{
int i, j;
float x;
x = A[last];
i = first - 1;
if (A[j] <= x) {
i++;
}
return (i+1);
}
/* Function `seq_quicksort implements the traditional, recursive
quicksort algorithm. It is called when we want a thread to be
be responsible for sorting an entire sub-array. */
void seq_quicksort (int first, int last)
{
int q;
if (first < last) {
seq_quicksort (first, q-1);
seq_quicksort (q+1, last);
}
}

-

53
{
int first;
int id;
int last;
int my_count;
int my_index;
int q;
id = omp_get_thread_num();
printf ("Thread %d enters quicksort\n", id);
my_count = 0;
while (num_sorted < n) {
{
if (unfinished_index >= 0) {
unfinished_index--;
} else {
first = 0;
last = -1;
}
}
/*
printf ("Thread %d has region %d-%d\n", id, first, last);
*/
while (first <= last) {
/*
printf ("Thread %d now partitioning %d-%d\n", id, first, last);
*/
if (first == last) {
num_sorted++;
my_count++;
last = first - 1;
} else if ((last - first) < CHUNK_SIZE) {
seq_quicksort (first, last);
num_sorted += (last - first + 1);
my_count += (last - first + 1);
last = first - 1;
} else {
num_sorted++;
my_count++;

54


exit (-1);
}
{
unfinished_index++;
}
last = q-1;
}
}
}
printf ("Thread %d exits, having sorted %d\n",
id, my_count);
}
{
int i;
for (i = 0; i < n; i++)
}
{
int i;
for (i = 0; i < n-1; i++)
return 1;
}
{

-

55
int
i;
int
seed;
int
t;

/* Number of threads */
if (argc != 4) {
printf ("Command-line syntax: %s <n> <threads> <seed>\n",
argv[0]);
exit (-1);
}
xi[0] = xi[1] = xi[2] = seed;
t = atoi(argv[2]);
n = atoi (argv[1]);
for (i = 0; i < n; i++)
A[i] = erand48(xi);
/*
*/
num_sorted = 0;
#pragma omp parallel
quicksort ();
/*
*/
return 0;
}

56

Exercise 2
To make the program parallel, all students have to do is put a pragma in front of the first
for loop in function main. The program will exhibit some speedup. Speedup will be
improved if students add a clause to the pragma indicating that the run-time system
should use guided self-scheduling to allocate loop iterations to threads. Making the second
for loop in main parallel does little to affect the execution time because it takes an
insignificant amount of time compared to the first loop.
Ambitious students will notice that function is_prime is very inefficient and will try to
make it faster. A good way to do this is to add a pre-processing function, called from
main, that finds all primes up to the square root of n, and puts them in a list. Function
is_prime runs dramatically faster if it simply refers to that list of primes, rather than
trying all positive integers greater than or equal to 2. This demonstrates that making the
sequential program faster should be the first thing attempted.
/*
*
*/
This C/OpenMP program counts the number of primes between 2 and n.
#include <stdio.h>
#include <math.h>
#include <omp.h>
int *prime_list;
int prime_list_len;
/* Contains primes up to sqrt(n) */

/* Number of primes in prime_list */
/* Function sieve fills array prime_list with primes between

2 and sqrt(n). It uses a rather crude algorithm to do this.
It would be easy to make this function faster, but its execution
time is not too significant compared to the total execution time
of the program. */
void sieve (int n)
{
int i, j;
int s;
/* Square root of n, rounded down */
s = (int) sqrt(n);
prime_list = (int *) malloc (s * sizeof(int));
for (i = 0; i < s; i++) {
if (i < 2) prime_list[i] = 0;
else {
prime_list[i] = i;
j = 2;
while (j*j <= i) {
if (i % j == 0) {
prime_list[i] = 0;
break;
}
j++;

-

57
}
}
}
i = 0;
for (j = 0; j < s; j++)
if (prime_list[j])
prime_list[i++] = prime_list[j];
prime_list_len = i;
prime_list[prime_list_len] = n;
}
/* Function is_prime returns 1 if p is prime and 0 if p is
not prime. */
int is_prime (int p)
{
int i;
if (p < 2) return 0;
i = 0;
/* Check all primes less than or equal to the square root of
p to see if they divide evenly into p. */
while (prime_list[i]*prime_list[i] <= p) {
if (p % prime_list[i] == 0) return 0;
i++;
}
return 1;
}
{
int *a;
/* ith element is 1 iff i is prime */
int count; /* Number of primes between 2 and n */
int i;
int n;
/* Upper bound */
int t;
/* Desired number of threads */
/* Determine problem size and number of threads */
if (argc != 3) {
printf ("Command line syntax: %s <n> <threads>\n",
argv[0]);
exit (-1);
}
n = atoi(argv[1]);
t = atoi(argv[2]);
/* Identify primes up to square root of n */
sieve (n);

58

a = (int *) malloc (n * sizeof(int));

#pragma omp parallel for schedule(guided)
for (i = 0; i < n; i++)
a[i] = is_prime(i);
count = 0;
for (i = 0; i < n; i++)
count += a[i];
printf ("There are %d primes less than %d\n", count, n);
return 0;
}

-

59
Lab 8: Choosing the Appropriate Thread Model

These answers are based on the assumption that OpenMP programs are easier to write,
debug, and maintain than programs based on Win32/Java/POSIX threads, and the
performance of well-written OpenMP threads programs is comparable to the performance
of programs written using one of the more general thread models. Hence we will choose
the OpenMP solution whenever it is feasible.
Ideally, the students will enter into in-depth discussions of possible solution strategies for
these problems, raising issues that go far beyond the bare-bones answers presented here.
Problem 1
We want to stop our search as soon as one technique has successfully evaluated the
integral, strongly suggesting that our parallel program should be based on
Win32/Java/POSIX threads. The fork/join model of OpenMP is better suited for situations
where we allow every thread to complete its work.
Problem 2
Even though the variance in the amount of time needed to simulate the path of a single
particle is quite large, the fact that were simulating millions of particles means that if we
have one thread per processor and give every thread an equal share of the particles to
simulate, the threads will complete at roughly the same time (because of the law of large
numbers). So we should code up the application using OpenMP, and a static allocation of
iterations to threads would most likely be fine. If were worried about threads finishing at
significantly different times, we can switch to guided self-scheduling, but this will probably
not be necessary.
Problem 3
Different threads are going to be responsible for different tasks. Some tasks may be so
compute-intensive that they should be performed using multiple threads. For example, a
single thread is probably sufficient to catch the keyboard and mouse events, but some of
the image manipulations may benefit from parallel execution. There are going to be a wide
variety of specialized interactions among threads. It may even make sense to create
threads for short-term tasks. In all, the dynamic, asymmetric nature of the parallelism in
this program points toward the use of Win32/Java/POSIX threads.
Another idea is to create an explicit threading/OpenMP hybrid. The compute-intensive
image processing tasks can be parallelized more easily using OpenMP. Avoiding processor
oversubscription could be the most difficult part of this parallel implementation. Neither
explicit threading nor OpenMP provides a way to query the number of idle processors.
Even if they did, the dynamic nature of the application could make the query result
obsolete very quickly. On Windows, QueueUserWorkItem plus OpenMP in the computeintensive tasks would minimize processor oversubscription. The OS maintains a thread
pool and would only allow queued tasks to run if there were idle processors.

60

Problem 4
A quick (linear-time) preprocessing step can put the prolog in shared memory and
determine the number of pages to be formatted. The time needed to format pages will
vary by quite a bit, depending upon what is on the page. In addition, some documents are
fairly short. Nevertheless, the same function will be called to process each page. With
these factors in mind, it seems reasonable to use OpenMPs parallel for statement to
convert the pages of the PostScript program to PDF. Because the number of pages may be
small, the parallel program should use guided self-scheduling or dynamic scheduling to
ensure all threads finish at roughly the same time.
Problem 5
This problem seems amenable to solution using OpenMP. The complete video probably
contains thousands of frames. Each thread would scan a contiguous segment of the raw
footage, marking the frames that represented new scenes. The segments scanned by the
threads would have to overlap by a frame, to prevent the error of forgetting to mark a
scene that began precisely at the beginning of a threads allocated segment.

-

61

62

Additional Resources
For more information on Intel Software College, visit

www.intel.com/software/college .
For more information on software development products, services, tools, training and expert
advice, visit www.intel.com/software .
Put your knowledge to the ultimate test solving coding problems in competitions for multithreading on multi-core microprocessors and win cash prizes! Intel Multi-Threading Competition
Series at www.topcoder.com/intel .
For more information about the latest technologies for computer product developers and IT
professionals, look up Intel Press books at
http://www.intel.com/intelpress/ .
Maximize application performance using Intel Software Development Products:

www.intel.com/software/products/ .


63

64


Student Workbook with Instructors Notes - Inner Back Cover
www.intel.com/software/college
Copyright 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
*Other brands and names are the property of their respective owners.

Introduction To Parallel Programming - Student Workbook With Instructor's Notes

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Introduction To Parallel Programming - Student Workbook With Instructor's Notes

Transféré par

Droits d'auteur :

Formats disponibles

Student Workbook with Instructors Notes

Legal Lines and Disclaimers

Introduction to Parallel Programming

Introduction to Parallel Programming

Introduction to Parallel Programming

Intel Software College

Intel Software College

Introduction to Parallel Programming

Introduction to Parallel Programming

Introduction to Parallel Programming

Intel Software College

Lab 1: Identifying Parallelism

Intel Software College

Introduction to Parallel Programming

Introduction to Parallel Programming

Introduction to Parallel Programming

Intel Software College

Lab. 1: Identifying Parallelism

Intel Software College

Introduction to Parallel Programming

Introduction to Parallel Programming

Introduction to Parallel Programming

Intel Software College

Lab 2: Introducing Threads

For each of the following programs or program segments:

determine whether the best parallelization approach is a domain decomposition or a task

Intel Software College

Introduction to Parallel Programming

Introduction to Parallel Programming

Introduction to Parallel Programming

Intel Software College

Lab. 2: Introducing Threads

(int argc, char *argv[])

Intel Software College

Introduction to Parallel Programming

Introduction to Parallel Programming

Introduction to Parallel Programming

Intel Software College

Lab 3: Domain Decomposition with OpenMP

For each of the programs below

make the program parallel by adding the appropriate OpenMP pragmas;

compile the program;

execute the program for 1, 2, 3, and 4 threads; and

check the program outputs to verify they are the same.

Function 'rerror' is called when the program detects an

void rerror (char *s)

Function 'allocate_matrix", passed the number of rows and columns,

void allocate_matrix (float ***subs, int rows, int cols) {

Intel Software College

Introduction to Parallel Programming

Introduction to Parallel Programming

storage = (float *) malloc (rows * cols * sizeof(float));

Given the name of a file containing a matrix of floats, function

/* Input file pointer */

fptr = fopen (s, "r");

Passed a pointer to a two-dimensional matrix of floats and

void print_matrix (float **a, int rows, int cols)

Introduction to Parallel Programming

Intel Software College

Lab. 3: Domain Decomposition with OpenMP

Function 'matrix_multiply' multiplies two matrices containing

void matrix_multiply (float **a, float **b, float **c,

storage = (float ) malloc (rows cols * sizeof(float));

void matrix_multiply (float a, float b, float **c,