Vous êtes sur la page 1sur 3

Programming with OpenCL 1.

2
By Gastón Hillar, August 26, 2014

P ost a Comment

printf­style debugging and the ability to partition computing devices into subdevices make
OpenCL 1.2 a very useful upgrade.

OpenCL 1.2 is the latest stable OpenCL release with available drivers for both CPU and GPU support.
It is an interesting and useful enhanc ement to OpenCL 1.1 that adds several useful features — some
of whic h are available extensions to earlier OpenCL versions. In this artic le, I c over the addition of
the simplest debug tool ever, whic h makes it easier to debug OpenCL kernels, and the new "devic e
fission," whic h enables you to partition a single c ompute devic e into many subdevic es.

The Simplest Debug Tool: printf

OpenCL 1.2 inc ludes a built-in printffunc tion to the OpenCL C programming language. In previous
OpenCL versions, only a few spec ific vendors provided an equivalent func tion through an extension,
suc h as the cl_intel_printfextension found in the Intel drivers. The built-inprintffunc tion is
very similar to the printfdefined in the C99 standard, but there are differenc es that you should
review by c hec king the doc umentation. For example, OpenCL Cprintfreturns 0for a suc c essful
exec ution and -1otherwise, while C99 printfreturns the number of printed c harac ters for a
suc c essful exec ution. At the time of writing, it was possible to use printfwith OpenCL 1.2 drivers
for CPU targets and also for AMD GPUs.

The printffunc tion buffers output until the kernel exec ution c ompletes, then transfers the output
bac k to the host. Thus, you have to be extremely c areful to send text to the standard output when
adding c alls to printfin your kernels. As you might guess, the c all to this func tion within a
massively parallel exec ution has undesirable side effec ts, inc luding a big impac t on performanc e and
memory usage. You should use printfonly for spec ific debugging purposes on reduc ed data sets
and then remove c alls to this func tion when you want to exec ute the kernels on the entire data set.

The following lines show the c ode for a very simple OpenCL kernel that c omputes the produc t of a
matrix and a vec tor:
1 __kernel void matrix_dot_vector(__global const float4 *matrix, ?
2         __global const float4 *vector, __global float *result)
3 {
4     int gid = get_global_id(0);
5     result[gid] = dot(matrix[gid], vector[0]);
6 }

You c an add the following line after the assignment to result[gid]to write data about the
generated result index and value to the standard output:
1 printf("result[%d] %f \n", gid, result[i]); ?

This way, you c an easily gather data from the kernel with a func tion that is familiar.

Working with Device Fission

One of the c lassic problems that arises when you target CPUs with OpenCL is that the exec ution of a
kernel uses all the available c ores and doesn't leave any c ore free to exec ute other proc esses.
The cl_ext_device_fissionextension provides an interfac e for subdividing an OpenCL devic e into
multiple subdevic es. OpenCL 1.2 has inc orporated devic e fission in the spec ific ation, and all the
func tions that assigned work to devic es now ac c ept subdevic es.

With the use of devic e fission, you c an c ontrol whic h of the available c ompute units you want the
OpenCL runtime to use in order to exec ute kernels. For example, if you have an eight-c ore CPU, you
c an use devic e fission to c reate a subdevic e with six c ores and leave two c ores for the operating
system tasks and other proc esses that require CPU usage. As a developer, devic e fission is extremely
useful bec ause you c an c ontinue working with your IDE and other development tools while exec uting
OpenCL kernels that target the CPU.

You c an also c reate sets c omposed of one or more subdevic es with their own c ommand queue. With
this feature, you c an use devic e fission to c ontrol the queues and dispatc h work to eac h set. The
feature is useful when you have algorithms that benefit from distributing work among many sets of
subdevic es. You have a lot of different possibilities for partitioning tasks and work with advanc ed
task parallelism sc enarios.

Obviously, the use of devic e fission requires that you have a good understanding of the details of the
underlying hardware. When you c reate subdevic es and dispatc h work to them, you must c onsider
many things that might affec t performanc e (if you're writing OpenCL kernels, it means you want to
ac hieve the best performanc e). If you don't take into ac c ount the shared resourc es for different
subdevic es, suc h as a shared c ac he memory or Non-Uniform Memory Ac c ess (NUMA) nodes, you will
lose performanc e. Luc kily, devic e fission in OpenCL 1.2 provides many predefined partitioning types
and options that make it easy to spec ify the ways in whic h you want to split a devic e, and many of
them allow you to c onsider the affinity of the c ompute units to share levels of c ac he hierarc hy or a
NUMA node. These predefined partitioning types and options allow you to make an effic ient use of
shared hardware resourc es when generating subdevic es.

Devic e fission allows you to partition subdevic es. Thus, onc e you c reate subdevic es, they c an be
furthered partitioned by c reating new subdevic es. The relationships of the different subdevic es form
a tree in whic h the subdevic es have a parent devic e or subdevic e. It is possible to use different
partition types and options eac h time you request OpenCL to split the devic e or subdevic es. For
example, you c an partition the CPU devic es by affinity, then partition one of those subdevic es
equally into eight subdevic es. Obviously, the root devic e doesn't have a parent.

When you work with devic e fission for CPUs, you need to take into ac c ount that eac h c ompute unit
is equivalent to a logic al c ore or a hardware thread. So, when you work with Intel CPUs with Hyper-
Threading tec hnology enabled, two logic al c ores or hardware threads share one physic al c ore.

The clGetDeviceInfofunc tion has new assoc iated devic e property IDs to retrieve information that
allows you to plan the partitioning sc heme to c reate subdevic es. Before c reating subdevic es, you
c an retrieve the following information using the spec ified devic e property IDs:

The maximum number of subdevic es that you c an c reate for a


devic e:CL_DEVICE_PARTITION_MAX_SUB_DEVICES. For example, if a CPU has eight logic al c ores,
the value will be 8.
The partition types that the devic e supports: CL_DEVICE_PARTITION_PROPERTIES. I'll dive deep
on partition types later.
The affinity domains for partitioning the devic e that the devic e
supports:CL_DEVICE_PARTITION_AFFINITY_DOMAIN. When you spec ify a partitioning by affinity
domain, you c an use any of the affinity domains inc luded in the returned list of supported values.

You c an determine whether both the OpenCL implementation and the devic e support devic e fission
by c hec king the maximum number of subdevic es that you c an c reate for a devic e
(CL_DEVICE_PARTITION_MAX_SUB_DEVICES). Onc e you are sure that you c an c reate subdevic es, you
c an c hec k the supported partition types and the affinity domains in c ases where you want to work
with partitioning by affinity domain. With all this information, you c an write c ode that c an take into
ac c ount different hardware arc hitec tures and use different partitioning sc hemes based on the
information gathered from the different clGetDeviceInfoc alls.

After you make the c all clGetDeviceIDsand the nec essary c alls to clGetDeviceInfofor the
selec ted devic e, you c an start c reating the subdevic es with c alls to the
new clCreateSubDevicesfunc tion. Note that you must c reate the subdevic es before you c reate the
OpenCL c ontext. OpenCL 1.2 devic es and subdevic es have retain (clRetainDevice)and release
(clReleaseDevice) func tions that allow you to inc rement and dec rement the referenc e c ount as is
done on other OpenCL objec ts.

The following lines show the C dec laration of the clCreateSubDevicesfunc tion:
1 cl_int clCreateSubDevices(cl_device_id in_device, ?
2     const cl_device_partition_property *properties,
3     cl_uint num_devices,
4     cl_device_id out_devices,
5     cl_uint *num_devices_ret);

The func tion requires the following arguments:

in_device: Indic ates the ID of the devic e (cl_device_id) that you want to split into
subdevic es.
properties: Provides a property list that starts with the desired partition type and then
provides additional values required by the selec ted partition type. The last value of the property
list that indic ates the end of the properties list must
beCL_DEVICE_PARTITION_BY_COUNTS_LIST_END(0). As I explained before, in order to have a
suc c essful subdevic es c reation, you need to make sure that the partition type spec ified in this
property list is supported by the devic e by making c alls to clGetDeviceInfo.
num_devices: Spec ifies the size of the out_devicesarray.
out_devices: Provides a buffer for the generated subdevic es with a number of elements
spec ified by num_devices.
num_devices_ret: Returns the number of subdevic es that the devic e may be partitioned into
c onsidering the partition type and the other values spec ified in the property list (properties).

Vous aimerez peut-être aussi