Vous êtes sur la page 1sur 41

OpenCL Programming

OPEN COMPUTING LANGUAGE B Y S W A P N I L A T I L AY E

Overview
Introduction to OpenCL Design Goals of OpenCL OpenCL Architecture OpenCL Framework

Basic OpenCL Program Structure


OpenCL Program: Example of Hello World OpenCL Program: Example Element-wise Matrix Addition OpenCL Language Restriction Future of OpenCL Conclusion

Motivation: Before OpenCL

Motivation: Promise of OpenCL

OpenCL Commercial Objective:


Grow the market for parallel computing For vendors of systems, silicon, middleware, tools and applications Open, royalty-free standard for heterogeneous parallel computing Unified programming model for CPUs, GPUs, Cell, DSP and other processors in a system Cross-vendor software portability to a wide range of silicon and systems HPC servers, desktop systems and handheld devices covered in one specification

Support for a wide diversity of applications From embedded and mobile software through consumer applications to HPC solutions
Rapid deployment in the market Designed to run on current latest generations of GPU hardware

What is OpenCL:
OpenCL (Open Computing Language) is an open royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors, giving software developers portable and efficient access to the power of these heterogeneous processing platforms.
OpenCL stands for Open Computing Language. It supported by Apple and several vendors. The Khronos group who made OpenGL making OpenCL. Cross-platform parallel computing API and C-like language for heterogeneous computing devices

A single OpenCL kernel will likely not achieve peak performance on all device types

Heterogeneous Computing : OpenCL

OpenCL Working Group


Apple initially proposed and is very active in the working group Serving as specification editor Diverse industry participation Processor vendors, system OEMs, middleware vendors, application developers Here are some of the other companies in the OpenCL working group

Design Goals of OpenCL:


Code is portable across various target devices: Correctness is guaranteed Performance of a given kernel is not guaranteed across differing target devices. Targets a broader range of CPU-like and GPU-like devices than CUDA Targets devices produced by multiple vendors Many features of OpenCL are optional and may not be supported on all devices OpenCL works for all kinds of GPGPUs like AMD and NVIDIA GPUs, all kind of multi cores CPUs like x86 CPUs, Cell processors and works on handhelds and mobile devices. OpenCL implementations also exit in FPGAs. OpenCL codes must be prepared to deal with much greater hardware diversity

OpenCL Platform Model:


A host connected to one or more OpenCL devices

Device can be divided into one or more compute units (CUs)


Compute unit can be further divided into one or more processing elements (PEs).

The computation will be done on the Processing elements


Application send commands from host to PE PE within CU execute instructions as SIMD/SPMD units

OpenCL Hardware Abstraction

OpenCL exposes CPUs, GPUs, and other Accelerators as devices Each device contains one or more compute units, i.e. cores, SMs, etc...

Each compute unit contains one or more SIMD processing elements

OpenCL Memory Model:


OpenCL memory devided into 4 types Global, Constant, Local, and Private.

OpenCL Memory Model:


OpenCL memory devided into 4 types Global, Constant, Local, and Private. Private Memory Per work-item Qualifier: __private; Ex.: __private char *px; local memory in CUDA Local Memory Shared within a workgroup (16Kb) Qualifier: __local Shared memory in CUDA Local Global/Constant Memory Not synchronized Qualifier: __global; Ex.: __global float4 *p; Qualifier: __constant Global memory, constant memory in CUDA

OpenCL Execution Model:


The execution in OpenCL happen in 2 parts, one called the Kernel, which execute on the OpenCL devices or the computer devices. The other is the host program which control how the kernel execute. Work item is the basic unit of work

Kernel is code for work item. Executed on OpenCL devices, similar to C functions, CUDA kernels, etc. Data-parallel or task-parallel
Host program executed on host. Collection of compute kernels and internal functions. Analogous to a dynamic library

OpenCL Host Program:

An OpenCL program contains one or more kernels and any supporting routines that run on a target device.

An OpenCL kernel is the basic unit of parallel code that can be executed on a target device

OpenCL Execution Model:


Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host)

Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);

OpenCL Kernels :
Kernel Execution:

Code that actually executes on target devices

Kernel body is instantiated once for each work item An OpenCL work item is equivalent to a CUDA thread Each OpenCL work item gets a unique index

Eg: __kernel void vadd(__global const float *a, __global const float *b, __global float *result) { int id = get_global_id(0); result[id] = a[id] + b[id]; }

OpenCL Kernels Execution Launch:

OpenCL Execution Model:


Kernel Execution:

The host program invokes a kernel over an index space called an NDRange - NDRange, N-Dimensional Range, can be a 1D, 2D, or 3D space
A single kernel instance at a point in the index space is called a work-item - Work-items have unique global IDs from the index space - CUDA: thread Ids Work-items are further grouped into work-groups - Work-groups have a unique work-group ID - Work-items have a unique local ID within a work-group - UDA: Block IDs

Array of Parallel Work Item:


An OpenCL kernel is executed by an array of work items All work items run the same code (SPMD) Each work item has an index that it uses to compute memory addresses and make control decisions

Work Group Scalable Operation:


Divide monolithic work item array into work groups Work items within a work group cooperate via shared memory, atomic operations and barrier synchronization Work items in different work groups cannot cooperate

OpenCL Execution Model : example 2D NDRange

Total number of work-items = Gx * Gy Size of each work-group = Sx * Sy Global ID can be computed from work-group ID and local ID

OpenCL Programming Model: Data Parallel Model


Define N-Dimensional computation domain Each independent element of execution in N-D domain is called a workitem The N-D domain defines the total number of workitems that execute in parallel global work size. Parallel work is submitted to devices by launching Kernels

Kernels run over global dimension index ranges (NDRange), broken up into work groups,and work items Work items executing within the same work group can synchronize with each other with barriers or memory fences Work items in different work groups cant sync with each other, except by launching a new kernel

OpenCL ND Range Configuration:Global and Local Dimensions


Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (executed together) Choose the dimensions that are best for your algorithm

OpenCL Programming Model: Task Parallel Model


Some compute devices such as CPUs can also execute task-parallel compute kernels Executes as a single work-item The main characteristic is that each processor executes different commands. A compute kernel written in OpenCL A native C / C++ function

OpenCL Programming Model: Synchronization


Work-items in a single work-group (work-group barrier) Similar to _synchthreads () in CUDA No mechanism for synchronization between work-groups Synchronization points between commands in command-queues Similar to multiple kernels in CUDA but more generalized Command-queue barrier Waiting on an event

OpenCL Compilation Model


OpenCL uses dynamic compilation model (like DirectX and OpenGL)

Static compilation: The code is compiled from source to machine execution code at a specific point in the past.
Dynamic compilation: Also known as runtime compilation Step 1 : The code is complied to an Intermediate Representation (IR), which is usually an assembler of a virtual machine. Step 2: The IR is compiled to a machine code for execution. This step is much shorter. In dynamic compilation, step 1 is done usually once, and the IR is stored. The App loads the IR and does step 2 during the Apps runtime

Mapping Programming Model: OpenCL to CUDA


OpenCL Parallelism Concept Kernel NDRange(index space) Work Item Work Group CUDA Parallelism Concept kernel grid Thread Block

OpenCL Object:
Setup DevicesGPU, CPU, Cell/B.E. ContextsCollection of devices QueuesSubmit work to the device Memory BuffersBlocks of memory Images2D or 3D formatted images Execution ProgramsCollections of kernels KernelsArgument/execution instances Synchronization/profiling Events

OpenCL Framework:

OpenCL Framework:
The OpenCL framework allows applications to use a host and one or more OpenCL devices as a single heterogeneous parallel computer system. The framework contains the following components: OpenCL Platform layer: The platform layer allows the host program to discover OpenCL devices and their capabilities and to create contexts OpenCL Runtime: The runtime allows the host program to manipulate contexts once they have been created. OpenCL Compiler: The OpenCL compiler creates program executable that contain OpenCL kernels. The OpenCL C programming language implemented by the compiler supports a subset of the ISO C99 language with extensions for parallelism.

Basic OpenCL Program Structure:

Main Flow of Host Code:


Get information about the platform and devices Create an OpenCL context Create a command queue Create memory buffer objects Create program object Load the kernel source code and compile it Create kernel object Set kernel arguments

Execute the kernel


Copy memory from GPU to CPU

OpenCL Context
Contains one or more devices OpenCL memory objects are associated with a context, not a specific device clCreateBuffer() is the main data object allocation function error if an allocation is too large for any device in the context Each device needs its own work queue(s) Memory transfers are associated with a command queue (thus a specific device

OpenCL Program: Example of Hello World


Kernel Code:

Host Code:

CUDA Program: Example Element-wise Matrix Addition


/* Set grid size*/ const int N = 1024; const int blocksize = 16; /* Compute kernel*/ __global__void add_matrix( float* a, float *b, float *c, int N ) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; int index = i + j*N; if ( i < N && j < N ) c[index] = a[index] + b[index]; } int main() { /* CPU Memory Allocation*/ float *a = new float[N*N]; float *b = new float[N*N]; float *c = new float[N*N]; For ( int i = 0; i < N*N; ++i ) { a[i] = 1.0f; b[i] = 3.5f; } /*GPU Memory Allocation */ float *ad, *bd, *cd; const int size = N*N*sizeof(float); cudaMalloc( (void**)&ad, size ); cudaMalloc( (void**)&bd, size ); cudaMalloc( (void**)&cd, size ); /*Copy Data to GPU */ cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice ); /* Execute Kernel*/ dim3 dimBlock( blocksize, blocksize ); dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<<dimGrid, dimBlock>>>( ad, bd, cd, N ); /*Copy Result Back to CPU */ cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost ); /*Clean Up and Return */ cudaFree( ad ); cudaFree( bd ); cudaFree( cd ); return EXIT_SUCCESS; }

OpenCL Language Restriction:


Pointers to functions not allowed Pointers to pointers allowed within a kernel, but not as an argument Bit-fields not supported Variable-length arrays and structures not supported Recursion not supported Writes to a pointer of types less than 32-bit not supported Double types not supported, but reserved 3D Image writes not supported Some restrictions are addressed through extensions

Future of OpenCL:
The future lies with OpenCL as it is an open standard, not restricted to a vendor or specific hardware. Also because AMD is going to release a new processor called fusion. AMD Fusion is a new approach to processor design and software development, delivering powerful CPU and GPU capabilities for HD, 3D and data-intensive workloads in a single-die processor called an APU. APUs combine high-performance serial and parallel processing cores with other special-purpose hardware accelerators, enabling breakthroughs in visual computing, security, performance-per-watt and device form factor. This processor would be perfect for OpenCL, As that doesnt care what type of processor is available; as long as it can be used.

Conclusion:
OpenCL would attract HPC programmers because it is long term strategy with GPUs and other accelerators. It might be complicated language for the short application, but it is very useful with more complicated application . There are some restrictions on OpenCL. But it won't affect the language reliability. There will be other implementations for OpenCL in other high end language where would be easy for the normal programmers. In the end, you might find OpenCL very difficult. But when you master it, you will be the master of parallel computing There are already some requests for OpenCL programmers from UK companies.

OpenCL Demo on AMD :

http://www.youtube.com/watch?v=MCaGb40Bz58&feature=related

http://www.youtube.com/watch?v=PJ1jydg8mLg

http://www.youtube.com/watch?v=mcU89Td53Gg

Thank You

Vous aimerez peut-être aussi