Académique Documents
Professionnel Documents
Culture Documents
Overview
GPGPU vs. Data Parallel Computing
Introducing the Compute Shader
Advantages
Target Applications
Key Features
Examples
Image reduction, histogram, convolution
API Support
Scattered Writes
Can read/write arbitrary data structures
Enables new classes of algorithms
Integrated with Direct3D resources
Target Applications
Image/post-processing:
Image reduction, histogram, convolution, FFT
Effect physics
Particles, smoke, water, cloth, etc.
A-Buffer/OIT
Ray-tracing, radiosity, etc.
Gameplay physics, AI
Rasterizer
Pixel
Shader
Output
Merger
Scene Image
Render scene
Write out scene image
Use Compute for
image post-processing
Output final image
Data Structure
Compute
Shader
Final Image
Sub Blocking
Not all threads in the call can/should share
registers with each other
Sharing threads are broken down into
subsets (groups) of threads
Thread indices are made available in
shader
sv_ThreadID
sv_ThreadGroupID
sv_ThreadIDinGroup
14
Atomic Intrinsics
Enable parallel operations on individual
32-bit memory locations without requiring
full synchronization
Either video memory or shared registers
Atomic Intrinsics
Enables basic operations:
InterlockedAdd( rVar, val );
InterlockedMin( rVar, val );
InterlockedMax( rVar, val );
InterlockedOr( rVar, val );
InterlockedXOr( rVar, val );
InterlockedCompareWrite( rVar, val );
InterlockedCompareExchange( rVar, val );
DXGI resources
Enables out-of-bounds memory checking
Returns 0 on reads
Writes are No-Ops
Unordered I/O
For fastest performance when ordering of
records need not be preserved
Both reads and writes:
UnorderedLoad( ResourceVar, val);
UnorderedStore( ResourceVar, val);
Dont Forget
Texture sampling still works:
Object.Load( Loc, Offset, Samples );
Object.Gather( Sampler, Loc );
Object.Sample( Sampler, Loc );
Object.SampleLevel( Sampler, Loc, LoD );
Examples
Image Reduction
Image Histogram
FFT
Image Post-Processing
Significant fraction of frame time
1020% for most games
5070% for deferred shading-based engines
Image Reduction
Find the average intensity of an Image
E.g. for HDR exposure adjustment
Optimizes scene for viewing on SDR monitor
Algorithm breakdown:
Input: 1 million pixels
Compute: 1 MAD per pixel read
Output: 1 value
Million-to-1 reduction
GPU
Input
Output
// Total so far
// Count added
// array of 32 totals
// array of 32 counts
Reduction Performance
Pyramid approaches work today
Some choice in reduction level per pass
Tradeoff is contention for destination
Histogram Generation
Similar to reduction problem
Reduce to 64256 destinations at data
dependent (unpredictable) addresses
// array of 16
// update bin
Histogram Performance
Recent work shows similar performance to
reductions:
Direct3D takes ~2.4 ms per megapixel
On DirectX10 hardware
8x theoretically possible
if purely read limited
Image Convolution
Fundamental operation for blurs:
HDR flares, depth-of-field, soft shadows, streaks
Convolution Performance
Massively variable depending on method
Direct3D does 5x5 kernel in 0.65ms/Mpix
Separable kernel
Scan (Prefix-sum)
Each number in data sequence is sum of
all previous numbers
Used to compute writes in irregular arrays
Foundation of Summed Area Tables
Scan (Prefix-sum)
We are looking at providing this in a
library routine
Along with FFT, etc.
Direct3D FFT
Ping-pong between 2 R32G32F surfaces
R is Real, G is Complex
FFT Before
FFT After
After
FFT Performance
Complex 1024x1024 2D FFT:
Software
42ms
Direct3D9
15ms
Prototype DX11
Latest chips
6ms
3ms
6 GFlops
17 GFlops 3x
42 GFlops 6x
100 GFlops
Order-Independent Translucency
Eliminates draw-order issues, and shimmer
in moving scenes
Correct AA even of transparent objects
Any object is transparent if antialiased
e.g. alpha tested leaves in forests
A-Buffer Rendering
Currently prototyping using refrast
DirectX reference rasterizer running on CPU
Measuring memory access patterns/locality
Evaluating feasibility of hardware
Not really feasible with current Direct3D
Additional Algorithms
New rendering methods
Ray-tracing, collision detection, etc.
Rendering elements at different resolutions
Non-rendering algorithms
IK, physics, AI, simulation, fluid simulation,
radiosity
Summary
Compute Shader is coming in Direct3D 11
GPU performance levels for more applications
Questions?
www.xnagamefest.com