Vous êtes sur la page 1sur 54

Direct3D 11 Compute Shader

More Generality for Advanced Techniques


Chas. Boyd
Architect
Windows Desktop & Graphics Technology
Microsoft

Overview
GPGPU vs. Data Parallel Computing
Introducing the Compute Shader
Advantages
Target Applications
Key Features
Examples
Image reduction, histogram, convolution

API Support

GPGPU = Data Parallel Computing


GPU Performance continues to grow
More algorithms want this performance

Apps can scale to massive parallelism


without tricky code changes
General recognition that this model is
applicable beyond just rendering
although that is our primary target

Deliver scalable performance


Code scales with core count with no changes

Introducing: Compute Shader


A new processing model for GPUs
Dataparallel programming for mass markets

Integrated with Direct3D


For efficient interoperability in client scenarios

Supports more general constructs:


Cross-thread data sharing
Unordered access I/O operations

Enables more general data structures


Irregular arrays, trees, etc.

Enables more general algorithms


Far beyond shading

Optimized for Client Scenarios


Simpler setup syntax
Balance between power and complexity

Real-time rendering of results


Working to reduce cost of transition from
compute mode to graphics mode

Better integration with media data types:


Pixels, samples, text, vs. only floats

Need consistency between implementations


Both across vendors and over time/generations

Compute Shader Features


Predictable Thread Invocation
Regular arrays of threads: 1D, 2D, 3D
Dont have to draw a quad anymore

Shared registers between threads


Reduces register pressure
Can eliminate redundant compute and I/O

Scattered Writes
Can read/write arbitrary data structures
Enables new classes of algorithms
Integrated with Direct3D resources

Target Applications
Image/post-processing:
Image reduction, histogram, convolution, FFT

Effect physics
Particles, smoke, water, cloth, etc.

A-Buffer/OIT
Ray-tracing, radiosity, etc.
Gameplay physics, AI

Integrated with Direct3D


Fully supports all Direct3D resources
Targets graphics/media data types
Evolution of DirectX HLSL
Graphics pipeline updated to emit general
data structures
which can then be manipulated by compute
shader
and then rendered by D3D again

Integration with Pipeline


Input
Assembler
Vertex
Shader
Tessellatio
n
Geometry
Shader

Rasterizer
Pixel
Shader
Output
Merger

Scene Image

Render scene
Write out scene image
Use Compute for
image post-processing
Output final image
Data Structure

Compute
Shader

Final Image

Direct Thread Invocation


The ability to explicitly launch a known number of
threads onto the GPU
pD3D11Device->Dispatch( numThreads );

Analogous to graphics DrawPrimitive() calls


Enables algorithms to execute the optimal number of
threads
Not how many vertices are read, or pixels written

Current thread id is available to shader code:


sv_ThreadID.x

Analogous to sv_PrimitiveID system value

Enables predictable memory access and register usage


12

Shared Register Class


New register type/variable storage class
shared float sfFoo;

Multiple threads can access same memory


Enables uses like user-controlled cache

Maximum of 32 KB of registers can be


shared in DirectX 11
8K floats or 2K float4s
vs. 64 KB of total temporary registers available
16K floats or 4K float4s
13

Sub Blocking
Not all threads in the call can/should share
registers with each other
Sharing threads are broken down into
subsets (groups) of threads
Thread indices are made available in
shader
sv_ThreadID
sv_ThreadGroupID
sv_ThreadIDinGroup
14

Atomic Intrinsics
Enable parallel operations on individual
32-bit memory locations without requiring
full synchronization
Either video memory or shared registers

Can be used to implement higher-level


synch constructs
Semaphores, etc.

Not intended for heavy lifting


Support an immediate return argument
At some performance cost

Atomic Intrinsics
Enables basic operations:
InterlockedAdd( rVar, val );
InterlockedMin( rVar, val );
InterlockedMax( rVar, val );
InterlockedOr( rVar, val );
InterlockedXOr( rVar, val );
InterlockedCompareWrite( rVar, val );
InterlockedCompareExchange( rVar, val );

Unordered Memory Accesses


HLSL resource variables
Declared in the language

DXGI resources
Enables out-of-bounds memory checking
Returns 0 on reads
Writes are No-Ops

Improves security, reliability of shipped code

Unordered I/O
For fastest performance when ordering of
records need not be preserved
Both reads and writes:
UnorderedLoad( ResourceVar, val);
UnorderedStore( ResourceVar, val);

Requires buffer allocated before-hand

Integration with Direct3D


Pixel shaders can also perform scattered
writes
Enables rendering output to data
structures more complex than a 2D array
Histogram, linked list, irregular array, tree,
etc.

Dont Forget
Texture sampling still works:
Object.Load( Loc, Offset, Samples );
Object.Gather( Sampler, Loc );
Object.Sample( Sampler, Loc );
Object.SampleLevel( Sampler, Loc, LoD );

No automatic trilinear LoD calculation


Other graphics features are not present:
Antialiasing, depth culling, alpha blending,
triangle rasterization

Examples
Image Reduction
Image Histogram
FFT

Image Post-Processing
Significant fraction of frame time
1020% for most games
5070% for deferred shading-based engines

Savings here means more time for 3D

Image Reduction
Find the average intensity of an Image
E.g. for HDR exposure adjustment
Optimizes scene for viewing on SDR monitor

Algorithm breakdown:
Input: 1 million pixels
Compute: 1 MAD per pixel read
Output: 1 value

Should this run at texture sample rate?


Does not due to write contention

Million-to-1 reduction

GPU

Input

Output

Reduction Compute Code


Buffer<uint> Values;
OutputBuffer<uint> Result;
ImageAverage()
{
groupshared uint Total;
groupshared uint Count;

// Total so far
// Count added

float3 vPixel = load( sampler, sv_ThreadID );


float fLuminance = dot( vPixel, LUM_VECTOR );
uint value = fLuminance*65536;
InterlockedAdd( Count, 1 );
InterlockedAdd( Total, value );
SynchronizeThreadGroup(); // enable all threads in group
to complete

Reduction Compute Code

// Allow all threads in group to complete


SynchronizeThreadGroup();
// Compute the average and store it in our output buffer
if (threadID.x == 0)
{
float fAverage = total/count;
// compute avg
UnorderedStore( Result[0], fAverage ); // write it out
}
}

Fast Reduction Compute Code


Buffer<uint> Values;
OutputBuffer<uint> Result;
ImageAverage()
{
groupshared uint Total[32];
groupshared uint Count[32];

// array of 32 totals
// array of 32 counts

float3 vPixel = load( sampler, sv_ThreadID );


float fLuminance = dot( vPixel, LUM_VECTOR );
uint value = fLuminance*65536;
uint idx = (sv_ThreadID.x + sv_ThreadID.y + sv_ThreadID.z)
& 32;
Total[idx] += value;
Count[idx] += 1;

Fast Reduction Compute Code

// Allow all threads in group to complete


SynchronizeThreadGroup();
// Compute the average and store it in our output buffer
if (threadIDInGroup.x == 0)
{
for ( uint i=0; i< 32; i++ )
{
TheTotal += total[i];
TheCount += count[i];
}
float fAverage = TheTotal/TheCount; // compute avg
UnorderedStore( Result[GroupID], fAverage );
// write
}
}

Reduction Performance
Pyramid approaches work today
Some choice in reduction level per pass
Tradeoff is contention for destination

1M pixels takes ~0.4ms in Direct3D


Pass-count-limited at small end of pyramid

Ideally should run at texture read rate


< 0.1 ms in theory, or 410x faster

Compute shader features should help


Such as local read-write cache
Prototypes show ~2x speed boost so far

Histogram Generation
Similar to reduction problem
Reduce to 64256 destinations at data
dependent (unpredictable) addresses

Still suffers contention when multiple


pixels increment same bin
So replicate bins e.g. 16x
Increment bins using InterlockedAdd() math
operations

Currently showing 2x speedup

Histogram Generation Code


Histogram()
{
shared int Histograms[16][256];

// array of 16

float3 vPixel = load( sampler, sv_ThreadID );


float fLuminance = dot( vPixel, LUM_VECTOR );
int iBin = fLuminance*255.0f;
// compute bin to increment
int iHist = sv_ThreadIDInGroup & 16; // use thread index
Histograms[iHist][iBin] += 1;

// update bin

SynchronizeThreadGroup; // enable all threads in group to


complete

Histogram Generation Code


// Write register histograms out to memory:
iBin = sv_ThreadIDInGroup.x;
if ( ( sv_ThreadID.x < 256 )
{
for ( iHist = 0; iHist < 16; iHist++ )
{
int2 destAddr = int2( iHist, iBin );
OutputResource.add( destAddr,
Histograms[iHist][iBin] ); // atomic
}
}
}

Histogram Performance
Recent work shows similar performance to
reductions:
Direct3D takes ~2.4 ms per megapixel
On DirectX10 hardware

2x speedup shown via prototypes


On same hardware but using shared registers

8x theoretically possible
if purely read limited

Image Convolution
Fundamental operation for blurs:
HDR flares, depth-of-field, soft shadows, streaks

Need fairly large kernels for these


100 wide is possible at high resolutions
(sparse sampling produces artifacts)

7-Tap Separable Kernel

7-Tap Separable Kernel

Convolution Performance
Massively variable depending on method
Direct3D does 5x5 kernel in 0.65ms/Mpix
Separable kernel

Prototype does slightly better


Using shared register capability

Theoretical performance should be higher


Some opportunity remains

Need to evaluate relevant kernel sizes


Games need 100x100 effectively

Other Example Techniques


These are not used directly in game postprocessing today, but are key foundations
of other algorithms
Scan (prefix sum), and
FFT (fast Fourier transform)

Scan (Prefix-sum)
Each number in data sequence is sum of
all previous numbers
Used to compute writes in irregular arrays
Foundation of Summed Area Tables

Known GPU algorithms (Horns method)


Pyramid scheme, so I/O bound

Sharing memory between threads results


in ~2x speedup

Scan (Prefix-sum)
We are looking at providing this in a
library routine
Along with FFT, etc.

Summed Area Table


2D equivalent of Scan
Each element of 2D array has sum of all
elements up/left of it

Enables box filter with performance


independent of kernel size O(k)
Fast generation of
Shadow blur with distance
Depth-of-field
Area light integrals, etc.

Fast Fourier Transform


Converts image into frequency domain
Many operations are faster in frequency
domain than in spatial domain
e.g. convolution becomes a multiply
Trivial detection of periodic noise
Some application to motion estimation

Core algorithm similar to scan


Similar I/O patterns,
But more math-intensive inner loop

Direct3D FFT
Ping-pong between 2 R32G32F surfaces
R is Real, G is Complex

Do LogN passes along rows then columns


Pixel shader only
Does not use blenders or iterators
Uses vPos.xy as array indices [i][j]

Inner loop is math intensive


20+ instructions including trig
Indexing math dominates unless DX10

FFT Before

FFT After

After

FFT Performance
Complex 1024x1024 2D FFT:
Software

42ms

Direct3D9

15ms

Prototype DX11
Latest chips

6ms
3ms

6 GFlops
17 GFlops 3x
42 GFlops 6x
100 GFlops

Shared register space and random access


writes enable ~2x speedups

Order-Independent Translucency
Eliminates draw-order issues, and shimmer
in moving scenes
Correct AA even of transparent objects
Any object is transparent if antialiased
e.g. alpha tested leaves in forests

Current methods require large sample


counts
Alpha-To-Coverage
Depth Peeling with Occlusion Queries

The A-Buffer Method


A-buffer is a more accurate method
Accumulate object data in per-pixel list
Then sort each pixel into order
Collapse to final color and display

Brings visual quality to movie levels


without requiring 256-sample MSAA
Something to keep an eye on for OIT

A-Buffer Rendering
Currently prototyping using refrast
DirectX reference rasterizer running on CPU
Measuring memory access patterns/locality
Evaluating feasibility of hardware
Not really feasible with current Direct3D

Compute shader features enable this


Such as indexed writes, counters, etc.
Rendering to structures beyond regular arrays

But performance is still largely unknown

Additional Algorithms
New rendering methods
Ray-tracing, collision detection, etc.
Rendering elements at different resolutions

Non-rendering algorithms
IK, physics, AI, simulation, fluid simulation,
radiosity

Need more general data structures


Quad/octrees, irregular arrays, sparse arrays

Need linear algebra

Summary
Compute Shader is coming in Direct3D 11
GPU performance levels for more applications

Scalable parallel processing model


Code should scale for several generations

Increased generality will enable both:


Improved performance on existing GPU tasks
More CPU tasks can switch to DP cores

Full cross-vendor support


Enables broadest possible installed base

Questions?

www.xnagamefest.com

2008 Microsoft Corporation. All rights reserved.


This presentation is for informational purposes only.
Microsoft makes no warranties, express or implied, in this summary.

Vous aimerez peut-être aussi