08 gpuSoftwareRasterLaineAndPantaleoni BPS2011

Software Rasterization on GPUs
Samuli Laine Jacopo Pantaleoni
NVIDIA Research
Outline
Rasterization
Laine, Karras: High-Performance Software Rasterization on GPUs. Proceedings of High-Performance Graphics 2011.
Voxelization
Pantaleoni: VoxelPipe: A Programmable Pipeline for 3D Voxelization. Proceedings of High-Performance Graphics 2011.
Rationale
Build a research platform
Elbow space for game developers
Enable new algorithms

Programmable ROP Stochastic rasterization Non-linear rasterization Non-quad derivatives Quad merging Decoupled sampling Compact after discard etc.
Provoke hardware architects

Flexibility of software, performance of fixed-function hardware
Building a Pipeline
We implemented a full pixel pipeline using CUDA
From triangle setup to ROP
Obey fundamental requirements of gfx pipe

Maintain input order Hole-free rasterizer with correct rasterization rules
Make it as fast as possible!
Design Considerations
Run everything in parallel
We need a lot of threads to fill the machine
Minimize amount of synchronization

Avoid excessive use of atomics
Focus on load balancing

Graphics workloads are wild
Programmable Shading
Pipeline Structure
Chunker-style pipeline with four stages
Triangle setup Bin raster Coarse raster Fine raster Run data in large batches
Separate kernel launch for each stage
Keep data in input order all the time

No need to sort
Chunking to Bins and Tiles

Frame buffer
Bin
16x16 tiles 128x128 px
Tile 8x8 px
Pixel
Triangle Setup
Vertex buffer positions, attributes Index buffer
...
Triangle Setup
edge eqs. u/v pleqs zmin etc.
Triangle data buffer ...
Bin Raster
Triangle data buffer ...
Bin Raster SM 0
Bin Raster SM 1
...
Bin Raster SM 14
IDs of triangles that overlap bin
Coarse Raster
...
Coarse Raster SM n
One coarse raster SM has exclusive access to the bin its processing
IDs of triangles that overlap tile
Fine Raster
IDs of triangles that overlap tile
Fine Raster warp n
One fine raster warp has exclusive access to the tile its processing
Read tile once from DRAM to shared
Write tile once to DRAM
Pixel data in FB
Tidbit 1: Coverage Calculation
Step along edge (Bresenham-like) Use look-up tables to generate coverage masks ~50 instructions for 8x8 stamp, one edge
Tidbit 2: Fragment Distribution
Input Phase
Shading Phase
In input phase, calculate coverage and store in list

In shading phase, detect triangle changes and calculate triangle index and fragment in triangle
Test Scenes
Call of Juarez scene courtesy of Techland S.T.A.L.K.E.R.: Call of Pripyat scene courtesy of GSC Game World
Performance Results
Frame rendering time in ms (depth test + color, no MSAA, no blending)
Comparison to Hardware (1/3)

Resolution
Cannot match hardware in raster, z kill + compact Currently support max 2K x 2K frame buffer, 4 subpixel bits
Attributes
Fetched when used bad latency hiding Expensive interpolation
Antialiasing
Hardware nearly oblivious to MSAA, we much less so

Memory usage, buffering through DRAM
Performance implications of reduced buffering unknown Streaming through on-chip memory would be much better
+ Shader complexity
Shader performance theoretically the same as in graphics pipe
+ Frame buffer bandwidth

Each pixel touched only once in DRAM

+ Extensibility
Need one stage to do something extra? Need a new stage altogether? You can actually implement it
+ Specialization to individual applications

Rip out what you dont need, hard-code what you can
Exploration Potential
Shader performance boosters
Compact after discard, quad merging, decoupled sampling,
Things to do with programmable ROP

A-buffering, order-independent transparency,
Stochastic rasterization Non-linear rasterization
(Your idea here)
The Code is Out There

The entire codebase is open-sourced and released
http://code.google.com/p/cudaraster/
VoxelPipe:
A Programmable Pipeline for 3D Voxelization
What is Voxelization?
Voxelization =
Finding all voxels overlapped by each triangle in a mesh
Why shall we care?

Why is it useful?
Shape Matching Collision Detection Fluid / Soft-body Sim Stress Analysis Level of Detail Ray Tracing
Why shall we care?

Interactive Indirect Illumination and Ambient Occlusion using Voxel Cone Tracing
Cyril Crassin (I3D 2011)
Rationale building a full-featured pipeline for voxelization,

analogous to OpenGL for 2d rasterization
fully conservative and thin* rasterization arbitrary frame-buffer types many blending modes (additive,max,min,and,or...) multiple render targets vertex shaders fragment shaders
Rationale Extended support for rendering modes:

conventional blending-based rasterization
A-buffer / bucketing
Challenges Previous research mostly concerned with binary output

=> no Shading, no ROP State-of-the-Art had poor load balancing => Huge performance hit for mixed triangle sizes
What is Rasterization?
What is Rasterization?
Highly Variable Expansion Rate source of most load balancing problems
Vertex Shading Fragment Shading
Observations (1)
Rasterization = Sorting of Compressed Batches of Elements
triangles: fragments:
1 f1,1 2 f2,1 3 f3,1 4 f4,1 5 6 n fn,1
f1,2
f1,3
f3,2
f3,3
F4,2
fn,2
highly variable decompression rate
f3,1000000
Observations (2)
Decompression and Sorting can be done Hierarchically
emit per-tile fragments sort by tile
emit per-voxel fragments sort by voxel blend
Observations (2)
Decompression and Sorting can be done Hierarchically
triangles: fragments:
1
f1,1 f1,2
2
f2,1
3
f3,1 f3,2
4
f4,1 f4,2
n
fn,1 fn,2
f1,3
f3,3
decompression rate is more regular

f3,1000000
Pipeline Overview
coarse rasterizer (tile,tri) fragments by tri id 1 one tri per thread persist. threads 1 tile sorting tile queues tri 1 fine rasterizer 1 CTA per tile FB tiles
tri 2
radix sort
1
2 2
tri 1 tri 2
tri 3 tri 1
one tri per thread persist. threads
Programmable Shading simple C++ classes:

struct MyShader { T eval(const Fragment frag) const; private: ... };
can be any of the supported types!
Performance Results
Performance Results
150 300 M tris/s
Example Application: Real-Time GI
Future Work Sparse Octrees Tessellation / Geometry Shaders Programmable ROP
Future Work
The entire codebase will be open-sourced
http://code.google.com/p/voxelpipe/
Thank You
Questions
Further information:
Laine, Karras: High-Performance Software Rasterization on GPUs. Proceedings of High-Performance Graphics 2011.
Pantaleoni: VoxelPipe: A Programmable Pipeline for 3D Voxelization. Proceedings of High-Performance Graphics 2011.

08 gpuSoftwareRasterLaineAndPantaleoni BPS2011

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

08 gpuSoftwareRasterLaineAndPantaleoni BPS2011

Transféré par

Droits d'auteur :

Formats disponibles

Software Rasterization on GPUs

Samuli Laine Jacopo Pantaleoni

Enable new algorithms

Provoke hardware architects

Obey fundamental requirements of gfx pipe

Make it as fast as possible!

Minimize amount of synchronization

Focus on load balancing

Keep data in input order all the time

Chunking to Bins and Tiles

16x16 tiles 128x128 px

edge eqs. u/v pleqs zmin etc.

Triangle data buffer ...

IDs of triangles that overlap bin

IDs of triangles that overlap tile

Fine Raster warp n

Read tile once from DRAM to shared

Write tile once to DRAM

Tidbit 1: Coverage Calculation

Tidbit 2: Fragment Distribution

In input phase, calculate coverage and store in list

Frame rendering time in ms (depth test + color, no MSAA, no blending)

Comparison to Hardware (1/3)

Comparison to Hardware (2/3)

+ Frame buffer bandwidth

Comparison to Hardware (3/3)

+ Specialization to individual applications

Things to do with programmable ROP

Stochastic rasterization Non-linear rasterization

(Your idea here)

The Code is Out There

Why shall we care?

Why shall we care?

Cyril Crassin (I3D 2011)

Rationale building a full-featured pipeline for voxelization,

Rationale Extended support for rendering modes:

Challenges Previous research mostly concerned with binary output

highly variable decompression rate

emit per-voxel fragments sort by voxel blend

decompression rate is more regular

one tri per thread persist. threads

Programmable Shading simple C++ classes:

150 300 M tris/s

Example Application: Real-Time GI

Future Work Sparse Octrees Tessellation / Geometry Shaders Programmable ROP

Vous aimerez peut-être aussi