Vous êtes sur la page 1sur 40

Software Rasterization on GPUs

Samuli Laine Jacopo Pantaleoni

NVIDIA Research

Outline
Rasterization
Laine, Karras: High-Performance Software Rasterization on GPUs. Proceedings of High-Performance Graphics 2011.

Voxelization
Pantaleoni: VoxelPipe: A Programmable Pipeline for 3D Voxelization. Proceedings of High-Performance Graphics 2011.

Rationale
Build a research platform
Elbow space for game developers

Enable new algorithms


Programmable ROP Stochastic rasterization Non-linear rasterization Non-quad derivatives Quad merging Decoupled sampling Compact after discard etc.

Provoke hardware architects


Flexibility of software, performance of fixed-function hardware

Building a Pipeline
We implemented a full pixel pipeline using CUDA
From triangle setup to ROP

Obey fundamental requirements of gfx pipe


Maintain input order Hole-free rasterizer with correct rasterization rules

Make it as fast as possible!

Design Considerations
Run everything in parallel
We need a lot of threads to fill the machine

Minimize amount of synchronization


Avoid excessive use of atomics

Focus on load balancing


Graphics workloads are wild

Programmable Shading

Pipeline Structure
Chunker-style pipeline with four stages
Triangle setup Bin raster Coarse raster Fine raster Run data in large batches
Separate kernel launch for each stage

Keep data in input order all the time


No need to sort

Chunking to Bins and Tiles


Frame buffer
Bin

16x16 tiles 128x128 px

Tile 8x8 px

Pixel

Triangle Setup
Vertex buffer positions, attributes Index buffer

...

Triangle Setup

edge eqs. u/v pleqs zmin etc.

Triangle data buffer ...

Bin Raster
Triangle data buffer ...

Bin Raster SM 0

Bin Raster SM 1

...

Bin Raster SM 14

IDs of triangles that overlap bin

Coarse Raster
...

Coarse Raster SM n

One coarse raster SM has exclusive access to the bin its processing

IDs of triangles that overlap tile

Fine Raster
IDs of triangles that overlap tile

Fine Raster warp n

One fine raster warp has exclusive access to the tile its processing

Read tile once from DRAM to shared

Write tile once to DRAM

Pixel data in FB

Tidbit 1: Coverage Calculation

Step along edge (Bresenham-like) Use look-up tables to generate coverage masks ~50 instructions for 8x8 stamp, one edge

Tidbit 2: Fragment Distribution

Input Phase

Shading Phase

In input phase, calculate coverage and store in list


In shading phase, detect triangle changes and calculate triangle index and fragment in triangle

Test Scenes

Call of Juarez scene courtesy of Techland S.T.A.L.K.E.R.: Call of Pripyat scene courtesy of GSC Game World

Performance Results

Frame rendering time in ms (depth test + color, no MSAA, no blending)

Comparison to Hardware (1/3)


Resolution
Cannot match hardware in raster, z kill + compact Currently support max 2K x 2K frame buffer, 4 subpixel bits

Attributes
Fetched when used bad latency hiding Expensive interpolation

Antialiasing
Hardware nearly oblivious to MSAA, we much less so

Comparison to Hardware (2/3)


Memory usage, buffering through DRAM
Performance implications of reduced buffering unknown Streaming through on-chip memory would be much better

+ Shader complexity
Shader performance theoretically the same as in graphics pipe

+ Frame buffer bandwidth


Each pixel touched only once in DRAM

Comparison to Hardware (3/3)


+ Extensibility
Need one stage to do something extra? Need a new stage altogether? You can actually implement it

+ Specialization to individual applications


Rip out what you dont need, hard-code what you can

Exploration Potential
Shader performance boosters
Compact after discard, quad merging, decoupled sampling,

Things to do with programmable ROP


A-buffering, order-independent transparency,

Stochastic rasterization Non-linear rasterization

(Your idea here)

The Code is Out There


The entire codebase is open-sourced and released

http://code.google.com/p/cudaraster/

VoxelPipe:
A Programmable Pipeline for 3D Voxelization

What is Voxelization?
Voxelization =
Finding all voxels overlapped by each triangle in a mesh

Why shall we care?


Why is it useful?

Shape Matching Collision Detection Fluid / Soft-body Sim Stress Analysis Level of Detail Ray Tracing

Why shall we care?


Interactive Indirect Illumination and Ambient Occlusion using Voxel Cone Tracing

Cyril Crassin (I3D 2011)

Rationale building a full-featured pipeline for voxelization,


analogous to OpenGL for 2d rasterization
fully conservative and thin* rasterization arbitrary frame-buffer types many blending modes (additive,max,min,and,or...) multiple render targets vertex shaders fragment shaders

Rationale Extended support for rendering modes:


conventional blending-based rasterization

A-buffer / bucketing

Challenges Previous research mostly concerned with binary output


=> no Shading, no ROP State-of-the-Art had poor load balancing => Huge performance hit for mixed triangle sizes

What is Rasterization?

What is Rasterization?
Highly Variable Expansion Rate source of most load balancing problems
Vertex Shading Fragment Shading

Observations (1)
Rasterization = Sorting of Compressed Batches of Elements
triangles: fragments:
1 f1,1 2 f2,1 3 f3,1 4 f4,1 5 6 n fn,1

f1,2
f1,3

f3,2
f3,3

F4,2

fn,2

highly variable decompression rate

f3,1000000

Observations (2)
Decompression and Sorting can be done Hierarchically
emit per-tile fragments sort by tile

emit per-voxel fragments sort by voxel blend

Observations (2)
Decompression and Sorting can be done Hierarchically
triangles: fragments:
1
f1,1 f1,2

2
f2,1

3
f3,1 f3,2

4
f4,1 f4,2

n
fn,1 fn,2

f1,3

f3,3

decompression rate is more regular


f3,1000000

Pipeline Overview
coarse rasterizer (tile,tri) fragments by tri id 1 one tri per thread persist. threads 1 tile sorting tile queues tri 1 fine rasterizer 1 CTA per tile FB tiles

tri 2
radix sort

1
2 2

tri 1 tri 2
tri 3 tri 1

one tri per thread persist. threads

Programmable Shading simple C++ classes:


struct MyShader { T eval(const Fragment frag) const; private: ... };
can be any of the supported types!

Performance Results

Performance Results

150 300 M tris/s

Example Application: Real-Time GI

Future Work Sparse Octrees Tessellation / Geometry Shaders Programmable ROP

Future Work
The entire codebase will be open-sourced

http://code.google.com/p/voxelpipe/

Thank You
Questions

Further information:
Laine, Karras: High-Performance Software Rasterization on GPUs. Proceedings of High-Performance Graphics 2011.
Pantaleoni: VoxelPipe: A Programmable Pipeline for 3D Voxelization. Proceedings of High-Performance Graphics 2011.

Vous aimerez peut-être aussi