Vous êtes sur la page 1sur 28

GPGPU-Sim Tutorial

Zhen Lin
North Carolina State University
Based on GPGPU-Sim Tutorial and Manual by UBC

Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study

Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study

GPGPU-Sim in a Nutshell
Microarchitecture timing model of contemporary GPUs
Run unmodified CUDA/OpenCL

What GPGPU-Sim Simulates


Functional model
PTX
SASS

Timing model for the compute part of a GPU


Not for CPU or PCIe
Only model microarchitecture timing relevant to compute

Functional model
PTX
A low-level, data-parallel virtual machine and instruction set architecture (ISA)
Between CUDA and hardware ISA (SASS)
Stable ISA that spans multiple GPU generations

SASS/PTXPLUS
Hardware native ISA
PTX -> Translate + Optimize -> SASS
More accurate, but not well supported

CUDA tool chain

Functional Model (PTX)

Scalar ISA
SSA representation: register allocation not done in PTX

Timing Model for GPU Micro-Architecture


GPGPU-Sim simulates the timing model
of a GPU running each launched CUDA
kernel
Report stats (e.g. # cycles) for each kernel
Exclude any time spent on data transfer
on PCIe bus
CPU is assumed to be idle when the GPU
is working

Compilation Path

Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study

Demo1
Setup
Stats
Configuration

Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study

Overview of the Architecture

Inside a SIMT Core


Pipeline stages

Fetch
Decode
Issue
Read operand
Execution
Writeback

Fetch + Decode
Arbitrate the I-cache
among warps
Cache miss handled by
fetching again later

Fetched instruction is
decoded and then
stored in the I-Buffer
1 or more entries / warp
Only warp with vacant
entries are considered in
fetch

Issue
Selects a warp with a ready
instruction
Acquires the activemask
from TOS of SIMT stack
Invalid the I-buffer

Scoreboard
Checks for RAW and WAW
dependency hazard
Flag instructions with hazards as not ready in I-Buffer
(masking them out from the scheduler)

Instructions reserves dest registers at issue


Release them at writeback

December 2012

GPGPU-Sim Tutorial (MICRO 2012) 4: Microarchitecture Model

4.17

Read Operand
Bank 0

Bank 1

Bank 2

Bank 3

R0

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

R11

add.s32 R3, R1, R2;

No Conflict

mul.s32 R3, R0, R4;

Conflict at bank 0

Operand Collector Architecture (US Patent: 7834881)


Interleave operand fetch from different threads to achieve full utilization

December 2012

GPGPU-Sim Tutorial (MICRO 2012) 4: Microarchitecture Model

4.18

Operand Collector
(from instruction issue stage)
dispatch

December 2012

GPGPU-Sim Tutorial (MICRO 2012) 4: Microarchitecture Model

4.19

Execution
ALU
Stream processor (SP)
Specific function unit (SFU)

MEM

Shared memory
Local memory
Global memory
Texture memory
Constant memory

ALU Pipelines
SIMD Execution Unit
Fully Pipelined
Each pipe may execute a subset of instructions
Configurable bandwidth and latency (depending on the instruction)
Default: SP + SFU pipes

December 2012

GPGPU-Sim Tutorial (MICRO 2012) 4: Microarchitecture Model

4.21

Memory Unit

Double clock the unit


Each cycle service half the
warp

A
G
U

Bank
Conflict

Shared MSHR
Mem

Access
Coalesc.

Data
Cache

Has a private writeback


path

December 2012

GPGPU-Sim Tutorial (MICRO 2012) 4: Microarchitecture Model

Const.
Cache
Texture
Cache

Memory Port

Model timing for memory


instructions
Support half-warp (16
threads)

4.22

Writeback
Write result to register file
Scoreboard updates the r-bit

Stack-Based Branch Divergence Hardware


When the branch diverge

New entries are pushed to SIMT stack


RPC set to the immediate post dominator
Activemast indicates which threads are active
PC is sent to fetch unit

When RPC is reached


Pop the TOS
PC of new TOS is sent to the fetch unit

Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study

Demo2
Software framework overview
To monitor the warp scheduling order
Compare with different scheduling policies

For More Information


http://www.gpgpu-sim.org/

Thanks & question?

Vous aimerez peut-être aussi