Nava Risc

SIMD Pipelined Processor
Implemented on an FPGA
MS Comprehensive Exam
Benjamin Mar
J uly 2, 2007
2
Presentation Overview
Introduction
Thesis Statement & Contributions
Background
Processor Architecture
Results
Conclusions & Future Work
3
Outline
Introduction
Background
Results
4
Introduction
Modern processors range:
General purpose fixed processors that implement a complex
instruction set
Soft-core processors that implement a reconfigurable, reduced
instruction set
When designing a processor architecture:
What instruction set should be implemented?
How should the chip area be allocated?
What should the hardware be responsible for?
What should the software be responsible for?
How should the memory architecture be designed?
How should hazards be avoided?
5
Outline
Introduction
Background
Results
6
Thesis Statement
There is a need to develop and implement a new processor
that can serve as an effective platform for education as well
as for future research in applications like image processing
(binary and convolution operations).
Education
Pipeline architecture
Simulation & Synthesis
Hazard control
Research
Single Instruction Multiple Data (SIMD) pipeline
Reconfigurability
Single cycle throughput
7
Contributions
A synthesizable VHDL description of a five stage
pipeline processor with hazard control.
Complete instruction set based from MIPS instructions
A synthesizable VHDL description of a SIMD version of
the five stage pipeline processor with hazard control.
Arithmetic, logical, and memory instructions
Instruction set chosen to be able to implement calculations such
as convolution and morphological operations
An analysis of the processors maximum frequency and
area usage of the FPGA.
Efficient area usage
8
Outline
Introduction
Background
Results
9
Background:
MIPS Processor (I/II)
Pipeline architecture
RISC processor
Single cycle execution
32-bits operations
Used as example for teaching pipelines in
many classes
10
Background:
MIPS Processor (II/II)
Argument against MIPS (CISC vs. RISC)
Replacing complex instructions with many
simple instructions
Lower speed
Response to CISC
Speed comes from pipeline
Complex instructions do not use area
efficiently
11
Background: Motivation
Utilize concept foundation of MIPS
Improve architecture
Hazard control
SIMD functionality
Reconfigurable
12
Background: SIMD
Method of improving performance in
applications that have highly repetitive
operations (as in signal and image
processing)
Values stored in SIMD vectors
Vectors use special set of CPU registers
Makes use of multiple CPU functional
units that execute concurrently
13
Background: Prior Work (I/II)
SIMD architectures make the use of
multiple processors to carry out the
multiple executions
Processing elements (PE) must be
connected in a network
14
Background: Prior Work (II/II)
8 bits 10ns
(estimate)
FPGA, Configurable network, 256 x 8-bit
memory per PE
MATRIX (1996)
32 bits 60ns FPGA, Torus network, 28 PEs-4 FPGAs,
1.5MB external memory
VIP (1996)
16 bits 100ns Pyramid network, Deeply Pipelined, 256
PEs, 8000 bits of memory off chip
WPM (1989)
1 bit 100ns Mesh network, 16000 PEs with 1024 bits
of memory per PE
MPP (1983)
1 bit 100ns Mesh network, 72 bit-serial processors
with 128 bits of memory per PE
GAPP (1984)
1 bit 400ns Mesh network, 9216 PEs, bit-serial
processor with 32 bits of memory per PE
CLIP 4 (1980)
Data
size per
PE
Clock
Period
Approach Processor
15
Background: Motivation
Disadvantage of prior work
No single cycle throughput
What is provided in this implementation
5 stage pipeline
Efficient instruction set (small area)
32 bit operations
16
Outline
Introduction
Background
Results
17
Processor Architecture: SISD
Architecture
32-bit instruction set
Five stage pipeline
Harvard memory
Load-store
Hazard control
PC Register
Instruction
Memory
Data
Memory
Controller
Branch Logic
Next PC Logic
Register File
Registers:
TAKENEXTPC
NEXTPC
BREAKH
IF IF/ID ID ID/EXE EXE MEM EXE/WB
Registers:
A
B
M
ALUOP
SHAMT
FUNCT
WB
REGWRITEOUT
MEMTOREG
MEMWRITEH
ALU
Controller
Shifter
Comparator
ALU
Multiplier
Forward
MUXA
Forward
MUXM
Registers:
WB
TAKEMUL
REGWRITEOUT
MEMTOREG
Register
Write MUX
S
I
S
D
Instruction
Address MUX
Register
Destination MUX
18
Processor Architecture:
SISD Instruction Set
Arithmetic Instructions:
addu, addiu, subu, mul
Logical Instructions:
and, andi, or, ori, xor, xori
Shift Instructions:
sll, srl, sra, lui
Comparison Instructions:
slt, sltu, slti, sltiu
Memory Instructions:
lw, sw
Branch Instructions:
beq, bne, bgez, bltz
J ump Instructions:
j, jr, jal
Exception Instruction:
break
Implemented 28 MIPS32 instructions out of 189:
Chosen for completeness
19
Instruction Set Completeness
Instruction set can be used to implement:
any logical operation
convolutions using multiply and add
any memory transfer
any conditional/unconditional jump
instruction
20
Instruction Fetch (I/II)
Synchronous instruction memory (no IR register)
Address given in IF
Instruction read in ID
Program counter (PC) register
PC Register
Instruction
Memory
Data
Memory
Controller
Branch Logic
Next PC Logic
Register File
Registers:
TAKENEXTPC
NEXTPC
BREAKH
Registers:
A
B
M
ALUOP
SHAMT
FUNCT
WB
REGWRITEOUT
MEMTOREG
MEMWRITEH
ALU
Controller
Shifter
Comparator
ALU
Multiplier
Forward
MUXA
Forward
MUXM
Registers:
WB
TAKEMUL
REGWRITEOUT
MEMTOREG
Register
Write MUX
S
I
S
D
Instruction
Address MUX
Register
Destination MUX
21
Instruction Fetch (II/II)
Directs the flow of
the program
Branches
Conditional
jumps
J umps
Unconditional
jumps
Breaks
Stalls for
debugging
Retrieves
instructions from
memory
22
Instruction Decode (I/II)
Implements branch delay slot
Compiler responsible to place instruction after
branch to execute
Contains the control unit of the processor
PC Register
Instruction
Memory
Data
Memory
Controller
Branch Logic
Next PC Logic
Register File
Registers:
TAKENEXTPC
NEXTPC
BREAKH
Registers:
A
B
M
ALUOP
SHAMT
FUNCT
WB
REGWRITEOUT
MEMTOREG
MEMWRITEH
ALU
Controller
Shifter
Comparator
ALU
Multiplier
Forward
MUXA
Forward
MUXM
Registers:
WB
TAKEMUL
REGWRITEOUT
MEMTOREG
Register
Write MUX
S
I
S
D
Instruction
Address MUX
Register
Destination MUX
23
Instruction Decode (II/II)
Decodes each
instruction
Sets control
signals for
instructions
Determines if
branch is taken
Calculates
branch/jump
target address
24
Execution (I/II)
Carries out the operations of the instructions
Synchronous multiplier
Operands given in EXE
Result read in WB
Potential data hazard handled by hazard controller
PC Register
Instruction
Memory
Data
Memory
Controller
Branch Logic
Next PC Logic
Register File
Registers:
TAKENEXTPC
NEXTPC
BREAKH
Registers:
A
B
M
ALUOP
SHAMT
FUNCT
WB
REGWRITEOUT
MEMTOREG
MEMWRITEH
ALU
Controller
Shifter
Comparator
ALU
Multiplier
Forward
MUXA
Forward
MUXM
Registers:
WB
TAKEMUL
REGWRITEOUT
MEMTOREG
Register
Write MUX
S
I
S
D
Instruction
Address MUX
Register
Destination MUX
25
Execution (II/II)
Sets controls for
calculations
Calculates
arithmetic and
logical values
Compares values
Shifts bits
26
Processor Architecture: Memory
Data storage for processor
Synchronous data memory
Address given in MEM
Data read in WB
Potential data hazard handled by hazard controller
PC Register
Instruction
Memory
Data
Memory
Controller
Branch Logic
Next PC Logic
Register File
Registers:
TAKENEXTPC
NEXTPC
BREAKH
Registers:
A
B
M
ALUOP
SHAMT
FUNCT
WB
REGWRITEOUT
MEMTOREG
MEMWRITEH
ALU
Controller
Shifter
Comparator
ALU
Multiplier
Forward
MUXA
Forward
MUXM
Registers:
WB
TAKEMUL
REGWRITEOUT
MEMTOREG
Register
Write MUX
S
I
S
D
Instruction
Address MUX
Register
Destination MUX
27
Processor Architecture: Write Back
Writes data from memory
Writes operation results
PC Register
Instruction
Memory
Data
Memory
Controller
Branch Logic
Next PC Logic
Register File
Registers:
TAKENEXTPC
NEXTPC
BREAKH
Registers:
A
B
M
ALUOP
SHAMT
FUNCT
WB
REGWRITEOUT
MEMTOREG
MEMWRITEH
ALU
Controller
Shifter
Comparator
ALU
Multiplier
Forward
MUXA
Forward
MUXM
Registers:
WB
TAKEMUL
REGWRITEOUT
MEMTOREG
Register
Write MUX
S
I
S
D
Instruction
Address MUX
Register
Destination MUX
28
Processor Inter-Connections
Connects the stages together
Registers values for pipeline functionality
Watched by hazard detector
PC Register
Instruction
Memory
Data
Memory
Controller
Branch Logic
Next PC Logic
Register File
Registers:
TAKENEXTPC
NEXTPC
BREAKH
Registers:
A
B
M
ALUOP
SHAMT
FUNCT
WB
REGWRITEOUT
MEMTOREG
MEMWRITEH
ALU
Controller
Shifter
Comparator
ALU
Multiplier
Forward
MUXA
Forward
MUXM
Registers:
WB
TAKEMUL
REGWRITEOUT
MEMTOREG
Register
Write MUX
S
I
S
D
Instruction
Address MUX
Register
Destination MUX
29
Hazard Control
Data forwarding
Register value needed in ID stage
Register written by instruction still in EXE or MEM
stage
Data forwarded from input of inter-stage registers
Resolves read after write dependency
Write after write and write after read dependencies not an
issue for this pipeline
J al address forwarded
Pipeline stall
Stall pipeline when forwarding data not available
during that clock cycle
Multiply
Load word
30
Hazard Control Example
addu $1, $2, $3
or $4, $1, $2
mul $5, $1, $4
and $6, $7, $5
subu $0, $0, $0
IF ID EXE WB
addu
addu
addu
addu
or
or
or
or
mul
mul
mul
and
and
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
mul and Cycle 6
subu
subu
31
Processor Inter-Connections
Instructions clear
pipeline in four clock
cycles
Prevents data
hazards and
structural hazards
Does not need to kill
instructions or flush
the pipeline
32
From SISD to SIMD
Modify instruction set
SISD (1 execution unit)
add: addu $1, $2, $3
$1 $2 + $3
load word: lw$4, 0($0)
$4 Mem(0)
branch: beq $4, $3, target
SIMD (N execution units)
add: vaddu $1, $2, $3
$1a $2a + $3a, $1b $2b + $3b, $1c $2c + $3c,
load word: vlw$4, 0($0)
$4a Mema(0), $4b Memb(0), $4c Memc(0),
branch: beq $4, $3, target
33
New Processor Architecture:
Extending to SIMD
Instructions based off of SISD instructions
Multiple basic units added and connected
within the architecture
Number of units defined in generic map
Control logic updated to account for the
different datapaths
Memory designed as distributed memory
34
SIMD Instruction Set
Implemented 11 instructions that operate on
register sets of N registers:
Arithmetic Instructions:
vaddu, vaddiu, vmul
Logical Instructions:
vand, vandi, vor, vori, vxor, vxori
Memory Instructions:
vlw, vsw
Opcodes based off of SISD opcodes
35
Processor Architecture: SIMD
PC Register
Instruction
Memory
Data
Memory
SIMD Data
Memory 1
SIMD Data
Memory N
SIMD Data
Memory 2
Controller
Branch Logic
Next PC Logic
Register File
SIMD
Register File 1
Registers:
TAKENEXTPC
NEXTPC
BREAKH
Registers:
A
B
M
ALUOP
SHAMT
FUNCT
WB
REGWRITEOUT
MEMTOREG
MEMWRITEH
Registers:
VA
VB
VM
VREGWRITEOUT
VMEMTOREG
VMEMWRITEH
Registers:
VA
VB
VM
VREGWRITEOUT
VMEMTOREG
VMEMWRITEH
Registers:
VA
VB
VM
VREGWRITEOUT
VMEMTOREG
VMEMWRITEH
SIMD
Register File 2
ALU
Controller
Shifter
Comparator
ALU
Multiplier
SIMD ALU 1
SIMD
Multiplier 1
SIMD ALU 2
SIMD
Multiplier 2
SIMD ALU N
SIMD
Multiplier N
Forward
MUXA
Forward
MUXM
SIMD
Forward
MUXA 1
SIMD
Forward
MUXM 1
SIMD
Forward
MUXA 2
SIMD
Forward
MUXM 2
SIMD
Forward
MUXA N
SIMD
Forward
MUXM N
SIMD Register
File N
Registers:
WB
TAKEMUL
REGWRITEOUT
MEMTOREG
Register
Write MUX
Registers:
VREGWRITEOUT
VMEMTOREG
SIMD Register
Write MUX 1
Registers:
VREGWRITEOUT
VMEMTOREG
SIMD Register
Write MUX 2
Registers:
VREGWRITEOUT
VMEMTOREG
SIMD Register
Write MUX N
S
I
S
D
S
I
M
D
Instruction
Address MUX
Register
Destination MUX
36
SIMD Units
Number of units reconfigurable
Generic map
Data arrays
SIMD Data
Memory 1
SIMD Data
Memory N
SIMD Data
Memory 2
SIMD
Register File 1
Registers:
VA
VB
VM
VREGWRITEOUT
VMEMTOREG
VMEMWRITEH
Registers:
VA
VB
VM
VREGWRITEOUT
VMEMTOREG
VMEMWRITEH
Registers:
VA
VB
VM
VREGWRITEOUT
VMEMTOREG
VMEMWRITEH
SIMD
Register File 2
SIMD ALU 1
SIMD
Multiplier 1
SIMD ALU 2
SIMD
Multiplier 2
SIMD ALU N
SIMD
Multiplier N
SIMD
Forward
MUXA 1
SIMD
Forward
MUXM 1
SIMD
Forward
MUXA 2
SIMD
Forward
MUXM 2
SIMD
Forward
MUXA N
SIMD
Forward
MUXM N
SIMD Register
File N
Registers:
VREGWRITEOUT
VMEMTOREG
SIMD Register
Write MUX 1
Registers:
VREGWRITEOUT
VMEMTOREG
SIMD Register
Write MUX 2
Registers:
VREGWRITEOUT
VMEMTOREG
SIMD Register
Write MUX N
S
I
M
D
37
SIMD Control
SISD controller modified for SIMD
New signals for SIMD register files
Modified signals for SIMD execution units
PC Register
Instruction
Memory
Data
Memory
Controller
Branch Logic
Next PC Logic
Register File
Registers:
TAKENEXTPC
NEXTPC
BREAKH
Registers:
A
B
M
ALUOP
SHAMT
FUNCT
WB
REGWRITEOUT
MEMTOREG
MEMWRITEH
ALU
Controller
Shifter
Comparator
ALU
Multiplier
Forward
MUXA
Forward
MUXM
Registers:
WB
TAKEMUL
REGWRITEOUT
MEMTOREG
Register
Write MUX
S
I
S
D
Instruction
Address MUX
Register
Destination MUX
38
SIMD Memory
Distributed memory
Every memory unit received the same
address
Memory pictured as rows method of access
39
Outline
Introduction
Background
Results
40
Results: Testing
Tested each instruction for correct functionality
Arithmetic operation
Logical operation
Addiu/Vaddiu sign extend
Branch and jump delay slot
Load and store to memory
Data forwarding
Writing or forwarding register 0
Load word and multiply stall
41
Results: Verification
Simulated in ModelSim
Viewed wave file for correct signal behavior
Synthesized on XUP board with Xilinx ISE
Verified with two clock modes
On board clock (31.25 MHz)
Step clock
42
Results: Board Demonstration (I/IV)
43
Results: Board Demonstration (II/IV)
vlw$1, 0($0) #Load distributed memories
vlw$2, 4($0) #into SIMD register files
vlw$3, 8($0)
vlw$4, 12($0)
vlw$5, 16($0)
vlw$6, 20($0)
vlw$7, 24($0)
vlw$8, 28($0)
vlw$9, 32($0)
vlw$10, 36($0)
vlw$11, 40($0)
vlw$12, 44($0)
vlw$13, 48($0)
vlw$14, 52($0)
vlw$15, 56($0)
vlw$16, 60($0)
vlw$17, 64($0)
vlw$18, 68($0)
vlw$19, 72($0)
vaddu $23, $1, $2 #Test vaddu instruction
vaddu $24, $23, $3
vaddu $25, $24, $23
vsw$25, 0($0)
break 1 #read SIMD 23 registers
vlw$20, 0($0)
vaddiu $21, $20, 0x1111 #Test vlwhazard
vaddiu $22, $0, 0x1000
vsw$23, 0($22) #Test vswand vlw
vsw$24, 8($22)
vsw$25, 24($22)
vlw$23, 8($22)
vlw$24, 24($22)
vlw$25, 0($22)
vandi $24, $8, 0x8888 #Test vandi instruction
vori $23, $24, 0x8888
vxor $23, $23, $9
vor $23, $23, $10
break 4
break 5
vxori $23, $23, 0x0765
vand $23, $18, $23
vaddiu $23, $23, 0x8000
break 7
break 8
vor $23, $16, $7
vmul $23, $23, $6 #Test vmul
vmul $23, $23, $17 #Test vmul hazard
break 10
break 11
44
Results: Board Demonstration (III/IV)
0x0000014D
0x0000010D
0x000000CD
0x42EAB825
0x42E9B826
0x77178888
0x0000008D
0x7D188888
0xCED90000
0xCED80018
0xCED70008
0xEED90018
0xEED80008
0xEED70000
Inst value
0x000000A0 break 5
0x0000009C break 4
0x00000098 break 3
0x00000094 vor $23, $23, $10
0x00000090 vxor $23, $23, $9
0x0000008C vori $23, $24,
0x8888
0x00000088 break 2
0x00000084 vandi $24, $8,
0x8888
0x00000080 vlw$25, 0($22)
0x0000007C vlw$24, 24($22)
0x00000078 vlw$23, 8($22)
0x00000074 vsw$25, 24($22)
0x00000070 vsw$24, 8($22)
0x0000006C vsw$23, 0($22)
PC Instruction
0x64161000
0x66951111
0xCC140000
0x0000004D
0xEC190000
0x4317C821
0x42E3C021
0x4022B821
0xCC130048
0xCC04000c
0xCC030008
0xCC020004
0xCC010000
Inst value
0x00000068 vaddiu $22, $0,
0x1000
0x00000064 vaddiu $21, $20,
0x1111
0x00000060 vlw $20, 0($0)
0x0000005C break 1
0x00000058 vsw$25, 0($0)
0x00000054 vaddu $25, $24,
$23
0x00000050 vaddu $24, $23, $3
0x0000004c vaddu $23, $1, $2
0x00000048 vlw$19, 72($0)

0x0000000c vlw$4, 12($0)
0x00000008 vlw$3, 8($0)
0x00000004 vlw$2, 4($0)
0x00000000 vlw$1, 0($0)
PC Instruction
45
Results: Board Demonstration (IV/IV)
Step Clock Board Clock
0x001ED8BA
0x001ED8BA
0x001ED8BA
0x201FA040
0x201FA040
0x201FA040
0xBBBBBBB
0xBBBBBBB
0xBBBBBBB
0x66666666
0x33333333
SIMD1
0x0000092A
0x0000092A
0x0000092A
0xFFFF8002
0xFFFF8002
0xFFFF8002
0x0000000B
0x0000000B
0x0000000B
0x00000006
0x00000003
SIMD2
0x000000D0
0x000000CC
0x000000C8
0x000000B8
0x000000B4
0x000000B0
0x000000A0
0x0000009C
0x00000098
0x00000088
0x0000005C
PC
0x001ED8BA
0x2FC9036A
0x77777777
0x201FA040
0x20202040
0xBBBBBCDE
0xBBBBBBB
0x99991111
0x00008888
0x66666666
0x33333333
SIMD 1
0x000002CD
0x0000028D
0x0000024D
0x0000020D
0x000001CD
0x0000018D
0x0000014D
0x0000010D
0x000000CD
0x0000008D
0x0000004D
Inst value
0x0000092A break 11
0x0000008A break 10
0x00000017 break 9
0xFFFF8002 break 8
0x00000002 break 7
0x0000076E break 6
0x0000888B break 5
0x00008881 break 4
0x00008888 break 3
0x00000006 break 2
0x00000003 break 1
SIMD2
Instruction
46
Results: SISD Frequency (I/II)
Worst case path traverses branch logic
Instruction memory register file forward MUX
branch logic take next PC register
Routing delay about 70% of the total delay
Maximum board clock is 100 MHz
8.940 ns 8.437 ns 22.116 ns 20.933 ns
Minimum
Period
111.8 MHz 118.5 MHz 45.2 MHz 47.8 MHz
Maximum
Frequency
Without Multiply,
Branches, J umps
One Unconditional
J ump (j)
Without
Multiply
All
Instructions
47
Results: SIMD Frequency (II/II)
7.784 ns 9.432 ns 19.903 ns 21.222 ns
Minimum
Period
106 MHz
Without Multiply,
Branches, J umps
128.5 MHz 50.2 MHz 47.1 MHz
Maximum
Frequency
One Unconditional
J ump (j)
Without
Multiply
All
Instructions
Maximum frequency doubles without SISD
branch instructions
48
Results: Area
4 units instantiated
Maximum number of units: 8 units
Processor can complete 8 sets of 32-bit data
results per cycle (like dual core Pentium)
SISD SISD and SIMD
Number of Slices 2033 7879
% of Slices Used 14% 57%
Minimum Period 20.933 ns 21.222 ns
49
Outline
Introduction
Background
Results
50
Conclusions (I/II)
Construction of a new SIMD-pipelined 32-bit
MIPS-based processor
Implementation of 39 instructions in five stage
pipeline with Harvard memory architecture
Function of processor self contained
completeness with instruction set
Addition of multiple SIMD units and control logic
updated
Designation of SIMD memory as distributed
memory allowing memory to be used as rows
method of access
51
Conclusions (II/II)
Can operate at the maximum speed of 100 MHz
(single-cycle throughput) and utilize all the area
of the FPGA
Verified in both simulation and synthesis
Compared to VIP (non-pipelined)
Single-cycle throughput on all instructions
Fit more SIMD units in same space with more
functionality
Compared to MATRIX (pipelined fetch only)
Single-cycle throughput on all instructions
Using 32-bits instead of Matrixs 8-bits
52
Future Work
Modifications to research image processing
applications
binary operations to implement morphological
operations
multiply and add operations for a MAC to perform
convolution and filtering operations.
transfer image row data into SIMD memories
Development into a real-time reconfigurable
system
Further research to confirm if manual routing
could improve the maximum frequency
SIMD Pipelined Processor
Implemented on an FPGA
MS Comprehensive Exam
Benjamin Mar
J uly 2, 2007
Thank You

Nava Risc

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Nava Risc

Transféré par

Droits d'auteur :

Formats disponibles

SIMD Pipelined Processor

Vous aimerez peut-être aussi