Hardware Accelerators

Hardware Accelerators
Ingo Sander ingo@imit.kth.se

See also Wolf: Computers as Components, Ch 7
How to improve the performance of a microprocessor system?

z
Choose a faster version of your microprocessor Add additional computational units that are perform special functions?
z z z z
Standard Component (Graphics Processor) Coprocessor (Floating-Point Processor) Additional Microprocessor Hardware Accelerator
2B1448 SoC Architectures 2
October 28, 2005
Hardware Accelerators
z
Accelerated System Architecture

1 Request
If the overall performance of a uniprocessor system is too slow, additional hardware can be used to speed up the system. This hardware is called hardware accelerator! The hardware accelerator is a component that works together with the processor and executes key functions much faster than the processor.
2B1448 SoC Architectures 3 October 28, 2005
accelerator
3 2 Data
CPU
Result
memory I/O
October 28, 2005
2B1448 SoC Architectures
Amdahls Law
z
Example (Henessey & Patterson)

z
Amdahls law states that the performance improvement of an improved unit is limited by the fraction of time the unit is in use!
Speedup = = ExecutionTimeOld ExecutionTimeEnhanced 1 FractionEnhanced (1 FractionEnhanced ) + SpeedupEnhanced
An application uses the floating point square root 20% of the time and floating point operations 50% of the time. Is it better to
z
implement a square root unit that speeds up this operation with a factor of 10, or to Improve the floating-point instructions in general so that they can run 2 times faster.
Fraction denotes the time the enhancement can be used!

October 28, 2005 2B1448 SoC Architectures 5 October 28, 2005 2B1448 SoC Architectures 6
Example (Henessey & Patterson)

z
Almdahls Law Lessons to be learned

z
Square Root:
z
Speedup = 1 / ((1-0.2)+0.2/10) = 1/0.82 = 1.22 Speedup = 1 / ((1-0.5)+0.5/2) = 1/0.75 = 1.33
The maximum speedup that is possible is limited by the fraction!

z z
Floating-Point:
z
Assume infinite speedup Speedup = 1 / ((1-F)+F/Infinity) = 1/(1-F) F Max. Speedup 0.1 0.3 0.5 0.9 2 10
1.11 1.43
Improve the common cases!

An Accelerator is not a CoProcessor

z
Design of a hardware accelerator

z
A co-processor is connected to the CPU and executes special instructions.

z
Instructions are dispatched by the CPU.
An accelerator appears as a device on the bus
Which functions shall be implemented in hardware and which functions in software? Hardware/software co-design: joint design of hardware and software architectures The hardware accelerator can be implemented in
z z
Application-specific integrated circuit. Field-programmable gate array (FPGA).

October 28, 2005
October 28, 2005
Hardware Software Co-Design

Good estimates are needed for good partitioning
Hardware/Software Co-Design
z
System Model Partitioning & Mapping Veri fication
Original C-Program
Estimation Library HW-Model (VHDL) HW Synthesis Netlist

October 28, 2005
Hardware/Software Co-design covers the following problems

z
Which functions shall go to HW and SW?
SW-Model (C/C++) SW Compilation
Veri fication
Executable Program
11
Co-Specification: the creations of specifications that describe both the hardware and software of a system Co-Synthesis: The automatic or semi-automatic design of hardware and software to meet a specification Co-Simulation: The simultaneous simulation of hardware and software elements on different levels of abstraction
October 28, 2005
Co-Synthesis
z
Partitioning
z
Four tasks are included in co-synthesis

z
Partitioning: The functionality of the system is divided into smaller, interacting computation units Allocation: The decision, which computational resources are used to implement the functionality of the system Scheduling: If several system functions have to share the same resource, the usage of the resource must be scheduled in time Mapping: The selection of a particular allocated computational unit for each computation unit
z z
During partitioning the functionality of the system is partitioned into several parts (corresponding to the allocated/available components) Many possible partitions exist Analysis is done by evaluating the costs of different partitions
A C D A C D
All these tasks depend on each other!

Estimation
z
Estimation Accuracy and Fidelity

z
In order to get a good partitioning, there is a need for good figures about
z
performance for a function on different components execution time for communication time
The accuracy of an estimate is a measure how close the estimate is to the actual value on the real implementation The fidelity of an estimation method is defined as percentage of correctly predicted comparisons between design implementations
October 28, 2005
15
October 28, 2005
16
Fidelity
Quality metric
Hardware/Software Co-Design
Quality metric
Strategies:
1. Start with an all-software-configuration While (Constraints are not satisfied) Move the SW function that gives the best improvement to HW (implemented in COSYMA [Ernst, Henkel, Brenner 1993]) 2. Start with an all-hardware-configuration While (Constraints are satisfied) Move the most costly HW component to SW (implemented in Vulcan [Gupta, DeMicheli 1995])
Fidelity = 100% z
Fidelity = 33% (only A > C correct)
Though accuracy is much higher in (1) than in (2), the estimates are not very useful for the partitioning process because of the low fidelity! This can cause bad design decisions!
October 28, 2005
October 28, 2005
18
System design tasks

z
Papers on HW/SW Co-Design

z
Design a heterogeneous multiprocessor architecture.

z
Processing element (PE): CPU, accelerator, etc.
z z
Divide Tasks to Processing Elements Verify that

z z
z z
Functionality of the system is correct System meets the performance constraints
R. Ernst et al. Hardware-software co-synthesis from Microcontrollers. IEEE Design & Test of Computers. December 1993. R. K. Gupta and G. de Micheli. Hardware-software cosynthesis for digital systems. IEEE Design & Test of Computers. December 1993. G. de Micheli and R. K. Gupta. Hardware/software co-design. Proceedings of the IEEE. March 1997. (and much much more)
Electronic versions of these and other papers can be accessed by the KTH Library (www.lib.kth.se)
19 October 28, 2005 2B1448 SoC Architectures 20
October 28, 2005
Why accelerators?
z
Why accelerators? contd.

z
Better cost/performance.
z
Better real-time performance.

z z
Custom logic may be able to perform operation faster than a CPU of equivalent cost. CPU cost is a non-linear function of performance.
Put time-critical functions on less-loaded processing elements. Remember Scheduling Overhead (Chapter 6)
z
To improve performance by choosing a faster CPU may be very expensive!

cost
Extra CPU cycles must be reserved to meet deadlines.

cost
deadline
deadline plus Scheduling overhead

performance
performance
Why accelerators? contd.

z z z z
Accelerated system design

z
Good for processing I/O in real-time. May consume less energy. May be better at streaming data. May not be able to do all the work on even the largest single CPU.
First, determine that the system really needs to be accelerated.

z z z
Which core function(s) shall be accelerated? (Partitioning) How much faster is the accelerator on the core function? How much is the data transfer overhead? performance analysis; scheduling and allocation.
Design Tasks
z z
z z
Design the accelerator itself. Design CPU interface to accelerator.

October 28, 2005
23
October 28, 2005
Performance analysis
z
Single- vs. multi-threaded

z
Critical parameter is speedup: how much faster is the system with the accelerator? Must take into account:
z z z
One critical factor is available parallelism:

z
Accelerator execution time. Data transfer time. Synchronization with the master CPU.
The Accelerator needs to know, when it can start its computation z The CPU needs to know when the results are ready
z
single-threaded/blocking: CPU waits for accelerator; multithreaded/non-blocking: CPU continues to execute along with accelerator.
To multithread, CPU must have useful work to do.

z
But software must also support multithreading.
October 28, 2005
25
October 28, 2005
26
Sources of parallelism
z
Total execution time

z
Overlap I/O and accelerator computation.

z
Single-threaded:
P1
Multi-threaded:
P1
Split
Perform operations in batches, read in second batch of data while computing on first batch. May reschedule operations to move work after accelerator initiation.
Find other work to do on the CPU.

z
P2 P3 P4
A1
Accel.
P2 P3
Join
A1
Accel.
P4
CPU
October 28, 2005
27
October 28, 2005
CPU
28
Execution time analysis

Single-threaded: z Count execution time of all component processes.
Execution Time Acc. CPU P1 A1 tx Acc. P2 tout P3 P4 Time CPU P1
Accelerator execution time

z
Multi-threaded: z Find longest path through execution.

Execution Time A1 P3 P2 P4 Time
Total accelerator execution time:

z taccel
= tin + tx + tout
Data input Accelerated computation
Data output
tin
Communication Overhead
Data input/output times

z
Example for Accelerator Architecture

Mem
Bus transactions include:

z z z
flushing register/cache values to main memory; time required for CPU to set up transaction; overhead of data transfers by bus packets, handshaking, etc.
CPU
DMA
Bus Interface
Accelerator
Registers
Read Unit Write Unit
Core
October 28, 2005
31
October 28, 2005
32
Accelerator/CPU interface
z
Caching problems
z
Accelerator registers provide control registers for CPU. Data registers can be used for small data objects. Accelerator may include special-purpose read/write logic.
z
Main memory provides the primary data transfer mechanism to the accelerator. Programs must ensure that caching does not invalidate main memory data (Assume a cache in CPU).
Especially valuable for large data transfers.
October 28, 2005
33
October 28, 2005
34
Possible Problems with Caches

1. 2.
Design for a multiprocessor environment

z
3.
CPU reads location S. Accelerator writes location S. CPU reads location S.

Cache 2
Memory
If several processing entities exit, each entity can have several tasks Divide functional specification into units.
z z
Map units onto PEs. Units may become processes.
Wrong value!
CPU
Determine proper level of parallelism:

f1() f2() f3(f1(),f2())
Accelerator
vs. f3()
October 28, 2005
35
October 28, 2005
36
Partitioning methodology
z
Partitioning example
cond 1 cond 2 P1 Block 1
Divide CDFG into pieces, shuffle functions between pieces. Hierarchically decompose CDFG to identify possible partitions.
Block 2 P2 Block 3 P5 P3
October 28, 2005 2B1448 SoC Architectures 37 October 28, 2005
P4
Scheduling and allocation

z
Example Accelerator
Data-flow Graph
x y
Must:
z z
schedule operations in time; allocate computations to processing elements.
Architecture
P A M
Scheduling and allocation interact, but separating them helps.

z
Alternatively allocate, then schedule.
h(f(x),g(y))
Execution Times
P f g h 5 5 5 A 2 2 z
Single-Processor Solution
Data-flow Graph
Both P and A have sufficient registers P and A cannot access the bus simultaneously A memory access (load or store) takes 1 time unit
Load x Load y
P
1 1 5 5 5 1 18
f g h Store h(...)
h(f(x),g(y))
October 28, 2005 2B1448 SoC Architectures 41 October 28, 2005 2B1448 SoC Architectures
42
Processor-Accelerator Solution I
Data-flow Graph
x y
P Load x Load y f g A 1 1 2 2 1 1
Processor-Accelerator Solution II
Data-flow Graph
x y Load y g A f g P Load f h h P Store h 1 5 1 Total
Total 16
P 1 5 Load x f Store f
A 1 2 1
A
Load f 1 1 5 1
Store f Store g Load g h Store h
13
h(f(x),g(y))
October 28, 2005 2B1448 SoC Architectures
h(f(x),g(y))
43 October 28, 2005
Exploitation of parallelism leads to fast solution!

Still Single-Thread!
System integration and debugging

z
Summary
z
z z
Try to debug the CPU/accelerator interface separately from the accelerator core. Build equipment to test the accelerator. Hardware/software co-simulation can be useful.
The use of a hardware accelerator can lead to a more efficient solution

z
In particular when the parallelism in the functionality can be exploited
Hardware/Software co-design techniques can be used for the design of an accelerator You have to be aware of cache coherence problems, if the processor or accelerator uses a cache
October 28, 2005
45
October 28, 2005

Hardware Accelerators

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Hardware Accelerators

Transféré par

Droits d'auteur :

Formats disponibles

Hardware Accelerators

Ingo Sander ingo@imit.kth.se

How to improve the performance of a microprocessor system?

October 28, 2005

Accelerated System Architecture

October 28, 2005

2B1448 SoC Architectures

Example (Henessey & Patterson)

Fraction denotes the time the enhancement can be used!

Example (Henessey & Patterson)

Almdahls Law Lessons to be learned

Speedup = 1 / ((1-0.2)+0.2/10) = 1/0.82 = 1.22 Speedup = 1 / ((1-0.5)+0.5/2) = 1/0.75 = 1.33

The maximum speedup that is possible is limited by the fraction!

Improve the common cases!

An Accelerator is not a CoProcessor

Design of a hardware accelerator

A co-processor is connected to the CPU and executes special instructions.

Instructions are dispatched by the CPU.

An accelerator appears as a device on the bus

Application-specific integrated circuit. Field-programmable gate array (FPGA).

October 28, 2005

2B1448 SoC Architectures

October 28, 2005

Hardware Software Co-Design

System Model Partitioning & Mapping Veri fication

Estimation Library HW-Model (VHDL) HW Synthesis Netlist

Hardware/Software Co-design covers the following problems

Which functions shall go to HW and SW?

SW-Model (C/C++) SW Compilation

2B1448 SoC Architectures

October 28, 2005

Four tasks are included in co-synthesis

All these tasks depend on each other!

Estimation Accuracy and Fidelity

October 28, 2005

2B1448 SoC Architectures

October 28, 2005

2B1448 SoC Architectures

Fidelity = 33% (only A > C correct)

October 28, 2005

October 28, 2005

2B1448 SoC Architectures

System design tasks

Papers on HW/SW Co-Design

Design a heterogeneous multiprocessor architecture.

Processing element (PE): CPU, accelerator, etc.

Divide Tasks to Processing Elements Verify that

Functionality of the system is correct System meets the performance constraints

October 28, 2005

2B1448 SoC Architectures

Why accelerators? contd.

Better real-time performance.

To improve performance by choosing a faster CPU may be very expensive!

Extra CPU cycles must be reserved to meet deadlines.

deadline plus Scheduling overhead

Why accelerators? contd.

Accelerated system design

First, determine that the system really needs to be accelerated.

Design the accelerator itself. Design CPU interface to accelerator.

October 28, 2005

2B1448 SoC Architectures

October 28, 2005

Single- vs. multi-threaded