Académique Documents
Professionnel Documents
Culture Documents
Choose a faster version of your microprocessor Add additional computational units that are perform special functions?
z z z z
Standard Component (Graphics Processor) Coprocessor (Floating-Point Processor) Additional Microprocessor Hardware Accelerator
2B1448 SoC Architectures 2
Hardware Accelerators
z
If the overall performance of a uniprocessor system is too slow, additional hardware can be used to speed up the system. This hardware is called hardware accelerator! The hardware accelerator is a component that works together with the processor and executes key functions much faster than the processor.
2B1448 SoC Architectures 3 October 28, 2005
accelerator
3 2 Data
CPU
Result
memory I/O
Amdahls Law
z
Amdahls law states that the performance improvement of an improved unit is limited by the fraction of time the unit is in use!
Speedup = = ExecutionTimeOld ExecutionTimeEnhanced 1 FractionEnhanced (1 FractionEnhanced ) + SpeedupEnhanced
An application uses the floating point square root 20% of the time and floating point operations 50% of the time. Is it better to
z
implement a square root unit that speeds up this operation with a factor of 10, or to Improve the floating-point instructions in general so that they can run 2 times faster.
Square Root:
z
Floating-Point:
z
Assume infinite speedup Speedup = 1 / ((1-F)+F/Infinity) = 1/(1-F) F Max. Speedup 0.1 0.3 0.5 0.9 2 10
1.11 1.43
Which functions shall be implemented in hardware and which functions in software? Hardware/software co-design: joint design of hardware and software architectures The hardware accelerator can be implemented in
z z
Hardware/Software Co-Design
z
Original C-Program
Veri fication
Executable Program
11
Co-Specification: the creations of specifications that describe both the hardware and software of a system Co-Synthesis: The automatic or semi-automatic design of hardware and software to meet a specification Co-Simulation: The simultaneous simulation of hardware and software elements on different levels of abstraction
2B1448 SoC Architectures 12
Co-Synthesis
z
Partitioning
z
Partitioning: The functionality of the system is divided into smaller, interacting computation units Allocation: The decision, which computational resources are used to implement the functionality of the system Scheduling: If several system functions have to share the same resource, the usage of the resource must be scheduled in time Mapping: The selection of a particular allocated computational unit for each computation unit
z z
During partitioning the functionality of the system is partitioned into several parts (corresponding to the allocated/available components) Many possible partitions exist Analysis is done by evaluating the costs of different partitions
A C D A C D
Estimation
z
In order to get a good partitioning, there is a need for good figures about
z
performance for a function on different components execution time for communication time
The accuracy of an estimate is a measure how close the estimate is to the actual value on the real implementation The fidelity of an estimation method is defined as percentage of correctly predicted comparisons between design implementations
15
16
Fidelity
Quality metric
Hardware/Software Co-Design
Quality metric
Strategies:
1. Start with an all-software-configuration While (Constraints are not satisfied) Move the SW function that gives the best improvement to HW (implemented in COSYMA [Ernst, Henkel, Brenner 1993]) 2. Start with an all-hardware-configuration While (Constraints are satisfied) Move the most costly HW component to SW (implemented in Vulcan [Gupta, DeMicheli 1995])
Fidelity = 100% z
Though accuracy is much higher in (1) than in (2), the estimates are not very useful for the partitioning process because of the low fidelity! This can cause bad design decisions!
2B1448 SoC Architectures 17
18
z z
z z
R. Ernst et al. Hardware-software co-synthesis from Microcontrollers. IEEE Design & Test of Computers. December 1993. R. K. Gupta and G. de Micheli. Hardware-software cosynthesis for digital systems. IEEE Design & Test of Computers. December 1993. G. de Micheli and R. K. Gupta. Hardware/software co-design. Proceedings of the IEEE. March 1997. (and much much more)
Electronic versions of these and other papers can be accessed by the KTH Library (www.lib.kth.se)
19 October 28, 2005 2B1448 SoC Architectures 20
Why accelerators?
z
Better cost/performance.
z
Custom logic may be able to perform operation faster than a CPU of equivalent cost. CPU cost is a non-linear function of performance.
Put time-critical functions on less-loaded processing elements. Remember Scheduling Overhead (Chapter 6)
z
deadline
performance
October 28, 2005 2B1448 SoC Architectures 21 October 28, 2005 2B1448 SoC Architectures 22
Good for processing I/O in real-time. May consume less energy. May be better at streaming data. May not be able to do all the work on even the largest single CPU.
Which core function(s) shall be accelerated? (Partitioning) How much faster is the accelerator on the core function? How much is the data transfer overhead? performance analysis; scheduling and allocation.
Design Tasks
z z
z z
23
Performance analysis
z
Critical parameter is speedup: how much faster is the system with the accelerator? Must take into account:
z z z
Accelerator execution time. Data transfer time. Synchronization with the master CPU.
The Accelerator needs to know, when it can start its computation z The CPU needs to know when the results are ready
z
single-threaded/blocking: CPU waits for accelerator; multithreaded/non-blocking: CPU continues to execute along with accelerator.
25
26
Sources of parallelism
z
Single-threaded:
P1
Multi-threaded:
P1
Split
Perform operations in batches, read in second batch of data while computing on first batch. May reschedule operations to move work after accelerator initiation.
P2 P3 P4
A1
Accel.
P2 P3
Join
A1
Accel.
P4
CPU
2B1448 SoC Architectures
27
CPU
28
= tin + tx + tout
Data output
tin
Communication Overhead
October 28, 2005 2B1448 SoC Architectures 29 October 28, 2005 2B1448 SoC Architectures 30
flushing register/cache values to main memory; time required for CPU to set up transaction; overhead of data transfers by bus packets, handshaking, etc.
CPU
DMA
Bus Interface
Accelerator
Registers
Core
31
32
Accelerator/CPU interface
z
Caching problems
z
Accelerator registers provide control registers for CPU. Data registers can be used for small data objects. Accelerator may include special-purpose read/write logic.
z
Main memory provides the primary data transfer mechanism to the accelerator. Programs must ensure that caching does not invalidate main memory data (Assume a cache in CPU).
33
34
3.
Memory
If several processing entities exit, each entity can have several tasks Divide functional specification into units.
z z
Wrong value!
CPU
Accelerator
vs. f3()
35
36
Partitioning methodology
z
Partitioning example
cond 1 cond 2 P1 Block 1
Divide CDFG into pieces, shuffle functions between pieces. Hierarchically decompose CDFG to identify possible partitions.
Block 2 P2 Block 3 P5 P3
October 28, 2005 2B1448 SoC Architectures 37 October 28, 2005
P4
2B1448 SoC Architectures 38
Example Accelerator
Data-flow Graph
x y
Must:
z z
Architecture
P A M
h(f(x),g(y))
October 28, 2005 2B1448 SoC Architectures 39 October 28, 2005 2B1448 SoC Architectures 40
Execution Times
P f g h 5 5 5 A 2 2 z
Single-Processor Solution
Data-flow Graph
Both P and A have sufficient registers P and A cannot access the bus simultaneously A memory access (load or store) takes 1 time unit
Load x Load y
P
1 1 5 5 5 1 18
f g h Store h(...)
h(f(x),g(y))
October 28, 2005 2B1448 SoC Architectures 41 October 28, 2005 2B1448 SoC Architectures
42
Processor-Accelerator Solution I
Data-flow Graph
x y
P Load x Load y f g A 1 1 2 2 1 1
Processor-Accelerator Solution II
Data-flow Graph
x y Load y g A f g P Load f h h P Store h 1 5 1 Total
Total 16
P 1 5 Load x f Store f
A 1 2 1
A
Load f 1 1 5 1
13
h(f(x),g(y))
October 28, 2005 2B1448 SoC Architectures
h(f(x),g(y))
43 October 28, 2005
Still Single-Thread!
Summary
z
z z
Try to debug the CPU/accelerator interface separately from the accelerator core. Build equipment to test the accelerator. Hardware/software co-simulation can be useful.
Hardware/Software co-design techniques can be used for the design of an accelerator You have to be aware of cache coherence problems, if the processor or accelerator uses a cache
2B1448 SoC Architectures 46
45