Académique Documents
Professionnel Documents
Culture Documents
Lecture 6
Hardware Acceleration and Embedded Networks
Hardware Acceleration
Hardware acceleration is the use of computer hardware to perform some function faster than is possible in software running on the generalpurpose CPU. Examples of hardware acceleration include acceleration functionality in graphics processing units (GPUs) and instructions for complex operations in CPUs.( from wikipedia)
....
Normally, processors are sequential, and instructions are executed one by one. Various techniques are used to improve performance; hardware acceleration is one of them. The main difference between hardware and software is concurrency, allowing hardware to be much faster than software. Hardware accelerators are designed for computationally intensive software code. Depending upon granularity, hardware acceleration can vary from a small functional unit to a large functional block
....
The hardware that performs the acceleration, when in a separate unit from the CPU, is referred to as a hardware accelerator, or often more specifically as graphics accelerator or floatingpoint accelerator, etc. Those terms, however, are older and have been replaced with less descriptive terms like video card or graphics card.
....
Many hardware accelerators are built on top of field-programmable gate array chips.
Accelerated Systems
Use additional computational unit dedicated to some functions
hardwired logic extra CPU Coprocessor
Applications
graphics and multimedia (streaming data) encryption and compression communication devices (signal processing) supercomputing, numerical computations
Why Accelerators?
Better cost/performance.
Custom logic may be able to perform operation faster than a CPU of equivalent cost. CPU cost is a non-linear function of performance.
First, determine that the system really needs to be accelerated. How much faster is the accelerator on the core function (speed up)? How much data transfer overhead? How much is the overhead for synchronization with CPU? Performance estimation must be done on all levels of abstraction Simulation based methods (only average case, need to find test patterns that stress the system) Analytic methods (only limited accuracy, quick, used mainly on system level)
Input/output time (bus transactions) include: flushing register/cache values to main memory; time required for CPU to set up transaction; overhead of data transfers by bus packets, handshaking, etc.
Accelerator Gain
But in general:
not all tasks can be executed on the accelerator CPU has also other obligations possibilities:
single-threaded/blocking: CPU waits for accelerator multithreaded/non-blocking: CPU continues to execute along with accelerator, CPU must have useful work to do, software must support multi-threading
Synchronization
via interrupts via special data and control registers at accelerator
Hardware/software co-design meeting system-level objectives by exploiting the synergism of hardware and software through their concurrent design. joint design of hardware and software architectures. Co-design problems have different flavor according to the application domain, implementation technology and design methodology Design a heterogeneous multiprocessor architecture Communication (bus, network, ) Memory architecture Interfaces and I/O Processing elements (CPU, application-specific integrated circuit, FPGA (Field-programmable gate array)) Program the system
13
Network elements
distributed computing platform: PE
PE PE PE
sensor
actuator
15
Why distributed?
Higher performance at lower cost. Physically distributed activities---time constants may not allow transmission to central site. Improved debugging---use one CPU in network to debug others. May buy subsystems that have embedded processors.
Overheads for Computers as Components 16
Hardware architectures
17
Point-to-point networks
PE 1 link 1
PE 2 link 2
PE 3
18
Bus networks
PE 1
PE 2
PE 3
PE 4
header
address
data
ECC
packet format
19
Bus arbitration
Fixed: Same order every time. Fair: every PE has the same access over long periods.
fixed round-robin
A A
A,B,C
B B
C C
A,B,C
Overheads for Computers as Components
A B
B C
C A
20
Multi-stage networks
Use several stages of switching elements. Often blocking. Often smaller than crossbar.
21
I2C bus
Originally designed for automotive electronics Now used for other applications as well Bit serial transmission, 500 Kb/s, over twisted pair,up to 40 m Synchronous, nodes synchronize themselves by listening to the bit transitions on the bus Arbitration by using Carrier Sense Multiple Access with Arbitration on Message Priority (CSMA/AMP) For error handling a special error frame and an overload frame are used as well as acknowledgements
23