Vous êtes sur la page 1sur 15


General and special purpose DSP processors
Computer architectures for signal processing
General purpose fixed point DSP processors
Selecting DSP Processors
Implementation of DSP algorithms
Special purpose DSP processors



DSP processors are used to implement and execute DSP algorithms in real-time (often real-time implies 'as soon
as possible', but within specified time limits).
The main objectives of this section of the DSP course (lectures session and associated laboratory/course work)
are to provide an understanding of
(1) Key issues underlying DSP processors and their hardware/software architectures.
(2) How DSP algorithms are implemented for real-time execution using fixed point DSP processors (digital
filtering will be used as a vehicle for this).
(3) Finite word length effects in fixed point DSP systems (using digital filtering as a vehicle for in the


General and special purpose DSP processors

For convenience, DSP processors can be divided into two broad categories:

General purpose DSP processors these are basically high speed microprocessors with hardware and
instruction sets optimized for DSP operations. Examples of such processors include fixed-point devices
such as Texas Instruments TMS320C54x and Motorola DSP563x processors, and floating point
processors such as Texas Instruments TMS320C4x and Analog Devices ADSP21xxx SHARC


Special purpose DSP processors these include: (i) hardware designed for efficient execution of
specific DSP algorithms and (some times called algorithm-specific hardware), e.g. FFT, and (ii)
hardware designed for specific applications (some times called application specific processors), e.g. for
PCM in telecommunications or audio applications. Examples of special-purpose DSP processors are
Cirrus's processor for digital audio sampling rate converters (CS8420), Mitel's multi-channel telephony
voice echo canceller (MT9300), FFT processor (PDSP16515A) and programmable FIR filter


Computer architectures for signal processing

Standard microprocessors are based on the von Neumann concepts where operations are performed sequentially.
Increase in processor speed is only achieved by making the individual units of the processor operate faster, but
there is a limit to this (see Figure 1). For real-time operation, DSP processors must have architecture optimised
for executing DSP operations. Figure 1b depicts a generic hardware architecture for DSP.

Figure 1 A simplified architecture for standard microprocessors

Figure 2 A simplified generic hardware architecture for DSP

The characteristic features of the architecture of Figure 2 include:

Multiple bus structure, with separate memory spaces for data and programs.
Arithmetic units for logical and arithmetic operations, include a hardware multiplier/accumulator.

Why is such an architecture necessary? In DSP most algorithms, e.g. digital filtering and FFT, involve
repetitive arithmetic operations such as multiplication, additions, memory accesses and heavy data flow through
the CPU.
The architecture of standard microprocessors is not suited to this type of activity. An important goal in DSP
hardware design is to optimise both hardware architecture and instruction set to increase speed and make real

time execution possible whilst keeping quantization errors low. In DSP, this is achieved by making extensive
use of the concepts of parallelism. In particular, the following techniques are used:

Harvard architecture
Fast, dedicated hardware multiplier/accumulator
Specialised instructions dedicated to DSP
On-chip memory/cache.
Extended parallelism SIMD, VLIW and static super scalar processing.

We will examine some of the above techniques to gain more understanding of the architectural features of DSP

Harvard architecture

In a standard microprocessor, the program codes and the data are held in one memory space. Thus, the
fetching of the next instruction while the current one is executing is not allowed, because the fetch and
execution phases each require memory access (see Figure 3).

Figure 3 An illustration of instruction fetch, decode and execute in a non-Harvard architecture with
single memory space (a) instruction fetch from memory; (b) timing diagram
NB: The example illustrates reading of a value op 1 at address ADR1 in memory into the accumulator and
then storing it at two other addresses, ADR2 and ADR3. The instructions could be:

ADR1 Load the operand op1 into the accumulator from ADR1
ADR2 Store op1 in address ADR2
ADR3 Store op1 in address ADR3

Typically, an instruction in a microprocessor involves three distinct steps:

Instruction fetch
Instruction decode
Instruction execute.

The main feature of the Harvard architecture is that the program and data memories lie in two separate
spaces, see Figure 4. This permits a full overlap of instruction fetch and execution.

Figure 4 The basic Harvard architecture with separate data and program spaces;

Figure 5 An illustration of instruction overlap made possible by Harvard architecture.

In a Harvard architecture, since the program codes and data lie in separate memory spaces, the fetching of
the next instruction can overlap the execution of the current instruction. Normally, the program memory
holds the program codes, whilst the data memory stores variables such as the input data samples.



This is a technique used extensively in DSP to increase speed as it allows two or more operations to
overlap during execution. In pipelining, a task is broken down into a number of distinct sub-tasks which
are then over lapped during execution.
A pipeline is akin to a typical production line in a factory, such as a car or TV assembly plant. As in the
production line, the task is broken down into small, independent sub-tasks called pipe stages which are
connected in series to form a pipe. Execution is sequential.

Figure 6 An illustration of the concepts of pipelining.

Figure 6 gives a timing diagram of a 3-stage pipeline. Typically, each step in the pipeline takes one
machine cycle to complete. Thus, during a given cycle up to three different instructions may be active at
the same time, although each will be at a different stage of completion.

The speedup

average instruction time (non pipeline)

average instruction time (pipeline)


Example 1
In a non pipeline processor, the instruction fetch, decode and execute take 35 ns, 25 ns, and 40 ns,
respectively. Determine the increase in throughput if the instruction steps were pipelined. Assume a 5 ns
pipeline overhead at each stage, and ignore other delays.
In an ideal non pipeline processor, the average instruction time is simply the sum of the times for
instruction fetch, decode and execute:
35 + 25 + 40 ns = 100 ns.
However, if we assume a fixed machine cycle then each instruction time would take three machine cycles
to complete: 40 ns x 3 = 120 ns (the execute time maximum time determines the cycle time). This
corresponds to a throughput of 8.3x106 instructions per second.
In the pipeline processor, the clock speed is determined by the speed of the slowest stage plus overheads,
i.e. 40 + 5 = 45 ns. The through put (when the pipeline is full) is 22.2 x106 instructions per second.
Speed up =

average instruction time (non pipeline) = 120/45 = 2.67

average instruction time (pipeline)

Pipelining has a major impact on the system memory because it leads to an increased number of memory
accesses (typically by the number of stages). The use of Harvard architecture where data and instructions
lie in separate memory spaces promotes pipelining.

Assuming the times in the above example are as follows:

20 nS
25 nS
15 ns
1 nS

Determine the increase in throughput if the instructions were pipelined.


Example 2
Most DSP algorithms are characterised by multiply-and-accumulate operations typified by the following

y (n) a0 x(n) a1 x(n 1) a 2 x(n 2) ... a N 1 x(n ( N 1))

Figure 5 shows a non pipeline configuration for an arithmetic element for executing the above equation.
Assume a transport delay of 200 ns, 100ns and 100 ns, respectively for the memory, multiplier and the
(1) What is the system throughput?
(2) Reconfigure the system with pipelining to give a speed increase of 2:1. Illustrate the operation of the
new configuration with a timing diagram.

Figure 7 Non-pipelined MAC configuration.


The coefficients, a k , and the data arrays are stored in memory as shown in Figure 7. In the nonpipelined mode, the coefficients and data are accessed sequentially and applied to the multiplier.
The products are summed in the accumulator. Successive MAC will be performed once every 400
ns (200 + 100 + 100), that is a throughput of 2.5 x106 operations per second.


The arithmetic operations involved can be broken up into three distinct steps: memory read,
multiply, and accumulate. To improve speed these steps can be overlapped. A speed improvement

of 2:1 can be achieved by inserting pipeline registers between the memory and multiplier and
between the multiplier and accumulator as shown in Figure 8. The timing diagram for the pipeline
configuration is shown in Figure 9. As is evident in the timing diagram, the MAC is performed
once every 200 ns. The limiting factor is the basic transport delay through the slowest element, in
this case the memory. Pipeline overheads have been ignored.

Figure 8 Pipelined MAC configuration. The pipeline registers serve as temporary store for coefficient
and data sample pair. The product register also serves as a temporary store for the product.

Figure 9 Timing diagram for a pipelined MAC unit. When the pipeline is full, a MAC operation is
performed every clock cycle (200 ns).

DSP algorithms are often repetitive but highly structured, making them well suited to multilevel
pipelining. Pipelining ensures a steady flow of instructions to the CPU, and in general leads to a
significant increase in system through put. However, on occasions pipelining may caused problems (e.g.
an unwanted instruction execution, especially near branch instructions).



The basic numerical operations in DSP are multiplication and addition. Multiplication in software is time
consuming. Additions are even worse if floating point arithmetic is used.
To make real-time DSP possible, a fast dedicated hardware MAC, using either fixed point or floating
point arithmetic is mandatory. Characteristics of a typical fixed point MAC include:

16 x 16 bit 2's complement inputs

16 x 16 bit multiplier with 32-bit product in 25 ns

32/40 bit accumulator


Special instructions

These are instructions optimised for DSP and lead to compact codes and increased speed of execution of
operations that are repeated. For example, digital filtering requires data shifts or delays to make room for
new data, followed by multiplication of the data samples by the filter coefficients, and then accumulation
of products. Recall that FIR filters are characterised by the following equation:
N 1

y (n) h(k ) x(n k ) , where N is the filter length.

k 0

In the TMS320C50, for example, the FIR equation can be efficiently implemented using the instruction


The first instruction, RPT NM1, loads the filter length minus 1 (N-1) into the repeat instruction counter,
and causes the multiply-accumulate with data move (MACD) instruction following it to be repeated N
times. The MACD instruction performs a number of operations in one cycle:

multiplies the data sample, x(n k ) , in the data memory by the coefficient, h(k ) , in the
program memory;
adds previous product to the accumulator;
implements the unit delay, symbolized by z 1 , by shifting the data sample, x(n-k), up to update
the tapped delay line.

In the Motorola DSP56000 DSP processor family, as in the TMS320 family, the MAC instruction,
together with the repeat instruction (REP) may be used to implement an FIR filter efficiently:

X0, Y0, A

X: (R0)+, X0

Y: (R4)+, Y0

Here the repeat instruction is used with the MAC instruction to perform sustained multiplication and sums
of product operations. Again, notice the ability to perform multiple operations with one instruction, made
possible by having multiple data paths.


The contents of the registers X0 and Y0 are multiplied together and the product added to the accumulator.
At the same time, the next data sample and corresponding coefficient are fetched from the X and Y
memories for multiplication.
In most modern DSP processors, the concept of instruction repeat has been taken further by providing
instructions that allow a block of code, not just a single instruction, to be repeated a specified number of
times. In the TMS320 family (e.g. TMS320C50, TMS320C54 and TMS320C30), the format for repeat
execution of a block of instructions, with a zero-overhead loop, is:


RPTB loop
(last instruction)

Repeat instructions provided by some DSP processors have high level language features. In Motorola
DSP56000 and DSP56300 families zero-overhead DO loops are provided which may also be nested. The
example below illustrates a nested Do loop in which the outer loop is executed N times and the inner loop
NM times.
LOOP2 (last instruction is placed here)
LOOP1 (last instruction in the outer loop is placed here)
Nested loops are useful for efficient implementation of DSP functions such as FFT algorithms and 2-D
dimensional signal processing.
Analog Devices DSP processors (e.g. ADSP-2115 and SHARC processors) also have nested-looping
capability. The ADSP-2115 supports up to 4 levels of nested loops. The format for looping is:
LOOP: (last instruction in the loop)
The loop is repeated until the counter expires. The loop can contain a large block of instructions, not just a
single instruction. The format for nested looping is essentially the same as for DSP56000 family.
Modern DSP processors also feature application-oriented instructions for applications such as speech
coding (e.g. those for codebook search), digital audio (e.g. those for surround sound ) and
telecommunications (e.g. those for Viterbi decoding). Other application oriented instructions include those
that support coefficient update for adaptive filters and bit reverse addressing for FFTs (see later).



Extended parallelism - SIMD, VLIW and static superscaler processing.

The trend in DSP processor architecture design is to increase both the number of instructions executed in
each cycle and the number of operations performed per instruction to enhance performance. In newer DSP
processor architectures, parallel processing techniques are extensively used to achieve increased
computational performance. The three techniques that are used, often in combination, are:

Single instruction, multiple data (SIMD) processing.

Very-long-instruction-word (VLIW) processing
Superscalar processing

Figure 10 An illustration of the use of SIMD processing and multiple data size capability to extend the
number of multiplier/accumulators (MACs) from one to four in a TigerSHARC DSP processor.

Note: SIMD processing is used to increase the number of operations performed per instruction. Typically, in DSP
processors with SIMD architectures the processor has multiple data paths and multiple execution units. Thus, a
single instruction may be issued to the multiple execution units to process blocks of data simultaneously and in this
way the number of operations performed in one cycle is increased.


Figure 11 Principles of very long instruction word (VLIW) architecture and data flow in the
advanced, fixed point DSP processor, TMS320C62x.

Note: The Very-long-instruction-word processing is an important approach for substantially increasing the number of
instructions that are processed per cycle. A very-long-instruction word is essentially a concatenation of several short
instructions and require multiple execution units, running in parallel, to carry out the instructions in a single cycle. In
the TMS320C62x, the CPU contains two data paths and eight independent execution units, organised in two sets (L1, S1, M1and D1) and (L2, S2, M2 and D2). In this case, each short instruction is 32-bits wide and eight of these
are linked together to form a very long instruction word packet which may be executed in parallel. The VLIW
processing starts when the CPU fetches an instruction packet (eight 32-bit instructions) from the on-chip program
memory. The eight instructions in the fetch packet are formed into an execute packet, if they can be executed in
parallel, and then dispatched to the eight execution units as appropriate. The next 256-bit instruction packet is
fetched from the program memory while the execute packet is decoded and executed. If the eight instructions in a
fetch packet are not executable in parallel, then several execute packets will be formed and dispatched to the
execution units, one at a time. A fetch packet is always 256-bit wide (eight instructions), but an execute packet may
vary between 1 and 8 instructions.


Figure 12 Principles of superscalar architecture and data flow in the

TigerSHARC DSP processor
Note: Superscalar processing is used to increase the instruction rate of a DSP processor by exploiting instructionlevel parallelism. Traditionally, the term superscalar refers to computer architectures that enable multiple
instructions to be executed in one cycle. Such architectures are widely used in general purpose processors, such as
PowerPC and Pentium processors. In superscalar DSP processors, multiple execution units are provided and several
instructions may be issued to the units for concurrent execution. Extensive use is also made of pipelining techniques
to increase performance further. The TigerSHARC is described as a static superscalar DSP processor because
parallelism in the instructions is determined before run-time. In fact, the TigerSHARC processor combines SIMD,
VLIW and superscalar concepts. This advanced, DSP processor has multiple data paths and two sets of independent
execution units, each with a multiplier, ALU, a 64-bit shifter and a register file. TigeSHARC is a floating point
processor, but it supports fixed arithmetic with multiple data types (8-, 16-, and 32-bit numbers). The instruction
width is not fixed in the TigerSHARC processor. In each cycle, up to four 32-bit instructions are fetched from the
internal program memory and issued to the two sets of execution units in parallel. An instruction may be issued to
both units in parallel (SIMD instructions) or to each execution unit independently. Each execution unit (ALU,
multiplier or shifter) takes its inputs from and returns its results to the register file. The register files are connected to
the three data paths and so can simultaneously read two inputs and write an output to memory in a cycle. This
load/store architecture is suited to basic DSP operations which often take two inputs and computes an output.
Because the processor can work on several data sizes, the execution units allow further levels of parallel
computation. Thus, in each cycle the TigerSHARC can execute up to eight addition/subtract operations and eight
multiply-accumulate operations with 16-bit inputs, in stead of two multiply-accumulate operations with 32-bit


ctors in selecting a given processor.