Vous êtes sur la page 1sur 10

Alignment Issue in Circular Buffering:

This document is written targeting Texas instruments microcontrollers for signal processing specifically TMS320VC5416. The issue is explained in detail in further paragraphs. While implementing FIR filtering in C on the TMS320VC5416 some function calls in the DSPLIB requires that the coefficients and the data be aligned in memory. The alignment of the data is understood but what is not understood is the alignment required with the filter coefficients. This document explains why alignment is required for the filter coefficients in the memory for the DSPLIB functions to work. Here the question can be rephrased that why it was chosen to put the filter coefficients in a circular buffer instead of a linear buffer. Is it more efficient? First I clarify what exactly we mean by alignment. Alignment in memory here means that for a given filter length nh , the coefficients are put into memory with starting address with K = log2(l), where 2l>=nh lower bit zeros. e.g. for a filter length 5 the starting address in the memory for the filter coefficients must be : xxxxxxxxxxxxx000 The Issues are as following: 1. Why do filter coefficients need circular buffering for the DSPLIB functions to work? 2. Why do the filter coefficients need memory alignment for the DSPLIB functions to work? Issues: The approach is as follows. We answer the following questions . a) Why is circular buffering efficient than linear buffering for the data? a.1) How is FIR filtering implemented? a.2) How is it implemented in the hardware?

b) Why is it efficient for the filter coefficients? c) Why is alignment needed for DSPLIB functions to work? a.1 : I first explain FIR processing algorithms and their flaws and problems occurring during practical implementations that will shed light on our issues. For an FIR filter the impulse response is of the form:

Here M is the filter order. The length of the filter is nh = M+1. The output can be obtained by equation:

(Equation 1) length of output = ny length of input = nx Now to implement above equation there are two types of methods: 1. Sample Processing: We take the input samples one by one and with each input coming we have a sample of output y . Here the filter is implemented as state system. Some of sample processing techniques are as follows. y Direct form 1 y Direct form 2 y Canonical form y Cacade form 2. Block Processing: Here we take a block of input samples and give many ouput samples. Some of block processing methods are as follows. y Convolution y Matrix form

y LTI form y Overlap add block convolution method We start with the sample processing methods as they are much easier in realtime applications. We focus on Direct form 1. The structure to implement the filter through direct form is following.

fig 1 : direct form 1 Above structure implements an Mth order filter. To understand higher level of what is happening we first look at the C implementation of above structure and then look at the processor level implementation. The following code would implement the above structure: double fir(M,h,w,x) double *h, *w, x; int M; { int i; // usage y = fir(M,h,w,x)

double y = 0 ; K = (L <= M) ? M : L ; v[0] = x; for(i =0; i<=L; i++ ) y += b[i]*w[i] ; for(i =K; i<=L; i-- ) w[i] += w[i-1] ; return y; }

//output sample

//for implementing equation 1

//for tapped delay line //reverse order updating of states. //current output sample

y Here for the first loop we require a multiplier and an adder. y Second loop just propogates the input sample. i.e w1 = w0 ; w3 = w2 and so on.. Visually it can be pictured as below. The filter is chosen of order 3 for simplicity, it can be genrelized to an Mth order filter.

fig 2 : flip and slide form of convolution Some comments about the above function.

1. One call of above function returns just one sample of the output. i.e. first call will give y0 second call will give y1 and so on.. However with every call we must give the function proper input samples to give correct output. 2. Here the picture suggests as the input samples sitting and the filter sliding along. This implements Equation 1 but with a change of variables. If we picture the opposite i.e filter sitting and input samples sliding through, then it replicates the Equation 1 and the function. a.2 : Now we look more closely of how this can be implemented in the hardware(controller). Hardware implementation is closely related to the assembly language or the instruction set available to us for a particular processor. Here we use instruction set of the TMS320VC to understand implemetation. All the filter coeficients will be located at a particular location in the memory. Here we assume the order of the filter 3 but this can be generalized to any filter of order M . y0 xn . . . . . . x1 0000 0001 0002 0003 h0 h1 h2 h3 . x0 y1 ym

Say,

xn . . . . . . x1 x0

...

1000 1001 1002 1003

Above is the pictorial representation on the convolution. Procedure. Every output sample y is result of one fir function call. Now in assembly we can use MAC and RPT instructions. ********* Here we look at specific locations in the memory in our case for the filter : 0000 to 0003 and for the data 1000 to 0003 and perform filtering over those memory locations and thus have to move our data in those memory locations constantly. So we need continuous moving of input data. This is clearly an overhead. The following mechanism can be used to do the same thing more efficiently.

Address of x0

p1

Address of h0

A pseudo code is as follows: cfir: repeat : n <=length of the filter(nh) Take contents of p1 and p2 as address and multiply contents at that address. dec p dec p1 if n == nh ; // statement 1

reset p (make it point h0 again) // statement 2 inc p reset p1 end (making pointer ready for the next call ) //statement 3 //statement 4

fig 3 : Contents of circular buffer at successive time instants xn . . . . . . x1 x0 0 0 0 n=0 xn . . . . . . x1 x0 0 0 0 n=1

.. . .

1001 1000 999 998 997

. p | |

1001 1000

fig4: updating of pointer p for one cfir call (n=0, n=1 etc.) and successive cfir calls

In this case the data is assumed to be located at a static location in memory. The pointer p as shown in fig 3 is pointing at location of x0 . For the first call of cfir (n=0 in the fig 3), p is pointing at x0 . The function results in the successive decrements to the pointer (i.e. pointing to memory locations 999, 998, 997) resulting in multiplication with 0. When the last time repeat is executed (n=nh), if condition becomes true and the pointer p (pointing to data) is first reset to original position (i.e. memory location 1000) is started from i.e. in this case x0 and then incremented (so now it points to x1, memory location 1001). For the next call (n=1 in fig 3) same thing happens but now when n equals nh and the function enter if then pointer p is reset to original position in this case x1 and then incremented. Thus pointer wraps around emulating a circular buffer. This is clearly more efficient than moving all data every time a filtering operation is done. Here putting the data in the circular buffer means that we don t need to implement the statement 2 and statement 3. We still have to implement statement 1 because of the pointer p1 . This clarifies all parts of a .

Some comments: 1. Here although we started we samples processing but if we call above cfir function for more than one we effectively are doing block processing. 2. For TMS320vcXXXX controllers the resetting of the data pointer and the inc for the pointer for the next filtering operation is done in the hardware. 3. Here the function would be called number of times= ny to get all the output samples. After ny calls the pointer to the data values is reset to the first location. This explains the need for circular buffering for the data samples. Now we look at part b

b: In the above discussion we just looked at what is happening to data samples. Now we concentrate on the filter coefficients. For the first call (i.e. n =0) the pointer

p1 would be pointing at h0. Then for the successive calls the pointer will be decremented and will point to h1, h2 and h3 successively. This completes one filtering operation (i.e. one call of cfir is completed) Now for the next call the to the function or next filtering operation for next output sample the pointer p1 must point to h0 again. Thus it can be seen that putting in the circular buffer also would benefit the filter coefficients. Here what we mean by putting in the circular buffer is that we will not have to check for statement 4. So if we put both data and filter coefficients in the circular buffer then we don t need to execute statements 1,2,3 and 4 in the software.

This clarifies b .

c: To understand why memory alignment would be required in the memory specifically for the filter coefficients when circular buffering is used, we take a closer look at how would the circular buffering would be implemented in the hardware. We assume that the filter coefficients are store in order in contagious location in memory. Before we look at both the ways we take a note of what would it require for the hardware implementation to be successful in the hardware of circular buffering for filter coefficients. y The filter buffer pointer must reset or come back to the original position (i.e. the starting address or the location of the first filter coefficient) once it goes through nh iterations. The information available to hardware is: 1. nh i.e. length of filter. 2. Starting address of the filter coefficients Now there are two ways we can implement the requirement stated above. 1. The starting address of the register is stored in a register say R1 and is added to the length of the filter stored in R2 . The result is stored in register R3 . Contents of R1 are copied to R4 and this acts as the pointer

and is incremented in the cfir . With every increment the contents of the register is XOR ed with R3 . If a match is found then R3 is copied to R4 . Thus a circular buffer is implemented.

2.

For the second method we must ensure the following: a. The size of the buffer must be a power of two (2n> nh ). The filter length can be any size. b. However, the buffer must be aligned so that the starting address of the buffer has n lsb's equal to zero. In this case we register L1 contains the length of the filter. After every iteration of the loop in cfir the register is decremented. As soon as all become zero is reached the pointer is made zero in the n lsb s.

The TMS320 processors implement the circular buffer in the above-mentioned way. Hence we need alignment in the memory. Here we don t need to move the data in the memory. The data is static at one place it is just the pointer pointing them that is being changed. This is definitely more efficient than previous method. The former method is linear buffering and the data was assumed to be in linear buffer. And the later method is circular buffering in which the data is put in a circular buffer. As we see that for one filtering operation we need only those number of input samples as the length of the filter. So every time fir is called we first have to move input data by one memory location and then perform filtering. In order that the above code can be implemented in assembly the following things need to be done. This clarifies c Arthur Butz. Rushi Desai.

Vous aimerez peut-être aussi