Vous êtes sur la page 1sur 4

A Comparison of FPGA and DSP Development Environments and Performance for Acoustic Array Processing

Russ Duren, Jeremy Stevenson and Mike Thompson


Department of Electrical and Computer Engineering Baylor University Waco, TX 76798 E-mail: Russell_Duren@baylor.edu, Michael_W_Thompson@baylor.edu
AbstractThis paper compares the development effort and performance of a field programmable gate array (FPGA)based implementation of a signal processing solution with that of a traditional digital signal processor (DSP) implementation. An acoustic array processing task was selected as a typical problem. A simple metric is proposed to compare the design effort.

I. INTRODUCTION Advances in the capabilities of field programmable gate arrays (FPGAs) have generated interest in using FPGAs for applications that are computationally intensive, including many digital signal processing tasks. One problem limiting the widespread used of FPGAs has been the specialized knowledge required to develop FPGA-based solutions. Recently, design tools have become available which help shorten the development time required for implementing signal processing solutions using FPGAs. The primary goal of the paper is to perform a comparison between the development effort and performance of an FPGA-based implementation of a signal processing solution with that of a traditional DSP processor implementation using these newer tools. A comparison of processing speed between an Altera FPGA and a TI DSP for a wavelet-based processing algorithm has been provided in [1]. That work mentions that FPGAs require more design effort, but only in passing. Reference [2] compares a TI DSP to an older Xilinx 4000 series FPGA on a convolution task. That work spends more time discussing the differences in the development tools, however as it is an older work, the authors used assembly language to program the DSP and apparently used schematic capture to develop the circuitry for the FPGA. The signal processing system that is implemented for the comparison must process signals from an array of microphones to determine the direction of arrival of an impulsive acoustic signal. The FPGA development is accomplished using MATLABs Simulink in conjunction with Xilinxs FPGA System Generator Blockset. It is implemented on an XTremeDSP board from Nallatech. The DSP processor development is accomplished with Code Composer Studio with the target processor being a TMS320C67 DSK. It is widely considered that there is a

trade between the two approaches in terms of implementation complexity and processing performance. The use of design tools for bridging this gap is of interest to system developers [3], [4]. In this paper we develop a simple metric that allow us to compare the two implementations in terms of implementation complexity. Following this, we discuss the processing performance achieved by each solution. The paper concludes with a discussion of the results and a discussion of the pitfalls encountered in implementing each approach. II. THE SIGNAL PROCESSING TASK

In an effort to arrive at a meaningful comparison between an FPGA and DSP processor implementation, we considered an application that requires several traditional DSP functions found in many applications. Our application uses a fixed, planar array of microphones to capture impulsive acoustic signals from a point source. The sound source is assumed to be located at a distance that is sufficiently far so as to justify a far-field assumption. Additionally, the sound source is assumed to be located on the same plane as the microphone array. Examples of impulsive acoustic signal include a gun shot or a hand clap. The objective is to locate the direction of the (far-field) sound source. This is accomplished by estimating the relative time delays for the arrival times of the acoustic impulse at each microphone. It is important to note that many standard signal processing blocks including offset removal, correlation, and trigonometric calculations are primary components of the solution. A top-level block diagram of the system is shown in Fig. 1. Details of each of the blocks are provided below.

Figure 1. Top-level Block Diagram

1-4244-1176-9/07/$25.00 2007 IEEE.

1177

One of the goals of the study was to implement the sound location signal processing task using realistic sound sources at know locations. For this study handclaps were used to represent impulsive signals. Each clap was recorded in a quiet room using an array of two microphones placed 10 cm apart. The signals were captured with a sampling rate of 44.1 kHz. The origins of sounds were located along an arc centered about the center of the microphone array with a radius of 10 m. This distance is sufficient to justify a planer model for the incoming sound waves (the far-field assumption). Handclaps were recorded at angles of 70, 75 and 110, relative to the array, where 0 was defined to be on a line passing through both microphones to the left. The first two angles where chosen to demonstrate the resolution of the system while the last angle verified the ability of the system to distinguish an angle of arrival from a source located in the second quadrant. III. ALGORITHM OVERVIEW This section provides a more detailed description of the signal processing approach used for both the DSP and FPGA implementations. For both systems, a block processing approach is taken as illustrated in Fig. 1. The focus of this study is to compare the block processing times for both implementations, and therefore the time required for signal acquisition is not included in the comparison. The time domain signature of an impulsive signal exhibits a short-duration, large-amplitude spike. A peakdetecting procedure is used to locate the time index value that corresponds to the maximum amplitude of the reference microphone signal. Each signal is then converted to form a length 1000 block of samples by taking 500 samples before and after the time index location specified by the peakdetection process. The time-duration of the block length for a 44.1 kHz sampling rate is 44.1 ms, which was experimentally verified to be adequate for capturing the handclap signals. Figure 2 illustrates that a sound wave from a given direction will arrive at each microphone in the array at predictable delay time with respect to the reference microphone. For both implementations, the reference microphone signal is cross-correlated with all of the other microphones in the array in order to estimate the delay time with respect to the reference microphone. For zero-mean signals, a commonly used estimator for cross-correlation is given by
R XY (n) =

basis in the preprocessing stage of the algorithm. It is noted that for certain situations the preprocessing stage could also be used to more effectively isolate the impulsive signal from the existing background noise by employing a high-pass filter. Based on the geometry of the microphone array, it is possible to calculate the maximum time delay between the reference microphone and each microphone in the array (the maximum delay occurs when the source is located at 0 or 180). This allows one to limit the number of crosscorrelation lags that need to be calculated in the correlation algorithm. For the experimental set-up used in this study the number of lag calculations can be restricted to 40 lags. We note that it is also possible for the delay time between a given microphone and the reference microphone to be negative. A negative delay time simply indicates that the acoustic waveform reached a given microphone before it reached the reference microphone. To account for the possibility of a negative time delay, the cross-correlation for the remaining microphones in the array is calculated twice; once with the reference micro signal leading each of the remaining microphone signals and again with the reference microphone signal in a lagging position. Fig. 2 shows how the angle of arrival calculation uses the time difference of the sounds arrival at the two microphones, as measured between the two peaks of the correlations, to calculate the approach angle of the incoming sound. The time difference, t, is multiplied by the speed of sound to get the distance traveled by the sound, dt. This distance is used along with the distance between the microphones, dm, to calculate the angle of arrival as shown in (3) and (4).
r = dm dt
2 2

(3) (4)

= tan 1

r dt

Notice that the solution uses a square root and an arctangent calculation, neither of which are native to FPGA hardware.

k =

x (k ) y ( n + k ),

n ( , )

(1)

Where x represents the reference microphone signal and y represents one of the remaining microphone signals. Before implementing the cross-correlation operation the signals from each microphone require preprocessing. The cross-correlation estimate in (1) requires that the dc offset for each signal be removed. The sample mean for each signal block is calculated and subtracted on a sample-by-sample

Figure 2. Time delay, t, due to angle of arrival,

1178

IV. HARDWARE IMPLEMENTATIONS The FPGA implementation of the signal processing system was developed using MATLAB version 7.2.0.232 (R2006a) and Simulink version 6.4 in conjunction with Xilinxs FPGA System Generator for DSP version 8.1.01. It was implemented in a Xilinx Virtex-II XC2V3000-4FG676 FPGA using a Nallatech XtremeDSP Development Kit-II. System Generator for DSP provides Simulink blocks that are connected together in a manner similar to schematic design. Our implementation used a hierarchical arrangement. The top-level drawing contained three blocks as shown in Fig. 1. The functionality of each of the top-level blocks was specified using a lower-level diagram containing System Generator blocks. The lower-level preprocessing block was designed exclusively using basic System Generator blocks. The angle of arrival block used CORDIC (Coordinate Rotation Digital Computer) square root and arctangent blocks from the System Generator block set. The System Generator block set does not provide a correlation block. We were faced with the choice of implementing the correlation function either with an assembly of smaller blocks or using the black box feature of System Generator. We chose the latter. The black box feature allows the user to develop a custom block whose functionality is specified using a hardware description language (HDL), either Verilog or VHDL. We used the Verilog language. The FPGA circuitry was designed using 16-bit fixedpoint math. A very convenient feature of the System Generator block set was the GatewayIn block. This block took a double precision floating point value from MATLAB and converted it to a desired fixed point format, in this case a signed 16-bit number with 15 bits to the right of the decimal point. Similarly, the GatewayOut block converted the fixedpoint results back to floating point values for display and analysis using MATLAB. As will be discussed later, the use of 16-bit fixed-point math did not result in a noticeable change in the accuracy of the output. The maximum clock rate, as reported after placing and routing the design in the FPGA, was 40 MHz. System Generator does not provide an easy method for implementing parallelization techniques such as loop unrolling. As a result, the initial design of the FPGA system was developed without taking full advantage of the parallelization that gives the FPGA its major advantage over a DSP device. As shown in Table I, only 3 of the available 96 multipliers were used in this design. The extra hardware could easily be used to process more microphones. Initial work indicates that an array of eight microphones could be processed using the chosen FPGA without impacting the clock speed.
TABLE I. FPGA Resource External IOB MULT18X18 RAMB16 SLICE FPGA RESOURCE UTILIZATION Number Used 110 3 4 3123 Number Available 484 96 96 14336

The DSP implementation was accomplished using Texas Instruments Code Composer 3.3 development software with a TMS360C6711 DSK as the target hardware. Note that the target processor has floating point capability; a feature that reduces the development time for most DSP applications. Subroutines, programmed in C, were developed for the preprocessing, partial cross-correlation and angle calculation portions of the algorithm. In order to obtain an accurate analysis of the execution time of the algorithm, the C6711 Device Accurate Simulator (Little Endian) was used to obtain a cycle count for the overall algorithm. The input wave files were read into the development environment using the Probe Points feature of the software. However, the cycle count for reading the input signals into memory were not included in the overall performance figures. V. MEASURING THE DEVELOPMENT COMPLEXITY

The major focus of this work was to determine how the available development tools affected the design development time and the design performance. For a comparison of the development time, a metric of equivalent lines of code was developed. Using lines of code provides a more objective comparison of development effort than a simple recording of the man-hours spent on the development as the latter is highly dependant on the skill of the designer. (Certainly the designers skill can influence the required lines of code, but this is considered to be a secondary effect.) Other researchers have developed metrics for characterizing the quality and design effort required for graphical languages including National Instruments LabVIEW [5], [6], UML [7], and Simulink [8], [9]. Our metric was chosen to estimate the design entry time in lines of code (LOC). Since this cannot be applied directly to graphical languages such as Simulink, we developed an equivalent LOC metric for Simulink code. The FPGA design tools use graphical blocks supplemented with HDL code instead of C code. In order to find an equivalent measurement for the total LOC in the FPGA design, an assumption needed to be made about the blocks used in the design. The blocks required the user to specify a minimum of three basic details to instantiate the block: the block function, the input(s), and the output(s). The inputs and outputs are specified by wires. The methodology provided one line of code for each block placed and one line for each wire routed. Many blocks also contained user-defined parameters that had to be set for each block instantiation. For every parameter needed to define the block, another line of code was added to the count. Finally, the FPGA design included some functions implemented as user-defined black boxes. The function of these black boxes was specified using text-based configuration files and Verilog source files. The number of lines of code in each of these files was added to the count as well. This methodology brought the FPGA design line count to a total of 429 lines of code. By comparison, the DSP design contained 86 LOC. The final results indicate that the FPGA design required 4.6 times more lines of code. This corresponds fairly closely

1179

to the ratio of man-hours required to implement and debug the two designs, which was approximately 4 to 1. VI.
PERFORMANCE RESUTLS

Table II summarizes the results of this study. As mentioned before, both implementations were tested by using recorded handclaps at angles of 70, 75 and 110 relative to the microphone array. The numerical accuracy of both implementations is similar with the difference explained by the fact that the FPGA was a fixed point implementation whereas the DSP used floating point. In terms of timing performance, the FPGA implementation is significantly faster than the DSP. The DSP took 25,725,060 clock cycles to produce a final answer running at a clock rate of 100 MHz. The FPGA took 23,005 clock cycles to produce a final answer running at a clock rate of 40 MHz. This resulted in operating times of 257.3 ms and 0.575 ms respectively, a speedup factor of 447. However, it should be noted that a closer examination of the DSP implementation results from Code Composers Profiler revealed a large percentage of stall cycles. We believe that our DSP implementation did not make effective use of the processors memory cache and that the number of stall cycles has the potential to be greatly reduced. Furthermore, the use of hand optimized linear assembly code has the potential to result in further improvements for the DSP execution time.
TABLE II. System DSP FPGA DSP VERSUS FPGA IMPLEMENTATION COMPARISON
(70o)

greatly reduced and the quality of results can be increased. We note that the development time advantage of the DSP approach will be diminished significantly if portions of the code need significant tuning. Future plans for the DSP portion of the comparison include the more effective use of cache memory and the use of an optimized subroutine for the cross-correlation calculation. The speed up provided by the FPGA was dramatic although somewhat misleading. Our goal of achieving a quick solution using a DSP processor resulted in a DSP implementation that did not fully utilize the capability of the processor. However, even if one were to fully take advantage of todays advanced floating point DSP processors (up to 700 MMAC/s) capabilities, they are ultimately no match for a large FPGA that can employ massive parallelism. The DSPs advantage is the ease of programming and the fact that functionality can be extended by using a larger program (at the expense of speed). The FPGAs advantage is raw speed. A more detailed preliminary version of this work is available in [10]. That work also contains details of the hardware used to capture the sounds and source code listings. REFERENCES
M. Montani, L. De Marchi, A. Marcianesi and N. Speciale, Comparison of a programmable DSP and FPGA implementation for a wavelet-based denoising algorithm, Proc. 46th Midwest Symp. Circuits and Systems (MWSCAS 2003), Dec. 2003, pp. 602-605. [2] D. Bilsby, R. Walke and R. Smith, Comparison of a programmable DSP and a FPGA for real-time multiscale convolution, IEE Colloquium on High Performance Architectures for Real-Time Image Processing, Feb. 1998, pp. 4/1-4/6. [3] M. Ownby and W. Mahmoud, A design methodology for implementing DSP with Xilinx System Generator for MATLAB, Proc. 35th Southeastern Symp. Sys. Theory, Mar. 2003, pp. 404-408. [4] Y. Yi and R. Woods, FPGA-based system-level design framework based on the IRIS synthesis tool and System Generator, Proc. IEEE Intl Conf. Field-Programmable Tech. (FPT), Dec. 2002, pp.8592. [5] S. Bragg and C. Driskill, Diagrammatic-graphical programming languages and DoD-STD-2167A, Proc. IEEE Autotestcon, Sept. 1994, pp. 211-220. [6] D. Pittman and J. Miller, Software metrics for non-textual programming languages, Proc. IEEE Autotestcon, Sept. 1997, pp. 198-203. [7] M. Genero, M Piattini, and C. Calero, A Survey of metrics for UML class diagrams, Jour. Object Tech., [Online], vol. 4, no. 9, Nov-Dec 2005, pp. 59-92. Available: http://www.jot.fm/issues/issue_2005_11/article1 [8] A. Hosagrahara and P. Smith, Measuring productivity and quality in model-based design, MATLAB Digest, [Online], Mar. 2006. Available: http://www.mathworks.com/company/newsletters/digest/2006/mar/m easuringprod.html [9] G. Menkhaus and B. Andrich, Metric suite for directing the failure mode analysis of embedded software systems, Proc. of the 7th Intl Conf. on Enterprise Information Systems, May 2005. Available: http://www.softwareresearch.net/site/publications/C069.pdf [10] J. Stevenson, A comparison of field programmable gate arrays and digital signal processors in acoustic array processing, Masters thesis, Dept. Elec. Comp. Eng., Baylor Univ., Waco, TX, 2006. [1]

Angle Calculation
(75o)

(110o)

LOC 86 429

72.4o 72.7o

77.8o 77.0o

109.5o 111.3o

Exec. Time 257.3 ms 0.575 ms

VII. DISCUSSION The software metric used to determine the equivalent LOC for the Simulink source code corresponded to the development time, at least for a rough estimate. This is encouraging, but it is by no means conclusive. Further work will be required to compare this metric on multiple designs. In addition it should be compared with other metrics, such as those proposed in [5] through [9]. As suggested in [7], another important step is the development of theoretical comparisons to supplement the empirical comparisons. The time required to develop the design using the graphical method was four to five times greater than that required to code the design in C. This is a disappointing result that was supported by our LOC comparison. However, it is often the case that DSP algorithms are originally developed and tested in a MATLAB or Simulink environment. If this the case, some of the programming time may be reduced by starting with the Simulink code used for algorithm development. In addition, much of the development time for the FPGA version can be attributed to the fact that System Generator does not provide a correlation block. When a design can be implemented using major functions that are described by standard blocks, e.g. FFTs, the development time can be

1180