Dist FIR

CHAPTER 1 INTRODUCTION
In the recent years, there has been a growing trend to implement digital signal processing functions in Field Programmable Gate Array (FPGA). In this sense, we need to put great effort in designing efficient architectures for digital signal processing functions such as FIR filters, which are widely used in video and audio signal processing, telecommunications and etc. Traditionally, direct implementation of a K-tap FIR filter requires K multiply-
and-accumulate (MAC) blocks, which are expensive to implement in FPGA due to logic complexity and resource usage. To resolve this issue, we First present DA, which is a multiplier-less architecture. Implementing multipliers using the logic fabric of the FPGA is costly due to logic complexity and area usage, especially when the filter size is large. Modern FPGAs have dedicated DSP blocks that alleviate this problem, however for very large filter sizes the challenge of reducing area and complexity still remains. An alternative to computing the multiplication is to decompose the MAC operations into a series of lookup table (LUT) accesses and summations. This approach is termed distributed arithmetic (DA), a bit serial method of computing the inner product of two vectors with a fixed number of cycles. The original DA architecture stores all the possible binary combinations of the coefficients w[k] of equation (1) in a memory or lookup table. It is evident that for large values of L, the size of the memory containing the pre computed terms grows exponentially too large to be practical. The memory size can be reduced by dividing the single large memory (2Lwords) into m multiple smaller sized memories each of size 2k where L = m
k. The memory size can be further reduced to 2L1 and 2L2 by applying offset binary coding and exploiting resultant symmetries found in the contents of the memories. This technique is based on using 2's complement binary representation of data, and the data can be pre-computed and stored in LUT. As DA is a very efficient solution especially suited for LUT-based FPGA architectures, many researchers put great effort in using DA to implement FIR filters in FPGA. Patrick Longa introduced the structure of the FIR filter using DA algorithm and the functions of each part. Sangyun Hwang analyzed the power consumption of the filter using DA algorithm. Heejong Yoo proposed a modified DA architecture that gradually replaces LUT requirements with multiplexer/adder pairs. But the main problem of DA is that the requirement of LUT capacity increases exponentially with the order of the filter, given that DA implementations need 2Kwords (K is the number of taps of the filter). And if K is a prime, the hardware resource consumption will cost even higher. To overcome these problems, this paper presents a hardware-efficient DA architecture. This method not only reduces the LUT size, but also modifies the structure of the filter to achieve high speed performance. The proposed filter has been designed and synthesized with ISE 9.1, and implemented with a 4VLX40FF668 FPGA device. and smaller resource usage in comparison to the previous DA architecture. 1.1 Objective and goal of the thesis Traditionally, direct implementation of a K-tap FIR filter requires K multiply-andaccumulate (MAC) blocks, which are expensive to implement in FPGA due to logic complexity and resource usage. To resolve this issue, we First present DA, which is a multiplier-less architecture. Our results show that the proposed DA architecture can implement FIR filters with high speed
An alternative to computing the multiplication is to decompose the MAC operations into a series of lookup table (LUT) accesses and summations. This approach is termed distributed arithmetic (DA), a bit serial method of computing the inner product of two vectors with a fixed number of cycles. The original DA architecture stores all the possible binary combinations of the coefficients w[k] of equation (1) in a memory or lookup table. It is evident that for large values of L, the size of the memory containing the pre computed terms grows exponentially too large to be practical. This technique is based on using 2's complement binary representation of data, and the data can be pre-computed and stored in LUT. As DA is a very efficient solution especially suited for LUT-based FPGA architectures, many researchers put great effort in using DA to implement FIR filters in FPGA. 1.2 Literature Survey FIR, Finite Impulse Response, filters are one of the primary types of filters used in Digital Signal Processing. A finite impulse response (FIR) filter is a filter structure that can be used to implement almost any sort of frequency response digitally. An FIR filter is usually implemented by using a series of delays, multipliers, and adders to create the filter's output. The characteristic equation can be expressed as a convolution of the coefficient sequence bi with the input signal:
Y[n] = N bi x[n-i]
i=0
That is, the filter output is a weighted sum of the current and a finite number of previous values of the input. For implementing this Filter many number of multiplications are required. FPGA implementation of FIR filter involves huge hardware resources if we follow traditional
implementations. To decrease the hardware resource requirement, a method called distributed arithmetic (DA) has been used. Distributed Arithmetic, along with Modulo Arithmetic, are computation algorithms that perform multiplication with look-up table based schemes. Both stirred some interest over two decades ago but have languished ever since. Indeed, DA specifically targets the sum of products (sometimes referred to as the vector dot product) computation that covers many of the important DSP filtering and frequency transforming functions. Distributed arithmetic is so named because the arithmetic operations that appear in signal processing are distributed in an unrecognizable fashion. Distributed arithmetic is a bit level rearrangement of a multiply accumulate to hide the multiplications. Motivation for using distributed arithmetic is its computational efficiency. By careful design one may reduce the total gate count in a signal processing arithmetic unit by a number seldom smaller than 50 percent and often as large as Generally used for SUM OF PRODUCTS and MAC (multiply and accumulate). This arithmetic takes advantage of technologies rich in memory elements. It uses lookup-tables and adders to compute inner products. It is a powerful technique for reducing the size of a parallel hardware multiply accumulates that is well suited to FPGA designs. It can also be extended to other sum functions such as complex multiplies, Fourier transforms and so on. 80 percent. The distributed computational algorithm is based on LOOK UP TABLES.
1,3 Comparison between FIR and IIR filter
1.3.1 The advantages of FIR Filters (compared to IIR filters) Compared to IIR filters, FIR filters offer the following advantages: 1. They can easily be designed to be "linear phase" (and usually are). Put simply, linear-phase filters delay the input signal but dont distort its phase. 2. They are simple to implement. On most DSP microprocessors, the FIR calculation can be done by looping a single instruction. 3. They are suited to multi-rate applications. By multi-rate, we mean either "decimation" (reducing the sampling rate), "interpolation" (increasing the sampling rate), or both. Whether decimating or interpolating, the use of FIR filters allows some of the calculations to be omitted, thus providing an important computational efficiency. In contrast, if IIR filters are used, each output must be individually calculated, even if it that output will discarded (so the feedback will be incorporated into the filter). 4. They have desirable numeric properties. In practice, all DSP filters must be implemented using finite-precision arithmetic, that is, a limited number of bits. The use of finite-precision arithmetic in IIR filters can cause significant problems due to the use of feedback, but FIR filters without feedback can usually be implemented using fewer bits, and the designer has fewer practical problems to solve related to non-ideal arithmetic. 5. They can be implemented using fractional arithmetic. Unlike IIR filters, it is always possible to implement a FIR filter using coefficients with magnitude of less than 1.0. (The overall gain of the FIR filter can be adjusted at its output, if desired.) This is an important consideration when using fixed-point DSP's, because it makes the implementation much simpler 1.4 Existing Systems
Existing systems for implementing FIR filter are the multiplier based designs and Distributed arithmetic based Designs. For multiplication based Designs number of multiplications required is large and the hardware utilization will be high in FPGA solutions. 1.5 Implemented Systems In Distributed Arithmetic concept FIR filter is implemented using LUT, add and shift hardware, in which no multiplications will be there. LUT stores all the possible combinations of the sums of the coefficients. Add and shift hardware is to implement the Filter functionality by using LUT contents. This method is explained in detail in the chapter FIR filter design using Distributed arithmetic in this thesis. 1.6 Applications Finite Impulse Response (FIR) filters using distributed arithmetic (DA) are widely used to implement pulse-shaping filters; digital phase-locked loop (PLL) frequency synchronizers, discrete cosine transform (DCT) cores, and so forth in hand-held applications where low-power consumption is required. 1.7 Advantages 1. The size of each LUT is fixed. 2. The LUTs readily available in the FPGAs can be utilized efficiently 3. Design performance increases. 4. Speed increases
CHAPTER 2 FIR FILTER DESIGN
2.1 Finite impulse response Filter In signal processing, there are many instances in which an input signal to a system contains extra unnecessary content or additional noise which can degrade the quality of the desired portion. In such cases we may remove or filter out the useless samples. For example, in the case of the telephone system, there is no reason to transmit very high frequencies since most speech falls within the band of 400 to 3,400 Hz. Therefore, in this case, all frequencies above and below that band are filtered out. The frequency band between 400 and 3,400 Hz, which isnt filtered out, is known as the pass band, and the frequency band that is blocked out is known as the stop band. FIR, Finite Impulse Response, filters are one of the primary types of filters used in Digital Signal Processing. FIR filters are said to be finite because they do not have any feedback. Therefore, if you send an impulse through the system (a single spike) then the output will invariably become zero as soon as the impulse runs through the filter. A finite impulse response (FIR) filter is a filter structure that can be used to implement almost any sort of frequency response digitally. An FIR filter is usually implemented by using a series of delays, multipliers, and adders to create the filter's output. Figure 2.1 below shows the basic block diagram for an FIR filter of length N. The delays result in operating on prior input samples. The hk values are the coefficients used for multiplication, so that the output at time n is the summation of all the delayed samples multiplied by the appropriate coefficients. The difference equation that defines the output of an FIR filter in terms of its input is :
Where: x[n] is the input signal, y[n] is the output signal, bi are the filter coefficients, and N is the filter order an Nth-order filter has (N + 1) terms on the right-hand side; these are commonly referred to as taps.
Figure 2.1: The logical structure of an FIR filter This equation can also be expressed as a convolution of the coefficient sequence bi with the input signal:
y=K-1 x[n-k] hk
k=0
That is, the filter output is a weighted sum of the current and a finite number of previous values of the input.
The process of selecting the filter's length and coefficients is called filter design. The goal is to set those parameters such that certain desired stop band and pass band parameters will result from running the filter. Most engineers utilize a program such as MATLAB to do their filter design. But whatever tool is used, the results of the design effort should be the same. A frequency response plot, verifies that the filter meets the desired specifications, including ripple and transition bandwidth, the filter's length and coefficients. The longer the filter (more taps), the more finely the response can be tuned. 2.1.1 Impulse response `The impulse response h[n] can be calculated if we set
x[n] = [n] in the above
relation; where [n] is the Kronecker delta impulse. The impulse response for an FIR filter then becomes the set of coefficients bn , for n= 0 to N as follows
H(n)=N bi [n-i]
i=0
= bn
The Z-transform of the impulse response yields the transfer function of the FIR filter
H ( z ) = Z {h[n]}
H(n)= h[n] z n
n= -
=
n= -
bn z
-n
FIR filters are clearly bounded-input bounded-output (BIBO) stable, since the output is a sum of a finite number of finite multiples of the input values, so can be no greater than bi times the largest value appearing in the input.
2.2 Properties An FIR filter has a number of useful properties which sometimes make it preferable to an infinite impulse response (IIR) filter. FIR filters: 1. Are inherently stable. This is due to the fact that all the poles are located at the origin and thus are located within the unit circle. 2. Require no feedback. This means that any rounding errors are not compounded by summed iterations. The same relative error occurs in each calculation. This also makes implementation simpler. 3. They can be designed to be linear phase, which means the phase change is proportional to the frequency. This is usually desired for where transparent filtering is adequate. The main disadvantage of FIR filters is that considerably more computation power is required compared with a similar IIR filter. This is especially true when low frequencies (relative to the sample rate) are to be affected by the filter. phasesensitive applications, for example crossover filters, and mastering,
10
2.3 Approach to design a FIR Filter Filters are signal conditioners. Each function by accepting an input signal, blocking pre specified frequency components, and passing the original signal minus those components to the output. For example, a typical phone line acts as a filter that limits frequencies to a range considerably smaller than the range of frequencies human beings can hear. That's why listening to CD-quality music over the phone is not as pleasing to the ear as listening to it directly. A digital filter takes a digital input, gives a digital output, and consists of digital components. In a typical digital filtering application, software running on a digital signal processor (DSP) reads input samples from an A/D converter, performs the mathematical manipulations dictated by theory for the required filter type, and outputs the result via a D/A converter. An analog filter, by contrast, operates directly on the analog inputs and is built entirely with analog components, such as resistors, capacitors, and inductors. There are many filter types, but the most common are low pass, high pass, band pass, and band stop. A low pass filter allows only low frequency signals (below some specified cutoff) through to its output, so it can be used to eliminate high frequencies. A low pass filter is handy, in that regard, for limiting the uppermost range of frequencies in an audio signal; it's the type of filter that a phone line resembles. A high pass filter does just the opposite, by rejecting only frequency components below some threshold. An example high pass application is cutting out the audible 60Hz AC power "hum", which can be picked up as noise accompanying almost any signal in the U.S. The designer of a cell phone or any other sort of wireless transmitter would typically place an analog band pass filter in its output RF stage, to ensure that only output signals within its narrow, government-authorized range of the frequency spectrum are transmitted.
11
Engineers can use band stop filters, which pass both low and high frequencies, to block a predefined range of frequencies in the middle. 2.4 Filter design To design a filter means to select the coefficients such that the system has specific characteristics. The required characteristics are stated in filter specifications. Most of the time filter specifications refer to the frequency response of the filter. There are different methods to find the coefficients from frequency specifications: 1. Window design method 2. Frequency Sampling method The Remez exchange algorithm is commonly used to find an optimal equiripple set of coefficients. Here the user specifies a desired frequency response, a weighting function for errors from this response, and a filter order N. The algorithm then finds the set of (N + 1) coefficients that minimize the maximum deviation from the ideal. Intuitively, this finds the filter that is as close as you can get to the desired response given that you can use only (N + 1) coefficients. This method is particularly easy in practice since at least one text includes a program that takes the desired filter and N, and returns the optimum coefficients. Software packages like MATLAB, GNU Octave, Scilab, and SciPy provide convenient ways to apply these different methods. Some of the time, the filter specifications refer to the time-domain shape of the input signal the filter is expected to "recognize". The optimum matched filter is to sample that shape and use those samples directly as the coefficients of the filter giving the filter an impulse response that is the time-reverse of the expected input signal.
12
CHAPTER 3 FPGA ARCHITECTURES

3.1 Introduction A Field-Programmable Gate Array (FPGA) is a semiconductor device that can be configured by the customer or designer after manufacturinghence the name "field-
13
programmable". FPGAs are programmed using a logic circuit diagram or a source code in a hardware description language (HDL) to specify how the chip will work. They can be used to implement any logical function that an application-specific integrated circuit (ASIC) could perform, but the ability to update the functionality after shipping offers advantages for many applications. FPGAs contain programmable logic components called "logic blocks", and a hierarchy of reconfigurable interconnects that allow the blocks to be "wired together" somewhat like a one-chip programmable breadboard. Logic blocks can be configured to perform complex combinational functions, or merely simple logic gates like AND and XOR. In most FPGAs, the logic blocks also include memory elements, which may be simple flipflops or more complete blocks of memory. 3.1.1 History The FPGA industry sprouted from programmable read only memory (PROM) and programmable logic devices (PLDs). PROMs and PLDs both had the option of being programmed in batches in a factory or in the field (field programmable), however programmable logic was hard-wired between logic gates. Xilinx Co-Founders, Ross Freeman and Bernard Vonderschmitt, invented the FIRst commercially viable field programmable gate array in 1985 the XC2064. The XC2064 had programmable gates and programmable interconnects between gates, the beginnings of a new technology and market. The XC2064 boasted a mere 64 configurable logic blocks (CLBs), with two 3-input lookup tables (LUTs). More than 20 years later, Freeman was entered into the National Inventor's Hall of Fame for his invention.
14
Some of the industrys foundational concepts and technologies for programmable logic arrays, gates, and logic blocks are founded in patents awarded to David W. Page and LuVerne R. Peterson in 1985. In the late 1980s the Naval Surface Warfare Department funded an experiment proposed by Steve Casselman to develop a computer that would implement 600,000 reprogrammable gates. Casselman was successful and the system was awarded a patent in 1992. Xilinx continued unchallenged and quickly growing from 1985 to the mid-1990s, when competitors sprouted up, eroding significant market-share. By 1993, Actel was serving about 18 percent of the market. The 1990s were an explosive period of time for FPGAs, both in sophistication and the volume of production. In the early 1990s, FPGAs were primarily used in telecommunications and networking. By the end of the decade, FPGAs found their way into consumer, automotive, and industrial applications. FPGAs got a glimpse of fame in 1997, when Adrian Thompson merged genetic algorithm technology and FPGAs to create a sound recognition device. Thomsons algorithm allowed an array of 64 x 64 cells in a Xilinx FPGA chip to decide the configuration needed to accomplish a sound recognition task. 3.2 Modern developments A recent trend has been to take the coarse-grained architectural approach a step further by combining the logic blocks and interconnects of traditional FPGAs with embedded microprocessors and related peripherals to form a complete "system on a programmable chip". This work mirrors the architecture by Ron Perlof and Hana Potash of Burroughs Advanced Systems Group which combined a reconfigurable CPU architecture on a single chip called the SB24. That work was done in 1982. Examples of such hybrid
15
technologies can be found in the Xilinx Virtex-II PRO and Virtex-4 devices, which include one or more PowerPC processors embedded within the FPGA's logic fabric. The Atmel FPSLIC is another such device, which uses an AVR processor in combination with Atmel's programmable logic architecture. An alternate approach to using hard-macro processors is to make use of "soft" processor cores that are implemented within the FPGA logic. (See "Soft processors" below). As previously mentioned, many modern FPGAs have the ability to be reprogrammed at "run time," and this is leading to the idea of reconfigurable computing or reconfigurable systems CPUs that reconfigure themselves to suit the task at hand. The Mitrion Virtual Processor from Mitrionics is an example of a reconfigurable soft processor, implemented on FPGAs. However, it does not support dynamic reconfiguration at runtime, but instead adapts itself to a specific program. Additionally, new, non-FPGA architectures are beginning to emerge. Softwareconfigurable microprocessors such as the Stretch S5000 adopt a hybrid approach by providing an array of processor cores and FPGA-like programmable cores on the same chip. 3.2.1 Gates
1. 2. 3.
1987: 9,000 gates, Xilinx 1992: 600,000, Naval Surface Warfare Department Early 2000s: Millions
3.2.2 Market size

1. 2. 3.
1985: FIRst commercial FPGA technology invented by Xilinx 1987: $14 million ~1993: >$385 million
16
4. 5.
2005: $1.9 billion 2010 estimates: $2.75 billion
3.3 FPGA Comparisons Historically, FPGAs have been slower, less energy efficient and generally achieved less functionality than their fixed ASIC counterparts. A combination of volume, fabrication improvements, research and development, and the I/O capabilities of new supercomputers have largely closed the performance gap between ASICs and FPGAs. Advantages include a shorter time to market, ability to re-program in the field to fix bugs, and lower non-recurring engineering costs. Vendors can also take a middle road by developing their hardware on ordinary FPGAs, but manufacture their final version so it can no longer be modified after the design has been committed. Xilinx claims that several market and technology dynamics are changing the ASIC/FPGA paradigm:
1. 2. 3. 4. 5.
IC costs are rising aggressively ASIC complexity has bolstered development time and costs R&D resources and headcount is decreasing Revenue losses for slow time-to-market are increasing Financial constraints in a poor economy are driving low-cost technologies
These trends make FPGAs a better alternative than ASICs for a growing number of higher-volume applications than they have been historically used for, which the company blames for the growing number of FPGA design starts (see History). The primary differences between CPLDs and FPGAs are architectural. A CPLD has a somewhat restrictive structure consisting of one or more programmable sum-of-products
17
logic arrays feeding a relatively small number of clocked registers. The result of this is less flexibility, with the advantage of more predictable timing delays and a higher logic-tointerconnect ratio. The FPGA architectures, on the other hand, are dominated by interconnect. This makes them far more flexible (in terms of the range of designs that are practical for implementation within them) but also far more complex to design for. Another notable difference between CPLDs and FPGAs is the presence in most FPGAs of higher-level embedded functions (such as adders and multipliers) and embedded memories, as well as to have logic blocks implements decoders or mathematical functions. Some FPGAs have the capability of partial re-configuration that lets one portion of the device be re-programmed while other portions continue running. 3.4 Applications Applications of FPGAs include digital signal processing, software-defined radio, aerospace and defense systems, ASIC prototyping, medical imaging, computer vision, speech recognition, cryptography, bioinformatics, computer hardware emulation, radio astronomy and a growing range of other areas. FPGAs originally began as competitors to CPLDs and competed in a similar space, that of glue logic for PCBs. As their size, capabilities, and speed increased, they began to take over larger and larger functions to the state where some are now marketed as full systems on chips (SoC). Particularly with the introduction of dedicated multipliers into FPGA architectures in the late 1990s, applications, which had traditionally been the sole reserve of DSPs, began to incorporate FPGAs instead. FPGAs especially find applications in any area or algorithm that can make use of the massive parallelism offered by their architecture. One such area is code breaking, in particular brute-force attack, of cryptographic algorithms.
18
FPGAs are increasingly used in conventional high performance computing applications where computational kernels such as FFT or Convolution are performed on the FPGA instead of a microprocessor. The inherent parallelism of the logic resources on an FPGA allows for considerable computational throughput even at a low MHz clock rates. The flexibility of the FPGA allows for even higher performance by trading off precision and range in the number format for an increased number of parallel arithmetic units. This has driven a new type of processing called reconfigurable computing, where time intensive tasks are offloaded from software to FPGAs.The adoption of FPGAs in high performance computing is currently limited by the complexity of FPGA design compared to conventional software and the extremely long turn-around times of current design tools, where 4-8 hours wait is necessary after even minor changes to the source code. Traditionally, FPGAs have been reserved for specific vertical applications where the volume of production is small. For these low-volume applications, the premium that companies pay in hardware costs per unit for a programmable chip is more affordable than the development resources spent on creating an ASIC for a low-volume application. Today, new cost and performance dynamics have broadened the range of viable applications.
3.5 Architectures The most common FPGA architecture consists of an array of configurable logic blocks (CLBs), I/O pads, and routing channels. Generally, all the routing channels have the same width (number of wires). Multiple I/O pads may fit into the height of one row or the width of one column in the array.
19
An application circuit must be mapped into an FPGA with adequate resources. While the number of CLBs and I/Os required is easily determined from the design, the number of routing tracks needed may vary considerably even among designs with the same amount of logic. Since unused routing tracks increase the cost (and decrease the performance) of the part without providing any benefit, FPGA manufacturers try to provide just enough tracks so that most designs that will fit in terms of LUTs and IOs can be routed. This is determined by estimates such as those derived from Rent's rule or by experiments with existing designs. The FPGA is an array or island-style FPGA. It consists of an array of logic blocks and routing channels. Two I/O pads fit into the height of one row or the width of one column, as shown below. All the routing channels have the same width (number of wires). 3.5.1 FPGA structures
A classic FPGA logic block consists of a 4-input lookup table (LUT), and a flip-flop, as shown below. In recent years, manufacturers have started moving to 6-input LUTs in their high performance parts, claiming increased performance.
20
3.1 Typical logic block There is only one output, which can be either the registered or the unregistered LUT output. The logic block has four inputs for the LUT and a clock input. Since clock signals (and often other high-fanout signals) are normally routed via special-purpose dedicated routing networks in commercial FPGAs, they and other signals are separately managed. For this example architecture, the locations of the FPGA logic block pins are shown below.
3.2 Logic Block Pin Locations Each input is accessible from one side of the logic block, while the output pin can connect to routing wires in both the channel to the right and the channel below the logic block. Each logic block output pin can connect to any of the wiring segments in the channels adjacent to it.
21
Similarly, an I/O pad can connect to any one of the wiring segments in the channel adjacent to it. For example, an I/O pad at the top of the chip can connect to any of the W wires (where W is the channel width) in the horizontal channel immediately below it. Generally, the FPGA routing is unsegmented. That is, each wiring segment spans only one logic block before it terminates in a switch box. By turning on some of the programmable switches within a switch box, longer paths can be constructed. For higher speed interconnect, some FPGA architectures use longer routing lines that span multiple logic blocks. Whenever a vertical and a horizontal channel intersect, there is a switch box. In this architecture, when a wire enters a switch box, there are three programmable switches that allow it to connect to three other wires in adjacent channel segments. The pattern, or topology, of switches used in this architecture is the planar or domain-based switch box topology. In this switch box topology, a wire in track number one connects only to wires in track number one in adjacent channel segments, wires in track number 2 connect only to other wires in track number 2 and so on. The figure below illustrates the connections in a switch box.
3.3 Switch box topology
22
Modern FPGA families expand upon the above capabilities to include higher level functionality fixed into the silicon. Having these common functions embedded into the silicon reduces the area required and gives those functions increased speed compared to building them from primitives. Examples of these include multipliers, generic DSP blocks, embedded processors, high speed IO logic and embedded memories. FPGAs are also widely used for systems validation including pre-silicon validation, post-silicon validation, and Firmware development. This allows chip companies to validate their design before the chip is produced in the factory, reducing the time to market.
3.6 FPGA Design and Programming

To define the behavior of the FPGA, the user provides a hardware description language (HDL) or a schematic design. The HDL form might be easier to work with when handling large structures because it's possible to just specify them numerically rather than having to draw every piece by hand. On the other hand, schematic entry can allow for easier visualisation of a design. Then, using an electronic design automation tool, a technology-mapped netlist is generated. The netlist can then be fitted to the actual FPGA architecture using a process called place-and-route, usually performed by the FPGA company's proprietary place-androute software. The user will validate the map, place and route results via timing analysis, simulation, and other verification methodologies. Once the design and validation process is complete, the binary file generated (also using the FPGA company's proprietary software) is used to (re)configure the FPGA. Going from schematic/HDL source files to actual configuration: The source files are fed to a software suite from the FPGA/CPLD vendor that through different steps will
23
produce a file. This file is then transferred to the FPGA/CPLD via a serial interface (JTAG) or to an external memory device like an EEPROM. The most common HDLs are VHDL and Verilog, although in an attempt to reduce the complexity of designing in HDLs, which have been compared to the equivalent of assembly languages, there are moves to raise the abstraction level through the introduction of alternative languages. To simplify the design of complex systems in FPGAs, there exist libraries of predefined complex functions and circuits that have been tested and optimized to speed up the design process. These predefined circuits are commonly called IP cores, and are available from FPGA vendors and third-party IP suppliers (rarely free, and typically released under proprietary licenses). Other predefined circuits are available from developer communities such as OpenCores (typically free, and released under the GPL, BSD or similar license), and other sources. In a typical design flow, an FPGA application developer will simulate the design at multiple stages throughout the design process. Initially the RTL description in VHDL or Verilog is simulated by creating test benches to simulate the system and observe results. Then, after the synthesis engine has mapped the design to a netlist, the netlist is translated to a gate level description where simulation is repeated to confirm the synthesis proceeded without errors. Finally the design is laid out in the FPGA at which point propagation delays can be added and the simulation run again with these values back-annotated onto the netlist. 3.6.1 Basic Process Technology Types
1.
SRAM - based on static memory technology. In-system programmable and reprogrammable. Requires external boot devices. CMOS. Antifuse - One-time programmable. CMOS.
2.
24
3.
EPROM - Erasable Programmable Read-Only Memory technology. Usually onetime programmable in production because of plastic packaging. Windowed devices can be erased with ultraviolet (UV) light. CMOS.
4.
EEPROM - Electrically Erasable Programmable Read-Only Memory technology. Can be erased, even in plastic packages. Some, but not all, EEPROM devices can be in-system programmed. CMOS.
5.
Flash - Flash-erase EPROM technology. Can be erased, even in plastic packages. Some, but not all, flash devices can be in-system programmed. Usually, a flash cell is smaller than an equivalent EEPROM cell and is therefore less expensive to manufacture. CMOS.
6.
Fuse - One-time programmable. Bipolar.
3.6.2 Major Manufacturers Xilinx and Altera are the current FPGA market leaders and long-time industry rivals. Together, they control over 80 percent of the market, with Xilinx alone representing over 50 percent. Xilinx also provides free Windows and Linux design software, while Altera provides free Windows tools; the Solaris and Linux tools are only available via a rental scheme. Other competitors include Lattice Semiconductor (flash, SRAM), Actel (antifuse, flash-based, mixed-signal), SiliconBlue Technologies (low power), Achronix (RAM based, 1.5GHz fabric speed), and QuickLogic (handheld focused CSSP, no general purpose FPGAs!). 3.7 FPGA prototype
25
FPGA prototyping, sometimes also referred to as ASIC prototyping or SoC prototyping is the method to prototype SoC and ASIC design on FPGA for hardware verification and early software development. Main stream verification methods for hardware design and early software and Firmware co-design has become mainstream. Prototyping SoC and ASIC design on FPGA has become a good method to do this. 3.7.1 Need of Prototyping 1. Running a SoC design on FPGA prototype is a reliable way to ensure that it is functionally correct. This is compared to designers only relying on software simulations to verify that their hardware design is sound. Simulation speed and modeling accuracy limitations hinder this development. 2. Due to time constrains, many projects cannot wait until the silicon is back from the foundry to start on software tests. FPGA prototyping allows for much more time in area of software development and testing at the software-hardware integration stage. This allows many unforeseen software bugs that appear due to today's array of operating systems, applications, and hardware. 3. Prototyping also allows the developer to ensure that all IP technologies on his system work well together off the simulation stage and in actual form. 4. Prototyping has the added advantage as demo platforms to SoC clients, bringing in interest early. This speeds up the overall development cycle and allows for more enhancement or improvement to the chip features as it would otherwise have been. 3.8 Complex Programmable Logic Devices (CPLDs) 3.8.1 Introduction
26
Complex Programmable Logic Devices (CPLDs) are exactly what the claim to be. Essentially they are designed to appear just like a large number of PAL s in a single chip, connected to each other through a crosspoint switch They use the same development tools and programmers, and are based on the same technologies, but they can handle much more complex logic and more of it. The internal architecture of a typicalWhile each manufacturer has a different variation, in general they are all similar in that they consist of function blocks, input/output block, and an interconnect matrix. The devices are programmed using programmableelements that, depending on the technology of the manufacturer, can be EPROM cells, EEPROM cells, or Flash EPROM cells. 3.8.2 Comparison between FPGA and CPLDS 1. FPGA needs to have boot rom, CPLD does not in some systems you might not have enough time to boot up FPGA you will need to have CPLD+FPGA 2. Check HDL codig styles app notes for you CPLD vendor (Altera, Xilinx etc..) should be none software take care of it 3. First of all CPLD is more efficient than FPGA , but since cPld is mush costlier than FPGA so tadeoff between cost & effciency occur. 4. Also the code for CPLD run on FPGA 5. You can use codes written for both on each other if your device has code's capacity, in fact you describe your hardware. 6. CPLD is too much lower in logic density than FPGA
27
7. when you are writting your code , in the software ( qu(at)rtus , ISE foundation ) , at the end you can choose or change your device , you can use CPLD or FPGA . no difference in codes .
CHAPTER 4 DISTRIBUTED ARITHMETIC FIR FILTER USING FPGA ARCHITECTURES

4.1 Background Traditional implementations of the finite impulse response (FIR) filter equation is
28
y=K-1 x[n-k] hk
k=0
Typically employ L multiply-accumulate (MAC) units. Implementing multipliers using the logic fabric of the FPGA is costly due to logic complexity and area usage, especially when the filter size is large. Modern FPGAs have dedicated DSP blocks that alleviate this problem, however for very large filter sizes the challenge of reducing area and complexity still remains. An alternative to computing the multiplication is to decompose the MAC operations into a series of lookup table (LUT) accesses and summations. This approach is termed distributed arithmetic (DA). 4.2 Need of Distributed Arithmetic 1. It reduces the logic needed to implement MAC operator to only adders and shifters. 2. It can be implemented on LUT Lookuptable based FPGAs without wasting resources. 3. It has several implementation based on several trade-offs. 4. Most FIR filters have constant operators so using DA algorithm they can be implemented with very custom circuits. 5. The advantage of a distributed arithmetic approach is its efficiency of mechanization
4.3 Distributed Arithmetic FIR filter architecture Distributed Arithmetic is one of the most well-known methods of implementing FIR filters. The DA solves the computation of the inner product equation when the coefficients are pre knowledge, as happens in FIR filters.
29
An FIR filter of length K is described as:
y=K-1 x[n-k] hk
.. (1) Where h[k] is the filter coefficient and x[k] is the input data. For the convenience of analysis, x[k] =x [n - k] is used for modifying the equation (1) and we have:
k=0
y=K-1 x[k] hk
......................... (2) In this equation, the hk are the fixed coefficients, K is the number of filter taps and xk are the input data words. These ones have a standard fixed-point format number, which is a twos-complement fractional representation with xk limited in the range -1 xk 1
k=0
xk= -xk0 + N-1 xkn 2-n

..(3) where N is the bit number of the data, being xk0 the sign bit and xkn the bit n xk of. Equation (1) can be rewritten in this way:
n=1
y=K-1 hk (-xk0 + N-1 xkn 2-n)

k=0 n=1
= -K-1 hk xk0 + N-1 K-1 xkn hk

k=0 n=1 k=0
-n 2
30
Using registers, memory resources and a scaling accumulator does the implementation of digital filters using this arithmetic. Original LUT-based DA implementation of a 4-tap (K=4) FIR filter is shown in Figure 4.1. The DA architecture includes three units: the shift register unit, the DA-LUT unit, and the adder/shifter unit.
4.4 System Block diagram
31
Figure 4.1: Original LUT-based DA implementation of a 4-tap FIR filter
4.5 Shift register
32
In digital circuits, a shift register is a cascade of flip flops, sharing the same clock, which has the output of anyone but the last flip-flop connected to the "data" input of the next one in the chain, resulting in a circuit that shifts by one position the one-dimensional "bit array" stored in it, shifting in the data present at its input and shifting out the last bit in the array. Shift registers are a type of sequential logic circuit, mainly for storage of digital data. They are a group of flip-flops connected in a chain so that the output from one flip-flop becomes the input of the next flip-flop. Most of the registers possess no characteristic internal sequence of states. All flipflops is driven by a common clock, and all are set or reset Simultaneously. Register Register is a set of n flip-flops. Each flip-flop stores one bit .Two basic functions: data storage and data movement Shift Register A register that allows each of the flip-flops to pass the stored information to its adjacent neighbor Storage Capacity The storage capacity of a register is the total number of bits (1 or 0) of digital data it can retain. Each stage (flip flop) in a shift register represents one bit of storage capacity. Therefore the number of stages in a register determines its storage capacity.
33
Figure 4.2 flip-flop .
The serial in/serial out shift register accepts data serially that is, one bit at a time on a single line. It produces the stored information on its output also in serial form.
Figure 4.3 four-bit shift register A basic four-bit shift register can be constructed using four D flip-flops, as shown in Figure 4.3 The operation of the circuit is as follows. 1. The register is First cleared, forcing all four outputs to zero. 2. The input data is then applied sequentially to the D input of the a. First flip-flop on the left (FF0). 3. During each clock pulse, one bit is transmitted from left to right.
34
Figure 4. 4 Serial in, Serial out Shift register Above we show a block diagram of a serial-in/serial-out shift register, which is 4stages long. Data at the input will be delayed by four clock periods from the input to the output of the shift register. Data at "data in", above, will be present at the Stage A output after the First clock pulse. After the second pulse stage A data is transfered to stage B output, and "data in" is transfered to stage A output. After the third clock, stage C is replaced by stage B; stage B is replaced by stage A; and stage A is replaced by "data in". After the fourth clock, the data originally present at "data in" is at stage D, "output". The "First in" data is "First out" as it is shifted from "data in" to "data out". For a K-Tap FIR filter the present input and past K-1 inputs must be available. For the FIR filter the input has been given serially bit wise. In the shift register block there are N-shift register which holds the K-inputs needs by FIR filter. For every clock one bit of Next input will be reaching shift register module. So to accommodate the new bit x[n] shift register must be shifted right. So for every B Clock cycles where as b is the no. of bits in each input sample, now input will be stored x[n] and old x[n] will be moved to x[n-1], x[n-1] to x[n-2]--- like that.
35
4.6 Look up table We know that it is possible to store binary data within solid-state devices. Those storage "cells" within solid-state memory devices are easily addressed by driving the "address" lines of the device with the proper binary value(s). Suppose we had a ROM memory circuit written, or programmed, with certain data, such that the address lines of the ROM served as inputs and the data lines of the ROM served as outputs, generating the characteristic response of a particular logic function. Theoretically, we could program this ROM chip to emulate whatever logic function we wanted without having to alter any wire connections or gates. Consider the following example of a 4 x 2 bit ROM memory (a very small memory!) programmed with the functionality of a half adder:
Figure 4.4 Logic Diagram of 4X2 ROM
Table 4.1 Truth table of Half Adder
36
If this ROM has been written with the above data (representing a half-adder's truth table), driving the A and B address inputs will cause the respective memory cells in the ROM chip to be enabled, thus outputting the corresponding data as the (Sum) and Cout bits. Unlike the half-adder circuit built of gates or relays, this device can be set up to perform any logic function at all with two inputs and two outputs, not just the half-adder function. To change the logic function, all we would need to do is write a different table of data to another ROM chip. We could even use an EPROM chip which could be re-written at will, giving the ultimate flexibility in function. It is vitally important to recognize the significance of this principle as applied to digital circuitry. Whereas the half-adder built from gates or relays processes the input bits to arrive at a specific output, the ROM simply remembers what the outputs should be for any given combination of inputs. This is not much different from the "times tables" memorized in grade school: rather than having to calculate the product of 5 times 6 (5 + 5 + 5 + 5 + 5 + 5 = 30), school-children are taught to remember that 5 x 6 = 30, and then expected to recall this product from memory as needed. Likewise, rather than the logic function depending on the functional arrangement of hard-wired gates or relays (hardware), it depends solely on the data written into the memory (software). Such a simple application, with definite outputs for every input, is called a look-up table, because the memory device simply "looks up" what the output(s) should to be for any given combination of inputs states. This application of a memory device to perform logical functions is significant for several reasons:
1. 2.
Software is much easier to change than hardware. Software can be archived on various kinds of memory media (disk, tape), thus providing an easy way to document and manipulate the function in a "virtual" form; hardware can only be "archived" abstractly in the form of some kind of graphical drawing.
37
3.
Software can be copied from one memory device (such as the EPROM chip) to another, allowing the ability for one device to "learn" its function from another device.
4.
Software such as the logic function example can be designed to perform functions that would be extremely difficult to emulate with discrete logic gates (or relays!).
The usefulness of a look-up table becomes more and more evident with increasing complexity of function. Suppose we wanted to build a 4-bit adder circuit using a ROM. We'd require a ROM with 8 address lines (two 4-bit numbers to be added together), plus 4 data lines (for the signed output): In our design we are using look table to store all the different possible combination summation of filter coefficients. This look -up-table will implemented by ROM design. In the reset state we will be store the coefficient summations. Input for this LUT will be coming from the output of shift register module. For every clock cycle the LSB bits of all N input samples are applied to the LUT. LUT will consider this input as on address and give the data stored in that particular address to the output. The main advantage of LUT implementation is we can avoid the multiplications. 4.7 Adder/Subtractor In digital circuits, an Adder-Subtractor is a circuit that is capable of adding or subtracting numbers (in particular, binary). Below is a circuit that does adding or subtracting depending on a control signal. However, it is possible to construct a circuit that performs both addition and subtraction at the same time. To get the FIR filter response we need to add the LUT outputs. The present LUT output will be added to the shifted (1 bit right) version of previous LUT output. Finally at Bth clock cycle (B is the no. of bits in each sample) we have to perform subtraction. So to perform addition or subtraction we implemented adder/subtracted block which performs any me of them depending on the input select signal.
38
4.8 Right Shifter In digital circuits, a shift register is a cascade of flip flops, sharing the same clock, which has the output of anyone but the last flip-flop connected to the "data" input of the next one in the chain, resulting in a circuit that shifts by one position the one-dimensional "bit array" stored in it, shifting in the data present at its input and shifting out the last bit in the array. Shift registers are a type of sequential logic circuit, mainly for storage of digital data. They are a group of flip-flops connected in a chain so that the output from one flip-flop becomes the input of the next flip-flop. In the right shift, 1-bit is shifted to right direction. To realize x/2 term in the FIR filter response we need to divide by 2. But to avoid multiplications we released it through a 1-bit right shifty Circuit.
4.9 Advantages
1. No multiplier units so complexity reduces 2. Low power consumption 3. Suitable for FPGA Implementations
39
CHAPTER 5
RESULTS AND CONCLUSIONS

5.1 Shift register Algorithm 1. Start. 2. Declare the port list. 3. If reset=1 then there is no shift in the input samples else go to step4. 4. If clck=0,1 then there is right shift in the each input sample with data in. then go to step5. 5. The output is stored. Repeat step3. 6. Stop. Output
40
Fig 5.1 Result of shift register 5.2 Look-Up-Table Algorithm 1. Start. 2. Declare the port list. 3. If reset=1, then the data is assigned with respect to the address else go to step4. 4. If clock=0,1 then go to step5. 5. If read=1, then the data is converted into integer of address. 6. The output is stored. Repeat step3.
7. Stop.
Output
Fig 5.2 Result of look up table
41
5.3 Adder/Subtractor Algoritthm 1. Start. 2. Declare the port list. 3. If select=0, then the datas are added, else go subtracted.. 4. The output is stored. Repeat step3.
5. Stop.
Output
Fig 5.3 Result of adder/subtractor
42
5.4 Right Shifter Algorithm 1. Start. 2. Declare the port list. 3. If reset=1 then there is no shift in the input samples else go to step4. 4. If clock=0,1 then there is right shift in the data with 0. then go to step5. 5. The output is stored. Repeat step3. 6. Stop. Output
Fig 5.4 Result of right shifter
43
5.5 Four Tap DA FIR Filter Algorithm 1. Start. 2. Declare the port list. 3. Declare component of shift register. 4. Declare component of look-up-table. 5. Declare component of adder/subtractor. 6. Declare component of right shift register. 7. Mapping the ports of each component which are mentioned above. 8. Stop. Output
Fig.5.5 Result of 4-TAP DA FIR filter
44
For a 4-tap FIR filter the present input and past 3 inputs must be available. For the FIR filter the input has been given serially bit wise. In the shift register block there are Nshift register which holds the 4-inputs needs by FIR filter. For every clock one bit of next input will be reaching shift register module. So to accommodate the new bit x[n] shift register must be shifted right. So for every 4 Clock cycles, now input will be stored x[n] and old x[n] will be moved to x[n-1], x[n-1] to x[n2]--- like that. We are using look table to store all the different possible combination summation of filter coefficients. This look -up-table will implemented by ROM design. In the reset state we will be store the coefficient summations. Input for this LUT will be coming from the output of shift register module. For every clock cycle the LSB bits of all 4 input samples are applied to the LUT. LUT will consider this input as on address and give the data stored in that particular address to the output. The main advantage of LUT implementation is we can avoid the multiplications. To get the FIR filter response we need to add the LUT outputs. The present LUT output will be added to the shifted (1 bit right) version of previous LUT output. Finally at 4th clock cycle) we have to perform subtraction. So to perform addition or subtraction we implemented adder/subtracted block which performs any me of them depending on the input select signal. To realize x/2 term in the FIR filter response we need to divide by 2. But to avoid multiplications we released it through a 1-bit right shifty Circuit. 5.5 Summary of the work In this project we implemented four different architectures analyzing their advantages and disadvantages.
45
In recent years, there has been a growing trend to implement digital signal processing functions in Field Programmable Gate Arrays (FPGA). FIR, Finite Impulse Response, filters are one of the primary types of filters used in Digital Signal Processing. Traditionally, direct implementation of a K-tap FIR filter requires K multiply-and-accumulate (MAC) blocks, which are expensive to implement in FPGA due to logic complexity and resource usage. Modern FPGAs have dedicated DSP blocks that alleviate this problem, however for very large filter sizes the challenge of reducing area and complexity still remains. An alternative to computing the multiplication is to decompose the MAC operations into a series of lookup table (LUT) accesses and summations. This approach is termed distributed arithmetic (DA), a bit serial method of computing the inner product of two vectors with a fixed number of cycles. Advantage of this method is the LUTs readily available in the FPGAs can be utilized efficiently. 5.6 Conclusion This paper presents the proposed DA architectures for FIR filters, i.e., multiplier less architecture. Then, the complexity is reduced. Hence there is low power consumption. Then performance increases. Then the speed increases. FPGAs can be utilized efficiently. 5.7 Future Scope Future scope of this project is to improve the architecture of the Distributed arithmetic FIR filter such that it uses the hardware resources of the latest FPGA families. In vertex-5 and Vertex-6 family FPGAs, 6-input LUTs were introduced. Future work includes changing the architecture which uses 6-input LUTs for storing coefficient sums and SRL(Shift register logic) macros to implement shift operations such that total number of slices used will be reduced. The LUTs readily available in the
46
BIBILIOGRAPHY
References: [1] DIGITAL SIGNAL PROCESSING Principles, Algorithms, and Applications by John G.Proakis , Dimitris G.Manolakis [2] DIGITAL SIGNAL PROCESSING by RameshBabu [3] DIGITAL SIGNAL PROCESSING by NagoorKani [4] SWITCHING THEORY AND LOGIC DESIGN by R.P.Jain [5] VHDL PRIMER BY J.Bhasker [6] Wang Sen, Tang Bin, Zhu Jun, Distributed Arithmetic for FIR Filter Design on FPGA Websites: 1. www.wikipedia.org/wiki/FIR 2. www.wikipedia.org/wiki/daFIR 3. www. /ipcores/distributedarithmeticFIRd.cfm 4. www.daFIR.cfm
47
Appendix-A Implementing VHDL Designs Using Xilinx ISE

This tutorial shows how to create, implement, simulate, and synthesize VHDL designs for implementation in FPGA chips using Xilinx ISE 9.2i : Xilinx Edition III v6.2g. 1. Launch Xilinx ISE from either the shortcut on your desktop or from your start menu under Programs Xilinx ISE 9.2i Project Navigator. 2. Start a new project by clicking File New Project.
3.
In the resulting window, verify the Top-Level Source Type is VHDL. Change the Project Location to a suitable directory and give it what ever name you choose, e.g. lab3.
48
4.
The next window shows the details of the project and the target chip. We will be synthesizing designs into real chips so it important to match the target chip with the particular board/chip you will be using. Beginning labs will be done in a Spartan 3E XC3S500E chip that comes in a FG320 package with a speed grade of 6 as shown below.
49
5.
Since we are starting a new design the next couple of pop-up windows arent relevant, just click Next and Next and Finish .
6.
You should now be in the main Project Navigator window. Select Project New Source from the menu.
50
7.
In the resulting pop-up window specify a VHDL Module source and give the file a name. I tend to just use the same name as the project itself, e.g. Lab 3. Click Next.
51
8.
The next pop-up window allows you to specify your inputs and outputs through the Wizard if you so desire. In this tutorial we will build a 2 x 1 multiplexer so we can specify the inputs and outputs as shown below. Here, the default entity and architecture names have also been changed. Once all inputs and outputs are entered click next and click Finish.
52
9.
The project will usually open with the design summary tab active in the right hand side of your window. We want to go to the VHDL code so you need to click the *.vhd tab for your design.
53
10.
You can see that the Wizard has used STD_LOGIC as the default type for your signals and also filled in the basic entity and architecture details for you.
54
11.
Now you can fill in the rest of your code for your design. In this case, we can do the multiplexer as shown below. Make sure to frequently save your code.
55
12.
Once the code is entered we can proceed with a simulation of the design or we can synthesize the code for implementation and download onto an FPGA. Let us proceed with the simulation First. In the upper left-hand side of the ISE environment there is a Sources sub window which has a drop down box as shown below. Note that the drop down box currently shows Synthesis/Implementation. Change this to Behavioral Simulation.
13. Highlight your *.vhd file in the Sources sub window and then expand the Simulator selection in the Processes sub window as shown below. Click on Simulate Behavioral Model to launch the simulator.
56
14.right click and set as the top module and create VHDL test bench
57
58
15. Simulation
16.
Double click on the simulation output
59
17.
Now lets look at the flow for actually synthesizing and implementing the design in the FPGA prototyping boards. Close ISE simulator and go back to the Xilinx ISE environment. In the Sources sub window change the selection in the dropdown box from Behavioral Simulation to Synthesis/Implementation.
18. To properly synthesize the design we need to specify which pins on the chip all the inputs and outputs should be assigned to. In general of course we could assign the signals just about any way we want. Since we will be using specific prototype boards, we need to make sure our pins assignments match the switches, buttons, and LEDs so we can test our design. We will be starting with Dig lab 2E boards that are connected to Dig lab DIO2 input/output boards. The I/O board has already been programmed and configured to have the following connections:
60
19.
To assign specific pins, expand the User Constraints selection under the Process sub window and double-click on Assign Package Pins.
61
20.
A new application called Xilinx PACE should be launched.
62
21.
In the Design Object List sub window you should see a listing of all the input and output signals from our design.
63
Here is where we can specify which pin locations we want for each signal. Simply enter the pins numbers from the tables shown in Step 19 above, making sure to use a capital letter P in front of the pin specification. Lets assign our signals as A P163 (Switch 1) I0 P164 (Switch 2) I1 P166 (Switch 3) Y P149 (LED 0)
22.
Back in the Xilinx ISE environment window we can now tell the computer to synthesize our design. In the Process sub window double-click on the Synthesize XST selection and wait for the process to complete. Then doubleclick on the Implement Design selection and wait for the process to complete. Then double-click on the Generate Programming File selection and wait for the process to complete. If all goes well, you should have green checks marks for the whole design.
64
23.
There is a lot of information you can obtain through all of the objects listed in the Processes sub window, but let us proceed to downloading the design onto the prototyping board for testing. First make sure the prototyping board is connected to the PC and has power on. Also make sure the slide switch on the FPGA board by the parallel port is set to JTAG (as opposed to Port). Then select Configure Device (iMPACT) underneath the Generate Programming File selection. You should the following window
65
24.
Now you need to specify which bit stream file to use to configure the device. For this tutorial we want to select the mux.bit file and click Open.
66
You will probably get the message below. Just click Yes.
You will also get a warning message saying the JTAG clock was updated in the bitstream file (which is good) so just click OK. There is a way to correct for that in the original design flow, but Xilinx automatically catches it here so I dont usually bother.
67
25.
You should now see the Spartan XC2S200E chip in the main window. Right click on the chip to prepare for downloading the bit stream file.
Select Program on the resulting window.
68
26.
Click OK.
If all goes well you should get the Programming Succeeded message
69
Now just test and verify your design on the actual FPGA board!
Appendix-B VHDL
Why (V)HDL? 1. Interoperability 2. Technology independence 3. Design reuse 4. Several levels of abstraction 5. Readability 6. Standard language 7. Widely supported What is VHDL? VHDL = VHSIC Hardware Description Language(VHSIC = Very High-Speed IC)
70
Design specification language Design entry language Design simulation language Design documentation language An alternative to schematics
Brief History: 1. VHDL Was developed in the early 1980s for managing design problems that involved large circuits and multiple teams of engineers.
2. Funded by U.S. Department of Defense. 3. The First publicly available version was released in 1985. 4. In 1986 IEEE (Institute of Electrical and Electronics Engineers, Inc.) was presented
with a proposal to standardize the VHDL. 5. In 1987 standardization => IEEE 1076-1987 6. An improved version of the language was released in 1994 => IEEE standard10761993. Related Standards: 1. IEEE 1076 doesnt support simulation conditions such as unknown and highimpedance.
71
2. Soon after IEEE 1076-1987 was released, simulator companies began using their own, non-standard types => VHDL was becoming a nonstandard. 3. IEEE 1164 standard was developed by an IEEE.IEEE 1164 contains definitions for a nine-valued data type, std_logic. 4. IEEE 1076.3 (Numeric or Synthesis Standard) defines data types as they relate to actual hardware. 5. Defines, e.g., two numeric types: signed and unsigned.
VHDL Environment:
Design Units: Segments of VHDL code that can be compiled separately and stored in a library.
72
Entities: 1. A black box with interface definition. 2. Defines the inputs/outputs of a component (define pins) 3. A way to represent modularity in VHDL. 4. Similar to symbol in schematic. 5. Entity declaration describes entity. Eg: entity Comparator is port (A, B : in std_logic_vector(7 downto0); EQ : out std_logic); end Comparator; Ports: 1. Provide channels of communication between the component and its environment. 2. Each port must have a name, direction and a type.
73
3. An entity may have NO port declaration Port directions: 1. in: A value of a port can be read inside the component, but cannot be assigned. Multiple reads of port are allowed. 2. out: Assignments can be made to a port, but data from a port cannot be read. Multiple assignments are allowed.
3. In out: Bi-directional, assignments can be made and data can be read. Multiple
assignments are allowed. 4. buffer: An out port with read capability. May have at most one assignment. (are not recommended) Architectures: 1. Every entity has at least one architecture. 2. One entity can have several architectures. 3. Architectures can describe design using: a. BehaviourStructureDataflow 4. Architectures can describe design on many levelsGate levelRTL (Register Transfer Level)Behavioural level 5. Configuration declaration links architecture to entity. Eg: Architecture Comparator1 of Comparator is Begin EQ <= 1when (A=B) else 0; End Comparator1; Configurations: 1. Links entity declaration and architecture body together. 2. Concept of default configuration is a bit messy in VHDL 87. a. Last architecture analyzed links to entity?
74
3. Can be used to change simulation behavior without re-analyzing the VHDL source. 4. Complex configuration declarations are ignored in synthesis. 5. Some entities can have, e.g.,gate level architecture and behavioral architecture. 6. Are always optional.
Packages: Packages contain information common to many design units. 1. Package declaration --constant declarations type and subtype declarations function and procedure declarations global signal declarations file declarations component declarations 2. Package body is not necessary needed function bodies procedure bodies Packages are meant for encapsuling data which can be shared globally among several design units. Consists of declaration part and optional body part. Package declaration can contain: type and subtype declarations subprograms
75
constants, Alias declarations global signal declarations file declarations component declarations Package body consists of subprogram declarations and bodies type and subtype declarations deferred constants file declarations
Libraries: Collection of VHDL design units (database). 1.Packages: a) package declaration b) package body 2. Entities (entity declaration) 3. Architectures (architecture body) 4. Configurations (configuration declarations) a) Usually directory in Unix file system. b) Can be also any other kind of database. Levels of Abstraction: VHDL supports many possible styles of design description, which differ primarily in how closely they relate to the HW.
76
It is possible to describe a circuit in a number of ways. 1. Structural------2. Dataflow ------3. Behavioral ------Structural VHDL description: 1. Circuit is described in terms of its components. 2. From a low-level description (e.g., transistor-level description) to a high level description (e.g., block diagram). 3. For large circuits, a low-level descriptions quickly become impractical. Dataflow VHDL Description: 1. Circuit is described in terms of how data moves through the system. 2. In the dataflow style you describe how information flows between registers in the system. 3. The combinational logic is described at a relatively high level, the placement and operation of registers is specified quite precisely. Higher level of abstraction
a.
77
4. The behavior of the system over the time is defined by registers. 5. There are no build-in registers in VHDL-language. a. Either lower level description b. or behavioral description of sequential elements is needed. 6. The lower level register descriptions must be created or obtained. 7. If there is no 3rd party models for registers => you must write the behavioral description of registers. 8. The behavioral description can be provided in the form of subprograms(functions or procedures)
Behavioral VHDL Description: 1. Circuit is described in terms of its operation over time. 2. Representation might include, e.g., state diagrams, timing diagrams and algorithmic descriptions. 3. The concept of time may be expressed precisely using delays (e.g., A <= B after 10 ns) 4. If no actual delays is used, order of sequential operations is defined. 5. In the lower levels of abstraction (e.g., RTL) synthesis tools ignore detailed timing specifications. 6. The actual timing results depend on implementation technology and efficiency of synthesis tool. 7. There are a few tools for behavioral synthesis. Concurrent Vs Sequential:
78
Processes: 1. Basic simulation concept in VHDL. 2. VHDL description can always be broken up to interconnected processes. 3. Quite similar to Unix process.
4. Process keyword in VHDL. 5. Process statement is concurrent statement.
79
6. Statements inside process statements are sequential statements. 7. Process must contain either sensitivity list or wait statement(s), but NOT both. 8. Sensitivity list or wait statement(s) contains signals which wakes process up. General format: process [(sensitivity_list)] process_declarative_part begin process_statements [wait_statement] End process;
80

Dist FIR

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Dist FIR

Transféré par

Droits d'auteur :

Formats disponibles

CHAPTER 1 INTRODUCTION

1,3 Comparison between FIR and IIR filter

CHAPTER 2 FIR FILTER DESIGN

x[n] = [n] in the above

CHAPTER 3 FPGA ARCHITECTURES

3.2.2 Market size

2005: $1.9 billion 2010 estimates: $2.75 billion

3.3 Switch box topology

3.6 FPGA Design and Programming

Fuse - One-time programmable. Bipolar.

CHAPTER 4 DISTRIBUTED ARITHMETIC FIR FILTER USING FPGA ARCHITECTURES

An FIR filter of length K is described as:

xk= -xk0 + N-1 xkn 2-n

y=K-1 hk (-xk0 + N-1 xkn 2-n)

= -K-1 hk xk0 + N-1 K-1 xkn hk

4.4 System Block diagram

Figure 4.1: Original LUT-based DA implementation of a 4-tap FIR filter

4.5 Shift register

Figure 4.2 flip-flop .

Figure 4.4 Logic Diagram of 4X2 ROM

Table 4.1 Truth table of Half Adder

RESULTS AND CONCLUSIONS

Fig 5.2 Result of look up table

Fig 5.3 Result of adder/subtractor

Fig 5.4 Result of right shifter

Fig.5.5 Result of 4-TAP DA FIR filter

Appendix-A Implementing VHDL Designs Using Xilinx ISE

Double click on the simulation output

A new application called Xilinx PACE should be launched.

Select Program on the resulting window.

4. Process keyword in VHDL. 5. Process statement is concurrent statement.

Vous aimerez peut-être aussi