Parative Study of VLSI Solutions To

548 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 54, NO.
1, FEBRUARY 2007
Comparative Study of VLSI Solutions to

Independent Component Analysis
Hongtao Du, Student Member, IEEE, Hairong Qi, Senior Member, IEEE,
and Xiaoling Wang, Student Member, IEEE
Abstract—The advent of independent component analysis (ICA) large volumes of raw and processed data have caused ICA
has brought a paradigm shift to signal and image processing. algorithms time-consuming processes for software implemen-
ICA that extracts independent source signals by searching for a tation. Hardware implementation provides not only an optimal
linear or nonlinear transformation and minimizing the statistical
dependence between components has the promise of effective parallelism but also potentially faster and real-time solutions.
unsupervised signal separation capability. Due to the computation While software implementation is useful for investigating the
complexity of ICA and commonly high-volume data sets used in capabilities of ICA algorithms and is sufficient for most ap-
signal and image processing, the ICA process, however, is very plications, hardware implementation is essential to fully ben-
time-consuming. Very large scale integration (VLSI) solutions efit from the parallel architecture and to facilitate high-speed
with optimal parallelism provide potentially faster and even re-
al-time implementations for ICA algorithms. In this paper, the processing. The major difference between hardware and soft-
authors study these solutions and discuss their limits. Critical ware implementations lies in the fact that hardware subroutines
challenges are identified, and issues associated with the VLSI are executed by integrated circuits (ICs) instead of a series
implementation of ICA algorithms are designed. Design recom- of microinstructions. Hardware implementation also solves the
mendations that have potentials in performing complicated ICA insufficient memory problem encountered by software for large
algorithms on large throughput are provided.
data sets and high dimensionality.
Index Terms—Application-specific integrated circuit (ASIC), During the last decade, advances in very large scale
CMOS integrated circuits, field-programmable gate array integrated (VLSI) circuit technologies have allowed de-
(FPGA), independent component analysis (ICA), reconfigurable
architectures, very large scale integration (VLSI). signers to implement some ICA algorithms on fully ana-
log CMOS circuits, analog–digital (AD) mixed-signal ICs,
digital application-specific ICs (ASICs), and general field-
I. I NTRODUCTION programmable gate arrays (FPGAs). Both analog CMOS cir-
cuits and mixed-signal ICs are fully customized by designers
I NDEPENDENT component analysis (ICA) is the most pop-
ular technique developed to solve the blind source sep-
aration (BSS) problem [1]. Under the assumption of signal
using either analog or AD mixed technologies, where the
silicon is utilized in the most efficient manner but the develop-
independence across sources, the task of BSS and ICA is to ment expense is incredibly high. The digital nonprogrammable
separate and recover independent source signals from linear or ASICs such as standard-height library and mask gate arrays are
nonlinear mixed sensor observations, in which both sources and also full-custom VLSI and are used to implement designs at
the unmixing matrix are unknown. ICA not only decorrelates high circuit density by specifying interconnections during latter
the signals that are of second-order statistics using a minimum stages of the IC manufacturing process. In addition, the large
of a priori information but also reduces higher order statistical amount of available standard libraries of basic logic cells makes
dependencies between reconstructed signals. This principle has the design expense much cheaper and the design process much
been used in other applications, such as recognition [2] and faster. The FPGAs based on the reconfiguration technology are
hyperspectral image analysis [3]. In particular, ICA is very the most economic and efficient solutions to ICA algorithms
effective for unsupervised source estimations given only the since they allow end users to modify and configure their designs
observations of mixed signals. multiple times. Specifically, recent rapid increase in the density
Although powerful, the complicated arithmetics, the iterative of FPGAs has made it possible to implement large ICA designs
computation with slow convergence rate, and the generally with a completely hardware-driven approach.
This paper identifies critical challenges and design issues for
hardware implementation of ICA algorithms, studies existing
VLSI solutions, and discusses future directions. Section II
Manuscript received November 17, 2004; revised April 9, 2005. Abstract briefly describes ICA principles and different software imple-
published on the Internet September 15, 2006. This work was supported in part mentation algorithms and addresses several VLSI design chal-
by the Office of Naval Research under Grant N00014-04-1-0797.
H. Du and H. Qi are with the Department of Electrical and Com- lenges. Section III studies existing VLSI solutions to various
puter Engineering, University of Tennessee, Knoxville, TN 37996-2100 USA ICA algorithms and demonstrates our ASIC and FPGA designs.
(e-mail: hdu1@utk.edu; hqi@utk.edu). Section IV presents some new VLSI technologies and discusses
X. Wang is with the SONY Advanved Technology Center, San Jose, CA
95134 USA (e-mail: grace.wang@am.sony.com). potential solutions to the design challenges. Finally, Section V
Digital Object Identifier 10.1109/TIE.2006.885491 concludes this paper.
0278-0046/$25.00 © 2007 IEEE

DU et al.: COMPARATIVE STUDY OF VLSI SOLUTIONS TO ICA 549
II. ICA E{u4 } − 3(E{u2 })2 for a zero-mean random variable u [1],
[11]. If u is a Gaussian random variable, its kurtosis would
ICA is a method of finding a linear nonorthogonal coordinate
be zero. For those probability densities peaked at zero, their
system in any multivariate data [4]. The directions of the axes of
kurtosis is positive, and for those flatter probability densities,
this coordinate system are determined by both the second-order
their kurtosis is negative. Hyvärinen and Oja derived an ob-
and higher order statistics of the original data. The goal is to
jective function in [1] to find W that maximizes the kurtosis
perform a linear transform that makes the resulting variables
kurt[WT x(t)], where W is the corresponding weight matrix
as statistically independent from each other as possible. Let
and x(t) is the observation on time variable.
s1 , . . . , sm be the m source signals that are statistically inde-
Mutual information, which is inspired by information theory,
pendent, and no more than one signal is Gaussian distributed;
is a natural measure of the dependence between random vari-
that is, none of si gives any information on other signals.
ables, making it a good candidate for finding non-Gaussianity
The n observed signals x1 , . . . , xn are unmixed by an m × n
of the signals. The mutual information I between m random
unmixing matrix or weight matrix W to generate the source
variables
m yi , i = 1, . . . , m, is defined as I(y1 , y2 , . . . , ym ) =
signals in the following ICA unmixing model:
i=1 H(y i ) − H(y), where H(y) is the differential entropy
S = WX (1) of the random variable [12]. The minimization of mutual in-
formation corresponds to finding the most independent compo-
where W = [w1 , . . . , wm ]T and wi = [wi1 , . . . , win ], nents. The information maximization (InfoMax) principle [10]
i = 1, . . . , m. is derived from the minimization of mutual information and,
correspondingly, maximization of the output entropy.
Another very important measure of non-Gaussianity is given
A. Principal Approaches
by negentropy, which is derived based on the information-
There are two different research communities that have con- theoretic quantity of differential entropy [11]. The negentropy
sidered the analysis of independent components [5], including of a random vector y is defined as J(y) = H(ygauss ) − H(y),
the mixed sources separation and the unsupervised learning where H(ygauss ) denotes the differential entropy of a Gaussian
based on information theory. The study of separating mixed random variable ygauss , which has the same covariance matrix
sources observed in an array of sensors has been a classical as y. The advantage of using negentropy as a measure of non-
and difficult signal processing problem. Herault and Jutten Gaussianity is that it is well justified by statistical theory [11].
were the first working on BSS who, in their seminal work [6], Because it is difficult to calculate negentropy, an approxima-
introduced an adaptive Herault–Jutten (H–J) algorithm in a tion is usually given as J(Y) ≈ {E[G(Y)] − E[G(Ygauss )]}2 ,
simple feedback network topology y = −Wy + x, where x where G(Y) is a nonquadratic function [1]. In ICA algorithms,
is the observation, y is the separated signal, and W is Y = wT X, where wT is the transpose of the weight vector
the unmixing matrix with zero diagonal terms [wii = 0, i = w. The FastICA algorithm [1] is developed to find w that
1, . . . , min(m, n)] that is able to separate several unknown in- maximizes the above objective function. It uses a fixed-point
dependent sources. Their approach has been further developed iteration scheme to find a direction such that the projection of
by many other researchers. Common was one of them, who weight vectors maximizes an approximation of negentropy.
elaborated the concept of ICA and proposed cost functions In addition, the approach of output divergence minimiza-
related to the approximate minimization of mutual information tion [13] uses Kullback–Leibler (K–L) divergence or relative
between sensors [7]. In parallel to BSS studies, unsupervised entropy of the output signals as the objective function and
learning rules based on information theory were proposed minimizes the divergence with gradient descent learning. The
whose goal was to maximize the mutual information between maximum likelihood method [5] assumes that the input mixed
the inputs and outputs of a neural network [8]. Roth and signals are mutually independent and maximizes the likelihood
Baram [9] and Bell and Sejnowski [10] independently derived of the input signals.
stochastic gradient learning rules for this maximization and
applied them, respectively, to forecasting, time-series analy-
B. VLSI Implementation Challenges
sis, and blind separation of sources. In their work, Bell and
Sejnowski put the BSS problem into an information-theoretic As mentioned above, many ICA algorithms are slow
framework and demonstrated its effectiveness in the separation processes in signal/image processing applications due to the
and deconvolution of mixed sources [10]. complicated arithmetics as well as the time-consuming iterative
Intuitively speaking, the key to estimating the ICA model is computation. VLSI is an ideal algorithm implementation carrier
non-Gaussianity since the source signals are desired to contain and offers many features such as high processing speed, which
the least Gaussian components. Hence, the definition and esti- is extremely desired in ICA implementations. The complicated
mation of a contrast function that measures the non-Gaussianity arithmetics of ICA is one of the main barricades in ICA
of independent components is necessary for the identifiability hardware implementation, especially in synthesis procedure.
of the model. There are many different representations of the Therefore, hierarchy and modularity techniques in VLSI design
contrast function, where high-order cumulates, mutual informa- are essential for most ICA implementations to overcome the
tion, and negentropy are the most important ones. complexity of ICA algorithms. The hierarchy, or the divide
The fourth-order cumulate or kurtosis is the classical mea- and conquer technique, involves dividing an ICA process into
sure of non-Gaussianity of signals. It is defined as kurt(u) = subprocessing modules until the complexity of the bottom
550 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 54, NO. 1, FEBRUARY 2007
mutual information among source signals. The fuzzy infer-

ence system [19] uses fuzzy-based adaptive learning rate to
measure the second-order and higher order correlation coeffi-
cients. Additionally, under some assumptions and constraints,
noniterative ICA algorithms, such as the algebra ICA [20] and
Molgedey–Schuster ICA [21], can be developed to approximate
the true solutions.
ICA algorithms are mostly applied in signal and image
processing field, which usually entails large volumes of raw or
preprocessed data that are transferred in and out of the VLSI
designs. A successful ICA hardware implementation therefore
Fig. 1. Design flow using RCs. requires careful design of internal RAM and I/O bandwidth.
If transferring data directly to the VLSI chip is not a feasible
level submodules becomes manageable. These submodules are option, then data transfer can be accomplished through shared
independently developed, then integrated together and put into memory or external software programs depending on specific
a design and development environment for performing tasks applications.
such as synthesis, optimization, placement, and routing. The
use of modularity enables the parallelism of the design process III. C OMPARATIVE S TUDY OF E XISTING S OLUTIONS
and facilitates the development of generic modules in various TO ICA A LGORITHMS
designs. Recently, the concept of modularity has gained more
In the past few years, many advantages found in VLSI
attention because of a need to reduce the design cycle and the
designs have inspired the implementation of complex ICA
development cost. For example, Cho and Lee [14] implemented
algorithms.
the InfoMax ICA on the analog CMOS circuit in 2001, and
Cauwenberghs [15] implemented the same algorithm using the
A. Analog CMOS Integration
AD mixed-signal technology in 2003. Both implementations
utilized modular architectures such that the designs could be When ICA algorithms are implemented on analog CMOS
easily expanded to larger chips and further integrated into mul- circuits, end users synthesize progressive processing at the
tichip systems for a large number of input and output channels. transistor level; that is, the low-level element transistors and
In our previous work [16], some ICA-related reconfigurable necessary interconnections can be customized. The application
components (RCs) have been developed for purposes of reuse specified by analog circuits utilizes the minimum amount of
and retargeting. The design and use of RCs make system transistors and the shortest interconnections to achieve high cir-
synthesis faster, easier, and more efficient and reduce time-to- cuit density and, correspondingly, low circuit delay and power
market, which directly leads to higher revenues [17]. Compared consumption. Within a limited chip area, designers can specify
to the effort of one-time use, it takes about 50% more time to tens of millions of transistors to integrate computation for
verify code for reuse and reconfiguration, but other designers ICA algorithms. These huge amount of transistors generate the
would benefit for 70% [17] or more design time reduction massively parallel analog CMOS circuits and make high-speed
from these RCs. As another advantage, the well-tested RCs processing possible. In the implementation process, a complex
also reduce design risks by avoiding verifications. RCs that ICA algorithm is decomposed into several subprocesses, each
are parameterized using generic maps as to make them highly of which is accomplished by a cluster of transistors to per-
flexible for future instances have helped focusing the synthesis form the corresponding function. The popularly used modular
work mainly on configuration and interconnection. During the structure can then be extended to any size design easily. The
implementation process, the necessary RCs are fetched from final circuit is either connected to external peripherals, such as
the library, as demonstrated in Fig. 1. The generics are con- AD convert/digital–analog (DA) converter, or integrated with
figured according to application specifications, whereas input a digital signal processor and an AD/DA converter within
and output ports are interconnected to form processes or sub- one chip.
processes. In addition, these ICA-related RCs can be modified, The low-power high-density analog CMOS circuit is one of
improved, and extended to some new RCs as necessary for the earliest solutions to ICA algorithms. In 1995, Cohen and
specific categories of ICA applications. Andreou [22] designed two analog CMOS chips to implement
Another VLSI design challenge concerning the ICA im- the H–J ICA algorithm for blind separation of mixed speech
plementation is the iterative computation that requires large signals. The chip core size (transistor counts) varied according
amount of processing time and RAM to store intermediate to the design scope. For example, the two-input × two-output
results. The best solution to this problem is to reduce the ICA chip used 76 transistors, and the five-input × five-output chip
algorithm at the early preprocessing stage of the VLSI de- used 715 transistors. The core was integrated with analog I/O
sign process. Computationally efficient ICA algorithms would interfaces, weight coefficients, and adaptation blocks on one
contain less iterations and converge rapidly to the desired chip. The fabrication was targeted to a 2-µm, n-well, two-
accuracy. Researchers have recently accelerated ICA processes poly two-metal CMOS process, in which chips employed three
by including nonlinear learning parameters. For example, the power lines, namely Vdd (+2.5 V), Vss (−2.5 V), and Gnd. In
α-ICA algorithm [18] uses the α-logarithm to minimize the the same year, Gharbi and Salam [23] also implemented the
TABLE I
COMPARISON OF ANALOG, MIXED-SIGNAL, AND ASIC SOLUTIONS
H–J algorithm on a 2.22 × 2.25 mm tiny chip using the 2.0-µm

CMOS technology. More recently, Cho and Lee [14] designed
a fully analog CMOS chip to implement the InfoMax theory-
based ICA algorithm, in which analog multiplier, nonlinear
function, and weight update circuits were used. The fabricated
four-input × four-output ICA chip used a 0.6-µm, p-well, two-
poly three-metal AMS CMOS process and had a 2.8 × 2.8 mm
active area. Both the input and output voltages ranged between
−2.5 and +2.5 V.
Although very efficient in circuit design, the analog CMOS
integration suffers from high expense of the workstation-based
Fig. 2. Pilchard board.
development system ($150 000) and slow turnaround time
(usually eight weeks), which make it subpotential for most ICA
designs and fast implementations [17]. In addition, the fabri- onto a feedforward system. A mixed-signal adaptive parallel
cated analog CMOS circuits do not necessarily exhibit the same VLSI architecture was then used to implement this feedforward
behavior as they do in the ideal simulations. This phenomenon network. The elements in the unmixing matrix were stored
is called the “transistor mismatch,” which is caused by three in 12-bit precision digital cells of the mixed-signal architec-
identifiable factors, namely: 1) edge effects; 2) striation effects; ture, which was implemented using fully differential switched-
and 3) random variations [24]. General solutions include capacitor simpled-data circuits. The chip was fabricated using
increasing transistor size (W and L) to afford more current and a 0.5-µm two-poly three-metal CMOS technology, and the
using cocentric structure to allow matched transistors to share core size was 3 × 3 mm. The three-input × three-output chip
the same surrounding structures. However, it would take more could unmix three independent components. This ICA VLSI
time and cost more to solve these physical problems. processor was used as a front end of the system integration,
in which it separated the mixed analog acoustic inputs and then
fed the digital outputs to Xilinx FPGA for further classification.
B. AD Mixed-Signal Techniques Although the aforementioned research efforts offer possibly
The AD mixed-signal techniques inherit all the advantages better solutions to some ICA applications compared to the
of analog CMOS circuits, but the architecture is digitally re- analog CMOS integration, AD mixed-signal techniques do not
configurable and is capable of implementing a class of similar improve much on the development expense and the prototype
applications. turnaround period.
In the neuromorphic autoadaptive systems project conducted
at The Johns Hopkins University [15], Celik et al. [25] simpli-
C. ASIC Solutions
fied and generalized the H–J, the InfoMax, the natural gradient
[26], and the Cichocki–Unbehauen algorithm [27] into a com- The previously discussed analog CMOS and AD mixed-
mon outer-product rule, which was approximated and mapped signal technologies provide user full-custom solutions to ICA
TABLE II
COMPARISON OF FPGA SOLUTIONS
algorithms. End users are generally required to have sufficient In our previous work [29], we synthesized the FastICA-
knowledge and focus more on detailed analog physical prob- based parallel ICA (pICA) algorithm on ASIC using standard-
lems and basic component designs. Therefore, the application height library cells. In our design, the necessary logic cells were
domains are comparatively limited, and the development costs selected or developed in very compact format. All the cells were
significant time with expenses. Fortunately, the fast-blooming filled horizontally in multiple standard cell rows with standard
digital VLSI technologies like ASIC and FPGA allow end users height of 66/76 λ. These rows overlapped with each other
to concentrate on the algorithm implementation itself because after inversion or flip-flop, so that they shared Vdd and Gnd
IC vendors provide enormous standard libraries. ASICs and to save space and reduce circuit delay. In addition, the space
FPGAs are therefore called user semicustom solutions. between rows for wiring could be justified as needed. Such
From the aspect of circuit density and efficiency, the non- standard-height architecture brought higher performance and
programmable ASICs cover the lower end of analog CMOS more compact core area for hardware implementation, therefore
and AD mixed-signal full-custom VLSIs and the higher end providing better solution to the ICA algorithms. Our standard-
of reprogrammable FPGAs. Compared to reprogrammable height cell-based ASIC synthesis aimed at Taiwan Semicon-
FPGAs, nonprogrammable ASICs retain the benefits of com- ductor Manufacturing Company 0.18-µm process, where λ
pact circuit design and low power consumption. Although the was equal to 0.1 µm, and the CMOS fabrication process had
nonprogrammable feature increases the design expense and six metal layers and one polylayer. The voltage of the target
risk, ASICs that typically contain ten million logic gates or chip was 1.8 V for applications, and a thick oxide layer was
more are the appropriate solutions to very complex ICA de- used for 3.3-V transistors [30]. For a pICA process containing
signs. For example, the standard-height library cell is a design estimations of four weight vectors, the chip size was 1186.34 ×
technique for nonprogrammable ASICs, where the vendors 1184.49 µm. The aspect ratio of the core was set to 1.001
develop standard-height library cells for the implementation for the convenience of placement and routing, which optimally
of large amount of functions. When implementing an ICA mapped the structural interconnection of standard-height cells
algorithm, the end users only need to select necessary cells expressed in a schematic view onto the physical architecture.
that are logic level components with constant height on chip All the analog, mixed-signal, and ASIC solutions depend
and then specify interconnections between layers such as poly, on compact designs from end users at the beginning stage
metal1, and metal2 according to their designs. and fabrication of application-specific chips from hardware
To pursue potential solutions to the InfoMax-based ICA companies at the final stage. Table I compares most up-to-
algorithms on higher density digital ICs, the Computa- date analog CMOS, AD mixed-signal, and ASIC solutions to
tional NeuroSystems Laboratory, Korea Advanced Institute of ICA algorithms. The chip size and I/O reflect the compactness
Science and Technology [28], is designing an ASIC chip for of individual ICA designs, whereas the fabrication parameters
ICA and will use it as a front end to control noise in speech and voltage reflect the trend of the VLSI technology develop-
recognition. ment. Obviously, “low-voltage circuit,” which directly results
Fig. 3. Capacity utilization of Xilinx VIRTEX V1000EHQ240-6 for different numbers of weight vectors in pICA. The dotted lines denote the maximum capacity
of Xilinx VIRTEX V1000EHQ240-6. (a) Delay. (b) Slice. (c) Transistor. (d) Equivalent gate.
TABLE III
PERFORMANCE COMPARISON BETWEEN RECONFIGURABLE FPGA SYSTEM AND SINGLE FPGA IMPLEMENTATION
in squared power conservation, and “small chip size,” which configurable technologies, in which end users are allowed
requires compact circuit design, are the current trend. to modify their designs for multiple times and program the
interconnections in a few hours instead of waiting several weeks
for the final fabrication and metalization. These savings in the
D. FPGA Solutions
development expense and the turnaround time of prototyping
Among all the VLSI technologies, FPGAs provide the most directly lead to time-to-market reduction and profit increase.
economic and efficient solutions to comparatively simple ICA Most FPGAs contain 2000 to 2 000 000 logic gates [31] and
algorithms and could provide lower cost substitute of nonpro- use architectures that support a balance between logic resources
grammable ASICs. Unlike nonprogrammable VLSI devices, and routing resources [32]. Typical FPGAs are composed of
FPGAs are standard and general-purpose products fabricated a two-dimensional array of input/output blocks, interconnects,
by hardware companies before end users implement specific and configurable logic blocks (CLBs) that can be customized
ICA designs on them. FPGAs are developed based on re- to implement logic functions. The programmable interconnects
Fig. 5. FPGA developments versus Moore’s law.
In 2002, Satter and Charayaphan [37] implemented an im-

proved InfoMax BSS algorithm proposed by Torkkola [38] on
a Xilinx Virtex E FPGA, which contains 0.6 million logic gates.
They used the Xilinx CAD tools of System Generator V2.1 and
ISE 4.1 for code generating, synthesis, and routing. However,
this method is not optimal from the aspects of both synthesis
architecture and circuit compactness. As a result, the capacity
limit led to the maximum iteration number being prelimited
to 50 and the buffer size to 2500 samples. Since the required
memory size exceeded the capacity of the FPGA, a MATLAB
Fig. 4. Global run-time reconfiguration flow. program was used instead, which was executed on the external
memory. Wei and Charoensak [39] used the same method in
between these CLBs allow end users to implement the mul- [37] to implement a noniterative algebra ICA algorithm devel-
tilevel logic functions [33]. FPGA vendors prefabricate rows oped by Yamaguchi and Itoh [20] for the purpose of speeding
of gates and programmable connections, whereas end users up motion detection in image sequences. The target FPGA
specify and interconnect the programmable CLBs to perform was Xilinx Virtex E. Although the design consumed 90 200
the desired ICA algorithms. of the 600 000 logic gates, the application only supported the
In recent years, FPGAs have become the most popu- unmixing of two independent components.
larly used devices for various VLSI implementations of ICA In 2003, Kim et al. [40] combined the InfoMax ICA-based
algorithms. In 2001, Lim et al. [34] implemented two small- BSS algorithm and the ICA adaptive noise canceling algo-
size independent component neural network (ICNN) prototypes rithm, which minimized entropy between signals and noise.
that were based on mutual information (between input and An expandable modularized processor design for this ICA-
output) maximization and output divergence minimization. The based speech recognition system was elaborated and imple-
implementation was on Xilinx Virtex XCV 812E, which con- mented on Altera EP20K600EBC652-1 FPGA, which contains
tains 0.25 million logic gates [35]. All the variables in the 0.6 million logic gates and supports embedded memory. The
network were represented as fixed-point numbers. The input observed acoustic signals collected from analog microphones
signals were stored in 1-Mb RAMs and iteratively read by were represented as 12-bit integers, whereas the elements in
the ICNNs until the weight-updating process converged. Two the weight matrix W were represented as 30-bit floating-point
7-neuron ICNN prototypes were implemented, and comparison data. In our previous work [16], the FastICA-based pICA algo-
was conducted in terms of cost and performance to evaluate rithm was implemented on Xilinx VIRTEX V1000EHQ240-6
which one was more suitable for hardware implementation. FPGA with a capacity of one million logic gates. The target
In [36], Nordin et al. proposed a pipelined ICA architecture FPGA was embedded on the pilchard reconfigurable computing
for potential FPGA implementation. The InfoMax algorithm platform, which was plugged in a Sun workstation as shown in
programmed in MATLAB from Tony Bell [10] was first broken Fig. 2. The pilchard reconfigurable computing platform uses the
down into four modules, each of which was translated into C, 168-pin dual inline memory module RAM slot as an interface.
and then into HDL to implement on the four-stage pipelined It is compatible with the PC133 standard [41], thereby achiev-
FPGA array. Since each FPGA in this pipeline does not have ing very high data transfer rate. In the synthesis process, the
data dependence with others, all blocks could be further imple- pICA algorithm was first simulated by ModelSim from Mentor
mented and executed in parallel. Graphics, then synthesized by Synopsys FPGA Compiler2, and
Fig. 6. Xilinx Virtex II-based platform with logic gates of eight million. Fig. 7. Architecture of HPRC.
finally placed and routed by Xilinx XVmake. After implement-

dynamically reconfigurable FPGA system that we presented
ing the pICA on the Xilinx V1000E embedded on the pilchard
in [42], we divided all processes of pICA into three groups:
board, we achieved the maximum frequency of 20.161 MHz
1) submatrix; 2) external decorrelation; and 3) comparison.
(minimum period of 49.600 ns) and the maximum net delay of
Each group was synthesized using Synopsys FPGA Compiler2
13.119 ns. The pICA used 92% slices of the V1000E.
and then placed and routed by Xilinx XVmake. Individual
Since FPGAs are standard VLSI products, devices in the
group modules were downloaded on the FPGA in a serial mode.
same series have similar internal structures. Chip utilization
The submatrix module was first downloaded to configure the
percentage and operation frequency are then very important
pilchard FPGA platform. After the submatrix module was ex-
parameters for evaluating the compactness and efficiency of
ecuted and the task finished, the external decorrelation module
individual designs. Table II compares recent FPGA solutions
was then downloaded to reconfigure the same FPGA. Since the
to various ICA algorithms.
immediate outputs from the preceding submatrix module were
Although the characteristics of reconfigurability and reusable
commonly used as inputs of the subsequent configuration of the
life cycle bring FPGA more advantages than other VLSIs, they
external decorrelation module, the external memory was used to
inversely result in lower circuit density and higher circuit delay,
thereby bringing capacity limitation. During the placement store these intermediate signals that were originally the internal
variables in single FPGA implementation.
and routing processes of the pICA synthesis, we observed
To compare the performance of the reconfigurable FPGA
that several capacity constraints barricaded single FPGA from
implementing complex algorithms like pICA. Fig. 3 shows the system with that of the single FPGA implementation, we list the
design and device utilization ratios for each group in Table III.
relationship between the number of weight vectors need to
We take the estimation of a weight matrix of 20 weight
be estimated in pICA process and the capacity utilization of
vectors as an example. The reconfiguration process of the
the FPGA Xilinx VIRTEX V1000E, which is evaluated using
reconfigurable FPGA system is illustrated in Fig. 4, in which
metrics including delay, slice, transistor, and equivalent gate.
both the number of executions and the order of executions are
We find that the circuit delay increases significantly after the
predefined. In this example, the reconfigurable FPGA system
number of weight vectors exceeds five, whereas the number
estimates 20 weight vectors given the observation signals. Since
of slices, transistors, and equivalent gates increases linearly.
the submatrix module can estimate and decorrelate four weight
Overall, a single Xilinx VIRTEX V1000E can accommodate
vectors each time, we need to execute the submatrix module
a pICA process with at most four weight vector estimations
for five times after configuring the FPGA. To decorrelate these
that takes 92% of the maximum capacity. Even one weight
five submatrices, the external decorrelation module needs to be
vector estimation occupies 29% of the capacity of this FPGA.
hierarchically executed for four times. The comparison module
This capacity constraint of single FPGA is not only for ICA
executes one time. A shell script file is written to control the
algorithms but also a challenge for all complicated applications
on large data sets. whole reconfiguration flow.
E. Reconfigurable FPGA System IV. R ECENT D EVELOPMENTS AND P OTENTIAL S OLUTIONS

Although FPGAs are economic solutions to complex ICA We have presented several effective solutions to ICA algo-
algorithms, its capacity constraint remains a problem. On the rithms in the previous section. However, we also realize the
other hand, ASICs can provide sufficient capacity, but its de- potential problems associated with these solutions. The devel-
velopment process is too expensive, and the turnaround time is opment cost of full-custom design is prohibitively high and
too long. Hence, we took the advantage of the reconfigurability considered a subpotential solution for most ICA algorithms.
of FPGAs and constructed a dynamically reconfigurable FPGA The capacity constraint problem of FPGAs impedes some
system, in which the IC capacity limit was overcome by sacri- complicated ICA algorithms from hardware implementation.
ficing the overall processing time [42]. The reconfigurable FPGA system requires extra processing
In a general FPGA platform, all processes of an ICA al- time for reconfigurations and frequent data input/output, which
gorithm are integrated in one design and synthesized on one significantly slow down the whole process and decline the
FPGA, which can be executed for multiple times. In the benefit of hardware implementation. In this section, we discuss
Fig. 8. HPRC implementation.
two possible directions that could provide potentially better FPGA platforms. For the purposes of data exchange and syn-
VLSI solutions to ICA algorithms. chronization, all computing nodes are interconnected by the
interconnection network (ICN), whereas RC boards are inter-
connected via reconfigurable ICN. The individual computing
A. FPGA/ASIC Technology Development nodes and RC boards are connected with each other through
The required computational power of complicated high-speed channels like memory bus or PCI bus.
algorithms or applications like ICA is the driving force HPRC can be implemented using multiple pilchard reconfig-
for the fast development of FPGA/ASIC technologies. Current urable computing platforms, which are plugged in the memory
phenomenal growth of FPGA/ASIC technologies has been far bus of the Sun workstations. As shown in Fig. 8 [46], the Sun
beyond the Moore’s law prediction of doubling in the number workstations (computing nodes) are interconnected through the
of transistors per IC every 18 months [43]. A comparison Ethernet. The pilchard platforms (RC boards) communicate
between the FPGA capacity tendency and Moore’s law is with each other through shared files. During the implementation
demonstrated in Fig. 5. Current FPGA has provided end of the pICA algorithm, the modules described in Section III-E
users powerful computing and processing capabilities. Taking are distributed on individual computing node. All modules
the new Xilinx Virtex II Pro as an example, this FPGA is are then configured and executed in the parallel mode and
manufactured based on the technology of 0.15-µm eight-layer collaboratively perform the whole process.
metal process with 0.12-µm high-speed transistors and allow With the fast development of current VLSI technologies,
end users to implement designs on 8 000 000 logic gates with we can expect large-capacity low-power consumption FPGAs
420-MHz internal clock speed and 840-Mb/s I/O [44]. Fig. 6 coming very soon. Networking these powerful FPGAs us-
shows a Xilinx Virtex II Pro-based FPGA platform, which is ing the HPRC technology, we anticipate significant improve-
in compliance with the Peripheral Component Interconnect ments in hardware solutions to complex ICA algorithms in the
(PCI)-X 133-MHz, PCI 66-MHz, and PCI 33-MHz standards. near future.
B. High-Performance Reconfigurable V. C ONCLUSION

Computing (HPRC) Platform
VLSI implementations of ICA algorithms require extremely
HPRC provides a potential solution for high processing efficient hardware designs and sufficient IC resources. In this
speed as well as the capacity limit by connecting the recon- paper, we summarized some design guidelines for general ICA
figurable FPGA systems or reconfigurable computing nodes in implementations. We also studied several existing solutions in
a network. HPRC exploits the benefits of parallel processing the categories of analog CMOS, AD mixed signal, ASICs,
from high-performance computing (HPC) in conjunction with and FPGAs. Although each technology has its own charac-
adaptive hardware acceleration associated with reconfigurable teristics, none of them can balance between a high-density
FPGA systems. HPC seeks extreme computing power associ- low-cost design and a shorter turnaround development period.
ated with supercomputers and parallel computing [45], whereas New development in FPGAs, combined with the HPRC tech-
the reconfigurable FPGA system utilizes the reconfigurability nology, shows great promise in a comprehensive hardware
quality of FPGAs. solution to complex ICA algorithms. We hope the discussions
As the architecture shown in Fig. 7, an HPRC platform here could shed lights on hardware solutions to other complex
consists of several computing nodes (CPU) and reconfigurable algorithms.
R EFERENCES [25] A. Celik, M. Stanacevic, and G. Cauwenberghs, “Mixed-signal real-time

adaptive blind source separation,” in Proc. IEEE ISCAS, Vancouver, BC,
[1] A. Hyvärinen and E. Oja, “A fast fixed-point algorithm for independent Canada, May 2004, pp. V-760–V-763.
component analysis,” Neural Comput., vol. 9, no. 7, pp. 1483–1492, [26] A. Cichocki, R. Unbehauen, L. Moszcnski, and E. Rummert, “A new on-
Oct. 1997. line adaptive learning algorithm for blind separation of sources,” in Proc.
[2] M. Bartlett and T. Sejnowski, “Neural information processing systems- ISANN, Dec. 1994, pp. 406–411.
natural and synthetic,” in Viewpoint Invariant Face Recognition Using [27] A. Cichocki, R. Unbehauen, and E. Rummert, “Robust learning algorithm
Independent Component Analysis and Attractor Networks. Cambridge, for blind separation of signals,” Electron. Lett., vol. 30, no. 17, pp. 1386–
MA: MIT Press, 1997, ch. 9, pp. 817–823 1387, Aug. 1994.
[3] M. Lennon, G. Mercier, M. C. Mouchot, and L. Hubert-Moy, “Indepen- [28] Computational NeuroSystems Laboratory, Digital Implementation of
dent component analysis as a tool for the dimensionality reduction and Independent Component Analysis Algorithm, 2003, Daejeon, South
the representation of hyperspectral images,” in Proc. SPIE Remote Sens., Korea: Dept. Biosystems, Korea Advanced Inst. Sci. Technol. Tech. Rep.
Toulouse, France, Sep. 2001, vol. 4541, pp. 2893–2895. [Online]. Available: http://cnsl.kaist.ac.kr/Research/kscho/icachip.htm
[4] T. W. Lee, M. S. Lewicki, and T. J. Sejnowski, “ICA mixture models for [29] H. Du, H. Qi, and G. Peterson, “Parallel ICA and its hardware imple-
unsupervised classification of non-gaussian classes and automatic context mentation in hyperspectral image analysis,” in Proc. SPIE Defense and
switching in blind signal separation,” IEEE Trans. Pattern Anal. Mach. Security Symp., Apr. 2004, pp. 74–83.
Intell., vol. 22, no. 10, pp. 1078–1089, Oct. 2000. [30] MOSIS. (2003). TSMC 0.18 Micrometer Process, Marina del Rey,
[5] T. W. Lee, M. Girolami, A. J. Bell, and T. J. Sejnowski, “A unifying CA: MOSIS Company Tech. Rep. [Online]. Available: http://www.
information-theoretic framework for independent component analysis,” mosis.org/products/fab/vendors/tsmc/tsmc018/
Int. J. Math. Comput. Model., vol. 39, no. 11, pp. 1–21, 1998. [31] D. Bouldin, ECE 551: Designing Application-Specific Integrated Circuits,
[6] J. Herault and J. Jutten, “Space or time adaptive signal processing by 2001.
neural network models,” in Proc. AIP Conf. 151 Neural Netw. Comput., [32] ——, “Synthesis of FPGAs and testable ASICs,” in Design of Systems on
1986, pp. 206–211. a Chip. Norwell, MA: Kluwer, 2003.
[7] P. Common, “Independent component analysis, a new concept,” Signal [33] H. Youssef and S. M. Sait, VLSI Physical Design Automation, Theory and
Proces. (Special Issue on High-Order Statistics), vol. 36, no. 3, pp. 287– Practice. Singapore: World Scientific, Jun. 1999.
314, Apr. 1994. [34] A. B. Lim, J. C. Rajapakse, and A .R. Omondi, “Comparative study of
[8] R. Linsker, “Local synaptic learning rules suffice to maximize mutual implementing ICNNs on FPGAs,” in Proc. Int. Joint Conf. Neural Netw.,
information in a linear network,” Neural Comput., vol. 4, no. 5, pp. 691– Jul. 2001, vol. 1, pp. 177–182.
702, 1992. [35] Xilinx, Virtex-E 1.8 V Extended Memory Field Programmable Gate
[9] Z. Roth and Y. Baram, “Multidimensional density shaping by sigmoids,” Arrays, 2002. [Online]. Available: http://direct.xilinx.com/bvdocs/
IEEE Trans. Neural Netw., vol. 7, no. 5, pp. 1291–1298, Sep. 1996. publications/ds025-1.pdf
[10] A. J. Bell and T. J. Sejnowski, “An information maximization approach to [36] A. Nordin, C. Hsu, and H. Szu, “Design of FPGA ICA for hyperspectral
blind separation and blind deconvolution,” Neural Comput., vol. 7, no. 6, imaging processing,” in Proc. SPIE, Wavelet Appl. VIII, 2001, vol. 4391,
pp. 1129–1159, Nov. 1995. pp. 444–454.
[11] A. Hyvarinen, E. Oja. (1999, Apr.). Independent Component Analy- [37] F. Sattar and C. Charayaphan, “Low-cost design and implementation of an
sis: A Tutorial. [Online]. Available: http://www.cis.hut.fi/aapo/papers/ ICA-based blind source separation algorithm,” in Proc. 15th Annu. IEEE
IJCNN99\_tutorialweb/ Int. ASIC/SOC Conf., 2002, pp. 15–19.
[12] T. M. Cover and J. A. Thomas, Element of Information Theory. Hobo- [38] K. Torkkola, “Blind separation of convolved sources based on information
ken, NJ: Wiley, 1991. maximization,” in Proc. IEEE Workshop Neural Netw. Signal Process.,
[13] S. Amari, A. Cichochi, and H. Yang, “A new learning algorithm for Kyoto, Japan, Sep. 1996, pp. 423–432.
blind signal separation,” in Advances in Neural Information Processing [39] Y. Wei and C. Charoensak, “FPGA implementation of non-iterative ICA
Systems, vol. 8. Cambridge, MA: MIT Press, 1996. for detecting motion in image sequences,” in Proc. 7th ICARCV, Dec.
[14] K. S. Cho and S. Y. Lee, “Implementation of InfoMax ICA algorithm with 2002, vol. 3, pp. 1332–1336.
analog CMOS circuits,” in Proc. Int. Workshop Independent Compon. [40] C. M. Kim et al., “FPGA implementation of ICA algorithm for blind sig-
Anal. and Blind Signal Separation, Vancouver, BC, Canada, Dec. 2001, nal separation and adaptive noise canceling,” IEEE Trans. Neural Netw.,
pp. 70–73. vol. 14, no. 5, pp. 1038–1046, Sep. 2003.
[15] G. Cauwenberghs. (2003). “Neuromorphic autoadaptive systems and in- [41] P. H. W. Leong et al., Pilchard—A Reconfigurable Computing Platform
dependent component analysis,” Johns Hopkins Univ., Baltimore, MD. With Memory Slot Interface. Hong Kong: Chinese Univ. Hong Kong.
Tech. Rep. [Online]. Available: http://bach.ece.jhu.edu/gert/yip/ [42] H. Du and H. Qi, “A reconfigurable FPGA system for parallel in-
[16] H. Du and H. Qi, “An FPGA implementation of parallel ICA for dimen- dependent component analysis,” EURASIP J. Embedded Syst. sub-
sionality reduction in hyperspectral images,” in Proc. IEEE Int. Geosci. mitted for publication. [Online]. Available: http://www.hindawi.com/
and Remote Sens. Symp., Sep. 2004, pp. 3257–3260. RecentlyAcceptedArticlePDF.aspx?journal=ES&number=23025
[17] D. Bouldin, Developments in Design Reuse. Knoxville, TN: Univ. [43] Silicon Moore’s Law. Intel Corporation, Santa Clara, CA, 2004.
Tennessee, 2001. Tech. Rep. [Online]. Available. http://www.intel.com/research/silicon/
[18] Y. Mtsuyama, T. Nimoto, N. Katsumata, Y. Suzuki, and S. Furukawa, mooreslaw.htm
“α-EM algorithm and α-ICA learning based upon extended logarithmic [44] Xilinx, Virtex-II Platform FPGAs: Complete Data Sheet, Mar.
information measures,” in Proc. IEEE-INNS-ENNS Int. Joint Conf. Neural 2004. [Online]. Available: http://direct.xilinx.com/bvdocs/publications/
Netw., Jul. 24–27, 2000, vol. 3, pp. 351–356. ds031.pdf
[19] S.-T. Lou and X.-D. Zhang, “Fuzzy-based learning rate determination [45] M. Alexander, Reconfigurable Computing, Pullman, WA: Washington
for blind source separation,” IEEE Trans. Fuzzy Syst., vol. 11, no. 3, State Univ. [Online]. Available. http://www.eecs.wsu.edu/~reconfig/
pp. 375–383, Jun. 2003. [46] G. D. Peterson and M. C. Smith, “Programming high performance re-
[20] T. Yamaguchi and K. Itoh, “An algebra solution to independent configurable computers,” in Proc. Int. Conf. Adv. Infrastructure Electroni.
component analysis,” Opt. Commun., vol. 178, no. 1–3, pp. 59–64, Business, Sci., and Education on the Internet, SSGRR, L’Aquila, Italy,
May 2000. Aug. 2001, pp. 60–68.
[21] L. Molgedey and G. Schuster, “Separation of a mixture of independent
signals using time delayed correlations,” Phys. Rev. Lett., vol. 72, no. 23,
pp. 3634–3637, Jun. 1994.
[22] M. H. Cohen and A. G. Andreou, “Analog CMOS integration and experi-
mentation with an autoadaptive independent component analyzer,” IEEE
Trans. Circuits Syst. II, Analog and Digit. Signal Process., vol. 42, no. 2, Hongtao Du (S’03) received the B.S. and M.S. degrees in electrical engineering
pp. 65–77, Feb. 1995. from Northeastern University, Shenyang, China, in 1997 and 2000, respectively,
[23] A. B. A. Gharbi and F. M. A. Salam, “Implementation and test results of and the M.S. degree in computer engineering from the University of Tennessee,
a chip for the separation of mixed signals,” in Proc. IEEE ISCAS, May Knoxville, in 2003. He is currently working toward the Ph.D. degree in
1995, pp. 271–274. electrical and computer engineering at the University of Tennessee.
[24] T. Serrano-Gotarredona and B. Linares-Barranco, “CMOS transistor mis- His current research interests include architectures and design methods for
match model valid from weak to strong inversion,” in Proc. Conf. Eur. low-power VLSIs, reconfigurable and virtual platforms, parallel/distributed
Solid-State Circuits, 2003, pp. 627–630. image and signal processing, and high-performance computing.
Hairong Qi (S’97–M’99–SM’05) received the B.S. and M.S. degrees in Xiaoling Wang (S’02) received the B.S. and M.S. degrees in electrical en-
computer science from Northern JiaoTong University, Beijing, China, in 1992 gineering from Northeastern University, Shenyang, China, in 1997 and 2000,
and 1995, respectively, and the Ph.D. degree in computer engineering from respectively, and the Ph.D. degree in electrical engineering from the University
North Carolina State University, Raleigh, in 1999. of Tennessee, Knoxville, in 2004.
She is currently an Associate Professor with the Department of Electrical and She is a Senior Engineer with the SONY Advanced Technology Center, San
Computer Engineering, University of Tennessee, Knoxville. She has published Jose, CA. Her current research interests are electronic imaging, information
more than 80 technical papers in archival journals and refereed conference processing in distributed sensor networks, distributed data fusion for target
proceedings. She also coauthored a book on machine vision. Her current tracking and classification, and pattern recognition.
research interests include advanced imaging and collaborative processing in
sensor networks, hyperspectral image analysis, and bioinformatics.
Dr. Qi serves on the Editorial Board of the Sensor Letters and is an Associate
Editor for Computers in Biology and Medicine. She is a recipient of the
National Science Foundation CAREER Award and the Chancellor’s Award for
Professional Promise in Research and Creative Achievement.

Parative Study of VLSI Solutions To

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Parative Study of VLSI Solutions To

Transféré par

Droits d'auteur :

Formats disponibles

548 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 54, NO.

Comparative Study of VLSI Solutions to

0278-0046/$25.00 © 2007 IEEE

mutual information among source signals. The fuzzy infer-

H–J algorithm on a 2.22 × 2.25 mm tiny chip using the 2.0-µm

Fig. 5. FPGA developments versus Moore’s law.

In 2002, Satter and Charayaphan [37] implemented an im-

finally placed and routed by Xilinx XVmake. After implement-

E. Reconfigurable FPGA System IV. R ECENT D EVELOPMENTS AND P OTENTIAL S OLUTIONS

Fig. 8. HPRC implementation.

B. High-Performance Reconfigurable V. C ONCLUSION

R EFERENCES [25] A. Celik, M. Stanacevic, and G. Cauwenberghs, “Mixed-signal real-time

Vous aimerez peut-être aussi