Académique Documents
Professionnel Documents
Culture Documents
1, FEBRUARY 2007
Abstract—The advent of independent component analysis (ICA) large volumes of raw and processed data have caused ICA
has brought a paradigm shift to signal and image processing. algorithms time-consuming processes for software implemen-
ICA that extracts independent source signals by searching for a tation. Hardware implementation provides not only an optimal
linear or nonlinear transformation and minimizing the statistical
dependence between components has the promise of effective parallelism but also potentially faster and real-time solutions.
unsupervised signal separation capability. Due to the computation While software implementation is useful for investigating the
complexity of ICA and commonly high-volume data sets used in capabilities of ICA algorithms and is sufficient for most ap-
signal and image processing, the ICA process, however, is very plications, hardware implementation is essential to fully ben-
time-consuming. Very large scale integration (VLSI) solutions efit from the parallel architecture and to facilitate high-speed
with optimal parallelism provide potentially faster and even re-
al-time implementations for ICA algorithms. In this paper, the processing. The major difference between hardware and soft-
authors study these solutions and discuss their limits. Critical ware implementations lies in the fact that hardware subroutines
challenges are identified, and issues associated with the VLSI are executed by integrated circuits (ICs) instead of a series
implementation of ICA algorithms are designed. Design recom- of microinstructions. Hardware implementation also solves the
mendations that have potentials in performing complicated ICA insufficient memory problem encountered by software for large
algorithms on large throughput are provided.
data sets and high dimensionality.
Index Terms—Application-specific integrated circuit (ASIC), During the last decade, advances in very large scale
CMOS integrated circuits, field-programmable gate array integrated (VLSI) circuit technologies have allowed de-
(FPGA), independent component analysis (ICA), reconfigurable
architectures, very large scale integration (VLSI). signers to implement some ICA algorithms on fully ana-
log CMOS circuits, analog–digital (AD) mixed-signal ICs,
digital application-specific ICs (ASICs), and general field-
I. I NTRODUCTION programmable gate arrays (FPGAs). Both analog CMOS cir-
cuits and mixed-signal ICs are fully customized by designers
I NDEPENDENT component analysis (ICA) is the most pop-
ular technique developed to solve the blind source sep-
aration (BSS) problem [1]. Under the assumption of signal
using either analog or AD mixed technologies, where the
silicon is utilized in the most efficient manner but the develop-
independence across sources, the task of BSS and ICA is to ment expense is incredibly high. The digital nonprogrammable
separate and recover independent source signals from linear or ASICs such as standard-height library and mask gate arrays are
nonlinear mixed sensor observations, in which both sources and also full-custom VLSI and are used to implement designs at
the unmixing matrix are unknown. ICA not only decorrelates high circuit density by specifying interconnections during latter
the signals that are of second-order statistics using a minimum stages of the IC manufacturing process. In addition, the large
of a priori information but also reduces higher order statistical amount of available standard libraries of basic logic cells makes
dependencies between reconstructed signals. This principle has the design expense much cheaper and the design process much
been used in other applications, such as recognition [2] and faster. The FPGAs based on the reconfiguration technology are
hyperspectral image analysis [3]. In particular, ICA is very the most economic and efficient solutions to ICA algorithms
effective for unsupervised source estimations given only the since they allow end users to modify and configure their designs
observations of mixed signals. multiple times. Specifically, recent rapid increase in the density
Although powerful, the complicated arithmetics, the iterative of FPGAs has made it possible to implement large ICA designs
computation with slow convergence rate, and the generally with a completely hardware-driven approach.
This paper identifies critical challenges and design issues for
hardware implementation of ICA algorithms, studies existing
VLSI solutions, and discusses future directions. Section II
Manuscript received November 17, 2004; revised April 9, 2005. Abstract briefly describes ICA principles and different software imple-
published on the Internet September 15, 2006. This work was supported in part mentation algorithms and addresses several VLSI design chal-
by the Office of Naval Research under Grant N00014-04-1-0797.
H. Du and H. Qi are with the Department of Electrical and Com- lenges. Section III studies existing VLSI solutions to various
puter Engineering, University of Tennessee, Knoxville, TN 37996-2100 USA ICA algorithms and demonstrates our ASIC and FPGA designs.
(e-mail: hdu1@utk.edu; hqi@utk.edu). Section IV presents some new VLSI technologies and discusses
X. Wang is with the SONY Advanved Technology Center, San Jose, CA
95134 USA (e-mail: grace.wang@am.sony.com). potential solutions to the design challenges. Finally, Section V
Digital Object Identifier 10.1109/TIE.2006.885491 concludes this paper.
II. ICA E{u4 } − 3(E{u2 })2 for a zero-mean random variable u [1],
[11]. If u is a Gaussian random variable, its kurtosis would
ICA is a method of finding a linear nonorthogonal coordinate
be zero. For those probability densities peaked at zero, their
system in any multivariate data [4]. The directions of the axes of
kurtosis is positive, and for those flatter probability densities,
this coordinate system are determined by both the second-order
their kurtosis is negative. Hyvärinen and Oja derived an ob-
and higher order statistics of the original data. The goal is to
jective function in [1] to find W that maximizes the kurtosis
perform a linear transform that makes the resulting variables
kurt[WT x(t)], where W is the corresponding weight matrix
as statistically independent from each other as possible. Let
and x(t) is the observation on time variable.
s1 , . . . , sm be the m source signals that are statistically inde-
Mutual information, which is inspired by information theory,
pendent, and no more than one signal is Gaussian distributed;
is a natural measure of the dependence between random vari-
that is, none of si gives any information on other signals.
ables, making it a good candidate for finding non-Gaussianity
The n observed signals x1 , . . . , xn are unmixed by an m × n
of the signals. The mutual information I between m random
unmixing matrix or weight matrix W to generate the source
variables
m yi , i = 1, . . . , m, is defined as I(y1 , y2 , . . . , ym ) =
signals in the following ICA unmixing model:
i=1 H(y i ) − H(y), where H(y) is the differential entropy
S = WX (1) of the random variable [12]. The minimization of mutual in-
formation corresponds to finding the most independent compo-
where W = [w1 , . . . , wm ]T and wi = [wi1 , . . . , win ], nents. The information maximization (InfoMax) principle [10]
i = 1, . . . , m. is derived from the minimization of mutual information and,
correspondingly, maximization of the output entropy.
Another very important measure of non-Gaussianity is given
A. Principal Approaches
by negentropy, which is derived based on the information-
There are two different research communities that have con- theoretic quantity of differential entropy [11]. The negentropy
sidered the analysis of independent components [5], including of a random vector y is defined as J(y) = H(ygauss ) − H(y),
the mixed sources separation and the unsupervised learning where H(ygauss ) denotes the differential entropy of a Gaussian
based on information theory. The study of separating mixed random variable ygauss , which has the same covariance matrix
sources observed in an array of sensors has been a classical as y. The advantage of using negentropy as a measure of non-
and difficult signal processing problem. Herault and Jutten Gaussianity is that it is well justified by statistical theory [11].
were the first working on BSS who, in their seminal work [6], Because it is difficult to calculate negentropy, an approxima-
introduced an adaptive Herault–Jutten (H–J) algorithm in a tion is usually given as J(Y) ≈ {E[G(Y)] − E[G(Ygauss )]}2 ,
simple feedback network topology y = −Wy + x, where x where G(Y) is a nonquadratic function [1]. In ICA algorithms,
is the observation, y is the separated signal, and W is Y = wT X, where wT is the transpose of the weight vector
the unmixing matrix with zero diagonal terms [wii = 0, i = w. The FastICA algorithm [1] is developed to find w that
1, . . . , min(m, n)] that is able to separate several unknown in- maximizes the above objective function. It uses a fixed-point
dependent sources. Their approach has been further developed iteration scheme to find a direction such that the projection of
by many other researchers. Common was one of them, who weight vectors maximizes an approximation of negentropy.
elaborated the concept of ICA and proposed cost functions In addition, the approach of output divergence minimiza-
related to the approximate minimization of mutual information tion [13] uses Kullback–Leibler (K–L) divergence or relative
between sensors [7]. In parallel to BSS studies, unsupervised entropy of the output signals as the objective function and
learning rules based on information theory were proposed minimizes the divergence with gradient descent learning. The
whose goal was to maximize the mutual information between maximum likelihood method [5] assumes that the input mixed
the inputs and outputs of a neural network [8]. Roth and signals are mutually independent and maximizes the likelihood
Baram [9] and Bell and Sejnowski [10] independently derived of the input signals.
stochastic gradient learning rules for this maximization and
applied them, respectively, to forecasting, time-series analy-
B. VLSI Implementation Challenges
sis, and blind separation of sources. In their work, Bell and
Sejnowski put the BSS problem into an information-theoretic As mentioned above, many ICA algorithms are slow
framework and demonstrated its effectiveness in the separation processes in signal/image processing applications due to the
and deconvolution of mixed sources [10]. complicated arithmetics as well as the time-consuming iterative
Intuitively speaking, the key to estimating the ICA model is computation. VLSI is an ideal algorithm implementation carrier
non-Gaussianity since the source signals are desired to contain and offers many features such as high processing speed, which
the least Gaussian components. Hence, the definition and esti- is extremely desired in ICA implementations. The complicated
mation of a contrast function that measures the non-Gaussianity arithmetics of ICA is one of the main barricades in ICA
of independent components is necessary for the identifiability hardware implementation, especially in synthesis procedure.
of the model. There are many different representations of the Therefore, hierarchy and modularity techniques in VLSI design
contrast function, where high-order cumulates, mutual informa- are essential for most ICA implementations to overcome the
tion, and negentropy are the most important ones. complexity of ICA algorithms. The hierarchy, or the divide
The fourth-order cumulate or kurtosis is the classical mea- and conquer technique, involves dividing an ICA process into
sure of non-Gaussianity of signals. It is defined as kurt(u) = subprocessing modules until the complexity of the bottom
550 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 54, NO. 1, FEBRUARY 2007
TABLE I
COMPARISON OF ANALOG, MIXED-SIGNAL, AND ASIC SOLUTIONS
TABLE II
COMPARISON OF FPGA SOLUTIONS
algorithms. End users are generally required to have sufficient In our previous work [29], we synthesized the FastICA-
knowledge and focus more on detailed analog physical prob- based parallel ICA (pICA) algorithm on ASIC using standard-
lems and basic component designs. Therefore, the application height library cells. In our design, the necessary logic cells were
domains are comparatively limited, and the development costs selected or developed in very compact format. All the cells were
significant time with expenses. Fortunately, the fast-blooming filled horizontally in multiple standard cell rows with standard
digital VLSI technologies like ASIC and FPGA allow end users height of 66/76 λ. These rows overlapped with each other
to concentrate on the algorithm implementation itself because after inversion or flip-flop, so that they shared Vdd and Gnd
IC vendors provide enormous standard libraries. ASICs and to save space and reduce circuit delay. In addition, the space
FPGAs are therefore called user semicustom solutions. between rows for wiring could be justified as needed. Such
From the aspect of circuit density and efficiency, the non- standard-height architecture brought higher performance and
programmable ASICs cover the lower end of analog CMOS more compact core area for hardware implementation, therefore
and AD mixed-signal full-custom VLSIs and the higher end providing better solution to the ICA algorithms. Our standard-
of reprogrammable FPGAs. Compared to reprogrammable height cell-based ASIC synthesis aimed at Taiwan Semicon-
FPGAs, nonprogrammable ASICs retain the benefits of com- ductor Manufacturing Company 0.18-µm process, where λ
pact circuit design and low power consumption. Although the was equal to 0.1 µm, and the CMOS fabrication process had
nonprogrammable feature increases the design expense and six metal layers and one polylayer. The voltage of the target
risk, ASICs that typically contain ten million logic gates or chip was 1.8 V for applications, and a thick oxide layer was
more are the appropriate solutions to very complex ICA de- used for 3.3-V transistors [30]. For a pICA process containing
signs. For example, the standard-height library cell is a design estimations of four weight vectors, the chip size was 1186.34 ×
technique for nonprogrammable ASICs, where the vendors 1184.49 µm. The aspect ratio of the core was set to 1.001
develop standard-height library cells for the implementation for the convenience of placement and routing, which optimally
of large amount of functions. When implementing an ICA mapped the structural interconnection of standard-height cells
algorithm, the end users only need to select necessary cells expressed in a schematic view onto the physical architecture.
that are logic level components with constant height on chip All the analog, mixed-signal, and ASIC solutions depend
and then specify interconnections between layers such as poly, on compact designs from end users at the beginning stage
metal1, and metal2 according to their designs. and fabrication of application-specific chips from hardware
To pursue potential solutions to the InfoMax-based ICA companies at the final stage. Table I compares most up-to-
algorithms on higher density digital ICs, the Computa- date analog CMOS, AD mixed-signal, and ASIC solutions to
tional NeuroSystems Laboratory, Korea Advanced Institute of ICA algorithms. The chip size and I/O reflect the compactness
Science and Technology [28], is designing an ASIC chip for of individual ICA designs, whereas the fabrication parameters
ICA and will use it as a front end to control noise in speech and voltage reflect the trend of the VLSI technology develop-
recognition. ment. Obviously, “low-voltage circuit,” which directly results
DU et al.: COMPARATIVE STUDY OF VLSI SOLUTIONS TO ICA 553
Fig. 3. Capacity utilization of Xilinx VIRTEX V1000EHQ240-6 for different numbers of weight vectors in pICA. The dotted lines denote the maximum capacity
of Xilinx VIRTEX V1000EHQ240-6. (a) Delay. (b) Slice. (c) Transistor. (d) Equivalent gate.
TABLE III
PERFORMANCE COMPARISON BETWEEN RECONFIGURABLE FPGA SYSTEM AND SINGLE FPGA IMPLEMENTATION
in squared power conservation, and “small chip size,” which configurable technologies, in which end users are allowed
requires compact circuit design, are the current trend. to modify their designs for multiple times and program the
interconnections in a few hours instead of waiting several weeks
for the final fabrication and metalization. These savings in the
D. FPGA Solutions
development expense and the turnaround time of prototyping
Among all the VLSI technologies, FPGAs provide the most directly lead to time-to-market reduction and profit increase.
economic and efficient solutions to comparatively simple ICA Most FPGAs contain 2000 to 2 000 000 logic gates [31] and
algorithms and could provide lower cost substitute of nonpro- use architectures that support a balance between logic resources
grammable ASICs. Unlike nonprogrammable VLSI devices, and routing resources [32]. Typical FPGAs are composed of
FPGAs are standard and general-purpose products fabricated a two-dimensional array of input/output blocks, interconnects,
by hardware companies before end users implement specific and configurable logic blocks (CLBs) that can be customized
ICA designs on them. FPGAs are developed based on re- to implement logic functions. The programmable interconnects
554 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 54, NO. 1, FEBRUARY 2007
Fig. 6. Xilinx Virtex II-based platform with logic gates of eight million. Fig. 7. Architecture of HPRC.
two possible directions that could provide potentially better FPGA platforms. For the purposes of data exchange and syn-
VLSI solutions to ICA algorithms. chronization, all computing nodes are interconnected by the
interconnection network (ICN), whereas RC boards are inter-
connected via reconfigurable ICN. The individual computing
A. FPGA/ASIC Technology Development nodes and RC boards are connected with each other through
The required computational power of complicated high-speed channels like memory bus or PCI bus.
algorithms or applications like ICA is the driving force HPRC can be implemented using multiple pilchard reconfig-
for the fast development of FPGA/ASIC technologies. Current urable computing platforms, which are plugged in the memory
phenomenal growth of FPGA/ASIC technologies has been far bus of the Sun workstations. As shown in Fig. 8 [46], the Sun
beyond the Moore’s law prediction of doubling in the number workstations (computing nodes) are interconnected through the
of transistors per IC every 18 months [43]. A comparison Ethernet. The pilchard platforms (RC boards) communicate
between the FPGA capacity tendency and Moore’s law is with each other through shared files. During the implementation
demonstrated in Fig. 5. Current FPGA has provided end of the pICA algorithm, the modules described in Section III-E
users powerful computing and processing capabilities. Taking are distributed on individual computing node. All modules
the new Xilinx Virtex II Pro as an example, this FPGA is are then configured and executed in the parallel mode and
manufactured based on the technology of 0.15-µm eight-layer collaboratively perform the whole process.
metal process with 0.12-µm high-speed transistors and allow With the fast development of current VLSI technologies,
end users to implement designs on 8 000 000 logic gates with we can expect large-capacity low-power consumption FPGAs
420-MHz internal clock speed and 840-Mb/s I/O [44]. Fig. 6 coming very soon. Networking these powerful FPGAs us-
shows a Xilinx Virtex II Pro-based FPGA platform, which is ing the HPRC technology, we anticipate significant improve-
in compliance with the Peripheral Component Interconnect ments in hardware solutions to complex ICA algorithms in the
(PCI)-X 133-MHz, PCI 66-MHz, and PCI 33-MHz standards. near future.
Hairong Qi (S’97–M’99–SM’05) received the B.S. and M.S. degrees in Xiaoling Wang (S’02) received the B.S. and M.S. degrees in electrical en-
computer science from Northern JiaoTong University, Beijing, China, in 1992 gineering from Northeastern University, Shenyang, China, in 1997 and 2000,
and 1995, respectively, and the Ph.D. degree in computer engineering from respectively, and the Ph.D. degree in electrical engineering from the University
North Carolina State University, Raleigh, in 1999. of Tennessee, Knoxville, in 2004.
She is currently an Associate Professor with the Department of Electrical and She is a Senior Engineer with the SONY Advanced Technology Center, San
Computer Engineering, University of Tennessee, Knoxville. She has published Jose, CA. Her current research interests are electronic imaging, information
more than 80 technical papers in archival journals and refereed conference processing in distributed sensor networks, distributed data fusion for target
proceedings. She also coauthored a book on machine vision. Her current tracking and classification, and pattern recognition.
research interests include advanced imaging and collaborative processing in
sensor networks, hyperspectral image analysis, and bioinformatics.
Dr. Qi serves on the Editorial Board of the Sensor Letters and is an Associate
Editor for Computers in Biology and Medicine. She is a recipient of the
National Science Foundation CAREER Award and the Chancellor’s Award for
Professional Promise in Research and Creative Achievement.