541420

EURASIP Journal on Embedded Systems
Design and Architectures for

Signal and Image Processing
Guest Editors: Markus Rupp, Ahmet T. Erdogan,
and Bertrand Granado
Design and Architectures for Signal and Image
Processing
Design and Architectures for Signal and Image
Processing
Guest Editors: Markus Rupp, Ahmet T. Erdogan,
Copyright 2009 Hindawi Publishing Corporation. All rights reserved.
This is a special issue published in volume 2009 of EURASIP Journal on Embedded Systems. All articles are open access articles
distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly cited.
Editor-in-Chief
Zoran Salcic, University of Auckland, New Zealand
Associate Editors
Sandro Bartolini, Italy
Neil Bergmann, Australia
Shuvra Bhattacharyya, USA
Ed Brinksma, The Netherlands
Paul Caspi, France
Liang-Gee Chen, Taiwan
Dietmar Dietrich, Austria
Stephen A. Edwards, USA
Alain Girault, France
Rajesh K. Gupta, USA
Thomas Kaiser, Germany
Bart Kienhuis, The Netherlands
Chong-Min Kyung, Korea
Miriam Leeser, USA
John McAllister, UK
Koji Nakano, Japan
Antonio Nunez, Spain
Sri Parameswaran, Australia
Zebo Peng, Sweden
Marco Platzner, Germany
Marc Pouzet, France
S. Ramesh, India
Partha S. Roop, New Zealand
Markus Rupp, Austria
Asim Smailagic, USA
Leonel Sousa, Portugal
Jarmo Henrik Takala, Finland
Jean-Pierre Talpin, France
J urgen Teich, Germany
Dongsheng Wang, China
Contents
Design and Architectures for Signal and Image Processing, Markus Rupp, Ahmet T. Erdogan,
Volume 2009, Article ID 674308, 3 pages
Multicore Software-Dened Radio Architecture for GNSS Receiver Signal Processing, Heikki Hurskainen,
Jussi Raasakka, Tapani Ahonen, and Jari Nurmi
An Open Framework for Rapid Prototyping of Signal Processing Applications, Maxime Pelcat,
Jonathan Piat, Matthieu Wipliez, Slaheddine Aridhi, and Jean-Francois Nezan
Run-Time HW/SWScheduling of Data FlowApplications on Recongurable Architectures,
Fakhreddine Ghaari, Benoit Miramond, and Francois Verdier
Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC Codes,
Massimo Rovini, Giuseppe Gentile, Francesco Rossi, and Luca Fanucci
Comments on Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC Codes,
Kiran K. Gunnam, Gwan S. Choi, and Mark B. Yeary
Reply to Comments on Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC
Codes, Massimo Rovini, Giuseppe Gentile, Francesco Rossi, and Luca Fanucci
OLLAF: A Fine Grained Dynamically Recongurable Architecture for OS Support, Samuel Garcia and
Bertrand Granado
Trade-O Exploration for Target Tracking Application in a Customized Multiprocessor Architecture,
Jehangir Khan, Smail Niar, Mazen A. R. Saghir, Yassin El-Hillali, and Atika Rivenq-Menhaj
A Prototyping Virtual Socket System-On-PlatformArchitecture with a Novel ACQPPS Motion Estimator
for H.264 Video Encoding Applications, Yifeng Qiu and Wael Badawy
FPSoC-Based Architecture for a Fast Motion Estimation Algorithmin H.264/AVC, Obianuju Ndili and
Tokunbo Ogunfunmi
FPGA Accelerator for Wavelet-Based Automated Global Image Registration, Baofeng Li, Yong Dou,
Haifang Zhou, and Xingming Zhou
A Systemfor an Accurate 3DReconstruction in Video Endoscopy Capsule, Anthony Kolar,
Olivier Romain, Jade Ayoub, David Faura, Sylvain Viateur, Bertrand Granado, and Tarik Graba
Performance Evaluation of UML2-Modeled Embedded Streaming Applications with System-Level
Simulation, Tero Arpinen, Erno Salminen, Timo D. H am al ainen, and Marko H annik ainen
Cascade Boosting-Based Object Detection fromHigh-Level Description to Hardware Implementation,
K. Khattab, J. Dubois, and J. Miteran
Very Low-Memory Wavelet Compression Architecture Using Strip-Based Processing for Implementation
in Wireless Sensor Networks, Li Wern Chew, Wai Chong Chia, Li-minn Ang, and Kah Phooi Seng
Data Cache-Energy and Throughput Models: Design Exploration for Embedded Processors,
Muhammad Yasir Qadri and Klaus D. McDonald-Maier
Hardware Architecture for Pattern Recognition in Gamma-Ray Experiment, Sonia Khatchadourian,
Jean-Christophe Pr evotet, and Lounis Kessal
Evaluation and Design Space Exploration of a Time-Division Multiplexed NoC on FPGA for Image
Analysis Applications, Linlin Zhang, Virginie Fresse, Mohammed Khalid, Dominique Houzet,
and Anne-Claire Legrand
Ecient Processing of a Rainfall Simulation Watershed on an FPGA-Based Architecture with Fast Access
to Neighbourhood Pixels, Lee Seng Yeong, Christopher Wing Hong Ngau, Li-Minn Ang,
and Kah Phooi Seng
Hindawi Publishing Corporation
doi:10.1155/2009/674308
Editorial
Design and Architectures for Signal and Image Processing
Markus Rupp (EURASIP Member),
1
Ahmet T. Erdogan,
2
3
1
Institute of Communications and Radio-Frequency Engineering (INTHFT), Vienna University of Thechnology, 1040 Vienna, Austria
2
The School of Engineering and Electronics, The University of Edinburgh, Edinburgh EH9 3JL, UK
3
ENSEA, Cergy-Pontoise University, boulevard du Port-95011 Cergy-Pontoise Cedex, France
Correspondence should be addressed to Markus Rupp, mrupp@nt.tuwien.ac.at
Received 8 December 2009; Accepted 8 December 2009
Copyright 2009 Markus Rupp et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This Special Issue of the EURASIP Journal of embedded sys-
tems is intended to present innovative methods, tools, design
methodologies, and frameworks for algorithm-architecture
matching approach in the design ow including system
level design and hardware/software codesign, RTOS, system
modeling and rapid prototyping, system synthesis, design
verication, and performance analysis and estimation.
Today, typical sequential design ows are in use and they
are reaching their limits due to:
(i) The complexity of todays systems designed with
the emerging submicron technologies for integrated
circuit manufacturing
(ii) The intense pressure on the design cycle time in
order to reach shorter time-to-market and reduce
development and production costs
(iii) The strict performance constraints that have to be
reached in the end, typically low and/or guaranteed
application execution time, integrated circuit area,
and overall system power dissipation.
Because in such design methodology the system is seen as a
whole, this special issue also covers the following topics:
(i) New and emerging architectures: SoC, MPSoC, con-
gurable computing (ASIPs), (dynamically) recon-
gurable systems using FPGAs
(ii) Smart sensors: audio and image sensors for high
performance and energy eciency
(iii) Applications: automotive, medical, multimedia,
telecommunications, ambient intelligence, object
recognition, and cryptography
(iv) Resource management techniques for real-time oper-
ating systems in a codesign framework
(v) Systems and architectures for real-time image pro-
cessing
(vi) Formal models, transformations, and architectures
for reliable embedded system design.
We received 30 submissions of which we eventually accepted
17 for publication.
The paper entitled Multicore software dened radio
architecture for GNSS receiver signal processing by H.
Hurskainen et al. describes a multicore Software Dened
Radio (SDR) architecture for Global Navigation Satellite
System (GNSS) receiver implementation. Three GNSS SDR
architectures are discussed: (1) a hardware-based SDR that
is feasible for embedded devices but relatively expensive,
(2) a pure SDR approach that has high level of exibility
and low bill of material, but is not yet suited for handheld
applications, and (3) a novel architecture that uses a
programable array of multiple processing cores that exhibits
both exibility and potential for mobile devices.
The paper entitled An open framework for rapid proto-
typing of signal processing applications by M. Pelcat et al.
presents an open source eclipse-based framework which aims
to facilitate the exploration and development processes in
this context. The framework includes a generic graph editor
(Graphiti), a graph transformation library (SDF4J), and
an automatic mapper/scheduler tool with simulation and
code generation capabilities (PREESM). The input of the
framework is composed of a scenario description and two
graphs: one graph describes an algorithm and the second
graph describes an architecture. As an example, a prototype
2 EURASIP Journal on Embedded Systems
for 3GPP long-term evolution (LTE) algorithm on a multi-
core digital signal processor is built, illustrating both the
features and the capabilities of this framework.
The paper entitled Run-time HW/SW scheduling of
data ow applications on recongurable architectures by F.
Ghaari et al. presents an ecient dynamic and run-time
Hardware/Software scheduling approach. This scheduling
heuristic consists in mapping on line the dierent tasks of
a highly dynamic application in such a way that the total
execution time is minimized. On several image processing
applications, the scheduling method is applied. The pre-
sented experiments include simulation and synthesis results
on a Virtex V-based platform. These results show a better
performance against existing methods.
The paper entitled Techniques and architectures for
hazard-free semiparallel decoding of LDPC codes by M.
Rovini et al. describes three dierent techniques to properly
reschedule the decoding updates, based on the careful inser-
tion of idle cycles, to prevent the hazards of the pipeline
mechanism in LDPC decoding. Along these dierent semi-
parallel architectures of a layered LDPC decoder suitable for
use with such techniques are analyzed. Taking the LDPC
codes for the wireless local area network (IEEE 802.11n) as
a case study, a detailed analysis of the performance attained
with the proposed techniques and architectures is reported,
and results of the logic synthesis on a 65 nm low-power
CMOS technology are shown.
The paper entitled OLLAF: a ne grained dynamically
recongurable architecture for OS support by S. Garcia
and B. Granado presents OLLAF, a ne grained dynamically
recongurable architecture (FGDRA), specially designed to
eciently support an OS. The studies presented here show
the contribution of this architecture in terms of hardware
context management, preemption support, as well as the gain
that can be obtained, by using OLLAF instead of a classical
FPGA, in terms of context management and preemption
overhead.
The paper entitled Trade-o exploration for target
tracking application in a customized multiprocessor archi-
tecture by J. Khan et al. presents the design of an FPGA-
based multiprocessor-system-on-chip (MPSoC) architecture
optimized for multiple target tracking (MTT) in automotive
applications. The paper explains how the MTT application
is designed and proled to partition it among dierent
processors. It also explains how dierent optimizations were
applied to customize the individual processor cores to their
assigned tasks and to assess their impact on performance
and FPGA resource utilization, resulting in a complete MTT
application running on an optimized MPSoC architecture
that ts in a contemporary medium-sized FPGA and that
meets the real-time constraints of the given application.
The paper entitled A prototyping virtual socket system-
on-platform architecture with a novel ACQPPS motion
estimator for H.264 video encoding applications by Y. Qiu
and W. M. Badawy presents a novel adaptive crossed quarter
polar pattern search (ACQPPS) algorithm that is proposed
to realize an enhanced inter prediction for H.264. Moreover,
an ecient prototyping system-on-platform architecture is
also presented, which can be utilized for a realization of
H.264 baseline prole encoder with the support of integrated
ACQPPS motion estimator and related video IP accelerators.
The implementation results show that ACQPPS motion
estimator can achieve very high estimated image quality
comparable to that from the full search method, in terms
of peak signal-to-noise ratio (PSNR), while keeping the
complexity at an extremely low level.
The paper entitled FPSoC-based architecture for a
fast motion estimation algorithm in H.264/AVC by O.
Ndili and T. Ogunfunmi presents an architecture based
on a modied hybrid fast motion estimation (FME) algo-
rithm. Presented results show that the modied hybrid
FME algorithm outperforms previous state-of-the-art FME
algorithms, while its losses, when compared with FSME (full
search motion estimation), in terms of PSNR performance
and computation time are insignicant.
The paper entitled FPGA accelerator for wavelet-
based automated global image registration by B. Li et al.
presents an architecture for wavelet-based automated global
image registration (WAGIR) that is fundamental for most
remote sensing image processing algorithms, and extremely
computation intensive. They propose a block wavelet-based
automated global image registration (BWAGIR) architecture
based on a block resampling scheme. The architecture with
1 processing unit outperforms the CL cluster system with
1 node by at least 7.4X, and the MPM massively parallel
machine with 1 node by at least 3.4X. And the BWAGIR with
5 units achieves a speedup of about 3X against the CL with
16 nodes, and a comparable speed with the MPM with 30
nodes.
The paper entitled A system for an accurate 3D recon-
struction in video endoscopy capsule by A. Kolar et al.
presents the hardware and software development of a wire-
less multispectral vision sensor which allows transmitting a
3D reconstruction of a scene in real time. The paper also
presents a method to acquire the images at a 25 frames/s
video rate with a discrimination between the texture and the
projected pattern. This method uses an energetic approach,
a pulsed projector, and an original 64 64 CMOS image
sensor with programable integration time. Multiple images
are taken with dierent integration times to obtain an image
of the pattern which is more energetic than the background
texture. Also presented is a 3D reconstruction processing
that allows a precise and real-time reconstruction. This
processing which is specically designed for an integrated
sensor and its integration in an FPGA-like device has a low
power consumption compatible with a VCE examination.
The paper presents experimental results with the realization
of a large-scale demonstrator using an SOPC prototyping
board.
The paper entitled Performance evaluation of UML2-
modeled embedded streaming applications with system-level
simulation by T. Arpinen et al. presents an ecient method
to capture abstract performance model of a streaming data
real-time embedded system (RTES). This method uses an
MDA (model driven architecture) approach. The goal of the
performance modeling and simulation is to achieve early
estimates on PE, memory, and on-chip network utilization,
task response times, among other information that is used
EURASIP Journal on Embedded Systems 3
for design-space exploration. UML2 is used for performance
model specication. The application workload modeling
is carried out using UML2 activity diagrams. Platform is
described with structural UML2 diagrams and model ele-
ments annotated with performance values. The focus here is
on modeling streaming data applications. It is characteristic
to streaming applications that a long sequence of data items
ows through a stable set of computation steps (tasks) with
only occasional control messaging and branching.
The paper entitled Cascade boosting-based object detec-
tion from high-level description to hardware implemen-
tation by K. Khattab et al. presents an implementation
of boosting-based object detection algorithms that are
considered the fastest accurate object detection algorithms
today, but their implementation in a real-time solution is
still a challenge. A new parallel architecture, which exploits
the parallelism and the pipelining in these algorithms, is
proposed. The method to develop this architecture was based
on a high-level SystemC description. SystemC enables PC
simulation that allows simple and fast testing and leaves
the structure open to any kind of hardware or software
implementation since SystemC is independent from all
platforms.
The paper entitled Very low memory wavelet compres-
sion architecture using strip-based processing for imple-
mentation in wireless sensor networks by L. W. Chew
et al. presents a hardware architecture for strip-based image
compression using the SPIHT algorithm. The lifting-based
5/3 DWT which supports a lossless transformation is used
in the proposed work. The wavelet coecients output
from the DWT module is stored in a strip buer in a
predened location using a new 1D addressing method for
SPIHT coding. In addition, a proposed modication on
the traditional SPIHT algorithm is also presented. In order
to improve the coding performance, a degree-0 zerotree
coding methodology is applied during the implementation
of SPIHT coding. To facilitate the hardware implementation,
the proposed SPIHT coding eliminates the use of lists in
its set-partitioning approach and is implemented in two
passes. The proposed modication reduces both the memory
requirement and complexity of the hardware coder.
The paper entitled Data cache-energy and throughput
models: design exploration for embedded processors by
M. Y. Qadri and K. D. McDonald Maier proposes cache-
energy models. These models strive to provide a complete
application-based analysis. As a result they could facilitate
the tuning of a cache and an application according for a given
power budget. The models presented in this paper are an
improved extension of energy and throughput models for a
data cache in term of the leakage energy that is indicated for
the entire processor rather than simply the cache on its own.
The energy model covers the per cycle energy consumption
of the processor. The leakage energy statistics of the processor
in the data sheet covers the cache and all peripherals of the
chip. It is also improved in terms of renement of the miss
rate that has been split into two terms: a read miss rate and
a write miss rate. This was done as the read energy and
write energy components correspond to the respective miss
rate contribution of the cache. The model-based approach
presented was used to predict the processors performance
with sucient accuracy. An example application for design
exploration that could facilitate the identication of an
optimal cache conguration and code prole for a target
application was discussed.
The paper entitled Hardware architecture for pattern
recognition in gamma-ray experiment by S. Khatcha-
dourian et al. presents an intelligent way of triggering
data in the HESS (high energy stereoscopic system) phase
II experiment. The system relies on the utilization of
image processing algorithms in order to increase the trigger
eciency. The proposed trigger scheme is based on a neural
system that extracts the interesting features of the incoming
images and rejects the background more eciently than
classical solutions. The paper presents the basic principles of
the algorithms as well as their hardware implementation in
FPGAs.
The paper entitled Evaluation and design space explo-
ration of a time-division multiplexed NoC on FPGA for
image analysis applications by L. Zhang et al. presents an
adaptable fat tree NoC architecture for eld programmable
gate array (FPGA) designed for image analysis applications.
The authors propose a dedicated communication architec-
ture for image analysis algorithms. This communication
mechanism is a generic NoC infrastructure dedicated to
dataow image processing applications, mixing circuit-
switching and packet-switching communications. The com-
plete architecture integrates two dedicated communication
architectures and reusable IP blocks. Communications are
based on the NoC concept to support the high bandwidth
required for a large number and type of data. For data
communication inside the architecture, an ecient time-
division multiplexed (TDM) architecture is proposed. This
NoC uses a fat tree (FT) topology with virtual channels (VC)
and it packet-switching with xed routes. Two versions of
the NoC are presented in this paper. The results of their
implementations and their design space exploration (DSE)
on Altera StratixII are analyzed and compared with a point-
to-point communication and illustrated with a multispectral
image application.
The paper entitled Ecient processing of a rainfall
simulation watershed on an FPGA-based architecture with
fast access to neighborhood pixels by L. S. Yeong et al.
describes a hardware architecture to implement the water-
shed algorithm using rainfall simulation. The speed of the
architecture is increased by utilizing a multiple memory bank
approach to allow parallel access to the neighborhood pixel
values. In a single read cycle, the architecture is able to
obtain all ve values of the center and four neighbors for a
4 connectivity watershed transform. The proposed rainfall
watershed architecture consists of two parts. The rst part
performs the arrowing operation and the second part assigns
each pixel to its associated catchment basin.
Markus Rupp
Ahmet T. Erdogan
Bertrand Granado
doi:10.1155/2009/543720
Research Article
Multicore Software-Dened Radio Architecture for
GNSS Receiver Signal Processing
Heikki Hurskainen, Jussi Raasakka, Tapani Ahonen, and Jari Nurmi
Department of Computer Systems, Tampere University of Technology, P. O. Box 553, 33101 Tampere, Finland
Correspondence should be addressed to Heikki Hurskainen, heikki.hurskainen@tut.
Received 27 February 2009; Revised 22 May 2009; Accepted 30 June 2009
Recommended by Markus Rupp
We describe a multicore Software-Dened Radio (SDR) architecture for Global Navigation Satellite System (GNSS) receiver
implementation. A GNSS receiver picks up very low power signals from multiple satellites and then uses dedicated processing
to demodulate and measure the exact timing of these signals from which the users position, velocity, and time (PVT) can be
estimated. Three GNSS SDR architectures are discussed. (1) A hardware-based SDR that is feasible for embedded devices but
relatively expensive, (2) a pure SDR approach that has high level of exibility and low bill of material, but is not yet suited for
handheld applications, and (3) a novel architecture that uses a programmable array of multiple processing cores that exhibits both
exibility and potential for mobile devices. We present the CRISP project where the multicore architecture will be realized along
with numerical analysis of application requirements of the platforms processing cores and network payload.
Copyright 2009 Heikki Hurskainen et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1. Introduction
Global navigation has been a challenge to mankind for
centuries. However, in the modern world it has become
easier with the help from Global Satellite Navigation Systems
(GNSSs). NAVSTAR Global Positioning System (GPS) [1]
has been the most famous implementation of GNSS and
the only fully operational system available for civilian users,
although this situation is changing.
Galileo [2] is emerging as a competitor and complement
for GPS, as they both are satellite navigation systems based on
Code Division Multiple Access (CDMA) techniques. CDMA
is a technique that allows multiple transmitters to use same
carrier simultaneously by multiplying pseudorandom noise
(PRN) codes to the transmitted signal. The PRN code rate
is higher than data symbol rate which divides the energy of
a data symbol to a wider bandwidth. The used PRN codes
are unique to each transmitter and thus transmitter can be
identied in reception when received signal is correlated with
a replica of the used PRN code.
The Russian GLONASS system, originally based on
Frequency Division Multiple Access (FDMA), is adding a
CDMA feature to the system with GLONASS-K satellites
[3]. China has also shown interest in implementing its
own system, called Compass, during the following decade
[4]. The GPS modernization program [5] introduces addi-
tional signals with new codes and modulation. Realiza-
tion of the new navigation systems and modernization of
GPS produce updates and upgrades to system specica-
tions.
Besides changing specications, GNSS is also facing chal-
lenges from an environmental point of view. The resulting
multipath eects make it more dicult to determine exact
signal timing crucial for navigation algorithms. Research
around multipath mitigation algorithms is active since
accurate navigation capability in environments with heavy
multipaths is desired. Among interference issues multipath
mitigation is also one of the biggest drivers for the introduc-
tion of new GNSS signal modulations.
Designing a true GNSS receiver is not a trivial task. Atrue
GNSS receiver should be recongurable and exible in design
so that the posibilities of new specications and algorithms
can be exploited, and the price should be low enough to
enable mass market penetration.
2. GNSS Principles and Challenges
2.1. Navigation and Signal Processing. Navigation can be
performed when four or more satellites are visible to
the receiver. The pseudoranges from receiver to satellites
and navigation data (containing ephemeris parameters) are
needed [1, 6, 7].
When pseudoranges () are measured by the receiver,
they can be used to solve unknowns, the users location
(x, y, z)
u
and clock bias b
u
with known positions of satellites
(x, y, z)
i
. The relation between pseudorange, satellite posi-
tion, and user position is illustrated in
i
=
(x
i
x
u
)
2
+ (y
i
y
u
)
2
+ (z
i
z
u
)
2
+ b
u
.
(1)
The transmitted signal contains low rate navigation
data (50 Hz for GPS Standard Positioning Service (SPS)),
repeating PRN code sequence (1023 chips at 1.023 MHz for
GPS SPS) and a high rate carrier (GPS SPS is transmitted at
L1 band which is centered at 1575.42 MHz) [1]. For Galileo
E1 Open Service (OS) and future GPS L1C it also contains
a Multiplexed Binary Oset Carrier (MBOC) modulation
[8, 9]. These signal components are illustrated in Figure 1.
The signal processing for GNSS can be divided into
analog and digital parts. Since the carrier frequencies of the
GNSS are high (>1 GHz) it is impossible to perform digital
signal processing on it. In the analog part of the receiver,
which is called the radio front-end, the received signal is
amplied, ltered, downconverted, and nally quantized and
sampled to digital format.
The digital signal processing part (i.e., baseband process-
ing) has two major tasks. First, the Doppler frequencies and
code phases of the satellites need to be acquired. The details
of the acquisition process are well explained in literature, for
example, [1, 7]. There are a number of ways to implement
acquisition, with parallel methods being faster than serial
ones, but at the cost of consuming more resources. The
parallel methods can be applied either as convolution in the
time domain (matched lters) or as multiplication in the
frequency domain (using FFT and IFFT).
Second, after successful acquisition the signals found
are being tracked. In tracking, the frequency and phase of
the receiver are continously ne-tuned to keep receiving
the acquired signals. Also, the GNSS data is demodulated
and the precise timing is formed from the signal phase
measurements. A detailed description of the tracking process
can be found, for example, in [1, 7]. The principles for data
demodulation are also illustrated in Figure 1.
2.2. Design Challenges of GNSS. The environment we are
living in is constantly changing in topographic, geometric,
economic, and political ways. These changes are driving the
GNSS evolution.
Besides newsystems (e.g., Galileo, Compass), the existing
ones (i.e., GPS, GLONASS) are being modernized. This
leads to constantly evolving eld of specications which
may increase the frustration and uncertainty among receiver
designers and manufacturers.
The signal spectrum of future GNSS signals is growing
with the new systems. Currently the GPS L1 (centered at
1575.42 MHz) is the only commercially exploited GNSS
frequency band. Galileo systems E1 OS signal will be sharing
the same band. Another common band of future GPS and
Galileo signals will be centered at 1176.45 MHz (GPS L5 and
Galileo E5a).
The GPS modernization program is also activating the
L2 frequency band (centered at 1227.60 MHz) to civilian
use by implementing L2C (L2 Civil) signal [10]. This band
has already been assigned for navigation use, but only for
authorized users via GPS Precise Positioning Service (PPS)
[1].
To improve the signal code tracking and multipath
performance new Binary Oset Carrier (BOC) modulation
was originally introduced as baseline for Galileo and modern
GPS L1 signal development [11]. The later agreement
between European and US GNSS authorities further spec-
ied the usage of Multiplexed BOC (MBOC) modulation
in both systems. In MBOC modulation two dierent binary
subcarriers are added to the signal, either as time multiplexed
mode (TMBOC), or summed together with predened
weighting factors as Composite BOC (CBOC) [8, 9, 12].
Like any other wireless communication, satellite naviga-
tion also suers from multipaths in environments prone to
such (e.g., urban canyons, indoors). The problem caused by
multipaths is even bigger in navigation than in communi-
cation since precise timing also needs to be resolved. The
eld of multipath mitigation is actively researched and new
algorithms and architectures are presented frequently, for
example, in [1315].
Besides GNSS there are also other wireless commu-
nication technologies that are developing rapidly and the
direction of development is driven towards multipurpose
low cost receivers (user handsets) with enhanced capabilities
[16].
3. Overviewof SDR GNSS Architectures
In this section we present three architectures for Software-
Dened Radio (SDR) GNSS receiver. A simplied denition
of SDR is given in [17]. Radio in which some or all of the
physical layer functions are software dened.
The root SDRarchitecture was presented in [18]. Figure 2
illustrates an example of GNSS receiver functions mapped
on to this canonical architecture. Only the reception part of
the architecture is presented since current GNSS receivers are
not transmitting. Radio Frequency (RF) conversion handles
the signal processing before digitalization. The Intermediate
Frequency (IF) processing block transfers the frequency of
the received signal from IF to baseband and may also take
care of Doppler removal in GNSS. The baseband processing
segment handles the accurate timing and demodulation, thus
enabling the construction of the navigation data bits. The
division into IF and baseband sections can vary depending
on the chosen solution since the complex envelope of the
received signal can be handled in baseband also. Desired
navigation output (Position, Velocity, and Time (PVT)) is
solved in the last steps of the GNSS receiver chain.
Receiver (reception)
Navigation
data recovery
Navigation
data
Replica
PRN
Replica
carrier
Replica
subcarrier
Binary
subcarrier
Carrier PRN code
Satellite (transmission)
Transmission medium
(i.e. space)
Figure 1: Principles for GNSS signal modulation in transmission and demodulation in reception.
User
interface
Navigation
processing
Baseband
processing
Baseband IF processing
Code
correlation
Carrier
wipeoff
A/D
conversion
Down
conversion
AGC
Radio
LNA
Local
oscillator
Frequency
synthesis
Carrier
NCO
Code NCO
&
generation
Source
Figure 2: Canonical SDR architecture adapted to GNSS. It is modied from [18].
Current state-of-the-art mass market receivers are based
on a chipset or single-chip receiver [19]. The chipset or
single-chip receiver is usually implemented as an Applica-
tion Specic Integrated Circuit (ASIC). ASICs have high
Nonrecurring Engineering (NRE) costs, but when produced
in high volumes they have very low price per unit. ASICs
can also be optimized for small size and to have small
power consumption. Both of these features are desired in
handheld, battery operated devices. On the other hand,
ASICs are xed solutions and impossible to recongure.
Modications in design are also very expensive to realize with
ASIC technology.
This approach has proven to be successful in mass market
receivers because of price and power consumption advan-
tages although it may not hold its position with growing
demand for exibility and shortened time to market.
3.1. Hardware Accelerated SDR Receiver Architecture. The
rst SDR receiver architecture discussed in this paper is the
approach where the most demanding parts of the receiver are
implemented on a recongurable hardware platform, usually
in the form of a Field Programmable Gate Array (FPGA)
progammed with a Hardware Description Language (HDL).
This architecture, comprised of radio front-end circuit,
recongurable baseband hardware, and navigation software
is well known and presented in numerous publications, for
example, [16, 2022]. FPGAs have proved to be suitable
for performing GNSS signal processing functions [23]. The
building blocks for hardware accelerated SDR receivers are
illustrated in Figure 3.
In this architecture the RF conversion is performed by
analog radio. The last step of the conversion transforms
the signal from analog to digital format. IF processing
and baseband functionalities are performed in accelerating
hardware. The source, PVT for the GNSS case, is constructed
in navigation processing.
The big advantage for recongurable FPGAs in compar-
ison to ASIC technologies is the savings in design, NRE and
mask costs due to shorter development cycle. The risk is also
smaller with FPGAs, since the possible bugs in design can
be xed by upgrades later on. On the other hand FPGAs are
much higher in unit price and power consumption.
A true GNSS Receiver poses some implementation
challenges. The specications are designed to be compatible
(i.e., systems do not interfere with each other too much)
and the true interoperability is reached at receiver level. One
example of interoperative design challenges is the selection
of the number of correlators and their spacing for tracking,
since dierent modulations have dierent requirements for
the correlator structure.
3.1.1. Challenges with Radio Front End. Although the focus
of this paper is mainly on baseband functions, the radio
should not be forgotten. The block diagram for a GNSS
single frequency radio front end is given on the left-hand
side of Figure 3. In the radio the received signal is rst
amplied with the LowNoise Amplier (LNA) and then after
necessary ltering it is downconverted to low IF, for example,
to 4 MHz [24]. The signal is converted to a digital format
after downconversion.
User
interface
Navigation
processing
Acquisition
engine
Automatic
gain control
(AGC)
Tracking
channels
1 to N
General purpose
processor
Reconfigurable
hardware (FPGA)
A/D
conversion
Down
converter
Radio front end ASIC
Low noise
amplifier
(LNA)
Local
oscillator
Frequency
synthesis
Figure 3: Hardware accelerated baseband architecture. From left to right: analog radio part, recongurable baseband hardware, and
navigation software running on GPP.
The challenges for GNSS radio design come from the
increasing amount of frequency bands. To call a receiver a
true GNSS receiver and also to get the best performance,
more than one frequency band should be processed by the
radio front-end. Dual- and/or multifrequency receivers are
likely choices for future receivers, and thus it is important to
study potential architectures [25].
Another challenge comes from the increased bandwidth
of new signals. With increased bandwidth the radio becomes
more vulnerable to interference. For mass market consumer
products, the radio design should also meet certain price
and power consumption requirements. Only solutions with
reasonable price and power consumption will survive.
3.1.2. Baseband Processing. The fundamental signal process-
ing for GNSS was presented in Figure 1. The carrier and
code removal processes are illustrated in more detail in
Figure 4. The incoming signal is divided into in-phase and
quadrature-phase components by multiplying it with the
locally generated sine and cosine waves. Both phases are
then correlated in identical branches with several closely
delayed versions (for GPS; early, prompt, and late), of the
locally generated PRN code [1]. Results are then integrated
and fed to discriminator computation and feedback lter.
Numerically Controlled Oscillators (NCOs) are used to steer
the local replicas.
An example of the dierent needs for new GNSS
signals is the addition of 2 correlator ngers (bump-
jumping algorithm) due to Galileo BOC modulation [26].
In Figure 4 additional correlator components needed for
Galileo tracking are marked with darker shade. In most
parts the GPS and Galileo signals in L1 band are using
the same components. The main dierence is that due
to the BOC family modulation Galileo needs additional
correlators; it is very-early (VE) and very-late (VL) to remove
the uncertainty of main peak location estimation [27]. The
increasing number of correlators is related to the increase
in complexity, measured by the number of transistors in the
nal design [13].
The level of hardware acceleration depends on the
selected algorithms. Acquisition is rarely needed compared
to tracking and thus it is more suitable for software imple-
mentation. FFT-based algorithms are more desirable for
designer to implement in software since hardware languages
are usually lacking direct support for oating-point number
calculus. Tracking on the other hand is a process containing
mostly multiplication and accumulation using relatively
small word lengths. The thing that makes it more suitable
for hardware implementation is that the number of these
relatively simple computations is high, with a real-time
deadline.
3.2. Ideal SDR GNSS Receiver Architecture. The ideal SDR
is characterized by assigning all functions after the analog
radio to a single processor [18]. In the ideal case all hardware
problems are turned to software problems.
A fundamental block diagram of a software receiver is
illustrated in Figure 5 [28]. The architecture of the radio
front-end is the same that was illustrated in Figure 3. After
radio the digitized signals are fed to buers for software
usage. Then all of the digital signal processing, acquisition,
and tracking functions are performed by software.
In the literature, for example, [28, 29], the justication
and reasoning for SDR GNSS is strongly attributed to the
well-known Moores law which states that the capacity of
integrated circuits is doubling every 1824 months [30].
Ideal SDR solutions should become feasible if and when
available processing power increases. Currently reported
SDR GPS receiver implementations are working in realtime
only if the clock speed of the processor is from 900 MHz [31]
to 3 GHz [29], which is too high for mobile devices but not,
for example, a laptop PC.
In the recent years, the availability of GNSS radio front
ends with USB has improved, making the implementa-
tion of a pure software receiver on a PC platform quite
straightforward. The area where pure software receivers have
already made a breakthrough is postprocessing applications.
Postprocessing with software receivers allows fast algorithm
Quadrature branch
(not shown)
Discriminator
computation
& filtering
I & D
VE E P L VL
Code
NCO
Carrier
NCO
Code generator
In-phase branch
sin cos
Figure 4: GPS/Galileo tracking channel.
Navigation
processing
Tracking
channels
1 to N
General purpose processor
Buffers & buffer
control
User
interface
Acquisition
engine
Radio front end
ASIC
GNSS radio
front end
Figure 5: Software receiver architecture. On left-hand side: analog radio part, and on right-hand side: baseband and navigation implemented
as software running on a GPP.
prototyping and signal analysis. Typical postprocessing
applications are ionospheric monitoring, geodetic applica-
tions, and other scientic applications [21, 32].
Software is denitely more exible than hardware when
compared in terms of time to market, bill of materials, and
recongurable implementation. But with a required clock
frequency of around 1 GHz or more, the generated heat and
battery life will be an issue for small handheld devices.
3.3. SDR with Multiple Cores. What about having an array of
recongurable cores for baseband processing? In a multicore
architecture baseband processing is divided among multiple
processing cores. This reduces the clock frequency needed
to a range achievable by embedded devices and provides an
increased level of parallelism which also eases the work load
per processing unit.
An example of the GNSS receiver architecture with
recongurable baseband approach is illustrated in Figure 6.
In this example one of the four cores is acting as an
acquisition engine and the remaining three are performing
the tracking functions. A xed set of cores is not desirable
since the need for acquisition and tracking varies over time.
For example, when receiver is turned on, all cores should be
performing acquisition to guarantee the fastest possible Time
To First Fix (TTFF). After satellites have been found more of
the acquisition cores are moved to the tracking task.
If (and when) manufactured in large volumes the
(properly scaled) array of processing cores can be eventually
implemented in an ASIC circuit. This lowers the per unit
price and makes this solution more appealing for mass
markets, while still being recongurable and having high
degree of exibility.
In the next section we present one future realization of
this architecture.
4. CRISP Platform
Cutting edge Recongurable ICs for Stream Processing
(CRISP) [33] is a project in the Framework Programme
7 (FP7) of the European Union (EU). The objectives of
the CRISP are to research the optimal utilization, e-
cient programming, and dependability of a recongurable
multiprocessor platform for streaming applications. The
CRISP consortium is a good mixture of academic and
industrial know-how with partners; Recore (NL), University
of Twente (NL), Atmel (DE), Thales Netherlands (NL),
Navigation
processing
Tracking
channels
Tracking
channels
Tracking
channels
General
purpose
processor
Reconfigurable platform (array of cores)
User
interface
Acquisition
engine
Radio front end
ASIC
GNSS radio
front end
Figure 6: Software recongurable baseband receiver architecture. From left: analog radio part, baseband implemented on an array of
recongurable cores, and navigation software running on GPP.
Tampere University of Technology (FI), and NXP (NL). The
three-year project started in the beginning at 2008.
The recongurable CRISP platform, also called General
Streaming Processor (GSP), designed and implemented
within the project, will consist of two separate devices:
General Purpose Device (GPD) and Recongurable Fabric
Device (RFD). The GPD contains o-the-shelf General
Purpose Processor (GPP) with memories and peripheral
connections whereas the RFD consists of 9 recongurable
cores. The array of recongurable cores is illustrated in
Figure 7 [34], with R depicting a router.
The recongurable cores are Montium cores (it was
recently decided to use Xentium processing tile as Recon-
gurable Core in the CRISP GSP. The Xentium has at
least similar performance to the Montium (with respect to
cycle count), but is designed for better programmability
(e.g., hardware supporting optimal software pipelining)).
Montium [35] is a recongurable processing core. It has
ve Arithmetic and Logical Units (ALUs), each having two
memories, resulting in total of 10 internal memories. The
cores communicate via a Network-on-Chip (NoC) which
includes two global memories. The device interfaces to other
devices and outer world via standard interfaces.
Within the CRISP project the GNSS receiver is one of
the two applications designed for proof of concept for the
platform. The other is a radar beamforming application
which has much higher demands on computation than a
standalone GNSS receiver.
4.1. Specifying the GNSS Receiver for the Multicore Platform.
In the CRISP project our intention is to specify, implement,
and integrate a GNSS receiver application supporting GPS
and Galileo L1 Open Service (OS) signals on the multicore
platform. In this case, the restriction for L1 band usage comes
from the selected radio [24], but in principle the multicore
approach can be extended to multifrequency receivers if a
suitable radio front-end is used.
4.1.1. Requirements for Tile Processor. The requirements
of GNSS L1 application have been studied in [36]. The
Table 1: Estimation of GNSS baseband process complexity for
Montium Tile Processor running at 200 MHz, max performance of
1 GMAC/s [36].
Process Usage (MMAC/s) Usage of TP (%)
Acquisition (GPS) 43.66 4.4
Acquisition (Galileo) 196.15 19.6
Tracking (GPS) 163.67 16.4
Tracking (Galileo) 229.14 22.9
results, restated in Table 1, indicated that a single Montium
core running at 200 MHz clock speed is barely capable of
executing the minimum required amount of acquisition and
tracking processes. This analysis did not take into account
the processing power needed for baseband to navigation
handover nor navigation processing itself. With this it is
evident that an array of cores (more than one) is needed
for GNSS L1 purposes. The estimations given in Table 1 are
based on reported [35] performance of the Montium core.
The acquisition gures are computed for a search speed of
one satellite per second and the tracking gures are for a
single channel.
The results presented in Table 1 reect the complexity
of the processes when the input stream is sampled at
16.368 MHz, which is the output frequency of the selected
radio front end for CRISP platform [24]. This is approxi-
mately 16 times the navigation signal fundamental frequency
of 1.023 MHz.
The GNSS application can also be used with a lower rate
input stream without a signicant loss in application perfor-
mance. For this paper, we analyzed the eect of the input
stream decimation to the complexity of the main baseband
processes. The other parameters, such as acquisition time
and number of frequency bins for acquisition and number
of active correlators per channel for tracking, remained the
same as in [36].
Figures 8 and 9 illustrate the eect of decimation by
factors 1, 2, 4, 8, and 16 to the utilization of the MontiumTile
processor. Decimation factor 1 equates to the case where no
RFD
R R
R R R
R
R R R
Test IF
Chip IF
Chip IF
Channel
data out
RF front
end data
in
Network IF
Reconfigurable
core
Network IF
Reconfigurable
core
Network IF
Reconfigurable
core
Network IF
Reconfigurable
core
Network IF
Reconfigurable
core
Network IF
Reconfigurable
core
Network IF
Reconfigurable
core
Network IF
Reconfigurable
core
Network IF
Reconfigurable
core
Smart
memory
Smart
memory
Parallel IF
Parallel IF
Parallel IF
Parallel IF
JTAG
Serial IF
Serial IF
Serial IF
Tracking channel 3
Tracking channel 4 Tracking channel 1
Tracking channel 2 Tracking channel 5
Tracking channel 0
Acquisition 0
Serial IF
Figure 7: Array of 9 recongurable cores [34] with example mapping of GNSS application illustrated, the selection of cores is random. R
depicts router and IF interface.
decimation is applied, that is, results shown in Table 1. The
presented gures showhowthe complexity of both processes,
measured as Montium Tile Processor utilization percentage,
decreases exponentially as decimation factor increases. The
behavior is the same for GPS and Galileo signals, except that
utilization with Galileo signals is a bit larger than with GPS
in all studied cases.
To ease the computational load of the Tile Processor the
decimation of the input stream seems to be a feasible choice.
The amount of decimation should be sucient to eect
meaningful savings in TP utilization without signicantly
degrading performance of the application. For the current
GPS SPS signal, decimation by factor 4 (4.092 MHz) is
feasible without signicant loss in receiver performance.
Factor of 8 (2.046 MHz) is equal to the Nyqvist rate for
1.023 MHz, which is the PRN code rate used in GSP SPS
signal.
In the Galileo case, decimation factor 4 is the maximum
decimation factor. This is because with a sampling frequency
of approximately 4 MHz the BOC(1,1) component of the
Galileo E1 OS signal can be still received with a maximum
loss of only 0.9 dB, when compared with the reception of
the whole MBOC bandwidth [12]. (This applies also to the
modern GPS L1C signals, but they are not specied in our
application [36].)
Table 2: Estimation of GNSS baseband process complexity with
decimated (by factor 4) input stream. Montium Tile Processor
running at 200 MHz, max performance of 1 GMAC/s.
Process Usage (MMAC/s) Usage of TP (%)
Acquisition (GPS) 9.57 0.96
Acquisition (Galileo) 43.66 4.37
Tracking (GPS) 40.92 4.09
Tracking (Galileo) 57.28 5.73
In the ideal case the decimation of the input stream
should be changing with the receiver mode (GPS/Galileo).
Since in CRISP the decimation of the radio stream will be
implemented as hardware in FPGA, which is connecting the
radio to the parallel interface of the nal CRISP prototype
platform, the run time conguration of the decimation factor
is not feasible. For this reason, in the rest of the paper we will
focus on the scenario where xed decimation factor of 4 is
used, resulting in a stream sample rate of 4.092 MHz.
Table 2 shows baseband complexity estimation for the
case when input stream is decimated by a factor of four.
When it is compared to the original gures of complexity
shown in Table 2, it can be seen that the utilization of TP
is over four times smaller.
2 4
GPS
Galileo
Input stream decimation factor
6 8 12 10 14 16
0
4
6
2
8
10
M
o
n
t
i
u
m

t
i
l
e

p
r
o
c
e
s
s
o
r

u
t
i
l
i
z
a
t
i
o
n

(
%
)
12
14
16
18
20
Figure 8: Acquisition process utilization of MontiumTile Processor
resources as a function of the decimation factor of the input stream.
2 4
GPS
Galileo
Input stream decimation factor
6 8 12 10 14 16
0
5
10
M
o
n
t
i
u
m

t
i
l
e

p
r
o
c
e
s
s
o
r

u
t
i
l
i
z
a
t
i
o
n

(
%
)
15
20
25
Figure 9: Tracking process utilization of Montium Tile Processor
resources as a function of the decimation factor of the input stream.
4.1.2. Requirements for the Network-on-Chip. To analyze the
multicore GNSS receiver application we built a functional
software receiver with the C++ language, running on a PC.
The detailed analysis of the software receiver will be given in
substantial paper [37].
In our SW receiver each process was implemented as a
software thread. With approximating one process per core
this approach enabled us to estimate the link payload by
logging communication between the threads.
We estimated a scenario where one core was allocated
to perform acquisition and six cores were mapped for the
tracking process. This scenario is illustrated in Figure 7. Dig-
itized RF front-end data is input to the NoC via an interface.
4000
4100
4300
4400
4500
4200
P
a
y
l
o
a
d

(
b
y
t
e
s
/
m
s
)
0 0.5 1
Time (ms)
1.5 2 2.5 3.5 3 4.5 4 5
10
3
(a) Acquisition link payload
4090
4095
4105
4110
4115
4100
P
a
y
l
o
a
d

(
b
y
t
e
s
/
m
s
)
0 0.5 1
Time (ms)
1.5 2 2.5 3.5 3 4.5 4 5
10
3
(b) Average tracking link payload
Figure 10: Link payloads for GPS acquisition process (a) and
average payload of GPS tracking processes (b).
A specic chip interface is used to connect the RFD to the
GPD, and it is used to forward channel data (channel phase
measurement data related to pseudorange measurements,
and navigation data) to the GPD. The Selected mapping
is a compromise between minimal operative setup (one
acquisition and four tracking) and the needs of dependability
testing processes, where individual cores may be taken oine
for testing purposes.
The scenario was simulated with a prerecorded set of
real GPS signals. Since signal sources for Galileo navigation
were not available, the Galileo case was not tested. The
link payloads caused by the cores communicating while the
software was running for 5 seconds is illustrated in Figure 10.
The results show that, in GPS mode, our GNSS appli-
cation causes a payload for each link/processing core with a
constant baseline of 4096 Bytes/millisecond. This is caused
by the radio front-end input, that is, the incoming signal.
In this scenario we used real GPS front end data which was
sampled at 4.092 MHz, each byte representing one sample.
This sampling rate is also equal to the potential decimation
scenario discussed earlier.
With a higher sampling rate the link payload baseline will
be raised, but on the other hand one byte can be preprocessed
to contain more than one sample, decreasing the trac
caused by radio front-end input.
The rst peak in the upper part of Figure 8 is caused
by the acquisition process output. When GNSS application
starts, FFT-based acquisition is started and the results are
ready after 60 milliseconds, which are then transmitted to
tracking channels. This peak is also the largest individual
payload event caused by the GNSS application.
After a short initialization period the tracking processes
start to produce channel output. An Average of simulated
GPS tracking link/processing core payloads is illustrated
in Figure 10(b). Every 20 milliseconds a navigation data
symbol (data rate is 50 Hz in GPS) is transmitted and once
a second higher transmission peak is caused by the loop
phase measurement data, which is transmitted to GPD for
pseudorange estimation.
In Galileo mode, the payload caused by incoming signal
will be equal since the same radio input will be used for
both GPS and Galileo. However, the transmission of data
symbols will cause a bigger payload since data rate of Galileo
E1 signals is 250 symbols per second [8]. Galileo phase
measurement rate will remain the same as in GPS mode.
From the results it is seen that the link payload caused by
the incoming RF signal is the largest one in both operating
modes, and if the link payload needs to be optimized the
reduction of it is the rst thing to be studied. The results also
indicate that when GNSS application is running smoothly
the link payloads caused by it are predictable.
Note that this estimation does not contain any over-
heads caused by network protocol or any other data than
navigation related (dependability, real-time mapping of the
processes). These issues will be studied in our future work.
4.2. Open Issues. Besides the additional network load caused
by other than the GNSS application itself, there are also
some other issues that remain open. There may be challenges
in designing software for a multicore environment. Power
consumption as well as the nal bill of materials (BOMs),
(i.e., nal price of the multicore product) remains an open
issue at the time of this writing. In future these issues will
be studied and suitable optimizations performed after the
prototyping and proof of concepts have been completed
successfully.
5. Conclusions
In this paper we discussed three Software-Dened Radio
(SDR) architectures for a Global Navigation Satellite System
(GNSS) receiver. The usage of exible architectures in
GNSS receiver was justied with the need for implementing
support for upcoming navigation systems and new algo-
rithms developed, and especially for multipath mitigation.
The hardware accelerated SDR architecture is quite close
to the current mass market solutions. There the ASIC is
replaced with a recongurable piece of hardware, usually an
FPGA. The second architecture, ideal (or pure) SDR receiver
is using a single processor to realize all necessary signal
processing functions. Real-time receivers remain a challenge,
but postprocessing applications are already taking advantage
of this architecture.
The third architecture, SDR with multiple cores, is a
novel approach for GNSS receivers. This approach benets
in both having high degree of exibility, and when properly
designed and scaled, a reasonably low unit price in high
volume production. In this paper we also presented the
CRISP project where such a multicore architecture will
be realized along with the analysis of GNSS application
requirements for the multicore platform.
We extended the previously published analysis of pro-
cessing tile utilization to cover the eect of input stream
decimation. Decimation by factor four seems to oer a
good compromise between core utilization and application
performance.
We implemented a software GNSS receiver with processes
implemented as threads and used that to analyze the GNSS
application communication payload for individual links.
This analysis indicated that the incoming signal represents
the largest part of the communication in the network
between processing cores.
Acknowledgments
The authors want to thank Stephen T. Burgess from Tampere
University of Technology for his useful comments about
the manuscript. This work was supported in part by the
FUGAT project funded by the Finnish Funding Agency for
Technology and innovation (TEKES). Parts of this research
are conducted within the FP7 Cutting edge Recongurable
ICs for Stream Processing (CRISP) project (ICT-215881)
supported by the European Commission.
References
[1] E. D. Kaplan and C. J. Hegarty, Eds., Understanding GPS,
Principles and Applications, Artech House, Boston, Mass, USA,
2nd edition, 2006.
[2] J. Benedicto, S. E. Dinwiddy, G. Gatti, R. Lucas, and M.
Lugert, GALILEO: Satellite System Design and Technology
Developments, European Space Agency, November 2000.
[3] S. Revnivykh, GLONASS Status and Progress, Decem-
ber 2008, http://www.oosa.unvienna.org/pdf/icg/2008/icg3/04
.pdf.
[4] G. Gibbons, International system providers meeting (ICG-
3) reects GNSSs competing interest, cooperative objectives,
Inside GNSS, December 2008.
[5] U. S. Airforce, GPS Modernization Fact Sheet, 2006,
http://pnt.gov/public/docs/2006/modernization.pdf.
[6] M. S. Braasch and A. J. van Dierendonck, GPS receiver
architectures and measurements, Proceedings of the IEEE, vol.
87, no. 1, pp. 4864, 1999.
[7] K. Borre, D. M. Akos, N. Bertelsen, P. Rinder, and S. H.
Jensen, A Software Dened GPS and Galileo ReceiverA
Single-Frequency Approach, Birkh auser, Boston, Mass, USA,
2007.
[8] Galileo Open Service, Signal in space interface control
document (OS SIS ICD), Draft 1, February 2008.
[9] Interface SpecicationNavstar GPS Space segment/User
segment L1C Interfaces, IS-GPS-800, August 2007.
[10] R. D. Fontana, W. Cheung, and T. Stansell, The modernized
L2C signalleaping forward into the 21st century, GPS
World, pp. 2834, September 2001.
[11] Galileo Joint UndertakingGalileo Open Service, Signal in
space interface control document (OS SIS ICD), GJU, May
2006.
[12] G. W. Hein, J.-A. Avila-Rodriguez, S. Wallner, et al., MBOC:
the new optimized spreading modulation recommended
for GALILEO L1 OS and GPS L1C, in Proceedings of
the IEEE/ION Position, Location, and Navigation Symposium
(PLANS 06), pp. 883892, San Diego, Calif, USA, April 2006.
[13] H. Hurskainen, E. S. Lohan, X. Hu, J. Raasakka, and J. Nurmi,
Multiple gate delay tracking structures for GNSS signals
and their evaluation with simulink, systemC, and VHDL,
International Journal of Navigation and Observation, vol. 2008,
Article ID 785695, 17 pages, 2008.
[14] S. Kim, S. Yoo, S. Yoon, and S. Y. Kim, A novel unambiguous
multipath mitigation scheme for BOC(kn, n) tracking in
GNSS, in Proceedings of the International Symposium on
Applications and the Internet Workshops, p. 57, 2007.
[15] F. Dovis, M. Pini, and P. Mulassano, Multiple DLL archi-
tecture for multipath recovery in navigation receivers, in
Proceedings of the 59th IEEE Vehicular Technology Conference
(VTC 04), vol. 5, pp. 28482851, May 2004.
[16] F. Dovis, A. Gramazio, and P. Mulassano, SDR technology
applied to Galileo receivers, in Proceedings of the International
Technical Meeting of the Satellite Division of the Institute of
Navigation (ION GPS 02), Portland, Ore, USA, September
2002.
[17] SDR Forum, January 2009, http://www.sdrforum.org.
[18] J. Mitola, The software radio architecture, IEEE Communi-
cations Magazine, 1995.
[19] P. G. Mattos, Asingle-chip GPS receiver, GPS World, October
2005.
[20] P. J. Mumford, K. Parkinson, and A. G. Dempster, The
namuru open GNSS research receiver, in Proceedings of the
International Technical Meeting of the Satellite Division of the
Institute of Navigation (ION GNSS 06), vol. 5, pp. 28472855,
Fort Worth, Tex, USA, September 2006.
[21] S. Ganguly, A. Jovancevic, D. A. Saxena, B. Sirpatil, and S.
Zigic, Open architecture real time development system for
GPS and Galileo, in Proceedings of the International Technical
Meeting of the Satellite Division of the Institute of Navigation
(ION GNSS 04), pp. 26552666, Long Beach, Calif, USA,
September 2004.
[22] H. Hurskainen, T. Paakki, Z. Liu, J. Raasakka, and J. Nurmi,
GNSS receiver reference design, in Proceedings of the 4th
Advanced Satellite Mobile Systems (ASMS 08), pp. 204209,
Bologna, Italy, August 2008.
[23] J. Hill, Navigation signal processing with FPGAs, in Pro-
ceedings of the National Technical Meeting of the Institute of
Navigation, pp. 420427, June 2004.
[24] Atmel, GPS Front End IC ATR0603, Datasheet, 2006.
[25] M. Detratti, E. Lopez, E. Perez, and R. Palacio, Dual-
frequency RF front end solution for hybrid Galileo/GPS
mass market receiver, in Proceedings of the IEEE Consumer
Communications and Networking Conference (CCNC 08), pp.
603607, Las Vegas, Nev, USA, January 2008.
[26] P. Fine and W. Wilson, Tracking algorithms for GPS oset
carrier signals, in Proceedings of the ION National Technical
Meeting (NTM 99), San Diego, Calif, USA, January 1999.
[27] H. Hurskainen and J. Nurmi, SystemC model of an interop-
erative GPS/Galileo code correlator channel, in Proceedings of
the IEEE Workshop on Signal Processing Systems (SIPS 06), pp.
327332, Ban, Canada, October 2006.
[28] D. M. Akos, The role of Global Navigation Satellite System
(GNSS) software radios in embedded systems, GPS Solutions,
May 2003.
[29] C. Dionisio, L. Cucchi, and R. Marracci, SOFTREC G3,
software receiver and signal analysis fog GNSS bands, in
Proceedings of the 10th IEEE Internationl Symposium on Spread
Spectrum Techniques and Applications (ISSSTA 08), Bologna,
Italy, August 2008.
[30] G. E. Moore, Cramming more components onto integrated
circuits, Proceedings of the IEEE, vol. 86, no. 1, pp. 8285,
1998.
[31] S. S oderholm, T. Jokitalo, K. Kaisti, H. Kuusniemi, and
H. Naukkarinen, Smart positioning with fastraxs software
GPS receiver solution, in Proceedings of the International
Technical Meeting of the Satellite Division of the Institute of
Navigation (ION GNSS 08), pp. 11931200, Savannah, Ga,
USA, September 2008.
[32] J. H. Won, T. Pany, and G. W. Hein, GNSS software dened
radio: real receiver or just a tool for experts, Inside GNSS, pp.
4856, July-August 2006.
[33] CRISP Project, December 2008, http://www.crisp-project
.eu.
[34] P. Heysters, CRISP Project Presentation, June 2008,
http://www.crisp-project.eu/images/publications/D6.1
CRISP project presentation 080622.pdf.
[35] P. M. Heysters, G. K. Rauwerda, and L. T. Smit, A exible,
low power, high performance DSP IP core for programmable
systems-on-chip, in Proceedings of the IP/SoC, Grenoble,
France, December 2005.
[36] H. Hurskainen, J. Raasakka, and J. Nurmi, Specication of
GNSS application for multiprocessor platform, in Proceedings
of the International Symposium on System-on-Chip (SOC 08),
pp. 128133, Tampere, Finland, November 2008.
[37] J. Raasakka, H. Hurskainen, and J. Nurmi, Modeling multi-
core software GNSS receiver with real time SW receiver,
in Proceedings of the International Technical Meeting of the
Satellite Division of the Institute of Navigation (ION GNSS 09),
Savannah, Ga, USA, September 2009.
doi:10.1155/2009/598529
Research Article
An Open Framework for Rapid Prototyping of
Signal Processing Applications
Maxime Pelcat,
1
Jonathan Piat,
1
Matthieu Wipliez,
1
Slaheddine Aridhi,
2
and Jean-Francois Nezan
1
1
IETR/Image and Remote Sensing Group, CNRS UMR 6164/INSA Rennes, 20, avenue des Buttes de Coesmes,
35043 Rennes Cedex, France
2
HPMP Division, Texas Instruments, 06271 Villeneuve Loubet, France
Correspondence should be addressed to Maxime Pelcat, mpelcat@insa-rennes.fr
Received 27 February 2009; Revised 7 July 2009; Accepted 14 September 2009
Embedded real-time applications in communication systems have signicant timing constraints, thus requiring multiple
computation units. Manually exploring the potential parallelism of an application deployed on multicore architectures is greatly
time-consuming. This paper presents an open-source Eclipse-based framework which aims to facilitate the exploration and
development processes in this context. The framework includes a generic graph editor (Graphiti), a graph transformation library
(SDF4J) and an automatic mapper/scheduler tool with simulation and code generation capabilities (PREESM). The input of the
framework is composed of a scenario description and two graphs, one graph describes an algorithmand the second graph describes
an architecture. The rapid prototyping results of a 3GPP Long-Term Evolution (LTE) algorithm on a multicore digital signal
processor illustrate both the features and the capabilities of this framework.
Copyright 2009 Maxime Pelcat et al. This is an open access article distributed under the Creative Commons Attribution License,
1. Introduction
The recent evolution of digital communication systems
(voice, data, and video) has been dramatic. Over the last two
decades, lowdata-rate systems (such as dial-up modems, rst
and second generation cellular systems, 802.11 Wireless local
area networks) have been replaced or augmented by systems
capable of data rates of several Mbps, supporting multimedia
applications (such as DSL, cable modems, 802.11b/a/g/n
wireless local area networks, 3G, WiMax and ultra-wideband
personal area networks).
As communication systems have evolved, the resulting
increase in data rates has necessitated a higher system algo-
rithmic complexity. A more complex system requires greater
exibility in order to function with dierent protocols in
dierent environments. Additionally, there is an increased
need for the system to support multiple interfaces and
multicomponent devices. Consequently, this requires the
optimization of device parameters over varying constraints
such as performance, area, and power. Achieving this
device optimization requires a good understanding of the
application complexity and the choice of an appropriate
architecture to support this application.
An embedded system commonly contains several pro-
cessor cores in addition to hardware coprocessors. The
embedded system designer needs to distribute a set of signal
processing functions onto a given hardware with predened
features. The functions are then executed as software code
on target architecture; this action will be called a deployment
in this paper. A common approach to implement a parallel
algorithm is the creation of a program containing several
synchronized threads in which execution is driven by the
scheduler of an operating system. Such an implementation
does not meet the hard timing constraints required by real-
time applications and the memory consumption constraints
required by embedded systems [1]. One-time manual
scheduling developed for single-processor applications is
also not suitable for multiprocessor architectures: manual
data transfers and synchronizations quickly become very
complex, leading to wasted time and potential deadlocks.
Furthermore, the task of nding an optimal deployment of
an algorithm mapped onto a multicomponent architecture
is not straightforward. When performed manually, the result
is inevitably a suboptimal solution. These issues raise the
need for new methodologies, which allow the exploration of
several solutions, to achieve a more optimal result.
Several features must be provided by a fast prototyping
process: description of the system (hardware and software),
automatic mapping/scheduling, simulation of the execu-
tion, and automatic code generation. This paper draws on
previously presented works [24] in order to generate a
more complete rapid prototyping framework. This complete
framework is composed of three complementary tools based
on Eclipse [5] that provide a full environment for the
rapid prototyping of real-time embedded systems: Parallel
and Real-time Embedded Executives Scheduling Method
(PREESM), Graphiti and Synchronous Data Flow for Java
(SDF4J). This framework implements the methodology
Algorithm-Architecture Matching (AAM), which was previ-
ously called Algorithm-Architecture Adequation (AAA) [6].
The focus of this rapid prototyping activity is currently
static code mapping/scheduling but dynamic extensions are
planned for future generations of the tool.
From the graph descriptions of an algorithm and of
an architecture, PREESM can nd the right deployment,
provide simulation information, and generate a framework
code for the processor cores [2]. These rapid prototyping
tasks can be combined and parameterized in a workow.
In PREESM, a workow is dened as an oriented graph
representing the list of rapid prototyping tasks to execute
on the input algorithm and architecture graphs in order
to determine and simulate a given deployment. A rapid
prototyping process in PREESM consists of a succession of
transformations. These transformations are associated in a
data ow graph representing a workow that can be edited in
a Graphiti generic graph editor. The PREESM input graphs
may also be edited using Graphiti. The PREESM algorithm
models are handled by the SDF4J library. The framework can
be extended by modifying the workows or by connecting
new plug-ins (for compilation, graph analyses, and so on).
In this paper, the dierences between the proposed
framework and related works are explained in Section 2.
The framework structure is described in Section 3. Section 4
details the features of PREESM that can be combined by
users in workows. The use of the framework is illustrated by
the deployment of a wireless communication algorithm from
the 3rd Generation Partnership Project (3GPP) Long-Term
Evolution (LTE) standard in Section 5. Finally, conclusions
are given in Section 6.
2. State of the Art of Rapid Prototyping and
Multicore Programming
There exist numerous solutions to partition algorithms
onto multicore architectures. If the target architecture is
homogeneous, several solutions exist which generate mul-
ticore code from C with additional information (OpenMP
[7], CILK [8]). In the case of heterogeneous architectures,
languages such as OpenCL [9] and the Multicore Association
Application Programming Interface (MCAPI [10]) dene
ways to express parallel properties of a code. However,
they are not currently linked to ecient compilers and
runtime environments. Moreover, compilers for such lan-
guages would have diculty in extracting and solving the
bottlenecks of the implementation that appear inherently in
graph descriptions of the architecture and the algorithm.
The Poly-Mapper tool from PolyCore Software [11]
oers functionalities similar to PREESM but, in contrast
to PREESM, its mapping/scheduling is manual. Ptolemy II
[12] is a simulation tool that supports many models of
computation. However, it also has no automatic mapping
and currently its code generation for embedded systems
focuses on single-core targets. Another family of frameworks
existing for data ow based programming is based on
CAL [13] language and it includes OpenDF [14]. OpenDF
employs a more dynamic model than PREESM but its
related code generation does not currently support multicore
embedded systems.
Closer to PREESM are the Model Integrated Computing
(MIC [15]), the Open Tool Integration Environment (OTIE
[16]), the Synchronous Distributed Executives (SynDEx
[17]), the Dataow Interchange Format (DIF [18]), and
SDF for Free (SDF3 [19]). Both MIC and OTIE can not
be accessed online. According to literature, MIC focuses
on the transformation between algorithm domain-specic
models and metamodels while OTIE denes a single system
description that can be used during the whole signal
processing design cycle.
DIF is designed as an extensible repository of repre-
sentation, analysis, transformation, and scheduling of data
ow language. DIF is a Java library which allows the user
to go from graph specication using the DIF language to
C code generation. However, the hierarchical Synchronous
Data Flow (SDF) model used in the SDF4J library and
PREESM is not available in DIF.
SDF3 is an open-source tool implementing some data
ow models and providing analysis, transformation, visu-
alization, and manual scheduling as a C++ library. SDF3
implements the Scenario Aware Data Flow (SADF [20]), and
provides Multiprocessor System-on-Chip (MP-SoC) bind-
ing/scheduling algorithm to output MP-SoC conguration
les.
SynDEx and PREESM are both based on the AAM
methodology [6] but the tools do not provide the same
features. SynDEx is not an open source, it has its own model
of computation that does not support schedulability analysis,
and code generation is possible but not provided with the
tool. Moreover, the architecture model of SynDEx is at a too
high level to account for bus contentions and DMA used
in modern chips (multicore processors of MP-SoC) in the
mapping/scheduling.
The features that dierentiate PREESM from the related
works and similar tools are
(i) The tool is an open source and accessible online;
(ii) the algorithm description is based on a single well-
known and predictable model of computation;
Rapid prototyping
eclipse plug-ins
Data ow graph
transformation library
Generic graph
editor eclipse
plug-in
Graph
transformation
Scheduler
Code
generator
SDF4J
Graphiti
Core
PREESM
Eclipse framework
Figure 1: An Eclipse-based Rapid Prototyping Framework.
(iii) the mapping and the scheduling are totally auto-
matic;
(iv) the functional code for heterogeneous multicore
embedded systems can be generated automatically;
(v) the algorithm model provides a helpful hierar-
chical encapsulation thus simplifying the map-
ping/scheduling [3].
The PREESM framework structure is detailed in the next
section.
3. An Open-Source Eclipse-Based Rapid
Prototyping Framework
3.1. The Framework Structure. The framework structure is
presented in Figure 1. It is composed of several tools to
increase reusability in several contexts.
The rst step of the process is to describe both the target
algorithm and the target architecture graphs. A graphical
editor reduces the development time required to create,
modify and edit those graphs. The role of Graphiti [21] is
to support the creation of algorithm and architecture graphs
for the proposed framework. Graphiti can also be quickly
congured to support any type of le formats used for
generic graph descriptions.
The algorithm is currently described as a Synchronous
Data Flow (SDF [22]) Graph. The SDF model is a good
solution to describe algorithms with static behavior. The
SDF4J [23] is an open-source library providing usual
transformations of SDF graphs in the Java programming
language. The extensive use of SDF and its derivatives in
the programming model community led to the development
of SDF4J as an external tool. Due to the greater specicity
of the architecture description compared to the algorithm
description, it was decided to perform the architecture
transformation inside the PREESM plug-ins.
The PREESM project [24] involves the development of a
tool that performs the rapid prototyping tasks. The PREESM
tool uses the Graphiti tool and SDF4J library to design
algorithm and architecture graphs and to generate their
transformations. The PREESM core is an Eclipse plug-in that
executes sequences of rapid prototyping tasks or workows.
The tasks of a workow are delegated to PREESM plug-
ins. There are currently three PREESM plug-ins: the graph
transformation plug-in, the scheduler plug-in, and the code-
generation plug-in.
The three tools of the framework are detailed in the next
sections.
3.2. Graphiti: A Generic Graph Editor for Editing Architectures,
Algorithms and Workows. Graphiti is an open-source plug-
in for the Eclipse environment that provides a generic graph
editor. It is written using the Graphical Editor Framework
(GEF). The editor is generic in the sense that any type
of graph may be represented and edited. Graphiti is used
routinely with the following graph types and associated le
formats: CAL networks [13, 25], a subset of IP-XACT [26],
GraphML [27] and PREESM workows [28].
3.2.1. Overview of Graphiti. A type of graph is registered
within the editor by a conguration. A conguration is an
XML (Extensible Markup Language [29]) le that describes
(1) the abstract syntax of the graph (types of vertices
and edges, and attributes allowed for objects of each
type);
(2) the visual syntax of the graph (colors, shapes, etc.);
(3) transformations from the le format in which the
graph is dened to Graphitis XML le format G, and
vice versa (Figure 2);
Two kinds of input transformations are supported, from
XML to XML and from text to XML (Figure 2). XML is
transformed to XML with Extensible Stylesheet Language
Transformation (XSLT [30]), and text is parsed to its Con-
crete Syntax Tree (CST) represented in XML according to a
LL(k) grammar by the Grammatica [31] parser. Similarly,
two kinds of output transformations are supported, from
XML to XML and from XML to text.
Graphiti handles attributed graphs [32]. An attributed
graph is dened as a directed multigraph G = (V, E, ) with
V the set of vertices, E the multiset of edges (there can be
more than one edge between any two vertices). is a function
: ({G} V E) A U that associates instances with
attributes from the attribute name set A and values from U,
the set of possible attribute values. A built-in type attribute
is dened so that each instance i {G} V E has a type
t = (i, type), and only admits attributes from a set A
t
A
XML
Text
Parsing
XML
CST
XSLT
transformations
G
(a)
XML
Text
XSLT
transformations
G
(b)
Figure 2: Input/output with Graphitis XML format G.
produce
out
do something
acc
in
out
consume
in
Figure 3: A sample graph.
given by A
t
= (t). Additionally, a type t has a visual syntax
(t) that denes its color, shape, and size.
To edit a graph, the user selects a le and the matching
conguration is computed based on the le extension. The
transformations dened in the conguration le are then
applied to the input le and result in a graph dened in
Graphitis XML format G as shown in Figure 2. The editor
uses the visual syntax dened by in the conguration to
draw the graph, vertices, and edges. For each instance of type
t the user can edit the relevant attributes allowed by (t)
as dened in the conguration. Saving a graph consists of
writing the graph in G, and transforming it back to the input
les native format.
3.2.2. Editing a Conguration for a Graph Type. To create a
conguration for the graph represented in Figure 3, a node (a
single type of vertex) must be dened. A node has a unique
identier called id, and accepts a list of values initially equal to
[0] (Figure 4). Additionally, ports need to be specied on the
edges, so the conguration describes an edgeType element
(Figure 5) that carries sourcePort and targetPort parameters
to store an edges source and target ports, respectively, such
as acc, in, and out in Figure 3.
Graphiti is a stand-alone tool, totally independent of
PREESM. However, Graphiti generates workow graphs,
IP-XACT and GraphML les that are the main inputs of
PREESM. The GraphML les contain the algorithm model.
These inputs are loaded and stored in PREESM by the SDF4J
library. This library, discussed in the next section, executes
the graph transformations.
3.3. SDF4J: A Java Library for Algorithm Data Flow Graph
Transformations. SDF4J is a library dening several Data
Flow oriented graph models such as SDF and Directed
Acyclic Graph (DAG [33]). It provides the user with several
classic SDF transformations such as hierarchy attening, and
<vertexType name=node>
<attributes>
<color red=163 green=0 blue=85/>
<shape name=roundedBox/>
<size width=40 height=40/>
</attributes>
<parameters>
<parameter name=id
type=java.lang.String
default= />
<parameter name=values
type=java.util.List>
<element value=0/>
</parameter>
</parameters>
</vertexType>
Figure 4: The type of vertices of the graph shown in Figure 3.
<edgeType name=edge>
<attributes>
<directed value=true/>
</attributes>
<parameters>
<parameter name=source port
default= />
<parameter name=target port
default= />
</parameters>
</vertexType>
Figure 5: The type of edges of the graph shown in Figure 3.
SDF to Homogeneous SDF (HSDF [34]) transformations
and some clustering algorithms. This library also gives the
possibility to expand optimization templates. It denes its
own graph representation based on the GraphML standard
and provides the associated parser and exporter class. SDF4J
is freely available (GPL license) for download.
3.3.1. SDF4J SDF Graph model. An SDF graph is used
to simplify the application specications. It allows the
representation of the application behavior at a coarse grain
level. This data ow representation models the application
operations and species the data dependencies between these
operations.
An SDF graph is a nite directed, weighted graph G =<
V, E, d, p, c > where:
(i) V is the set of nodes. A node computes an input data
stream and outputs the result;
(ii) E V V is the edge set, representing channels
which carry data streams;
(iii) d : E N {0} is a function with d(e) the number
of initial tokens on an edge e;
(iv) p : E N is a function with p(e) representing the
number of data tokens produced at es source to be
carried by e;
op
1
op
2 op
4
op
3
3 2 2 4
3
2 2
4
Figure 6: A SDF graph.
(v) c : E N is a function with c(e) representing the
number of data tokens consumed from e by es sink
node;
This model oers strong compile-time predictability
properties, but has limited expressive capability. The SDF
implementation enabled by the SDF4J supports the hierarchy
dened in [3] which increases the model expressiveness. This
specic implementation is straightforward to the program-
mer and allows user-dened structural optimizations. This
model is also intended to lead to a better code generation
using common C patterns like loop and function calls. It is
highly expandable as the user can associate any properties
to the graph components (edge, vertex) to produce a
customized model.
3.3.2. SDF4J SDF Graph Transformations. SDF4J implements
several algorithms intended to transform the base model or
to optimize the application behavior at dierent levels.
(i) The hierarchy attening transformation aims to atten
the hierarchy (remove hierarchy levels) at the chosen
depth in order to later extract as much as possible
parallelism from the designers hierarchical descrip-
tion.
(ii) The HSDF transformation (Figure 7) transforms the
SDF model to an HSDF model in which the amount
of tokens exchanged on edges are homogeneous
(production = consumption). This model reveals
all the potential parallelism in the application but
dramatically increases the amount of vertices in the
graph.
(iii) The internalization transformation based on [35]
is an ecient clustering method minimizing the
number of vertices in the graph without decreasing
the potential parallelism in the application.
(iv) The SDF to DAG transformation converts the SDF or
HSDF model to the DAG model which is commonly
used by scheduling methods [33].
3.4. PREESM: A Complete Framework for Hardware and Soft-
ware Codesign. In the framework, the role of the PREESM
tool is to perform the rapid prototyping tasks. Figure 8
depicts an example of a classic workow which can be
executed in the PREESM tool. As seen in Section 3.3, the
data ow model chosen to describe applications in PREESM
is the SDF model. This model, described in [22], has the
great advantage of enabling the formal verication of static
schedulability. The typical number of vertices to schedule in
op
1
op
2
op
2
op
2
op
2
op
1
3 1
1
1
1 1
1
1
Figure 7: A SDF graph and its HSDF transformation.
PREESMis between one hundred and several thousands. The
architecture is described using IP-XACT language, an IEEE
standard from the SPIRIT consortium [26]. The typical size
of an architecture representation in PREESM is between a
few cores and several dozen cores. A scenario is dened as a
set of parameters and constraints that specify the conditions
under which the deployment will run.
As can be seen in Figure 8, prior to entering the
scheduling phase, the algorithm goes through three trans-
formation steps: the hierarchy attening transformation,
the HSDF transformation, and the DAG transformation
(see Section 3.3.2). These transformations prepare the graph
for the static scheduling and are provided by the Graph
Transformation Module (see Section 4.1). Subsequently, the
DAGconverted SDF graphis processed by the scheduler
[36]. As a result of the deployment by the scheduler, a
code is generated and a Gantt chart of the execution is
displayed. The generated code consists of scheduled function
calls, synchronizations, and data transfers between cores. The
functions themselves are handwritten.
The plug-ins of the PREESM tool implement the rapid
prototyping tasks that a user can add to the workows. These
plug-ins are detailed in next section.
4. The Current Features of PREESM
4.1. The Graph Transformation Module. In order to generate
an ecient schedule for a given algorithm description, the
application dened by the designer must be transformed.
The purpose of this transformation is to reveal the potential
parallelism of the algorithm and simplify the work of the
task scheduler. To provide the user with exibility while
optimizing the design, the entire graph transformation
provided by the SDF4J library can be instantiated in a
workow with parameters allowing the user to control each
of the three transformations. For example, the hierarchical
attening transformation can be congured to atten a
given number of hierarchy levels (depth) in order to keep
some of the user hierarchical construction and to maintain
the amount of vertices to schedule at a reasonable level.
The HSDF transformation provides the scheduler with a
graph of high potential parallelism as all the vertices of the
SDF graph are repeated according to the SDF graphs basic
repetition vector. Consequently, the number of vertices to
schedule is larger than in the original graph. The clustering
transformation prepares the algorithm for the scheduling
process by grouping vertices according to criteria such as
strong connectivity or strong data dependency between
Graphiti editor
Architecture
editor
Algorithm
editor
Scenario
editor
Hierarchical
SDF
Hierarchy attening
HSDF transformation
SDF to DAG transformation
Mapping /scheduling
DAG + implementation
information
Gantt chart
Code generation
PREESM framework
I
P
-
X
A
C
T
S
c
e
n
a
r
i
o
SDF
HSDF
DAG
Code
Figure 8: Example of a workow graph: from SDF and IP-XACT descriptions to the generated code.
vertices. The grouped vertices are then transformed into a
hierarchical vertex which is then treated as a single vertex
in the scheduling process. This vertex grouping reduces the
number of vertices to schedule, speeding up the scheduling
process. The user can freely use available transformations in
his workow in order to control the criteria for optimizing
the targeted application and architecture.
As can be seen in the workow displayed in Figure 8,
the graph transformation steps are followed by the static
scheduling step.
4.2. The PREESM Static Scheduler. Scheduling consists of
statically distributing the tasks that constitute an application
between available cores in a multicore architecture and
minimizing parameters such as nal latency. This problem
has been proven to be NP-complete [37]. A static scheduling
algorithm is usually described as a monolithic process, and
carries out two distinct functionalities: choosing the core to
execute a specic function and evaluating the cost of the
generated solutions.
The PREESM scheduler splits these functionalities into
three submodules [4] which share minimal interfaces: the
task scheduling, the edge scheduling, and the Architecture
Benchmark Computer (ABC) submodules. The task schedul-
ing submodule produces a scheduling solution for the
application tasks mapped onto the architecture cores and
then queries the ABC submodule to evaluate the cost of the
proposed solution. The advantage of this approach is that any
task scheduling heuristic may be combined with any ABC
model, leading to many dierent scheduling possibilities. For
instance, an ABC minimizing the deployment memory or
energy consumption can be implemented without modifying
the task scheduling heuristics.
The interface oered by the ABC to the task scheduling
submodule is minimal. The ABC gives the number of avail-
able cores, receives a deployment description and returns
costs to the task scheduling (innite if the deployment is
impossible). The time keeper calculates and stores timings
for the tasks and the transfers when necessary for the ABC.
The ABCneeds to schedule the edges in order to calculate
the deployment cost. However, it is not designed to make
any deployment choices; this task is delegated to the edge
scheduling submodule. The router in the edge scheduling
submodule nds potential routes between the available cores.
The choice of module structure was motivated by
the behavioral commonality of the majority of scheduling
algorithms (see Figure 9).
4.2.1. Scheduling Heuristics. Three algorithms are currently
coded, and are modied versions of the algorithms described
in [38].
(i) A list scheduling algorithm schedules tasks in the
order dictated by a list constructed from estimating
a critical path. Once a mapping choice has been
made, it will never be modied. This algorithm is
fast but has limitations due to this last property.
List scheduling is used as a starting point for other
renement algorithms.
(ii) The FAST algorithm is a renement of the list
scheduling solution which uses probabilistic hops. It
changes the mapping choices of randomly chosen
tasks; that is, it associates these tasks to another
processing unit. It runs until stopped by the user
and keeps the best latency found. The algorithm is
multithreaded to exploit the multicore parallelism of
a host computer.
(iii) A genetic algorithm is coded as a renement of the
FAST algorithm. The n best solutions of FAST are
used as the base population for the genetic algorithm.
The user can stop the processing at any time while
retaining the last best solution. This algorithm is also
multithreaded.
The FASTalgorithmhas been developed to solve complex
deployment problems. In the original heuristic, the nal
order of tasks to schedule, as dened by the list scheduling
algorithm, was not modied by the FAST algorithm. The
FAST algorithm only modies the mapping choices of the
tasks. In large-scale applications, the initial order of the
tasks performed by the list scheduling algorithm becomes
occasionally suboptimal. In the modied version of the FAST
scheduling algorithm, the ABC recalculates the nal order of
a task when the heuristic maps a task to a new core. The task
switcher algorithm used to recalculate the order simply looks
for the earliest appropriately sized hole in the core schedule
for the mapped task (see Figure 10).
4.2.2. Scheduling Architecture Model. The current architec-
ture representation was driven by the need to accurately
model multicore architectures and hardware coprocessors
with intercores message-passing communication. This com-
munication is handled in parallel to the computation using
Direct Memory Access (DMA) modules. This model is
currently used to closely simulate the Texas Instruments
TMS320TCI6487 processor (see Section 5.3.2). The model
will soon be extended to shared memory communications
and more complex interconnections. The term operator
represents either a processor core or a hardware coprocessor.
Operators are linked by media, each medium representing a
bus and the associated DMA. The architectures can be either
homogeneous (with all operators and media identical) or
heterogeneous. For each medium, the user denes a DMA
set up time and a bus data rate. As shown in Figure 9,
the architecture model is only processed in the scheduler
by the ABC and not by the heuristic and edge scheduling
submodules.
4.2.3. Architecture Benchmark Computer. Scheduling often
requires much time. Testing intermediate solutions with
precision is an especially time-consuming operation. The
ABC submodule was created by reusing the useful concept
of time scalability introduced in SystemC Transaction Level
DAG
IP-XACT + scenario
Number of cores
Task schedule
Task scheduling
Architecture
benchmark
computer (ABC)
Time keeper
Scheduler
Cost
Task schedule
Router
Edge scheduling
Edge
schedule
Figure 9: Scheduler module structure.
Modeling (TLM) [39]. This language denes several levels of
system temporal simulation, from untimed to cycle-accurate
precision. This concept motivated the development of several
ABC latency models with dierent timing precisions. Three
ABC latency models are currently coded (see Figure 11).
(i) The loosely-timed model takes into account task and
transfer times but no transfer contention.
(ii) The approximately-timed model associates each inter-
core communication medium with its constant rate
and simulates contentions.
(iii) The accurately-timed model adds set up times which
simulate the duration necessary to initialize a parallel
transfer controller like Texas Instruments Enhanced
Direct Memory Access (EDMA [40]). This set up
time is scheduled in the core which sends the transfer.
The task and architecture properties feeding the ABC
submodule are evaluated experimentally, and include media
data rate, set up times, and task timings. ABC models
evaluating parameters other than latency are planed in
order to minimize memory size, memory accesses, cadence
(i.e., average runtime), and so on. Currently, only latency
is minimized due to the limitations of the list scheduling
algorithms: these costs cannot be evaluated on partial
deployments.
4.2.4. Edge Scheduling Submodule. When a data block is
transferred from one operator to another, transfer tasks are
added and then mapped to the corresponding medium. A
route is associated with each edge carrying data from one
operator to another, which possibly may go through several
other operators. The edge scheduling submodule routes the
edges and schedules their route steps. The existing routing
process is basic and will be developed further once the
architecture model has been extended. Edge scheduling can
be executed with dierent algorithms of varying complexity,
which results in another level of scalability. Currently, two
algorithms are implemented:
(i) the simple edge scheduler follows the scheduling order
given by the task list provided by the list scheduling
algorithm;
DAG
IP-XACT + scenario
Task scheduling
ABC
Scheduler
List scheduling Genetic algorithms FAST
Latency/cadence/memory driven
Edge scheduling
Only latency-driven
ACCURATE
FAST
Figure 10: Switchable scheduling heuristics.
(ii) the switching edge scheduler reuses the task switcher
algorithmdiscussed in Section 4.2.1 for edge schedul-
ing. When a new communication edge needs to be
scheduled, the algorithm looks for the earliest hole of
appropriate size in the medium schedule.
The scheduler framework enables the comparison of
dierent edge scheduling algorithms using the same task
scheduling submodule and architecture model description.
The main advantage of the scheduler structure is the
independence of scheduling algorithms from cost type and
benchmark complexity.
4.3. Generating a Code from a Static Schedule. Using the
AAM methodology from [6], a code can be generated from
the static scheduling of the input algorithm on the input
architecture (see workow in Figure 8). This code consists
of an initialization phase and a loop endlessly repeating the
algorithm graph. From the deployment generated by the
scheduler, the code generation module generates a generic
representation of the code in XML. The specic code for
the target is then obtained after an XSLT transformation.
The code generation ow for a Texas Instruments tricore
processor TMS320TCI6487 (see Section 5.3.2) is illustrated
by Figure 12.
PREESM currently supports the C64x and C64x+ based
processors from Texas Instruments with DSP-BIOS Oper-
ating System [41] and the x86 processors with Windows
Operating System. The supported intercore communication
schemes include TCP/IP with sockets, Texas Instruments
EDMA3 [42], and RapidIO link [43].
An actor is a task with no hierarchy. A function must
be associated with each actor and the prototype of the
function must be dened to add the right parameters in the
right order. A CORBA Interface Denition Language (IDL)
le is associated with each actor in PREESM. An example
of an IDL le is shown in Figure 13. This le gives the
generic prototypes of the initialization and loop function
calls associated with a task. IDL was chosen because it is a
language-independent way to express an interface.
DAG
IP-XACT scenario
Task scheduling
Scheduler
Architecture benchmark computer (ABC)
Accurately-timed
Edge scheduling
Approximately-timed Loosely-timed
ACCURATE
FAST
Bus contention
+ setup times Bus contention
Unscheduled
communication
M
e
m
o
r
y
C
a
d
e
n
c
e
Figure 11: Switchable ABC models.
Depending on the type of mediumbetween the operators
in the PREESMarchitecture model, the XSLT transformation
generates calls to the appropriate predened communication
library. Specic code libraries have been developed to
manage the communications and synchronizations between
the target cores [2].
5. Rapid Prototyping of a Signal Processing
Algorithm fromthe 3GPP LTE Standard
The framework functionalities detailed in the previous
sections are now applied to the rapid prototyping of a
signal processing application fromthe 3GPP LTE radio access
network physical layer.
5.1. The 3GPP LTE Standard. The 3GPP [44] is a group
formed by telecommunication organizations to standardize
the third generation (3G) mobile phone system specication.
This group is currently developing a new standard: the Long-
Term Evolution (LTE) of the 3G. The aim of this standard is
to bring data rates of tens of megabits per second to wireless
devices. The communication between the User Equipment
(UE) and the evolved base station (eNodeB) starts when the
user equipment (UE) requests a connection to the eNodeB
via random access preamble (Figure 14). The eNodeB then
allocates radio resources to the user for the rest of the random
access procedure and sends a response. The UE answers
with a L2/L3 message containing an identication number.
Finally, the eNodeB sends back the identication number
of the connected UE. If several UEs sent the same random
access preamble at the same time, only one connection
is granted and the other UEs will need to send a new
random access preamble. After the random access procedure,
the eNodeB allocates resources to the UE and uplink and
downlink logical channels are created to exchange data
continuously. The decoding algorithm, at the eNodeB, of
the UE random access preamble is studied in this section.
This algorithm is known as the Random Access CHannel
Preamble Detection (RACH-PD).
Medium1
type
Architecture model
Proc 1
c64x+
Proc 2
c64x+
Proc 3
c64x+
Algorithm
S
c
h
e
d
u
l
e
r
D
e
p
l
o
y
m
e
n
t
C
o
d
e
g
e
n
e
r
a
t
i
o
n
Proc1.xml
Proc2.xml
Proc3.xml
IDL prototypes
C64x+.xsl
Communication
libraries actors code
Proc1.c
Proc2.c
Proc3.c
Proc1.exe
Proc2.exe
Proc3.exe
X
S
L
t
r
a
n
s
f
o
r
m
a
t
i
o
n
T
I
c
o
d
e
c
o
m
p
o
s
e
r
c
o
m
p
i
l
e
r
Figure 12: Code generation.
module antenna delay {
typedef long cplx;
typedef short param;
interface antenna delay {
void init(in cplx antIn);
void loop(in cplx antIn,
out char waitOut, in param antSize);
};
};
Figure 13: Example of an IDL prototype.
UE eNodeB
Random access preamble
Random access response
L2/L3 message
Message for early contention resolution
Figure 14: Random access procedure.
5.2. The RACH Preamble Detection. The RACH is a
contention-based uplink channel used mainly in the initial
transmission requests from the UE to the eNodeB for
connection to the network. The UE, seeking connection
with a base station, sends its signature in a RACH preamble
dedicated time and frequency window in accordance with a
predened preamble format. Signatures have special auto-
correlation and intercorrelation properties that maximize the
ability of the eNodeB to distinguish between dierent UEs.
The RACH preamble procedure implemented in the LTE
eNodeB can detect and identify each users signature and is
dependent on the cell size and the systembandwidth. Assume
GP
1
GP
2
Time
RACH burst
n ms
Preamble
bandwidth
2x N-sample preamble
Figure 15: The random access slot structure.
that the eNodeB has the capacity to handle the processing of
this RACH preamble detection every millisecond in a worst
case scenario.
The preamble is sent over a specied time-frequency
resource, denoted as a slot, available with a certain cycle
period and a xed bandwidth. Within each slot, a Guard
Period (GP) is reserved at each end to maintain time
orthogonality between adjacent slots [45]. This preamble-
based random access slot structure is shown in Figure 15.
The case study in this article assumes a RACH-PD for
a cell size of 115 km. This is the largest cell size supported
by LTE and is also the case requiring the most processing
power. According to [46], preamble format no. 3 is used
with 21,012 complex samples as a cyclic prex for GP1,
followed by a preamble of 24,576 samples followed by the
same 24,576 samples repeated. In this case the slot duration
is 3 ms which gives a GP2 of 21,996 samples. As per Figure 16,
the algorithm for the RACH preamble detection can be
summarized in the following steps [45].
(1) After the cyclic prex removal, the preprocessing
(Preproc) function isolates the RACH bandwidth, by
shifting the data in frequency and ltering it with
downsampling. It then transforms the data into the
frequency domain.
(2) Next, the circular correlation (CirCorr) function
correlates data with several prestored preamble root
sequences (or signatures) in order to discriminate
between simultaneous messages from several users. It
also applies an IFFTto return to the temporal domain
and calculates the energy of each root sequence
correlation.
Antenna #2 to N
Preamble repetition #1 to P
Antenna#1
Preamble repetition #2 to P
Antenna #1 preamble repetition #1
RACH preprocessing
Antenna #2 to N preamble repetition #1 to P
Antenna #1 preamble repetition #2 to P
Antenna #1
RACH circular correlation
Root sequence # 2 to R
Root sequence # 1
Noise oor
estimation
PeakSearch
A
n
t
e
n
n
a
i
n
t
e
r
f
a
c
e
F
r
e
q
u
e
n
c
y
s
h
i
f
t
F
I
R
(
b
a
n
d
p
a
s
s
l
t
e
r
)
D
F
T
S
u
b
c
a
r
r
i
e
r
d
e
m
a
p
p
i
n
g
Z
C
r
o
o
t
s
e
q
.
m
u
l
t
.
Z
e
r
o
p
a
d
.
I
F
F
T
P
o
w
e
r
c
o
m
p
.
P
o
w
e
r
a
c
c
u
m
u
l
a
t
i
o
n
Figure 16: Random Access Channel Preamble Detection (RACH-PD) Algorithm.
(3) Then, the noiseoor threshold (NoiseFloorThr)
function collects these energies and estimates the
noise level for each root sequence.
(4) Finally, the peak search (PeakSearch) function detects
all signatures sent by the users in the current time
window. It additionally evaluates the transmission
timing advance corresponding to the approximate
user distance.
In general, depending on the cell size, three parameters
of RACH may be varied: the number of receive antennas,
the number of root sequences, and the number of times the
same preamble is repeated. The 115 km cell case implies 4
antennas, 64 root sequences, and 2 repetitions.
5.3. Architecture Exploration
5.3.1. Algorithm Model. The goal of this exploration is to
determine through simulation the architecture best suited
to the 115km cell RACH-PD algorithm. The RACH-PD
algorithm behavior is described as a SDF graph in PREESM.
A static deployment enables static memory allocation, so
removing the need for runtime memory administration. The
algorithm can be easily adapted to dierent congurations
by tuning the HSDF parameters. Using the same approach as
in [47], valid scheduling derived from the representation in
Figure 16 can be described by the compact expression:
(8Preproc)(4(64(InitPower
(2((SingleZCProc)(PowAcc))))PowAcc))
(64NoiseFloorThreshold)PeakSearch
We can separate the preamble detection algorithm in 4
steps:
(1) preprocessing step: (8Preproc),
(2) circular correlation step: (4(64(InitPower
(2((SingleZCProc)(PowAcc))))PowAcc)),
(3) noise oor threshold step: (64NoiseFloorThreshold),
(4) peak search step: PeakSearch.
Each of these steps is mapped onto the available cores
and will appear in the exploration results detailed in
C64x+ C64x+ C64x+
C64x+ C64x+ C64x+
C64x+
C64x+
C64x+
C64x+
EDMA
EDMA
EDMA
1
2
3
4
Figure 17: Four architectures explored.
Section 5.3.4. The given description generates 1,357 opera-
tions; this does not include the communication operations
necessary in the case of multicore architectures. Placing
these operations by hand onto the dierent cores would
be greatly time-consuming. As seen in Section 4.2 the
rapid prototyping PREESMtool oers automatic scheduling,
avoiding the problem of manual placement.
5.3.2. Architecture Exploration. The four architectures
explored are shown in Figure 17. The cores are all
homogeneous Texas Instrument TMS320C64x+ Digital
Signal Processors (DSP) running at 1 GHz [48]. The
connections are made via DMA links. The rst architecture
is a single-core DSP such as the TMS320TCI6482. The
second architecture is dual-core, with each core similar to
that of the TMS320TCI6482. The third is a tri-core and
is equivalent to the new TMS320TCI6487 [40]. Finally,
the fourth architecture is a theoretical architecture for
exploration only, as it is a quad-core. The exploration goal
is to determine the number of cores required to run the
random RACH-PD algorithm in a 115 km cell and how to
best distribute the operations on the given cores.
5.3.3. Architecture Model. To solve the deployment problem,
each operation is assigned an experimental timing (in
terms of CPU cycles). These timings are measured with
Real-time limit of 4 ms
1 core
2 cores
+ EDMA
3 cores
+ EDMA
4 cores
+ EDMA
Loosely timed
Approximately timed
Accurately timed
Figure 18: Timings of the RACH-PD algorithm schedule on target
architectures.
deployments of the actors on a single C64x+. Since the
C64x+ is a 32-bit xed-point DSP core, the algorithms must
be converted from oating-point to xed-point prior to
these deployments. The EDMA is modelled as a nonblocking
medium (see Section 4.2.2) transferring data at a constant
rate and with a given set up time. Assuming the EDMA has
the same performance from the L2 internal memory to the
L2 internal memory as the EDMA3 of the TMS320TCI6482
(see [42], then the transfer of N bytes via EDMA should
take approximately): transfer(N) = 135 +(N 3.375) cycles.
Consequently, in the PREESM model, the average data rate
used for simulation is 3.375 GBytes/s and the EDMA set up
time is 135 cycles.
5.3.4. Architecture Choice. The PREESM automatic schedul-
ing process is applied for each architecture. The workow
used is close to that of Figure 8. The simulation results
obtained are shown in Figure 18. The list scheduling heuris-
tic is used with loosely-timed, approximately-timed, and
accurately-timed ABCs. Due to the 115 km cell constraints,
preamble detection must be processed in less than 4 ms.
The experimental timings were measured on code exe-
cutions using a TMS320TCI6487. The timings feeding the
simulation are measured in loops, each calling a single
function with L1 cache activated. For more details about
C64x+ cache, see [48]. This represents the application
behavior when local data access is ideal and will lead to
an optimistic simulation. The RACH application is well
suited for a parallel architecture, as the addition of one core
reduces the latency dramatically. Two cores can process the
algorithm within a time frame close to the real-time deadline
with loosely and approximately timed models but high data
transfer contention and high number of transfers disqualify
it when accurately timed model is used.
The 3-core solution is clearly the best one: its CPU loads
(less than 86% with accurately-timed ABC) are satisfactory
and do not justify the use of a fourth core, as can be seen
in Figure 18. The high data contention in this case study
justies the use of several ABC models; simple models for
GEM 0
Chip
GEM 1 GEM 2
C64x+
Core 0
C64x+
Core 1
C64x+
Core 2
L2 mem L2 mem L2 mem
Switched central resources (SCR)
EDMA3
Inter-core
interruptions
Hardware
semaphores
DDR2 external memory
Figure 19: TMS320TCI6487 architecture.
fast results and more complex models to dimension correctly
the system.
5.4. Code Generation. Developed Code libraries for the
TMS320TCI6487 and automatically generated code created
by PREESM (see Section 4.3) were used in this experiment.
Details of the code libraries and code optimizations are
given in [2]. The architecture of the TMS320TCI6487 is
shown in Figure 19. The communication between the cores
is performed by copying data with the EDMA3 from one
core local L2 memory to another core L2 memory. The cores
are synchronized using intercore interruptions. Two modes
are available for memory sharing: in symmetric mode,
each CPU has 1MByte of L2 memory while in asymmetric
mode, core-0 has 1.5 MByte, core-1 has 1 MByte and core-2
0.5 MByte.
From the PREESM generated code, the size of the
statically allocated buers are 1.65 MBytes for one core,
1.25 MBytes for a second core, and 200 kBytes for a third
core. The asymmetric mode is chosen to t this memory
distribution. As the necessary memory is higher than the
internal L2, some buers are manually chosen to go in
the external memory and the L2 cache [40] is activated. A
memory minimization ABC in PREESM would help this
process, targeting some memory objectives while mapping
the actors on the cores.
Modeling the RACH-PD algorithm in PREESM while
varying the architectures (1,2,3 and 4 cores-based) enabled
the exploration of multiple solutions under the criterion
of meeting the stringent latency requirement. Once the
target architecture is chosen, PREESM can be setup to
generate a framework code for the simulated solution. As
highlighted and explained in the previous paragraph, the
statically allocated buers by the generated code were higher
than the physical memory of the target architecture. This
CPU 2 CPU 1 CPU 0
Preprocess
Preprocess
Preprocess
Preprocess
Circorr32
signatures
Circorr32
signatures
Circorr32
signatures
Circorr32
signatures
Circorr32
signatures
Circorr32
signatures
Maximal
cadence
noiseFloor +
PeakSearch
4 ms
4 ms
4 ms
Figure 20: Execution of the RACH-PD algorithm on
a TMS320TCI6487.
necessitated moving manually some of the noncritical buers
to external memory. This generated code, representing a
priori a good deployment solution, when executed on the
target had an average load of 78% per core while meeting
the real time deadline. Hence, the goal of decoding a RACH-
PD every 4 ms on the TMS320TCI6487 is thus successfully
accomplished. A simplied view of the code execution is
shown in Figure 20. The execution of the generated code had
led to a realistic assessment of a deployment very close to that
predicted with accurately timed ABC where the simulation
had shown an average load per core around 80%. These
results show that prototyping the application with PREESM
allows by simulation to assess dierent solutions and to give
the designer a realistic picture of the multicore solution
before solving complex mapping problems. This global result
needs to be tempered because one week-eort of manual
memory optimizations and also some manual constraints
were necessary to obtain such a fast deployment. New ABCs
computing the costs of semaphores for synchronizations
and the memory balance between the cores will reduce this
manual optimizations time.
6. Conclusions
The intent of this paper was to detail the functionalities
of a rapid prototyping framework comprising the Graphiti,
SDF4J, and PREESM tools. The main features of the frame-
work are the generic graph editor, the graph transformation
module, the automatic static scheduler, and the code genera-
tor. With this framework, a user can describe and simulate
the deployment, choose the most suitable architecture for
the algorithm and generate an ecient framework code.
The framework has been successfully tested on RACH-PD
algorithm from the 3GPP LTE standard. The RACH-PD
algorithm with 1357 operations was deployed on a tricore
DSP and the simulation was validated by the generated code
execution. In the near future, an increasing number of CPUs
will be available in complex System on Chips. Developing
methodologies and tools to eciently partition code on these
architectures is thus an increasingly important objective.
References
[1] E. A. Lee, The problem with threads, Computer, vol. 39, no.
5, pp. 3342, 2006.
[2] M. Pelcat, S. Aridhi, and J. F. Nezan, Optimization of
automatically generated multi-core code for the LTE RACH-
PD algorithm, in Proceedings of the Conference on Design
and Architectures for Signal and Image Processing (DASIP 08),
Bruxelles, Belgium, November 2008.
[3] J. Piat, S. S. Bhattacharyya, M. Pelcat, and M. Raulet, Multi-
core code generation from interface based hierarchy, in
Proceedings of the Conference on Design and Architectures for
Signal and Image Processing (DASIP 09), Sophia Antipolis,
France, September 2009.
[4] M. Pelcat, P. Menuet, S. Aridhi, and J.-F. Nezan, Scalable
compile-time scheduler for multi-core architectures, in Pro-
ceedings of the Conference on Design and Architectures for Signal
and Image Processing (DASIP 09), Sophia Antipolis, France,
September 2009.
[5] Eclipse Open Source IDE, http://www.eclipse.org/down-
loads.
[6] T. Grandpierre and Y. Sorel, From algorithm and architecture
specications to automatic generation of distributed real-time
executives: a seamless ow of graphs transformations, in
Proceedings of the 1st ACM and IEEE International Conference
on Formal Methods and Models for Co-Design (MEMOCODE
03), pp. 123132, 2003.
[7] OpenMP, http://openmp.org/wp.
[8] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson,
K. H. Randall, and Y. Zhou, Cilk: an ecient multithreaded
runtime system, Journal of Parallel and Distributed Comput-
ing, vol. 37, no. 1, pp. 5569, 1996.
[9] OpenCL, http://www.khronos.org/opencl.
[10] The Multicore Association, http://www.multicore-associa-
tion.org/home.php.
[11] PolyCore Software Poly-Mapper tool, http://www.poly-
coresoftware.com/products3.php.
[12] E. A. Lee, Overview of the ptolemy project, Technical
Memorandum UCB/ERL M01/11, University of California,
Berkeley, Calif, USA, 2001.
[13] J. Eker and J. W. Janneck, CAL language report, Tech.
Rep. ERL Technical Memo UCB/ERL M03/48, University of
California, Berkeley, Calif, USA, December 2003.
[14] S. S. Bhattacharyya, G. Brebner, J. Janneck, et al., OpenDF:
a dataow toolset for recongurable hardware and multicore
systems, ACMSIGARCHComputer Architecture News, vol. 36,
no. 5, pp. 2935, 2008.
[15] G. Karsai, J. Sztipanovits, A. Ledeczi, and T. Bapty, Model-
integrated development of embedded software, Proceedings of
the IEEE, vol. 91, no. 1, pp. 145164, 2003.
[16] P. Belanovic, An open tool integration environment for ecient
design of embedded systems in wireless communications, Ph.D.
thesis, Technische Universit at Wien, Wien, Austria, 2006.
[17] T. Grandpierre, C. Lavarenne, and Y. Sorel, Optimized rapid
prototyping for real-time embedded heterogeneous multipro-
cessors, in Proceedings of the 7th International Workshop on
Hardware/Software Codesign (CODES 99), pp. 7478, 1999.
[18] C.-J. Hsu, F. Keceli, M.-Y. Ko, S. Shahparnia, and S. S.
Bhattacharyya, DIF: an interchange format for dataow-
based design tools, in Proceedings of the 3rd and 4th
International Workshops on Computer Systems: Architectures,
Modeling, and Simulation (SAMOS 04), vol. 3133 of Lecture
Notes in Computer Science, pp. 423432, 2004.
[19] S. Stuijk, Predictable mapping of streaming applications on mul-
tiprocessors, Ph.D. thesis, Technische Universiteit Eindhoven,
Eindhoven, The Netherlands, 2007.
[20] B. D. Theelen, Aperformance analysis tool for scenario-aware
steaming applications, in Proceedings of the 4th International
Conference on the Quantitative Evaluation of Systems (QEST
07), pp. 269270, 2007.
[21] Graphiti Editor, http://sourceforge.net/projects/graphiti-
editor.
[22] E. A. Lee and D. G. Messerschmitt, Synchronous data ow,
Proceedings of the IEEE, vol. 75, no. 9, pp. 12351245, 1987.
[23] SDF4J, http://sourceforge.net/projects/sdf4j.
[24] PREESM, http://sourceforge.net/projects/preesm.
[25] J. W. Janneck, NLa network language, Tech. Rep., ASTG
Technical Memo, Programmable Solutions Group, Xilinx, July
2007.
[26] SPIRIT Schema Working Group, IP-XACT v1.4: a specica-
tion for XML meta-data and tool interfaces, Tech. Rep., The
SPIRIT Consortium, March 2008.
[27] U. Brandes, M. Eiglsperger, I. Herman, M. Himsolt, and
M. S. Marshall, Graphml progress report, structural layer
proposal, in Proceedings of the 9th International Symposiumon
Graph Drawing (GD 01), P. Mutzel, M. Junger, and S. Leipert,
Eds., pp. 501512, Springer, Vienna, Austria, 2001.
[28] J. Piat, M. Raulet, M. Pelcat, P. Mu, and O. D eforges, An
extensible framework for fast prototyping of multiprocessor
dataow applications, in Proceedings of the 3rd International
Design and Test Workshop (IDT 08), pp. 215220, Monastir,
Tunisia, December 2008.
[29] w3c XML standard, http://www.w3.org/XML.
[30] w3c XSLT standard, http://www.w3.org/Style/XSL.
[31] Grammatica parser generator, http://grammatica.perceder-
berg.net.
[32] J. W. Janneck and R. Esser, A predicate-based approach
to dening visual language syntax, in Proceedings of IEEE
Symposium on Human-Centric Computing (HCC 01), pp. 40
47, Stresa, Italy, 2001.
[33] J. L. Pino, S. S. Bhattacharyya, and E. A. Lee, A hierar-
chical multiprocessor scheduling framework for synchronous
dataow graphs, Tech. Rep., University of California, Berke-
ley, Calif, USA, 1995.
[34] S. Sriram and S. S. Bhattacharyya, Embedded Multiprocessors:
Scheduling and Synchronization, CRC Press, Boca Raton, Fla,
USA, 1st edition, 2000.
[35] V. Sarkar, Partitioning and scheduling parallel programs for
execution on multiprocessors, Ph.D. thesis, Stanford University,
Palo Alto, Calif, USA, 1987.
[36] O. Sinnen and L. A. Sousa, Communication contention in
task scheduling, IEEE Transactions on Parallel and Distributed
Systems, vol. 16, no. 6, pp. 503515, 2005.
[37] M. R. Garey and D. S. Johnson, Computers and Intractability:
A Guide to the Theory of NP-Completeness, W. H. Freeman, San
Francisco, Calif, USA, 1990.
[38] Y.-K. Kwok, High-performance algorithms of compiletime
scheduling of parallel processors, Ph.D. thesis, Hong Kong
University of Science and Technology, Hong Kong, 1997.
[39] F. Ghenassia, Transaction-Level Modeling with Systemc: TLM
Concepts and Applications for Embedded Systems, Springer,
New York, NY, USA, 2006.
[40] TMS320TCI6487 DSP platform, texas instrument product
bulletin (SPRT405).
[41] Tms320 dsp/bios users guide (SPRU423F).
[42] B. Feng and R. Salman, TMS320TCI6482 EDMA3 perfor-
mance, Technical Document SPRAAG8, Texas Instruments,
November 2006.
[43] RapidIO, http://www.rapidio.org/home.
[44] The 3rd Generation Partnership Project, http://www
.3gpp.org.
[45] J. Jiang, T. Muharemovic, and P. Bertrand, Random access
preamble detection for long term evolution wireless net-
works, US patent no. 20090040918.
[46] 3GPP technical specication group radio access network;
evolved universal terrestrial radio access (EUTRA) (Release 8),
3GPP, TS36.211 (V 8.1.0).
[47] S. S. Bhattacharyya and E. A. Lee, Memory management
for dataow programming of multirate signal processing
algorithms, IEEE Transactions on Signal Processing, vol. 42, no.
5, pp. 11901201, 1994.
[48] TMS320C64x/C64x+ DSP CPU and instruction set, Refer-
ence Guide SPRU732G, Texas Instruments, February 2008.
doi:10.1155/2009/976296
Research Article
Run-Time HW/SWScheduling of Data FlowApplications on
Recongurable Architectures
Fakhreddine Ghaffari, Benoit Miramond, and Francois Verdier
ETIS Laboratory, UMR 8051, ENSEA, University of Cergy Pontoise, CNRS, 6 avenue Du Ponceau, BP 44,
95014 Cergy-Pontoise Cedex, France
Correspondence should be addressed to Fakhreddine Ghaari, fakhreddine.ghaari@ensea.fr
Received 1 March 2009; Revised 22 July 2009; Accepted 7 October 2009
This paper presents an ecient dynamic and run-time Hardware/Software scheduling approach. This scheduling heuristic consists
in mapping online the dierent tasks of a highly dynamic application in such a way that the total execution time is minimized. We
consider soft real-time data ow graph oriented applications for which the execution time is function of the input data nature. The
target architecture is composed of two processors connected to a dynamically recongurable hardware accelerator. Our approach
takes advantage of the reconguration property of the considered architecture to adapt the treatment to the system dynamics. We
compare our heuristic with another similar approach. We present the results of our scheduling method on several image processing
applications. Our experiments include simulation and synthesis results on a Virtex V-based platform. These results show a better
performance against existing methods.
Copyright 2009 Fakhreddine Ghaari et al. This is an open access article distributed under the Creative Commons Attribution
cited.
1. Introduction
One of the main steps of the HW/SW codesign of a mixed
electronic system (Software and Hardware) is the scheduling
of the application tasks on the processing elements (PEs) of
the platform. The scheduling of an application formed by
N tasks on M target processing units consists in nding the
realizable partitioning in which the Ntasks are launched onto
their corresponding M units and an ordering on each PE
for which the total execution time of the application meets
the real-time constraints. This problem of multiprocessor
scheduling is known to be NP-hard [1, 2], that is, why we
propose a heuristic approach.
Many applications, in particular in image processing
(e.g., an intelligent embedded camera), have dependent data
execution times according to the nature of the input to be
processed. In this kind of application, the implementation
is often stressed by real-time constraints, which demand
adaptive computation capabilities. In this case, according
to the nature of the input data, the system must adapt its
behaviour to the dynamics of the evolution of the data and
continue to meet the variable needs of required calculation
(in quantity and/or in type). Examples of applications where
the processing needs changes in quantity (the computation
load is variable) come from the intelligent image processing
where the duration of the treatments can depend on the
number of the objects in the image (motion detection,
tracking, etc.) or of the number of interest areas (contours
detection, labelling, etc.).
We can quote also the use of run time of dierent lters
according to the texture of the processed image (here it is
the type of processing which is variable). Another example
of the dynamic applications is video encoding where the run-
length encoding (RLE) of frames depends on the information
within frames.
For these dynamic applications, many implementation
ways are possible. In this paper we consider an intelligent
embedded camera for which we propose a new design
approach compared to classical worst case implementations.
Our method consists in evaluating online the application
context and adapting its implementation onto the dierent
targeted processing units by launching a run time partition-
ing algorithm. The online modication of the partitioning
result can also be a solution of fault tolerance, by aecting
in run time the tasks of the fault target unit on others
operational targets [3]. This induces also to revise the
scheduling strategy. More precisely, the result of this later
must change at run time in two cases.
(i) Firstly, to evaluate the partitioning result. After each
modication of the tasks implementations we need to
know the new total execution time. And this is only
possible by rescheduling all the tasks.
(ii) Secondly, by modifying the scheduling result we
can obtain a better total execution time which
meets the real-time constraint without modifying the
partitioning. This is because the characteristics of the
tasks (mainly execution time) are modied according
to the nature of the input data.
In that context, the choice of the implementation of the
scheduler is of major importance and depends on the
heuristic complexity. Indeed, with our method the decisions
taken online by our scheduler can be very time consuming.
A software implementation of the proposed scheduling
strategies will then delay the application tasks. For this
reason, we propose in this work a hardware implementation
for our scheduling heuristic.
With this implementation, the scheduler takes only few
clock cycles. So we can easily call the scheduler at run
time without penalty on the total execution time of the
application.
The primary contribution of our work is the concept
of an ecient online scheduling heuristic for heterogeneous
multiprocessor platforms. This heuristic provides good
results for both hardware tasks (onto the FPGA) and software
tasks (onto the targeted General Purpose Processors) as well
as an extensive speedup through the hardware implementa-
tion of this scheduling heuristic. Finally, the implementation
of our scheduler allows the system to adapt itself to the
application context in real time. We have simulated and
synthesized our scheduler by targeting a FPGA (Xilinx Virtex
5) platform. We have tested the scheduling technique on
several image processing applications implemented onto a
heterogeneous target architecture composed of two proces-
sors coupled with a congurable logic unit (FPGA).
The remainder of this paper is organized as follows.
Section 2 presents related works on hardware/software
scheduling approaches. Section 3 introduces the framework
of our scheduling problem. Section 4 presents the proposed
approach. Section 5 shows the experimental results, and
nally Section 6 concludes this paper.
2. Related Works
The eld of study which tries to nd an execution order for
a set of tasks that meets system design objectives (e.g., min-
imize the total application execution time) has been widely
covered in the literature. In [46] the problem of HW/SW
scheduling for system-on-chip platforms with dynamically
recongurable logic architecture is exhaustively studied.
Moreover several works deal with scheduling algorithm
implemented in hardware [79]. Scheduling in such systems
is based on priorities. Therefore, an obvious solution is to
implement priorities queues. Many hardware architectures
for the queues have been proposed: binary tree comparators,
FIFO queues plus a priority encoder, and a systolic array
priority queue [7]. Nevertheless, all these approaches are
based on a xed priority static scheduling technique. More-
over most of the hardware proposed approaches addresses
the implementation of only one scheduling algorithm (e.g.,
Earliest Deadline First) [9, 10]. Hence they are inecient and
not appropriate for systems where the required scheduling
behavior changes during run time. Also, system performance
for tasks with data dependent execution times should be
improved by using dynamic schedulers instead of static (at
compile time) scheduling techniques [11, 12].
In our work, we propose a new hardware implemented
approach which computes at run-time tasks priorities based
on the characteristics of each task (execution time, graph
dependencies, etc.). Our approach is dynamic in the sense
that the execution order is decided at run time and supports
a heterogeneous (HW/SW) multiprocessor architecture.
The idea of dynamic partitioning/scheduling is based
on the dynamic reconguration of the target architecture.
Increasingly FPGA [13, 14] oers very attractive recongu-
ration capabilities: partial or total, static or dynamic.
The reconguration latency of dynamically recong-
urable devices represents a major problem that must not
be neglected. Several references can be found addressing
temporal partitioning for reconguration latency minimiza-
tion [15]. Moreover, conguration prefetching techniques
are used to minimize reconguration overhead. A similar
technique to lighten this overhead is developed in [16] and
is integrated into an existing design environment. A prefetch
and replacement unit modies the schedule and signicantly
reduces the latency even for highly dynamic tasks.
In fact, there are two dierent approaches in the litera-
ture: the rst approach reduces reconguration overhead by
modifying scheduling results.
The second one distinguishes between scheduling and
reconguration. The reconguration occurs only if the
HW/SWpartitioning step needs it. The scheduling algorithm
is needed only to validate this partitioning result.
After partitioning, the implementation of each task
is unchanged and conguration is not longer necessary.
Scheduling aims at nding the best execution time for a given
implementation strategy. Since scheduling does not change
partitioning decision it does not take reconguration time
into account.
In this paper, we focus only on the scheduling strategy in
the second case. We assume that the reconguration aspects
are taken into account during the HW/SW partitioning
step (decision of task implementation). Furthermore we
addressed this last step in our previous works [17].
3. ProblemDenition
3.1. Target Architecture. The target architecture is depicted in
Figure 1. It is a heterogeneous architecture, which contains
two software processing units: a Master Processor and a Slave
Master
CPU
Salve
CPU
Bus 1
Bus 3
DMA
RAM
Contexts
Bus 2
Bus i
RCU
Control
data
Figure 1: The target architecture.
Processor. The platform also contains a hardware processing
unit: (Recongurable Computing Unit) RCU and shared
memory resources. The software processing units are Von-
Neumann monoprocessing systems and execute only a single
task at a time.
Each hardware task (implemented on the RCU) occupies
a tile on the recongurable area [18]. The size of the tile
is the same for all the tasks to facilitate the placement and
routing of the RCU. We choose, for example, the tile size of
the task which uses the maximum of resources on the RCU
(we designate by resource here the Logic Element used by
the RCU to map any task).
The RCU unit can be recongured partially or totally.
Each hardware task is represented by a partial bitstream.
All bitstreams are memorized in the contexts memory (the
shared memory between the processors and the RCU in
Figure 1). These bitstreams will be loaded in the RCU
before scheduling to recongure the FPGA according to
run-time partitioning results [17]. The HW/SW partition-
ing result can change at run time according to temporal
characteristics of tasks [6]. In [17] we proposed an HW/SW
partitioning approach based on HWSW and SWHW
tasks migrations. The theory of tasks migrations consists in
accelerating the task(s) which become critical by modifying
their implementations from software units to hardware units
and to decelerate the tasks which become noncritical by
returning them to the software units.
After each newHW/SWpartitioning result, the scheduler
must provide an evaluation for this solution by providing the
corresponding total execution time. Thus it presents a real-
time constraint since it will be launched at run time. With
this approach of dynamic partitioning/scheduling the target
architecture will be very exible. It can self-adapt even with
very dynamic applications.
3.2. Application Model. The considered applications are data
ow oriented applications such as image processing, audio
processing, or video processing. To model this kind of
applications we consider a Data Flow Graph (DFG) (an
example is depicted in Figure 2) which is a directed acyclic
graph where nodes are processing functions and edges
describe communication between tasks (data dependencies
MS
A
5
1
1
SL B 3
SL C 7
3 1
HW
E
2
1
HW D 18
13
MS
F
MS: Master processor
SL: Slave processor
HW: FPGA
Figure 2: An Example of DFG with 6 tasks.
between tasks). The size of the DFG depends on the
functional partitioning of the application and then on the
number of tasks and edges. We can notice that the structure
of the DFG has a great eect on the execution time of the
scheduling operations. A low granularity DFG makes the
system easy to be predictable because tasks execution time
does not vary considerably, thus limiting timing constraints
violation. On the other hand, for a very low granularity DFG,
the number of tasks in a DFG of great size explodes, and the
communications between tasks become unmanageable.
Each node of the DFG represents a specic task in the
application. For each task there can be up to three dierent
implementations: Hardware implementations (HW) placed
in the FPGA, Software implementations running on the mas-
ter processor (MS), and another Software implementation
running on the slave processor (SL).
Each node of Figure 2 is annotated with two data: one
about the implementation (MS or SL or HW) and the other
is the execution time of the task. Similarly each edge is
annotated with the communication time between two nodes
(two tasks).
Each task of the DFG is characterized by the following
four parameters:
(a) Texe (execution time),
(b) Impl (implementation on the RCU or on the master
processor or on the slave processor),
(c) Nbpred (number of predecessor tasks),
(d) The Nbsucc (number of successor tasks).
All the tasks of a DFG are thus modeled identically, and
the only real-time constraint is on the total execution time.
At each scheduler invocation, this total execution time
corresponds to the longest path in the mapped task graph.
It then depends both on the application partitioning and on
the chosen order of execution on processors.
4. Proposed Approach
The applications are periodic. In one period, all the tasks
of the DFG must be executed. In the image processing,
for instance, the period is the execution time needed to
For all Software tasks do
{
Comput ASAP
Task with minimum ASAP will be chosen
If (Equality of ASAP)
Compute Urgency
Task with maximum urgency will be chosen
If (Equality of Urgency)
Compare Execution time
Task with maximum execution time will be chosen
}
Algorithm 1: Principle of our scheduling policy.
process one image. The scheduling must occur online at the
end of the execution of all the tasks, and when a violation
of real-time constraints is predicted. Hence the result of
partitioning/scheduling will be applied on the next period
(next image, for image processing applications).
Our run-time scheduling policy is dynamic since the
execution order of application tasks is decided at run time.
For the tasks implemented on the RCU, we assume that the
hardware resources are sucient to execute in parallel all
hardware tasks chosen by the partitioning step. Therefore
the only condition for launching their execution is the
satisfaction of all data dependencies. That is to say, a task
may begin execution only after all its incoming edges have
been executed.
For the tasks implemented on the software processors,
the conditions for launching are the following.
(1) The satisfaction of all data dependencies.
(2) The discharge of the software unit.
Hereby the task can have four dierent states.
(i) Waiting.
(ii) Running.
(iii) Ready.
(iv) Stopped.
The task is in the waiting state when it waits the end
of execution of one or several predecessor tasks. When a
software processing unit has nished the execution of a
task, new tasks may become ready for execution if all their
dependencies have been completed of course.
The task can be stopped in the case of preemption or after
nishing its execution.
The states of the processing units (SW, SL, and HW) in
our target architecture are: execution state, reconguration
state or idle state.
In the following, we will explain the principle of our
approach as well as a hardware implementation of the
proposed HW/SW scheduler.
As explained in Algorithm 1, the basic idea of our
heuristic of scheduling is to take decision of tasks priorities
according to three criteria.
The rst criterion is the As Soon As Possible (ASAP) time.
The task which has the shortest ASAP date will be launched
rst.
The second criterion is the urgency time: the task which
has the maximum of urgency will have priority to be
launched before the others. This newcriterion is based on the
nature of the successors of the task. The urgency criterion is
employed only if there is equality of the rst criterion for at
least two tasks. If there is still equality of this second criterion
we compare the last criterion which is execution time of the
tasks. We choose the task which has the upper execution time
to launch rst.
We use these criteria to choose between two or several
software tasks (on the Master or on the Slave) for running.
4.0.1. The Urgency Criterion. The urgency criterion is based
on the implementation of tasks and the implementations of
their successors. A task is considered as urgent when it is
implemented on the software unit (Master or slave) and has
one or more successor tasks implemented on other dierent
units (hardware unit or software unit).
Figure 3 shows three examples of DFG. In Figure 3(a) task
C is implemented on the Slave processor and it is followed
by task D which is implemented on the RCU. Thus the
urgency (Urg) of task C is the execution time of its successor
(Urg (C) = 13). In example (b) it is the task B which is
followed by the task D implemented on a dierent unit (on
the Master processor). In the last example (c) both tasks B
and C are urgent but task B is more urgent than task C since
its successor has an execution time upper than the execution
time of the successor of task C.
When a task has several successors with dierent imple-
mentations, the urgency is the maximum of execution times
of the successors.
In general case, when the direct successor of task A
has the same implementation as A and has a successor
with a dierent implementation, then this last feedbacks the
urgency to task A.
We show the scheduling result for case (a) when we
respect the urgency criterion in Figure 3(d) and otherwise
in Figure 3(e). We can notice for all the examples of DFG
in Figure 3 that the urgency criterion makes a best choice to
obtain a minimum total execution time. The third criterion
(the execution time) is an arbitrary choice and has very rarely
impact on the total execution time.
We can notice also that our scheduler supports the
dynamic creation and deletion of tasks. These online services
are only possible when keeping a xed structure of the DFG
along the execution. In that case the dependencies between
tasks are known a priori. Dynamic deletion is then possible
by assigning a null execution time to the tasks which are
not active. and dynamic creation by assigning their execution
time when they become active.
This scheduling strategy needs an online computation of
several criterions for all software tasks in the DFG.
We tried rst to implement this new scheduling policy
on a processor. Figure 4 shows the computation time of our
scheduling method when implemented on an Intel Core 2
MS
A
5
SL
B
3
SL
C
7
HW D
13
(a) Urg[C] = 13
MS
A
5
SL
B
3
SL
C
7
MS
D
5
(b) Urg[B] = 5
MS
A
5
MS
B
3
MS
C
7
HW
D
8
SL
E
2
(c) Urg[B] = 8, Urg[C]
= 2
HW
SL
MS
A
B C
D
8 15
5 28
(d) Case of DFG (a) task B before task C
HW
SL
MS
A
C B
D
12 15
5 25
(e) Case of DFG (a) task C before task B
Figure 3: Case examples of urgency computing.
0
5
10
15
20
25
30
35
40
E
x
e
c
u
t
i
o
n
t
i
m
e
(
m
s
)
1
5
5
1
0
9
1
6
3
2
1
7
2
7
1
3
2
5
3
7
9
4
3
3
4
8
7
5
4
1
5
9
5
6
4
9
7
0
3
7
5
7
Images
Scheduling execution time on
Intel (R) Core (TM) 2 Duo CPU 2.8 Ghz + 4Go RAM
33.49652
12.68212
14.14788
20.6904
16.17788
Figure 4: Execution time of the software implementation of the
scheduler.
Duo CPU with a frequency of 2.8 GHz and 4 Go of RAM.
We can notice that the average computation time of the
scheduler is about 12 milliseconds for an image. These
experiments are done on an image processing application
(the DFG depicted on Figure 12) whose period of processing
by an image is 19 milliseconds. So the scheduling (with this
software implementation) takes about 63% of a one image
processing computation time on a desktop computer.
We can conclude that, in an embedded context, a
software implementation of this strategy is thus incompatible
with real-time constraints.
We describe in the following an optimized hardware
implementation of our scheduler.
4.1. Hardware Scheduler Architecture. In this section, we
describe the proposed architecture of our scheduler. This
architecture is shown in Figure 5 for a DFG example of three
tasks. It is divided in four main parts.
(1) The DFG IP Sched (the middle part surrounded by a
dashed line in the gure).
(2) The DFG Update (DFG Up in the gure).
(3) The MS Manager (SWTM).
(4) The Slave Manager (SLTM).
The basic idea of this hardware architecture is to parallelize
at the maximum the scheduling of processing tasks. So, at the
most (and in the best case), we can schedule all the tasks of
the DFG in parallel for innite resources architecture.
We associate to the application DFG a modied graph
with the same structure composed of the IP nodes (each IP
represents a task). Therefore in the best case, where tasks are
independent, we could schedule all the tasks in the DFG in
only one clock cycle.
To parallelize also the management of the software
execution times, we associate for each software unit a
hardware module:
(i) the Master Task Manager (SWTM in Figure 5),
(ii) the Slave Task Manager (SLTM in the Figure 5).
These two modules manage the order of the tasks executions
and compute the processor execution time for each one.
The inputs signals of this scheduler architecture are the
following.
(i) A pointer in memory to the implementations of all
the tasks. We have three kinds of implementation
(RCU, Master, and Slave). With the signals SW and
HW we can code these three possibilities.
(ii) The measured execution time of each task (Texe).
(iii) The Clock signal and the Reset.
SW
HW
Texe
CLK
Reset
SWTM
SLTM
IP1
IP2 IP3
DFG UP
Texe Total
All Done
Scheduled DFG
Nb Task
Nb Task Slave
Figure 5: An example of the scheduler architecture for a DFG of
three tasks.
The outputs signals are the following.
(i) The total execution time after scheduling all tasks
(Texe Total).
(ii) The signal All Done which indicates the end of the
scheduling.
(iii) Scheduled DFG is a pointer to the scheduling result
matrix to be sent to the operating system (or any
simple executive).
(iv) The Nb Task and the Nb Task Slave are the number
of tasks scheduled on the Master and the number of
tasks scheduled on the Slave, respectively. As noted
here, these two signals were added solely for the
purpose of simulation in ModelSim (to check the
scheduling result). In the real case we do not need
these two output signals since this information comes
from the partitioning block.
The last one is the DFG Up. This allows updating the
results matrix after each scheduling of a task.
In the following paragraphs, we will detail each part of
this architecture.
4.1.1. The DFG IP Sched Block. In this block there are N
components (N is the number of tasks in the application).
For each task we associate an IP component which computes
the intrinsic characteristics of this task (urgency, ASAP,
Ready state, etc.). It also computes the total execution time
for the entire graph.
The proposed architecture of this IP is shown in Figure 6
(in the appendix).
For each task the implementation PE and the execution
time are xed, so the role of this IP is to calculate the start
time of the task and to dene its state. This is done by taking
into account the state of the corresponding target (master,
slave, or RCU). It then iterates along the DFG structure to
determine a total execution ordering and to aect the start
time.
This IP calculate also the urgency criterion of critical
tasks according to the implementation and the execution
time of their successors.
If the task is implemented on the RCU it will be launched
as soon as all its predecessors will be done. So the scheduling
time of hardware tasks depends on the number of tasks that
we can run in parallel. For example, the IP can schedule
all hardware tasks that can run in parallel in a one clock
cycle.
For the software tasks (on the master or on the slave)
the scheduling will take one clock cycle per task. Thus the
computing time of the hardware scheduler only depends on
the result of the HW/SW partitioning.
4.1.2. The DFG Update Block. When a DFG is scheduled
the result modies the DFG into a new structure. The
DFG Update block (Figure 7 in the appendix) generates
new edges (dependencies between tasks) after scheduling in
objective to give a total order of execution on each computing
unit according to the scheduling results.
We represent dependencies between tasks in the DFG by
a matrix where the rows represent the successors and the
columns represent the predecessors. For example, Figure 8
depicts the matrix of dependencies corresponding to the
DFG of Figure 2. After scheduling, the resulting matrix is
the update of the original one. It contains more depen-
dencies than this later. This is the role of the DFG Update
block.
4.1.3. The MS Manager Block. The objective of this module
is to schedule the software tasks according to the algorithm
given above. Figure 9 in the appendix presents the architec-
ture of the Master Manager bloc. The input signal ASAP SW
represents the ASAP times of all the tasks. The Urgency Time
signal represents the urgency of each task of the application.
The SW Ready signal represents the Ready signals of all the
software tasks.
The Signal MIN ASAP TASKS represents all the tasks
Ready and having the same minimumvalues of time ASAP.
The signal MAX CT TASKS represents all the tasks
Ready and having the same maximum of urgency. The
tasks which have the two preceding criteria will be rep-
resented by the Tasks Ready signal. The Task Scheduled
signal determines the only software task which will be
scheduled. With this signal, it is possible to choose the good
value of signal TEXE SW and to give the new value of
the SW Total Time signal thereafter. A single clock cycle is
necessary to schedule a single software task.
By analogy the Slave Manager block has the same role as
the SW Manager block. From scheduling point of view there
is no dierence between the two processors.
4.2. HW/SW Scheduler Outputs. In this section, we describe
how the results of our scheduler are processed by a target
module such as an executive or a Real-Time Operating
System (RTOS). As depicted in Figure 8 , the output of
our run-time HW/SW scheduler is n n matrix where n
is the total number of tasks in the DFG. Figure 10 shows
the scheduling result of the DFG depicted in Figure 12.
This matrix will be used by a centralized Operating Sys-
tem (OS) to ll its task queues for the three computing
units.
The table shown in Figure 11 is a compilation of both the
results of the partitioning and scheduling operations.

Combinatorial logics
SW_Ready = SW and ready and (not) done
SL_Ready = SL and ready and (not) done
HW_Ready = (not) SW and (not) SL and ready and (not) done

And
Or
Mux
Register done
Max
Mux
Adder
Mux
Register finishing
time
Mux
Mux
Mux
CLK Reset
Finishing time
All_DONE
CLK Reset
SW_Sched_DONE
Critical time

Max_Texe_HW_Succ
SW_Ready
SW HW Ready
All_DONE
All_DONE
HW_Ready
Done
TEXE
ASAP_Out
0
0
0
ASAP
ASAP
Max_Texe_Pred
SW_Total_Time
SW
TEXE_Pred
TEXE_SW
TEXE_SL
HW
SL_Total_Time
HW SW
0
SL_Sched_DONE
SL_Ready
Figure 6: An IP representing one task.
New successors
registers matrix
Original DFG
registers matrix
Or
XOR
XOR
Or
Or
Or
SW_Ready_in
Task_Sched_SW
SW_Enable
Task_Sched_SL
SL_Enable
SL_Ready_in
CLK
Reset
Scheduled_DFG
Figure 7: The DFG updating architecture.
The OS browses the matrix rowby row. Whenever it nds
a 1 it passes the task whose number corresponds to the
column in the waiting state. At the end of a task execution
the corresponding waiting tasks on each units will become
either Ready or Running.
A task will be in the Ready state only when all its
dependencies are done and that the target unit is busy. Thus
there is no Ready state for the hardware tasks.
It should be noted that if the OS runs on the Master
processor, for example, this later will be interrupted each
time to execute the OS.
5. Experiments and Results
With the idea to cover a wide range of data-ow applications,
we leaded experiments on real and articial applications. In
the context of this paper we present the summary of the
results obtained on a 3-case studies in the domain of real-
time image processing:
(i) a motion detection application,
(ii) an articial extension of this detection application,
(iii) a robotic vision application.
0 1 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Successors
A B C D E F
A
B
C
D
E
F
Before scheduling
MS
A 5
SL
B
3
SL
C
7
HW D 18
MS
E
2
MS
F
13
(a)
Successors
A B C D E F
A
B
C
D
E
F
0 1 1 0 0 0
0 0 1 1 0 0
0 0 0 0 1 1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0
After scheduling
MS
A 5
SL
B
3
SL
C
7
HW D 18
MS
E 2
MS
F
13
(b)
Figure 8: Matrix result after scheduling.
The second case study is a complex DFG which contains
dierent classical structures (Fork, join, sequential). This
DFG is depicted in Figure 12. It contains twenty tasks. Each
task can be implemented on the Software computation unit
(Master or Slave processor) or on the Recongurable RCU.
The original DFG is the model of an image processing
application: motion detection on a xed image background.
This application is composed of 10 sequential tasks (from
ID 1 to ID 10 in Figure 12). We added 10 others virtual
tasks to obtain a complex DFG containing the dierent
possible parallel structures. This type of parallel program
paradigm (Fork, join, etc.) arises in many application
areas.
In order to test the presented scheduling approach, we
have performed a large number of experiments where several
scenarios of HW/SW partitioning results were analyzed.
As an example, Figure 12 presents the scheduling result
when tasks 3, 4, 7, 8, 11, 12 17, 18, and 20 are implemented
in hardware. As explained in Section 4.1 new dependencies
(dotted lines) are added in the original graph to impose a
total order on each processor. In this gure all the execution
times are in milliseconds (ms).
We also leaded our experiments on a more dynamic
application from robotic vision domain [19]. It consists in
a subset of a cognitive system allowing a robot equipped
with a CCD-camera to navigate and to perceive objects. The
global architecture in which the visual system is integrated is
biologically inspired and based on the interactions between
the processing of the visual owand the robot movements. In
order to learn its environment the systemidenties keypoints
in the landscape.
Keypoints are detected in a sampled scale space based on
an image pyramid as presented in Figure 13. The application
is dynamic in the sense that the number of keypoints depends
on the scene observed by the camera. Then the execution
time of the Search and Extract tasks in the graph dynamically
changes (see [19] for more details about this application).
5.1. Comparison Results. Throughout our experiments, we
compared the result of our scheduler with the one given
by the HCP algorithm (Heterogeneous Critical Path) devel-
oped by Bjorn-Jorgensen and Madsen [20]. This algorithm
represents an approach of scheduling on a heterogeneous
multiprocessor architecture. It starts with the calculation of
priorities for each task associated with a processor. A task is
chosen depending on the length of its critical path (CPL).
The task which has the largest minimum CPL will have the
highest priority. We compared with this method because it is
shown better than several other approaches (MD, MCP, PC,
etc.) [21].
The summary of the experiments leaded is presented
in Figure 14. Each column gives average values for one of
the three presented applications with dierent partitioning
strategies.
By comparing the rst and second rows, our scheduling
method provides consistent results.
The quality of the scheduling solutions found by our
method and the HCP method is similar. Moreover, our
method obtains better results for the complex Icam appli-
cation. The HCP method returns an average total execution
time equal to 69 milliseconds whereas our method returns
only 58 milliseconds for the same DFG. For the icam simple
application, the DFG is completely sequential, so whatever
the scheduling method the result is always the same. For the
robotic vision application, we nd the same total execution
time with the two methods because of the existence of
a critical path in the DFG which always sets the overall
execution time. We also measured the execution overhead of
the proposed scheduling algorithm when it is implemented
in software (third row of Figure 14) and in hardware (forth
row).
Since the scheduling overhead depends on the number of
tasks in the application we only indicate the average values
in Figure 14. For example, Figure 15 presents the execution
time of the hardware scheduler (in cycles) according to the
number of software tasks.
From Figure 15, it may be concluded that when the result
of partitioning changes at runtime, then the computation
time needed for our scheduler to schedule all the DFG
tasks is widely dependent on this modication of tasks
implementations. So:
(1) there is a great impact of the partitioning result and
the DFG structure on the scheduler computation
time,
(2) the longest sequential sequence of tasks corresponds
to the case where all tasks are on the Software (each
task takes one clock cycle). This case corresponds to
the maximum of schedule computation time,
Register
SW total
time
Mux
And
Test
Comp
And
ASAP_SW
Urgency_Time
SW_Ready
CLK
Reset
SW_Total_Time
And
TEXE_SW
SW_Scheduled_Enable
Min
Max
Comp
And Combi
And Min
Task_Scheduled
MIN_ASAP_TASKS
MAX_CT_TASKS
Tasks_Ready
Figure 9: The module of the MS Manager.
Table 1: Device utilization summary after synthesis.
Used logic utilization Icam simple Icam complex Robotic vision
Number of slices registers <1% <1% <1%
Number of slice LUTs 3% 6% 9%
Number of fully used Bit Slices 5% 5% 6%
Number of bonded IOBs 24% 64% 100%
Number of BUFG/BUFGCTRLs 3% 3% 3%
Scheduler frequency 23,94 Mhz 19,54 MHz 17,44 MHz
0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Figure 10: Scheduling result.
(3) the minimum schedule computation time depends
on the DFG structure. The longest sequential
sequence of tasks when all tasks are on the hardware
(each task takes one clock cycle).
In our case (Figure 12) this sequence is formed by 10
tasks, so the minimum schedule computation time is equal
to 10 clock cycles for this application.
The results conrm that a software implementation
of our scheduling algorithm is incompatible with online
scheduling. Instead, the hardware implementation proposed
in the paper brings determinism, a better scalability, and an
20000 speedup.
5.2. Synthesis Results. We have synthesized our scheduler
architecture with an FPGA target platform (Virtex5, device
XC5VLX330 -2 1760) [22] for the RCU of Figure 1.
Table 1 shows the device utilization summary for the three
considered applications when we choose a size of the bus
equal to 16 bits. We noticed that for the presented complex
DFG, our scheduler uses only 6% of the device slices LUTs
which is reasonable. These results are obtained with a design
frequency about 19,54 MHz. The device Virtex V XCVLX330
provide 207360 Slices registers, 207360 Slices LUT, 9687 fully
used Bit slices, 1200 IOBs, and 32 BUFG/BUFGCTRLs.
These results are conrmed in Figure 16, where we
synthesize the same Scheduler for the three applications, but
with 10 bits bus size as explained in the following paragraph.
5.3. Accuracy of Execution Times Values. The accuracy of
the execution time values is dened by the size of the bus
which must convey the information from module to another.
The size of this bus is a very determinant parameter in the
scheduler synthesis.
Table 2: Behaviour of the scheduler in dynamic situations.
Total execution
time of application
Scheduling time
Size of the
scheduler IP
Need to
resynthesize
Variation of execution time of tasks Impacted Not impacted Not impacted No
Variation of partitioning results Impacted Impacted Not impacted No
Variation of the DFG structure (fork, join, etc.) Impacted Impacted Not impacted Yes
Variation of the application Impacted Impacted Impacted Yes
Variation of the execution time precision Not impacted Impacted Impacted Yes
14
Units
Master Slave RCU
S
t
a
t
e
W
a
i
t
R
u
n
R
u
n
R
u
n
D
o
n
e
D
o
n
e
W
a
i
t
W
a
i
t
R
e
a
d
y
R
e
a
d
y
D
o
n
e
S1 1 2 11
S2 1 2 3,12 11
S3 2 12,4 3,11
S4 12,4 11 3
S5 14 13 4 12 11
S6 16
,5,
15
14 13 4 12
S7 16
,5,
15
13 4
S8 6 16,15 5 14 4
S9 16,15 6 5 17,7
S10 15 16 6 17,18
,8
7
S11 19 15 16 17,8 7,18
S12 19 15 8 7 18
S13 9 19 15 17 8 7
S14 9 19 15 17 8
S15 9 19 20 17
S16 9 19
15
10 17,20
S17 9 10 17 20
S18 9 10 17
S19 9 10
S20 10
Figure 11: Lists of management for a centralised OS.
As shown in Figure 17 , when the size of the bus
increases the number of hardware resources used increases
also and the frequency of the circuit decreases. But for
our scheduler, even with a size of 32-bit, the IP keeps a
relatively small size compared to the total number of available
resources (20% of Slices LUTs). This is another advantage of
our scheduler architecture.
In the general case, the designer has to make a tradeo
between accuracy of performance measures (in this case
the execution time) and the cost in number of hardware
resources and the maximum frequency of the circuit.
5.4. Summary
5.5. Description of the Scheduling Algorithm. Through the
various results of synthesis, we conrm the eectiveness
of the hardware architecture for our proposed scheduling
method. With these three applications, we have swept
most of existing DFGs structures: the sequential in the
ID
12
ID
11
ID
10
ID
1
ID
6
ID
7
ID
2
ID
3
ID
4
ID
5
ID
9
ID
8
ID
13
ID
14
ID
15
ID
16
ID
17
ID
18
ID
19
ID
20
Averaging
Subtraction
Threshold
Ero/Dilat
Reconstruction
Dilatation
Labeling
Covering
(envelope)
Motion test
Updating
background
Virtual
task
Virtual
task
Virtual
task
Virtual
task
Virtual
task
Virtual
task
Virtual
task
Virtual
task
Virtual
task
Virtual
task
SW
SW
SW
SW
SW
SW
SW
HW
HW
HW
HW
HW
HW
HW
HW
SW
HW
SL
SL
8
2
3
6
4
7
9
2
5
1
8
6
3
4
9
5
1
6
2
2
SL
Figure 12: DFG application. The interrupted lines represent the
scheduling results.
application icam simple, the fork and join in the application
icam complex, and the parallelism in the application of
robotic vision.
This scheduling heuristic gives better results than the
HCP method. Moreover the proposed hardware architecture
is very ecient in terms of resources utilization and schedul-
ing latency.
1
1 ms
600 ms
115 ms
600 ms
1100 ms
45 ms
25 ms
43 ms
800 ms
100 ms
260 ms
70 ms
100 ms
25 ms
2500 ms
2500 ms
180 ms
4700 ms
900 ms
43 ms
20 ms
200 ms
200 ms
10 ms
300 ms
22 ms
22 ms
20 ms
10 ms
20 ms
SW
SW
SW
SW
SL
HW
HW
HW
HW
SW
HW
HW
HW
HW
HW
HW
HW
HW
HW
HW
HW
HW
HW
HW
SL
SL
SW
SW
SW
HW
Gauss 1
Gauss 1
Gauss 2
Gauss 1
Gauss 2
Gauss 1
Gauss 1
Gauss 1
Gauss 2
Subsample
Gradient
4
22
24
26
29
30
25
23
27
28
3
5
6
7
9
8
10
11
12
2
13
14
16
20
21
15
17
18
19
Oversampling
DoG DoG DoG
DoG
DoG
DoG
Search
Search Search Search
Search
Search
Extract
Extract Extract Extract
Extract
Extract
Low frequency Medium frequency High frequency
Figure 13: DFG graph of the robotic vision application.
Our total
execution time
(ms)
HCP total
execution time
(ms)
SW scheduling
time (ms)
HW scheduling
time (ms)
Icam simple
(10 tasks)
47
47
7.5
0.00041764
Icame complex
(20 tasks)
58
69
12
0.00056274
Robotic vision
(30 tasks)
10943
10943
21
0.000916976
Figure 14: Execution time for 3 applications.
These features allow our scheduling IP to run online
and meet the needs of the dynamic nature of most today
applications.
6. Conclusions
In this paper, we presented a complete run-time hardware-
software scheduling approach. Results of our experiments
show the eciency of the adaptation of the scheduling to
a dynamic change of the partitioning that can be due to
0
5
10
15
20
25
C
o
m
p
u
t
i
n
g
t
i
m
e
(
c
l
o
c
k
c
y
c
l
e
s
)
1 4 7 10 13 16 19
Number of software tasks
Figure 15: Variation of scheduling computation time according to
tasks implentations.
a new mode of a dynamic application or to fault detection.
As developed in this paper, a dynamic HW/SW schedul-
ing approach has many advantages over static traditional
approaches. In addition, the eciency of our hardware
implementation gives to our scheduler a minimal overhead
in on line execution context.
In conclusion, Table 2 resumes the behavior of our
scheduling approach with dierent situations of dynamicity.
We show through this table in which case it is necessary
to restrict the IP scheduler and resynthesize it and in which
case the IP can adapt to the dynamic system.
0
5
10
15
20
25
30
F
r
e
q
u
e
n
c
y
(
M
h
z
)
F
P
G
A
r
e
s
o
u
r
c
e
s
(
%
)
Icam simple (10
tasks)
Icam complexe(20
tasks)
Robotic vision (30
tasks)
Benchmarks
Frequency
nbre slices Luts
Synthesis results for Bus size = 10 bits
Figure 16: Scalability of the method according to application
complexity.
0
5
10
15
20
25
F
r
e
q
u
e
n
c
y
(
M
h
z
)
S
l
i
c
e
L
U
T
s
(
%
)
10 16 32
Bus size (bits)
Frequency
nbre slices Luts
Variation of synthesis results according to bus size
Figure 17: Impact of bus size on the scheduler synthesis results for
robotic vision application.
Our future works consist in integrating our scheduling
approach among the services of an RTOS for dynamically
recongurable systems.
Appendix
Block Diagrams of the Hardware Scheduler
See Figures 6, 7, and 9.
References
[1] M. Garey and D. Johnson, Computers and Intractability:
A Guide to the Theory of NP-Completeness, Freeman, San
Francisco, Calif, USA, 1979.
[2] Z. A. Mann and A. Orb an, Optimization problems in system-
level synthesis, in Proceedings of the 3rd Hungarian-Japanese
Symposium on Discrete Mathematics and Its Applications,
Tokyo, Japan, 2003.
[3] C. Haubelt, D. Koch, and J. Teich, Basic OS support for
distributed recongurable hardware, in Proceedings of the
3rd and 4th International Workshops on Computer Systems:
Architectures, Modeling, and Simulation (SAMOS 04), vol.
3133, pp. 3038, Samos, Greece, July 2004.
[4] J. Noguera and R. M. Badia, Dynamic runtime HW/SW
scheduling techniques for recongurable architectures, in
Proceedings of the 10th International Symposium on Hard-
ware/Software Codesign (CODES 02), pp. 205210, ACM
Press, New York, NY, USA, 2002.
[5] Y. Lu, T. Marconi, K. Bertels, and G. Gaydadjiev, Online
task scheduling for the FPGA-based partially recongurable
systems, in Proceedings of the 5th International Workshop on
Recongurable Computing: Architectures, Tools and Applica-
tions (ARC 09), pp. 216230, Karlsruhe, Germany, March
2009.
[6] R. Pellizzoni and M. Caccamo, Real-time management of
hardware and software tasks for FPGA-based embedded
systems, IEEE Transactions on Computers, vol. 56, no. 12, pp.
16661680, 2007.
[7] S.-W. Moon, J. Rexford, and K. G. Shin, Scalable hardware
priority queue architectures for high-speed packet switches,
IEEE Transactions on Computers, vol. 49, no. 11, pp. 1215
1227, 2000.
[8] D. Picker and R. Fellman, A VLSI priority packet queue with
inheritance and overwrite, IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 3, no. 2, pp. 245253,
1995.
[9] B. K. Kim and K. G. Shin, Scalable hardware earliest-
deadline-rst scheduler for ATM switching networks, in
Proceedings of the 18th IEEE Real-Time Systems Symposium, pp.
210218, San Francisco, Calif, USA, December 1997.
[10] T. Pop, P. Pop, P. Eles, and Z. Peng, Analysis and optimiza-
tion of hierarchically scheduled multiprocessor embedded
systems, International Journal on Parallel Program, vol. 36, no.
1, pp. 3767, 2008.
[11] S. Fekete, J. van der Veen, J. Angermeier, D. G ohringer,
M. Majer, and J. Teich, Scheduling and communication-
aware mapping of HW-SW modules for dynamically and
partially recongurable SoC architectures, in Proceedings of
the Dynamically Recongurable Systems Workshop (DRS 07),
Z urich, Switzerland, March 2007.
[12] B. Miramond and J.-M. Delosme, Design space exploration
for dynamically recongurable architectures, in Proceedings
of the Conference on Design, Automation and Test in Europe
(DATE 05), pp. 366371, 2005.
[13] Altera Corp., http://www.altera.com/.
[14] Xilinx Corp., http://www.xilinx.com/.
[15] K.M.G. Purna and D. Bhatia, Temporal partitioning and
scheduling data ow graphs for recongurable computers,
IEEE Transactions on Computers, vol. 48, no. 6, pp. 579590,
1999.
[16] J. Resano, D. Mozos, D. Verkest, F. Catthoor, and S. Vernalde,
Specic scheduling support to minimize the reconguration
overhead of dynamically recongurable hardware, in Proceed-
ings of the 41st Annual Conference on Design Automation (DCA
04), pp. 119124, ACM Press, San Diego, Calif, USA, June
2004.
[17] F. Ghaari, M. Auguin, M. Abid, and M. B. Jemaa, Dynamic
and on-line design space exploration for recongurable
architectures, in Transactions on High-Performance Embedded
Architectures and Compilers I, P. Stenstr om, Ed., vol. 4050
of Lecture Notes in Computer Science, pp. 179193, Springer,
Berlin, Germany, 2007.
[18] J.-Y. Mignolet, V. Nollet, P. Coene, D. Verkest, S. Vernalde, and
R. Lauwereins, Infrastructure for design and management of
relocatable tasks in a heterogeneous recongurable system-on-
chip, in Proceedings of the Conference on Design, Automation
and Test in Europe (DATE 03), Messe, Munich, Germany,
March 2003.
[19] F. Verdier, B. Miramond, M. Maillard, E. Huck, and T. Lefeb-
vre, Using high-level RTOS models for HW/SW embedded
architecture exploration: case study on mobile robotic vision,
EURASIP Journal on Embedded Systems, vol. 2008, no. 1,
Article ID 349465, 2008.
[20] P. Bjorn-Jorgensen and J. Madsen, Critical path driven cosyn-
thesis for heterogeneous target architectures, in Proceedings of
the 5th International Workshop on Hardware/Software Codesign
(CODES/CASHE 97), pp. 1519, Braunschweig, Germany,
March 1997.
[21] Y.-K. Kwok and I. Ahmad, Static scheduling algorithms
for allocating directed task graphs to multiprocessors, ACM
Computing Surveys, vol. 31, no. 4, pp. 406471, 1999.
[22] Virtex II Pro, Xilinx Corp., http://www.xilinx.com/.
doi:10.1155/2009/723465
Research Article
Techniques and Architectures for Hazard-Free Semi-Parallel
Decoding of LDPC Codes
Department of Information Engineering, University of Pisa, Via G. Caruso 16, 56122 Pisa, Italy
Correspondence should be addressed to Massimo Rovini, massimo.rovini@gmail.com
Received 4 March 2009; Revised 18 May 2009; Accepted 27 July 2009
The layered decoding algorithm has recently been proposed as an ecient means for the decoding of low-density parity-check
(LDPC) codes, thanks to the remarkable improvement in the convergence speed (2x) of the decoding process. However, pipelined
semi-parallel decoders suer from violations or hazards between consecutive updates, which not only violate the layered
principle but also enforce the loops in the code, thus spoiling the error correction performance. This paper describes three dierent
techniques to properly reschedule the decoding updates, based on the careful insertion of idle cycles, to prevent the hazards of
the pipeline mechanism. Also, dierent semi-parallel architectures of a layered LDPC decoder suitable for use with such techniques
are analyzed. Then, taking the LDPC codes for the wireless local area network (IEEE 802.11n) as a case study, a detailed analysis of
the performance attained with the proposed techniques and architectures is reported, and results of the logic synthesis on a 65 nm
low-power CMOS technology are shown.
Copyright 2009 Massimo Rovini et al. This is an open access article distributed under the Creative Commons Attribution
cited.
1. Introduction
Improving the reliability of data transmission over noisy
channels is the key issue of modern communication systems
and particularly of wireless systems, whose spatial coverage
and data rate are increasing steadily.
In this context, low-density parity-check (LDPC) codes
have gained the momentum of the scientic community and
they have recently been adopted as forward error correction
(FEC) codes by several communication standards, such as
the second generation digital video broadcasting (DVB-S2,
[1]), the wireless metropolitan area networks (WMANs,
IEEE 802.16e, [2]), the wireless local area networks (WLANs,
IEEE 802.11n, [3]), and the 10 Gbit Ethernet (10Gbase-T,
IEEE 802.2ae).
LDPC codes were rst discovered by Gallager in the
far 1960s [4] but have long been put aside until MacKay
and Neal, sustained by the advances in the very high
large-scale of integration (VLSI) technology, rediscovered
them in the early 1990s [5]. The renewed interest and
the success of LDPC codes is due to (i) the remarkable
error-correction performance, even at low signal-to-noise
ratios (SNRs) and for small block-lengths, (ii) the exibility
in the design of the code parameters, (iii) the decoding
algorithm, very suitable for hardware parallelization, and last
but not least (iv) the advent of structured or architecture-
aware (AA) codes [6]. AA-LDPC codes reduce the decoder
area and power consumption and improve the scalability
of its architecture and so allow the full exploitation of the
complexity/throughput design trade-os. Furthermore, AA-
codes perform so close to random codes [6], that they are the
common choice of all latest LDPC-based standards.
Nowadays, data services and user applications impose
severe low-complexity and low-power constraints and
demand very high throughput to the design of practical
decoders. The adoption of a fully parallel decoder architec-
ture leads to impressive throughput but unfortunately is also
so complex in terms of both area and routing [7] that a semi-
parallel implementation is usually preferred (see [6, 8]).
So, to counteract the reduced throughput, designers can
act at two levels: at the algorithmic level, by eciently
rescheduling the message-passing algorithm to improve its
convergence rate, and at the architectural level, with the
pipeline of the decoding process, to shorten the iteration
time. The rst matter can be solved with the turbo-decoding
message-passing (TDMP) [6] or the layered decoding
Vectorization
Array
of CN
Array
of VN
CN
0
CN
1
CN
2
VN
0
VN
1
VN
2
VN
3
0
y
0
1
y
1
2
y
2
3
y
3
H
B
=
1 0 1 0
1 0 1 1
0 1 0 1
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
1 0 0 0 0
0 1 0 0 0
0,0
0,0
Figure 1: Tanner graph of a simple 3 4 base-matrix and principle
of vectorization.
algorithm [9], while pipelined architectures are mandatory
especially when the decoder employs serial processing units.
However, the pipeline mechanism may dramatically cor-
rupt the error-correction performance of a layered decoder
by letting the processing units not always work on the most
updated messages. This issue, known as pipeline hazard,
arises when the dependence between the elaborations is
violated. The idea is then to reschedule the sequence of
updates and to delay with idle cycles the decoding process
until newer data are available.
As an improvement to similar state-of-the-art works
[1013], this paper proposes three systematic techniques
to optimally reschedule the decoding process in a way
to minimize the number of idle cycles and achieve the
maximum throughput. Also, this paper discusses dierent
semi-parallel architectures, based on serial processing units
and all supporting the reordering strategies, so as to attain
the best trade-o between complexity and throughput for
every LDPC code.
Semi-parallel architectures of LDPC decoder have
recently been addressed in several papers, although none
of them formally solves the issue of pipeline hazards and
decoding idling. Gunnam et al. describe in [10] a pipelined
semi-parallel decoder for WLAN LDPC codes, but the
authors do not mention the issue of the pipeline hazards;
only, the need of properly scrambling the sequence of data
in order to clear some memory conicts is described.
Boutillon et al. consider in [13] methods and architec-
tures for layered decoding; the authors mention the problem
of pipeline hazards (cut-edge conict) and of using an
output order dierent from the natural one in the processing
units; nonetheless, the issue is not investigated further, and
they simply modify the decoding algorithm to compute
partial updates as in [14]. Although this approach allows
the decoder to operate in full pipeline with no idle cycles,
it is actually suboptimal in terms of both performance and
complexity.
Similarly, Bhatt et al. propose in [11] a pipelined block-
serial decoder architecture based on partial updates, but
again, they do not investigate the dependence between
elaborations.
In [12], Fewer et al. implement a semi-parallel TDMP
decoder, but the authors boost the throughput by decoding
two codewords in parallel and not by means of pipeline.
This paper is organised as follows. Section 2 recalls the
basics of LDPC and of AA-LDPC codes and Section 3 sum-
marizes the layered decoding algorithm. Section 4 introduces
three dierent techniques to reduce the dependence between
consecutive updates and analytically derives the related
number of idle cycles. After this, Section 5 describes the
VLSI architectures of a pipelined block-serial LDPC-layered
decoder. Section 6 briey reviews the WLAN codes used as a
case study, while the performances of the related decoder are
analysed in Section 7. Then, the results of the logic synthesis
on a 65 nm low-power CMOS technology are discussed in
Section 8, along with the comparison with similar state-of-
the-art implementations. Finally, conclusions are drawn in
Section 9.
2. Architecture-Aware Block-LDPC Codes
LDPC codes are linear block-codes described by a parity-
check matrix H establishing a certain number of (even)
parity constraints on the bits of a codeword. Figure 1 shows
the parity-check matrix of a very simple LDPC code with
length N = 4 bits and with M = 3 parity constraints.
LDPC codes are also eectively described in a graphical way
through a Tanner graph [15], where each bit in the codeword
is represented with a circle, known as variable-node (VN),
and each parity-check constraint with a square, known as
check-node (CN).
Recently, the joint design of code and decoder has
blossomed in many works (see [8, 16]), and several principles
have been established for the design of implementation-
oriented AA-codes [6]. These can be summarized into (i)
the arrangement of the parity-check matrix in squared
subblocks, and (ii) the use of deterministic patterns within
the subblocks. Accordingly, AA-LDPC codes are also referred
to as block-LDPC codes [8].
The pattern used within blocks is the vital facet for a low-
cost implementation of the interconnection network of the
decoder and can be based either on permutations, as in [6]
and for the class of -rotation codes [17], or on circulants
or cyclic shifts of the identity matrix, as in [8] and in every
recent standards [13].
AA-LDPC codes are dened by the number of block-
columns n
c
, the number of block-rows n
r
, and the block-
size B, which is the size of the component submatrices.
Their parity-check matrix H can be conveniently viewed as
H = P
HB
, that is, as the expansion of a base-matrix H
B
with
size n
r
n
c
. The expansion is accomplished by replacing
the 1s in H
B
with permutations or circulants, and the 0s
with null subblocks. Thus, the block-size B is also referred
to as expansion-factor, for a codeword length of the resulting
LDPC code equal to N = B n
c
and code rate r = 1 n
r
/n
c
.
A simple example of expansion or vectorization of a base-
matrix is shown in Figure 1. The size, number, and location
of the nonnull blocks in the code are the key parameters to
get good error-correction performance and low-complexity
of the related decoder.
3. Decoding of LDPC Codes
LDPC codes are decoded with the belief propagation (BP) or
message-passing (MP) algorithm, that belong to the broader
class of maximum a posteriori (MAP) algorithms. The BP
algorithm has been proved to be optimal if the graph of
the code does not contain cycles, but it can still be used
and considered as a reference for practical codes with cycles.
In the latter case, the sequence of the elaborations, also
referred to as schedule, considerably aects the achievable
performance.
The most common schedule for BP is the so-called two-
phase or ooding schedule (FS) [18], where all parity-check
nodes rst, followed by all variable nodes then, are updated
in sequence.
A dierent approach, taking the distribution of closed
paths and girths in the code into account, has been described
by Xiao and Banihashemi in [19]. Although probabilistic
schedules are shown to outperform deterministic schedules,
the random activation strategy of the processing nodes is not
very suitable to HW implementation and adds signicant
complexity overheads.
The most attractive schedule is the shued or layered
decoding [6, 9, 18, 20]. Compared to the FS, the layered
schedule almost doubles the decoding convergence speed,
both for codes with cycles and cycle-free [20]. This is
achieved by looking at the code as a connection of smaller
supercodes [6] or layers [9], exchanging intermediate relia-
bility messages. Specically, a posteriori messages are made
available to the next layers immediately after computation
and not at next iteration as in a conventional ooding
schedule.
Layers can be any set of either CNs or VNs, and,
accordingly, CN-centric (or horizontal) or VN-centric (or
vertical) algorithms have been analyzed in [18, 20]. However,
CN-centric solutions are preferable since they can exploit
serial, exible, and low-complexity CN processors.
The horizontal layered decoding (HLD) is summarized
in Algorithm 1 and consists in the exchange of probabilistic
reliability messages around the edges of the Tanner graph
(see Figure 1) in the form of logarithms of likelihood ratios
(LLRs); given the random variable x, its LLR is dened as
LLR(x) = log
Pr(x = 1)
Pr(x = 0)
. (1)
In Algorithm 1,
n
is the nth a priori LLR of the
received bits, with n = 0, 1, . . . , N 1 and N the length
of the codeword, M is the overall number of parity-check
constraints, and N
it
the number of decoding iterations.
Also, N(m) is the set of VNs connected to the mth CN,
(q)
m,n represents the check-to-variable (c2v) reliability message
sent from CN m to VN n at iteration q, and y
n
is the
total information or soft-output (SO) of the nth bit in the
codeword (see Figure 1).
For the sake of an easier notation, it is assumed here that a
layer corresponds to a single row of the parity-check matrix.
Before being used by the next CN or layer, SOs are rened
with the involved c2v message, as shown in line 13, and
thanks to this mechanism, faster convergence is achieved.
input: a-priori LLR
n
, n = 0, 1, . . . , N 1
output: a-Posteriori hard-decisions y
n
= sign(y
n
)
(1) // Messages initialization
(2) q = 0, y
n
=
n
,
(0)
m,n = 0, n = 0, . . . , N 1,
m = 0, . . . , M1;
(3) while (q < N
it
& !Convergence) do
(4) // Loop on all layers
(5) for m 0 to M1 do
(6) // Check-node update
(7) forall n N(m) do
(8) // Sign update
(9) sign(
(q+1)
m,n ) =
jN(m)\n
sign(y
j

(q)
m, j
);
(10) // Magnitude update
(11)
(q+1)
m,n = Mmin
jN(m)\n
(y
j

(q)
m, j
);
(12) // Soft-output update
(13) y
n
= y
n
(q)
m,n +
(q+1)
m,n
(14) end
(15) end
(16) q + +;
(17) end
Algorithm 1: Horizontal layered decoding.
Magnitudes are updated with the M-min
binary opera-
tor [21] dened as M-min
(a, b) =min(a, b)+log(e

ab
/(1+
e
ab
)) for a, b 0. Following an approach similar to
Jones et al. [22], the updating rule of magnitudes is further
simplied with the method described in [23], which proved
to yield very good performance. Here, only two values
are computed and propagated for the magnitude of c2v
messages; specically, if we dene
j
min
= arg
_
min
jN(m)
_
y
j

(q)
m, j
_
_
(2)
the index of the smallest variable-to-check (v2c) message
entering CN m, then a dedicated c2v message is computed
in response to VN j
min
:
(q+1)
m, jmin
= M- min
jN(m), j / = jmin
_
y
j

(q)
m, j
_
=
m (3)
while all the remaining VNs receive one common, non-
marginalized value for magnitude given by
(q+1)
m,n
n / = jmin
= M-min
m
,
y
jmin

(q)
m, jmin
_
=
m. (4)
4. Decoding Pipelining and Idling
The data-ow of a pipelined decoder with serial processing
units is sketched in Figure 2. A centralized memory unit
keeps the updated soft-outputs, computed by the node
processors (NPs) according to Algorithm 1. If we denote with
d
k
the number of nonnull blocks in layer k, that is, the
degree of layer k, then the processor takes d
k
clock cycles
to serially load its inputs. Then, rened values are written
back in memory (after scrambling or permutation) with the
Output
SO
Input
SO

SO memory
(synchronous)
Serial NP
SO buer
Permutation
network
Figure 2: Outline of the ow of soft-outputs in an LDPC-layered
decoder with serial processing units.
latency of L
SO
clock cycles, and this operation takes again d
k
clock cycles. Overall, the processing time of layer k is then
2d
k
+ L
SO
clock cycles, as shown in Figure 3(a).
If the decoder works in pipeline, time is saved by
overlapping the phases of elaboration, writing-out and
reading, so that data are continuously read from and written
into memory, and a new layer is processed every d
k
clock
cycles (see Figure 3(b)).
Although highly desirable, the pipeline mechanism is
particularly challenging in a layered LDPC decoder, since
the soft-outputs retrieved from memory and used for the
current elaboration could not be always up-to-date, but
newer values could be still in the pipeline. This issue, known
as pipeline hazard, prevents the use and so the propagation
of always up-to-date messages and spoils the error-correction
performance of the decoding algorithm.
The solution investigated in this paper is to insert null
or idle cycles between consecutive updates, so that a node
processor is suspended to wait for newer data. The number
of idle cycles must be kept as small as possible since it aects
the iteration time and so the decoding throughput. Its value
depends on the actual sequence of layers updated by the
decoder as well as on the order followed to update messages
within a layer.
Three dierent strategies are described in this section,
to reduce the dependence between consecutive updates in
the HLD algorithm and, accordingly, the number of idle
cycles. These dier in the order followed for acquisition
and writing-out of the decoding messages and constitute a
powerful tool for the design of layered, hazard-free, LDPC
codes.
4.1. System Notation. Without any lack of generality, let
us identify a layer with one single parity-check node and
focusing on the set S
k
of soft-outputs participating to layer
k, let us dene the following subsets:
(i) A
k
= S
k
S
k1
, the set of SOs in common with layer
k 1;
(ii) B
k
= S
k
S
k+1
\ S
k1
, the set of SOs in common
with layer k + 1 and not in A
k
;
(iii) C
k
= S
k1
S
k
S
k+1
, the set of SOs in common with
both layers k 1 and k + 1;
(iv) E
k
= S
k
S
k2
\ S
k1
S
k+1
, the set of SOs in
common with layer k 2 and not in A
k
or B
k
;
(v) F
k
= S
k
S
k+2
\ S
k2
S
k1
S
k+1
, the set of
SOs in common with layer k + 2 but not in E
k
, A
k
,
B
k
;
(vi) G
k
= S
k2
S
k
S
k+2
\ S
k1
S
k+1
, the set of
SOs in common with both layers k 2 and k + 2, but
not in A
k
or B
k
;
(vii) R
k
, the set of remaining SOs.
In the denitions above the notation A \ B means the
relative complement of B in A or the set-theoretic dierence
of A and B. Let us also dene the following cardinalities:
d
k
= S
k
(degree of layer k),
k
= A
k
,
k
= B
k
,
k
= C
k
,
k
= E
k
,
k
= F
k
,
k
= G
k
,
k
= R
k
.
4.2. Equal Output Processing. First, let us consider a very
straightforward and implementation friendly architecture of
the node processor that updates (and so delivers) the soft-
output messages with the same order used to take them in.
In such a case it would be desirable to (i) postpone the
acquisition of messages updated by the previous layer, that is,
messages in A
k
, and (ii) output the messages in B
k
as soon
as possible to let the next layer start earlier. Actually, the last
constraint only holds when A
k
does not include any message
common to layer k + 1, that is, when C
k
= ; otherwise, the
set B
k
could be acquired at any time before A
k
.
Figure 4 shows the I/O data streams of an equal output
processing (EOP) unit. Here, L
SO
is the latency of the
SO data-path, including the elaboration in the NP, the
scrambling, and the two memory accesses (reading and
writing). Focusing on layer k + 1, the set C
k+1
cannot be
assigned to any specic position within A
k+1
, since the whole
A
k+1
is acquired according to the same order used by layer k
to output (and so also acquire) the sets B
k
and C
k
. For this
reason, the situation plotted in Figure 4 is only for the sake
of a clearer drawing.
With reference to Figure 4, pipeline hazards are cleared if
I
k
idle cycles are spent between layer k and k + 1 so that
I
k
+ S
k+1
\ A
k+1
L
SO
+ S
k
\ (A
k
B
k
) u(C
k
) (5)
with u(x) = 1 for x > 0 and u(x) = 0 otherwise. This means
that if C
k
is empty, then the messages in S
k
\ (A
k
B
k
) do
not need to be waited for. The solution to (5) with minimum
latency is
I
k
= L
SO
(d
k+1
k+1
) +
_
d
k
k
_
u
_
k
_
. (6)
Note that (5) and (6) only hold under the hypothesis of
C
k
leading within A
k
. If this is not the case, up to A
k
\ C
k
extra idle cycles could be added if C

k
is output last within A
k
.
So far, we have only focused on the interaction between
two consecutive layers; however, violations could also arise
between layer k and k +2. Despite this possibility, this issue is
not treated here, as it is typically mitigated by the same idle
cycles already inserted between layers k and k+1 and between
layers k + 1 and k + 2.
4.3. Reversed Output Processing. Depending on the particular
structure of the parity-check matrix H, it may occur that the
Read Write Elab. Read Write Elab.
Layer k Layer k + 1
d
k
L
SO
d
k
(a) Not pipelined
Elab. Elab. Elab.
Layer k Layer k + 1 Layer k + 2
Read, k Read, k + 1 Read, k + 2
Write, k Write, k + 1 Write, k + 2
(b) Pipelined
Figure 3: Pipelined and not pipelined data-ow.
Layer k
k

k+1
k+1
k+1
Idle
B
k
C
k
A
k
B
k+1
C
k+1
A
k+1
Layer k + 1
d
k
d
k+1 I
k+1 I
k
Idle
Time
SO output stream
SO input stream
Layer k
d
k
L
SO
B
k
C
k
A
k
k

k

k
Figure 4: Input and output data streams in an NP with EOP.
most of the messages of layer k in common with layer k1 are
also shared with layer k + 1, that is, A
k
= C
k
and B
k
= .
If this condition holds, as for the WLAN LDPC codes (see
Figure 11), it can be worth reversing the output order of SOs
so that the messages in A
k
can be both acquired last and
output rst.
Figure 5(a) shows the I/O streams of a reversed output
processing (ROP) unit. Exploiting the reversal mechanism,
the set B
k
is acquired second-last, just before A
k
, so that it is
available earlier for layer k + 1.
Following a reasoning similar to EOP, the situation
sketched in Figure 5(a) where C
k
is delivered rst within A
k
is just for an easier representation, and the condition for
hazard-free layered decoding is now
I
2l
k
+ S
k+1
\ A
k+1
L
SO
+ A
k
\ C
k
u(B
k
). (7)
Indeed, when B
k
= , one could output C
k
rst in A
k
,
and so get rid of the term A
k
\ C
k
. However, since C
k
is
actually left oating within A
k
, (7) represents again a best-
case scenario, and up to A
k
\ C
k
extra idle cycles could be
required. From (7), the minimum latency solution is
I
2l
k
= L
SO
(d
k+1
k+1
) +
_
k
_
u
_
k
_
. (8)
Similarly to EOP, the ROP strategy also suers from
pipeline hazards between three consecutive layers, and
because of the reversed output order, the issue is more
relevant now. This situation is sketched in Figure 5(b), where
the sets E
k
, F
k
, and G
k
are managed similarly to A
k
, B
k
,
and C
k
. The ROP strategy is then instructed to acquire the
set E
k
later and to output F
k
earlier. However, the situation
is complicated by the fact that the set F
k1
G
k1
may not
entirely coincide with E
k+1
; rather it is E
k+1
(F
k1
G
k1
),
since some of the messages in F
k1
G
k1
can be found
in B
k+1
. This is highlighted in Figure 5(b), where those
messages of F
k1
and G
k1
not delivered to E
k+1
are shown
in dark grey.
To clear the hazards between three layers, additional idle
cycles are added in the number of
I
3l
k
= max
_
ACQ
k+1
WR
k1
, 0
_
, (9)
where ACQ
k+1
is the acquisition margin on layer k + 1, and
WR
k1
is the writing-out margin on layer k 1. These can be
computed under the assumption of no hazard between layer
k 1 and k (i.e., C
k
A
k
is aligned with C
k1
B
k1
thanks
to I
2l
k
as shown in Figure 5(b)) and are given by
ACQ
k+1
= I
2l
k
+ d
k+1
k+1
+
k+1
+
k+1
_
,
WR
k1
=
_
k1
k1
+
G
k1
\ E
k+1
_
u
_
k1
_
.
(10)
The margin WR
k1
is actually nonnull only if F
k1 /
=;
otherwise, WR
k1
= 0 under the hypothesis that (i) the set
G
k1
is output rst within E
k1
, and (ii) within G
k1
, the
messages not in E
k+1
are output last.
Layer k
k

k+1
k+1
k+1
Idle
B
k
A
k
C
k
B
k+1
A
k+1
C
k+1
Layer k + 1
d
k
d
k+1 I
k+1 I
k
Idle
Time
SO output stream
SO input stream
Layer k
d
k
L
SO
C
k
A
k
B
k
k
(a) Pipeline hazards in the update of two consecutive layers
Layer k
k
k
k+1
k+1
k+1

k+1
k+1
k+1
Idle
B
k A
k
C
k
F
k+1
G
k+1 E
k+1
B
k+1 R
k+1
A
k+1 C
k+1
Layer k + 1
d
k
d
k+1
I
k+1
I
k
Idle
Time
SO output stream
SO input stream
Layer k 1
L
SO
C
k A
k
B
k
k
R
k1
F
k1
G
k1
E
k1
B
k1
A
k1
C
k1
Layer k
d
k
d
k1
k1
k1
k1
k1
k1

k1
(b) Pipeline hazards in the update of three consecutive layers. Messages of G
k1
and F
k1
not in E
k+1
are shown in dark grey
Figure 5: Organization of the input and output data stream in an NP with ROP.
Layer k
k

k

k+1

k+1
Idle
E
k
A
k
E
k+1
A
k+1
Layer k + 1
d
k
d
k+1 I
k+1
I
k
Idle
Time
SO output stream
SO input stream
Layer k
d
k
L
SO
B
k
C
k
F
k
G
k

k
+
k

k
+
k
Figure 6: Input and output data streams in an NP with UOP.
Overall, the number of idle cycles of ROP is given by
I
k
= I
2l
k
+ I
3l
k
. (11)
4.4. Unconstrained Output Processing. Fewer idle cycles are
expected if the orders used for input and output are not
constrained to each other. This implies that layer k can still
delay the acquisition of the messages updated by layer k 1
(i.e., messages in A
k
) as usual, but at the same time the
messages common to layer k +1 (i.e., in B
k
C
k
) can also be
delivered earlier.
The input and output data streams of an unconstrained
output processing (UOP) unit are shown in Figure 6. Now,
hazard-free layered decoding is achieved when
I
k
+ S
k+1
\ A
k+1
L
SO
, (12)
which yields
I
k
= L
SO
(d
k+1
k+1
). (13)
SO
MEM
c2v
message
MEM
v2c buffer
CNU array
Circular
shifting network
Input
buffer
FSM
y
n
[k]
(q)
m+1,n
+
CNU #0
CNU #1
CNU #2
.
.
.
CNU #B 1
y
n
[k]
(q)
m,n
+
+
(q+1)
m,n
y
n
[k + 1]
B
B
B
B
IO
Figure 7: Layered decoder architecture with variable-to-check buer.
Regarding the interaction between three consecutive
layers, if the messages common to layer k+2 (i.e., in F
k
G
k
)
are output just after B
k
C
k
, and if on layer k + 2, the set
E
k+2
is taken just before A
k+2
, then there is no risk of pipeline
hazard between layer k and k + 2.
4.5. Decoding of Irregular Codes. A serial processor cannot
process consecutive layers with decreasing degrees, d
k+1
<
d
k
, as the pipeline of the internal elaborations would be
corrupted and the output messages of the two layers would
overlap in time. This is not but another kind of pipeline
hazard, and again, it can be solved by delaying the update
of the second layer with d
k
= d
k
d
k+1
idle cycles.
Since this type of hazard is independent of that seen
above, the same idle cycles may help to solve both issues. For
this reason, the overall number of idle cycles becomes
I
k
= maxI
k
, d
k
, 0 (14)
with I
k
being computed according to (6), (11), or (13).
4.6. Optimal Sequence of Layers. For a given reordering
strategy, the overall number of idle cycles per decoding
iteration is a function of the actual sequence of layers used for
the decoding. For a code with layers, the optimal sequence
of layer p minimizing the time spent in idle is given by
p = arg min
pP
1
_
k=0
I
k
_
p
_
, (15)
where I
k
(p) is the number of idle cycles between layer k and
k + 1 for the generic permutation p and is given by (14), and
P is the set of the possible permutations of layers.
The minimization problem in (15) can be solved by
means of a brute-force computer search and results in the
denition of a permuted parity-check matrix

H, whose layers
are scrambled according to the optimal permutation p. Then,
within each layer of

H, the order to update the nonnull
subblocks is given by the strategy in use among EOP, ROP,
and UOP.
4.7. Summary and Results. The three methods proposed in
this section are dierently eective to minimize the overall
time spent in idle. Although UOP is expected to yield
the smallest latency, the results strongly depend on the
considered LDPC code, and ROP and EOP can be very close
to UOP. As a case-example, results will be shown in Section 7
for the WLAN LDPC codes.
However, the eectiveness of the individual methods
must be weighed up in view of the requirements of the
underlying decoder architecture and the costs of its hardware
implementation, which is the objective of Section 5. Thus,
UOP generally requires bigger complexity in hardware, and
EOP or ROP can be preferred for particular codes.
5. Decoder Architectures
Low complexity and high throughput are key features
demanded to every competitive LDPC decoder, and to this
extent, semi-parallel architectures are widely recognised as
the best design choice.
As shown in [6, 8, 12] to mention just a few, a semi-
parallel architecture includes an array of processing elements
with size usually equal to the expansion factor B of the
base-matrix H
B
. Therefore, the HLD algorithm described
in Section 3 must be intended in a vectorized form as well,
and in order to exploit the code structure, a layer counts
B consecutive parity-check nodes. Layers (in the number
of n
r
= M/B) are updated in sequence by the B check-
node units (CNUs), and an array of B SOs (y
n
) and of
Bc2v messages (
(q)
m,n
) are concurrently updated at every
clock cycle. Since the parity-check equations in a layer are
independent by construction, that is, they do not share SOs,
the analysis of Section 4 still holds in a vectorized form.
The CNUs are designed to serially update the c2v
magnitudes according to (3) and (4), and any arbitrary
order of the c2v messages (and so of SOs, see line 13 of
Algorithm 1) can be easily achieved by properly multiplexing
between the two values as also shown in [23]. It must
be pointed out that the 2-output approximation described
in Section 3 is pivotal to a low-complexity implementation
of EOP, ROP, or UOP in the CNU. However, the same
strategies could also be used with a dierent (or even no)
approximation in the CNU, although the cost of the related
implementation would probably be higher.
Three VLSI architectures of a layered decoder will be
described, that dier in the management of the memory
SO MEM
r1
w1
w1
r2
r1
r2
c2v MEM
CNU array
Circular
shifting network
Input
buffer
FSM
y
n
[k]
(q)
m+1,n
+
y
n
[k]
(q)
m,n
+
B
CNU #0
CNU #1
CNU #2
.
.
.
CNU #B 1
+
+
(q+1)
m,n
y
n
[k + 1]
B B
B
B
IO
B
B
Figure 8: Layered decoder with three-port SO and c2v memories.
units of both SO and c2v, and so result in dierent
implementation costs in terms of memory (RAM and ROM)
and logic.
5.1. Local Variable-to-Check Buer. The most straightfor-
ward architecture of a vectorized layered decoder is shown
in Figure 7. Here, the arrays of v2c messages
(q)
m,n
entering
the CNUs during the update of layer m = 0, 1, . . . , n
r
1, are
computed on-the-y as
(q)
m,n
= y
n
(q)
m,n
with n N(m), and
both the arrays of c2v and SO messages are retrieved from
memory.
Then, the updated c2v messages are used to rene every
array of SOs belonging to layer m: according to line 13 of
Algorithm 1, this is done by adding the new c2v array
(q+1)
m,n
to the input v2c array
(q)
m,n
. Since the CNUs work in pipeline,
while the update of layer m is still progress, the array of
the v2c messages belonging to layer m + 1 is already being
computed as
(q)
m+1,n
= y
n

(q)
m+1,n
, with n
N(m + 1).
For this reason,
(q)
m,n
needs to be temporarily stored in a local
buer as shown in Figure 7. The buer is vectorized as well
and stores B d
c,max
messages, with d
c,max
the maximum CN
degree in the code.
Before being stored back in memory, the array y
n
is
circularly shifted and made ready for its next use, by applying
compound or incremental rotations [12]; this operation is
carried out by the circular shifting network of Figure 7, and
more details about its architecture are available in [24].
The v2c buer is the key element that allows the archi-
tecture to work in pipeline. This has to sustain one reading
and one writing access concurrently and can be eciently
implemented with shift-register based architectures for EOP
(rst-in, rst-out, FIFO buer) and ROP (last-in, rst-out,
LIFO buer). On the contrary, UOP needs to map the buer
onto a dual-port memory bank, whose (reading) address is
provided by and extra conguration memory (ROM).
5.2. Double Memory Access. The buer of Arch. V-A can be
removed if the v2c messages are computed twice on-the-y,
as shown in Figure 8: the rst time to feed the array of CNUs,
and then to update the SOs. To this aim, a further reading is
required to get the arrays y
n
and
(q)
m,n
from memory, and so
recompute the array
(q)
m,n
on the CNUs output.
It follows that three-port memories are needed for both
SO and c2v messages since three concurrent accesses have
to be supported: two readings (see ports r1 and r2 in
Figure 8) and one writing. This memory can be implemented
by distributing data on several banks of customary dual-
port memory, in such a way that two readings always
involve dierent banks. Actually, in a layered decoder a
same memory location needs to be accessed several times
per iteration and concurrently to several other data, so that
resorting to only two memory banks would be unfeasible.
On the other hand, the management of a higher number of
banks would add a signicant overhead to the complexity of
the whole design.
The proposed solution is sketched in Figure 9 and is
based on only two banks (A and B) but, to clear access
conicts, some data are redundantly stored in both the banks
(see elements C1 and C2 in the example of Figure 9).
The most trivial and expensive solution is achieved when
both banks are a full copy or a mirror of the original
memory as in [11], which corresponds to 100% redundancy.
Conversely to this route, data can be selectively assigned
to the two banks through computer search aiming at a
minimum redundancy.
Roughly speaking, if we denote by
i
the cardinality of
the set of data (SO or c2v messages) read concurrently to
the ith data for i = 0, 1, . . . , N 1, then the higher

i

i
is (for a given N), the higher is the expected redundancy. So,
a small redundancy
c2v
is experienced by the c2v memory,
since each c2v message can collide with at most two other
data (i.e., max
i
i
= 2), while a higher redundancy
SO
is
associated to the SO memory, since every SO can face up
to 2d
VN,n
conicts, with d
VN,n
being the degree of the nth
variable node, typically greater than 1 (especially for low-rate
codes).
Indeed, the issue of memory partitioning and the
reordering techniques described in Section 4 are linked to
each other: whenever the CNUs are in idle, only one reading
is performed. Therefore, an overall system optimization
aiming at minimizing the iteration latency and the amount
w_en
w_en w_addr
r_addr
r_data r_data1
r_data2
r_addr
r_data
w_data
w_addr
w_en
w_addr
w_data
Read controller
Memory architecture
A1
A2
B1
B2
A3
B3
A4
C1
A5
A6
A7
A8
B4
B5
B6
C1
C2
A1
A2
A3
A4
A5
A6
A7
A8
C1
C2
B1
B2
B3
B4
B5
B6
r_addr1 r_addr2
w_data
Bank A
Bank A
Data partitoning
Write controller
Original
memory Bank B
Bank B
C2
C1
Figure 9: Three-port memory: data partitioning and architecture.
v2c MEM
r1
w1
r2
CNU array
Circular
shifting network
Input
buffer
FSM
c2v
message
MEM
y
n
[k]
(q)
m+1,n
B
y
n
[k]
(q)
m,n
(q)
m,n
CNU #0
CNU #1
CNU #2
.
.
.
CNU #B 1
+
+
(q+1)
m,n
y
n
[k + 1]
B
+
B
B
B
IO
B
B
B
Figure 10: Layered decoder with v2c three-port memory.
of memory redundancy at the same time could be pursued;
however, due to the huge optimization space, this task is
almost unfeasible and is not considered in this work.
5.3. Storage of Variable-to-Check Messages. During the elab-
oration of a generic layer, a certain v2c message is needed
twice, and a local buer or multiple memory reading
operations were implemented in Arch. V-A and Arch. V-B,
respectively.
A third way of solving the problem is computing the
array of v2c messages only once per iteration, like in Arch.
V-A, but instead of using a local buer, the v2c messages are
precomputed and stored in the SO memory ready for the
next use, as sketched in Figure 10. A similar architecture is
used in [10, 16] but the issue of decoding pipeline is not
clearly stated there.
In this way, the SO memory turns into a v2c memory
with the following meaning: the array y
n
updated by layer
m is stored in memory after marginalization with the c2v
message
m
,n
, with m
being the index of the next layer

reusing the same array of SOs, y
n
. In other words, the array of
v2c messages involved in the next update of the same block-
column n is precomputed. Therefore, the data stored in the
v2c memory are used twice, rst to feed the array of CNUs,
and then for the SOs update.
Similarly to Arch. V-B, a three-port memory would
be required because of the decoding pipeline; the same
considerations of Section 5.2 still hold, and an optimum
partitioning of the v2c memory onto two banks with some
redundancy can be found. Note that, as opposed to Arch.
V-B, a customary dual-port memory is enough for c2v
messages.
As far as the complexity is concerned, at rst glance this
solution seems to be preferable to Arch. V-B since it needs
only two stages of parallel adders while the c2v memory is
not split. However, the management of the reading ports of
the v2c memory introduces signicant overheads, since after
the update of the soft outputs y
n
by layer m, the memory
controller must be aware of what is the next layer m
using the
same soft outputs y
n
. This information needs to be stored in
a dedicated conguration memory, whose size and area can
be signicant, especially in a multilength, multirate decoder.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
0
1
2
3
4
5
6
7 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0
1
25 2
7
15
68
17
23
11
40 61 55
76 36 7
52
7
56
14
64 67
61 75 4
56 74 77
28 21 68
48 38 43
40 2 53
69 23 64
14 0 68
58 11 34
12
8
63
20
10
78
25
10
20
64 78
22 21
62
65
24
20
78
4
20
24
8
44
52
72
23
75
29
58
44
5
H
B
=
81 81 identity matrix rotated by r 81 81 zero matrix r
Figure 11: Parity-check base-matrix of the block-LDPC code for IEEE 802.11n with codeword size N
2
= 1944 and rate r = 2/3. Black squares
correspond to cyclic shifts s of the identity matrix (0 s B 1), also indicated in the square, while empty squares correspond to all-zero
submatrices.
6. A Case Study: The IEEE 802.11n LDPC Codes
6.1. LDPC Code Construction. The WLAN standard [3]
denes AA-LDPC codes based on circulants of the identity
matrix. Three dierent codeword lengths are supported,
N
0
= 648, N
1
= 1296, and N
2
= 1944, each coming with four
code rates, 1/2, 2/3, 3/4, and 5/6, for a total of 12 dierent
codes. As a distinguishing feature, a dierent block-size is
used for each codeword length, that is, B
0
= 27, B
1
= 54,
and B
2
= 81, respectively; accordingly, every code counts
n
c
= N
i
/B
i
= 24 block-columns, while the block-rows (layers)
are in the number of n
r
= (1r)n
c
= 12, 8, 6, 4 for code rates
1/2, 2/3, 3/4, and 5/6, respectively.
An example of the base-matrix H
B
for the code with
length N
2
= 1944 and rate r = 2/3 is shown in Figure 11.
6.2. Multiframe Decoder Architecture. In order to attain an
adequate throughput for every WLAN codes, the decoder
must include a number of CNUs at least equal to maxB
i
=
81. This means that two thirds of the processors would
remain unused with the shortest codes.
In the latter case, the throughput can be increased thanks
to a multiframe approach, where F
i
= |maxB
i
/B
i
] frames
of the code with block-size B
i
are decoded in parallel. A
similar solution is described in [12], but in that case two
dierent frames are decoded in time-division multiplexing
by exploiting the 2 nonoverlapped phases of the ooding
algorithm. Here, F
i
frames are decoded concurrently, and
more specically, three dierent frames of the shortest code
can be assigned to a cluster of 27 CNUs each.
Note that to work properly, the circular shifting network
must support concurrent subrotations as described in [24].
7. Decoder Performance
As to give a practical example of the reordering strategies
described in Section 4, Figure 12 shows the data ow related
to the update of layer 0 for the WLAN code of Figure 11.
While 6 idle cycles are required following the original,
natural order of updates (see Figure 12(a)), EOP needs
5 cycles (see Figure 12(b)), ROP reduces them to 1 (see
Figure 12(c)), while no idle cycle is used by UOP (see
Figure 12(d)). The subsets dened in Section 4.1 are also
shown in Figure 5, along with the optimal sequence of layers
followed for decoding.
7.1. Latency and Throughput. The latency of a pipelined
LDPC decoder can be expressed as
T
dec
= t
clk
_
N
it
(N
B
+ I
it
) + L
pipe
+ 2
IO
_
(16)
with t
clk
= 1/ f
clk
being the clock period, N
it
being the
number of iterations, N
B
being the number of nonnull
blocks in the code, I
it
=

nr 1
k=0
I
k
being the number of
idle cycles per iteration, L
pipe
being the cycles to empty
the decoder pipelin and nally, L
IO
being the cycles for
the input/output interface. Among the parameters above,
N
it
is set for good error-correction performance, N
B
is a
code-dependent parameter, and L
IO
is xed by the I/O
management; thus, for a minimum latency, the designer
can only act on I
it
, whose value can be optimised with the
techniques of Section 4.
Focusing on the IEEE 802.11n codes, Table 1 shows the
overall number of cycles for 12 iterations (L
dec
= T
dec
/t
clk
),
the number of idle cycles per iteration (I
it
), the percentage
of idle cycles with respect to the total (idling %), and the
throughput at the clock frequency of 240 MHz.
The latter is expressed in information bits decoded per
time unit and is also referred to as net throughput:
n
= F
i
r N
i
T
dec
, (17)
where F
i
is the number of frames decoded in parallel. For
this reason, the gures of Table 1 for the short codes are very
similar to those for the long codes (N
0
F
0
= N
2
F
2
); on the
contrary, the middle codes do not benet from the same
mechanism(i.e., F
1
= 1) and their throughput is scaled down
by a factor 2/3.
The results of Table 1 are for every technique of Section 4
as well as for the original codes before optimization.
Although EOP clearly outperforms the original codes, better
results are achieved with ROP and UOP for the WLAN case
56
4
2
13
25
15
1
16
17
14
8
11
75
1
63
3
0
17
61
0
4
2
64
7
4
9
7
12
0
17
67
10
24
8
74
1
20
3
0
18
56
0
77
2
56
4
2
13
25
15
1
16
17
14
8
11
75
1
63
3
0
17
61
0
4
2
SO rotation
Layer #0 Layer #1
Layer #0
Idle
SO index
SO
input
SO
output
t
(a) Original base-matrix (sequence of layers: 0,1,2,3,4,5,6,7)
56
4
2
13
25
15
1
16
17
14
8
11
75
1
63
3
0
17
61
0
4
2
0
22
21
6
22
4
23
13
68
12
0
21
23
1
10
3
29
14
69
0
64
2
56
4
2
13
25
15
1
16
17
14
8
11
75
1
63
3
0
17
61
0
4
2
SO rotation
Layer #0 Layer #5
Layer #0
Idle
SO index
SO
input
SO
output
B
0
A
0
B
5
A
5
t
B
0
A
0
(b) EOP (optimised sequence of layers: 0,5,6,7,4,2,3,1)
56
4
2
13
25
15
1
16
17
14
8
11
75
1
63
3
0
17
61
0
4
2
14
5
0
19
0
18
10
3
65
6
23
10
28
0
7
4
68
2
21
1
75
14
56
4
2
13
25
15
1
16
17
14
8
11
75
1
63
3
0
17
61
0
4
2
SO rotation
Layer #0 Layer #2
Layer #0
Idle
SO index
SO
input
SO
output
B
0 A
0
A
2
t
A
0
B
0
(c) ROP (optimised sequence of layers: 0,2,7,5,6,3,4,1)
56
4
2
13
25
15
1
16
17
14
8
11
75
1
63
3
0
17
61
0
4
2
52
5
62
6
0
20
40
0
20
8
0
21
25 0 2 5 3 4 4
56
4
2
13
25
15
1
16
17
14
8
11
2 11
0
17
1 3
75
1
63
3
61
0
4
2
16
SO rotation
Layer #0 Layer #4
Layer #0
No idle
SO index
SO
input
SO
output
E
0
A
0 A
4
E
4
t
B
0
F
0
(d) UOP (optimised sequence of layers: 0,4,7,6,1,2,6,3)
Figure 12: An example of optimization of the base-matrix of the LDPC code IEEE 802.11n with N
2
= 1944 and r = 2/3 with EOP, ROP and
UOP. Critical propagations are highlighted in dark gray.
example, where at most 14% and 11% of the decoding time
are spent in idle, respectively. On average, the decoding time
decreases from 7.6 to 6.7 ns with EOP and even to 5.3 ns with
ROP and 5.1 ns with UOP. This behaviour can be explained
by considering that for the WLAN codes the term (d
k
k
)u(
k
) found in (6) for EOP is signicantly nonnull, while
comparing (8) to (13), ROP and UOP basically dier for the
term(
k
k
)u(
k
), which is negligible for the WLANcodes.
7.2. Error-Correction Performance. Figure 13 compares the
oating point frame error rate (FER) after 12 decoding
iterations of a pipelined decoder using EOP, ROP, and UOP
with a reference curve obtained by simulating the original
parity-check matrix before optimization, in a nonpipelined
decoder. Two simulations were run for each strategy, one
with the proper number of idle cycles (curves with full
markers), and the other without idle cycles and referred to
as full pipeline mode (curves with empty markers).
As expected, the three strategies reach the reference curve
of the HLD algorithm when properly idled. Then, in case
of full pipeline (I
k
= 0, k), the performance of EOP are
spoiled, while ROP and UOP only pay about 0.6 and 0.3 dB,
respectively. This means that the reordering has signicantly
reduced the dependence between layers and only few hazards
arise without idle cycles.
Similarly to EOP, no received codeword is successfully
decoded even at high SNRs (i.e., FER = 1) if the original
code descriptors are simulated in full pipeline. This conrms
Table 1: Performance of an LDPC decoder for IEEE 802.11n with 12 iterations: L
SO
= 5 and f
clk
= 240 MHz.
Code lenght N
0
= 648 N
1
= 1296 N
2
= 1944
Code rate 1/2 2/3 3/4 5/6 1/2 2/3 3/4 5/6 1/2 2/3 3/4 5/6
N
B
88 88 88 88 86 88 88 85 86 88 85 79
L
IO
72 72 72 72 48 48 48 48 72 72 72 72
Original
L
dec
2299 1763 1779 1486 2106 1715 1886 1653 2107 1775 1752 1603
I
it
91 46 47 22 81 46 60 43 77 47 48 41
idling % 47% 31% 31% 17% 46% 32% 38% 31% 44% 31% 32% 30%
n
(Mbps) 101 176 197 262 74 121 124 157 111 175 200 243
EOP
L
dec
1927 1691 1575 1462 1819 1643 1527 1377 1855 1691 1538 1352
I
it
60 40 30 20 57 40 30 20 56 40 30 20
idling % 37% 28% 23% 16% 37% 29% 23% 17% 36% 28% 23% 17%
n
(Mbps) 121 184 222 266 85 126 153 188 126 184 228 288
ROP
L
dec
1308 1216 1290 1403 1223 1168 1239 1330 1283 1228 1243 1305
I
it
8 0 6 15 7 0 6 16 8 1 5 16
idling % 7.3% 0% 5.5% 13% 6.8% 0% 5.5% 14% 7.4% 1% 4.8% 14%
n
(Mbps) 178 256 271 277 127 178 188 195 182 253 282 298
UOP
L
dec
1308 1216 1243 1380 1187 1168 1195 1260 1259 1216 1195 1164
I
it
8 0 2 13 4 0 2 10 6 0 1 4
idling % 7.3% 0% 1.9% 11% 4% 0% 2% 9.3% 5.6% 0% 0.9% 4%
n
(Mbps) 178 256 282 282 131 178 195 206 185 256 293 334
once more the importance of idle cycles in a pipelined HLD
decoding decoder and motivates the need of an optimization
technique.
Considering the same scenario of Figure 13, Figure 14
shows the convergence speed, measured in average number
of iterations, of the layered decoding algorithm. The curves
conrm that HLD needs one half of the number of iterations
of the ooding schedule, on average, and show that the full
pipeline mode is also penalized in terms of speed.
8. Implementation Results
The complexity of an LDPC decoder for IEEE 802.11n codes
was derived through logical synthesis on a low-power 65 nm
CMOS technology targeting f
clk
= 240 MHz. Every architec-
ture of Section 5 was considered for implementation, each
one supporting the three reordering strategies, for a total of 9
combinations. For good error correction performance, input
LLRs and c2v messages were represented on 5 bits, while
internal SO and v2c messages on 7 bits.
Table 2 summarizes the complexity of the dierent
designs in terms of logic, measured in equivalent Kgates
and number of RAM and ROM bits. Equivalent gates are
counted by referring to the low-drive, 2-input NAND cell,
whose area is 2.08 m
2
for the target technology library. Arch.
V-A needs the highest number of memory bits due to the
local variable-to-check buer, but its logic is smaller since it
requires no additional hardware resources (adders) and less
conguration bits.
Because of the partitioning of both the SO and the
c2v memories, Arch. V-B needs more logic resources and
more memory bits than Arch. V-C (both for data and
conguration). The redundancy ratios
SO
and
c2v
of the SO
and c2v memory in Arch. V-B, respectively, and
v2c
of the v2c
memory in Arch. V-C, are also reported in Table 2.
As a matter of fact, the three architectures are very similar
in complexity and performance, and, for a given set of LDPC
codes, the designer can select the most suitable solution by
trading-o decoding latency and throughput at the system
level, with the requirements of logic and memory in terms of
area, speed, and power consumption at the technology level.
Table 3 compares the design of a decoder for IEEE
802.11n based on Arch. V-C with UOP with similar state-of-
the-art implementations: a parallel decoder by Blanskby and
Howland [7], a 2048-bit rate 1/2 TDMP decoder by Mansour
and Shanbhag [25], a similar design for WLAN by Gunnam
et al. [10], and a decoder for WiMAX by Brack et al. [26].
Here, for a fair comparison, the throughput is expressed in
channel bits decoded per time unit; that is, it is the channel
throughput
c
= N
i
/T
dec
=
n
/r.
For the comparison, we focused on the architectural
eciency
A
dened as
A
=
T
dec
f
clk
N
it
N
B
=
N f
clk
c
N
it
N
B
, (18)
which represents the average number of clock cycles to
update one block of H. In decoders based on serial functional
units it is
A
1, and the higher
A
is, the less ecient
is the architecture. Actually,
A
can reach 1 only when
the dependence between consecutive layers is solved at the
code design level. This is the case of two WiMAX codes
2.5 2 1.5 1 0.5
SNR (dB)
10
6
10
5
10
4
10
3
10
2
10
1
10
0
F
E
R
IEEE 802.11n, N
2
= 1944, r = 1/2
HLD reference
Pipelined HLD & idle cycles
EOP
ROP
UOP
Full-pipelined HLD
(no idle cycles)
Original
EOP
ROP
UOP
Figure 13: Error-correction performance of the IEEE 802.11n,
N
2
= 1944, rate-1/2 LDPC code after 12 decoding iterations.
Table 2: IEEE 802.11n LDPC decoder complexity analysis.
EOP ROP UOP
Arch. V-A
logic (Kgates) 71.29 71.62 74.65
RAM bits 61,722 61,722 61,722
ROM bits 23,159 23,159 40,788
Arch. V-B
logic (Kgates) 75.45 75.75 77.99
RAM bits 53,622 54,837 57,024
SO
29.2% 29.2% 33.3%
c2v
1.1% 4.6% 9.1%
ROM bits 36,582 36,582 51,849
Arch. V-C
logic (Kgates) 71.83 72.14 74.60
RAM bits 53,217 53,217 53,784
v2c
29.2% 29.2% 33.3%
ROM bits 34,508 34,508 43,553
(specically, class 1/2 and class 2/3B codes) which are hazard-
free (or layered) by construction, thus explaining the very
low value of
A
achieved by [26]. However, [26] is as ecient
as our design (
A
1.3) on the remaining nonlayered
WiMAX codes, but the authors do not perform layered
decoding on such codes.
For decoders with parallel processing units (see [7,
25]) the architectural eciency becomes a measure of the
parallelization used in the processing units and it can be
expressed as
A
= 1/d with d being the average check
2.5 2.25 2 1.75 1.5 1.25 1 0.75 0.5 0.25
SNR (dB)
3
4
5
6
7
8
9
2
3
4
5
6
7
8
9
10
100
A
v
e
r
a
g
e
n
u
m
b
e
r
o
f
i
t
e
r
a
t
i
o
n
s
IEEE 802.11n, N
2
= 1944, r = 1/2
HLD reference
FS
Pipelined HLD & idle cycles:
EOP
ROP
UOP
Full-pipelined HLD
(no idle cycles)
Original
EOP
ROP
UOP
Figure 14: IEEE 802.11n, N
2
= 1944, rate-1/2 LDPC code: average
decoding speed for a maximum of 100 iterations.
node degree. Indeed, in a two-phase decoder, the number
of blocks can be equivalently dened as the overall number
of exchanged messages, divided by the number of functional
units. If E is the number of edges in the code, then N
B
=
2E/(N +rN), which is an index of the parallelization used in
the processors.
The dierent designs were also compared in terms of
energy eciency, dened as the energy spent per coded bit
and per decoding iteration. This is computed as
E
=
E
dec
N N
it
=
P
c
N
it
(19)
with E
dec
= P T
dec
being the decoding energy and P
being the power consumption. The latter was estimated
with Synopsys Power Compiler and was averaged out over
three dierent SNRs (corresponding to dierent convergence
speeds) and includes the power dissipated in the memory
units (about 70% of the total). In terms of energy, our design
is more ecient than [25] and gets close to the parallel
decoder in [7].
Since the design in [10] is for the same WLAN LDPC
codes and implements a similar layered decoding algorithm
with the same number of processing units, a closer inspection
is compulsory. Thanks to the idle optimization, our solution
is more ecient in terms of throughput, the saving in
eciency ranging from 16% to 23%. Then, although our
design saves about 70 mW in power consumption with
respect to [10], the related energy eciency has not been
included in Table 2 since the reference scenario used to
estimate the power consumption (238 mW) was not clearly
dened. Finally, although curves for error correction perfor-
mance are not available in [10], penalties are expected in view
of the smaller accuracy used to represent v2c (5 bits) and SOs
(6 bits) messages.
Table 3: State-of-the-art LDPC decoder implementations.
[this]
[7] [10] [25]
[26]
Technology 65 nm CMOS
0.16 m CMOS
5-LM
0.13 m TSMC
CMOS
0.18 m 1.8 V
TSMC CMOS
0.13 m CMOS
Algorithm layered
ooding layered TDMP
ooding/layered
CPU arch. serial
parallel serial parallel
serial
Nb. of CPUs 81
1536 81 64
96
Msg. width (c2v + SO) 5 + 7
4 + 4 5 + 6 4 + 5
6
Clock fr (MHz) 240
64 500 125
333
Rates 1/2, 2/3, 3/4, 5/6
1/2 1/2, 2/3, 3/4, 5/6 1/2 : 1/16 : 7/8
1/2, 2/3, 3/4, 5/6
Codeword length, N 648, 1296, 1944
1024 648, 1296, 1944 2048
576 : 96 : 2304
Codeword size, B 27, 54, 81
1 27, 54, 81 64
24 : 4 : 96
Nb. of blocks, N
B
7988
4,33 7988 96
7688
Speed
Iterations N
it
12
64 5 10
16
c
(Mbps) 262401
1,024 5411,618 640
177999
Area
Kgates (mm
2
) 100.7 (0.207)
1750 (52.5) 99.9 (1.85) 220 (14.3)
489.9 (2.964)
RAM bits 56,376
55,344 51,680
NA
Power consumption (W) 0.162
0.69 0.238 0.787
NA
A
(cycle/bit/iter) 1.1031.306
0.231 1.3611.521 0.417
1.011.31
E
(pjoule/bit/iter) 33.751.5
10.5 123

9. Conclusions
An eective method to counteract the pipeline hazards
typical of block-serial layered decoders of LDPC codes has
been presented in this paper. This method is based on
the rearrangement of the decoding elaborations in order
to minimize the number of idle cycles inserted between
updates and resulted in three dierent strategies named
equal, reversed, and unconstrained output (EOP, ROP, and
UOP) processing.
Then, dierent semi-parallel VLSI architectures of a lay-
ered decoder for architecture-aware LDPC codes supporting
the methods above have been described and applied to the
design of a decoder for IEEE 802.11n LDPC codes.
The synthesis of the proposed decoder on a 65 nm low-
power CMOS technology reached the clock frequency of
240 MHz, which corresponds to a net throughput ranging
from 131 to 334 Mbps with UOP and 12 decoding iterations,
outperforming similar designs.
This work has proved that the layered decoding algo-
rithm can be extended with no modications nor approx-
imations to every LDPC code, despite the interconnections
on its parity-check matrix, provided that idle cycles are used
to maintain the dependencies between the updates in the
algorithm.
Also, the paradigm of code-decoder codesign has been
reinforced in this work, since not only the described
techniques have shown to be very eective to counteract
the pipeline hazards but also they provide at the same time
useful guidelines for the design of good, hazard-free, LDPC
codes. To this extent, it is then overcome the assumption that
consecutive layers do not have to share soft-outputs, like the
WiMAX class 1/2 and 2/3B codes do, thus leaving more room
to the optimization of the code performance at the level of
the code design.
References
[1] Satellite digital video broadcasting of second generation
(DVB-S2), ETSI Standard EN302307, February 2005.
[2] IEEE Computer Society, Air Interface for Fixed and Mobile
Broadband Wirelss Access Systems, IEEE Std 802.16e
TM
-
2005, February 2006.
[3] IEEE P802.11n
TM
/D1.06, Draft amendment to Standard for
high throughput, 802.11 Working Group, November 2006.
[4] R. Gallager, Low-density parity-check codes, Ph.D. dissertation,
Massachusetts Institutes of Technology, 1960.
[5] D. MacKay and R. Neal, Good codes based on very sparse
matrices, in Proceedings of the 5th IMA Conference on
Cryptography and Coding, 1995.
[6] M. M. Mansour and N. R. Shanbhag, High-throughput
LDPC decoders, IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 11, no. 6, pp. 976996, 2003.
[7] A. Blanksby and C. Howland, A690-mW1-Gb/s 1024-b, rate-
1/2 lowdensity parity-check code decoder, IEEE Journal of
Solid-State Circuits, vol. 37, no. 3, pp. 404412, 2002.
[8] H. Zhong and T. Zhang, Block-LDPC: a practical LDPC
coding system design approach, IEEE Transactions on Circuits
and Systems I, vol. 52, no. 4, pp. 766775, 2005.
[9] D. E. Hocevar, A reduced complexity decoder architecture via
layered decoding of LDPC codes, in Proceedings of the IEEE
Workshop on Signal Processing Systems (SISP 04), pp. 107112,
2004.
[10] K. Gunnam, G. Choi, W. Wang, and M. Yeary, Multi-rate
layered decoder architecture for block LDPC codes of the
IEEE 802.11n wireless standard, in Proceedings of the IEEE
International Symposium on Circuits and Systems (ISCAS 07),
pp. 16451648, May 2007.
[11] T. Bhatt, V. Sundaramurthy, V. Stolpman, and D. McCain,
Pipelined block-serial decoder architecture for structured
LDPC codes, in Proceedings of the IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP 06),
vol. 4, pp. 225228, April 2006.
[12] C. P. Fewer, M. F. Flanagan, and A. D. Fagan, A versatile
variable rate LDPC codec architecture, IEEE Transactions on
Circuits and Systems I, vol. 54, no. 10, pp. 22402251, 2007.
[13] E. Boutillon, J. Tousch, and F. Guilloud, LDPC decoder,
corresponding method, system and computer program, US
patent no. 7,174,495 B2, February 2007.
[14] M. Rovini, F. Rossi, P. Ciao, N. LInsalata, and L. Fanucci,
Layered decoding of non-layered LDPCcodes, in Proceedings
of the 9th Euromicro Conference on Digital System Design (DSD
06), August-September 2006.
[15] R. Tanner, A recursive approach to low complexity codes,
IEEE Transactions on Information Theory, vol. 27, no. 5, pp.
533547, 1981.
[16] H. Zhang, J. Zhu, H. Shi, and D. Wang, Layered approx-
regular LDPC: code construction and encoder/decoder
design, IEEE Transactions on Circuits and Systems I, vol. 55,
no. 2, pp. 572585, 2008.
[17] R. Echard and S.-C. Chang, The -rotation low-density
parity check codes, in Proceedings of the IEEE Global
Telecommunications Conference (GLOBECOM 01), pp. 980
984, November 2001.
[18] F. Guilloud, E. Boutillon, J. Tousch, and J.-L. Danger, Generic
description and synthesis of LDPC decoders, IEEE Transac-
tions on Communications, vol. 55, no. 11, pp. 20842091, 2006.
[19] H. Xiao and A. H. Banihashemi, Graph-based message-
passing schedules for decoding LDPC codes, IEEE Transac-
tions on Communications, vol. 52, no. 12, pp. 20982105, 2004.
[20] E. Sharon, S. Litsyn, and J. Goldberger, Ecient serial
message-passing schedules for LDPC decoding, IEEE Trans-
actions on Information Theory, vol. 53, no. 11, pp. 40764091,
2007.
[21] F. Zarkeshvari and A. Banihashemi, On implementation of
min-sum algorithm for decoding low-density parity-check
(LDPC) codes, in Proceedings of the IEEE Global Telecommu-
nications Conference (GLOBECOM 02), vol. 2, pp. 13491353,
November 2002.
[22] C. Jones, E. Valles, M. Smith, and J. Villasenor, Approximate-
MIN constraint node updating for LDPC code decoding, in
Proceedings of the IEEE Military Communications Conference
(MILCOM 03), vol. 1, pp. 157162, October 2003.
[23] M. Rovini, F. Rossi, N. LInsalata, and L. Fanucci, High-
precision LDPC codes decoding at the lowest complexity, in
Proceedings of the 14th European Signal Processing Conference
(EUSIPCO 06), September 2006.
[24] M. Rovini, G. Gentile, and L. Fanucci, Multi-size circular
shifting networks for decoders of structured LDPC codes,
Electronics Letters, vol. 43, no. 17, pp. 938940, 2007.
[25] M. M. Mansour and N. R. Shanbhag, A 640-Mb/s 2048-bit
programmable LDPC decoder chip, IEEE Journal of Solid-
State Circuits, vol. 41, no. 3, pp. 684698, 2006.
[26] T. Brack, M. Alles, F. Kienle, and N. Wehn, A synthesizable IP
core for WiMax 802.16E LDPC code decoding, in Proceedings
of the 17th IEEE International Symposium on Personal, Indoor
and Mobile Radio Communications (PIMRC 06), pp. 15,
September 2006.
doi:10.1155/2009/704174
Letter to the Editor
Comments on Techniques and Architectures for Hazard-Free
Semi-Parallel Decoding of LDPC Codes
Kiran K. Gunnam,
1, 2
Gwan S. Choi,
2
and Mark B. Yeary
3
1
Channel Architecture, Storage Peripherals Group, LSI Corporation, Milpitas, CA 95035, USA
2
Department of ECE, Texas A&M University, College Station, TX 77843, USA
3
Department of ECE, University of Oklahoma, Norman, OK 73019, USA
Correspondence should be addressed to Kiran K. Gunnam, kgunnam@ieee.org
This is a comment article on the publication Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC
Codes Rovini et al. (2009). We mention that there has been similar work reported in the literature before, and the previous work
has not been cited correctly, for example Gunnam et al. (2006, 2007). This brief note serves to clarify these issues.
Copyright 2009 Kiran K. Gunnam et al. This is an open access article distributed under the Creative Commons Attribution
cited.
The recent work by Rovini and others in [1] states that
Gunnam et al. describe in [10] a pipelined semi-parallel
decoder for WLAN LDPC codes, but the authors do not
mention the issue of the pipeline hazards; only, the need
of properly scrambling the sequence of data in order to
clear some memory conicts is described. On the contrary,
we gave detailed explanation of our decoder architecture,
concepts of out-of-order processing in [26]. The proposed
approach Unconstrained Output Processing (UoP) in [1] is
similar to our approach outlined in [26]. So we would like
to clarify more in this matter.
We describe in [26] a pipelined semi-parallel decoder
for WLAN LDPC codes, that used scheduling of layered
processing and out-of-order block processing to minimize
the pipeline hazards and memory stall cycles. The following
paragraph from [4, Page 1, Column 2], correctly describes
our work: This paper introduces the following concepts
to LDPC decoder implementation: Block serial scheduling
[5], value-reuse, scheduling of layered processing, out-of-
order block processing, master-slave router, dynamic state.
All these concepts are termed as on-the-y computation as
the core of these concepts is based on minimizing memory
and re-computations by employing just-in-time scheduling.
More detailed explanations and illustrations can be found in
the presentations [5, 6] which were available on-line from
October 2006 and May 2007, respectively.
Also [1] did not cite our work on layer reordering for
optimizing the pipeline and memory accesses. In [3, Page
4, Column 2], last paragraph, we clearly mention that It
is possible to do the decoding using a dierent sequence
of layers instead of processing the layers from 1 to j which
is typically used to increase the parallelism such that it is
possible to process two block rows simultaneously [4]. In
this work, we use the concept of reordering of layers for
increased parallelism as well as for low complexity memory
implementation and also for inserting additional pipeline
stages without incurring overhead.
Our proposal of out-of-order processing (OoP) for the
layered decoding [26] is to process the circulants in a layer
in out-of-order (not necessarily sequential) to remove the
pipeline and memory conicts. This includes the partial state
processing and other related steps (Rold message generation,
Q message generation, CNU partial state processing. that
is, the processing step of nding Min1,Min2, Min1 ID), in
out-of-order fashion and the processing of Rnew messages
in out-of-order fashion. For instance, while processing the
layer 2, the blocks/circulants which depend on layer 1 will
be processed last to allow for the pipeline latency. Also Rnew
selection is out-of-order (these messages will come from the
most recently updated connected block), so that it can feed
the data required for the PS processing of the second layer. A
dependent circulant or connected circulant is the non-zero
circulant that supplies the last updated information of P
message to the specied nonzero circulant. The dependent
layer is the layer which contains dependent circulant. So
circulants in second layer will get the latest P update based
on the Rnew messages from dierent connected circulant
in dierent connected layers. Thus OoP for PS processing
is across one layer (i.e., at any time the CNU partial state
processing is concerned with starting and completing one
layer; however, the order of the circulants processed in the
layer is processed in out-of-order to satisfy the pipeline and
memory constraints); OoP for Rnew message generation
is across several layers. Also the P update (Q + Rnew),
in [2, (9)] is computed on-the-y along with reading of
the Q message of the last updated circulant in the same
block column from the Q memory and the Rnew message
generation that is, at the precise moment when it is needed;
this avoids the use of P memory and needs a single-port
read and single-port write Q memory whose storage capacity
is equal to the code length multiplied by the word length
of Q message. The bandwidth of this memory measure in
terms of number of Q messages is equal to the decoder
parallelization [26]. Other decoder hardware architectures
and implementations use both P memory and Q memory,
use mirror memories, or use more complicated multiported
memory. Illustrations for out-of-order processing were given
in [5, 6].
We gave more explanation in [5]: The decoder hardware
architecture is proposed to support out-of-order processing
to remove pipeline and memory accesses or to satisfy any
other performance or hardware constraint. Remaining hard-
ware architectures will not support out-of-order processing
without further involving more logic and memory. For the
above hardware decoder architecture, the optimization of
decoder schedule belongs to the class of NP-complete prob-
lems. So there are several classic optimization algorithms
such as dynamic programming that can be applied. We apply
the following classic approach of optimal substructure.
Step 1. We will try dierent layer schedules (j! i.e., j factorial
of j if there are j layers). For simplicity, we will try only
a subset of possible sequences so as to have more spread
between the original layers.
Step 2. Given a layer schedule or a re-ordered H matrix, we
will optimize the processing schedule of each layer. For this,
we use the classic approach of optimal substructure that is,
the solution to a given optimization problemcan be obtained
by the combination of optimal solutions to its sub problems.
So rst we optimize the processing order to minimize the
pipeline conicts. Then we optimize the resulting processing
order to minimize the memory conicts. So for each layer
schedule, we are measuring the number of stall cycles (our
cost function).
Step 3. We choose a layer schedule which minimizes the cost
function that is meets the requirements with less stall cycles
due to pipeline conicts and memory conicts and also
minimizes the memory accesses (such as FS memory accesses
to minimize the number of ports needed and to save the
access power and to minimize the more muxing requirement
and any interface memory access requirements).
Also we would like to mention how we calculate the
architecture eciencies: we mention in [2], Here, all
calculations for the decoded throughput are based on an
average of 5 decoding iteration to achieve frame error rate
of 10e-4, while it
max
is set to 15. If we are considering
the actual system throughput, then we should consider
how many maximum iterations the system can run and
what is the additional overhead from LLR/Q memory and
hard decision memory statistical buering and loading and
unloading times. We have close to 1.5 iteration overhead
due to statistical buering. So mixing the average number
of iterations with actual system throughput to calculate the
decoder core architecture eciency is not a fair metric. In
our works [26], for the decoders based on one-circulant
processing, the number of clock cycles for decoding of
each block/circulant is 1. Note that [2] and [3] designs
are similar and have the pipeline depth of 5. The Clock
Cycles Per Iteration (CCI) for most of the IEEE 802.11n
and IEEE 802.16e H matrices after reordering of layers and
out-of-order processing and data forwarding and speculative
computations is the number of blocks in H matrix. The only
exception is the rate 5/6 matrix of IEEE 802.16e and IEEE
802.11n. For 802.16e 5/6 matrix, the CCI is 87 clock cycles
to process 80 blocks. For 802.11n 5/6 matrix, the CCI is 85
clock cycles to process 79 blocks. We gave the worst case for
CCI as total number of blocks in H matrix + 2 cycle overhead
per each layer in [3, Page 6, Column 1, Line 2328]. So if
we were to report the Architecture Eciency = CCI/Ideal
CCI, the architecture eciency for our decoders would be
1 for all the codes except 2 cases. For 802.16e 5/6 matrix,
this number would be 1.0875. For 802.11n 5/6 matrix, the
number would be 1.0759. Even if we use the above worst case
number reported in [3] for all the codes even though it is
not necessary, then the architecture eciency number would
vary from 1.0759 to 1.29 for 802.11n codes and 1.0875 to
1.3158 for 802.16e codes.
Also our work covers more aspects. We can apply OoP
for PS processing across multiple layers. While waiting for
the data from the currently processed layer 1, we can start
processing the independent circulants in next layer 2 that will
not depend on current layer 1 and also the circulants in layer
3 that will not depend on layer 1 and layer 2. In [5], also we
will sequence the operations in layer such that we process the
block rst that has dependent data available for the longest
time. This naturally leads us to true out-of-order processing
across several layers. In practice we wont do out-of-order
partial state processing involving more than 2 layers.
References
[1] M. Rovini, G. Gentile, F. Rossi, and L. Fanucci, Techniques and
architectures for hazard-free semi-parallel decoding of LDPC
codes, EURASIP Journal on Embedded Systems, vol. 2009,
pp. 16451648, New Orleans, La, USA, May 2007.
[3] K. K. Gunnam, G. S. Choi, M. B. Yeary, and M. Atiquzzaman,
VLSI architectures for layered decoding for irregular LDPC
codes of WiMax, in Proceedings of the IEEE International
Conference on Communications (ICC 07), pp. 45424547,
Glasgow, UK, June 2007.
[4] K. K. Gunnam, G. S. Choi, W. Wang, E. Kim, and M. B.
Yeary, Decoding of quasi-cyclic LDPC codes using an on-the-
y computation, in Proceedings of the 4th Asilomar Conference
on Signals, Systems and Computers, pp. 11921199, October-
November 2006.
[5] K. Gunnam, Area and energy ecient VLSI architectures for
low density parity-check decoders using an on-the-y compu-
tation, Ph.D. presentation, Texas A&M University, College
Station, Tex, USA, October 2006, http://dropzone.tamu.edu/
kirang/10112006.pdf.
doi:10.1155/2009/635895
Letter to the Editor
Reply to Comments on Techniques and Architectures for
Hazard-Free Semi-Parallel Decoding of LDPC Codes
Department of Information Engineering, University of Pisa, Via G. Caruso, 56122 Pisa, Italy
Correspondence should be addressed to Massimo Rovini, massimo.rovini@gmail.com
This is a reply to the comments by Gunnam et al. Comments on Techniques and architectures for hazard-free semi-parallel
decoding of LDPC codes, EURASIP Journal on Embedded Systems, vol. 2009, Article ID 704174 on our recent work Techniques
and architectures for hazard-free semi-parallel decoding of LDPCcodes, EURASIP Journal on Embedded Systems, vol. 2009, Article
ID 723465.
Copyright 2009 Massimo Rovini et al. This is an open access article distributed under the Creative Commons Attribution
cited.
1. Introduction
After a careful reading of the comments by Gunnam et al.,
[1] we identied two main points to be further discussed
hereafter.
1.1. Point 1: Cited Papers. Gunnam et al. claim that we did
not cite correctly their work [2] and refer to other four
publications of their own to provide further explanation.
Actually the introductory section of our work [3] aims at
providing an overview of the state-of-the-art architectures
on the subject. The ve works by Gunnam et al. basically
propose the same LDPC architecture where the description
of all the features is spread across the ve publications. As
a matter of fact, to be fair and balanced with the other
state-of-the-art architectures we have decided to cite only
one of their works and particularly the one providing the
most details regarding the architecture and the implemen-
tation results [2]. Finally the selected paper was correctly
cited in our work [3] with no misleading information or
wrong assertion regarding the architecture described by
Gunnam et al.
1.2. Point: Architectural Eciency. In our paper [3], we
dened a metric to compare the eciency of dierent
LDPC architectures in terms of (average) number of clock
cycles per block and per iteration, with the term block
referring to a circulant of the parity check matrix. We
applied this metric to our design as well as to other
available implementations including [2], in this process, we
used the gures of throughput reported in each referenced
paper.
Gunnam et al. claim that this is not a fair metric because
it involves the average number of iterations. Actually we
hardly understand the point arisen. On one hand, it is com-
mon practice referring to the average number of iterations to
express the system throughput. On the other hand, Gunnam
et al. themselves use in [2] the average number of iterations
to evaluate their throughput gures. Moreover Gunnam et
al. state that the overhead of the statistical buering has not
been taken into account. Although there is no mention of the
statistical buering within the cited paper [2], this does not
aect the system throughput but rather the decoding latency.
Summarizing, we are quite condent regarding the fairness
of the considered Architectural Eciency metric and of the
data provided in our paper.
2. Conclusion
In this brief reply we have provided a detailed explanation
regarding the points arisen in [1]. The comments by
Gunnam et al. are indeed very useful to better understand
their decoder architecture. So, in the future, we will cite [1]
as the most eective description of their decoder.
References
[1] K. K. Gunnam, G. S. Choi, and M. B. Yeary, Comments
on Techniques and architectures for hazard-free semi-parallel
decoding of LDPC codes, EURASIP Journal on Embedded
Systems, vol. 2009, Article ID 704174, 3 pages, 2009.
[3] M. Rovini, G. Gentile, F. Rossi, and L. Fanucci, Techniques and
architectures for hazard-free semi-parallel decoding of LDPC
codes, EURASIP Journal on Embedded Systems, vol. 2009,
doi:10.1155/2009/574716
Research Article
OLLAF: A Fine Grained Dynamically Recongurable Architecture
for OS Support
Samuel Garcia and Bertrand Granado
ETIS Laboratory, CNRS UMR8051, University of Cergy-Pontoise, ENSEA 6, Avenue du Ponceau, F 95000 Cergy-Pontoise, France
Correspondence should be addressed to Samuel Garcia, samuel.garcia@ensea.fr
Received 15 March 2009; Revised 24 June 2009; Accepted 22 September 2009
Fine Grained Dynamically Recongurable Architecture (FGDRA) oers a exibility for embedded systems with a great power
processing eciency by exploiting optimizations opportunities at architectural level thanks to their ne conguration granularity.
But this increase design complexity that should be abstracted by tools and operating system. In order to have a usable solution, a
good inter-overlapping between tools, OS, and platform must exist. In this paper we present OLLAF, an FGDRA specially designed
to eciently support an OS. The studies presented here show the contribution of this architecture in terms of hardware context
management and preemption support. Studies presented here show the gain that can be obtained, by using OLLAF instead of a
classical FPGA, in terms of context management and preemption overhead.
Copyright 2009 S. Garcia and B. Granado. This is an open access article distributed under the Creative Commons Attribution
cited.
1. Introduction
Many modern applications, for example robots navigation,
have a dynamic behavior, but the hardware targets today are
still static and this dynamic behavior is managed in software.
This management is lowering the computation performances
in terms of time and expressivity. To obtain best perfor-
mances we need a dynamical computing paradigm. This
paradigm exists as DRA (Dynamically Recongurable Archi-
tecture), and some DRA components are already functionals.
A DRA component contains several types of resources: logic
cells, dedicated routing logic and input/output resources.
The logic cells implement functions that may be described
by the designer. The routing logic connects the logic cells
between them and is also congured by the designer. The I/O
resources allow communication outside the recongurable
area.
Several types of congurable components exist. For
example, ne grain architectures such as FPGA (Field
Programmable Gate Array) may adapt the functioning and
the routing at bit level. Other coarse grain architectures
may be adapted by reconguring dedicated operators (e.g.,
multipliers, ALU units, etc.) at coarser level (bit vectors).
In a DRA the functioning of the components may change
on line during run. FGDRA (Fine Grained Dynamically
Recongurable Architecture) could obtain very high per-
formances for a great number of algorithms because of its
bit level reconguration, but this level of reconguration
induces a great complexity. This complexity makes it hard
to use even for an expert and could be abstracted at some
level by two ways: at design time by providing design tools
and at run time by providing an operating system. This
operating system, in order to handle eciently dynamic
applications, has to be able to respond rapidly to events.
This can be achieved by providing dedicated services like
hardware preemption that lowe congurations and contexts
transfer times. In our previous work [1], we demonstrated
that we need to adapt the operating system to an FGDRA,
but also we need to modify an FGDRA to have an ecient
operating system support.
In this paper we present OLLAF which is an FGDRA
specially designed to support dynamics applications and a
specic FGDRA operating system.
This paper will be organized as follows. First, an
explanation of the problematics of this work is presented
in Section 2. Section 3 presents the OLLAF FGDRA archi-
tecture and its particularities. In Section 4, an analysis of
preemption costs in OLLAF in comparison with others
existing platforms, including commercial FPGA using sev-
eral preemption methods, is presented. Section 5 presents
application scenarios and compares context management
overhead using OLLAF competing with FPGA, especially the
Virtex family. Conclusions are then drawn in Section 6, as
well as perspectives on this work.
2. Context and Problematics
Fine Grained Dynamically Recongurable Architectures
(FGDRA) such as FPGAs, due to their ne recongura-
tion grain, allow to take better advantage of optimization
opportunities at architectural level. This feature leads in most
applications to a better performance/consumption factor
compared with other classical architectures. Moreover, the
ability to dynamically recongure itself at run time allows
FGDRA to reach a dynamicity very close to that encountered
using microprocessors.
The used model in a microprocessor development gains
its eciency from a great overlapping between platforms,
tools, and OS. First between OS and tools, as most main
frame OS oer specically adapted tools to support their
API. Also between tools and platform, as an example RISC
processors have an instruction set specically adapted to the
output of most compilers. Finally, between platform and
OS then, by integrating some OS related component into
hardware, MMU is an example of such an overlapping. As
for microprocessors, for FGDRAs the keypoint to maximize
eciency of a design model is the inter-overlapping between
platforms, tools, and OS.
This article presents a study of our original FGDRAcalled
OLLAF specically designed to enhance the eciency of OS
services necessary to manage such an architecture. OLLAF
has a great inter-overlapping between OS and platform. This
particular study mainly focuses on the contribution of this
architecture in terms of conguration management overhead
compared to other existing FGDRA solutions.
2.1. Problematics. Several studies have been led around
FGDRAmanagement that demonstrated the interest of using
an operating system to manage such a platform.
Few of them actually propose to bring some mod-
ications to the FGDRA itself in order to enhance the
eciency of some particular services as fast reconguration
or task relocation. But most of recent studies concentrate
on implementing an OS to manage an already existing
commercially available FPGA, most often from the Virtex
family. This FPGAfamily is actually the only recent industrial
FPGA family to allow partial reconguration thanks to an
interface called ICAP.
In a previous study, we presented a method allowing to
drastically decrease preemption overhead of a FPGA based
task, using a Virtex FPGA [1]. In this previous work, as
in the one presented here, we made dierence between
conguration, which relates to the conguration bitstream,
and context. Context is the data that have to be saved by
the operating system, prior to a preemption, in order to
be able to resume the task later without any data loss. In
this previous study, we thus proposed a method to manage
context, conguration being managed in a traditional way.
Conclusions of this study were encouraging but revealed that
if we want to go further, we have to work at architecture
level. That is why we proposed an architecture called OLLAF
[2] specially designed to answer to problematics related
to FGDRA management by an operating system. Among
those, we wanted to address problems such as context
management and task conguration loading speed, these two
features being of primary concern for an ecient preemptive
management of the system.
2.2. Related Works. Several researchs have been led in the
eld of OS for FGDRA [36]. All those studies present an OS
more or less customized to enable specic FGDRA related
services. Example of such services are partial reconguration
management, hardware task preemption, or hardware task
migration. They are all designed on top of a commercial
FPGA coupled with a microprocessor. This microprocessor
may be a softcore processor, an embedded hardwired core or
even an external processor.
Some works have also been published about the design of
a specic architecture for dynamical reconguration. In [7]
authors discuss about the rst multicontext recongurable
device. This concept has been implemented by NEC on the
Dynamically Recongurable Logic Engine (DRLE) [8]. At
the same period, the concept of Dynamically Programmable
Gate Arrays (DPGA) was introduced, it was proposed in
[9] to implement a DPGA in the same die as a classic
microprocessor to form one of the rst System on Chip
(SoC) including dynamically recongurable logic. In 1995,
Xilinx even applied a patent on multicontext programmable
device proposed as an XC4000E FPGA with multiple con-
guration planes [10]. In [11], authors study the use of
a conguration cache, this feature is provided to lower
costly external transfers. This paper shows the advantages
of coupling conguration caches, partial reconguration and
multiple conguration planes.
More recently, in [12], authors propose to add special
material to an FGDRA to support OS services, they worked
on top of a classic FPGA. The work presented in this paper
try to take advantage of those previous works both about
hardware recongurable platform and OS for FGDRA.
Our previous work on OS for FGDRA was related to
preemption of hardware task on FPGA [1]. For that purpose
we have explored the use of a scanpath at task level. In order
to accelerate the context transfer, we explore the possibility
of using multiple parallel scanpaths. We also provided the
Context Management Unit or CMU, which is a small IP
that manage the whole process of saving and restoring task
contexts.
In that study both the CMU and the scanpath were
built to be implemented on top of any available FPGA.
This approach showed number of limitations that could
be summarized in this way: implementing this kind of OS
related material on top of the existing FPGA introduces
unacceptable overhead on both the tasks and the OS services.
Dierently said, most of OS related materials should be as
much as possible hardwired inside the FGDRA.
3. OLLAF Architecture Overview
3.1. Specications of an FGDRA with OS Support. We have
designed an FGDRA with OS support following those
specications.
It should rst address the problem of the conguration
speed of a task. This is one of the primary concerns because
if the system spend more time conguring itself than actually
running tasks its eciency will be poor. The conguration
speed will thus have a big impact on the scheduling strategy.
In order to enable more choice on scheduling scheme,
and to match some real time requirements, our FGDRA
platform must also include preemption facilities. For the
same reasons as conguration, the speed of context saving
and restoring processes will be one of our primary concerns.
On this particular point, previous work we have discussed in
Section 2 will be adapted and reused.
Scheduling on a classical microprocessor is just a matter
of time. The problem is to distribute the computation time
between dierent tasks. In the case of an FGDRA the system
must distribute both computation time and computation
resources. Scheduling in such a system is then no more
a one-dimensional problem, but a three-dimensional one.
One dimension is the time and the two others represent the
surface of recongurable resources. Performing an ecient
scheduling at run time for minimizing processing time is
then a very hard problem that the FGDRA should help
getting close to solve. The primary concern on this subject is
to ensure an easy task relocation. For that, the recongurable
logic core should be splited into several equivalent blocks.
This will allow to move a task from one block to any another
block, or froma group of blocks to another group of blocks of
the same size and the same form factor, without any change
on the conguration data. The size of those blocks would be
a tradeo between exibility and scheduling eciency.
Another aspect of an operating system is to provide
intertask communication services. In our case we will dis-
tinguish two cases. First the case of a task running on top of
our FGDRA and communicating with another task running
on a dierent computing unit. This last case will not be
covered here as this problem concern a whole heterogeneous
platform, not only the particular FGDRA computing units.
The second case is when two, or more, tasks run on top of the
same FGDRA communicate together. This communication
channel should remain the same wherever the task is placed
on the FGDRA recongurable core and whatever the state
of those tasks is (running, pending, waiting,. . .). That means
that the FGDRA platform must provide a rationalized
communication medium including exchange memories.
The same arguments could also be applied to inputs/
outputs. Here again two cases exists; rst the case of I/Obeing
a global resource of the whole platform; second the case of
special I/O directly bounding to the FGDRA.
3.2. Proposed Solutions. Figure 1 shows a global view of
OLLAF, our original FGDRA designed to support eciently
OS services like preemption or conguration transfers.
In the center, stands the recongurable logic core of
the FGDRA. This core is a dual plane, an active plane
Application communication media
Recongurable
logic core
HW Sup
+
HW RTK
+
CCR
HCM
LCM
HCM
LCM
HCM
LCM
HCM
LCM
C
M
U
C
M
U
C
M
U
C
M
U
Control bus
Figure 1: Global view of OLLAF.
and a hidden one, organized in columns. Each column
can be recongured separately and oers the same set
of services. A task is mapped on an integer number of
columns. This topology as been chosen for two reasons. First,
using a partial reconguration by column transforms the
scheduling probleminto a two-dimensional problem(time +
1D surface) which will be easier to handle for minimizing the
processing time. Secondly as every column is the same and
oers the same set of services, tasks can be moved from one
column to another without any change on the conguration
data.
In the gure, at the bottomof each column you cannotice
two hardware blocks called CMU and HCM. The CMU is
an IP able to manage automatically tasks context saving and
restoring. The HCM standing for Hardware Conguration
Manager is pretty much the same but to handle congu-
ration data is also called bitstream. More details about this
controller can be found in [1]. On each column a local cache
memory named LCM is added. This memory is a rst level of
cache memory to store contexts and congurations close to
the column where it might most probably be required. The
internal architecture of the core provides adequate materials
to work with CMU and HCM. More about this will be
discussed in the next section.
On the right of the gure stands a big block called
HW Sup + HW RTK + CCR. This block contains a
hardware supervisor running a custom real time kernel
specially adapted to handle FGDRA related OS services
and platform level communication services. In our rst
prototype presented here, this hardware supervisor is a
classical 32 bits microprocessor. Along with this hardware
supervisor a central memory is provided for OS use only.
Basically this memory will store congurations and contexts
of every task that may run on the FGDRA. This supervisor
communicates with all columns using a dedicated control
bus. The hardware supervisor can initiate context transfers,
from and to the hidden plane, by writing in CMUs and
HCMs registers through this control bus.
Finally, on top of the Figure 1 you can see the application
communication medium. This communication medium
provides a communication port to each column. Those
communication ports will be directly bound to the recon-
gurable interconnection matrix of the core. If I/O had to
ABCD (3..0)
A
B
C
D
X
LUT
Clk
CE
Rst
D
C
CE
R DFF
Q
LX
QX
Figure 2: Functional, task designer point of view of LE.
be bound to the FGDRA they would be connected with
this communication medium in the same way recongurable
columns are.
This architecture has been developed as a VHDL model
in which the size and number of columns are generic
parameters.
3.3. Logic Core Overview. The OLLAFs logic core is func-
tionally the same as logic fabric found in any common FPGA.
Each column is an array of Logic Elements surrounded
by a programmable interconnect network. Basic functional
architecture of an LE can be seen on Figure 2. It is composed
of an LUT and a D-FlipFlop. Several multiplexors and/or
programmable inverters can also be used.
All the material added to support OS in the recong-
urable logic core, concern the conguration memories. That
mean that in a user point of view, designing for OLLAF is
similar to designing for any common FPGA. This also mean
that if we want to improve the functionality of those LE the
results presented here will not change.
Conguration data and context data (Flipops content)
constitutes two separate paths. A context swap can be
performed without any change in conguration. This can
be interesting for checkpointing or when running more than
one instance of the same task.
3.4. Conguration, Preemption, and OS Interaction. In previ-
ous sections an architectural view of our FGDRA has been
exposed. In this section, we discuss about the impact of this
architecture on OS services. We will here consider the three
services most specically related to the FGDRA:
(i) First, the conguration management service: on the
hardware side, each column provides a HCM and a LCM.
That means that congurations have to be prefetched in
the LCM. The associated service running on the hardware
supervisor will thus need to take that into account. This
service must manage an intelligent cache to prefetch task
conguration on the columns where it might most probably
be mapped.
(ii) Second, the preemption service: the same principle
must be applicable here as those applied for conguration
management, except that contexts also have to be saved. The
context management service must ensure that there never
exists more than one valid context for each task in the
entire FGDRA. Contexts must thus be transferred as soon
as possible from LCM to the centralized global memory of
CSrs
D + clk
CSin + CSclk
DFF
1
DFF
2
Q
CSout
Figure 3: Dual plane conguration memory.
the hardware supervisor. This service will also have a big
impact on the scheduling service as the ability to perform
preemption with a very low overhead allows the use of more
exible scheduling algorithms.
(iii) Finally the scheduling service, and in particular the
space management part of the scheduling: it takes advantage
of the column topology and the centralized communication
scheme. The recongurable resource could then be managed
as a virtual innite space containing an undetermined
number of columns. The job is to dynamically map the
virtual space into the real space (the actual recongurable
logic core of the FGDRA).
3.5. Context Management Scheme. In [1], we proposed a
context management scheme based on a scanpath, a local
context memory and the CMU. The context management
scheme in OLLAF is slightly dierent in two ways. First, every
context management related material is hardwired. Second,
we added two more stages in order to even lower preemption
overhead and to ensure the consistency of the system.
As context management materials are added at hardware
level and no more at task level, it needed to be splited
dierently. As the programmable logic core is column based,
it was natural to implement context management at columns
level. A CMU and a LCM have then been added to each
column, and one scanpath is provided for each columns set
of ipops.
In order to lower preemption overhead, our recon-
gurable logic core uses a dual plane, an active plane
and a hidden plane. Flipops used in logic elements are
thus replaced with two ipops with switching material.
Architecture of this dual plane ipops can be seen on
Figure 3. Run and scan are then no more two working modes
but two parallel planes which can be swapped as well. With
this topology, the context of a task can be shifted in while
the previous task is still running, and shifted out while the
next one is already running. The eective task switching
overhead is then taken down to one clock cycle as illustrated
in Figure 5.
Contexts are transferred by the CMU into LCM in the
hidden plane with a scanpath. Because the context of every
column can be transferred in parallel, LCM is placed at
column level. It is particularly useful when a task uses more
than one column. In the rst prototype, those memories can
store 3 congurations and 3 contexts. LCM optimizes access
to a bigger memory called the Central Context Repository
(CCR).
Dual plane
CMU
LCM
Control bus
CCR
Speed
Size
(nb of context)
1 (+1 active)
10
>100
Fixed
1 clk
Fixed
depending on
column size
(1 clk/logic element)
Random
bus access speed
Figure 4: Context memories hierarchy.
1 Tclk overhead
1 2
T1 T2
T2 cong. transfert
T2 context restore T1 context save
Cur. active plane
Execution
Cong. scan
Context scan
Time axe
Plane 1
Plane 2
Figure 5: Typical preemption scenario.
CCR is a large memory space storing the context of each
task instance run by the system. LCM should then store
context of tasks who are most likely to be the next to be run
on the corresponding column.
After a preemption of the corresponding task, a context
can be stored in more than one LCM in addition to the
copy stored in the CCR. In such situation, care must be
taken to ensure the consistency of the task execution. For
that purpose, contexts are tagged by the CMU each time
a context saving is performed with a version number. The
operating system keeps track of this version number and
also increments it each time a context saving is performed.
In this way the system can then check for the validity of a
context before a context restoration. The system must also
try to update the context copy in the CCR as short as possible
after a context saving is performed with a write-through
policy.
Dual plane, LCM and CCR form a complex mem-
ory hierarchy specially designed to optimize preemption
overhead as seen on Figure 4. The same memory scheme
is also used for conguration management except that a
conguration does not change during execution so it does
not need to be saved and then no versioning control is
required here. The programmable logic core uses a dual
conguration plane equivalent to the dual plane used for
context. Each column has an HCM which is a simplied
version of the CMU (without saving mechanism). LCM is
designed to be able to store an integer number of both
contexts and congurations.
In best case, preemption overhead can then be bound to
one clock cycle.
A scenario of a typical preemption is presented in
Figure 5. In this scenario we consider the case where context
and conguration of both tasks are already stored into LCM.
Let us consider that a task T1 is preempted to run another
task T2, scenario of task preemption is then as follows:
(i) T1 is running and the scheduler decides to preempt it
to run T2 instead,
(ii) T2 is conguration and eventual context are shifted
on the hidden plane,
(iii) once the transfer is completed the two conguration
planes are switched,
(iv) nowT2 is running and T1s context can be shifted out
to be saved,
(v) T1s context is updated as soon as possible in the
CCR.
4. Preemption Cost Analysis
4.1. OLLAF versus Other FPGA Based Works. This section
presents an analytic comparison of preemption management
eciency on dierent solutions using commercial FPGA
platform and on our FGDRA OLLAF. The comparison was
made on six dierent management method to transfer the
context and the conguration for the preemption incuding
the methods in use in OLLAF.
The six considered methods are
XIL: a solution based on the Xilinx XAPP290 [13] using
ICAP interface to transfer both context and conguration
and using the readback bitstream for context extraction,
Scan: a solution using a simple scanpath for context transfer
as described in both [1, 14], and using ICAP interface for
conguration transfer,
PCS8: a solution that is similar to Scan solution but using 8
parallel scanpath as described in [1] to transfer the context,
ICAP interface is still used for conguration transfer,
DPScan: a solution that uses a dual plane scanpath similar
to the one used in OLLAF for context transfer and ICAP for
conguration transfer. This method is also studied in [14],
referred as a shadow Scan Chain,
MM: a solution that uses ICAP for conguration transfer
and the memory mapped solution proposed in [14] for
context transfer,
OLLAF: a solution that use separate dual plane scanpath for
conguration transfer and context transfer as used in the
FGDRA architecture proposed in this article.
We denes the preemption overhead H as the cost of a
preemption for the system in terms of time, expressed as a
number of clock cycles or tclk. In the same way, all transfer
times are expressed and estimated in number of clock cycle
as we want to focus on the architectural view only. Task sizes
will be parameterized as n, the number of ipops used.
Preemption overhead can be due to context transfers
(two transfers: one from the previously running task to
save it is context and one to the next task to restore it is
context), conguration transfers (to congure the next task)
and eventually contexts data extraction (if the contexts data
are spreaded among other data as in the XIL solution).
The ve rst solution uses the ICAP interface as cong-
uration transfer method. Using this method, transfers are
made as conguration bitstream. A conguration bistream
contains both a conguration and a context. In the same
way, for the XIL solution that also use the ICAP interface
for context saving, the readback bitsteam contains both
a conguration and an context. In this case only context
is useful. But we need to transfer both conguration and
context and then to spend some extra time to extract the
context.
According to [14], we can estimate that for an n ipop
IP, and so an n bits context, the conguration is 20n bits. That
means a typical ICAP bitstream of 21n bits.
Analytic expression of H for each case are estimated as
follows.
XIL. Assuming that it uses a 32-bit-width access bus, the
ICAP interface can transfers 32 bits per clock cycle. A
complete preemption process will require the transfer of two
complete bistreams at this rate. In [14], authors estimate that
it takes 20 clock cycles to extract each context bit from the
readback bitstream. This time should then also be taken into
acount for the preemption overhead
H =
21n
32
+
21n
32
+ 20n 21.3n. (1)
Scan. Using a simple scanpath for context transfer requires
1 clock cycle per ipop for each context transfer. As we use
the ICAP interface for conguration transfer, as mentioned
earlier, that implies the eective transfer of a complete
bitstream. That means that the context of the next task is
transfered two time even if only one of them contains the
real useful data
H =
21n
32
+ 2n 2.66n. (2)
PCS8. Using 8 parallel scanpath requires 1 clock cycle for 8
ipops. The conguration transfer remains the same as for
the previous solution
H =
21n
32
+
2n
8
0.9n. (3)
DPScan. Using a double plane scanpath, the context trans-
fers can be hidden, the cost of those transfers is then always
1 clock cycle. The conguration transfer remains the same as
for the previous solutions
H =
21n
32
+ 1 0.66n + 1. (4)
MM. Using 32-bit-memory access, this case is similar to
the PCS8 but using 32 parallel paths instead of 8. The
conguration transfer remains the same as for the previous
solutions
H =
21n
32
+
2n
32
0.69n. (5)
OLLAF. In OLLAF, both context and conguration transfers
could be hidden so the total cost of the preemption is always
1 clock cycle whatever the size of the task
H = 1. (6)
As a point of comparison, considering a typical operating
system clock tick of 10 ms and assuming a typical clock
frequency of 100 MHz, the OS tick is 10
6
tclk.
To make our comparison, we consider two tasks T1 and
T2. We consider a DES56 cryptographic IP that requires 862
ipops and a 16-tap-FIR lter that requires 563 ipops.
Both of those IPs can be found in www.opencores.org. To ease
the computation we will consider two tasks using the average
number of ipops of the two considered IP. So for T1 and
T2, we got n = (862 + 563)/2 713. Table 1 shows the
overhead H for each presented method.
Table 1: Comparison of task preemption overhead for 713 ipops task.
XIL Scan PCS8 DPScan MM OLLAF
H (tclk) 15188 1897 642 472 492 1
Table 2: Comparison of task preemption overhead for a whole 1M ipops FGDRA.
XIL Scan PCS8 DPScan MM OLLAF
H (tclk) 21.3 10
6
2.66 10
6
900 10
3
660 10
3
690 10
3
1
Those results show that in this case, using our method
leads to a preemption overhead around 500 times smaller
than using the best other method.
If we now consider that not only one task is preempted
but the whole FGDRA surface, assuming a 1 Million LEs
logic core, estimation of overhead for each method is shown
in Table 2. In the XIL case the preemption overhead is
about 20 times more than the tick period, which is not
acceptable. Those results show clearly the benet of OLLAF
over actual FPGA concerning preemption. Using actual
methods, preemption overhead is linearly dependent on the
size of the task. In OLLAF, this overhead do not depends on
the size of the task and is always of only one clock cycle.
In OLLAF, both context and conguration transfers are
hidden due to the use of dual plane. The latency L between
the moment a preemption is asked and the moment the new
task eectively begins to run can also be studied. This latency
only depends on the size of the columns. In the worst case
this latency will be far shorter than the OS tick period. OS
tick period being in any case the shortest time in which the
system must respond to an event, we can consider that this
latency will not aect the system at all.
5. Dynamic Applications Cases Studies
In this section, we will consider few applications cases to
demonstrate the contribution of the OLLAF architecture
especially for the implementation of dynamical applications.
Applications will be here presented as a task dependency
graph, each task being characterized by its execution time,
its size as a number of columns occupied, and eventually its
periodicity.
In this study, we consider an OLLAF prototype with four
columns. The study consists of comparing the execution of
a particular application case using three dierent context
transfer methods. The rst considered context transfer
method will be the use of an ICAP-like interface, this will
be the reference method as it is the one considered in most
of todays works on recongurable computing. The second
consider method will be the method used in the OLLAF
architecture as presented earlier. We will here consider using
LCM of a size of 3 congurations and 3 contexts. Then in
order to study in more detail the contribution of dual planes
and of the LCM we will also considered a method consisting
on an OLLAF like architecture but using only one plane. As
the use of a dual planes will have a major impact on the
recongurable logic cores performance, this last case is of
primary concern to justify this cost.
T
M
T
L1
T
L0
LCM
LCM
LCM
LCM
Conf plane
Conf plane
Conf plane
Conf plane
Exec plane
Exec plane
Exec plane
Exec plane
C
C
R
Figure 6: Memory view of the considered implementation of
OLLAF.
Table 3: Transfer times and lengths in clock periods for each level.
T
M
T
L1
T
L0
Tr. length (#Tclk) 53760 16384 1
Transfer Time 537.6 s 16.38 s 10 ns
Figure 6 shows a hierarchy memory view of OLLAF.
CCR is the main memory, LCM constitute the local column
caches and then the dual plane is the higher and very fast
level. T
L0
, T
L1
and T
M
represent the three transfer levels in
OLLAF architecture. The ICAP like case will imply only
T
M
, the OLLAF simple one will imply T
M
and T
L1
, and
nally the OLLAF case will involve the three transfer levels.
Each transfer level is characterized by the time necessary to
transfer the whole context of one column. In this study we
choose to use a recongurable logic core composed of four
columns of 16384 Logic Elements each. Using this layout, the
context and conguration of a column comports 1680 Kbits.
Table 3 gives transfer time for one columns context and
conguration in clock period assuming a working frequency
of 100 MHz. Those parameters will be useful as the study
will now consist on counting the number of transfers at each
level for every dierent application case and transfer method
case. We will thus study the temporal cost of context transfers
for a whole sequence of each application case. We have to
distinguish two cases, the very rst execution, where caches
T1
T2
T3
P = 40 ms
T = 40 ms
S = 1
T = 10 ms
S = 3
T = 15 ms
S = 2
Figure 7: First case: simple linear application.
T1
T2
T3 T4
P = 40 ms
T = 40 ms
S = 1
T = 10 ms
S = 3
?
T = 15 ms
S = 2
T = 10 ms
S = 2
Figure 8: Second case: two dynamically chosen tasks.
are empty, and every later executions of the sequence, where
caches and planes already contain contexts congurations.
Applications presented here involves each a rst task T1
which has a periodicity of 40 ms, each time execution of this
task nish, the remaining sequence begin (creation of task
T2 . . .) and a new instance of T1 is ran. This correspond
to a typical real time imaging system, a task is in charge of
capturing a picture, then each time a picture as been fully
captured, this picture is being processed by a set of other
tasks, while the next picture is being captured.
5.1. Considered Cases. The rst case as seen on Figure 7 is
an application composed of three linearly dependent tasks.
It presents no particular dynamicity and thus will serve as a
reference case.
The second considered case, as seen on Figure 8, presents
a dynamical branch. By that we mean that depending on the
result of task T2s processing, the system may run T3 or T4.
By those two last tasks presenting dierent characteristics,
the overall behavior of the system will be dierent depending
on input data. This is a typical example of dynamic
application, in those cases, the system management must be
performed online. In order to study such a dynamical case,
we gave a probability for each possible case. Here we consider
that probability of task T3 is 20% while the probability of T4
is 80%. Those probabilities are given randomly in order to
be able to perform a statistical study of this application. In
real case those probabilities may not be known in advance as
it depends on input data, we could then consider having an
online proling in order to improve eciency of the caching
system, but this is beyond the scope of this article. One
could note that MPEG encoding algorithm is an example of
algorithm presenting this kind of dynamicity.
In the last considered case, on Figure 9, dynamicity is
not in which task will be executed next but in how many
instances of a same task will be executed. This can be
seen as dynamic thread creation. This kind of case can be
found on some computer vision algorithm where a rst
task is detecting objects and then a particular treatment is
being applied on each detected object. As we cannot know
in advance how many objects will be detected, treatments
applied on each of those objects must be dynamically
created. In this particular case, we consider that the system
can handle from 0 up to 4 objects in each scene. That
mean that depending on input data, from 0 up to 4
instances of the task Tdyn can be created and executed. The
probabilities for each possible number of object detected are
shown on the probability graph on Figure 9, we chosen a
Gaussian like probability gure which is a typical realistic
distribution.
This case is particularly interesting for many reasons.
First the loading condition of the task T2 dynamically
depends on the previous iteration of the sequence. As an
example, if no object has been detected in the previous scene,
then no Tdyn has been created and thus T2 is still fully
operational into the active plane, it may only eventually have
to be reseted. If now 3 or more objects has been detected and
thus all the three free columns has been used, then the full
context of T2 have to be loaded from the second plane or in
some cases from the local caches.
Another interesting aspect occurs when 4 objects are
detected and so 4 Tdyn are created and must be executed.
In that case if three rst Tdyn are executed, one on each
free column, and then the fourth is executed on one random
column, then a new image will be arrived before processing
of the current one is nished, in other terms, the deadline
is missed. However, by scheduling those four Tdyn instances
using a simple round robin algorithm with a quantum time
of 5 ms, real time treatment can be achieved. It should be
noticed that this scheduling is only possible if preemption is
allowed by the platform.
5.2. Results. Tables 4, 5, and 6 show execution results for
each presented application case in terms of transfer cost.
For each case, we show the number of transfers that occurs
per sequence iteration at each possible stage depending on
the considered architecture. We also give the Total time
spent in transferring context. Those results do not take into
account transfers that are hidden due to a parallelization with
the execution of a task in the considered column, as those
transfers do not present any temporal cost for the system.
Concerning level T
L1
and T
L0
, multiple transfers can occur
T1
T2
P = 40 ms
T = 40 ms
S = 1
T = 10 ms
S = 3
Create #?
T
dyn
T
dyn
T
dyn

T = 15 ms
S = 1
P
r
o
b
a
b
i
l
i
t
y
(
%
)
40
30
20
10
0
0 1 2 3 4
# dynamical task created
Figure 9: Third case: dynamical creation of multiple instances of a task.
Table 4: Results for case 1 execution.
First iteration Next iterations
#T
M
#T
L1
#T
L0
Total time #T
M
#T
L1
#T
L0
Total time
ICAP-like 3 1.61 ms 4 2.15 ms
OLLAF simple 1 2 570 s 0 1 32.8 s
OLLAF 1 1 3 554 s 0 0 2 20 ns
Table 5: Results for case 2 execution.
#T
M
#T
L1
#T
L0
Total time #T
M
#T
L1
#T
L0
Total time
ICAP-like 3 1.61 ms 4 2.15 ms
OLLAF simple 3 2 1.65 ms 0 1 32.8 s
OLLAF 1 2 3 570 s 0 0.5 2 8.21 s
Table 6: Results for last case execution.
#T
M
#T
L1
#T
L0
Total time #T
M
#T
L1
#T
L0
Total time
ICAP-like 6.6 3.55 ms 5.5 2.96 ms
OLLAF simple 1 3.2 590 s 0 3.1 50.8 s
OLLAF 1 1 3.2 554 s 0 0 2.1 21 ns
in parallel (one on each column), in those cases only one
transfer is counted as the temporal cost is always of one
transfer at considered stage.
Considering the results using OLLAF, for the rst
iteration of the sequence, give information about the con-
tribution of the dual planes while the results for the next
iterations using OLLAF simple give information about the
contribution of the LCM only. If we now consider the result
for next iterations using OLLAF, we can see that a major gain
is obtained by combining LCMand a dual planes. In the cases
considered here, this gain is a factor between 10
3
for case 2
and 10
6
for case 1 and 3 compared to the ICAP solution.
We also have to consider the scalability of proposed
solutions. Transfers at level T
L0
are not dependent of either
column size or number of columns in the considered
platform. T
L1
transfer time depend on the size of each
column but not on the number of columns in use. T
M
transfers not only depends on the column size but also on
the number of column as all transfers at this level share the
same source memory (CCR) and the same bus. We can see
Case 1 Case 2 Case 3
1
10
3
10
6
T
o
t
a
l
t
r
a
n
s
f
e
r
t
i
m
e
(
l
o
g
1
0
(
n
s
)
)
OLLAF
OLLAF simple
ICAP like
Figure 10: Summary of Total transfer cost per sequence.
that using the classical approach will face some scalability
issues while OLLAF oer a far better scalability potential as
transfers cost is far less dependent on the platform size.
Figure 10 gives a summarized view of results. It present
the total transfer cost per sequence iteration in normal
execution (i.e., not for the rst execution). Results are
presented here in nanoseconds using a decimal logarithmic
scale. This gure reveal the contribution of the OLLAF
architecture in terms of context transfer overhead reduction.
In all the three cases, OLLAF is the best solution. Case 3
shows that it is well adapted to dynamic applications.
Those results not only prove the benet of the OLLAF
architecture, but they also demonstrate that the use of LCM
allows to take better advantage of dual planes.
6. Conclusion
In this paper we presented a Fine Grained Dynamical-
ly Recongurable Architecture called OLLAF, specially
designed to enhance the eciency of Operating Systems
services necessary to its management.
Case study considering several typical applications with
dierent degrees of dynamicity revealed that this architecture
permits to obtain a far better eciency for task loading
and execution context saving services than actual FPGA
traditionally used as FGDRA in most recent studies. In
the best case, task switching can be achieved in just one
clock cycle. More realistic statistical analysis showed that
for any basic dynamic case considered, the OLLAF platform
always outperform commercially available solution by a
factor around 10
3
to 10
6
concerning contexts transfer costs.
The analysis showed that this result can be achieved thinks to
the combination of a dual planes and an LCM.
This feature allows fast preemption and thus permit to
handle dynamic applications eciently. This also open the
door to lot of dierent scheduling strategies that cannot be
considered using classical architecture.
Future works will be led on the development of an online
scheduling service taking into account new possibilities
oered by OLLAF. We could include prediction mechanism
in this scheduler performing smart congurations and
contexts prefetch. Being able to predict in most cases the
future task that will run in a particular column will permit to
take even better advantage of the context and conguration
management scheme proposed in OLLAF.
This work contribute to make FGDRAs a much more
realistic option as universal computing resource, and make
themone possible solution to keep the evolution of electronic
system going in the more than moore fashion. For that
purpose, we claim that we have to put a lot of eorts to build
a strong consistence between design tools, Operating Systems
and platforms.
References
[1] S. Garcia, J. Prevotet, and B. Granado, Hardware task context
management for ne grained dynamically recongurable
architecture, in Proceedings of the Workshop on Design and
Architectures for Signal and Image Processing (DASIP 07),
Grenoble, France, November 2007.
[2] S. Garcia and B. Granado, OLLAF: a ne grained dynamically
recongurable architecture for os support, in Proceedings of
the Workshop on Design and Architectures for Signal and Image
Processing (DASIP 08), Grenoble, France, November 2008.
[3] H. Simmler, L. Levinson, and R. M anner, Multitasking on
FPGA coprocessors, in Proceedings of the 10th International
Conference on Field-Programmable Logic and Applications (FPL
00), vol. 1896 of Lecture Notes in Computer Science, pp. 121
130, Villach, Austria, August 2000.
[4] G. Chen, M. Kandemir, and U. Sezer, Conguration-sensitive
process scheduling for FPGA-based computing platforms,
in Proceedings of the Design, Automation and Test in Europe
Conference and Exhibition (DATE 04), vol. 1, pp. 486493,
Paris, France, February 2004.
[5] H. Walder and M. Platzner, Recongurable hardware oper-
ating systems: from design concepts to realizations, in
Proceedings of the International Conference on Engineering of
Recongurable Systems and Algorithms (ERSA 03), pp. 284
287, 2003.
[6] G. Wigley, D. Kearney, and D. Warren, Introducing recon-
gme: an operating system for recongurable computing,
in Proceedings of the 12th International Conference on Field
Programmable Logic and Application (FPL 02), vol. 2438, pp.
687697, Montpellier, France, September 2002.
[7] X.-P. Ling and H. Amano, Wasmii : a data driven computer
on virtuel hardware, in Proceedings of the IEEE Workshop
on FPGAs for Custom Computing Machines, pp. 3342, Napa,
Calif, USA, April 1993.
[8] Y. Shibata, M. Uno, H. Amano, K. Furuta, T. Fujii, and M.
Motomura, A virtual hardware system on a dynamically
recongurable logic device, in Proceedings of the IEEE Sympo-
sium on FPGAs for Custom Computing Machines (FCCM 00),
Napa Valley, Calif, USA, April 2000.
[9] A. DeHon, DPGA-coupled microprocessors: commodity ICs
for the early 21st century, in Proceedings of the IEEE Workshop
on FPGAs for Custom Computing Machines (FCCM 94), pp.
3139, Napa Valley, Calif, USA, April 1994.
[10] Xilinx, Time multiplexed programmable logic device, US
patent no. 5646545, 1997.
[11] Z. Li, K. Compton, and S. Hauck, Conguration caching
techniques for FPGA, in Proceedings of the IEEE Symposium
on FPGA for Custom Computing Machines (FCCM 00), Napa
Valley, Calif, USA, April 2000.
[12] V. Nollet, P. Coene, D. Verkest, S. Vernalde, and R. Lauw-
ereins, Designing an operating system for a heterogeneous
recongurable SoC, in Proceedings of the 17th International
Parallel and Distributed Processing Symposium (IPDPS 03), p.
174, Nice, France, April 2003.
[13] Xilinx, Two ows for partial reconguration: module based
or dierence based, Xilinx, Application Note, Virtex, Virtex-
E, Virtex-II, Virtex-II Pro Families XAPP290 (v1.2), Septem-
ber 2004.
[14] D. Koch, C. Haubelt, and J. Teich, Ecient hardware check-
pointing: concepts, overhead analysis, and implementation,
in Proceedings of the 17th International Conference on Field
Programmable Logic and Applications (FPL 07), Amsterdam,
The Netherlands, August 2007.
doi:10.1155/2009/175043
Research Article
Trade-Off Exploration for Target Tracking Application in
a Customized Multiprocessor Architecture
Jehangir Khan,
1
Smail Niar,
1
Mazen A. R. Saghir,
2
Yassin El-Hillali,
1
and Atika Rivenq-Menhaj
1
1
Universite de Valenciennes et du Hainaut-Cambresis, ISTV2 - Le Mont Houy, 59313 Valenciennes Cedex 9, France
2
Department of Electrical and Computer Engineering, Texas A&M University at Qatar, 23874 Doha, Qatar
Correspondence should be addressed to Smail Niar, smail.niar@univ-valenciennes.fr
Received 16 March 2009; Revised 30 July 2009; Accepted 19 November 2009
This paper presents the design of an FPGA-based multiprocessor-system-on-chip (MPSoC) architecture optimized for Multiple
Target Tracking (MTT) in automotive applications. An MTT system uses an automotive radar to track the speed and relative
position of all the vehicles (targets) within its eld of view. As the number of targets increases, the computational needs of the MTT
system also increase making it dicult for a single processor to handle it alone. Our implementation distributes the computational
load among multiple soft processor cores optimized for executing specic computational tasks. The paper explains how we
designed and proled the MTT application to partition it among dierent processors. It also explains how we applied dierent
optimizations to customize the individual processor cores to their assigned tasks and to assess their impact on performance and
FPGA resource utilization. The result is a complete MTT application running on an optimized MPSoC architecture that ts in a
contemporary medium-sized FPGA and that meets the applications real-time constraints.
Copyright 2009 Jehangir Khan et al. This is an open access article distributed under the Creative Commons Attribution License,
1. Introduction
Technological progress has certainly inuenced every aspect
of our lives and the vehicles we drive today are no exception.
Fuel economy, interior comfort, and entertainment features
of these vehicles draw ample attention but the most impor-
tant objective is to aid the driver in avoiding accidents.
Road accidents are primarily caused by misjudgment of a
delicate situation by the driver. The main reason behind the
drivers inability to judge a potentially dangerous situation
correctly is the mental and physical fatigue due the stressful
driving conditions. In cases where visibility is low due to
poor weather or due to night-time driving, the stress on the
driver increases even further.
An automatic early warning and collision avoidance
system onboard a vehicle can greatly reduce the pressure on
the driver. In the literature, such systems are called Driver
Assistance Systems (DASs). DASs not only automatize safety
mechanisms in a vehicle but also help drivers take correct and
quick decisions in delicate situations. These systems provide
the driver with a realistic assessment of the dynamic behavior
of potential obstacles before it is too late to react and avoid a
collision.
In the past few years, various types of DASs have been
the subject of research studies [13]. Most of these works
concentrate on visual aid to the driver by using a video
camera. Cameras are usually used for recognizing road signs,
lane departure warnings, parking assistance, and so forth.
Identication of potential obstacles and taking corrective
action are still left to the driver. Moreover, cameras have
limitations in bad weather and low-visibility conditions.
Our system uses a radar installed in a host vehicle
to scan its eld of view (FOV) for potential targets, and
partitions the scanned data into sets of observations, or
tracks [4]. Potentially dangerous obstacles are then singled
out and visual and audio alerts are generated for the
driver so that a preventive action can be taken. The output
signals generated by the system can also be routed to an
automatic control system and safety mechanisms in case
of the vehicles equipped for fully autonomous driving. We
aim to use a low-cost automotive radar and complement
it with an embedded tracking system for achieving higher
performance. The objective is to reduce the cost of the system
without sacricing its accuracy and precision.
The principle contributions of this work are as follows.
(i) design and development of a new MTT system
specially adapted to the requirements of automotive
safety applications,
(ii) feasible and scalable Implementation of the system
in a low-cost congurable and exible platform
(FPGA),
(iii) optimization of the system to meet the real time
performance requirements of the application and to
reduce the hardware size to the minimum possible
limit, this not only helps to reduce the energy
consumption but also creates room for adding more
functionality into the system using the same low-cost
platform.
We implement our system in FPGA using a multiprocessor
architecture which is inherently exible and adaptable.
FPGAs are increasingly being used as the platforms of choice
for implementing complex embedded systems due to their
high performance, exibility, and fast design times. Multi-
processor architectures have also become popular for several
reasons. For example, monitoring processor properties over
the last three decades shows that the performance of a single
processor has leveled o in the last decade. Using multiple
processors with a lower frequency, results in comparable
performance in terms of instructions per second to a single
highly clocked processor and reduces power consumption
signicantly [5]. Dedicated fully hardware implementation
may be useful for high-speed processing but it does not
oer the exibility and programmability desired for system
evolution. Fully hardware implementations also require
longer design time and are inherently inexible. Applications
with low-power requirements and hardware size constraints,
are increasingly resorting to MPSoC architectures. The move
to MPSoC design elegantly addresses the power issues faced
on the hardware side while ensuring the speed performance.
2. MTT Terminology and Building Blocks
2.1. Terminology. In the context of target tracking applica-
tions, a target represents an obstacle in the way of the host
vehicle. Every obstacle has an associated state represented by
a vector that contains the parameters dening the targets
position and its dynamics in space (e.g., its distance, speed,
azimuth or elevation, etc.).
A state vector with n elements is called n-state vector. A
concatenation of target states dening the target trajectory
or movement history at discrete moments in time is called a
track.
The behavior of a target can ideally be represented by
its true state. The true state of a target is what characterizes
the targets dynamic behavior and its position in space in a
100% correct and exact manner. A tracking system attempts
to estimate the state of a target as close to this ideal state as
possible. The closer a tracking system gets to the true state,
the more precise and accurate it is. For achieving this goal, a
tracking system deals with three types of states
(i) The Observed State or the Observation corresponds to
the measurement of a targets state by a sensor (radar
in our application) at discrete moments in time. It
is one of the two representations of the true state
of the target. The observed state is obtained through
a measurement model also termed as the observation
model (refer to Section 4.2). The measurement model
mathematically relates the observed state to the true
state, taking into account the sensor inaccuracies and
the transmission channel noises. The sensor inac-
curacies and the transmission noises are collectively
called measurement noise.
(ii) The Predicted State or the Prediction is the second
representation of the targets true state. Prediction
is done for the next cycle before the sensor sends
the observations. It is a calculated guess of the
targets true state before the observation arrives. The
predicted state of a target is obtained through a
process model (refer to Section 4.1). The process model
mathematically relates the predicted state to the true
state while taking into account the errors due to the
approximations of the random variables involved in
the prediction process. These errors are collectively
termed as process noise.
(iii) The Estimated State or the Estimate is the corrected
state of the target that depends on both the obser-
vation and the prediction. The correction is done
after the observation is received from the sensor. The
estimated state is calculated by taking into account
the variances of the observation and the prediction.
To get a state that is more accurate than both the
observed and predicted states, the estimation process
calculates a weighted average of the observed and
predicted states favoring the one with lower variance
more over the one with larger variance.
In this paper, the term scan refers to the periodic sweep
of radar eld of view (FOV) giving observations of all the
detected targets. The FOV is usually a conical region in space
inside which an obstacle can be detected by the radar. The
area of this region depends upon the radar range (distance)
and its view angle in azimuth.
The radar Pulse Repetition Time (PRT) is the time interval
between two successive radar scans. The PRT for the radar
unit we are using is 25 ms. This is the time window within
which the tracking system must complete the processing of
the information received during a scan. After this interval,
new observations are available for processing. As we shall
see latter, the PRT puts an upper limit on the latency of the
slowest module in the application.
2.2. MTT Building Blocks. A generalized view of a Multiple
Target Tracking (MTT) system is given in Figure 1. The
system can broadly be divided into two main blocks, namely
Data Association and Filtering & Prediction. The two blocks
work in a closed loop. The Data Association block is
Data association
Filtering
& prediction
Observations
from the
sensor
Track
maintenance
Observation to track
assignment
Gate computation
Estimated
target
states
Figure 1: A simplied view of MTT.
further divided into three subblocks: Track maintenance,
Observation-to-Track Assignment and Gate Computation.
Figure 1 represents a text book view of the MTT sys-
tem as presented in [6, 9]. Practical implementation and
internal details may vary depending on the end use and
the implementation technology. For example, the Filtering
& Prediction module may be implemented choosing from
a variety of algorithms such as - lter [4, 6], mean-
shift algorithm [7], Kalman Filter [6, 8, 9], and so forth.
Similarly, the Data Association module is usually modeled as
an Assignment Problem. The assignment problem itself may
be solved in a variety of ways, for example, by using the
Auction algorithm [10], or the Hungarian/Munkres algorithm
[11, 12].
The choice of algorithms for the subblocks is driven
by factors like the application environment, the amount of
available processing resources, the hardware size of the end
product, the track precision, and the system response time,
and so forth.
3. Hardware Software Codesign Methodology
For designing our system, we followed the y-chart codesign
methodology depicted in Figure 2.
On the right hand side, the software design considera-
tions are taken into account. This includes the choice of the
programming language, the software development tools, and
so forth. On the left hand side, the hardware design tools, the
choice of processors, the implementation platform, and the
application programming interface (API) of the processors
are dened. In the middle, the MPSoC hardware is generated
and the software is mapped onto the processors.
After constructing the initial architecture, its perfor-
mance is evaluated. If further performance improvement
is needed, we track back to the initial steps and optimize
various aspects of the software and/or the hardware to
achieve the desired performance. The modalities of the
track back step are mostly dependent on the experience
and expertise of the designer. For this work, we used a
manual track back approach based on the proling statistics
of the application. As a part of our ongoing work, we are
formalizing the design approach to help the designer in
HW aspects SW aspects
Architectural platform
Programming model
Application design
Application
development
environment
SW to HW mapping
Code generation
process
Performance analysis
Improve
architecture
Re-arrange
application
Improve
mapping
strategies
Figure 2: The Y-chart ow for codesign.
choosing the right conguration parameters and mapping
strategies.
Following the codesign methodology, we rst developed
our application, details of which are described in the next
section. After developing the application, we move on to
the architectural aspects of the system which are detailed in
Section 6.
4. Application Design and Development:
Our Approach
As stated above, the choice of algorithms for the MTT system
and the internal details are driven by various factors. We
designed the application for mapping onto a multiprocessor
system. A multiprocessor architecture can be exploited
very eciently if the underlying application is divided
into simpler modules which can run in parallel. Moreover,
simple multiple modules can be managed and improved
independently of one another as long as the interfaces among
them remain unchanged.
For the purpose of a modular implementation, we
organized our MTT application into submodules as shown
in Figure 3. The functioning of the system is explained as
follows. Assuming a recursive processing as shown by the
loop in Figure 1, tracks would have been formed on the
previous radar scan. When new observations are received
from the radar, the processing loop is executed.
In the rst cycle of the loop, at most 20 of the incoming
observations would simply pass through the Gate Checker,
the Cost Matrix Generator, and the Assignment Solver on
to the lters inputs. A lter takes an observation as an
inaccurate representation of the true state of the target, and
the amount of inaccuracy of the observation depends on the
measurement variance of the radar. The lter estimates the
current state of the target and predicts its next state before
Data association
Track maintenance
Obs-to-track assignment
Filtering & prediction
Correction
(measurement
update)
Prediction
(time update)
Observationless
gate identier
New target
identier
Track
Init/del
Gate checker
Cost matrix
generator
Assignment
solver
Gate computation
Radar
Observations
Estimates
Figure 3: The Proposed MTT implementation.
the next observation is available. The estimation process
and the MTT application as a whole rely on mathematical
models. The mathematical models we used in our approach
are detailed below.
4.1. Process Model. The process model mathematically
projects the current state of a target to the future. This can
be presented in a linear stochastic dierence equation as
Y
k
= AY
k1
+ BU
k
+ W
k1
. (1)
In (1), Y
k1
and Y
k
are n-dimensional state vectors that
include the n quantities to be estimated. Vector Y
k1
represents the state at scan k1, while Y
k
represents the state
at scan k.
The n n matrix A in the dierence equation (1) relates
the state at scan k 1 to the state at scan k, in the absence
of either a driving function or process noise. Matrix A is the
assumed known state transition matrix which may be viewed
as the coecient of state transformation from scan k 1
to scan k, in the absence of any driving signal and process
noise. The n l matrix B relates the optional control input
U
k
R
l
to the state Y
k
, whereas W
k1
is zero-mean additive
white Gaussian process noise (AWGN) with assumed known
covariance Q. Matrix B is the assumed known control matrix,
and U
k
is the deterministic input, such as the relative position
change associated with the host-vehicle motion.
4.2. Measurement Model. To express the relationship
between the true state and the observed state (measured
state), a measurement model is formulated. It is described as
a linear expression:
Z
k
= HY
k
+ V
k
. (2)
Here Z
k
is the measurement or observation vector
containing two elements, distance d and azimuth angle
. The 2 n observation matrix H in the measurement
equation (2) relates the current state to the measurement
(observation) vector Z
k
. The term V
k
in (2) is a random
variable representing the measurement noise.
For implementation, we chose the example case given in
[8]. In the rest of the paper, the numerical values of all the
matrix and vector elements are borrowed from this example.
In this example, the matrices and vectors in equations (1)
and (2) have the forms shown below:
Y
k
=
y
11
y
21
y
31
y
31
, A =
1 T 0 0
0 1 0 0
0 0 1 T
0 0 0 1
, Z
k
=
. (3)
Here y
11
is the target range or distance; y
21
is range rate
or speed; y
31
is the azimuth angle; y
41
is angle rate or
angular speed. In vector Z
k
, the element d is the distance
measurement and is the azimuth angle measurement.
Matrix B and control input U
k
are ignored here because they
are not necessary in our application.
The radar Pulse Repetition Time (PRT) is denoted by T
and it is 0.025 seconds for the specic radar unit we are using
in our project.
Having devised the process and measurement models, we
need an estimator which would use these models to estimate
the true state. We use the Kalman lter which is a recursive
Least Square Estimator (LSE) considered to be the optimal
estimator for linear systems with Additive White Gaussian
Noise (AWGN) [9, 13].
4.3. Kalman Filter. The Filtering & Prediction block in
Figure 3 is particularly important as the number of lters
employed in this block is the same as the maximum number
of targets to be tracked. In our work, we xed this number at
20 as the radar we are using can measure the coordinates of
a maximum of 20 targets. Hence this block uses 20 similar
lters running in parallel. If the number of the detected
targets is less than 20, the idle lters are switched o to
conserve energy.
Given the process and the measurement models in (1)
and (2), the Kalman lter equations are
k
= A
Y
k1
+ BU
k
, (4)
P
k
= AP
k1
A
T
+ Q, (5)
K = P
k
H
T
_
HP
k
H
T
+ R
_
1
, (6)
Y
k
=

Y
k
+ K
_
Z
k
H
k
_
, (7)
P
k
= (I KH)P
k
. (8)
Here

Y
k
is the state prediction vector;

Y
k1
is the state
estimation vector, K is the Kalman gain matrix, P
k
is the
prediction error covariance matrix, P
k
is the estimation error
covariance matrix and I is an identity matrix of the same
dimensions as P
k
. Matrix R represents the measurement
noise covariance and it depends on the characteristics of the
radar.
The newly introduced vectors and matrices in (4) to (8)
have the following forms:
Y
k
=
y
11
y
21
y
31
y
41
,

Y
k
=
11
y
21
y
31
y
41
, H =
1 0 0 0
0 0 1 0
,
R =
r
11
r
12
r
21
r
22
10
6
0
0 2.9 10
4
,
Q =
q
11
q
12
q
13
q
14
q
21
q
22
q
23
q
24
q
31
q
32
q
33
q
34
q
41
q
42
q
43
q
44
0 0 0 0
0 330 0 0
0 0 0 0
0 0 0 1.3 10
8
.
(9)
Here

Y
11
is the range prediction,

Y
21
is the speed
prediction,

Y
31
is the azimuth angle prediction,

Y
41
is the
angular speed prediction,

Y
11
is the range estimate,

Y
21
the
speed estimate,

Y
31
is the angle estimate and lastly

Y
41
is the
angular speed estimate, all for instant k.
Matrices K and P
k
have the following forms:
K =
k
11
k
11
k
21
k
122
k
31
k
32
k
41
k
42
, P
k
=
11
p
21
p
31
p
41
p
21
p
22
p
23
p
24
p
31
p
32
p
33
p
34
p
41
p
42
p
43
p
44
. (10)
Matrix P
k
is similar in form to P
k
except for the
superscript .The scan index k has been ignored in the
Seed values
for

Y
k1
& P
k1
Prediction
(time update)
State prediction (V-C.1)
Error cov. pred. (V-C.2)
Correction
(measurement update)
Filter gain (V-C.3)
State estimation (V-C.4)
Error cov. estim. (V-C.5)
Figure 4: The Kalman lter.
elements of these matrices and vectors for the sake of
notational simplicity. The Kalman lter cycles through the
prediction-correction loop shown pictorially in Figure 4. In
the prediction step (also called time update), the lter
predicts the next state and the error covariance associated
with the state prediction using (4) and (5), respectively.
In the correction step (also called measurement update), it
calculates the lter gain, estimates the current state and the
error covariance of the estimation using (6) through (8),
respectively.
Figure 5 shows the position of a target estimated by the
Kalman lter against the true position and the observed
position (measured by the radar). The ecacy of the lter
can be appreciated by fact that the estimated position follows
the true position very closely as compared with the observed
position after the 20 transitional iterations.
In the case of a system dedicated to tracking a single
target, the estimated state given by the lter would be used
to null the oset between the current pointing angle of
the radar and the angle at which the target is currently
situated. This operation would need a control loop and an
actuator to correct the pointing angle of the radar. But since
we are dealing with multiple targets at the same time, we
have to identify which of the incoming observed states to
associate with the predicted states to get the estimation for
each target. This is the job of data association function. The
data association submodules are explained one by one in the
following sections.
4.4. Gate Computation. The rst step in data association is
the gate computation. The Gate Computation block receives
the predicted states

Y
k
and the predicted error covariance P
k
from the Kalman Filters for all the currently known targets.
Using these two quantities the Gate Computation block
denes the probability gates which are used to verify whether
an incoming observation can be associated with an existing
target. The predicted states

Y
k
are located at the center of
the gates. The dimensions of the gates are proportional to the
prediction error covariance P
k
. If the innovation Z
k
H
k

(also called the residual) for an observation, is greater than
the gate dimensions, the observation fails the gate and hence
Position (true, measured and estimated)
D
i
s
t
a
n
c
e
(
m
)
14
16
18
20
22
24
Iterations (n)
400 600 800 1000 1200 1400 1600 1800
True
Measured
Estimated
Figure 5: Estimated target position.
it cannot be associated with the concerned prediction. If an
observation passes a gate, then it may be associated with the
prediction at the center of that gate. In fact, observations for
more than one targets may pass a particular gate. In such
cases all these observations are associated with the single
prediction. The Gating process may be viewed as the rst
level of screening out the unlikely prediction-observation
associations. In the second level of screening, namely the
assignment solver (discussed latter in Section 4.7), a strictly
one-to-one coupling is established between observations and
predictions.
The gate computation model is summarized as follows.
Dene
`
Y to be the innovation or the residual vector (Z
k
k
). In general, for a track i, the residual vector is
`
Y
i
= Z
k
H
`
Y
i
. (11)
Now dene a rectangular region such that an observation
vector Z
k
(with elements z
kl
) is said to satisfy the gate of a
given track if all elements ` y
il
of residual vector
`
Y
i
satisfy the
relationship
Z
kl
H
il
`
Y
il
K
Gl
r
. (12)
In (11) and (12), i is an index for track i, G is gate and l is
replaced either by d or by , whichever is appropriate (see
(17) and (18)). The term
r
is the residual standard deviation
and is dened in terms of the measurement variance
2
z
and
prediction variance
2
y
k
. A typical choice for K
Gl
is [K
Gl

3.0]. This large choice of gating coecient is typically made
in order to compensate for the approximations involved in
modeling the target dynamics through the Kalman lter
covariance matrix [4]. This concept comes from the famous
3 sigma rule in statistics.
In its matrix form for scan k and track i, (11) can be
simplied down to
`
Y
ik
=
` y
ik11
` y
ik21
d
i
y
k11
i
y
k31
. (13)
Consequently (12) gives
` y
ik11
` y
ik21
K
Gl
r
. (14)
The residual standard deviations for the two state vector
elements are dened as follows
rd
=
_
r
11
+ p
22
, (15)
r
=
_
r
22
+ p
44
. (16)
From (14), (15), and (16), we get
` y
ik11
` y
ikd
3.0
_
r
11
+ p
22
(17)
` y
ik21
` y
ik
3.0
_
r
22
+ p
44
(18)
Equations (17) and (18) together put the limits on the
residuals ` y
ikd
and ` y
ik
. In other words, the dierence between
an incoming observation and prediction for track i must
comply with (17) and (18) for the observation to be assigned
to track i. The Gate Checker subfunction, explained next,
tests all the incoming observations for this compliance.
4.5. Gate Checker. The Gate Checker tests whether an
incoming observation fullls the conditions set in (17) and
(18). Incoming observations are rst considered by the
Gate Checker for updating the states of the known targets.
Gate checking determines which observation-to-prediction
pairings are probable. At this stage the pairing between the
predictions and the observations are not done in a strictly
one-to-one fashion. A single observation may be paired
with several predictions and vice versa, if (17) and (18) are
complied with. In eect, the Gate Checker sets or resets the
binary elements of an N N matrix termed as the Gate Mask
matrix M where N is the maximum number of targets to be
tracked,
M =
Predictions
, ..
m
11
m
12
m
1N
m
21
m
22
m
2N
.
.
.
.
.
.
.
.
.
m
N1
m
N2
m
NN
observations,
m
i j
=
1 if obs i obey (V-D.7) & (V6D.8) for track j,

0 otherwise.
(19)
If an observation i fullls both the conditions of (17) and
(18) for a prediction j, the corresponding element m
i j
of
matrix M is set to 1 otherwise it is reset to 0. Matrix M would
typically have more than one 1
s in a column or a row. The

ultimate goal for estimating the states of the targets is to have
only one

1
in a row or a column for a one-to-one coupling

of observations and predictions. To achieve this goal, the rst
step is to attach a cost to every possible coupling. This is done
by the Cost Generator block explained next.
4.6. Cost Matrix Generator. The Mask matrix is passed on
to the Cost Matrix Generator which attributes a cost to each
pairing. The costs associated with all the pairings are put
together in a matrix called the Cost Matrix C.
The cost c
i j
for associating an observation i with a
prediction j is the statistical distance d
2
i j
between the
observation and the prediction when m
i j
is 1. The cost is
an arbitrarily large number when m
i j
is 0. The statistical
distance d
2
i j
is calculated as follows.
Dene
S
i j
= HP
k
H
T
+ R. (20)
Here i is an index for observation i and j is the index for
prediction j in a scan, S
i j
is the residual covariance matrix.
The statistical distance d
2
i j
is the norm of the residual vector,
d
2
i j
=
`
Y
T
i j
S
1
i j
`
Y
i j
(21)
C =
Predictions
, ..
c
11
c
12
c
1N
c
21
c
22
c
2N
.
.
.
.
.
.
.
.
.
c
N1
c
N2
c
NN
Observations
c
i j
=
Arbitrary large number if m

i j
is 0
d
2
i j
if m
i j
is 1
(22)
Equation (20) can be written in its matrix form and
simplied down to
S
i j
=
11
+ r
11
p
13
p
31
p
33
+ r
22
. (23)
Using (13), (21), and (23), d
2
i j
is calculated as follows:
d
2
i j
=
[
` yik11 ` yik21 ]
_
p
33
+r22 p
13
p
31
p
11
+r11
__
` yik11
` yik21
_
__
p
11
+ r
11
_
_
p
33
+ r
22
_
p
11
p
33
_. (24)
Recall here that ` y
ik11
= ` y
ikd
and ` y
ik21
= ` y
ik
.
The cost matrix demonstrates a conict situation where
several observations are candidates to be associated with a
particular prediction and vice versa. A conict situation is
illustrated in Figure 6.
The three rectangles represent the gates constructed by
the Gate Computation module. The predicted states are
situated at the center of the gates. Certain parts of the
three gates overlap one another. Some of the incoming
observations would fall into these overlapping regions of
the gates. In such cases all the predictions at the center of
the concerned gates are eligible candidates for association
with the observations falling in the overlapping regions. The
2 way conict
3 way conict
Gate 1
Gate 2
Gate 3
1
3
3
2
1
2
Predictions
Observations
Figure 6: Conict situation in data association.
mask matrix M and the cost matrix C corresponding to this
situation are shown below,
M =
Predictions
, ..
0 1 0
1 0 0
1 1 1
Observations,
C =
Predictions
, ..
d
2
12

d
2
21

d
2
31
d
2
32
d
2
33
Observations.
(25)
The prediction with the smallest statistical distance d
2
i j
from the observation is the strongest candidate. To resolve
these kinds of conicts, the cost matrix is passed on to the
Assignment Solver block which treats it as the assignment
problem [10, 12].
4.7. Assignment Solver. The assignment solver determines
the nalized one-to-one pairing between predictions and
observations. The pairings are made in a way to ensure
minimum total cost for all the nalized pairings. The
assignment problem is modeled as follows.
Given a cost matrix C with elements c
i j
, nd a matrix
X = [x
i j
] such that
C =
n
_
i=1
m
_
j=1
c
i j
x
i j
is minimized (26)
subject to:
_
i
x
i j
= 1, j,
_
j
x
i j
= 1, i.
(27)
Here x
i j
is a binary variable used for ensuring that an
observation is associated with one and only one prediction
and a prediction is associated with one and only one
observation. This requires x
i j
to be either 0 or 1, that is,
x
i j
{0, 1}.
Matrix X can be found by using various algorithms. The
most commonly used among themare the Munkres algorithm
[12] and the Auction algorithm[10]. We use the former in our
application due to its inherent modular structure.
Matrix X below shows a result of the Assignment Solver
for a 4 4 cost matrix. It shows that observation 1 is to be
paired with prediction 3, observation 2 with prediction 1 and
so on:
X =
Predictions
, ..
0 0 1 0
1 0 0 0
0 0 0 1
0 1 0 0
Observations. (28)
The nalized observation-prediction pairs are passed on to
the the relevant Kalman lters to start a new cycle of the loop
for estimating the current states of the targets, predicting
their next states and the error covariances associated with
these states.
All the steps IV-C through IV-G are repeated indenitely
in the loop in Figure 3. However, there are certain cases
where some additional steps have to be taken too. Together
these steps are called Track Maintenance. The Track Mainte-
nance and the circumstances where it becomes relevant are
explained in the next section.
4.8. Track Maintenance. The Track Maintenance block con-
sists of three functions namely the New Target Identier, the
obs-less Gate Identier and the Track Init/Del.
In real conditions there would be one or more targets
detected in a scan which did not exist in the previous scans.
On the other hand there would be situations where one or
more of the already known targets would no longer be in the
radar range. In the rst case we have to ensure if it is really
a new target or a false alarm. The New target Identication
subblock takes care of such cases. In the latter case we have
to ascertain that the target has really disappeared from the
radar FOV. The Observation-less Gate Identication subblock
is responsible for dealing with such situations.
A new target is identied when its observation fails all
the already established gates, that is, when all the elements of
a row in the Gate Mask matrix M are zero. Such observations
are candidates for initiating new tracks after conrmation.
The conrmation strategies we use in our work are based on
empirical results cited in [4]. In this work, 3 observations
out of 5 scans for the same target initiate a new track. The
new target identier starts a counter for the newly identied
target. If the counter reaches 3 in ve scans, the target is
conrmed and a new track is initiated for it. The counter
is reset every ve scans thus eectively forming a sliding
window.
The disappearance of a target means that, no observa-
tions fall in the gate built around its predicted state. This is
indicated when an entire column of the Mask matrix is lled
with zeros. The tracks for such targets have to be deleted after
conrmation of their disappearance. The disappearance is
conrmed if the concerned gates go without an observation
for 3 consecutive scans out of 5. The obs-less gate identier
starts a counter when an empty gate is detected. If the
counter reaches 3 in three consecutive scans out of 5, the
disappearance of the target is conrmed and its track is
deleted from the list. The counter is reset every ve scans.
The Track Init/Del function prompts the system to
initiate new tracks or to delete existing ones when needed.
5. Implementation Platform and the Tools
For the system under discussion we work with Alteras NiosII
development kit StratixII edition as the implementation
platform. The kit is built around Alteras StratixII EP2S60
FPGA.
5.1. Design Tools. The NiosII development kits are com-
plemented with Alteras Embedded Design Suite (EDS).
The EDS oers a user friendly interface for designing
NiosII based multiprocessor systems. A library of ready-
to-use peripherals and customizable interconnect structure
facilitates creating complex systems. The EDS also provides
a comprehensive API for programming and debugging the
system. The NiosII processor can easily be reinforced with
custom hardware accelerators and/or custom instructions
to improve its performance. The designer can choose from
three dierent implementations of the NiosII processor and
can add or remove features according to the requirements of
the application.
The EDS consists of the three tools, namely the Quar-
tusII, the SOPC Builder and the NiosII IDE.
The system design starts with creating a QuartusII
project. After creating a project, the user can invoke the
SOPC Builder tool from within the QuartusII. The designer
chooses processors, memory interfaces, peripherals, bus
bridges, IP cores, interface cores, common microprocessor
peripherals and other system components from the SOPC
Builder IP library. The designer can add his/her own custom
IP blocks and peripherals to the SOPC Builder component
library. Using the SOPC Builder, the designer generates the
Avalon switch fabric that contains all the decoders, arbiters,
data path, and timing logic necessary to bind the chosen
processors, peripherals, memories, interfaces, and IP cores.
Once the system integration is complete, RTL code is
generated for the system. The generated RTL code is sent
back into the QuartusII project directory where it can be
synthesized, placed and routed and nally an FPGA can be
congured with the system hardware.
After conguring the FPGA with a Nios II based
hardware, the next step is to develop and/or compile software
applications for the processor(s) in the system. The NiosII
IDE is used to manage the NiosII C/C++ application and
system library or board support package (BSP) projects
and makeles. The C/C++ application contains the software
application les developed by the user. The system library
includes all the header les and drivers related to the system
Munkres algorithm analysis with GProf for 10 iterations
R
u
n
t
i
m
e
(
m
s
)
f
o
r
1
0
i
t
e
r
a
t
i
o
n
s
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Number of targets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Call to Munkres
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Figure 7: Munkres algorithm prole obtained through GProf.
Munkres analysis with performance counter for 10 iterations
R
u
n
t
i
m
e
(
m
s
)
0
200
400
600
800
1000
1200
1400
1600
Number of targets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Call to Munkres
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Figure 8: Prole of Munkres Algorithm obtained through Perfor-
mance Counter.
hardware components. The system library can be used to
select project settings such as the choice of stdin, stdout,
stderr devices, system clock timer, system time stamp timer,
various memory locations, and so forth. Thus using the
system library, the designer can choose the optimum system
conguration for an application.
5.2. Application Proling and the Proling Tools. A NiosII
application can be proled in several ways, the most popular
among them being the use of the GProf proler tool and the
Performance Counter peripheral.
5.2.1. GProf. The Gprof proler tool, called nios2-elf-gprof,
can be used without making any hardware changes to the
NiosII system. This tool directs the compiler to add calls to
the proler library functions into the application code.
The proler provides an overview of the run-time behav-
ior of the entire system and also reveals the dependencies
among application modules. However adding instructions to
each function call for use by the GNU proler aects the
codes behavior in numerous ways. Each function becomes
larger because of the additional function calls to collect
proling information. Collecting the proling information
increases the entry and exit time of each function. The
proling data is a sampling of the program counter taken at
the resolution of the system timer tick. Therefore, it provides
an estimation, not an exact representation of the processor
time spent in dierent functions [14].
5.2.2. Performance Counter. A performance counter periph-
eral is a block of counters in hardware that measure the
execution time taken by the user-specied sections of the
application code. It can monitor as many as seven code
sections. A pair of counters tracks each code section. A 64-
bit time counter counts the number of clock ticks during
which the code in the section is running while a 32-bit event
counter counts the number of times the code section runs.
These counters accurately measure the execution time taken
by designated sections of the C/C++ code. Simple, ecient
and minimally intrusive macros are used to mark the start
and end of the blocks of interest (the measured code sections)
in the program [14].
Figure 7 shows the Munkres algorithms prole obtained
through Gprof. The algorithm was executed on NiosII/s
with 100 MHz clock and 4 KB instruction cache. The call
to Munkres represents the processor time of the overall
algorithm for up to 20 obstacles. Step 1 through Step 6
represent the behavior of individual subfunctions which
constitute the algorithm.
Figure 8 shows the prole of the same algorithmobtained
through the performance counter for the same processor
conguration. Clearly the two proles have exactly the same
form. The dierence is that while Gprof estimates that for
20 obstacles the algorithm takes around 4500 ms to nd a
solution, the performance counter calculates the execution
time to be around 1500 ms. This huge dierence is due to
the overhead added by Gprof when it calls its own library
functions for proling the code.
We proled the application with both the tools. The
Gprof was used for identifying the dependencies and the
performance counter for precisely measuring the latencies.
All the performances cited in the rest of the paper are those
obtained by using performance counter.
6. SystemArchitecture
We coded our application in ANSI C following the generally
accepted ecient coding practices and the O3 compilation
option. Before deciding to allocate processing resources to
the application modules, we proled the application to
know the latencies, resource requirements and dependencies
among the modules. Guided by the proling results, we
distributed the application over dierent processors as
distinct functions communicating in a producer-consumer
fashion as shown in Figure 9. Similar considerations have
been proposed in [2, 15, 16].
The proposed multiprocessor architecture includes dif-
ferent implementations of the NiosII processor and various
peripherals as system building blocks.
FIFO FIFO FIFO
F
I
F
O
F
I
F
O
F
I
F
O
F
I
F
O
Track maintenance
processor #23
I-cache D-cache
Local on-chip mem
Assignment solver
processor #22
I-cache D-cache
Local on-chip mem
Gating module
processor #21
I-cache D-cache
Local on-chip mem
Radar
Shared
o-chip
mem.
Shared memory interconnect
Track maint to KF interconnect
Assign solver to KF interconnect
KF to gating interconnect

KF1
processor #1
I-cache D-cache
Local on-chip mem
KF2
processor #2
I-cache D-cache
Local on-chip mem
KF20
processor #20
I-cache D-cache
Local on-chip mem
Figure 9: The proposed MPSoC architecture.
The NiosII is a congurable soft-core RISC processor
that supports adding or removing features on a system-by-
system basis to meet performance or cost goals. A NiosII
based system consists of NiosII processor core(s), a set of on-
chip peripherals, on-chip memory and interfaces to o-chip
memory and peripherals, all implemented on a single FPGA
device. Because NiosII processor systems are congurable,
the memories and peripherals can vary from system to
system.
The architecture hides the hardware details from the pro-
grammer, so programmers can develop NiosII applications
without specic knowledge of the hardware implementation.
The NiosII architecture uses separate instruction and
data buses, classifying it as Harvard architecture. Both the
instruction and data buses are implemented as Avalon-
MM master ports that adhere to the Avalon-MM interface
specication. The data master port connects to both memory
and peripheral components while the instruction master
port connects only to memory components.
The Kalman lter, as mentioned earlier, is recursive
algorithm looping around prediction and correction steps.
Both these steps involve matrix operations on oating
point numbers. These operations demand heavy processing
resources to complete in a timely way. This makes the lter
a strong candidate for mapping onto a separate processor.
Thus for tracking 20 targets at a time, we need 20 identical
processors executing Kalman lters.
The Gate Computation block regularly passes infor-
mation to Gate Checker which in turn, is in constant
communication with Cost Matrix Generator. In view of these
dependencies, we group these three blocks together, collec-
tively call them the Gating Module and map them onto a
single processor to minimize interprocessor communication.
Interprocessor communication would have required addi-
tional logic and would have added to the complexity of the
system. Avoiding unnecessary interprocessor communica-
tion is also desirable for reducing power consumption.
The assignment-solver is an algorithm consisting of six
distinct iterative steps [12]. Looping through these steps
demands a long execution time. Moreover, these steps have
dependencies among them. Hence the assignment solver has
to be kept together and cannot be combined with any of the
other functions. So we allocated a separate processor to the
assignment solver.
The three blocks of the Track Maintenance subfunction
individually dont demand heavy computational resources,
so we group them together for mapping onto a processor.
As can be seen in Figure 9, every processor has an I-cache,
a D-cache and a local memory. Since the execution time of
the individual functions and their latencies to access a large
shared memory, are not identical, dependence exclusively on
a common system bus would become a bottleneck. Addi-
tionally, since the communication between various functions
is of producer-consumer nature, complicated synchronization
and arbitration protocols are not necessary. Hence we chose
to have a small local memory for every processor and a
large o-chip memory device as shared memory for non
critical sections of the application modules. As a result
the individual processors have lower latencies for accessing
their local memories containing the performance critical
codes. In Sections 7.2 and 7.4 we will demonstrate how to
systematically determine the optimal sizes of these caches
and the local memories.
Every processor communicates with its neighboring
processors through buers. These buers are dual-port
FIFOs with handshaking signals indicating when the buers
are full or empty and hence regulating the data transfer
between the processors. This arrangement forms a system
level pipeline among the processors. At the lower level, the
processors themselves have a pipelined architecture (refer to
Table 1). Thus the advantages of pipelined processing are
taken both at the system level as well as at the processor level.
An additional advantage of this arrangement is that changes
made to the functions running on dierent processors do not
have any drastic eects on the overall system behavior as long
as the interfaces remain unchanged. The buers are ushed
when they are full and the data transfer resumes after mutual
consent of the concerned processors. The loss of information
during this procedure does not aect the accuracy because
the data sampling frequency as set by the radar PRT, is high
enough to compensate for this minor loss.
Access to the I/O devices is memory-mapped. Both data
memory and peripherals are mapped into the address space
of the data master port of the NiosII processors. The NiosII
processor uses the Avalon switch fabric as the interface to its
embedded peripherals. The switch fabric may be viewed as a
partial cross-bar where masters and slaves are interconnected
only if they communicate. The Avalon switch fabric with
the slave-side arbitration scheme, enables multiple masters
to operate simultaneously [17]. The slave-side arbitration
scheme minimizes the congestion problems characterizing
the traditional bus.
In the traditional bus architectures, one or more bus
masters and bus slaves connect to a shared bus. A single
arbiter controls the bus, so that multiple bus masters do not
simultaneously drive the bus. Each bus master requests the
arbiter for control of the bus and the arbiter grants access
to a single master at a time. Once a master has control of
the bus, the master performs transfers with any bus slave.
When multiple masters attempt to access the bus at the same
time, the arbiter allocates the bus resources to a single master,
forcing all other masters to wait.
The Avalon system interconnect fabric uses multimaster
architecture with slave-side arbitration. Multiple masters can
be active at the same time, simultaneously transferring data
to independent slaves. Arbitration is performed at the slave
where the arbiter decides which master gains access to the
slave only if several masters initiate a transfer to the same
slave in the same cycle. The arbiter logic multiplexes all
address, data, and control signals from masters to a shared
slave. The arbiter logic evaluates the address and control
signals from each master and determines which master, if
any, gains access to the slave next. If the slave is not shared,it is
always available to its master and hence multiple masters can
simultaneously communicate with their independent slaves
without going through the arbiter.
6.1. System Software. When processors are used in a system,
the use of system software or an operating system is
inevitable. Many NiosII systems have simple requirements
where a minimal operating system or a small footprint
system software such as Alteras Hardware Abstraction
Layer (HAL) or a third party real-time operating system is
sucient. We use the former because the available third party
real time operating systems have large memory footprints
while one of our objectives is to minimize the memory
requirements.
The HAL is a lightweight runtime environment that
provides a simple device driver interface for programs
to communicate with the underlying hardware. The HAL
application programming interface (API) is integrated with
the ANSI C standard library. The API facilitates access to
devices and les using familiar C library functions.
HAL device driver abstraction provides a clear distinc-
tion between application and device driver software. This
driver abstraction promotes reusable application code that
is independent of the underlying hardware. Changes in
the hardware conguration automatically propagate to the
HAL device driver conguration, preventing changes in the
underlying hardware from creating bugs. In addition, the
HAL standard makes it straightforward to write drivers for
new hardware peripherals that are consistent with existing
peripheral drivers [17].
6.2. Constraints. The main constraints that we have to
comply with are as follows.
We need the overall response time of the system to be
less than the radar PRT which is 25 ms. This means that
the slowest application module must have less than 25 ms
of response time. Hence the rst objective is to to meet this
deadline.
The FPGA (StratixII EP2S60) we are using for this
system, contains a total of 318 KB of congurable on-chip
memory. This memory has to make up the processors
instruction and data caches, their internal registers, periph-
eral port buers and locally connected dedicated RAM or
ROM. Thus the second constraint is that the total on-chip
memory utilization must not exceed this limit. We can use
o-chip memory devices but they are not only very slow
in comparison to the on-chip memory but they also have
to be shared among the processors. Controlling access to
shared memory needs arbitration circuitry which adds to the
complexity of the systemand further increase the access time.
On the other hand we cannot totally eliminate the o-chip
memory for the reasons stated above. In fact we must balance
our reliance on the o-chip and on-chip memory in such a
way that neither the on-chip memory requirements exceed
the available amount of memory nor the system becomes too
slow to cope with the time constraints.
Another constraint is the amount of logic utilization.
We must choose our hardware components carefully to
minimize the use of the programmable logic on the FPGA.
Excessive use of programmable logic not only complicates
the design and consumes the FPGA resources but also
increases power consumption. For these reasons we optimize
the hardware features of the the individual processors and
leave out certain options when they are not absolutely
essential for meeting the time constraints.
7. Optimization Strategies
To meet the constraints discussed above, we plan our
optimization strategies as follows.
Table 1: Dierent NiosII implementations and their features.
Nios II/f Nios II/s Nios II/e
Fast Standard Economy
Pipeline 6 Stages 5 Stages None
HW Multiplier
1 Cycle 3 Cycles
Emulated in
and Barrel Shifter Software
Branch Prediction Dynamic Static None
Instr. cache Congurable Congurable None
Data cache Congurable None None
Logic elements 14001800 12001400 600700
(i) Select the appropriate processor type for each module
to execute it in the most ecient way.
(ii) Identify the optimum cache conguration for each
module and customize the concerned processor
accordingly.
(iii) Explore the needs for custom instruction hardware
for each module and implement the hardware where
necessary.
(iv) Identify the performance critical sections in each
module and map themonto the fast on-chip memory
to improve the performance while keeping the on-
chip memory requirements as low as possible.
(v) Look for redundancies in the code and remove them
to improve the performance.
In the following sections we explain these strategies one
by one.
7.1. Choice of NiosII Implementations. The NiosII proces-
sor comes in three customizable implementations. These
implementations dier in the FPGA resources they require
and their speeds. NiosII/e is the slowest and consumes the
least amount of logic resources while NiosII/f is the fastest
and consumes the most logic resources. NiosII/s falls in
between NiosII/e and NiosII/f with respect to logic resource
requirements and speed.
Table 1 shows the salient features of the three implemen-
tations of the NiosII processor.
Note here that the code written for one implementation
of the processor will run on any of the other two with a dif-
ferent execution speed. Hence changing from one processor
implementation to another requires no modications to the
software code.
The choice of the right processor implementation is
dependent on the speed requirements of a particular appli-
cation module and the availability of sucient FPGA logic
resources. Optimization of the architecture trades o the
speed for resource saving or vice versa depending on the
requirements of the application.
A second criterion for selecting a particular implementa-
tion of the NiosII processor is the need (or lack thereof) for
instruction and data cache. For example if we can achieve the
required performance for a module without any cache, the
NiosII/e would be the right choice for running that module.
Inuence of I-cache & D-cache sizes on the Kalman lter
with NiosII/F 100 MHz
T
i
m
e
(
s
e
c
s
)
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
D-cache size (KB)
0 2 4 8 16 32 64
I-cache = 4 KB
I-cache = 8 KB
I-cache = 16 KB
I-cache = 32 KB
I-cache = 64 KB
Figure 10: Kalman Filter performances for dierent Caches Sizes.
Kalman lter NosiII/F 4 KB D-cache all mem sections o chip
T
i
m
e
(
m
s
)
0
2
4
6
8
10
12
14
16
Kalman Mat add Mat Mul Mat sub Mat trans Mat inv
Without FP custom instructions
With FP custom instructions
Figure 11: Kalman Filter performances with 4 KB I-cache.
On the other hand, if a certain application module needs
instruction and data cache to achieve a desired performance,
NiosII/f would be chosen to run it. If only instruction cache
can enable the processor to run an application module with
the desired performance then we shall use NiosII/s for that
module. The objective is to achieve the desired speed with
the least possible amount of hardware.
7.2. I-Cache and D-Cache. The NiosII architecture sup-
ports cache memories on both the instruction master port
(instruction cache) and the data master port (data cache).
Cache memory resides on-chip as an integral part of the
NiosII processor core. The cache memories can improve the
average memory access time for NiosII processor systems
that use slow o-chip memory such as SDRAM for program
and data storage.
The cache memories are optional. The need for higher
memory performance (and by association, the need for
Inuence of I-cache & D-cache sizes on gating module
with NiosI/F 100 MHz
T
i
m
e
(
s
e
c
s
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
D-cache size (KB)
0 2 4 8 16 32 64
I-cache = 4 KB
I-cache = 8 KB
I-cache = 16 KB
I-cache = 32 KB
I-cache = 64 KB
Figure 12: Cache behavior for gating Module.
cache memory) is application dependent. Many applications
require the smallest possible processor core, and can trade-
o performance for size. A NiosII processor core might
include one, both, or neither of the cache memories. Further-
more, for cores that provide data and/or instruction cache,
the sizes of the cache memories are user-congurable. The
inclusion of cache memory does not aect the functionality
of programs, but it does aect the speed at which the
processor fetches instructions and reads/writes data.
Optimal cache conguration is application specic. For
example, if a NiosII processor system includes only fast,
on-chip memory (i.e., it never accesses the slow o-chip
memory), an instruction or data cache is unlikely to oer
any performance gain. As another example, if the critical
loop of a program is 2 KB, but the size of the instruction
cache is 1 KB, this instruction cache will not improve
execution speed. In fact, an instruction cache may degrade
performance in this situation [17]. We must determine the
optimum instruction and data cache sizes that are necessary
for achieving the desired performance for each module.
Both the Instruction and Data Cache sizes for NiosII/f
can range from 0 KB to 64 KB in discrete steps of 0 KB,
2 KB, 4 KB, 8 KB, 16 KB, 32 KB, and 64 KB. We experimented
with various combinations of I-cache and D-caches sizes
to determine the optimum cache sizes for each module.
In the following sections we discuss the outcome of these
experiments and the guidance that we took from them.
7.2.1. Kalman Filter Cache Requirements. Using the perfor-
mance counter with a NiosII/F processor, we measured the
performance of the Kalman lter with dierent instruction
and cache sizes. Figure 10 shows the inuence of I-cache
and D-cache sizes on the processor time of the Kalman lter
running on NiosII/f with 100 MHz clock using the o-chip
RAM.
Two very important conclusions can be drawn from
this gure. One, whatever the I-cache or D-cache size,
the processor time does not exceed 15 ms. Two, beyond
16 KB I-cache and 2 KB D-cache, the execution time is
mostly independent of the D-cache size. Based on these
observations, we can say that 16 KB is the optimum I-cache
size for the processors executing the Kalman lters. However,
as mentioned earlier, for tracking a maximum of 20 obstacles
we need 20 of these processors. Viewed in isolation, 16 KB
may not seem a large amount of memory but replicating it 20
times is practically not possible. To nd out the total amount
of memory required by this conguration, we compiled a
QuartusII project with a NiosII/f having 16 KB I-cache. The
total on-chip block memory used by a single processor,
accounted for 7% of the memory available on our FPGA
(StratixII EP2S60). Besides, we have to keep in mind that
the other processors in the system also have on-chip memory
requirements. Consequently we have to settle for a smaller I-
cache and hence lower speed to avoid this prohibitive on-chip
memory usage.
The good news here is that even with 4 KB I-cache and
no D-cache, the processor time is below the 25 ms threshold.
Thus an I-cache of 4 KB would be the right choice for the
Kalman lters in these circumstances. Furthermore, since
we do not a use a D-cache, replacing NiosII/f by NiosII/s
would help reduce the logic size from the 14001800 LEs
range down to the 12001400 range which accounts for a
sizeable gain in size, considering the 20 processors for the
lters.
Figure 11 shows the performance of the Kalman lter
on NiosII/s with 4 KB I-cache, no D-cache, 100 MHz clock
and using o-chip memory exclusively. Even by keeping
all memory sections in the o-chip device and using no
oating point custom instructions, the runtime is around
15 ms. Thus we can conserve the scarce on-chip memory
by using only 4 KB of I-cache for the processors running
Kalman lters without slowing down the system beyond
tolerable limits. The on-chip block memory usage in this
case drops down to only 3% of that exploitable on the FPGA
which is more than 50% drop. For 20 Kalman processors the
total on-chip memory usage is 60% of that available to the
user.
7.2.2. Gating Module Cache Requirements. The gating mod-
ules behavior with respect to the I-cache and D-cache is
shown in Figure 12. A remarkable speed up is observed when
I-cache size changes from 4 KB to 8 KB and again when it
changes from 8 KB to 16 KB. Beyond 16 KB the speed up for
the I-cache is insignicant.
The D-cache size does not matter much as long as it is
more than zero. The overall processor run time is minimum
(70 ms) when I-cache size is 16 KB and D-cache size is 2 KB.
Therefore, the right I-cache and D-cache sizes for the Gating
module are are 16 KB and 72 KB, respectively. The total on-
chip memory usage for the processor with this conguration
is 8% of that available on the FPGA. This also includes the
memory used by the internal registers of the processor.
Using these cache sizes we charted the performance
of the processor while varying the number of obstacles
Gating module performance on NiosII/F 100 MHz with
16 KB I-cache, 2 KB D-cache and using o chip SDRAM
T
i
m
e
(
m
s
)
0
10
20
30
40
50
60
70
80
Number of obstacles
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Cost Mat Gen
Gate checker
Gate mask generator
Innov d calcultor
Innov a calcultor
Figure 13: Gating Module performances with 16 KB I-cache and
2 KB D-cache.
from 2 through 20 as shown in Figure 13. The Innov d
and Innov a calculators are two subroutines used by the
Gate Mask Generator function to calculate distance and
angle innovations. The sum of the times taken by these
two subroutines is roughly equal to the time taken by the
Gate Mask Generator. The Gate Checker and the Gate Mask
Generator functions are in turn called by Cost Mat Gen which
is the top level function of the Gating module. The Cost
Mat Gen represents the overall behavior of the whole Gating
Module.
Although the overall runtime for 20 obstacles is min-
imum (70 ms) for the given conguration, yet it is much
higher than the 25 ms we are aiming for. In Sections 7.3.2
and 7.4.2 we discuss the techniques employed for further
improving this execution time.
7.2.3. Munkres Algorithm Cache Requirements. Using a cost
matrix with oating point elements and a range of instruc-
tion and data cache sizes, Munkres algorithm showed the
behavior depicted in Figure 14.
The rst observation here is that when the D-cache
size is more than zero, the runtime decreases profoundly
whatever the I-cache size. Looking closely at the gure we can
eliminate 4 KB from the list of competitors for the I-cache
size. An 8 KB I-cache along with 16 KB D-cache results in the
minimum execution time, that is, 71.07 ms. Hence this is the
optimum I-cache/D-cache combination for this module. A
NiosII system with these cache sizes uses 9% of the on-chip
block memory available on the FPGA.
Figure 15 shows the performance of the algorithm using
this system composition for the number of obstacles ranging
from 2 to 20.
We notice here that the two main contributors to the
total runtime are Step 4 and Step 6. This is because these two
functions contain nested loops and they are invoked multiple
times during the solution nding process.
Eects of I-cache & D-cache sizes on Munkres algorithm
with NiosII/F 100 MHz 20 obstacles
T
i
m
e
(
s
e
c
s
)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
D-cache size (KB)
0 2 4 8 16 32 64
I-cache = 4 KB
I-cache = 8 KB
I-cache = 16 KB
I-cache = 32 KB
I-cache = 64 KB
Figure 14: Cache performances for Munkres Algorithm.
Munkres algorithm on NiosII/F 100 MHz, 8 KB I-cache,
16 KB D-cache using o chip SDRAM
T
i
m
e
(
m
s
)
0
10
20
30
40
50
60
70
80
Number of obstacles
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Call to Munkres
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Figure 15: Munkres Algorithm performance with 8 KB I-cache and
16 KB D-Cache.
The overall run time for 20 obstacles, is 71 ms which is
higher than the 25 ms bound. We need to further optimize
the processor to decrease this runtime. In Sections 7.3.3,
7.4.3, and 7.5 we explain the steps taken for achieving this
goal.
7.3. Floating Point Custom Instructions. The oating-point
custom instructions, optionally available on the NiosII pro-
cessor, implement single precision oating-point arithmetic
operations in hardware. They accelerate oating-point oper-
ations in NiosII C/C++ applications. The basic set of oating
point custom instructions includes single precision oating-
point addition, subtraction, and multiplication. Floating-
point division is available as an extension to the basic
instruction set.
The NiosII software development tools recognize a C
code that takes advantage of the oating-point instructions
present in the processor core. When the oating-point
custom instructions are present in the target hardware,
the NiosII compiler generates the code to use the custom
instructions for oating-point operations, including addi-
tion, subtraction, multiplication, division and the newlib
math library [14].
The best choice for a hardware design depends on
a balance among oating-point usage, hardware resource
usage and system performance. While the oating-point
custom instructions speed up oating-point arithmetic,
they add substantially to the size of the hardware design.
When resource usage is an issue, it is advisable to rework
the algorithms to minimize oating-point arithmetic (see
Section 7.5).
We used the oating point custom instructions in the
processors to assess the tradeos between performance and
hardware size for each processor. Sections 7.3.1, 7.3.2 and
7.3.3 examine the outcome of this assessment and the
recommendations based thereon.
7.3.1. Kalman Filter and Floating Point Custom Instructions.
As mentioned earlier, the Kalman lters runtime never
exceeds 15 ms so there is no need at the moment to accelerate
it further at the cost of precious FPGA resources. Neverthe-
less we tested the oating point custom instructions impact
on the Kalman lters performance for better understanding
the trade-os and for exploring the opportunities for the
eventual future optimization. Figure 11 shows the results of
these tests.
An overall speed up of more than 50% is achieved in
comparison to the scenario when no oating point custom
instructions are used. The most signicant improvement
is witnessed in case of the Mat Mul subfunction. This
improvement can be attributed to two factors. One, Mat Mul
relies heavily on oating point multiplication and second, it
is called 11 times in a single iteration of the lter algorithm.
Floating point custom instructions are the most eective in
such situations hence this remarkable improvement. This
speed up comes at the cost of a bulkier hardware. The
hardware size increases by 8% when oating point custom
instructions are used. We stick to our earlier decision of using
regular NiosII/s with 4 KB I-cache and no other add-ons for
the Kalman lter in the present work. Since the use of oating
point instructions reduces the execution time for the Kalman
lter considerably, in our future work we will take this option
to process more than one targets per processor.
7.3.2. Gating Module and Floating Point Custom Instructions.
In case of the Gating Module the use of oating point custom
instructions is a necessity rather than an option. The reason
is that even with the optimum cache size selection, the
Gating Module takes 70 ms to execute. Moreover, the Gating
module runs on only one processor so we dont have to
replicate the oating point custom instructions hardware.
Figure 16 shows the performance of the Gating Module after
the oating point custom instructions are added to the
processor.
Floating point custom instructions with NiosII processor
for the Gating Module improve the overall performance by
approximately 50%. If we compare Figure 16 with Figure 13,
we notice two interesting dierences between the two gures.
The rst and very obvious dierence is the drop from 70 ms
to 37 ms of the overall runtime for 20 obstacles.
The second dierence is that the curve for the Gate
Checker, which was earlier above the Innov a and Innov d,
is now below them. This change in behavior is due to the
fact that in addition to the oating point multiplication and
division, the Gate Checker uses the sqrt() function of the
ANSI C math library. The sqrt() function itself relies on
multiply and divide operations internally. Hence the oating
point custom instructions improve the performance of the
Gate Checker more than the Innov a and Innov d which do
not use the sqrt() function.
Although by using the oating point custom instruction
we managed to bring the execution time from 70 ms down
to 37 ms for the Gating Module, yet we are still above our
25 ms target. In Section 7.4.2, we explore other possibilities
of improving the performance of the Gating Module even
further.
7.3.3. Munkres Algorithm and Floating Point Custom Instruc-
tions. Floating point custom instructions bring Munkres
algorithms execution time from 71 ms down to 47 ms for 20
obstacles, as shown in Figure 17.
Although this is a 33.8% improvement yet 47 ms is
almost twice the time we aim to attain, that is, 25 ms. This
motivates us to look for ways and means to further decrease
this time. To arrive at this goal, we employ several techniques
as explained in Sections 7.4.3 and 7.5.
7.4. On-Chip versus O-Chip Memory Sections. The HAL-
based systems are linked using an automatically generated
linker script that is created and managed by the NiosII IDE.
The linker script controls the mapping of the code and
the data within the available memory sections. It creates
standard code and data sections (.text, .data, and .bss), plus a
section for every physical memory device in the system.
Typically, the .text section is reserved for the program
instructions. The .data section is the part of the object
module that contains initialized static data, for example,
initialized static variables, string constants, and so forth.
The .bss (Block Started by Symbol) section denes the space
for non initialized static data. The heap section is used for
dynamic memory allocation, for example, when malloc() or
new() are used in the C or C++ code, respectively. The stack
section is used for holding the return addresses (program
counter) when function calls occur.
In general, the NiosII design ow automatically species
a sensible default partitioning. However, we may wish to
change the partitioning in certain situations. For example, to
improve the performance, we can place performance-critical
Gating module performance on NiosII/F 100 MHz with 16 KB
I-cache, 2 KB D-cache and oating point custom instructions
T
i
m
e
(
m
s
)
0
5
10
15
20
25
30
35
40
Number of obstacles
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Cost Mat Gen
Gate checker
Gate mask generator
Innov d calcultor
Innov a calcultor
Figure 16: Gating Module performance with 16 KB I-cache, 2 KB
D-Cache and Floating Point Custom Instructions.
Munkres algorithm on NiosII/F 100 MHz, 8 KB I-cache,
16 KB D-cache, FP custom instruction & using o chip SDRAM
T
i
m
e
(
m
s
)
0
5
10
15
20
25
30
35
40
45
50
Number of obstacles
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Call to Munkres
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Figure 17: Munkres Algorithm performance with 8 KB I-cache,
16 KB D-Cache and Floating Point Custom Instructions.
code and data in the fast on-chip RAM. In these cases, we
have to allocate the memory sections manually.
We can control the placement of the .text, .data, heap and
stack memory partitions by altering the NiosII system library
or BSP settings. By default, the heap and the stack are placed
in the same memory partition as the .rwdata section. We
can place any of the memory sections in the on-chip RAM
if needed, to achieve the desired performance.
Ideally we would put all the memory sections in the fast
on-chip memory but the amount of the on-chip memory in
Table 2: Memory requirements of various application modules.
Section Name Memory Foot Print
Kalman lter
Whole Code + Initialized Data 81 KB
.text Section alone 69.6 KB
.data Section alone 10.44 KB
stack Section alone approximately 2 KB
heap section alone approximately 1 KB
Gating module
.text Section alone 51.81 KB
.data Section alone 8.61 KB
stack Section alone approximately 2 KB
Munkres algorithm
.text Section Alone 52.34 KB
.data Section Alone 10.44 KB
.stack Section Alone approximately 2 KB
the FPGA is limited. Hence we have to rely greatly on the
o-chip SDRAM or SSRAM. However accessing the o-chip
memory is inherently far slower than the on-chip memory.
Moreover, dierent processors would have to go through
the arbitration logic to access the shared o-chip memory
device. This would increase the memory access time even
further. Consequently, on the one hand we cannot use the
o-chip memory exclusively since it would slow the system
down beyond the acceptable limits. On the other hand, we
have to minimize our dependence on on-chip memory for
each processor due to the scarcity of the on-chip memory.
We therefore have to balance our reliance on dedicated
on-chip memory and the shared o-chip memory without
compromising the performance too much.
Compiling the application modules in the NiosII IDE
gives us an estimate of the memory needs of these modules.
We selected the appropriate compiler compression options
to generate a compact object code for each module. Table 2
summarizes the memory requirements of all the application
modules.
We can see in this table that the memory requirements
of the whole code and the .text sections for all the modules
are too high to be accommodated in the on-chip memory.
However if a certain module uses malloc() or new() abun-
dantly, placing the heap section in the on-chip memory can
improve its speed by a large margin. Similarly if a module
makes frequent calls to other functions, putting the stack
section in the on-chip memory can help achieve a higher
execution speed for that module.
We performed experiments by placing the memory
sections for the dierent modules in the o-chip and the on-
chip memories and observed some interesting results. These
results are discussed in the following sections.
7.4.1. Kalman Filter and Memory Sections. Although the
Kalman lter takes 15 ms with only 4 KB I-Cache and
no further optimization, yet we investigated the prospects
of improving it further through on-chip memory place-
ment. The outcome of this investigation is summarized in
Figure 18. As before Kalman represents the overall algorithm
and the other bars are its constituent subfunctions. These
results are obtained with 4 KB I-cache and using oating
point instructions.
Even with all memory sections in the o-chip device, the
runtime is 6.35 ms while moving only the stack section to
the on-chip memory reduces this time by more than 50%.
Since the stack section of the memory requires only 2 KB,
we can bring the time down to 3.2 ms by connecting 2 KB
of on-chip dedicated memory to the processors for the stack
and using a NiosII/S with 4 KB of I-cache and oating point
custom instructions. One of our experiments showed that if
we use a NiosII/f with 16 KB I-cache and 2 KB D-cache and
all the other optimizations implemented, we can reduce the
processing time for the lter to 1 ms. This opens up a new
venue for our future work where we shall route several targets
into a Kalman lter to reduce the number of processors for
the lters. With this arrangement, we have two options. We
can reduce the number of processors for the lters from 20
to 2 and thereby losing some of the exibility. Alternatively,
we can run all the 20 lters on separate processors and
guarantee exibility of being able to switch to other types
of lters instead of the Kalman lter depending on the
target characteristics. At the moment we are using the latter
option.
7.4.2. Gating Module and Memory Sections. The Gating
Modules performance for dierent memory placement
experiments is shown in Figure 19. Here we can deduce
that a minimum execution time of 22 ms for 20 obstacles
can be achieved by keeping all the memory sections in
the on-chip memory. But this would require the on-chip
RAM to be at least 61 KB. This combined the I-cache and
D-cache, adds up to 79 KB. Obviously this is a very high
requirement considering the limited amount of the on-chip
memory. The next best solution of 23 ms is obtained when
we place the stack and the heap sections in the on-chip
memory. There is a very small speed loss but in this case
only 3 KB of dedicated on-chip memory is sucient to get
this speed up. Clearly this is a considerable gain in on-
chip memory saving compared to the earlier requirement
of 79 KB. So the Gating Module can operate satisfactorily by
using a NiosII/F processor with 16 KB I-cache, 2 KB D-cache,
3 KB dedicated on-chip RAM and oating point custom
instructions.
The Innov d and Innov a calculators together account
for more than half the execution time taken by the gating
module (cf. Figure 16). Executing these two on separate
processors in parallel will pave the way for scaling the system
for more than 20 targets. As an alternative scaling solution
we are currently experimenting on DSP-VLIW processors to
exploit data level parallelismand hardware accelerators, since
after the optimizations, we have enough space available for
adding more circuitry (see Section 10).
7.4.3. Munkres Algorithm and Memory Sections. Placing
various memory sections on chip does not have a noteworthy
inuence on the Munkres algorithm performance, although
there was some improvement as shown in Figure 20. We
gain only 6 ms if all the memory sections are put on chip.
The next best gain is achieved by putting the heap on chip.
this is due to the use of a few malloc() statements in the
code. While neither of these gains is enough to reduce the
execution time below 25 ms, the former is not even feasible
given the memory foot print of the algorithm. We have to
look elsewhere for a possible and practicable solution. The
following section explains our approach to this issue.
7.5. Floating Point versus Integer Cost Matrix For Munkres
Algorithm. Munkres algorithm operates on the cost matrix
iteratively to nd an optimum solution. It looks for the
minimum value in every column and row of the cost matrix
such that only one value in a row and a column is selected.
It comes out with a solution when the sum of the selected
elements of the cost matrix reaches its minimum. This
procedure remains the same whether the elements of the
cost matrix are oating point numbers or integer numbers.
We found out that if we truncate the fractional part of the
oating point elements of the cost matrix, the nal solution
is the same as in the case of the oating point cost matrix.
Hence we can replace the oating point cost matrix by a
representative integer cost matrix without sacricing the
accuracy of the nal solution. This does not require that the
all the elements of the cost matrix have to be dierent from
one another; the algorithm still nds a unique solution even
if all the elements of the cost matrix have the same numerical
value hence it does not require distinct integer values in the
cost matrix. The advantage of this manipulation however, is
that with the integer cost matrix the mathematical operations
become simpler and faster, reducing the runtime of the
algorithm by a large margin. Additionally, using an integer
cost matrix obviates the need for the oating point custom
instruction hardware. Consequently the size of the processor
is reduced by 8%.
We made the necessary modications to the Munkres
algorithm and the cost matrix generating function to incor-
porate this rearrangement. A glimpse of the advantage of
these transformations can be seen in Figure 21 which shows
the optimal cache conguration for the integer version of the
Munkres algorithm.
Certainly 8 KB I-cache and 16 KB D-cache are still the
best choices, the point worth noting here is that with the
integer cost matrix, the runtime for the overall algorithm
drops down to 24 ms as opposed to the 82 ms with oating
point cost matrix (refer to Figure 14). This drop takes place
while all the memory sections are placed in the o-chip
SDRAM.
The Munkres algorithm analysis shows the following
facts: Step 4 and Step 6 are the most time consuming sub-
functions of the algorithm in the same order (cf. Figure 17).
Although, after optimization, for 20 targets a single processor
executes the algorithm within the required time interval,
yet to scale the system up for more than 20 targets, these
subfunctions can be executed on separate processors in
Execution time for Kalman lter on NiosII and
oating point custom instructions
T
i
m
e
(
m
s
)
0
1
2
3
4
5
6
7
All mem
sections
o chip
Text
section
on chip
Data
section
on chip
Heap
section
on chip
Stack
section
on chip
All mem
sections
on chip
Mat inv
Mat trans
Mat sub
Mat Mul
Mat add
Figure 18: Eects of on chip and o chip memory sections on
Kalman Filter performances.
Gating module on NiosII/F with 16 KB I-cache, 2 KB D-cache
& 100 MHz clock
T
i
m
e
(
m
s
)
0
5
10
15
20
25
30
35
40
Number of obstacles
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
All mem secs o chip
Heap on chip
Heap & stack on chip
All mem secs on chip
Gating Module.
parallel, to reduce the execution time of the algorithm. In
a similar fashion to Gating Module, as an alternative, we
are currently experimenting with DSP-VLIW processors to
exploit data level parallelism and hardware accelerators.
7.6. Track Maintenance. So far we have not mentioned
the track maintenance block of the MTT application in
the context of optimization. The reason for this deliberate
omission is that a very short processing time is required for
this block. A simple NiosII/e processor executes this block in
8 ms. In future we may even eliminate this processor and run
the track maintenance block as a second task on one of the
other processors.
Eects of various memory sections on the performance
of Munkres algorithm for 20 obstacles
T
i
m
e
(
m
s
)
40
41
42
43
44
45
46
47
48
All mem
sections
o chip
Text
section
on chip
Data
section
on chip
Heap
section
on chip
Stack
section
on chip
All mem
sections
on chip
Munkres algorithm performances.
Eects of I-cache & D-cache sizes on Munkers algorithm
with NiosII/F 100 MHz, 20 obstacles & integer cost matrix
T
i
m
e
(
s
e
c
s
)
0
0.05
0.1
0.15
0.2
0.25
D-cache size (KB)
0 2 4 8 16 32 64
I-cache = 4 KB
I-cache = 8 KB
I-cache = 16 KB
I-cache = 32 KB
I-cache = 64 KB
Figure 21: Cache Behavior for Munkres Algorithm with Integer
Cost Matrix.
8. Discussion
After the success of the last optimization of the Munkres
algorithm discussed in Section 7.5, we investigated the other
modules for similar optimizations. We found out that this
technique cannot be extended to all the modules in the
application for the following reasons.
(i) The Kalman lter calculates the predicted states,
prediction error covariances, estimated states and estimation
error covariance for the targets. The dierences in the values
of these quantities from one radar scan to another are very
small and they occur to the right of the decimal point. It
takes hundreds of scans for these changes to ow over to
the left of the decimal point. Hence the integer part of these
oating point numbers remain unchanged for hundreds of
scans. Nevertheless, these small dierences play an important
role not only in the lter itself but also in the Gating
Module.
In the lter, the estimated state and estimation error
covariance are fed back to the prediction stage of the lter.
The prediction stage uses them as the basis of predictions for
the next scan. If we use just the integer parts of quantities,
there would be no change in the estimated values for
hundreds of scans. Obviously, this would introduce an error
into the predictions. Due to the cyclic feedback between the
prediction and correction stages of the lter, an avalanche of
errors would be generated in a few seconds.
(ii) The predicted states and the prediction error
covariances are also used by the Gating Module to locate
the centers of the probability gates and to calculate the
dierence between the measured and the predicted target
coordinates, that is, innov d and innov a, respectively. If
we use only the integer part of the predicted and mea-
sured coordinates, there would be two catastrophic errors
introduced into the system. First, because of the non-
changing integer parts of the predicted coordinates, the
gates would be centered at the same xed locations for
hundreds of scans. Second, for the same reasons, innov d
and innov a would remain zero for hundreds of cycles. Zero
innovations mean that the predicted coordinates are exactly
identical to the measured coordinates which is practically
impossible.
(iii) The Gating Module uses the prediction error covari-
ance to calculate the dimensions of the probability gates.
Using the constant integer part of the covariance would x
the gate dimensions to a constant size for hundreds of scans.
This again, is unrealistic and would inject even more error
into the system.
(iv) For the Munkres algorithm (the Assignment Solver)
the case is dierent. The Assignment Solver is the last step
of the application loop. By the time the application reaches
this step, most of the oating point operations have already
been completed resulting in the Cost Matrix. The output
of the Assignment Solver is the Matrix X which has either
1
s or 0
s as its elements. The 1s in the matrix are used to

identify the most probable observation-prediction pairs. No
arithmetic operations are performed on the Matrix X.
Altera provides a tool called C2H compiler which is
intended to transform C code into a hardware accelerator.
At a stage in our work we tried this tool but it turned
out that it has some serious limitations. It can be used
for codes operating only on integers. So, for the above
stated reasons, we could use it only for the Munkres
algorithm. But again, the tool can accelerate only a single
function and that function too must not involve complex
computations. So we could not use it for Step 4 and Step 6
of the algorithm where we needed it most. The tool simply
stops working when we try to accelerate either of these two
functions.
In case of small functions (like Step 3) where it does
work, the hardware size of the accelerator is almost half that
of the processor to which it is attached while the speedup is
nominal. In brief, if this tool is improved to remove these
limitations it can be very useful. In its current status it is far
from its stated goals.
9. Related Work
To our understanding, comprehensive literature about the
implementation of a complete MTT system in FPGA, does
not exist. Works about the application of MTT to DASs are
even harder to nd.
Some work has been done on dierent isolated com-
ponents of the MTT system but in dierent contexts. For
example an implementation of the Kalman lter only, is
proposed in [18]. It is not only limited to the lter but it also
is a fully hardware implementation. As mentioned earlier in
the introduction, fully hardware designs lack the exibility
and programmability needed for the ever evolving modern
day embedded applications. Moreover, the authors report
two alternative implementations of the Kalman lter namely
the Scalar-Based Direct Algorithm Mapping (SBDAM) and
the Matrix-Based Systolic Array Engineering (MBSAE).
The former consumes 4564 logic cells whereas the latter
consumes 8610 logic cells for a single lter each. Apart
from the large sizes, the internal components of both the
implementations are manually organized and re-organized
to get the desired performance. This is obviously not scalable
and repeatable in a complex system like ours where the lter
is not the only component to be optimized.
An attempt to implement an MTTsystemin hardware for
a maritime application is documented in [19]. In addition
to being a completely hardware implementation, the work
presented here is inconclusive.
The data association aspect of MTT has been dealt with
nicely in [11] but the physical implementation of the system
is not a consideration in this work. Only matlab simulations
are reported for that part of the MTT.
Although the title of [20] sounds very close to our
work, yet this work describes the theory of the Extended
Kalman Filter (EKF) with a smoothing window. The paper
discusses the velocity estimation of slow moving vehicles and
emphasizes on the necessity of reducing the liberalization
errors in the process. While the paper presents a viable
solution to the problem of liberalization errors in EKF, the
physical implementation of the EKF or the tracking system
does not gure among the objectives of the work.
A systolic array based FPGA implementation of the
Kalman lter only, is reported in [21]. This work con-
centrates on the use of a matrix manipulation algorithm
(Modied Faddeev) for reducing the complexity of the
computation. This article again, presents an interesting
account of implementing the Kalman lter in an ecient
way. In cases where very fast ltering is the main objective,
this may be a good solution.
In fact software forms of the algorithms like EKF [20] and
Modied Faddeev based implementation of the Kalman lter
[21] can be easily integrated into our system. For example
EKF is useful in situations where a target exhibits a abrupt
changes in its dynamic behavior as in hilly regions. Similarly,
other algorithms like [21] can be added on if required. So the
works discussed above can be considered as complementary
rather than competitors to our work.
Most of the available works treat the individual compo-
nents of the MTT (mainly the Kalman lter) in isolation.
However, putting these and other components together to
design a coherent MTT application and adapting it to
automotive safety utilization, is not a trivial task.
Our work is unique in several aspects. In contrast to
the works mentioned above, we consider a complete MTT
system implementation. Our recongurable MPSoC archi-
tecture of the system is inherently exible, programmable
and scalable. Thus it can evolve very easily with advances
in technology and with improvements in application algo-
rithms. Moreover, the use of several concurrently running
processors meets the overall real time deadlines. Several
low frequency processors running concurrently consume less
power compared to a single processor with a high clock
frequency and doing the same job [5]. The recongurability
of the processors and other components in our design, allow
for customizing them according to application requirements
while keeping the hardware size as small as possible. The
system we propose is a complete plug-and-play solution that
can be easily integrated with the existing electronic systems
onboard a vehicle.
10. Summary and Conclusion
We presented the procedure we adopted for designing and
optimizing an application specic MPSoC. The Multiple
Target Tracking (MTT) application is designed for Driver
Assistance Systems (DAS). These systems are used in vehicles
for collision avoidance and early warning for assisting the
driver.
First, we described a general view of the MTT application
and then we presented our own approach to the design
and development of the customized application for Driver
Assistance. We developed the mathematical models for the
application and coded the application in ANSI C. We divided
the application into easily manageable modules that can be
executed in parallel.
After developing the application, we proled it to iden-
tify performance bottlenecks and dependencies among the
application modules. This helped us in allocating processing
resources to the application modules and in laying out
our optimization strategies. Using three dierent hardware
implementations of the NiosII soft core embedded processor
and other components, we devised a heterogeneous MPSoC
architecture for the system.
To formulate our optimization strategies we also identi-
ed the constraints to be met. The constraints include the
25 ms time limit for the application execution, the limited
amount of available on-chip memory and the size of the
system hardware.
To avoid overusing the on-chip memory we optimized
the I-cache and D-cache sizes for each application module.
Understanding the I-cache and D-cache requirements not
only helped us in accelerating the system but also in selecting
the right conguration of the NiosII processor for each
module.
The optimum cache congurations reduced the execu-
tion times by at least 50%. The Gating Module and the
Assignment Solver needed further acceleration to arrive at
Table 3: Summary of the nal system.
Kalman Gating Munkres Track Maint.
Number 20 1 1 1
of proc.
NIOSII S F F E
type
I cache 4 16 8 0
in KB
D cache 0 2 16 0
in KB
Local mem 0 3 3 0
in KB
% Mem. used 60 8 9 1
on FPGA
FP custom No Yes No No
instructions
Run time 15 23 24 8
in mSec
FPGA resource usage (LEs)
Stratix II EP2S60 (total 60,000 LEs)
Unused
17974 Filters
26000
FIFOs and
interconnects
12000
Track maintenance
600
Assignment solver
1700
Gating module
1726
Figure 22: System Hardware Size.
25 ms cut-o set by the radar PRT. We incorporated the
oating point custom instructions hardware in the relevant
processors to accelerate them further. Floating Point Custom
instructions reduced the runtime from 70 ms to 37 ms (47%
speedup) for the Gating Module and from 71 ms to 47 ms
(34% speedup) for the Assignment Solver. To bring these
times below 25 ms, we needed to speed these modules up
even more.
Shifting the whole application to the fast on-chip mem-
ory could greatly improve the speed however it is not feasible
due to the large memory footprint of the application and the
limited amount of the on-chip memory. We experimented
with placing dierent memory sections like the stack and the
heap in the fast on-chip RAM. Placing only the stack and
the heap memory sections on-chip for the Gating Module,
brought the runtime down to 23 ms which is below the 25 ms
cut-o and hence we settled for it.
For the Assignment Solver (Munkres algorithm) we could
gain only 6 ms in runtime by putting the entire module in
the on-chip memory. This gain is neither enough to get us to
our goal nor we can aord to put the entire module on chip,
in the nalized system.
Exploring the algorithm we found that the nal output
of the algorithm remains unchanged if we drop down the
fractional part of the oating point elements of the input Cost
Matrix. This manipulation of the input matrix reduced the
runtime for the algorithm to 24 ms without compromising
the accuracy of the nal solution. It also allowed us
to do away with the oating point custom instructions.
Consequently we use the lighter NiosII/s instead of heavier
NiosII/f for the assignment solver.
Speed was not the only objective in choosing the system
components and the optimization strategies, we wanted to
keep the on-chip memory utilization and the hardware size
in check too. We traded speed for FPGA resource economy
where we could aord it, for example, in the case of the
Kalman lters.
Taking into account the results of the optimizations,
we nalized the components and their respective features.
Table 3 summarizes the salient features of the nalized
architecture with reference to Figure 9. The whole system ts
in a single StratixII EP2S60 FPGA. The design uses 42,000
of the 60,000 logic elements (LEs) available on the FPGA as
shown in Figure 22 and it meets the runtime constraints of
the application.
References
[1] A. Techmer, Application development of camera-based driver
assistance systems on a programmable multi-processor archi-
tecture, in Proceedings of IEEE Intelligent Vehicles Symposium
(IV 07), pp. 12111216, Istanbul, Turkey, June 2007.
[2] M. Beekema and H. Broeders, Computer Architectures
for Vision-Based Advanced Driver Assistance Systems,
http://www.xs4all.nl/hc11/paper ca st.pdf.
[3] STMicroelectronics and Mobileye, Stmicroelectronics and
mobileye deliver second generation system-on-chip for vision-
based driver assistance systems, Press release, May 2008.
[4] S. Blackman and R. Popoli, Design and Analysis of Modern
Tracking Systems, Artech House, Boston, Mass, USA, 1999.
[5] Frank Schirrmeister Imperas, Inc., Multi-core processors:
fundamentals, trends, and challenges, in Proceedings of the
Embedded Systems Conference, 2007, ESC351.
[6] E. Brookner, Tracking and Kalman Filtering Made Easy, John
Wiley & Sons, New York, NY, USA, 1998.
[7] V. Nedovi, Tracking moving video objects using mean-
shift algorithm, Project Report, http://sta.science.uva.nl/
vnedovic/MMIR2004/vnedovicProjReport.pdf.
[8] Z. Salcic and C.-R. Lee, FPGA-based adaptive tracking
estimation computer, IEEE Transactions on Aerospace and
Electronic Systems, vol. 37, no. 2, pp. 699706, 2001.
[9] R. E. Kalman, A new approach to linear ltering and
prediction problems, Journal of Basic Engineering, vol. 82, pp.
3545, 1960.
[10] D. P. Bertsekas and D. A. Castaon, A forward/reverse auc-
tion algorithm for asymmetric assignment problems, http://
web.mit.edu/dimitrib/www/For Rev Asym Auction.pdf.
[11] P. Konstantinova, et al., A study of target tracking algorithm
using global nearest neighbor approach, in Proceedings of the
International Conference on Computer Systems and Technolo-
gies (CompSysTech 03), Soa, Bulgaria, June 2003.
[12] Munkres Assignment Algorithm, Modied for Rectangu-
lar Matrices http://csclab.murraystate.edu/bob.pilgrim/445/
munkres.html.
[13] G. Welch and G. Bishop, An introduction to the Kalman
Filter, 2001, http://www.cs.unc.edu/welch/kalman/.
[14] Altera Corporation, http://www.altera.com/literature/an/
an391.pdf.
[15] R. Joost and R. Salomon, Advantages of FPGA-based multi-
processor systems in industrial applications, in Proceedings of
the 31st Annual Conference of IEEE Industrial Electronics Soci-
ety (IECON 05), pp. 445450, Raleigh, NC, USA, November
2005.
[16] H. Penttinen, T. Koskinen, and M. Hnnikinen, Leon3 MP
on Altera FPGA, Final Project Report, August 2007, Altera
Innovate Nordic.
[17] Altera Corporation, NIOS II Processor Reference Handbook,
http://www.altera.com/literature/hb/nios2/n2cpu nii5v1.pdf.
[18] Z. Salcic and C.-R. Lee, Scalar-based direct algorithm
mapping FPLD implementation of a Kalman lter, IEEE
Transactions on Aerospace and Electronic Systems, vol. 36, no.
3, part 1, pp. 879888, 2000.
[19] Y. Boismenu, Etude dune carte de tracking radar, Th` ese de
doctorat, Universit de Bourgogne, Dijon, France, 2000.
[20]

A. G oransson and B. Sohlberg, Tracking low velocity vehicles
from radar measurements, in Proceedings of the IASTED
International Conference on Circuits, Signals, and Systems (CSS
03), pp. 5155, Cancun, Mexico, May 2003.
[21] G. Chen and L. Guo, The FPGA implementation of Kalman
lter, in Proceedings of the 5th WSEAS International Con-
ference on Signal Processing, Computational Geometry and
Articial Vision, Malta, 2005.
doi:10.1155/2009/105979
Research Article
A Prototyping Virtual Socket System-On-Platform
Architecture with a Novel ACQPPS Motion Estimator for
H.264 Video Encoding Applications
Yifeng Qiu and Wael Badawy
Department of Electrical and Computer Engineering, University of Calgary, Alberta, Canada T2N 1N4
Correspondence should be addressed to Yifeng Qiu, yiqiu@ucalgary.ca
Received 25 February 2009; Revised 27 May 2009; Accepted 27 July 2009
H.264 delivers the streaming video in high quality for various applications. The coding tools involved in H.264, however, make
its video codec implementation very complicated, raising the need for algorithm optimization, and hardware acceleration. In
this paper, a novel adaptive crossed quarter polar pattern search (ACQPPS) algorithm is proposed to realize an enhanced inter
prediction for H.264. Moreover, an ecient prototyping system-on-platform architecture is also presented, which can be utilized
for a realization of H.264 baseline prole encoder with the support of integrated ACQPPS motion estimator and related video
IP accelerators. The implementation results show that ACQPPS motion estimator can achieve very high estimated image quality
comparable to that from the full search method, in terms of peak signal-to-noise ratio (PSNR), while keeping the complexity at an
extremely low level. With the integrated IP accelerators and optimized techniques, the proposed system-on-platform architecture
suciently supports the H.264 real-time encoding with the low cost.
Copyright 2009 Y. Qiu and W. Badawy. This is an open access article distributed under the Creative Commons Attribution
cited.
1. Introduction
Digital video processing technology is to improve the coding
validity and eciency for digital video images [1]. It involves
the video standards and relevant realizations. With the joint
eorts of ITU-T VCEG and ISO/IEC MPEG, H.264/AVC
(MPEG-4 Part 10) has been built up as the most advanced
standard so far in the world, targeting to achieve very high
data compression. H.264 is able to provide a good video
quality at bit rates which are substantially lower than what
previous standards need [24]. It can be applied to a wide
variety of applications with various bit rates and video
streaming resolutions, intending to cover practically almost
all the aspects of audio and video coding processing within
its framework [57].
H.264 includes many proles, levels and feature de-
nitions. There are seven sets of capabilities, referred to as
proles, targeting specic classes of applications: Baseline
Prole (BP) for low-cost applications with limited comput-
ing resources, which is widely used in videoconferencing and
mobile communications; Main Prole (MP) for broadcasting
and storage applications; Extended Prole (XP) for stream-
ing video with relatively high compression capability; High
Prole (HiP) for high-denition television applications;
High 10 Prole (Hi10P) going beyond present mainstream
consumer product capabilities; High 4 : 4 : 2 Prole (Hi422P)
targeting professional applications using interlaced video;
High 4 : 4 : 4 Prole (Hi444P) supporting up to 12 bits per
sample and ecient lossless region coding and an integer
residual color transform for RGB video. The levels in H.264
are dened as Level 1 to 5, each of which is for specic bit,
frame and macroblock (MB) rates to be realized in dierent
proles.
One of the primary issues with H.264 video applications
lies on howto realize the proles, levels, tools, and algorithms
featured by H.264/AVC draft. Thanks to the rapid develop-
ment of FPGA[8] techniques and embedded software system
design and verication tools, the designers can utilize the
hardware-software (HW/SW) codesign environment which
is based on the recongurable and programmable FPGA
infrastructure as a dedicated solution for H.264 video
applications [9, 10].
The motion estimation (ME) scheme has a vital impact
on H.264 video streaming applications, and is the main
function of a video encoder to achieve image compression.
The block-matching algorithm (BMA) is an important and
widely used technique to estimate the motions of regular
block, and generate the motion vector (MV), which is
the critical information for temporal redundancy reduction
in video encoding. Because of its simplicity and coding
eciency, BMA has been adopted as the standard motion
estimation method in a variety of video standards, such as the
MPEG-1, MPEG-2, MPEG-4, H.261, H.263, and H.264. Fast
and accurate block-based search techniques and hardware
acceleration are highly demanded to reduce the coding delay
and maintain satised estimated video image quality. Anovel
adaptive crossed quarter polar pattern search (ACQPPS)
algorithm and its hardware architecture are proposed in
this paper to provide an advanced motion estimation search
method with the high performance and low computational
complexity.
Moreover, an integrated IP accelerated codesign system,
which is constructed with an ecient hardware architecture,
is also proposed. With integrations of H.264 IP accelerators
into the system framework, a complete system-on-platform
solution can be set up to realize the H.264 video encoding
system. Through the codevelopment and co-verication for
system-on-platform, the architecture and IP cores developed
by designers can be easily reused and therefore transplanted
fromone platformto others without signicant modication
[11]. These factors make a system-on-platform solution
outperform a pure software solution and more exible than
a fully dedicated hardware implementation for H.264 video
codec realizations.
The rest of paper is organized as follows: in the next
Section 2, H.264 baseline prole and its applications are
briey analyzed. In Section 3, the ACQPPS algorithm is
proposed in details, while Section 4 describes the hardware
architecture for the proposed ACQPPS motion estimator.
Furthermore, a hardware architecture and host interface
features of the proposed system-on-platform solution is
elaborated in Section 5, and the related techniques for system
optimizations are illustrated in Section 6. The complete
experimental results are generated and analyzed in Section 7.
The Section 8 concludes the paper.
2. H.264 Baseline Prole
2.1. General Overview. The proles and levels specify the
conformance points, which are designed to facilitate the
interoperability between a variety of video applications of
the H.264 standard that has similar functional requirements.
A prole denes a set of coding tools or algorithms that
can be utilized in generating a compliant bitstream, whereas
a level places constraints on certain key parameters of the
bitstream.
H.264 baseline prole was designed to minimize the
computational complexity and provide high robustness and
exibility for utilization over a broad range of network
environment and conditions. It is typically regarded as
the simplest one in the standard, which includes all the
H.264 tools with the exception of the following tools: B-
slices, weighted prediction, eld (interlaced) coding, pic-
ture/macroblock adaptive switching between the frame and
eld coding (MB-AFF), context adaptive binary arithmetic
coding (CABAC), SP/SI slices and slice data partition-
ing. This prole normally targets the video applications
with low computational complexity and low delay require-
ments.
For example, in the eld of mobile communications,
H.264 baseline prole will play an important role because
the compression eciency is doubled in comparison with
the coding schemes currently specied by the H.263 Baseline,
H.263+ and MPEG-4 Simple Prole.
2.2. Baseline Prole Bitstream. For mobile and videocon-
ferencing applications, H.264 BP, MPEG-4 Visual Simple
Prole (VSP), H.263 BP, and H.263 Conversational High
Compression (CHC) are usually considered. Practically,
H.264 outperforms all other considered encoders for video
streaming encoding. H.264 BP allows an average bit rate
saving of about 40% compared to H.263 BP, 29% to MPEG-4
VSP and 27% to H.263 CHC, respectively [12].
2.3. Hardware Codec Complexity. The implementation com-
plexity of any video coding standard heavily depends on
the characteristics of the platform, for example, FPGA, DSP,
ASIC, SoC, on which it is mapped. The basic analysis with
respect to the H.264 BP hardware codec implementation
complexity can be found in [13, 14].
In general, the main bottleneck of H.264 video encoding
is a combination of multiple reference frames and large
search ranges.
Moreover, the H.264 video codec complexity ratio is in
the order of 10 for basic congurations and can grow up to
the 2 orders of magnitude for complex ones [15].
3. The Proposed ACQPPS Algorithm
3.1. Overview of the ME Methods. For motion estimation,
the full search algorithm (FS) of BMA exhaustively checks all
possible block pixels within the search windowto nd out the
best matching block with minimal matching error (MME).
It can usually produce a globally optimal solution to the
motion estimation, but demand a very high computational
complexity.
To reduce the required operations, many fast algorithms
have been developed, including the 2D logarithmic search
(LOGS) [16], the three-step search (TSS) [17], the newthree-
step search (NTSS) [18], the novel four-step search (NFSS)
[19], the block-based gradient descent search (BBGDS)
[20], the diamond search (DS) [21], the hexagonal search
(HEX) [22], the unrestricted center-biased diamond search
(UCBDS) [23], and so forth. The basic idea behind these
multistep fast search algorithms is to check a few of block
points at current step, and restrict the search in next step to
the neighboring of points that minimizes the block distortion
measure.
These algorithms, however, assume that the error surface
of the minimum absolute dierence increases monotonically
as the search position moves away from the global minimum
on the error surface [16]. This assumption would be
reasonable in a small region near the global minimum,
but not absolutely true for real video signals. To avoid
trapped in undesirable local minimum, some adaptive search
algorithms have been devised intending to achieve the global
optimum or sub-optimum with adaptive search patterns.
One of those algorithms is the adaptive rood pattern search
(ARPS) [24].
Recently, a few of valuable algorithms have been devel-
oped to further improve the search performance, such
as the Enhanced Predictive Zonal Search (EPZS) [25,
26] and Unsymmetrical-Cross Multi-Hexagon-grid Search
(UMHexagonS) [27], which were even adopted by H.264 as
the standard motion estimation algorithms. These schemes,
however, are not especially suitable for the hardware imple-
mentation, as the search principle of these methods is
complicated. If the hardware architecture is required for the
realization of H.264 encoder, these algorithms are usually not
regarded as the ecient solution.
To improve the search performance and reduce the com-
putational complexity as well, an ecient and fast method,
adaptive crossed quarter polar pattern search algorithm
(ACQPPS), is therefore proposed in this paper.
3.2. AlgorithmDesign Considerations. It is known that a small
search pattern with compactly spaced search points (SP)
is more appropriate than a large search pattern containing
sparsely spaced search points in detecting small motions
[24]. On the contrary, the large search pattern has the
advantage of quickly detecting large motions to avoid being
trapped into local minimum along the search path and leads
to unfavorable estimation, an issue that the small search
pattern encounters. It is desirable to use dierent search
patterns, that is, adaptive search patterns, in view of a variety
of the estimated motion behaviors.
Three main aspects are considered to improve or speed
up the matching procedure for adaptive search methods: (1)
type of the motion prediction; (2) selection of the search
pattern shape and direction; (3) adaptive length of search
pattern. The rst two aspects can reduce the number of
search points, and the last one is to give more accurate
searching result with a large motion.
For the proposed ACQPPS algorithm under H.264
encoding framework, a median type of the predicted
motion vector, that is, median vector predictor (MVP)
[28], is produced for determining the initial search range.
The shape and direction of the search pattern is adap-
tively selected. The length (radius) of the search arm is
adjusted to improve the search. Two main search steps
are involved in the motion search: (1) initial search stage;
(2) rened search stage. In the initial search stage, some
initial search points are selected to obtain an initial MME
point. For the rened search, a unit-sized square pattern
is applied iteratively to obtain the nal best motion vec-
tor.
3.3. Shape of the Search Pattern. To determine the following
search step according to whether the current best matching
point is positioned at the center of search range, a new search
pattern is devised to detect the potentially optimal search
points in the initial search stage. The basic concept is to pick
up some initial points along with the polar (circular) search
pattern. The center of the search circles is the current block
position.
Under the assumption that the matching error surface
has a property of monotonic increasing or decreasing,
however, some redundant checking points may exist in the
initial search stage. It is obvious that some redundant points
are not necessary to be examined under the assumption of
unimodal distortion surface. To reduce the number of initial
checking points and keep the probability of getting optimal
matching points as high as possible, a fractional or quarter
polar search pattern is used accordingly.
Moreover, it is known that the accuracy of motion
predictor is very important to the adaptive pattern search.
To improve the performance of adaptive search, extra related
motion predictors can be used other than the initial MVP.
The extra motion predictors utilized by ACQPPS algorithm
only require an extension and a contraction of the initial
MVP that can be easily obtained. Therefore, at the crossing
of quarter circle and motion predictors, the search method
is equipped with the adaptive crossed quarter polar patterns
for ecient motion search.
3.4. Adaptive Directions of the Search Pattern. The search
direction, which is dened by the direction of a quarter circle
contained in the pattern, comes from the MVP. Figure 1
shows the possible patterns designed, and Figure 2 depicts
how to determine the direction of a search pattern. The
patterns employ the directional information of a motion
predictor to increase the possibility to get the best MME
point for the rened search. To determine an adaptive
direction of the search pattern, certain rules are obeyed.
(3.4.1) If the predicted MV (motion predictor) = 0, set up an
initial square search pattern with a pattern size =1,
around the search center, as shown in Figure 2(a).
(3.4.2) If the predicted MV falls onto a coordinate axis,
that is, PredMVy = 0 or PredMVx = 0, the pattern
direction is chosen to be E, N, W, or S, as shown in
Figures 1(a), 1(c), 1(e), 1(g). In this case, the point
at the initial motion predictor is overlapped with an
initial search point which is on the N, W, E, or S
coordinate axis.
(3.4.3) If the predicted MV does not fall onto any coor-
dinate axis, and Max{|PredMVy|, |PredMVx|} >
2
Min{|PredMVy|, |PredMVx|}, the pattern direc-

tion is chosen to be E, N, W, or S, as shown in
Figure 2(b).
(3.4.4) If the predicted MV does not fall onto any coor-
dinate axis, and Max{|PredMVy|, |PredMVx|}
2
Min{|PredMVy|, |PredMVx|}, the pattern direc-

tion is chosen to be NE, NW, SW, or SE, as shown in
Figure 2(c).
SW SW
NE NW
S
E W
N
(a) E Pattern
SW SW
NE NW
S
E W
N
(b) NE Pattern
SW SW
NE NW
S
E W
N
(c) N Pattern
SW SW
NE NW
S
E W
N
(d) NW Pattern
SW SW
NE NW
S
E W
N
(e) W Pattern
SW SW
NE NW
S
E W
N
Points with the predicted MV
and extension
Initial SPs along the quarter circle
(f) SW Pattern
SW SW
NE NW
S
E W
N
and extension
(g) S Pattern
SW SW
NE NW
S
E W
N
and extension
(h) SE Pattern
Figure 1: Possible adaptive search patterns designed.
3.5. Size of the Search Pattern. To simplify the selection of
search pattern size, the horizontal and vertical components
of motion predictor is still utilized. The size of search pattern,
that is, the radius of a designed quarter polar search pattern,
is simply dened as
R = Max
_
PredMVy
, |PredMVx|
_
, (1)
where R is the radius of quarter circle, PredMVy and
PredMVx the vertical and horizontal components of the
motion predictor, respectively.
3.6. Initial Search Points. After the direction and size of
a search pattern are decided, some search points will
be selected in the initial search stage. Each search point
represents a block to be checked with intensity matching. The
initial search points include (when MVP is not zero):
(1) the predicted motion vector point;
(2) the center point of search pattern, which represents the
candidate block in the current frame;
(3) some points on the directional axis;
4 3 2 1 0 1 2 3 4
4
3
2
1
0
1
2
3
4
Initial SPs with a square pattern when PredMV = 0
(a)
SE SW
NE NW
E W
N
S
Point with the predicted MV
Max{|PredMVy|, |PredMVx|} > 2
Min{|PredMVy|, |PredMVx|}
N/E/W/S pattern selected
(b)
SE SW
NE NW
E W
N
S
Max{|PredMVy|, |PredMVx|} 2
Min{|PredMVy|, |PredMVx|}
NW/NE/SW/SE pattern selected
(c)
Figure 2: (a) Square pattern size = 1, (b) N/W/E/S search pattern
selected, (c) NW/NE/SW/SE search pattern selected.
Table 1: Alook-up table for the denition of vertical and horizontal
components of initial search points on NW/NE/SW/SE axis.
R |SP
x
| |SP
y
| R |SP
x
| |SP
y
|
0 0 0 6 4 4
1 1 1 7 5 5
2 2 2 8 6 6
3 2 2 9 6 6
4 3 3 10 7 7
5 4 4
(4) the extension predicted motion vector point (the point
with prolonged length of motion predictor), and the
contraction predicted motion vector point (the point
with contracted length of motion predictor)
Normally, if no overlapping exists, there will be totally
seven search points selected in the initial search stage, in
order to get a point with the MME, which can be used as
a basis for the rened search stage thereafter.
If a search point is on the axis of NW, NE, SW, or SE,
the corresponding decomposed coordinates of that point will
satisfy,
R =
_
(SP
x
)
2
+
_
SP
y
_
2
, (2)
where SP
x
and SP
y
are the vertical and horizontal compo-
nents of a search point on the axis of NW, NE, SW, or SE.
Because |SP
x
| is equal to |SP
y
| in this case, then
R =
2 |SP
x
| =
SP
y
. (3)
Obviously, neither |SP
x
| nor |SP
y
| is an integer, as R
is always an integer-based radius for block processing. To
simplify and reduce the computational complexity of a
search point denition on the axis of NW, NE, SW or SE,
a look-up table (LUT) is employed, as listed in Table 1.
The values of SP
x
and SP
y
are predened according to the
radius R, and now they are integers. Figure 3 illustrates some
examples of dened initial search points with the look-up
table.
When the radius R > 20, the value of |SP
x
| and |SP
y
| can
be determined by
|SP
x
| =
SP
y
= Round
_
R
2
_
. (4)
There are two initial search points related to the extended
motion predictors. One is with a prolonged length of motion
predictor (extension version), whereas the other is with a
reduced length of motion predictor (contraction version).
Two scaled factors are adaptively dened according to the
radius R, for the lengths of those two initial search points
can be easily derived from the original motion predictor, as
shown in Table 2. The scaled factors are chosen so that the
initial search points related to the extension and contraction
of the motion predictor can be distributed reasonably around
the motion predictor point to obtain the better motion
predictor points.
SE SW
NE NW
S
E W
N
SP
y
SP
x
R
Initial SPs when E pattern selected SP
y
and SP
x
determined by look-up table
(a)
SE SW
NE NW
S
E W
N
SP
y
SP
x
R
Initial SPs when NE pattern selected SP
y
and SP
x
determined by look-up table
(b)
Figure 3: (a) An example of initial search points dened for E pattern using look-up table; (b) an example of initial search points dened
for NE pattern using look-up table.
Table 2: Denition of scaled factors for initial search points related
to motion predictor.
R
Scaled factor for
extension (SF
E
)
R
Scaled factor for
contraction (SF
C
)
0 2 3 0 10 0.5
3 5 2 >10 0.75
6 10 1.5
>10 1.25
Therefore, the initial search points related to the motion
predictor can be identied as
EMVP = SF
E
MVP, (5)
CMVP = SF
C
MVP, (6)
where MVP is a point representing the median vector predic-
tor. SF
E
and SF
C
are the scaled factors for the extension and
contraction, respectively. EMVP and CMVP are the initial
search points with the prolonged and contracted lengths of
predicted motion vector, respectively. If the horizontal or
vertical component of EMVP and CMVP is not an integer
after the scaling, the component value will be truncated to
the integer for video block processing.
3.7. Algorithm Procedure
Step 1. Get a predicted motion vector (MVP) for the
candidate block in current frame for the initial search stage.
Step 2. Find the adaptive direction of a search pattern by
rules (3.4.1)(3.4.4), determine the pattern size R with
the (1), choose initial SPs in the reference frame along the
quarter circle and predicted MV using look-up table, (5) and
(6).
Step 3. Check the initial search points with block pixel
intensity measurement, and get an MME point which has a
minimum SAD as the search center for the next search stage.
Step 4. Rene local search by applying unit-sized square
pattern to the MME point (search center), and check its
neighboring points with block pixel intensity measurement.
If after search, the MME point is still the search center, then
stop searching and obtain the nal motion vector for the
candidate block corresponding to the nal best matching
point identied in this step. Otherwise, set up the new MME
point as the search center, and apply square pattern search to
that MME point again, until the stop condition is satised.
3.8. Algorithm Complexity. As the ACQPPS is a predicted
and adaptive multistep algorithm for motion search, the
algorithm computational complexity exclusively depends on
the object motions contained in the video sequences and
scenarios for estimation processing. The main overhead of
ACQPPS algorithm lies in the block SAD computations.
Some other algorithm overhead, such as the selection of
adaptive search pattern direction, the determination of
search arm and initial search points, are merely consumed
by a combination of if-condition judgments, and thus can be
even ignored when compared with block SAD calculations.
If the large, quick, and complex object motions are
included in video sequences, the number of search points
(NSP) will be reasonably increased. On the contrary, if the
small, slow and simple object motions are shown in the
sequences, it only requires the ACQPPS algorithm a few
of processing steps to nish the motion search, that is, the
number of search points is correspondingly reduced.
Unlike the ME algorithms with xed search ranges,
for example, the full search algorithm, it is impractical
to precisely identify the number of computational steps
for ACQPPS. On an average, however, an approximation
Look-up
table
Initial search
processing unit
Motion predictor
storage
Refined search
processing unit
Current & reference video
frame storage
Pipelined multi-level
SAD calculator
SAD comparator
MV generated
MME point
Residual data
Reference data
MME point MV generated
18 18 register
array with reference
block data
16 16 register array
with current block data
Figure 4: A hardware architecture for ACQPPS motion estimator.
equation can be utilized to represent the computational
complexity for ACQPPS method. The worst case of motion
search for a video sequence is to use the 4 4 block size,
if the xed block size is employed. In this case, the number
of search points for ACQPPS motion estimation is usually
around 12 16, according to the practical motion search
results. Therefore, the algorithm complexity can be simply
identied as, in terms of image size and frame rate,
C 16 Block SAD computations
Number of blocks in a video frame Frame rate,
(7)
where the block size is 4 4 for the worst case of
computations. For a standard software implementation, it
actually requires 16 subtractions and 15 additions, that is, 31
arithmetic operations, for each 44 block SAD calculations.
Accordingly, the complexity of ACQPPS is approximately
14 and 60 times less than the one required by full search
algorithm with the [7, +7] and [15, +15] search range,
respectively. In practice, the ACQPPS complexity is roughly
at the same level as the simple DS algorithm.
4. Hardware Architecture of
ACQPPS Motion Estimator
The ACQPPS is designed with low complexity, which
is appropriate to be implemented based on a hardware
architecture. The hardware architecture takes advantage
of the pipelining and parallel operations of the adaptive
search patterns, and utilizes a fully pipelined multilevel
SAD calculator to improve the computational eciency and,
therefore, reduce the clock frequency reasonably.
As mentioned above, the computation of motion vector
for a smallest block shape, that is, 4 4 block, is the worst
case for calculation. The worst case refers to the percentage
usage of the memory bandwidth. It is necessary that the
computational eciency be as high as possible in the worst
case. All of the other block shapes can be constructed from
4 4 blocks so that the computation of distortion in 4 4
partial solutions and result additions can solve all of the other
block shapes.
4.1. ACQPPS Hardware Architecture. An architecture for the
ACQPPS motion estimator is shown in Figure 4. There are
two main stages for the motion vector search, including
the initial and rened search, indicated by the hardware
semaphore. In the initial search stage, the architecture utilizes
the previously calculated motion vectors to produce an
MVP for the current block. Some initial search points are
generated utilizing the MVP and LUT to dene the search
range of adaptive patterns. After an MME point is found
in this stage, the search renement will take into eect
applying square pattern around MME points iteratively to
obtain a nal best MME point, which indicates the nal
best MV for the current block. For motion estimation, the
reference frames are stored in SRAM or DRAM, while the
current frame and produced MVs are stored in dual-port
memory (BRAM). Meanwhile, The LUT also uses the BRAM
to facilitate the generation of initial search points.
Figure 5 illustrates a data search ow of the ACQPPS
hardware IP with regard to each block motion search. The
initial search processing unit (ISPU) is used to generate the
initial search points and then perform the initial motion
search. To generate the initial search points, previously
calculated MVs and an LUT are employed. The LUT contains
the vertical and horizontal components of the initial search
points dened in Table 1. Both produced MVs and LUT
values are stored in BRAM, for they can be accessed through
two independent data ports in parallel to facilitate the
processing. When the initial search stage is nished, the
rened search processing unit (RSPU) is enabled to work. It
employs the square pattern around the MME point derived
in initial search stage to rene the local motion search. The
local rened search steps might be iteratively performed a
few of times, until the MME point is still at the search center
Initial search stage
MVP
generation
Remove the
overlapping
initial search
points
Obtain the
initial MME
point
position
Obtain the
refinement
MME point
position
Generate new offset SPs
using diamond or square
pattern according to MME
point for refinement search
Refined search stage
MME point is not the search center
M
M
E

p
o
i
n
t

i
s

t
h
e

s
e
a
r
c
h

c
e
n
t
e
r
Final MV for
current block
Clock cycles

1 2 3 4 5 6 7 8 N + 1 N + 2 N + 3 N + 4 N + 5 N + 6 N + 7 N + 8
Preload current
block data to
16 16 registers
Load reference data of
MVP oset point to
18 18 register array and
enable SAD calculation
(0, 0) oset point to
Load reference data of each
of other initial oset SPs to
Decide search
pattern, generate
initial SPs except
MVP and (0, 0)
each of renement SPs
to 18 18 register array,
(a)
MME point is the search center
MME point is not the search center
Final MV for
current block
M
M
E

P
o
i
n
t

i
s

t
h
e

s
e
a
r
c
h

c
e
n
t
e
r
Refined search stage Initial search stage
Obtain the
refinement
MME point
position
Generate new offset SPs
using diamond or square
pattern according to MME
point for refinement search
Obtain the
current MME
point
position
MVP
generation
Clock cycles

1 2 3 4 5 6 7 8 N + 1 N + 2 N + 3 N + 4 N + 5 N + 6 N + 7 N + 8
Preload current
block data to
16 16 registers
Load reference data of (0, 0)
oset point (search center)
to 18 18 register array and
Load reference data of each of
other oset SPs dened by square
pattern to 18 18 register array
and enable SAD calculation
each of renement SPs
to 18 18 register array,
(b)
Figure 5: (a) A data search ow for the individual block motion estimation when MVP is not zero; (b) a data search ow for the individual
block motion estimation when MVP is zero. Note The clock cycles for each task are not on the exact timing scale, only for illustration
purpose.
after certain rened steps. The search data ow of ACQPPS
IP architecture conforms to the algorithm steps dened in
Section 3.7, with further improvement and optimization of
hardware parallel and pipelining features.
4.2. Fully Pipelined SAD Calculator. As main ME operations
are related to SAD calculations that have a critical impact
on the performance of hardware-based motion estimator, a
fully pipelined SAD calculator is designed to speed up the
SAD computations. Figure 6 displays a basic architecture of
the pipelined SAD calculator, with the processing support
of variable block sizes. According to the VBS indicated
by block shape and enable signals, SAD calculator can
employ appropriate parallel and pipelining adder opera-
tions to generate SAD result for a searched block. With
the parallel calculations of basic processing unit (BPU),
it can take 4 clock cycles to nish the 4 4 block
SAD computations (BPU for 4 4 block SAD), and 8
clock cycles to produce a nal SAD result for a 16 16
block.
To support the VBS feature, dierent block shapes might
be processed based on the prototype of the BPU. In such case,
a 16 16 macroblock is divided into 16 basic 4 4 blocks.
R
e
g
i
s
t
e
r

d
a
t
a

a
r
r
a
y
R
e
g
i
s
t
e
r

d
a
t
a

a
r
r
a
y
R
e
g
i
s
t
e
r

d
a
t
a

a
r
r
a
y
R
e
g
i
s
t
e
r

d
a
t
a

a
r
r
a
y
Data selection control
ACC
ACC
BPU1 M
u
x
Mux Mux
Mux Mux
M
u
x
M
u
x
M
u
x
BPU2
BPU3
BPU4
ACC
ACC
ACC
ACC
ACC
Current 4 4 block 0
Current 4 4 block 4
Current 4 4 block 8
Current 4 4 block 12
Current 4 4 block 1
Current 4 4 block 5
Current 4 4 block 9
Current 4 4 block 2
Current 4 4 block 6
Current 4 4 block 3
Current 4 4 block 7
Reference 4 4 block 0/4/8/12
8 8 or 8 16
SAD (0)
8 4 SAD (0)
4 4 SAD (0)
4 8 SAD (0)
4 8 SAD (1)
4 4 SAD (1)
16 8 or 16 16 SAD
8 8 or 8 16
SAD (1)
8 4 SAD (1)
4 4 SAD (2)
4 8 SAD (2)
4 8 SAD (3)
4 4 SAD (3)
Figure 6: An architecture for pipelined multilevel SAD calculator.
9 8
7 6 5 4
3 2 1
10
15 14 13 12
11
0
10
9
8
7
6
5
4
3
2
1
0
15
14
13
12
11
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Organization of VBS using 4 4 blocks
16 16: {0, 1, . . . , 14, 15}
16 8: {0, 1, . . . , 6, 7}, {8, 9, . . . , 14, 15}
8 16: {0, 1, 4, 5, 8, 9, 12, 13},
{2, 3, 6, 7, 10, 11, 14, 15}
8 8: {0, 1, 4, 5}, {2, 3, 6, 7},
{8, 9, 12, 13}, {10, 11, 14, 15}
8 4: {0, 1}, {2, 3}, {4, 5}, {6, 7},
{8, 9}, {10, 11}, {12, 13}, {14, 15}
4 8:
{0, 4}, {1, 5}, {2, 6}, {3, 7},
{8, 12}, {9, 13}, {10, 14}, {11, 15}
4 4: {0}, {1}, . . . , {14}, {15}
Computing stages for VBS using 4 4 blocks
VBS Stage 1 Stage 2 Stage 3 Stage 4
16 16 {0, 1, 2, 3} {4, 5, 6, 7} {8, 9, 10, 11} {12, 13, 14, 15}
16 8 {0, 1, 2, 3}/{8, 9, 10, 11} {4, 5, 6, 7}/{12, 13, 14, 15} - -
8 16 {0, 1}/{2, 3} {4, 5}/{6, 7} {8, 9}/{10, 11} {12, 13}/{14, 15}
8 8
{0, 1}/{2, 3}/
{8, 9}/{10, 11}
{4, 5}/{6, 7}/
{12, 13}/{14, 15}
- -
8 4 {0, 1}/{2, 3}/{4, 5}/{6, 7}/
{8, 9}/{10, 11}/{12, 13}/{14, 15}
- - -
4 8 {0}/{1}/{2}/{3}/
{8}/{9}/{10}/{11}
{4}/{5}/{6}/{7}/
{12}/{13}/{14}/{15}
- -
4 4 {0}/{1}/ . . . /{14}/{15} - - -
Figure 7: Organization of Variable Block Size based on Basic 4 4 Blocks.
Other 6 block sizes in H.264, that is, 16 16, 16 8, 8 16,
8 8, 8 4, and 4 8, can be organized by the combination
of basic 4 4 blocks, shown in Figure 7, which also describes
computing stages for each variable-sized block constructed
on the basic 4 4 blocks to obtain VBS SAD results.
For instance, for a largest 16 16 block, it will require 4
stages of the parallel data loadings from the register arrays to
the SAD calculator to obtain a nal block SAD result. In this
case, the schedule of data loading will be {0, 1, 2, 3} {4,
5, 6, 7} {8, 9, 10, 11} {12, 13, 14, 15}, where {}
indicates each parallel pixel data input with the current and
reference block data.
4.3. Optimized Memory Structure. When a square pattern
is used to rene the MV search results, the mapping of
the memory architecture is important to speed up the
performance. In our design, the memory architecture will be
mapped onto a 2D register space for the rened stage. The
maximum size of this space is 18 18 with pixel bit depth,
that is, the mapped register memory can accommodate a
largest 16 16 macroblock plus the edge redundancy for the
rotated data shift and storage operations.
A simple combination of parallel register shifts and
related data fetches from SRAM can reduce the memory
bandwidth, and facilitate the renement processing, as
many of the pixel data for searching in this stage remain
unchanged. For example, 87.89% and 93.75% of the pixel
data will stay unchanged, when the (1,1) and (1,0) oset
searches for the 16 16 block are executed, respectively.
4.4. SAD Comparator. The SAD comparator is utilized to
compare the previously generated block SAD results to
obtain a nal estimated MV which corresponds to the best
MME point that has the minimum SAD with the lowest
block pixel intensity. To select and compare the proper
block SAD results as shown in Figure 6, the signals of
dierent block shapes and computing stages are employed
to determine the appropriate mode of minimum SAD to be
utilized.
For example, if the 16 16 block size is used for motion
estimation, the 16 16 block data will be loaded into the
BPU for SAD calculations. Each 16 16 block requires 4
computing stages to obtain a nal block SAD result. In
this case, the result mode of 16 8 or 16 16 SAD
will be rst selected. Meanwhile, the signal of computing
stages is also used to indicate the valid input to the SAD
comparator for retrieving proper SAD results from BPU, and
thus obtain the MME point with a minimum SAD for this
block size.
The best MME point position obtained by SAD com-
parator is further employed to produce the best matched
reference block data and residual data which are important
to other video encoding functions, such as mathematical
transforms and motion compensation, and so forth.
5. Virtual Socket System-on-Platform
Architecture
The bitstream and hardware complexity analysis derived
in Section 2 helps guiding both the architecture design
for prototyping IP accelerated system and the optimized
implementation of an H.264 BP encoding system based on
that architecture.
5.1. The Proposed System-On-Platform Architecture. A vari-
ety of options, switches, and modes required in video
bitstream actually results in the increasing interactions
between dierent video tasks or function-specic IP blocks.
Consequently, the functional oriented and fully dedicated
architectures will become inecient, if high levels of the
exibility are not provided in the individual IP modules.
To make the architectures remain ecient, the hardware
blocks need optimization to deal with the increasing com-
plexity for visual objects processing. Besides, the hardware
must keep exible enough to manage and allocate various
resources, memories, computational video IP accelerators for
dierent encoding tasks. In view of that the programmable
solutions will be preferable for video codec applications
with programmable and recongurable processing cores,
the heterogeneous functionality and the algorithms can be
executed on the same hardware platform, and upgraded
exibly by software manipulations.
To accelerate the performance on processing cores,
parallelization will be demanded. The parallelization can
take place at dierent levels, such as task, data, and
instruction. Furthermore, the specic video processing
algorithms performed by IP accelerators or processing
cores can improve the execution eciency signicantly.
Therefore, the requirements for H.264 video applications
are so demanding that multiple acceleration techniques
may be combined to meet the real-time conditions. The
programmable, recongurable, heterogeneous processors are
the preferable choice for an implementation of H.264 BP
video encoder. Architectures with the support for concurrent
performance and hardware video IP accelerators are well
applicable for achieving the real-time requirement imposed
by the H.264 standard.
Figure 8 shows the proposed extensible system-on-
platform architecture. The architecture consists of a pro-
grammable and recongurable processing core which is built
upon FPGA, and two extensible cores with RISC and DSP.
The RISC can take charge of general sequences control and
IP integration information, give mode selections for video
coding, and congure basic operations, while DSP can be
utilized to process the particular or exible computational
tasks.
The processing cores are connected through the het-
erogeneous integrated onplatform memory spaces for the
exchange of control information. The PCI/PCMCIA stan-
dard bus provides a data transfer solution for the host
connected to the platform framework, recongures and
controls the platform in a exible way. Desirable video IP
accelerators will be integrated in the system platform archi-
tecture to improve the encoding performance for H.264 BP
video applications.
5.2. Virtual Socket Management. The concept of virtual
socket is thus introduced to the proposed system-on-
platform architecture. Virtual socket is a solution for the
host-platform interface, which can map a virtual memory
space from the host environment to the physical storage
on the architecture. It is an ecient mechanism for the
management of virtual memory interface and heterogeneous
memory spaces on the system framework. It enables a
truly integrated, platform independent environment for the
hardware-software codevelopment.
DSP
RISC
BRAM
FPGA
IP module 1
IP module 2
SRAM DRAM
Interrupt
I
P

m
e
m
o
r
y

i
n
t
e
r
f
a
c
e
V
S

c
o
n
t
r
o
l
l
e
r
L
o
c
a
l

b
u
s

m
u
x

i
n
t
e
r
f
a
c
e
P
C
I

b
u
s

i
n
t
e
r
f
a
c
e
IP module N
.
.
.
Figure 8: The proposed extensible system-on-platform hardware architecture.
Through the virtual socket interface, a few of virtual
socket application programming interface (API) function
calls can be employed to make the generic hardware
functional IP accelerators automatically map the virtual
memory addresses from the host system to dierent memory
spaces on the hardware platform. Therefore, with the
ecient virtual socket memory organization, the hardware
abstraction layer will provide the system architecture with
simplied memory access, interrupt based control and
shielded interactions between the platform framework and
the host system. Through the integration of IP accelerators
to the hardware architecture, the system performance will be
improved signicantly.
The codesign virtual socket host-platform interface
management and system-on-platform hardware architecture
actually provide a useful embedded system approach for
the realization of advanced and complicated H.264 video
encoding system. Hence, the IP accelerators on FPGA,
together with the extensible DSP and RISC, construct an
ecient programmable embedded solution to perform the
dedicated and real-time video processing tasks. Moreover,
due to the various video congurations for H.264 encoding,
the physically implemented virtual socket interface as well
as APIs can easily enable the encoder congurations, data
manipulations and communications between the host com-
puter system and hardware architecture, in return facilitate
the system development for H.264 video encoders.
5.3. Integration of IP Accelerators. The IP accelerator illus-
trated here can be any H.264 compliant hardware block
which is dened to handle a computationally extensive
task for video applications without a specic design for
interaction controls between IP and the host. For encoding,
the basic modules to be integrated include Motion Estimator,
Discrete Cosine Transform and Quantization (DCT/Q),
Deblocking Filter and Context Adaptive Variable Length
Coding (CAVLC), while Inverse Discrete Cosine Transform
and Inverse Quantization (IDCT/Q
1
), and Motion Com-
pensation (MC) for decoding. An IP memory interface is
provided by the architecture to achieve the integration.
All IP modules are connected to the IP memory interface,
which provides accelerators a straight way to exchange data
between the host and memory spaces. Interrupt signals can
be generated by accelerators when demanded. Moreover, to
control the concurrent performance of accelerators, an IP
bus arbitrator is designed and integrated in the IP memory
interface, for the interface controller to allocate appropriate
memory operation time for each IP module, and avoid the
memory access conicts possibly caused by heterogeneous IP
operations.
IP interface signals are congured to connect the IP
modules to the IP memory interface. It is likely that each
accelerator has its own interface requirement for interaction
between the platform and IP modules. To make the inte-
gration easy, it is required that certain common interface
signals be dened to link IP blocks and memory interface
together. With the IP interface signals, the accelerators
will focus on their own computational tasks, and thus the
architecture eciency can be improved. Practically, the IP
modules can be exibly reused, extended, and migrated to
other independent platforms very easily. Table 3 denes the
necessary IP interface signals for the proposed architecture.
IP modules only need to issue the memory requests and
access parameters to the IP memory interface, and rest of
the tasks are taken by platform controllers. This feature
is especially useful when motion estimator is integrated in
the system.
5.4. Host Interface and API Function Calls. The host interface
provides the architecture with necessary data for video
processing. It can also control video accelerators to operate
in sequential or parallel mode, in accordance with the
H.264 video codec specications. The hardware-software
partitioning is simplied so that the host interface can focus
on the data communication as well as ow control for video
tasks, while hardware accelerators deal with local memory
Table 3: IP interface signals.
Interface signals Description
Clk, reset, start Platform signals for IP
Input Valid,
Output Valid
Valid strobes for IP memory access
Data In, Data Out Input and output memory data for IP
Memory Read IP request for memory read
Mem HW Accel, oset,
count
IP number, oset, and data count
provided by IP/Host for memory
read
Mem HW Accel1,
oset1, count1
IP Number, oset, and data count
provided by IP/Host for memory
write
Mem Read Req IP bus request for memory
Mem Write Req access
Mem Read Release Req IP bus release request for
Mem Write Release Req memory access
Mem Read Ack IP bus request grant for
Mem Write Ack memory access
Mem Read Release Ack IP bus release grant for
Mem Write Release Ack memory access
Done IP interrupt signal
accesses and video codec functions. Therefore, the software
abstraction layer covers the feature of data exchange and
video task ow control for hardware performance.
A set of related virtual socket API functions is dened
to implement the host interface features. The virtual socket
APIs are software function calls coded in C/C++, which
perform data transfers and signal interactions between the
host and hardware system-on-platform. The virtual socket
API as a software infrastructure can be utilized by a variety
of video applications to control the implementation of
hardware feature dened. With virtual socket APIs, the
manipulation of video data in local memories can be
executed conveniently. Therefore, the eciency of hardware
and software interactions can be kept high.
6. SystemOptimizations
6.1. Memory Optimization. Due to the signicant memory
access requirement for video encoding tasks, a large amount
of clock cycles is consumed by the processing core while
waiting for the data fetch from local memory spaces. To
reduce or avoid the overhead of memory data access, the
memory storage of video frame data can be organized
to utilize multiple independent memory spaces (SRAM
and DRAM) and dual-port memory (BRAM), in order to
enable the parallel and pipelined memory access during the
video encoding. This optimized requirement can practically
provide the system architecture with the multi-port memory
storage to reduce the data access bandwidth for each of the
individual memory space.
Furthermore, with the dual-port data access, DMA can
be scheduled to transfer a large amount of video frame data
through PCI bus and virtual socket interface in parallel with
the operations of encoding tasks, so that the processing core
will not suer memory and encoding latency. In such case,
the data control ow of video encoding will be managed to
make the DMAtransfer and IP accelerator operations in fully
parallel and pipelined stages.
6.2. Architecture Optimization. As the main video encoding
functions (such as ME, DCT/Q, IDCT/Q
1
, MC, Deblocking
Filter, and CAVLC) can be accelerated by IP modules, the
interconnection between those video processing accelerators
has an important impact on the overall system performance.
To make the IP accelerators execute main computational
encoding routines in full parallel and pipelining mode,
the IP integration architecture has to be optimized. A
few of caches are inserted between the video IP acceler-
ators to facilitate the encoding concurrent performance.
The caches can be organized as parallel dual-port mem-
ory (BRAM) or pipelined memory (FIFO). The intercon-
nection control of data streaming between IP modules
will be dened using those caches targeting to eliminate
the extra overhead of processing routines, for encoding
functions can be operated in full parallel and pipelining
stages.
6.3. Algorithm Optimization. The complexity of encoding
algorithms can be modied when the IP accelerators are
shaping. This optimization can be taken after choosing the
most appropriate modes, options, and congurations for the
H.264 BP applications. It is known that the motion estimator
requires the major overhead for encoding computations.
To reduce the complexity of motion estimation, a very
ecient and fast ACQPPS algorithm and corresponding
hardware architecture have been realized based on the
reduction of spatio-temporal correlation redundancy. Some
other algorithm optimizations can also be executed. For
example, a simple algorithm optimization may be applied
to mathematic transform and quantization. As many blocks
tend to have minimal residual data after the motion com-
pensation, the mathematic transform and quantization for
motion-compensated blocks can be ignored, if SAD of such
blocks is lower than a prescribed threshold, in order to
facilitate the processing speed.
The application of memory, algorithm, and architecture
optimizations combined in the system can meet the major
challenges for the realization of video encoding system. The
optimization techniques can be employed to reduce the
encoding complexity and memory bandwidth, with the well-
dened parallel and pipelining data streaming control ow,
in order to implement a simplied H.264 BP encoder.
6.4. An IP Accelerated Model for Video Encoding. An opti-
mized IP accelerated model is presented in Figure 9 for a
realization of simplied H.264 BP video encoder. In this
architecture, BRAM, SRAM, and DRAM are used as multi-
port memories to facilitate video processing. The current
video frame is transferred by DMA and stored in BRAM.
Meanwhile, IP accelerators fetch the data from BRAM and
IP memory interface
V
i
r
t
u
a
l

s
o
c
k
e
t

c
o
n
t
r
o
l
l
e
r
SRAM BRAM DRAM
IP memory interface control
ME DCT/Q
BRAM(1)
BRAM(2)
CAVLC MC
FIFO(1) FIFO(2) BRAM(3)
Deblock
BRAM(4)
IDCT/Q
1
Figure 9: An optimized architecture for simplied H.264 BP video encoding system.
Transfer the current frame
to BRAM
ME triggered to
start
Fetch data for each
block
ME generates MVs and
residual data
Save data to
BRAM(1)
and BRAM(2)
Frame finished?
Interrupt to end
Yes
No
Pipelined process Pipelined process
Parallel process
CAVLC starts
to work
CAVLC produces
bitstreams
Pipelined process Pipelined process
DCT/Q starts
to work
DCT/Q generates
block coefficients
Save data to
FIFO(2)
Save data to
BRAM(3)
Save data to
SRAM/DRAM
Save data to
FIFO(1) and
BRAM(4)
MC starts to
work
Deblocking starts
to work
Deblocking
generates filtered
block pixels
MC generates
reconstructed
block pixels
IDCT/Q
1
starts to work
DCT/Q
1
generates
block coecients
Figure 10: A video task partitioning and data control ow for the optimized system architecture.
start video encoding routines. As BRAM is a dual-port
memory, the overhead of DMA transfer is eliminated by this
dual-port cache.
This IP accelerated system model includes the mem-
ory, algorithm, and architecture optimization techniques
to enable the reduction and elimination of the overhead
resulted from the heterogeneous video encoding tasks. The
video encoding model provided in this architecture is
compliant with H.264 standard specications.
A data control ow based on the video task partitioning
is shown in Figure 10. According to the data streaming, it is
obvious that the parallel and pipelining operations dominate
in the whole part of encoding tasks, which are able to yield
an ecient processing performance.
7. Implementations
The proposed ACQPPS algorithm is integrated and veried
under H.264 JM Reference Software [28], while the hard-
ware architectures, including the ACQPPS motion estimator
and system-on-platform framework, are synthesized with
Synplify Pro 8.6.2, implemented using Xilinx ISE 8.1i
SP3 targeting Virtex-4 XC4VSX35FF668-10, based on the
WILDCARD-4 [29].
The systemhardware architecture can suciently process
the QCIF/SIF/CIF video frames with the support of on-
platform design resources. The Virtex-4 XC4VSX35 contains
3,456 Kb BRAM [30], 192 XtremeDSP (DSP48) slices [31],
and 15,360 logic slices, which are equivalent to almost
1 million logic gates. Moreover, WILDCARD-4 integrates
the large-sized 8 MB SRAM and 128 MB DRAM. With the
sucient design resources and memory support, the whole
video frames of QCIF/SIF/CIF can be directly stored in the
on-platform memories for the ecient hardware processing.
For example, if a CIF YUV (YCbCr) 4 : 2 : 0 video
sequence is encoded with the optimized hardware architec-
ture proposed in Figure 9, the total size of each current frame
is 148.5 Kb. Therefore, each of the current CIF frame can be
transferred from host system and directly stored in BRAM
for motion estimation and video encoding, whereas the
generated reference frames are stored in SRAM or DRAM.
The SRAM and DRAM can accommodate a maximum of up
to 55 and 882 CIF reference frames, respectively, which are
more than enough for the practical video encoding process.
7.1. Performance of ACQPPS Algorithm. A variety of video
sequences which contain dierent amount of motions, listed
in Table 4, is examined to verify the algorithm performance
for real-time encoding (30 fps). All sequences are in the for-
mat of YUV (YCbCr) 4 : 2 : 0 with luminance component to
be processed for ME. The frame size of sequences varies from
QCIF to SIF and CIF, which is the typical testing condition.
The targeted bit rate is from 64 Kbps to 2 Mbps. SAD is used
as the intensity matching criterion. The search window is
[15, +15] for FS. EPZS uses extended diamond pattern and
PMVFASTpattern [32] for its primary and secondary rened
search stages. It also enables the window based, temporal,
and spatial memory predictors to perform advanced motion
search. UMHexagonS utilizes search range prediction and
default scale factor optimized for dierent image sizes.
Encoded frames are produced in a sequence of IPP, . . . PPP,
as H.264 BP encoding is employed. For reconstructed video
quality evaluation, the frame-based average peak signal-to-
noise ratio (PSNR) and number of search points (NSP)
per MB (16 16 pixels) are measured. Video encoding is
congured with the support of full-pel motion accuracy,
single reference frame and VBS. As VBS is a complicated
feature dened in H.264, to make easy and practical the
calculation of NSP regarding dierent block sizes, all search
points for variable block estimation are normalized to the
search points regarding the MB measurement, so that the
NSP results can be evaluated reasonably.
The implementation results in Tables 6 and 7 show
that the estimated image quality produced by ACQPPS, in
Table 4: Video sequences for experiment with real-time frame rate.
Sequence (bit rate Kbps) Size/frame rate No. of frames
Foreman (512) QCIF/30 fps 300
Carphone (256) QCIF/30 fps 382
News (128) QCIF/30 fps 300
Miss Am (64) QCIF/30 fps 150
Suzie (256) QCIF/30 fps 150
Highway (192) QCIF/30 fps 2000
Football (2048) SIF/30 fps 125
Table Tennis (1024) SIF/30 fps 112
Foreman (1024) CIF/30 fps 300
Mother Daughter (128) CIF/30 fps 300
Stefan (2048) CIF/30 fps 90
Highway (512) CIF/30 fps 2000
Table 5: Video sequences for experiment with low bit and frames
rates.
Sequence (bit rate Kbps) Size/frame rate No. of frames
Foreman (90) QCIF/7.5 fps 75
Carphone (56) QCIF/7.5 fps 95
News (64) QCIF/15 fps 150
Miss Am (32) QCIF/15 fps 75
Suzie (90) QCIF/15 fps 75
Highway (64) QCIF/15 fps 1000
Football (256) SIF/10 fps 40
Table Tennis (150) SIF/10 fps 35
Foreman (150) CIF/10 fps 100
Mother Daughter (64) CIF/10 fps 100
Stefan (256) CIF/10 fps 30
Highway (150) CIF/10 fps 665
terms of PSNR, is very close to that from FS, while the
number of average search points is dramatically reduced.
The PSNR dierence between ACQPPS and FS is in the
range of 0.13 dB 0 dB. In most cases, PSNR degradation
of ACQPPS is less than 0.06 dB, as compared to FS. In
some cases, PSNR results of ACQPPS can be approximately
equivalent or equal to those generated from FS. When
compared with other fast search methods, that is, DS
(small pattern), UCBDS, TSS, FSS and HEX, ACQPPS
result is able to outperform their performance. ACQPPS
can always yield higher PSNR than those fast algorithms.
In this case, ACQPPS can obtain an average PSNR of
+0.56 dB higher than those algorithms with evaluated video
sequences.
Besides, ACQPPS performance is comparable to that
of the complicated and advanced EPZS and UMHexagonS
algorithms, as it can achieve an average PSNR in the range of
0.07 dB +0.05 dB and 0.04 dB +0.08 dB, as compared
to EPZS and UMHexagonS, respectively.
In addition to the real-time video sequence encoding
with 30 fps, many other application cases, such as the mobile
scenario and videoconferencing, require video encoding
under the low bit and frame rate environment with less
Table 6: Average PSNR performance for experiment with real-time and frame rate.
Sequence FS DS UCBDS TSS FSS HEX EPZS UMHexagonS ACQPPS
Foreman (QCIF) 38.48 38.09 37.93 38.27 38.19 37.87 38.45 38.44 38.42
Carphone (QCIF) 36.43 36.23 36.16 36.30 36.24 36.04 36.42 36.37 36.37
News (QCIF) 37.44 37.26 37.35 37.28 37.29 37.25 37.43 37.35 37.43
Miss Am (QCIF) 39.07 39.01 39.01 39.00 38.94 39.01 38.98 39.01 39.03
Suzie (QCIF) 38.65 38.46 38.47 38.59 38.54 38.45 38.61 38.58 38.60
Highway (QCIF) 38.23 37.99 38.13 38.11 38.09 38.06 38.18 38.17 38.13
Football (SIF) 31.37 31.23 31.23 31.22 31.24 31.20 31.40 31.37 31.36
Table Tennis (SIF) 33.87 33.71 33.79 33.62 33.72 33.71 33.87 33.84 33.84
Foreman (CIF) 36.30 35.91 35.83 35.72 35.70 35.69 36.27 36.24 36.25
Mother Daughter (CIF) 36.26 36.16 36.24 36.21 36.21 36.22 36.26 36.23 36.26
Stefan (CIF) 33.87 33.46 33.36 33.45 33.39 33.30 33.89 33.82 33.82
Highway (CIF) 37.96 37.70 37.83 37.79 37.77 37.76 37.89 37.87 37.83
Table 7: Average number of search points per MB for experiment with real-time and frame rate.
Foreman (QCIF) 2066.73 60.64 109.70 124.85 122.37 109.26 119.02 125.95 55.63
Carphone (QCIF) 1872.04 46.82 91.54 108.44 106.52 94.82 114.83 121.91 54.02
News (QCIF) 1719.92 33.72 74.48 88.28 90.30 81.12 81.36 79.73 41.32
Miss Am (QCIF) 1471.96 30.70 64.35 74.95 76.52 68.43 62.94 56.27 32.32
Suzie (QCIF) 1914.32 44.19 88.19 108.19 104.97 93.21 96.98 88.74 47.59
Highway (QCIF) 1791.86 40.27 85.14 101.49 100.94 90.27 85.04 84.24 46.12
Football (SIF) 2150.45 68.42 118.21 131.82 129.91 117.62 184.81 202.19 72.63
Table Tennis (SIF) 2031.72 55.66 105.56 120.09 121.27 108.79 128.36 124.95 54.25
Foreman (CIF) 1960.07 76.83 124.56 128.85 125.76 117.21 122.22 124.26 67.20
Stefan (CIF) 1954.21 69.32 116.72 118.23 122.37 113.50 137.59 149.80 58.91
Highway (CIF) 1730.90 45.63 90.81 104.90 103.98 93.22 78.82 75.98 47.57
than 30 fps. Accordingly, the satised settings for video
encoding are usually 7.5 fps 15 fps for QCIF and 10 fps
15 fps for SIF/CIF with various low bit rates, for example,
90 Kbps for QCIF and 150 Kbps for SIF/CIF, to maximize
the perceived video quality [40, 41]. In order to further
evaluate the ME algorithms under low bit and frame rate
cases, video sequences are provided in Table 5, and Tables 8
and 9 generate the corresponding performance results.
The experiments show that the PSNR dierence between
ACQPPS and FS is still small, which is in an acceptable
range of 0.49 dB 0.02 dB. In most cases, there is only
less than 0.2 dB PSNR discrepancy between them. Moreover,
ACQPPS still suciently outperforms DS, UCBDS, TSS,
FSS and HEX. For mobile scenarios, there are usually quick
and considerable motion displacements existing, under the
environment of low frame rate video encoding. In such
case, ACQPPS is particularly much better than those fast
algorithms, and a result of up to +2.42 dB for PSNR can be
achieved with the tested sequences. When compared with
EPZS and UMHexagonS, ACQPPS can yield an average
PSNR in the range of 0.36 dB +0.06 dB and 0.15 dB
+0.07 dB, respectively.
Normally, ACQPPS is useful to produce a favorable
PSNR for the sequences not only with small object motions,
but also large amount of motions. In particular, if a sequence
includes large object motions or considerable amount of
motions, the advantage of ACQPPS algorithm is obvious, as
the ACQPPS can adaptively choose dierent shapes and sizes
for the search pattern which is applicable to the ecient large
motion search.
Such search advantage can be observed when ACQPPS is
compared with DS. It is know that DS has a simple diamond
pattern for a very low complexity based motion search. For
video sequences with slow and small motions contained,
for example, Miss Am (QCIF) and Mother Daguhter (CIF)
at 30 fps, the PSNR performance of DS and ACQPPS is
relatively close, which indicates that DS performs well in
the case of simple motion search. When the complicated
and large amount of motions included in video images,
however, DS is unable to yield good PSNR, as its motion
search will be easily trapped in undesirable local minimum.
For example, the PSNR dierences between DS and ACQPPS
are 0.34 dB and 0.44 dB, when Foreman (CIF) is tested
with 1 Mbps at 30 fps and 150 Kbps at 10 fps, respectively.
Furthermore, ACQPPS can produce an average PSNR of
+0.02 dB +0.36 dB higher than DS in the case of real-time
video encoding, and +0.07 dB +1.94 dB in the case of low
bit and frame rate environment.
Table 8: Average PSNR performance for experiment with low bit and frame rates.
Foreman (QCIF) 34.88 34.44 34.19 34.40 34.42 33.99 34.85 34.80 34.80
Carphone (QCIF) 34.12 33.99 33.96 34.02 33.99 33.84 34.08 34.04 34.06
News (QCIF) 35.28 35.21 35.20 35.11 35.20 35.19 35.25 35.21 35.24
Miss Am (QCIF) 38.36 38.23 38.33 38.25 38.23 38.31 38.28 38.27 38.34
Suzie (QCIF) 36.54 36.40 36.34 36.44 36.39 36.26 36.52 36.50 36.50
Highway (QCIF) 36.19 35.80 36.01 35.96 35.94 35.90 36.13 36.09 35.98
Football (SIF) 25.11 24.82 24.92 24.79 24.84 24.89 25.08 25.10 25.01
Table Tennis (SIF) 27.57 26.65 26.95 26.85 26.84 26.86 27.57 27.60 27.45
Foreman (CIF) 31.95 31.29 31.32 31.16 31.25 31.06 31.90 31.79 31.73
Stefan (CIF) 27.02 24.59 24.92 24.11 24.12 24.96 26.89 26.67 26.53
Highway (CIF) 37.21 36.88 36.98 36.94 36.92 36.94 37.12 37.09 37.01
Table 9: Average number of search points per MB for experiment with low bit and frame rates.
Foreman (QCIF) 2020.51 90.20 140.01 134.64 133.92 125.12 163.63 190.38 98.94
Carphone (QCIF) 1836.04 58.40 102.76 112.65 111.28 100.74 141.56 160.32 71.81
News (QCIF) 1680.68 34.74 74.11 87.22 88.92 79.64 96.40 102.30 52.61
Miss Am (QCIF) 1406.26 32.60 64.56 75.39 75.08 67.74 68.05 63.20 44.22
Suzie (QCIF) 1823.23 52.96 94.39 110.43 106.30 95.11 115.96 112.17 64.30
Highway (QCIF) 1710.86 42.06 84.42 97.99 97.34 87.36 98.22 97.77 58.13
Football (SIF) 1914.43 80.13 132.67 123.20 125.01 119.62 192.88 246.76 92.51
Table Tennis (SIF) 1731.44 50.10 98.45 97.71 100.73 93.82 159.45 182.19 64.39
Foreman (CIF) 1789.76 91.32 140.31 124.01 124.24 120.48 154.55 170.89 88.62
Stefan (CIF) 1663.89 65.44 110.17 100.53 102.36 103.71 153.97 194.64 78.69
Highway (CIF) 1715.63 52.26 97.20 109.24 107.49 96.94 91.74 92.27 64.45
The number of search points for each method, which
mainly represents the algorithm complexity, is also obtained
to measure the search eciency of dierent approaches.
The NSP results show that the search eciency of ACQPPS
is higher than other algorithms, as ACQPPS can produce
very good performance, in terms of PSNR, with reasonably
possessed NSP. The NSP of ACQPPS is one of the least
among all methods.
If ACQPPS is compared with DS, it is shown that
ACQPPS has the similar NSP as DS. It is true that NSP
of ACQPPS is usually a little bit increased in comparison
with that of DS. However, the increasing of the NSP is
limited and very reasonable, and is able to in turn bring
ACQPPS much better PSNR for the encoded video quality.
Furthermore, for the video sequences containing complex
and quick object motions, for example, Foreman (CIF) and
Stefan (CIF) at 30 fps, the NSP of ACQPPS can be even less
than that of DS, which veries that ACQPPS has a much
satised search eciency than DS, due to its highly adaptive
search patterns.
In general, the complexity of ACQPPS is very low, and
with high search performance, which makes it especially
useful for the hardware architecture implementation.
7.2. Design Resources for ACQPPS Motion Estimator. As the
complexity and search points of ACQPPS have been greatly
reduced, design resources used by ACQPPS architecture
can be kept at a very low level. The main part of design
resources is for SAD calculator. Each BPU requires one 32-
bit processing element (PE) to implement SAD calculations.
Every PE has two 8-bit pixel data inputs, one from the
current block and the other from reference block. Besides,
every PE contains 16 subtractors, 8 three-input adders, 1
latch register, and does not require extra interim registers or
accumulators. As a whole, a 32 4 PE array will be needed
to implement the pipelined multilevel SAD calculator, which
requires totally 64 subtractors, 32 three-input adders, and 4
latch registers. Other related design resources mainly include
an 18 18 register array, a 16 16 register array, a few
of accumulators, subtractors and comparators, which are
used to generate the block SAD results, residual data and
nal estimated MVs. Moreover, some other multiplexers,
registers, memory access, and data ow control logic gates
are also needed in the architecture. A comparison of design
resources between ACQPPS and other ME architectures [33
36] is presented in Table 10. The results show that proposed
ACQPPS architecture can utilize greatly reduced design
Table 10: Performance comparison between proposed ACQPPS and other motion estimation hardware architectures.
[33] [34] [35] [36] Proposed architecture
Type ASIC ASIC ASIC ASIC FPGA + DSP
Algorithm FS FS FS FS ACQPPS
Search range [16, +15] [32, +31] [16, +15] [16, +15] Flexible
Gate count 103 K 154 K 67 K 108 K 35 K
8 8
Support block sizes All All 16 16 All All
32 32
Freq. [MHz] 66.67 100 60 100 75
Max fps of CIF 102 60 30 56 120
Min Freq. [MHz]for CIF 30 fps 19.56 50 60 54 18.75
Table 11: Design resources for system-on-platform architecture.
Target FPGA Critical Path Gates DFFs/Latches
XC4VSX35FG668-10 5 ns 279,774 3,388
Target FPGA LUTs CLB Slices Resource
XC4VSX35FG668-10 3,161 3,885 25%
Table 12: DMA performance for video sequence transfer.
QCIF 4 : 2 : 0 YCrCb
DMA Write
(ms)
DMA Read
(ms)
DMA R/W
(ms)
WildCard-4 0.556 0.491 0.515
CIF 4 :2 : 0 YCrCb
DMA Write
(ms)
DMA Read
(ms)
DMA R/W
(ms)
WildCard-4 2.224 1.963 2.059
resources to realize a high-performance motion estimator for
H.264 encoding.
7.3. Throughput of ACQPPS Motion Estimator. Unlike the
FS which has a xed search range, search points and search
range of ACQPPS depend on video sequences. ACQPPS
search points will be increased, if a video sequence contains
considerable or quick motions. On the contrary, search
points can be reduced, if a video sequence includes slow or
small amount of motions.
The ME scheme with a xed block size can be typically
applied to the throughput analysis. In such case, the worst
case will be the motion estimation using 4 4 blocks, which
is the most time consuming in the case of xed block size.
Hence, the overall throughput result produced by ACQPPS
architecture can be reasonably generalized and evaluated.
In general, if the clock frequency is 50 MHz and the
memory (SRAM, BRAM and DRAM) structure is organized
as DWORD (32-bit) for each data access, the ACQPPS
hardware architecture will approximately need an average
of 12.39 milliseconds for motion estimation in the worst
case of using 4 4 blocks. For a real-hardware architecture
implementation, the typical throughput in the worst case
of 44 blocks can represent the overall motion search ability
for this motion estimator architecture.
Therefore, the ACQPPS architecture can complete the
motion estimation for more than 4 CIF (352 288) video
sequences or equivalent 1 4 CIF (704 576) video sequence
at 75 MHz clock frequency within each 33.33 milliseconds
time slot (30 fps) to meet the real-time encoding requirement
for a low design cost and low bit rate implementation.
The throughput ability of ACQPPS architecture can be
compared with those of a variety of other recently developed
motion estimator hardware architectures, as illustrated in
Table 10. The comparison results show that the proposed
ACQPPS architecture can achieve higher throughput than
other hardware architectures, with the reduced operational
clock frequency. Generally, it will only require a very low
clock frequency, that is, 18.75 MHz, to generate the motion
estimation results for the CIF video sequences at 30 fps.
7.4. Realization of System Architecture. Table 11 lists the
design resources utilized by system-on-platform framework.
The implementation results indicate that the systemarchitec-
ture uses approximately 25% of the FPGA design resources
when there is no hardware IP accelerator integrated in the
platform system. If video functions are needed, there will
be more design resources demanded, in order to integrate
and accommodate necessary IP modules. Table 12 gives
a performance result of the platform DMA video frame
transfer feature.
Dierent DMA burst sizes will result in dierent DMA
data transfer rates. In our case, the maximumDMAburst size
is dened to accommodate a whole CIF 4 : 2 : 0 video frame,
that is, 38,016 Dwords for each DMA data transfer buer.
Accordingly, the DMA transfer results verify that it only
takes an average of approximately 2 milliseconds to transfer
a whole CIF 4 : 2 : 0 video frame based on WildCard-4. This
transfer performance can suciently support up to level 4
bitstream rate for the H.264 BP video encoding system.
7.5. Overall Encoding Performance. In view of the complexity
analysis of H.264 video tasks described in Section 2, the most
time consuming task is motion estimation. Other encoding
tasks have much less overhead. Therefore, the video tasks can
be scheduled to operate in parallel and pipelining stages as
displayed in Figures 9 and 10 for the proposed architecture
Table 13: An overall performance comparison for H.264 BP video encoding systems.
Implementation [37] [38] [39] Proposed architecture
Architecture ASIC Codesign Codesign Codesign(Extensible multiple processing cores)
ME Algorithm Full Search (FS) Full Search (FS) Hexagon (HEX) ACQPPS
Freq. [MHz] 144 100 81 75
Max fps of CIF 272.73 5.125 18.6 120
Min Freq. [MHz] for CIF 30 fps 15.84 585 130.65 18.75
Core Voltage Supply 1.2 V 1.2 V 1.2 V 1.2 V
I/O Voltage Supply 1.8/2.5/3.3 V 1.8/2.5/3.3 V 2.5/3.3 V 2.5/3.3 V
model. In this case, the overall encoding time for a video
sequence is approximately equal to the following
Encoding time
= Total motion estimation time
+ Processing time of DCT/Q for the last block
+ Max
_
Processing time of IDCT/Q
1
+ MC
+ Deblocking for the last block,
Processing time of CAVLC for the last block
_
.
(8)
The processing time of DCT/Q, IDCT/Q
1
, MC,
Deblocking Filter, and CAVLC for a divided block directly
depends on the architecture design for each of the module.
On an average, it is normal that the overhead of those video
tasks for encoding an individual block is much less than that
of motion estimation. As a whole, the encoding time derived
from those video tasks for the last one block can be even
ignored, when it is compared to the total processing time of
the motion estimator for a whole video sequence. Therefore,
to simplify the overall encoding performance analysis for the
proposed architecture model, the total encoding overhead
derived fromthe systemarchitecture for a video sequence can
be approximately regarded as
Encoding time Total motion estimation time. (9)
This simplied system encoding performance analysis is
valid as long as the video tasks are operated in concurrent and
pipelined stages with the ecient optimization techniques.
Accordingly, when the proposed ACQPPS motion estimator
is integrated into the system architecture to perform the
motion search, the overall encoding performance for the
proposed architecture model is generalized.
A performance comparison can be presented in Table 13,
where the proposed architecture is compared with some
other recently developed H.264 BP video encoding systems
[3739] including both fully dedicated hardware and code-
sign architectures. The results indicate that this proposed
system-on-platform architecture, when integrated with the
IP accelerators, can yield a very good performance which
is comparable or even better than other H.264 video
encoding systems. Especially, if compared with other code-
sign architectures, the proposed system has much higher
encoding throughput, which is about 30 and 6 times higher
than the processing ability of the architectures presented
in [38, 39], respectively. The generated high performance
of proposed architecture is directly contributed from the
ecient ACQPPS motion estimation architecture and the
techniques employed for the system optimizations.
8. Conclusions
An integrated recongurable hardware-software codesign
IP accelerated system-on-platform architecture is proposed
in this paper. The ecient virtual socket interface and
optimization approaches for hardware realization have been
presented. The system architecture is exible for the host
interface control and extensible with multiple cores, which
can actually construct a useful integrated and embedded
system approach for the dedicated functions.
An advanced application for this proposed architecture
is to facilitate the development of H.264 video encoding
system. As the motion estimation is the most complicated
and important task in video encoder, a block-based novel
adaptive motion estimation search algorithm, ACQPPS,
and its hardware architecture are developed for reducing
the complexity to extremely low level, while keeping the
encoding performance, in terms of PSNR and bit rate,
as high as possible. It is benecial to integrate video IP
accelerators, especially ACQPPS motion estimator, into the
architecture framework for improving the overall encoding
performance. The proposed system architecture is mapped
on an integrated FPGA device, WildCard-4, toward an
implementation for a simplied H.264 BP video encoder.
In practice, with the proposed system architecture, the
realization of multistandard video codec can be greatly
facilitated and eciently veried, other than the H.264
video applications. It can be expected that the advantages
of the proposed architecture will become more desirable for
prototyping the future video encoding systems, as new video
standards are emerging continually, for example, the coming
H.265 draft.
Acknowledgment
The authors would like to thank the support from Alberta
Informatics Circle of Research Excellence (iCore), Xilinx
Inc., Natural Science and Engineering Research Council of
Canada (NSERC), Canada Foundation for Innovation (CFI),
and the Department of Electrical and Computer Engineering
at the University of Calgary.
References
[1] M. Tekalp, Digital Video Processing, Signal Processing Series,
Prentice Hall, Englewood Clis, NJ, USA, 1995.
[2] Information technologygeneric coding of moving pictures
and associated audio information: video, ISO/IEC 13818-2,
September 1995.
[3] Video Coding for Low Bit Rate Communication, ITU-T
Recommendation H.263, March 1996.
[4] Coding of audio-visual objectspart 2: visual, amendment
1: visual extensions, ISO/IEC 14496-4/AMD 1, April 1999.
[5] Joint Video Team of ITU-T and ISO/IEC JTC 1, Draft ITU-
T recommendation and nal draft international standard
of joint video specication (ITU-T Rec. H.264 ISO/IEC
14496-10 AVC), JVT-G050r1, May 2003; JVT-K050r1 (non-
integrated form) and JVT-K051r1 (integrated form), March
2004; Fidelity Range Extensions JVT-L047 (non-integrated
form) and JVT-L050 (integrated form), July 2004.
[6] T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra,
Overview of the H.264/AVC video coding standard, IEEE
Transactions on Circuits and Systems for Video Technology, vol.
13, no. 7, pp. 560576, 2003.
[7] S. Wenger, H.264/AVC over IP, IEEE Transactions on Circuits
and Systems for Video Technology, vol. 13, no. 7, pp. 645656,
2003.
[8] B. Zeidman, Designing with FPGAs and CPLDs, Publishers
Group West, Berkeley, Calif, USA, 2002.
[9] S. Notebaert and J. D. Cock, Hardware/Software Co-design of
the H.264/AVC Standard, Ghent University, White Paper, 2004.
[10] W. Staehler and A. Susin, IP Core for an H.264 Decoder SoC,
Universidade Federal do Rio Grande do Sul (UFRGS), White
Paper, October 2008.
[11] R. Chandra, IP-Reuse and Platform Base Designs, STMicroelec-
tronics Inc., White Paper, February 2002.
[12] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J.
Sullivan, Rate-constrained coder control and comparison of
video coding standards, IEEE Transactions on Circuits and
Systems for Video Technology, vol. 13, no. 7, pp. 688703, 2003.
[13] J. Ostermann, J. Bormans, P. List, et al., Video coding
with H.264/AVC: tools, performance, and complexity, IEEE
Circuits and Systems Magazine, vol. 4, no. 1, pp. 728, 2004.
[14] M. Horowitz, A. Joch, F. Kossentini, and A. Hallapuro,
H.264/AVC baseline prole decoder complexity analysis,
IEEE Transactions on Circuits and Systems for Video Technology,
vol. 13, no. 7, pp. 704716, 2003.
[15] S. Saponara, C. Blanch, K. Denolf, and J. Bormans, The JVT
advanced video coding standard: complexity and performance
analysis on a tool-by-tool basis, in Proceedings of the Packet
Video Workshop (PV 03), Nantes, France, April 2003.
[16] J. R. Jain and A. K. Jain, Displacement measurement and its
application in interframe image coding, IEEE Transactions on
Communications, vol. 29, no. 12, pp. 17991808, 1981.
[17] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro,
Motion compensated interframe coding for video conferenc-
ing, in Proceedings of the IEEE National Telecommunications
Conference (NTC 81), vol. 4, pp. 19, November 1981.
[18] R. Li, B. Zeng, and M. L. Liou, A new three-step search
algorithm for block motion estimation, IEEE Transactions on
Circuits and Systems for Video Technology, vol. 4, pp. 438442,
1994.
[19] L.-M. Po and W.-C. Ma, A novel four-step search algorithm
for fast block motion estimation, IEEE Transactions on
Circuits and Systems for Video Technology, vol. 6, no. 3, pp.
313317, 1996.
[20] L.-K. Liu and E. Feig, A block-based gradient descent search
algorithm for block motion estimation in video coding, IEEE
6, no. 4, pp. 419421, 1996.
[21] S. Zhu and K. K. Ma, A new diamond search algorithm for
fast block-matching motion estimation, in Proceedings of the
International Conference on Information, Communications and
Signal Processing (ICICS 97), vol. 1, pp. 292296, Singapore,
September 1997.
[22] C. Zhu, X. Lin, and L.-P. Chau, Hexagon-based search
pattern for fast block motion estimation, IEEE Transactions
on Circuits and Systems for Video Technology, vol. 12, no. 5, pp.
349355, 2002.
[23] J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kassim,
A novel unrestricted center-biased diamond search algorithm
for block motion estimation, IEEE Transactions on Circuits
1998.
[24] Y. Nie and K.-K. Ma, Adaptive rood pattern search for fast
block-matching motion estimation, IEEE Transactions on
Image Processing, vol. 11, no. 12, pp. 14421449, 2002.
[25] H. C. Tourapis and A. M. Tourapis, Fast motion estimation
within the H.264 codec, in Proceedings of the IEEE Interna-
tional Conference on Multimedia and Expo (ICME 03), vol. 3,
pp. 517520, Baltimore, Md, USA, July 2003.
[26] A. M. Tourapis, Enhanced predictive zonal search for single
and multiple frame motion estimation, in Visual Communi-
cations and Image Processing, vol. 4671 of Proceedings of SPIE,
pp. 10691079, January 2002.
[27] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG,
Fast integer pel and fractional pel motion estimation for
AVC, JVT-F016, December 2002.
[28] K. S uhring, H.264 JM Reference Software v.15.0, September
2008, http://iphome.hhi.de/suehring/tml/download.
[29] Annapolis Micro Systems, Wildcard
TM
4 Reference Man-
ual, 12968-000 Revision 3.2, December 2005.
[30] Xilinx Inc., Virtex-4 User Guide, UG070 (v2.3), August
2007.
[31] Xilinx Inc., XtremeDSP for Virtex-4 FPGAs User Guide,
UG073(v2.1), December 2005.
[32] A. M. Tourapis, O. C. Au, and M. L. Liou, Predictive
motion vector eld adaptive search technique (PMVFAST)
enhanced block based motion estimation, in Proceedings of
the IEEE Visual Communications and Image Processing (VCIP
01), pp. 883892, January 2001.
[33] Y.-W. Huang, T.-C. Wang, B.-Y. Hsieh, and L.-G. Chen,
Hardware architecture design for variable block size motion
estimation in MPEG-4 AVC/JVT/ITU-T H.264, in Proceed-
ings of the IEEE International Symposium on Circuits and
Systems (ISCAS 03), vol. 2, pp. 796798, May 2003.
[34] M. Kim, I. Hwang, and S. Chae, A fast VLSI architecture for
full-search variable block size motion estimation in MPEG-4
AVC/H.264, in Proceedings of the IEEE Asia and South Pacic
Design Automation Conference, vol. 1, pp. 631634, January
2005.
[35] J.-F. Shen, T.-C. Wang, and L.-G. Chen, A novel low-
power full-search block-matching motion-estimation design
for H.263+, IEEE Transactions on Circuits and Systems for
Video Technology, vol. 11, no. 7, pp. 890897, 2001.
[36] S. Y. Yap and J. V. McCanny, A VLSI architecture for
advanced video coding motion estimation, in Proceedings
of the IEEE International Conference on Application-Specic
Systems, Architectures, and Processors (ASAP 03), vol. 1, pp.
293301, June 2003.
[37] S. Mochizuki, T. Shibayama, M. Hase, et al., A 64 mW high
picture quality H.264/MPEG-4 video codec IP for HD mobile
applications in 90 nm CMOS, IEEE Journal of Solid-State
Circuits, vol. 43, no. 11, pp. 23542362, 2008.
[38] R. R. Colenbrander, A. S. Damstra, C. W. Korevaar, C. A.
Verhaar, and A. Molderink, Co-design and implementation
of the H.264/AVC motion estimation algorithm using co-
simulation, in Proceedings of the 11th IEEE EUROMICRO
Conference on Digital SystemDesign Architectures, Methods and
Tools (DSD 08), pp. 210215, September 2008.
[39] Z. Li, X. Zeng, Z. Yin, S. Hu, and L. Wang, The design and
optimization of H.264 encoder based on the nexperia plat-
form, in Proceedings of the 8th IEEE International Conference
on Software Engineering, Articial Intelligence, Networking, and
Parallel/Distributed Computing (SNPD 07), vol. 1, pp. 216
219, July 2007.
[40] S. Winkler and F. Dufaux, Video quality evaluation for
mobile applications, in Visual Communications and Image
Processing, vol. 5150 of Proceedings of SPIE, pp. 593603,
Lugano, Switzerland, July 2003.
[41] M. Ries, O. Nemethova, and M. Rupp, Motion based
reference-free quality estimation for H.264/AVCvideo stream-
ing, in Proceedings of the 2nd International Symposium on
Wireless Pervasive Computing (ISWPC 07), pp. 355359,
February 2007.
doi:10.1155/2009/893897
Research Article
FPSoC-Based Architecture for
a Fast Motion Estimation Algorithmin H.264/AVC
Obianuju Ndili and Tokunbo Ogunfunmi
Department of Electrical Engineering, Santa Clara University, Santa Clara, CA 95053, USA
Correspondence should be addressed to Tokunbo Ogunfunmi, togunfunmi@scu.edu
Received 21 March 2009; Revised 18 June 2009; Accepted 27 October 2009
Recommended by Ahmet T. Erdogan
There is an increasing need for high quality video on low power, portable devices. Possible target applications range from
entertainment and personal communications to security and health care. While H.264/AVCanswers the need for high quality video
at lower bit rates, it is signicantly more complex than previous coding standards and thus results in greater power consumption
in practical implementations. In particular, motion estimation (ME), in H.264/AVC consumes the largest power in an H.264/AVC
encoder. It is therefore critical to speed-up integer ME in H.264/AVC via fast motion estimation (FME) algorithms and hardware
acceleration. In this paper, we present our hardware oriented modications to a hybrid FME algorithm, our architecture based
on the modied algorithm, and our implementation and prototype on a PowerPC-based Field Programmable System on Chip
(FPSoC). Our results show that the modied hybrid FME algorithm on average, outperforms previous state-of-the-art FME
algorithms, while its losses when compared with FSME, in terms of PSNRperformance and computation time, are insignicant. We
show that although our implementation platform is FPGA-based, our implementation results compare favourably with previous
architectures implemented on ASICs. Finally we also show an improvement over some existing architectures implemented on
FPGAs.
Copyright 2009 O. Ndili and T. Ogunfunmi. This is an open access article distributed under the Creative Commons Attribution
cited.
1. Introduction
Motion estimation (ME) is by far the most powerful
compression tool in the H.264/AVC standard [1, 2], and
it is generally carried out in two stages: integer-pel then
fractional pel as a renement of the integer-pel search.
ME in H.264/AVC features variable block sizes, quarter-
pixel accuracy for the luma component (one-eighth pixel
accuracy for the chroma component), and multiple reference
pictures. However the power of ME in H.264/AVC comes at
the price of increased encoding time. Experimental results
[3, 4] have shown that ME can consume up to 80% of
the total encoding time of H.264/AVC, with integer ME
consuming a greater proportion. In order to meet real-
time and low power constraints, it is desirable to speed
up the ME process. Two approaches to ME speed-up
include designing fast ME algorithms and accelerating ME in
hardware.
Considering the algorithm approach, there are tradi-
tional, single search fast algorithms such as new three-step
search (NTSS) [5], four-step search (4SS) [6], and diamond
search (DS) [7]. However these algorithms were developed
for xed block size and cannot eciently support variable
block size ME (VBSME) for H.264/AVC. In addition, while
these algorithms are good for small search range and low
resolution video, at higher denition for some high motion
sequences such as Stefan, these algorithms can drop into a
local minimum in the early stages of the search process [4].
In order to have more robust fast algorithms, some hybrid
fast algorithms that combine earlier single search techniques
have been proposed. One of such was proposed by Yi et al.
[8, 9]. They proposed a fast ME algorithm known variously
as the Simplied Unied Multi-Hexagon (SUMH) search
or Simplied Fast Motion Estimation (SFME) algorithm.
SUMH is based on UMHexagonS [4], a hybrid fast motion
estimation algorithm. Yi et al. showin [8] that with similar or
even better rate-distortion performance, SUMH reduces ME
time by about 55%and 94%on average when compared with
UMHexagonS and Fast Full Search, respectively. In addition,
SUMH yields a bit rate reduction of up to 18% when com-
pared with Full Search in low complexity mode. Both SUMH
and UMHexagonS are nonnormative parts of the H.264/AVC
standard.
Considering ME speed-up via hardware acceleration,
although there has been some previous work on VLSI
architectures for VBSME in H.264/AVC, the overwhelming
majority of these works have been based on the Full Search
Motion Estimation (FSME) algorithm. This is because FSME
presents a regular-patterned search window which in turn
provides good candidate-level data reuse (DR) with regular
searching ows. A good candidate-level DR results in the
reduction of data access power. Power consumption for an
integer ME module mainly comes fromtwo parts: data access
power to read reference pixels from local memories and
computational power consumed by the processing elements.
For FSME, the data access power is reduced because the
reference pixels of neighbouring candidates are considerably
overlapped. On the other hand, because of the exhaustive
search done in FSME, the computational complexity and
thus the power consumed by the processing elements, is
large.
Several low-power integer ME architectures with corre-
sponding fast algorithms were designed for standards prior
to H.264/AVC [1013]. However, these architectures do
not support H.264/AVC. Additionally, because the irregular
searching ows of fast algorithms usually lead to poor
intercandidate DR, the power reduction at the algorithm
level is usually constrained by the power reduction at the
architecture level. There is therefore an urgent need for
architectures with hardware oriented fast algorithms for
portable systems implementing H.264/AVC [14]. Note also
that because the data ow of FME is very similar to that of
fractional pel search, some hardware reuse can be achieved
[15].
For H.264/AVC, previous works on architectures for
fast motion estimation (FME) [1418] have been based on
diverse FME algorithms.
Rahman and Badawy in [16] and Byeon et al. in [17]
base their works on UMHexagonS. In [14], Chen et al.
propose a parallel, content-adaptive, variable block size, 4SS
algorithm, upon which their architecture is based. In [15],
Zhang and Gao base their architecture on the following
search sequence: Diamond Search (DS), Cross Search (CS)
and nally, fractional-pel ME.
In this paper, we base our architecture on SUMH
which has been shown in [8] to outperform UMHexagonS.
We present hardware oriented modications to SUMH.
We show that the modied SUMH has a better PSNR
performance that of the parallel, content-adaptive variable
block size 4SS proposed in [14]. In addition, our results
(see Section 2) show that for the modied SUMH, the
average PSNR loss is 0.004 dB to 0.03 dB when compared
with FSME, while when compared to SUMH, most of
the sequences show an average improvement of up to
0.02 dB, while two of the sequences show an average loss
of 0.002 dB. Thus in general, there is an improvement over
SUMH. In terms of percentage computational time savings,
while SUMH saves 88.3% to 98.8% when compared with
FSME, the modied SUMH saves 60.0% to 91.7% when
compared with FSME. Finally, in terms of percentage bit
rate increase, when compared with FSME, the modied
SUMH shows a bit rate improvement (decrease in bit rate),
of 0.02% in the sequence Coastguard. The worst bit rate
increase is in Foreman and that is 1.29%. When compared
with SUMH, there is a bit rate improvement of 0.03% to
0.34%.
The rest of this paper is organized as follows. In Section 2
we summarize integer-pel motion estimation in SUMH and
present the hardware oriented SUMH along with simulation
results. In Section 3 we briey present our proposed
architecture based on the modied SUMH. We also present
our implementation results as well as comparisons with prior
works. In Section 4 we present our prototyping eorts on
the XUPV2P development board. This board contains an
XC2VP30 Virtex-II Pro FPGA with two hardwired PowerPC
405 processors. Finally our conclusions are presented in
Section 5.
2. Motion Estimation Algorithm
2.1. Integer-Pel SUMH Algorithm. H.264/AVC uses block
matching for motion vector search. Integer-pel motion
estimation uses the sum of absolute dierences (SADs), as
its matching criterion. The mathematical expression for SAD
is given in
SAD
_
dx, dy
_
=
X1
_
x=0
Y1
_
y=0
a
_
x, y
_
b
_
x + dx, y + dy
_
,
(1)
_
MV
x
, MV
y
_
=
_
dx, dy
_
minSAD(dx,dy).
(2)
In (1), a(x, y) and b(x, y) are the pixels of the current,
and candidate blocks, respectively. (dx, dy) is the displace-
ment of the candidate block within the search window.
X Y is the size of the current block. In (2) (MV
x
, MV
y
)
is the motion vector of the best matching candidate
block.
H.264/AVC features seven interprediction block sizes
which are 16 16, 16 8, 8 16, 8 8, 8 4, 4 8, and
44. These are referred to as block modes 1 to 7. An up layer
block is a block that contains sub-blocks. For example, mode
5 or 6 is the up layer of mode 7, and mode 4 is the up layer of
mode 5 or 6.
SUMH [8] utilizes ve key steps for intensive search,
integer-pel motion estimation. They are cross search,
hexagon search, multi big hexagon search, extended hexagon
search, and extended diamond search. For motion vector
(MV) prediction, SUMH uses the spatial median and up
layer predictors, while for SAD prediction, the up layer
predictor is used. In median MV prediction, the median
value of the adjacent blocks on the left, top, and top-right
(or top-left) of the current block is used to predict the
MV of the current block. The complete ow chart of the
integer-pel, motion vector search in SUMH is shown in
Figure 1.
The convergence and intensive search conditions are
determined by arbitrary thresholds shifted by a blocktype
shift factor. The blocktype shift factor species the number
of bits to shift to the right in order to get the corresponding
thresholds for dierent block sizes. There are 8 blocktype
shift factors corresponding to 8 block modes: 1 dummy block
mode and the 7 block modes in H.264/AVC. The 8 block
modes are 1616 (dummy), 1616, 168, 816, 88, 84,
4 8, and 4 4. The array of 8 blocktype shift factors corre-
sponding, respectively, to these 8 block modes is given in
blocktype shift factor = {0, 0, 1, 1, 2, 3, 3, 1}. (3)
The convergence search condition is described in pseu-
docode in
_
min mcost <
_
ConvergeThreshold
blocktype shift factor
_
blocktype
___
,
(4)
where min mcost is the minimum motion vector cost. The
intensive search condition is described in pseudo-code in
_
blocktype == 1 &&
min mcost >
_
CrossThreshold1blocktype shift factor
_
blocktype
__
_
||
(min mcost>(CrossThreshold2blocktype shift factor [blocktype]))
,
(5)
where the thresholds are empirically set as follows:
ConvergeThreshold = 1000, CrossThreshold1 = 800, and
CrossThreshold2 = 7000.
2.2. Hardware Oriented SUMH Algorithm. The goal of our
hardware oriented modication is to make SUMH less
sequential without incurring performance losses or increases
in the computation time.
The sequential nature of SUMH arises from the fact that
there are a lot of data dependencies. The most severe data
dependency arises during the up layer predictor search step.
This dependency forces the algorithm to sequentially and
individually conduct the search for the 41 possible SADs in a
16 16 macroblock. The sequence begins with the 16 16
macroblock then computes the SADs of the subblocks in
each quadrant of the 16 16 macroblock. Performing the
algorithm in this manner consumes a lot of computational
time and power, yet its rate-distortion benets can still be
obtained in a parallel implementation. In our modication,
we skip this search step.
The decision control structures in SUMH are another
feature that makes the algorithm unsuitable for hardware
implementation. In a parallel and pipelined implementation,
these structures would require that the pipeline be ushed
at random times. This is in turn wasteful of clock cycles as
well as adds more overhead to the hardwares control circuit.
In our modication, we consider the convergence condition
not satised, and intensive search condition satised. This
removes the decision control structures that make SUMH
unsuitable for parallel processing. Another eect of this
modication is that we expect to have a better rate-distortion
performance. On the other hand, the expected disadvantage
of this modication is an increase in computation time.
However, as shown by our complexity analysis and results,
this increase is minimal and will also be easily compensated
for by hardware acceleration.
Further modications we make to SUMH are the
removal of the small local search steps and the convergence
search step.
Our modications to SUMH allow us to process in
parallel, all the candidate macroblocks (MB), for one current
macroblock (CMB). We use the so-called HF3V2 2-stitched
zigzag scan proposed in [19], in order to satisfy the data
dependencies between CMBs. These data dependencies arise
because of the side information used to predict the MV of
the CMB. Note that if we desire to process several CMBs in
parallel, we will need to set the value of the MV predictor to
the zero displacement MV, that is, MV = (0, 0). Experiments
in [2022], as well as our own experiments [23], show that
when the search window is centered around MV = (0, 0), the
average PSNR loss is less than 0.2 dB compared with when
the median MV is also used. Figure 2 shows the complete
ow chart of the modied integer-pel, SUMH.
2.3. Complexity Analysis of the Motion Estimation Algorithms.
We consider a search range s. The number of search points
to be examined by FSME algorithm is directly proportional
to the square of the search range. There are (2s + 1)
2
search
points. Thus the algorithmcomplexity of Full Search is O(s
2
).
We obtain the algorithm complexity of the modied
SUMH algorithm by considering the algorithm complexity
of each of its search steps as follows.
(1) Cross search: there are s search points both horizon-
tally and vertically yielding a total of 2s search points.
Thus the algorithm complexity of this search step is
O(2s).
(2) Hexagon and extended hexagon search: There are 6
search points each in both of these search steps, yield-
ing a total of 12 search points. Thus the algorithm
complexity of this search step is constant O(1).
(3) Multi-big hexagon search: there are (1/4)s hexagons
with 16 search points per hexagon. This yields a total
of 4s search points. Thus the algorithm complexity of
this search step is O(4s).
(4) Diamond search: there are 4 search points in this
search step. Thus the algorithm complexity of this
search step is constant O(1).
Therefore in total there are 1 + 2s + 12 + 4 + 4s search
points in the modied SUMH, and its algorithm complexity
is O(6s).
In order to obtain the algorithm complexity of SUMH,
we consider its worst case complexity, even though the
Start: check predictors
Satisfy
convergence
condition?
Small local search
Satisfy intensive
search condition?
Cross search
Hexagon search
Multibig hexagon search
Up layer predictor search
Small local search
Extended hexagon search
Satisfy
convergence
condition?
Extended diamond search
Convergence search
Stop
Yes
No
No
Yes
No
Yes
Figure 1: Flow chart of integer-pel search in SUMH.
Start: check center and median MV predictor
Cross search
Hexagon search
Multibig hexagon search
Extended hexagon search
Extended diamond search
Stop
Figure 2: Flow chart of modied integer-pel search.
Table 1: Complexity of algorithms in million operations per second
(MOPS).
Algorithm
Number of search
points for search
range s = 16
Number of MOPS
for CIF video at
30 Hz
FSME 1089 17103
Best case SUMH 5 78
Worst case SUMH 127 1995
Median case SUMH 66 1037
Modied SUMH 113 1775
algorithm may terminate much earlier. The worst case
complexity of SUMH is similar to that of the modied
SUMH, except that it adds 14 more search points. This
number is obtained by adding 4 search points each for 2
small local searches and 1 convergence search, and 2 search
points for the worst case up layer predictor search. Thus for
the worst case SUMH, there are in total 14+1+2s+12+4+4s
search points and its algorithm complexity is O(6s). Note
that in the best case, SUMH has only 5 search points: 1 for
the initial search candidate and 4 for the convergence search.
Another way to dene the complexity of each algorithm
is in terms of the number of required operations. We can then
express the complexity as Million Operations Per Second
(MOPS). To compare the algorithms in terms of MOPS we
assume the following.
(1) The macroblock size is 16 16.
(2) The SADcost function requires 21616 data loads,
16 16 = 256 subtraction operations, 256 absolute
operations, 256 accumulate operations, 41 compare
operations and 1 data store operation. This yields a
total of 1322 operations for one SAD computation.
(3) CIF resolution is 352288 pixels = 396 macroblocks.
(4) The frame rate is 30 frames per second.
(5) The total number of operations required to encode
CIF video in real time is 1322 396 30 z
a
, where
z
a
is the number of search points for each algorithm.
Thus there are 15.7z
a
MOPS per algorithm, where one
OP (operation) is the amount of computation it takes to
obtain one SAD value.
In Table 1 we compare the computational complexities
of the considered algorithms in terms of MOPS. As expected,
FSME requires the largest number of MOPS. The number of
MOPS required for the modied SUMH is about 10% less
than that required for the worst case SUMH and about 40%
more than that required for the median case SUMH.
2.4. Performance Results for the Modied SUMH Algorithm.
Our experiments are done in JM 13.2 [24]. We use the
following standard test sequences: Stefan (large motion),
Foreman and Coastguard (large to moderate motion)
and Silent (small motion). We chose these sequences
because we consider them extreme cases in the spectrum of
low bit-rate video applications. We also use the following
Table 2: Simulation conditions.
Sequences Quantization parameter Search range Frame size No. of frames
Foreman 22, 25, 28, 31, 33, 35 32 CIF 100
Mother-daughter 22, 25, 28, 31, 33, 35 32 CIF 150
Stefan 22, 25, 28, 31, 33, 35 16 CIF 90
Flower 22, 25, 28, 31, 33, 35 16 CIF 150
Coastguard 18, 22, 25, 28, 31, 33 32 QCIF 220
Carphone 18, 22, 25, 28, 31, 33 32 QCIF 220
Silent 18, 22, 25, 28, 31, 33 16 QCIF 220
Table 3: Comparison of speed-up ratios with full search.
Quantization
Parameter
18 22 25 28 31 33 35
SUMH
Modied
SUMH
SUMH
Modied
SUMH
SUMH
Modied
SUMH
SUMH
Modied
SUMH
SUMH
Modied
SUMH
SUMH
Modied
SUMH
SUMH
Modied
SUMH
Foreman N/A N/A 48.55 8.16 41.55 6.86 32.68 5.66 25.87 4.77 21.68 4.23 19.11 3.74
Stefan N/A N/A 15.35 4.62 13.16 4.21 12.20 3.93 10.67 3.50 10.05 3.23 8.96 3.06
Mother-
daughter
N/A N/A 16.63 2.49 19.31 2.72 21.56 3.01 28.63 3.47 35.43 4.20 43.90 5.08
Flower N/A N/A 9.73 3.07 10.72 3.29 11.32 3.49 12.94 3.78 13.77 4.02 15.02 4.21
Coastguard 86.34 12.06 70.12 10.31 58.05 9.01 43.62 7.98 36.04 6.80 30.10 6.13 N/A N/A
Silent 21.86 3.54 16.74 3.18 13.17 2.99 11.90 2.82 9.29 2.66 8.56 2.64 N/A N/A
Carphone 24.67 4.14 29.44 4.62 37.12 5.38 46.97 6.02 53.97 7.07 64.07 8.82 N/A N/A
Table 4: Comparison of percentage time savings with full search.
Quantization
Parameter
18 22 25 28 31 33 35
SUMH
Modied
SUMH
SUMH
Modied
SUMH
SUMH
Modied
SUMH
SUMH
Modied
SUMH
SUMH
Modied
SUMH
SUMH
Modied
SUMH
SUMH
Modied
SUMH
Foreman N/A N/A 97.94 87.75 97.59 85.43 96.94 82.34 96.13 79.04 95.38 76.36 94.76 73.31
Stefan N/A N/A 93.48 78.38 92.40 76.29 91.80 74.61 90.63 71.46 90.05 69.05 88.83 67.35
Mother-
daughter
N/A N/A 93.98 60.00 94.82 63.34 95.36 66.85 96.50 71.22 97.17 76.21 97.72 80.35
Flower N/A N/A 89.72 67.45 90.67 69.62 91.16 71.37 92.27 73.56 92.71 75.14 93.34 76.27
Coastguard 98.84 91.71 98.57 90.30 98.27 88.91 97.70 87.47 97.22 85.29 96.67 83.70 N/A N/A
Silent 95.42 71.77 94.02 68.62 92.40 66.61 91.60 64.56 89.23 62.47 88.32 62.20 N/A N/A
Carphone 95.94 75.87 96.60 78.36 97.30 81.41 97.87 83.41 98.14 85.87 98.43 88.66 N/A N/A
sequences: Mother-daughter (small motion, talking head
and shoulders), Flower (large motion with camera pan-
ning), and Carphone (large motion). The sequences are
coded at 30 Hz. The picture sequence is IPPP with I-frame
refresh rate set at every 15 frames. We consider 1 reference
frame. The rest of our simulation conditions are summarized
in Table 2.
Figure 3 shows curves that compare the rate-distortion
eciencies of Full Search ME, SUMH, and the modied
SUMH. Figure 4 shows curves that compare the rate-
distortion eciencies of Full Search ME and the single- and
multiple-iteration parallel content-adaptive 4SS of [14]. In
Tables 3 and 4, we show a comparison of the speed-up
ratios of SUMH and the modied SUMH. Table 5 shows
the average percentage bit rate increase of the modied
SUMH when compared with Full Search ME and SUMH.
Finally Table 6 shows the average Y-PSNR loss of the
modied SUMH when compared with Full Search ME and
SUMH.
From Figures 3 and 4, we see that the modied SUMH
has a better rate-distortion performance than the proposed
parallel content-adaptive 4SS of [14], even under smaller
search ranges. In Section 3 we will show comparisons of
our supporting architecture with the supporting architecture
31
32
33
34
35
36
37
38
39
40
41
Y
-
P
S
N
R
(
d
B
)
500 1000 1500 2000 2500 3000 3500
Bitrate (kbps)
R-D curve (Stefan, CIF, SR = 16, 1 ref frame, IPPP...)
(a)
33
34
35
36
37
38
39
40
41
Y
-
P
S
N
R
(
d
B
)
400 600 800 1000 1200 1400
Bitrate (kbps)
R-D curve (Foreman, CIF, SR = 32, 1 ref frame, IPPP...)
(b)
34
36
38
40
42
44
Y
-
P
S
N
R
(
d
B
)
100 150 200 250 300 350 400
Bitrate (kbps)
R-D curve (Silent, QCIF, SR = 16, 1 ref frame, IPPP...)
Full search
SUMH
Modied SUMH
(c)
32
34
36
38
40
42
Y
-
P
S
N
R
(
d
B
)
200 300 400 500 600 700 800 900 1000 1100
Bitrate (kbps)
R-D curve (Coastguard, QCIF, SR = 32, 1 ref frame, IPPP...)
Full search
SUMH
Modied SUMH
(d)
Figure 3: Comparison of rate-distortion eciencies for the modied SUMH.
proposed in [14]. Note though that the architecture in [14] is
implemented on an ASIC (TSMC 0.18- 1P6M technology),
while our architecture is implemented on an FPGA.
From Figure 3 and Table 6 we also observe that the largest
PSNRlosses occur in the Foreman sequence, while the least
PSNR losses occur in Silent. This is because the Foreman
sequence has both high local object motion and greater high-
frequency content. It therefore performs the worst under a
given bit rate constraint. On the other hand, Silent is a low
motion sequence. It therefore performs much better under
the same bit rate constraint.
Given the tested frames from Table 2 for each sequence,
we observe additionally from Table 6 that Full Search
performs better than the modied SUMH for sequences
with larger local object (foreground) motion, but lit-
tle or no background motion. These sequences include
Foreman, Carphone, Mother-daughter, and Silent.
However the rate-distortion performance of the modied
SUMH improves for sequences with large foreground and
background motions. Such sequences include Flower,
Stefan, and Coastguard. We therefore suggest that a yet
greater improvement in the rate-distortion performance of
32
33
34
35
36
37
38
P
S
N
R
(
d
B
)
700 900 1100 1300 1500 1700 1900
Bitrate (kbps)
R-D curve (Stefan, CIF, SR = 32, 1 ref frame, IPPP...)
(a)
32
33
34
35
36
37
38
P
S
N
R
(
d
B
)
170 270 370 470 570 670
Bitrate (kbps)
R-D curve (Foreman, CIF, SR = 32, 1 ref frame, IPPP...)
(b)
32
33
34
35
36
37
38
P
S
N
R
(
d
B
)
120 220 320
Bitrate (kbps)
R-D curve (Silent, CIF, SR = 32, 1 ref frame, IPPP...)
FS
Proposed content-adaptive parallel-VBS 4SS
Single iteration parallel-VBS 4SS
(c)
32
33
34
35
36
37
38
P
S
N
R
(
d
B
)
600 1000 1400 1800
Bitrate (kbps)
R-D curve (Coastguard, CIF, SR = 32, 1 ref frame, IPPP...)
FS
Proposed content-adaptive parallel-VBS 4SS
Single iteration parallel-VBS 4SS
(d)
Figure 4: Comparison of rate-distortion eciencies for parallel content-adaptive 4SS of [25] (Reproduced from [25]).
the modied SUMHalgorithmcan be achieved by improving
its local motion estimation.
For Table 3, we dene the speed-up ratio as the ratio
of the ME coding time of Full Search to ME coding time
of the algorithm under consideration. From Table 3 we
see that speed-up ratio increases as quantization parameter
(QP) decreases. This is because there are less skip mode
macroblocks as QP decreases. From our results in Table 3,
we further calculate the percentage time savings t for ME
calculation, according to
t =
_
1
1
r
_
100, (6)
where r are the data points in Table 3. The percentage time
savings obtained are displayed in Table 4. From Table 4, we
nd that SUMH saves 88.3% to 98.8% in ME computation
time compared to Full Search, while the modied SUMH
saves 60.0% to 91.7%. Therefore, the modied SUMH does
not incur much loss in terms of ME computation time.
In our experiments we set rate-distortion optimization
to high complexity mode (i.e., rate-distortion optimization
is turned on), in order to ensure that all of the algorithms
compared have a fair chance to yield their highest rate-
distortion performance. From Table 5 we nd that the
Table 5: Average percentage bit rate increase for modied SUMH.
Sequences
Compared with
Full search SUMH
Foreman 1.29 0.04
Stefan 0.40 0.34
Mother-daughter 0.15 0.05
Flower 0.19 0.17
Coastguard 0.02 0.03
Silent 0.56 0.33
Carphone 0.27 0.06
average percentage bit rate increase of the modied SUMH
is very low. When compared with Full Search, there is a bit
rate improvement (decrease in bit rate), in Coastguard of
0.02%. The worst bit rate increase is in Foreman and that
is 1.29%. When compared with SUMH, there is a bit rate
improvement (decrease in bit rate), going from 0.04% (in
Coastguard) to 0.34% (in Stefan).
From Table 6 we see that the average PSNR loss for the
modied SUMH is very low. When compared to Full Search,
the PSNR loss for modied SUMH ranges from 0.006 dB to
0.03 dB. When compared to SUMH, most of the sequences
show a PSNR improvement of up to 0.02 dB, while two of
the sequences show a PSNR loss of 0.002 dB.
Thus in general, the losses when compared with Full
Search are insignicant, while on the other hand there is
an improvement when compared with SUMH. We therefore
conclude that the modied SUMH can be used without
much penalty, instead of Full Search ME, for ME in
H.264/AVC.
3. Proposed Supporting Architecture
Our top-level architecture for fast integer VBSME is shown
in Figure 5. The architecture is composed of search window
(SW) memory, current MB memory, an address generation
unit (AGU), a control unit, a block of processing units (PUs),
an SAD combination tree, a comparison units and a register
for storing the 41 minimum SADs and their associated
motion vectors.
While the current and reference frames are stored o-
chip in external memory, the current MB (CMB) data and
the search window (SW) data are stored in on-chip, dual-
port block RAMS (BRAMS). The SW memory has N 1616
BRAMs that store N candidate MBs, where N is related to the
search range s. N can be chosen to be any factor or multiple
of |s| so as to achieve a tradeo between speed and hardware
costs. For example, if we consider a search range of s = 16,
then we can choose N such that N {. . . , 32, 16, 8, 4, 2, 1}.
The AGU generates addresses for blocks being processed.
There are N PUs each containing 16 processing elements
(PEs), in a 1D array. A PU shown in Figure 6 calculates
16 4 4 SADs for one candidate MB while a PE shown in
Figure 8 calculates the absolute dierence between two pixels,
one each from the candidate MB and the current MB. From
Figure 6, groups of 4 PEs in the PU calculate 1 column of
4 4 SADs. These are stored via demultiplexing, in registers
D1D4 which hold the inputs to the SAD combination tree,
one of which is shown in Figure 7. For N PUs there are N
SAD combination trees. Each SAD combination tree further
combines the 16 4 4 output SADs from one PU, to yield a
total of 41 SADs per candidate MB. Figure 7 shows that the
16 4 4 SADs are combined such that registers D6 contain
4 8 SADs, D7 contain 8 8 SADs, D8 contain 8 16 SADs,
D9 contain 168 SADs, D10 contain 84 SADs, and nally,
D11 contains the 16 16 SAD. These SADs are compared
appropriately in the comparison unit (CU). CU consists of
41 N-input comparing elements (CEs). A CE is shown in
Figure 9.
3.1. Address Generation Unit. For each of N MBs being
processed simultaneously, the AGU generates the addresses
of the top row and the leftmost column of 4 4 sub-blocks.
The address of each sub-block is the address of its top left
pixel. From the addresses of the top row and leftmost column
of 44 sub-blocks, we obtain the addresses of all other block
partitions in the MB.
The interface of the AGU is xed and we parameterize
it by the address of the current MB, the search type and the
Table 6: Average Y-PSNR loss for modied SUMH.
Sequences
Compared with
Full search SUMH
Foreman 0. 0290 dB 0. 0065 dB
Stefan 0. 0058 dB 0. 0125 dB
Mother-daughter 0. 0187 dB 0. 0020 dB
Flower 0. 0042 dB 0. 0002 dB
Coastguard 0. 0078 dB 0. 0018 dB
Silent 0. 0098 dB 0. 0018 dB
Carphone 0. 0205 dB 0. 0225 dB
Table 7: Search passes for modied SUMH.
Pass Description
1-2
Horizontal scan of cross search. Candidate
MBs seperated by 2 pixels
3-4
Vertical scan of cross search. Candidate MBs
seperated by 2 pixels
5 Hexagon search has 6 search points
613
Multi-big hexagon search has (1/4)(|s|)
hexagons, each containing 16 search points
14 Extended hexagon search has 6 search points
15 Diamond search has 4 search points
search pass. The search type is modied SUMH. However
we can expand our architecture to support other types of
search, for example, Full Search, and so forth. The search pass
depends on the search step and the search range. We show
for instance, in Table 7 that there are 15 search passes for the
modied SUMH considering a search range s = 16. There
is a separation of 2 pixels between 2 adjacent search points in
the cross search, therefore address generation for search pass
1 to 4 in Table 7 is straightforward. For the remaining search
passes515, tables of constant oset values are obtained
from JM reference software [24]. These oset values are
the separation in pixels, between the minimum MV from
the previous search pass, and the candidate search point. In
general, the ane address equations can be represented by
AE
x
= iC
x
, AE
y
= iC
y
, (7)
where AE
x
and AE
y
are the horizontal and vertical addresses
of the top left pixel in the MB, i is a multiplier, C
x
and C
y
are
constants obtained from JM reference software.
3.2. Memory. Figures 10 and 11 show CMB and search
window (SW) memory organization for N = 8 PUs.
Both CMB and SW memories are synthesized into BRAMs.
Considering a search range of s = 16, there are 15 search
passes for the modied SUMH search owchart shown in
Figure 2. These search passes are shown in Table 7. In each
search pass, 8 MBs are processed in parallel, hence the SW
memory organization is shown in Figure 11. SW memory is
128 bytes wide and the required memory size is 2048 bytes.
For the same search range s = 16, if FSME was used
along with levels A and B data reuse, the SW size would be
Candi-
date MB
N 2
Candi-
date MB
N 1
Candi-
date MB
N
Candi-
date MB
1
Candi-
date MB
2
Candi-
date MB
3

SW memory
PU 1 PU 2 PU 3 PU N 2 PU N 1 PU N
CE 1 CE 2 CE 3 CE 41 Comparison unit
SAD combination tree
Register that stores minimum 41 SADs and associated MVs
To external memory
C
o
n
t
r
o
l
u
n
i
t
AGU
Current MB
(CMB)
memory
Figure 5: The proposed architecture for fast integer VBSME.
D1 D2 D3 D4 D1 D2 D3 D4
PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 PE 8 PE 9 PE 10 PE 11 PE 12 PE 13 PE 14 PE 15 PE 16
+ +
+
Demux
Cntr
D1 D2 D3 D4
D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5
D1 D2 D3 D4
+ +
+
Demux
Cntr
+ +
+
Demux
Cntr
+ +
D0 D0 D0 D0 D0 D0 D0 D0
+
Demux
Cntr
Figure 6: The architecture of a Processing Unit (PU).
48 48 pixels, that is 2304 bytes [25]. Thus by using the
modied SUMH, we achieve an 11% on-chip memory
savings even without a data reuse scheme.
In each clock cycle, we load 64 bits of data. This means
that it takes 256 cycles to load data for one search pass and
3840 (256 15) cycles to load data for one CMB. Under
similar conditions for FSME it would take 288 clock cycles
to load data for one CMB. Thus the ratio of the required
memory bandwidth for the modied SUMH to the required
memory bandwidth for FSME is 13.3. While this ratio is
undesirably high, it is well mitigated by the fact that there
are only 113 search locations for one CMB in the modied
SUMH, compared to 1089 search locations for one CMB in
FSME. In other words, the amount of computation for one
CMB in the modied SUMH is approximately 0.1 that for
FSME. Thus there is an overall power savings in using the
modied SUMH instead of FSME.
3.3. Processing Unit. Table 8 shows the pixel data schedule
for two search passes of the N PUs. In Table 8 we are
considering as an illustrative example the cross search and
a search range s = 16, hence the given pixel coordinates.
Top
SAD
Top
SAD
Top
SAD
Top
SAD
Bottom
SAD
Bottom
SAD
Bottom
SAD
Bottom
SAD
+ + + +
+ +
+
+
+ + + +
D5
D6 D6 D6 D6 D6 D6 D6
D7
D5
D6
D8
D9
D10
D11 D7
D7 D7 D7
D8 D9 D9 D8
D10 D10 D10 D10
D11
D10 D10 D10 D10
D6
D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5 D5
+ + + +
+ +
+
+ + + +
+ +
Top
4 4 SAD
Bottom
4 4 SAD
4 4 SAD
4 8 SAD
8 8 SAD
8 16 SAD
16 8 SAD
8 4 SAD
16 16 SAD
Figure 7: SAD Combination tree.
Table 8: Data schedule for processing unit (PU).
Clock PU1 PU8 Comments
116
(15, 0)(0,0) (1,0)(14,0)
Search pass 1: left horizontal scan of cross search
.
.
.
.
.
.
(15, 15)(0, 15) (1, 15)(14, 15)
17 32
(1, 0)(16,0) (15,0)(30,0)
Search pass 2: right horizontal scan of cross search
.
.
.
.
.
.
(1, 15)(16, 15) (15, 15)(30, 15)
3348
(0, 15)(15,15) (0, 1)(15, 1)
Search pass 3: top vertical scan of cross search
.
.
.
.
.
.
(0, 0)(15, 0) (0, 14)(15, 14)
4964
(0, 1)(15, 1) (0, 15)(15, 15)
Search pass 4: bottom vertical scan of cross search
.
.
.
.
.
.
(0, 16)(15, 16) (0, 30)(15, 30)
.
.
.
.
.
.
.
.
.
.
.
.
Table 8 shows that it takes 16 cycles to output the 16 4 4
SADs from each PU.
3.4. SAD Combination Tree. The data schedule for the SAD
combination is shown in Table 9. There are N SAD combina-
tion (SC) trees, each processing 16 44 SADs that are output
from each PU. It takes 5 cycles to combine the 16 4 4 SADs
and output 41 SADs for the 7 interprediction block sizes in
H.264/AVC: 1 16 16 SAD, 2 16 8 SADs, 2 8 16 SADs,
4 88 SADs, 8 84 SADs, 8 48 SADs, and 16 44 SADs.
Register
Absolute
dierence
Candidate
MB
pixel
Current
MB
pixel
Control
signals
+
Figure 8: Processing element (PE).
Table 9: Data schedule for SAD combination (SC) unit.
Clock SC1 SC2 SC8
17 16 4 4 SAD 16 4 4 SAD 16 4 4 SAD
18
8 4 8 SAD 8 4 8 SAD 8 4 8 SAD
8 8 4 SAD 8 8 4 SAD 8 8 4 SAD
19 4 8 8 SAD 4 8 8 SAD 4 8 8 SAD
20
2 8 16 SAD 2 8 16 SAD 2 8 16 SAD
2 16 8 SAD 2 16 8 SAD 2 16 8 SAD
21 1 16 16 SAD 1 16 16 SAD 1 16 16 SAD
3.5. Comparison Unit. The data schedule for the CU is
shown in Table 10. The CU consists of 41 CE, each
element processing N SADs of the same interprediction
block size, from the N PUs. Each CE compares SADs in
twos. It therefore takes log
2
N + 1 cycles to output the 41
minimum SADs. Thus given N = 8, the CU consumes 4
cycles.
3.6. Summary of Dataow. The dataow represented by the
data schedules described variously in Tables 810 may be
summarized by the algorithmic state machine (ASM) chart
shown in Figure 12. The ASM chart also represents the
mapping of the modied SUMH algorithm in Figure 2, to
our proposed architecture in Figure 5.
In our ASM chart, there are 6 states and 2 decision
boxes. The states are labeled S1 to S6, while the decision
boxes are labeled Q1 and Q2. In each state box, we provide
the summary description of the state as well as its output
variables in italic font.
From Figure 12 we see that implementation of the
modied SUMH on our proposed architecture IP core starts
in state S1 when the motion vector (MV) predictors are
checked. This is done by the PowerPC processor which is
part of our SoC prototyping platform (see Section 4). The
MV predictors are stored in external memory and accessed
from there by the PowerPC processor. The output from state
S1 is the MV predictors. In the next state S2, the minimum
MV cost is obtained and mode decision is done to obtain the
right blocktype. This is also done by the PowerPC processor
and the outputs of this state are the minimum MV, its SAD
Control signals
AGU input
Min SAD
MV
N SADs
Min SAD
N-input
comparator
Figure 9: Comparing element (CE).
16 pixels
1
6
p
i
x
e
l
s
8 bytes 8 bytes
.
.
.
Figure 10: Data arrangement in current macroblock (CMB)
memory.
128 pixels
1
6
p
i
x
e
l
s
8 bytes 8 bytes 8 bytes 8 bytes
.
.
.

Figure 11: Data arrangement in search window (SW) memory.
cost, its blocktype, and its address. The minimum MV cost is
obtained by minimizing the cost in
J
motion
_
m, REF |
motion
_
=SAD
_
dx, dy, REF,
m
_
+
motion
_
R
_
p
_
+R(REF)
_
,
(8)
where

m = (m
x
, m
y
)
T
is the current MV being considered,
REF denotes the reference picture,
motion
is the Lagrangian
multiplier, SAD(dx, dy, REF,
m) is the SAD cost obtained as

in (1),
p = (p
x
, p
y
) is the MV used for the prediction, R(
p ) represents the number of bits used for MV coding, and,

R(REF) is the bits for coding REF.
In the state S3, some of the outputs from state S2
are passed into our proposed architecture IP core. In state
S4, the AGU computes the addresses of candidate blocks,
using the address of the MV predictor as the base address,
and the control unit waits for the initialization of search
window data in the BRAMs. The output of state S4 is
Table 10: Data schedule for comparison unit (CU).
Clock CE1CE16 CE17CE32 CE33CE36 CE37CE40 CE41
22 8 4 4 SAD
8 4 8 SAD
8 8 8 SAD
8 8 16 SAD
8 16 16 SAD
8 8 4 SAD 8 16 8 SAD
23 4 4 4 SAD
4 4 8 SAD
4 8 8 SAD
4 8 16 SAD
4 16 16 SAD
4 8 4 SAD 4 16 8 SAD
24
2 4 4 SAD
2 4 8 SAD
2 8 8 SAD
2 8 16 SAD
2 16 16 SAD
2 8 4 SAD 2 16 8 SAD
25
1 4 4 SAD
1 4 8 SAD
1 8 8 SAD
1 8 16 SAD
1 16 16 SAD
1 8 4 SAD 1 16 8 SAD
S1: check MV predictors in PowerPC
MVs
S2: obtain min MV cost and perform mode decision in PowerPC
min MV, min SAD cost, blocktype and base address
S3: min MV, min SAD cost and address in IP core
min MV, min SAD cost, and base address
S4: AGU computes addresses of candidate blocks from base address,
and control unit waits for initialization of BRAM data for search pass
addresses, BRAM initialization complete
S5: PUs and SCs compute SADs for search pass
addresses, SADs
S6: obtain min SAD for search pass and update base address from addresses
base address, 41 min SADs, 41 min MVs
Q1: last
search pass of
step?
Q2: last search pass
of modied SUMH?
No Yes
No
Yes
IP core
Figure 12: Algorithmic state machine chart for the modied SUMH algorithm.
the addresses of the candidate blocks and a ag indicating
that BRAM initialization is complete. In state S5, the
processing units and SAD combination trees compute the
SADs of the candidate blocks. The output of S5 is the
computed SADs and unchanged AGU addresses. In state
S6, the CU compares these SADs with previously computed
SADs and obtains the 41 minimum SADs. The outputs
of S6 are the 41 minimum SADs and their corresponding
addresses.
In the decision block Q1, we check if the current search
pass is the last search pass of a particular search step, for
example, the cross search step. If no, we continue with other
passes of that search step. If yes, we go to decision block Q2.
In Q2 we check if it is the last search pass of the modied
SUMH algorithm. If no, we move onto the next search step,
for example, hexagon search. If yes, we check for the MV
predictors of the next current macroblock, according to the
HF3V2 2-stitched zigzag scan proposed in [19].
Table 11: Synthesis results.
Process (m) 0.13 (FPGA)
Number of slices 11.4K
Number of slice ip ops 16.4K
Number of 4-input LUTs 18.7K
Total equivalent gate count 388K
Max frequency (MHz) 145.2
Algorithm Modied SUMH
Video specications CIF 30-fps
Search range 16
Block size 16 16 to 4 4
Minimum required frequency (MHz) 24.1
Number of 16 8-bit dual-port RAMs 129
Memory utilization (Kb) 398
Voltage (V) 1.5
Power consumed (mW) 25
3.7. Synthesis Results and Analysis. The proposed architec-
ture has been implemented in Verilog HDL. Simulation and
functional verication of the architecture was done using the
Mentor Graphics ModelSim tool [26]. We then synthesized
the architecture using the Xilinx sythesis tool (XST). XST
is part of the Xilinx integrated software environment (ISE)
[27]. After synthesis, place and routing is done targeting the
Virtex-II Pro XC2VP30 Xilinx FPGA on our development
board. Finally we obtain power analysis for our design, using
the XPower tool which is also part of Xilinx ISE.
Our synthesis results are shown in Table 11. From
Table 11 we see that our architecture can achieve a maximum
frequency of 145.2 MHz. The FPGA power consumption of
our architecture is 25 mW obtained using Xilinx XPower
tool. The total equivalent gate count is 388 K.
Our simulations in ModelSim support our dataow
described in Sections 3.1 to 3.6. We nd that it takes 27
cycles to obtain the minimum SAD from each search pass,
after initialization. The 27 cycles are obtained from 1 cycle
for the AGU, 1 cycle to read data from on-chip memory, 16
cycles for the PU, 5 cycles for the SAD combination tree,
and 4 cycles for the comparison unit. Therefore, it takes
405 (15 27) cycles to complete the search for 1 CMB, 1
reference frame, and s = 16. For a CIF image (396 MBs) at
30 Hz and considering 5 reference frames, a minimum clock
frequency of approximately 24.1 (405 396 30 5) MHz
is required. Thus with a maximum possible clock speed of
145.2 MHz, our architecture can compute in real-time CIF
sequences within a search range of 16 and using 5 reference
frames.
We provide Table 12 which compares our architecture
with previous state-of-the-art architectures implemented on
ASICs. Note that a direct comparison of our implementation
with implementations done on ASIC technology is impos-
sible because of the fact that the platforms are dierent.
ASICs still provide the highest performance in terms of
area, power consumed, and maximum frequency. However,
we provide Table 12 not for direct comparisons, but to
show that our implementation achieves ASIC-like levels of
performance. This is desirable because it indicates that an
ASIC implementation of our architecture will yield even
better performance results. Our Verilog implementation
was kept portable in order to simplify FPGA to ASIC
migration.
From Table 12 we see that our architecture achieves
many desirable results. The most remarkable is that the
power consumption is very low despite the fact that our
implementation is done on an FPGA which typically con-
sumes more power than an ASIC. Besides the low power
consumption of our architecture, other favorable results are
that the algorithm we use has better PSNR performance than
the algorithms used in the other works. We also note that
our architecture achieves the highest maximum frequency.
By extension our architecture is the only one that can support
high denition (HD) 1080 p sequences at 30 Hz, a search
range s = 16 and 1 reference frame. This would need a
minimum frequency of approximately 85.9 MHz.
In the next section we discuss our prototyping eorts and
compare our results with similar works.
4. Architecture Prototype
The top-level prototype design of our architecture is shown
in Figure 13. It is based on the prototype design in [25]. In
[25], Canals et al. propose an FPSoC-based architecture for
Full Search block matching algorithm. Their implementation
is done on a Virtex-4 FPGA.
Our prototype is done on the XUPV2P development
board available from Digilent Inc. [28]. The board contains
a Virtex-II Pro XC2VP30 FPGA with 30,816 Logic Cells, 136
18-bit multipliers, 2,448 Kb of block RAM, and two PowerPC
Processors. There are several connectors which include a
serial RS-232 port for communication with a host personal
computer. The board also features JTAG programming via
on-board USB2 port as well as a DDR SDRAM DIMM that
can accept up to 2 Gbytes of RAM.
The embedded development tool used to design our
prototype is the Xilinx Platform Studio (XPS), in the Xilinx
Embedded Development Kit (EDK) [29]. The EDK makes
it relatively simple to integrate user Intellectual Property
(IP) cores as peripherals in an FPSoC. Hardware/software
cosimulation can then be done to test the user IP.
In our prototype design, as shown in Figure 13, we
employ a PowerPC hardcore embedded processor, as our
controller. The processor sends stimuli to the motion
estimation IP core and reads results back for comparison.
The processor is connected to the other design modules, via
a 64bit processor local bus (PLB).
The boot program memory is a 64 kb BRAM. It contains
a bootloop program necessary to keep the processor in
a known state after we load the hardware and before we
load the software. The PLB connects to the user IP core
through an IP interface (IPIF). This interface exposes several
programmable interconnects. We use a slave-master FIFO
attachment that is 64-bits wide and 512 positions deep. The
status and control signals of the FIFO are available to the user
logic block. The user logic block contains logic for reading
Table 12: Comparison with other architectures implemented on ASICS.
Chaos et al. [11]
Miyakoshis
et al. [12]
Lins [13] Chens et al. [14] This Work
Process (m) 0.35 0.18 0.18 0.18 0.13 FPGA
Voltage (V) 3.3 1.0 1.8 1.3 1.5
Transistors count 301 K 1000 K 546 K 708 K 388 K
Maximum
frequency (MHz)
50 13.5 48.67 66 145.2
Video Spec. CIF 30-fps CIF 30-fps CIF 30-fps CIF 30-fps CIF 30-fps
frequency (MHz) 50 13.5 48.67 13.5 24.1
Algorithm Diamond search Gradient decent 4SS
Single-Iteration
Parallel VBS 4SS
w/1-ref.
Hardware
oriented SUMH
Block size 1616 and 88
16 16 and
8 8
16 16 16 16 to 4 4 16 16 to 4 4
power (mW) 223.6 6.56 8.46 2.13 25
Normalized
Power (1.8 V,
0.18 m)
17.60 21.25 8.46 4.08 69.02

Architecture
1D tree. No data
reuse scheme
1D tree. No data
reuse scheme
1D tree. Level A
data reuse
scheme
2D tree. Level B
data reuse
scheme
1D tree. No data
reuse scheme
Can support
HD1920 1080 p
No No No No Yes
Normalized power = Power (0.18

2
/process
2
) (1.8
2
/voltage
2
).
PowerPC
IPIF control
Read control
Read FIFO
Write control
Write FIFO
Write control Read control
Status and control
Pixel data memory
Motion estimation IP core
User logic
64-bit PLB bus
Boot program
memory
PLB IPIF
Figure 13: FPSoC prototype design of our architecture.
and writing to the FIFO and the Verilog implementation of
our architecture.
During operation, the PowerPC processor writes input
stimuli to the FIFO and sets status and control bits. The
Table 13: Comparison with other FPSOC architectures.
Canals et al. [25] This work
FPSoC FPGA Virtex-4 Virtex-II Pro
Algorithm Full Search
Hardware oriented
SUMH
Video format QCIF QCIF
Search range 16 16
Number of slices 12.5 K 11.4 K
Memory utilization
(Kb)
784 398
Clock frequency
(MHz)
100 100
user logic reads the status and control signals and when
appropriate, reads data from the FIFO. The data passes into
the IP core and when the ME computation is done, the
results are written back on the FIFO. The PowerPC reads the
results and does a comparison with expected results to verify
accuracy of the IP. Intermediate results during the operation
are sent to a terminal on the host personal computer, via the
RS-232 serial connection.
We target QCIF video for our prototype, in order to
compare our results with the results in [25]. Table 13 shows
this comparison. We see from Table 13 that our architecture
consumes less FPGA resources and has a lower memory
utilization. Again, we note that a direct comparison of both
architectures is complicated by the fact that dierent FPGAs
were used in both prototyping platforms. The work in [25]
is based on a Virtex-4 FPGA which uses 90-nm technology,
while our work is based on Virtex-II Pro FPGA which uses
130-nm technology.
5. Conclusion
In this paper we have presented our low power, FPSoC-
based architecture for a fast ME algorithm in H.264/AVC. We
described our adopted fast ME algorithmwhich is a hardware
oriented SUMH algorithm. We showed that the modied
SUMH has superior rate-distortion performance compared
to some existing state-of-the-art fast ME algorithms. We also
described our architecture for the hardware oriented SUMH.
We showed that the FPGA-based implementation of our
architecture yields ASIC-like levels of performance in terms
of speed, area, and power. Our results showed in addition,
that our architecture has the potential to support HD 1080 p
unlike the other architectures we compared it with. Finally
we have discussed our prototyping eorts and compared
them with a similar prototyping eort. Our results showed
that our implementation uses less FPGA resources.
In summary therefore, the modied SUMH is more
attractive than SUMH because it is hardware oriented. It
is also more attractive than Full Search because Full Search
is hardware oriented, it is much more complex than the
modied SUMH and thus will require more hardware area,
speed, and power for implementation.
We therefore conclude that for low power handheld
devices, the modied SUMH can be used without much
penalty, instead of Full Search, for ME in H.264/AVC.
Acknowledgments
The authors acknowledge the support from Xilinx Inc.,
the Xilinx University Program, the Packard Foundation
and the Department of Electrical Engineering, Santa Clara
University, California. The authors also thank the editor and
Reviewers of this journal for their useful comments.
References
[1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra,
Overview of the H.264/AVC video coding standard, IEEE
13, no. 7, pp. 560576, 2003.
[2] G. J. Sullivan, P. Topiwala, and A. Luthra, The H.264/AVC
advanced video coding standard: overview and introduction
to the delity range extensions, in Proceedings of the 27th
Conference on Applications of Digital Image Processing, vol.
5558 of Proceedings of SPIE, pp. 454474, August 2004.
[3] H.-C. Lin, Y.-J. Wang, K.-T. Cheng, et al., Algorithms and
DSP implementation of H.264/AVC, in Proceedings of the
Asia and South Pacic Design Automation Conference (ASP-
DAC 06), pp. 742749, Yokohama, Japan, January 2006.
[4] Z. Chen, P. Zhou, and Y. He, Fast integer pel and fractional
pel motion estimation for JVT, in Proceedings of the 6th
Meeting of the Joint Video Team (JVT) of ISO/IEC MPEG and
ITU-T VCE, Awaji Island, Japan, December 2002, JVT-F017.
[5] R. Li, B. Zeng, and M. L. Liou, New three-step search
algorithm for block motion estimation, IEEE Transactions on
438442, 1994.
[6] L.-M. Po and W.-C. Ma, A novel four-step search algorithm
for fast block motion estimation, IEEE Transactions on
313317, 1996.
[7] J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kassim,
A novel unrestricted center-biased diamond search algorithm
for block motion estimation, IEEE Transactions on Circuits
1998.
[8] X. Yi, J. Zhang, N. Ling, and W. Shang, Improved and
simplied fast motion estimation for JM, in Proceedings of the
16th Meeting of the Joint Video Team (JVT) of ISO/IEC MPEG
and ITU-T VCEG, Posnan, Poland, July 2005, JVT-P021.doc.
[9] X. Yi and N. Ling, Improved normalized partial distortion
search with dual-halfway-stop for rapid block motion estima-
tion, IEEE Transactions on Multimedia, vol. 9, no. 5, pp. 995
1003, 2007.
[10] C. De Vleeschouwer, T. Nilsson, K. Denolf, and J. Bor-
mans, Algorithmic and architectural co-design of a motion-
estimation engine for low-power video devices, IEEE Trans-
actions on Circuits and Systems for Video Technology, vol. 12,
no. 12, pp. 10931105, 2002.
[11] W.-M. Chao, C.-W. Hsu, Y.-C. Chang, and L.-G. Chen, A
novel hybrid motion estimator supporting diamond search
and fast full search, in Proceedings of IEEE International
Symposium on Circuits and Systems (ISCAS 02), vol. 2, pp.
492495, Phoenix, Ariz, USA, May 2002.
[12] J. Miyakoshi, Y. Kuroda, M. Miyama, K. Imamura, H.
Hashimoto, and M. Yoshimoto, A sub-mW MPEG-4 motion
estimation processor core for mobile video application,
in Proceedings of the Custom Integrated Circuits Conference
(ICC 03), pp. 181184, 2003.
[13] S.-S. Lin, Low-power motion estimation processors for mobile
video application, M.S. thesis, Graduate Institute of Electronic
Engineering, National Taiwan University, Taipei, Taiwan,
2004.
[14] T.-C. Chen, Y.-H. Chen, S.-F. Tsai, S.-Y. Chien, and L.-G.
Chen, Fast algorithm and architecture design of low-power
integer motion estimation for H.264/AVC, IEEE Transactions
on Circuits and Systems for Video Technology, vol. 17, no. 5, pp.
568576, 2007.
[15] L. Zhang and W. Gao, Reusable architecture and complexity-
controllable algorithm for the integer/fractional motion esti-
mation of H.264, IEEE Transactions on Consumer Electronics,
vol. 53, no. 2, pp. 749756, 2007.
[16] C. A. Rahman and W. Badawy, UMHexagonS algorithm
based motion estimation architecture for H.264/AVC, in
Proceedings of the 5th International Workshop on System-on-
Chip for Real-Time Applications (IWSOC 05), pp. 207210,
Ban, Alberta, Canada, 2005.
[17] M.-S. Byeon, Y.-M. Shin, and Y.-B. Cho, Hardware archi-
tecture for fast motion estimation in H.264/AVC video
coding, in IEICE Transactions on Fundamentals of Electronics,
Communications and Computer Sciences, vol. E89-A, no. 6, pp.
17441745, 2006.
[18] Y.-Y. Wang, Y.-T. Peng, and C.-J. Tsai, VLSI architecture
design of motion estimator and in-loop lter for MPEG-4
AVC/H.264 encoders, in Proceedings of the IEEE International
Symposium on Circuits and Systems (ISCAS 04), vol. 2, pp. 49
52, Vancouver, Canada, May 2004.
[19] C.-Y. Chen, C.-T. Huang, Y.-H. Chen, and L.-G. Chen,
Level C+ data reuse scheme for motion estimation with
corresponding coding orders, IEEE Transactions on Circuits
2006.
[20] S. Yalcin, H. F. Ates, and I. Hamzaoglu, A high performance
hardware architecture for an SAD reuse based hierarchical
motion estimation algorithm for H.264 video coding, in Pro-
ceedings of the International Conference on Field Programmable
Logic and Applications (FPL 05), pp. 509514, Tampere,
Finland, August 2005.
[21] S.-J. Lee, C.-G. Kim, and S.-D. Kim, A pipelined hard-
ware architecture for motion estimation of H.264/AVC, in
Proceedings of the 10th Asia-Pacic Conference on Advances
in Computer Systems Architecture (ACSAC 05), vol. 3740
Singapore, October 2005.
[22] C.-M. Ou, C.-F. Le, and W.-J. Hwang, An ecient VLSI
architecture for H.264 variable block size motion estimation,
IEEE Transactions on Consumer Electronics, vol. 51, no. 4, pp.
12911299, 2005.
[23] O. Ndili and T. Ogunfunmi, A hardware oriented integer
pel fast motion estimation algorithm in H.264/AVC, in
Proceedings of the IEEE/ECSI/EURASIP Conference on Design
and Architectures for Signal and Image Processing (DASIP 08),
Bruxelles, Belgium, November 2008.
[24] H.264/AVC Reference Software JM 13.2., 2009,
http://iphome.hhi.de/suehring/tml/download.
[25] J. A. Canals, M. A. Martnez, F. J. Ballester, and A. Mora,
New FPSoC-based architecture for ecient FSBM motion
estimation processing in video standards, in Proceedings of
the International Society for Optical Engineering, vol. 6590 of
Proceedings of SPIE, p. 65901N, 2007.
[26] Mentor Graphics ModelSim SE Users ManualSoftware
Version 6.2d, 2009, http://www.model.com/support.
[27] Xilinx ISE 9.1 In-Depth Tutorial, 2009, http://download
.xilinx.com/direct/ise9 tutorials/ise9tut.pdf.
[28] Xilinx Virtex-II Pro Development System, 2009, http://
www.digilentinc.com/Products/Detail.cfm?Prod=XUPV2P.
[29] Xilinx Platform Studio and Embedded Development Kit,
2009, http://www.xilinx.com/ise/embedded/edk pstudio.htm.
doi:10.1155/2009/162078
Research Article
FPGA Accelerator for Wavelet-Based Automated
Global Image Registration
Baofeng Li, Yong Dou, Haifang Zhou, and Xingming Zhou
National Laboratory for Parallel and Distributed Processing, National University of Defense Technology,
Changsha 410073, China
Correspondence should be addressed to Baofeng Li, lbf@nudt.edu.cn
Received 14 February 2009; Accepted 30 June 2009
Recommended by Bertrand Granado
Wavelet-based automated global image registration (WAGIR) is fundamental for most remote sensing image processing algorithms
and extremely computation-intensive. With more and more algorithms migrating fromground computing to onboard computing,
an ecient dedicated architecture of WAGIR is desired. In this paper, a BWAGIR architecture is proposed based on a block
resampling scheme. BWAGIR achieves a signicant performance by pipelining computational logics, parallelizing the resampling
process and the calculation of correlation coecient and parallel memory access. A proof-of-concept implementation with 1
BWAGIR processing unit of the architecture performs at least 7.4X faster than the CL cluster system with 1 node, and at least 3.4X
than the MPM massively parallel machine with 1 node. Further speedup can be achieved by parallelizing multiple BWAGIR units.
The architecture with 5 units achieves a speedup of about 3X against the CL with 16 nodes and a comparative speed with the MPM
with 30 nodes. More importantly, the BWAGIR architecture can be deployed onboard economically.
Copyright 2009 Baofeng Li et al. This is an open access article distributed under the Creative Commons Attribution License,
1. Introduction
With the rapid innovations of remote sensing technology,
more and more remote sensing image processing algorithms
are enforced to be nished onboard instead of at ground
station to meet the requirement of processing numerous
remote sensing data realtimely. Image registration [1, 2]
is the basis of many image processing operations, such
as image fusion, image mosaic, and geographic naviga-
tor. Considering the computation-intensive and memory-
intensive characteristics of remote sensing image regis-
tration and the limited computing power of onboard
computers, to implement image registration eciently and
eectively with dedicated architecture is of great signi-
cance.
In the past twenty years, FPGA technology has been
developed signicantly. The volume and performance of
FPGA chip have increased greatly to adapt many large-
scale applications. Due to its excellent recongurabil-
ity and convenient design ow, FPGA has been the
most popular choice for hardware designers to imple-
ment kinds of application-specic architectures. Therefore,
to implement the remote sensing image registration in
FPGA eciently is just the point of this paper. Though
Carstro-Pareja et al. [3, 4] have proposed a fast auto-
matic image registration (FAIR) architecture of mutual
information-based 3D image registration for medical imag-
ing applications, few works addressing hardware accelera-
tion of the remote sensing image registration have been
reported.
Many approaches have been proposed for remote
sensing image registration. As for hardware implementa-
tion, only the automated algorithms are suitable because
onboard computing demands that the algorithms should
be accurate and robust and operate without manual
intervention. Proposed automated remote sensing image
registration algorithms can be classied into two cate-
gories, CPs-based algorithms [512] and global algorithms
[1322]. In the former, some matched control points
(CPs) are extracted from both images automatically to
decide the nal mapping function. However, the prob-
lem is that it is dicult to automatically determine e-
cient CPs. The selected CPs need to be accurate, suf-
cient, and with even distribution. Missing or spurious
CPs make CPs-based algorithms unreliable and unsta-
ble [23]. Hence, CPs-based algorithms are not in our
consideration.
Automated global registration, however, is an approach
that does not rely on point-to-point matching. The nal
mapping function is computed globally over the images.
Therefore, the algorithms are stable and robust and easy
to be automatically processed. One of the disadvantages of
global registration is that it is computationally expensive.
Fortunately, the wavelet decomposition helps to relieve this
situation because it provides us a way to obtain the nal
result progressively. A wavelet-based automated global image
registration (WAGIR) algorithm for the remote sensing
application proposed by Moigne et al. [1315] has been
proved to be ecient and eective. In WAGIR, the lowest-
resolution wavelet subbands are rstly registered with a
rough accuracy and a wider search interval, and a local best
result is obtained. Nextly, this result is rened repeatedly after
the iterative registrations on the higher-resolution subbands.
The nal result is obtained at the highest-resolution sub-
bands, viz. the original images.
Many parallel schemes of WAGIR are proposed in previ-
ous works, such as parameter-parallel scheme (PP), image-
parallel (IP) scheme, hybrid-parallel (HP) scheme which
merges PP and IP, and group-parallel (GP) scheme [13, 24
27] which are implemented targeting large, expensive super-
computers, cluster system or grid system that are impractical
to be deployed onboard. In this paper, we propose a
block wavelet-based automated global image registration
(BWAGIR) architecture based on a block resampling scheme.
The architecture with 1 processing unit outperforms the CL
cluster system with 1 node by at least 7.4X, and the MPM
massively parallel machine with 1 node by at least 3.4X. And
the BWAGIR with 5 units achieves a speedup of about 3X
against the CL with 16 nodes and a comparable speed with
the MPM with 30 nodes. More importantly, our work is
targeting onboard computing.
The remainder of this paper is organized as follows. In
Section 2, the traditional WAGIR algorithm is reviewed and
analyzed based on the hierarchy architecture. The proposed
block resampling scheme is detailed in Section 3. And the
architecture of BWAGIR is presented in Section 4. Section 5
gives out the proof-of-concept implementation and the
experimental results with comparison to several related
works. Finally, this paper is concluded in Section 6.
2. Wavelet-Based Automatic Global Image
Registration Algorithm
Image registration is the process that determines the most
accurate match between two images of the same scene
or object. In the global registration process, one image is
registered according to another known standard image. We
refer to the former as input image, the latter as reference
image, the best matching image as registered image, and the
image after each resampling process as resampled image.
2.1. Review of WAGIR Algorithm. WAGIR can be described
as the pseudocode in Algorithm 1. Here assume that the
LL subbands form the feature space; 2D rotations and
translations are considered as search space; the search
strategy follows the multiresolution approach provided by
wavelet decomposition; the cross correlation coecient is
adopted as similarity metric. Firstly, an N-level wavelet
decomposes the input image and the reference image with
size of M M into nLLi and nLLr sequences where n
represents corresponding decomposition level. Then NLLi
and NLLr with the lowest resolution are registered with
accuracy of 2
N
. A local best combination of rotations and
translations (best, bestX, bestY) is obtained and used as
the search center of registering the next level subbands,
(N 1)LLi and (N 1)LLr. And another combination with
accuracy of 2
N1
is gained. This process iterates until the
overall best result with expected accuracy , is retrieved
after registering the original input image (0LLi) and reference
images (0LLr). Finally, a resampling process is carried out to
get the registered image.
At each level, the algorithm shown in Algorithm 2 is
employed to register nLLi and nLLr. The result of previous
level (C, XC, YC) is used as the search center. For each
combination of rotations and translations, the algorithm
shown in Algorithm 3 is performed to get a resampled
image of nLLr. Then a correlation coecient is calculated
to measure the similarity between the resampled nLLr and
the nLLi. The combination corresponding to the maximal
correlation coecient is the best result of current level. The
resampling algorithm is performed by sequentially selecting
one registered image location once, calculating the corre-
sponding coordinate of the selected location in the reference
image, accessing the neighboring 4 4 pixel window in the
reference image, calculating the corresponding interpolation
weights according to the computed coordinate, and nally
calculating the pixel value of the selected location by the
cubic convolution interpolation method. The correlation
coecient is calculated with (1):
C(A, B) =
M1
i=0
M1
j=0
A
i j
B
i j
1/M
2
M1
i=0
M1
j=0
A
i j
M1
i=0
M1
j=0
B
i j
M1
i=0
M1
j=0
A
2
i j
(1/M
2
)
M1
i=0
M1
j=0
A
i j
M1
i=0
M1
j=0
B
2
i j
(1/M
2
)
M1
i=0
M1
j=0
B
i j
.
(1)
Input: input image and reference image
Output: registered image
1 Initialize registration process (wavelet level N, search scope rotation
angle: (scope L, scope R), horizontal oset: (Xscope L, Xscope R), and
vertical oset: (Yscope L, Yscope L));
2 Perform wavelet decomposition of the input image and reference image;
3 best = 0; best X = 0; best Y = 0;
4 step = 2
N
; step X = 2
N
; step Y = 2
N
;
5 for (n = N; n 0; n) do
(width, height) = (image width/2
n
, image height/2
n
);
//registering at current wavelet level based on the results of previous level.
Perform Register (nLLi, nLLr, best , bestX, bestY, step , stepX, stepY);
(scope L, scope R) = (step , step );
(Xscope L, Xscope R) = (stepX, stepX);
(Yscope L, Yscope R) = (stepY; stepY);
step /=2; stepX /=2; stepY /= 2;
bestX
= 2; bestY
= 2; //size of next wavelet subband is twice of current

6 Resample (input image, best , bestX, bestY, registered image);
// last resampe to obtain the result image.
7 Over.
Algorithm 1: Main WAGIR algorithm.
Register (the registering algorithm)
Input: nLLi, nLLr, center, Xcenter, Ycenter, step, stepX, stepY
Output: local best , bestX, and bestY
1 (angle, x, y) = (scope L, Xscope L, Yscope L); //control variables
2 max co = 1; // record the maximum correlation
// the registration processing
3 while (angle <= scope R) do
while (x <= Xscope R) do
while (y <= Yscope R) do
(C, XC, YC) = (center + angle, Xcenter + x, Ycenter + y);
// resample nLLr with C, XC, YC to get registered image out
Perform Resample (nLLr, image out, C, XC, YC);
// compute the correlation betwwen image out and nLLi
corre = Correlation (image out, nLLi);
if (corre > max co) then
max co = corre;
(best, bestX, bestY) = (C, XC, yC);
y = y + stepY;
x = x + stepX;
y = Yscope L;
angle = angle + step;
x = Xscope L;
4 Over.
Algorithm 2: The registering algorithm.
2.2. Analysis of WAGIR Algorithm. All analysis are based on
a common assumption of the hierarchy architecture shown
in Figure 1. The o-chip external memory is used to store
the tremendously growing image data. The on-chip memory
serves as an buer to bridge the speed gap between external
memory and accelerator.
In WAGIR, for each possible combination of rotations
and translations, a resampling process is performed, and a
correlation coecient is calculated to decide which one is the
best transformation between the input image and reference
image. Runtime proles from a software implementation of
WAGIR listed in Table 1 show that the resampling process
(without the time to compute the correlation coecient) is
the most time-consuming, and the calculation of correlation
coecient is essential because each resampling process
corresponds one calculation of correlation coecient though
Resample (the resample algorithm)
Input: nLLr; trans, transX, transY
Output: image out
1 (w, h) = (width/2
n
, height/2
n
); //width and height of the input nLLr
//compute the cubic convolution weights for a xLen tab yLen tab template
2 cubicTable (4, 4, tab);
3 for (t y = 0; t y < h; t y++) do
for (tx = 0; tx < w; tx++) do
//The inverse mapping function
x = cos(trans) (tx transX w/2) + sin(trans) (t y transY h/2) + w/2;
y = sin(trans) (tx transX w/2) + cos(trans) (t y transY h/2) + h/2;
(x int, y int) = (int x, int y);
(x fra, y fra) = (x x int, y y int);
(f x, f y) = ((int)(x fratab), (int)(y fratab));
//read the corresponding xLen yLen weights from cubicTable to ct
ct = cubicTable + ( f y tab + f x) mem size;
if ((0 < x int < (w 2)) && (0 < y int < (h 2))) then
// read the corresponding xLen yLen coecients from nLLr to st
Read (nLLr, x int 1, y int 1, xLen, yLen, st);
pixel d = 0;
for (ch = 0; ch < 4; ch++) do
for (cw = 0; cw < 4; cw++) do
pixel d += (ct)(st);
ct++; st++;
if (pixel d > 255) then pixel d = 255;
if (pixel d < 0) then pixel d = 0;
pixel = (char)pixel d;
(image out + t y w + tx) = pixel;
4 Over
Algorithm 3: The resampling algorithm.
Table 1: Runtime proles from a software implementation of
WAGIR (three-level wavelet decomposition, grayscale image, pro-
ling platform is Intel Celeron(R) 1.7 GHz CPU, 256M DDR266
SDRAM2, Microsoft VC 6.0 and WindowsXP Prof.).
Image size Wavelet dec. Resampling process Correlation cal.
512 512 0.4 95.6 0.0001
1K 1K 0.6 94.2 0.0001
2K 2K 0.6 93.8 0.0000
3K 3K 0.7 93.7 0.0000
it consumes a little execution time. For example, to register
an input image and a reference image with size of M M
in search space [L, R] [xL, xR] [yL, yR], (R L)
(xR xL) (yR yL) resampling processes and (R
L) (xR xL) (yR yL) calculations of coecient are
needed. For each resampling process, M M resampling
operations are needed. Though wavelet decomposition can
relieve this situation, the computation requirement remains
signicant. Therefore, to accelerate WAGIR is just to acceler-
ate the resampling process and the calculation of correlation
coecient.
In addition to the great computation requirement,
WAGIR also has great memory requirement. In each resam-
pling process, for each location of the resampled image,
a neighboring 4 4 pixel window in the reference image
is needed. That means that 16 M
2
memory accesses are
required for each resampling process. Meanwhile, a total
of 2 M
2
accesses are also required for each calculation of
correlation coecient. Considering the great amount of
resampling processes and calculations of correlation coe-
cient, the total access amount comes out of a massive num-
ber. Even worse, the whole reference image may be needed
maximally to compute one row pixels of the resampled image
when the rotation angle is /4 as shown in Figure 2. This
demands that the on-chip memory should be capable to
hold the whole image. If not, the amount of memory access
grows signicantly. And also, each resampled image should
be buered on-chip because it is needed in calculation of
correlation coecient. It is infeasible to assign so great on-
chip memory for hardware implementation because of the
mass size of remote sensing images and scarcity of on-chip
memory resources. Therefore, a good memory scheduling
strategy is imperative.
3. Block Resampling Scheme
To accommodate the great computation and memory
requirements of WAGIR, a block resampling scheme is
employed. The foundation is to produce the resampled
image block by block because the computations of dierent
locations are absolutely independent and irrelevant. The
Block Resample (the block resample algorithm)
Input: nLLr; trans, transX, transY
Output: image out
1 (w,h) = (width/2
n
, height/2
n
); // width and height of the input nLLr
//compute the cubic convolution weights for a xLen tab yLen tab template
2 cubicTable (4, 4, tab);
3 for (s = 0; s <h/S; s++) do
for (r = 0; r <w/S; r++) do
for (ty = 0; ty <S;ty++) do
for (tx = 0; tx <S; tx++) do
//The inverse mapping function
x = cos(trans) (tx transX w/2) + sin(trans) (t y transY h/2) + w/2;
y = sin(trans) (tx transX w/2) + cos(trans) (t y transY h/2) + h/2;
(x int, y int) = (int x, int y);
(x fra, y fra) = (x x int, y y int);
( f x, f y) = ((int)(x fratab), (int)(y fratab));
//read the corresponding xLen yLen weights from cubicTable to ct
ct = cubicTable + ( f y tab + f x) mem size;
if ((0 < x int < ( w 2)) && (0 < y int < ( h 2))) then
// read the corresponding xLen yLen coecients from nLLr to st
Read(nLLr, x int-1, y int-1, xLen, yLen, st);
pixel d = 0;
for (ch = 0; ch < 4;ch++) do
for (cw = 0; cw < 4; cw++) do
pixel d += (ct)(st);
ct++; st++;
if (pixel d > 255) then pixel d = 255;
if (pixel d < 0) then pixel d = 0;
pixel = (char)pixel d;
(image out + t y w + tx) = pixel;
4 Over
Algorithm 4: Block resampling algorithm.
FPGA
Off-chip
external memory
On-chip memory
Accelerating
architecture
Figure 1: Assumption of hierarchy architecture.
pseudocode in Algorithm 4 describes the block resampling
scheme in which the resampled image is computed sequen-
tially in consecutive S S subblocks.
The reason what causes the great memory requirement
is that the resampled image is generated row by row in
traditional resampling algorithm. This way of computation
results in great scope of preloading the reference image.
According to the mapping function (2), the scope of required
reference image pixels to compute one row pixels of the
resampled image is [0, M] [0, M] maximally, that is, the
whole reference image. But the scope to compute an S S
subblock is just [((1

2)/2)S, ((1 +

2)/2)S] [((1
2)/2)S, ((1 +

2)/2)S]. Because S M, the preloading
scope is decreased greatly. Accordingly, the size of required
on-chip memory is reduced signicantly:
x = cos(trans )
tx trans X
M
2
+ sin(trans )
t y trans Y
M
2
+
M
2
,
y = sin(trans )
tx trans X
M
2
+ cos(trans )
t y trans Y
M
2
+
M
2
.
(2)
Another benet gained from the block resampling
scheme is that the calculations of all pixels within one block
Input image
t
x
t
y
Reference image Registered image
Figure 2: Memory requirement of WAGIR.

only need once preloading, and the amount of memory
access is decreased. In traditional resampling algorithm,
the pixels of reference image must be loaded from the
external memory again and again if there is no enough on-
chip memory to store the whole image. But in the block
resampling scheme, the block size is decided by the available
on-chip memory.
4. The BWAIR Architecture
As mentioned above, the resampling process and the cal-
culation of correlation coecient account for major of the
execution time. Therefore the BWAIR architecture aims to
accelerate WAGIR by accelerating the resampling algorithms
and the calculation of corresponding correlation coecient.
The BWAGIR architecture is detailed in Figure 3. The
coordinate calculation module computes the coordinate of
the pixel in reference image corresponding to each location
in the resampled image. The interpolation weights calculation
module is responsible to compute the 16 weights for the 44
interpolation window. The reference image RAM controller
loads the neighboring 4 4 window. The resampled pixel
calculation module is in charge of computing the values of
resampled pixels. The input image RAM controller loads the
input image pixels for calculation of correlation coecient.
The correlation calculation module computes the correlation
coecient. And the FIFOs are set to bridge the speed gap
among the modules mentioned above.
Proposed BWAGIR architecture optimizes the resam-
pling process and the calculation of correlation coecient
by means of parallelizing the resample process and corre-
sponding calculation of correlation coecient, pipelining all
calculation modules, and parallel memory access.
4.1. Parallelizing Resampling and Calculation of Correlation.
In a standard software implementation, the resampling
process and the calculation of correlation coecient are
performed sequentially. The calculation of correlation coef-
cient starts after all the pixels of the resampled image are
produced. This means that the resampled image must be
written back into the external memory or stored in extra on-
chip memory after the resampling and then read back when
calculating the correlation coecient. Extra memory volume
and memory access are evident.
In BWAGIR architecture, we partition the correlation
calculation into two steps.
(1) Calculate the sum of pixels of input image
(
M1
i=0
M1
j=0
A
i j
), the sum of pixels of resampled
image (
M1
i=0
M1
j=0
B
i j
), the sum of square of pixels
of input image (
M1
i=0
M1
j=0
A
2
i j
), the sum of square
of pixels of resampled image (
M1
i=0
M1
j=0
B
2
i j
),
and the sum of the production of pixels of input
image and corresponding pixels of resampled image
(
M1
i=0
M1
j=0
(A
i j
B
i j
)).
(2) Calculate the nal correlation coecient according to
(1).
This partition can avoid the extra memory volume and
memory access. Once a pixel in the resampled image is
produced, it is computed in step 1 and then discarded. Once
step 1 nishes the calculations of all pixels, the ve sums
are sent to step 2 to nalize the calculation of correlation
coecient. Therefore, the resampling process parallel with
the calculation of correlation coecient.
4.2. Pipelining. All the calculation modules are pipelined to
improve the system throughput and operating frequency.
As shown in Figure 3, the BWAGIR is divided into four
macrostages according to the processing ow. The rst stage
calculates the coordinate of the pixel in the reference image
corresponding to each location of resampled image and
writes the integral and fractional components into according
FIFOs. At the second stage, the Reference Image RAM
Controller reads the neighboring 4 4 reference image pixel
window, and at the same time, 16 interpolation weights are
produced by the Interpolation Weights Calculation Module.
Stage 3 calculates the value of each location in the resampled
image by multiplying the pixels with its corresponding
weights and adding these product together. Finally, the
Correlation Calculation Module computes the correlation
coecient by means described in Section 4.1. With pipelin-
ing, the total time of performing the resampling process and
corresponding correlation calculation once becomes equal to
the product of the worst pipeline stage time and the number
of pixels in the resampled image.
A BWAGIR resampling processor (FPGA)
Coordinate calculation module
FIFO: integral part FIFO: fractional part
E
x
t
e
r
n
a
l

S
D
R
A
M
r
e
f
e
r
e
n
c
e

i
m
a
g
e
Interpolation weights
calculation module
FIFO: cubic window
pixels
FIFO: weights
Resampled pixel
calculation module
E
x
t
e
r
n
a
l

S
D
R
A
M
i
n
p
u
t

i
m
a
g
e
O
n
-
c
h
i
p

R
A
M

m
e
m
o
r
y
i
n
p
u
t

i
m
a
g
e

Correlation
calculation module
Correlation efficient
FIFO: resampled
pixels
FIFO: input image
Reference image
controller
Input image
controller
P
r
e
l
o
a
d
i
n
g

c
o
n
t
r
o
l
l
e
r
r
e
f
e
r
e
n
c
e

i
m
a
g
e
P
r
e
l
o
a
d
i
n
g

c
o
n
t
r
o
l
l
e
r
i
n
p
u
t

i
m
a
g
e
Control logic
Stage 1
Stage 3
Stage 4
O
n
-
c
h
i
p

R
A
M

m
e
m
o
r
y
r
e
f
e
r
e
n
c
e

i
m
a
g
e
Sage 2
Figure 3: The BWAGIR architecture.
Table 2: Comparison of the registration time (milliseconds) with the CL cluster system.
Size
CL BWAGIR
1node 2nodes 5nodes 15nodes 16nodes 1unit 5units
512 512
PP 103.11 51.80 20.82 7.64 6.86
8.7 1.8
IP 102.98 61.00 38.54 27.50 27.44
HP 103.22 51.60 20.83 7.94 8.82
GP 103.22 51.79 20.85 7.25 6.80
1K 1K
PP 345.93 173.77 69.86 25.62 23.02
39.4 7.9
IP 345.94 187.43 88.72 44.66 43.31
HP 345.90 172.22 69.84 25.01 24.56
GP 345.95 172.32 69.86 24.21 22.90
3K 3K
PP 2849.95 1445.13 575.63 213.55 192.41
385.7 75.5
IP 2849.88 1440.52 626.68 231.41 218.20
HP 2849.97 1442.60 575.62 207.36 191.83
GP 2849.96 1439.75 576.02 204.50 191.87
4.3. Parallel Memory Access. Parallelizing the resampling
process and calculation of correlation coecient demands
parallel access to input image and reference image. But the
way of accessing the input image diers with that of accessing
the reference image. The input image is accessed sequentially,
and the reference image is accessed by 4 4 window.
Therefore, two external memories are used to store the
input image and the reference image, respectively. And they
are preloaded into the respective on-chip RAMs block by
block.
As mentioned above, the performance of the pipeline is
decided by the worst stage calculation time. The four pipeline
stages dier in data source and operation. The worst case
is the second stage because a 4 4 neighborhood, that is,
16 pixels, is loaded from on-chip reference image RAM. If
these pixels are loaded normally, it takes at least 16 cycles
to calculating each location in the resampled image. This
restraints the throughput of the pipeline signicantly. As a
rule, multibank memory organization can settle this problem
by distributing the sequential multiple access to dierent
Table 3: Comparison of the registration time (milliseconds) with the MPM parallel machine.
Size
MPM BWAGIR
1node 5nodes 15nodes 16nodes 30nodes 1unit 5units
512 512
PP 39.052 7.917 3.116 2.635 1.714
8.7 1.8
IP 39.051 10.414 5.350 5.022 4.189
HP 39.050 7.917 2.850 2.916 1.564
GP 39.055 7.917 2.750 2.633 1.450
1K 1K
PP 145.625 29.317 11.318 9.567 6.184
39.4 7.9
IP 145.629 32.188 12.841 12.142 8.683
HP 145.633 29.067 10.117 9.816 5.415
GP 145.633 29.183 10.082 9.665 5.271
3K 3K
PP 1327.00 276.336 102.926 86.233 68.517
385.7 75.5
IP 1326.24 292.485 105.821 100.466 75.243
HP 1327.25 270.950 91.967 87.015 46.801
GP 1328.00 267.150 88.350 87.717 45.516
memory banks with separate ports. Because the 16 pixels are
not consecutive, it is dicult to distribute them evenly into
16 banks. Therefore, we adopt a compromising strategy that
the on-chip memory for the reference image is divided into 8
banks each of which has three portsone for write and the
other two for read. This makes convenient to write the 64-bit
word which is composed of 8 consecutive 8 bits pixels into
on-chip memory parallelly and load the 8 pixels (two lines in
the 4 4 window) parallelly. Thereby, it takes only 2 cycles
to load the 16 pixels within a window. Though this cannot
match the speed of calculation modules yet (one result, one
cycle), the stage 2 calculation time is decreased by 8 times.
4.4. Parallelizing Multiple BWAGIR Processing Units. The
processing speed of proposed architecture can be further
improved by parallelizing multiple BWAGIR processing
units. There are two ways to achieve this.
(i) Processing multiple blocks belonging to the same resam-
pled image. This way multiplies the preloading scope,
and the on-chip memory volume. Another disadvan-
tage is the data is not utilized suciently because
each preloading only supports the calculation of one
resampled image.
(ii) Processing multiple blocks belonging to dierent resam-
pled images. This way enlarges the preloading scope
a little because the blocks at the same position of
dierent resampled images almost have the same
preloading scope. But the on-chip memory volume is
still multiplied because parallel processing demands
great data memory bandwidth, that is, each pro-
cessor requires an independent data memory. This
parallel can decrease the memory access because
each preloading supports the calculation of multiple
resampled images.
Therefore the more economical way is to process multiple
blocks which are at the same position of dierent resampled
images.
5. Implementation and Experimental Results
As a proof of concept, the BWAGIR architecture is modeled
with VerilogHDL, simulated with ModelSimSE 6.1d, synthe-
sized with Quartus II 6.0, and implemented in an external
prototype board with an Altera EP2S130F1020C3 FPGA.
One BWAGIR unit occupies about 67% ALUTs (70557) and
35% memory bits (2345144) and can operate at a clock rate
of 100 MHz. Two Micron MT16LSDT12864AG-1 GB PC133
SDRAMs are used as the external memories for input image
and reference image, respectively. And the on-chip memories
for input image and reference image are implemented with
internal memory blocks. The prototype board is connected
to a host computer with a USB cable. Only the registration
component of WAGIR is executed on the board, and the
other components are all performed on the host. Table 2
lists a comparison of the timings between the BWAGIR
architecture and the CL machine which is a cluster system
with 16 nodes; each node is equipped with Pentium4-1.7 G
CPU and 512 MB local storage, and all nodes are connected
by the 100 Mb/s Ethernet. Table 3 lists a comparison of the
timings between the BWAGIR and the MPM machine which
is a massively parallel computer with MIMDarchitecture and
has 32 processors with 1 GB local storage for each processor,
speed of MPM CPU is valued as 1.66 gigaops/sec, topology
of network is fat tree, and point-to-point bandwidth is
1.2 Gb/s. The images are processed with three-level wavelet
decomposition and registered within the search space of
(16
, 16
), x (16, 16), y (16, 16).

It can be concluded what follows.
(i) The BWAGIR with 5 units may perform more than
5X faster than that with 1 unit because parallelizing
multiple processing units cannot only improve the
execution speed but also reduce the amount of
memory access.
(ii) The BWAGIR with 1 unit outperforms 7.4X at least
over kinds of parallel schemes on the CL with 1
node and 3.4X at least faster than the MPM with 1
node because the way of memory access cannot fully
benet from the traditional cache-based memory
architectures present in most modern computers.
(iii) The BWAGIR with 5 units achieves a speedup of
about 3X against the CL with 16 nodes, a speedup
of greater than 1X against the MPM with 16 nodes,
and a comparative speed with the MPM with 30
nodes. This is because numerous communications
between nodes cut down the expected performance
improvement of the parallel schemes.
It should be noted that the timings of BWAGIR with 1
unit is obtained on the prototype board actually, and the
timings of the BWAGIR with 5 units are simulation times
because our board cannot support parallelizing multiple
units with limitation of available volume and number of
FPGAs.
6. Conclusion
WAGIR algorithm for remote sensing application is
extremely computation-intensive and demands execution
times on the order of minutes, even hours, on modern
desktop computers. Therefore a customized FPGA archi-
tecture is proposed in this paper to accommodate the
great computational requirements of the algorithm and the
trend of migration from ground computing to onboard
computing.
To implement the algorithm in FPGA eciently, a
block resampling scheme is adopted to relieve the great
computation and memory requirements. And based on the
block scheme, the proposed BWAGIR architecture derives
its improvement from (1) pipelining all computational
logics, (2) parallelizing the resampling process and calcu-
lation of correlation coecient, and (3) parallel memory
access. A practical implementation with two standard PC133
SDRAMs, operating at 100MHz, outperforms 7.4X com-
pared to the CL cluster system with 1 node and about
3.4X over the MPM machine with 1 node. This speedup is
derived using just one BWAGIR processing unit. For further
improvement, multiple units can be paralleled to implement
arrays of processing units using VLSI or FPGAs to perform
distributed image registration. Compared with the CL with
16 nodes, the BWAGIR architecture with 5 units can achieve
about 3X speedup. And also it achieves a comparative speed
with the MPM machine with 30 nodes. More importantly,
our architecture can meet the requirement of onboard
computing.
Acknowledgments
Our work is supported by the National Science Foundation
of China under contracts no. 60633050 and no. 60621003
and the National High Technology Research and Develop-
ment Program of China under contract no. 2007AA01Z106
and no. 2007AA12Z147.
References
[1] L. Brown, A survey of image registration techniques, ACM
Computing Surveys, vol. 24, no. 4, pp. 325376, 1992.
[2] B. Zitova and J. Flusser, Image registration methods: a
survey, Image and Vision Computing, vol. 21, no. 11, pp. 977
1000, 2003.
[3] C. R. Castro-Pareja and R. Shekhar, Hardware acceleration of
mutual information-based 3D image registration, Journal of
Imaging Science and Technology, vol. 49, no. 2, pp. 105113,
2005.
[4] C. R. Castro-Pareja, J. M. Jagadeesh, and R. Shekhar, FAIR:
a hardware architecture for real-time 3D image registration,
IEEE Transactions on Information Technology in Biomedicine,
vol. 7, no. 4, pp. 426434, 2003.
[5] J.-P. Djamdji, A. Bijaoui, and R. Maniere, Geometrical
registration of images: the multiresolution approach, Pho-
togrammetric Engineering & Remote Sensing, vol. 59, no. 5, pp.
645653, 1993.
[6] J. Flusser, An adaptive method for image registration, Pattern
Recognition, vol. 25, no. 1, pp. 4554, 1992.
[7] B. S. Manjunath, C. Shekhar, and R. Chellappa, A new
approach to image feature detection with applications, Pat-
tern Recognition, vol. 29, no. 4, pp. 627640, 1996.
[8] A. D. Ventura, A. Rampini, and R. Schettini, Image reg-
istration by recognition of corresponding structures, IEEE
Transactions on Geoscience and Remote Sensing, vol. 28, no. 3,
pp. 305314, 1990.
[9] Q. Zheng and R. Chellappa, Acomputational vision approach
to image registration, IEEE Transactions on Image Processing,
vol. 2, no. 3, pp. 311326, 1993.
[10] H. Li, B. S. Manjunath, and S. K. Mitra, A contour-
based approach to multisensor image registration, IEEE
Transactions on Image Processing, vol. 4, no. 3, pp. 320334,
1995.
[11] J. P. Djamdji and A. Bijaoui, Disparity analysis: a wavelet
transform approach, IEEE Transactions on Geoscience and
Remote Sensing, vol. 33, no. 1, pp. 6776, 1995.
[12] H. H. Li and Y.-T. Zhou, A wavelet-based point feature
extractor for multisensor image restoration, in SPIE Aerosense
Wavelet Applications III, vol. 2762 of Proceedings of SPIE, pp.
524534, Orlando, Fla, USA, 1996.
[13] J. Le Moigne, W. J. Campbell, and R. F. Cromp, An automated
parallel image registration technique based on the correlation
of wavelet features, IEEE Transactions on Geoscience and
Remote Sensing, vol. 40, no. 8, pp. 18491864, 2002.
[14] J. Le Moigne, A. Cole-Rhodes, R. Eastman, et al., Multi-
sensor registration of earth remotely sensed imagery, in Image
and Signal Processing for Remote Sensing VII, vol. 4541 of
Proceedings of SPIE, pp. 110, Toulouse, France, September
2001.
[15] J. Le Moigne and I. Zavorin, Use of wavelets for image regis-
tration, in Wavelet Applications VII, vol. 4056 of Proceedings
of SPIE, pp. 99104, Orlando, Fla, USA, April 2000.
[16] P. Thevenaz, U. E. Ruttimann, and M. Unser, A pyramid
approach to subpixel registration based on intensity, IEEE
Transactions on Image Processing, vol. 7, no. 1, pp. 2741, 1998.
[17] R. L. Allen, F. A. Kamangar, and E. M. Stokely, Laplacian and
orthogonal wavelet pyramid decompositions in coarse-to-ne
registration, IEEE Transactions on Signal Processing, vol. 41,
no. 12, pp. 35363541, 1993.
[18] Q.-S. Chen, M. Defrise, and F. Deconinck, Symmetric phase-
only matched ltering of Fourier-Mellin transforms for image
registration and recognition, IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 16, no. 12, pp. 1156
1168, 1994.
[19] J. C. Olivo, J. Deubler, and C. Boulin, Automatic registration
of images by a wavelet-based multiresolution approach, in
Wavelet Applications in Signal and Image Processing III, vol.
2569 of Proceedings of SPIE, pp. 234244, San Diego, Calif,
USA, 1995.
[20] H. Shekarforoush, M. Berthod, and J. Zerubia, Subpixel
image registration by estimating the polyphase decomposition
of cross power spectrum, in Proceedings of IEEE Computer
Society Conference on Computer Vision and Pattern Recogni-
tion, pp. 532537, 1996.
[21] R. C. Hardie, K. J. Barnard, and E. E. Armstrong, Joint
MAP registration and high-resolution image estimation using
a sequence of undersampled images, IEEE Transactions on
Image Processing, vol. 6, no. 12, pp. 16211633, 1997.
[22] R. J. Althof, M. G. J. Wind, and J. T. Dobbins, A rapid
and automatic image registration algorithm with subpixel
accuracy, IEEE Transactions on Medical Imaging, vol. 16, no.
3, pp. 308316, 1997.
[23] Y. C. Hsieh, D. M. McKeown, and F. P. Perlant, Performance
evaluation of scene registration and stereo matching for
cartographic feature extraction, IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 14, no. 2, pp. 214238,
1992.
[24] A. El-Ghazaw and P. Chalermwa, Wavelet-based image regis-
tration on parallel computers, in Proceedings of ACM/IEEE on
High Performance Networking and Computing (SuperComput-
ing 97), p. 20, 1997.
[25] P. Chalermwat, High performance automatic image registration
for remote sensing, Ph.D. Thesis, George Mason University,
Fairfax, Va, USA, 1999.
[26] H. Zhou, X. Yang, H. Liu, and Y. Tang, First evaluation of
parallel methods of automatic global image registration based
on wavelets, in Proceedings of the International Conference on
Parallel Processing (ICPP 05), pp. 129136, 2005.
[27] H. Zhou, Y. Tang, X. Yang, and H. Liu, Research on
grid-enabled parallel strategies of automatic wavelet-based
registration of remote-sensing images and its application in
ChinaGrid, in Proceedings of the 4th International Conference
on Image and Graphics (ICIG 07), pp. 725730, 2007.
doi:10.1155/2009/716317
Research Article
A Systemfor an Accurate 3DReconstruction in
Video Endoscopy Capsule
Anthony Kolar,
1
Olivier Romain,
1
Jade Ayoub,
1
David Faura,
1
Sylvain Viateur,
1
Bertrand Granado,
2
and Tarik Graba
3
1
Departement SOCLIP6, Universite P&M CURIEParis VI, Equipe SYEL, 4 place Jussieu, 75252 Paris, France
2
ETIS, CNRS/ENSEA/Universite de Cergy-Pontoise, 95000 Cergy, France
3
Electronique des syst`emes numeriques complexe, Telecom ParisTech, 46 rue Barrault, 75252 Paris, France
Correspondence should be addressed to Anthony Kolar, anthony.kolar@free.fr
Received 15 March 2009; Revised 9 July 2009; Accepted 12 October 2009
Since few years, the gastroenterologic examinations could have been realised by wireless video capsules. Although the images make
it possible to analyse some diseases, the diagnosis could be improved by the use of the 3D Imaging techniques implemented
in the video capsule. The work presented here is related to Cyclope, an embedded active vision system that is able to give
in real time both 3D information and texture. The challenge is to realise this integrated sensor with constraints on size,
consumption, and computational resources with inherent limitation of video capsule. In this paper, we present the hardware
and software development of a wireless multispectral vision sensor which allows to transmit, a 3D reconstruction of a scene
in realtime. multispectral acquisitions grab both texture and IR pattern images at least at 25 frames/s separately. The dierent
Intellectual Properties designed allowto compute specics algorithms in real time while keeping accuracy computation. We present
experimental results with the realization of a large-scale demonstrator using an SOPC prototyping board.
Copyright 2009 Anthony Kolar et al. This is an open access article distributed under the Creative Commons Attribution License,
1. Introduction
Examination of the whole gastrointestinal tract represents a
challenge for endoscopists due to its length and inaccessi-
bility using natural orices. Moreover, radiologic techniques
are relatively insensitive for diminutive, at, inltrative, or
inammatory lesions of the small bowel. Since 1994, video
capsules (VCEs) [1, 2] have been developed to allow direct
examination of this inaccessible part of the gastrointestinal
tract and to help doctors to nd the cause of symptoms such
as stomach pain, disease of Crohn, diarrhoea, weight loss,
rectal bleeding, and anaemia.
The Pillcam video capsule designed by Given Imaging
Company is the most popular of them. This autonomous
embedded system allows acquiring about 50 000 images of
gastrointestinal tract during more than twelve hours of an
analysis. The o-line image processing and its interpretation
by the practitioner permit to determine the origin of the
disease. However, recent benchmark [3] published shows
some limitations on this video capsule as the quality of
images and the inaccuracy on the size of the polyps. Accuracy
is a real need because the practitioner makes an ablation
of a polyp only if it exceeds a minimum size. Actually the
polyp size is estimated by practitioners experience with more
or less error for one practitioner to another. One of the
solutions could be to use techniques of 3D imagery, either
directly in the video capsule or on a remote computer.
This later solution is actually used in the Pillcam capsule
by using the 24 images that are taken per second and stored
wirelessly in a recorder that is worn around the waist. 3D
processing is performed o-line from the estimation of the
displacement of the capsule. However, the speed of video-
capsule is not constant; for example, in the oesophagus, it
is of 1.44 m/s, and in the stomach it is almost null and is
0.6 m/s in the intestine. Consequently, by taking images at
frequencies constant, certain areas of the transit will not be
rebuilt. Moreover, the regular transmission of the images
by the body consumes too much energy and limits the
autonomy of the video capsules to 10 hours. Ideally, the
quantity of information to be transmitted must be reduced
at the only pertinent information like polyps or other 3D
objects. The rst development necessary to the delivery
of such objects relies on the use of algorithm of pattern
recognition on 3D information inside the video capsule.
The introduction of 3D reconstruction techniques inside
a video capsule needs to dene a new system that takes into
account the hard constraints of size, lowpower consumption,
and processing time. The most common 3D reconstruction
techniques are those based on passive or active stereoscopic
vision methods, where image sensors are used to provide the
necessary information to retrieve the depth. Passive method
consists of taking at least two images of a scene at two
dierent points of view. Unfortunately using this method,
only particular points, with high gradient or high texture,
can be detected [4]. The active stereo-vision methods oer
an alternative approach when processing time is critical.
They consist in replacing one of the two cameras by a
projection system which delivers a pattern composed by a
set of structured rays. In this latter case, only an image of
the deformation of the pattern by the scene is necessary to
reconstruct a 3D image. Many implementations based on
active stereo-vision have been realised in the past [5, 6] and
provided signicant results on desktop computers. Generally,
these implementations have been developed to reconstruct
3D large objects as building [714].
In our research work, we have focused on an integrated
3Dactive vision sensor: Cyclope. The concept of this sensor
was rst described in [4]. In this new article we focus on
the presentation of our rst prototype which includes the
instrumentation and processing blocks. This sensor allows
making in real time a 3D reconstruction taking into account
the size and power consumption constraints of embedded
systems [15]. It can be used in wireless video capsules or
wireless sensor networks. In the case of video capsule in order
to be comfortable for the patient, the results could be stored
in a recorder around the waist. It is based on a multispectral
acquisition that must facilitate the delivery of a 3D textured
reconstruction in real time (25 images by second).
This paper is organised as follows, Section 2 describes
briey Cyclope and deals with the principles of the active
stereo-vision system and 3D reconstruction method. In
Section 3 we present our original multispectral acquisition.
In Section 4 we present the implementation of the optical
correction developed to correct the lens distortion. Section 5
deals with the implementation of a new thresholding and
labelling methods. In Sections 6 and 7, we present the
processing of matching in order to give a 3D representation
of the scene. Section 8 deals with wireless communication
consideration. Finally, before a conclusion and perspectives
of this work, we present, in Section 9, a rst functional
prototype and its performances which attest the feasibility of
this original approach.
2. Cyclope
2.1. Overview of the Architecture. Cyclope is an integrated
wireless 3D vision system based on active stereo-vision
technique. It uses many dierent algorithms to increase
Cyclope
VCSEL
CMOS imager
Instrumentation
block
FPGA
Processing block
P
RF block
Figure 1: Cyclope Diagram.
accuracy and reduce processing time. For this purpose, the
sensor is composed of three blocks (see Figure 1).
(i) Instrumentation block: it is composed of a CMOS
camera and a structured light projector on IR band.
(ii) Processing block: it integrates a microprocessor core
and a recongurable array. The microprocessor is
used for sequential processing. The recongurable
array is used to implement parallels algorithms.
(iii) RF block: it is dedicated for the OTA (Over the Air)
communications.
The feasibility of Cyclope was studied by an implemen-
tation on an SOPC (System On Programmable Chip) target.
These three parts will be realised in dierent technologies:
CMOS for the image sensor and the processing units,
GaAs for the pattern projector, and RFCMOS for the
communication unit. The development of such integrated
SIP (System In Package) is actually the best solution to
overcome the technological constraints and realise a chip
scale package. This solution is used in several embedded
sensors such as The Human++ platform [16] or Smart
Dust [17].
2.2. Principle of the 3D Reconstruction. The basic principle of
3D reconstruction is the triangulation. Knowing the distance
between two cameras (or the various positions of the same
camera) and dening of the line of views, one passing by the
center of camera and the other by the object, we can nd the
object distance.
The active 3D reconstruction is a method aiming to
increase the accuracy of the 3D reconstruction by the
projection on the scene of a structured pattern. The matching
is largely simplied because the points of interest in the
image needed to the reconstruction are obtained by the
extraction of the pattern; it also has the eect to increase the
speed of processing.
The setup of active stereo-vision system is represented
in Figure 2. The distance between the camera and the laser
projector is xed. The projection of the laser beams on a
plane gives an IR spots matrix.
Central ray
Laser
projector
Camera
P
1
P
k
O
y
x
z
Figure 2: Active stereo-vision system.
Image plan
P
O
c
e
C
L
Projector
center
Epipolar plan
P
z
c
Figure 3: Epipolar projection.
The 3D reconstruction is achieved through triangulation
between laser and camera. Each point of the projected
pattern on the scene represents the intersection of two lines
(Figure 3):
(i) the line of sight, passing through the pattern point on
the scene and its projection in the image plan,
(ii) the laser ray, starting from the projection center and
passing through the chosen pattern point.
If we consider the active stereoscopic system as shown in
Figure 3, where p is the projection of P in the image plan and
e the projection of C
L
on the camera plan O
C
, the projection
of the light ray supporting the dot on the image plan is a
straight line. This line is an epipolar line [1820].
To rapidly identify a pattern point on an image we can
limit the search to the epipolar lines.
For Cyclope the pattern is a regular mesh of points. For
each point ( j, k) of the pattern we can nd the corresponding
epipolar line:
v = a
jk
u + b
jk
, (1)
Image plan
C
f
B
P
p
z
2
z
1
p
2
p
1
2

1
Figure 4: Spot image movement versus depth.
where (u, v) are the image coordinates and the parameters
(a
jk
, b
jk
) are estimated through an o-line calibration pro-
cess.
In addition to the epipolar lines, we can establish the
relation between the position of a laser spot in the image and
its distance to the stereoscopic system.
On Figure 4, we consider a laser ray (5) projected on
two dierent plans
1
and
2
located, respectively, at z
1
and
z
2
, the trajectory d of the coordinates in the image will be
constrained to the epipolar line.
By considering the two triangles CPp
1
and CPp
2
, we can
express d as
d = B
__
z
1
f
z
1
_
_
z
2
f
z
2
__
= B f
(z
1
z
2
)
z
1
z
2
, (2)
where B is the stereoscopic, f the focal length of the camera
and d the distance in pixels:
d =
_
(u
1
u
2
)
2
+ (v
1
v
2
)
2
.
(3)
Given the epipolar line we can express d as a function of only
one image coordinates:
d =
_
1 + a
2
(u
1
u
2
). (4)
From (2) and (4), we can express, for each pattern point
( j, k), the depth as a hyperbolic function:
z =
1
jk
u +
jk
,
(5)
where the
jk
and
jk
parameters are also estimated during
the o-line calibration of the system [21].
We can compute the inverse of the depth z to simplify
the implementation. Two operations are only needed: an
addition and a multiplication. The computation of the depth
of each point is independent of the others. So, all the laser
spots can be computed separately allowing the parallelisation
of the architecture.
3. An Energetic Approach for
Multispectral Acquisition
The main problem when you design a 3D reconstruction
processing for an integrated system is the limitation of
Multispectral
image
acquisition
Distortion
correction
Thresholding Labeling
Center
detection
Matching
3D
reconstruction
Wireless
communication
Figure 5: Acquisition and 3D reconstruction ow chart.
Texture image
Pattern
image
400 nm 700 nm
Visible
Near IR
Figure 6: Multispectral image sensor.

the resources. However, we can obtain a good accuracy
considering hard constraints by using the following method
which is shown in Figure 5:
(1) the multispectral acquisition which makes the dis-
crimination between the pattern and the texture by
an energetic method;
(2) the correction of the error coordinates due to the
optical lens distortion;
(3) the processing before the 3D reconstruction as
thresholding, segmentation, labelling, and the com-
putation of the laser spot center;
(4) the computation of the matching and the third
dimension;
(5) the transmission of the data with a processor core and
an RF module.
The spectral response of the Silicon cuts near 1100 nm
and it covers UV to near Infrared domains. This important
characteristic allows dening a multispectral acquisition by
grabbing on the visible band the colour texture image and,
on the near infrared band, the depth information. Cyclope
uses this original acquisition method, which permits to
access directly at the depth informations independently from
texture image processing (Figure 6).
The combination of the acquisition of the projected
pattern on the infrared band, the acquisition of the texture
on the visible band, and the mathematical model of the
active 3D sensor makes it possible to restore the 3D
textured representation of the scene. This acquisition needs
to separate texture and 3D datas. For this purpose we have
developed a multispectral acquisition [15]. Generally, lters
are used to cut the spectral response. We used here an
Figure 7: 64 64 image sensor microphotograph.
energetic method, which has the advantage of being generic
for imagers.
To allow real-time acquisition of both pattern and
texture, we have developed a rst 64 64 pixels CMOS
imager prototype in 0.6 m for a total surface of 20 mm
2
(Figure 7). This sensor has programmable light integration
and shutter time to allow dynamic change. It was designed
to have large response in the visible and near infrared. This
rst CMOS imager prototype, which is not the subject of this
article, had allowed the validation of our original energetic
approach, but its small size needs to be increased to have
more information. So, in our demonstrator we have used
a greater CCD sensor (CIF resolution 352 288 pixels) to
obtain normal size images and validate the 3D processing
architecture.
The projector pulses periodically on the scene an
energetic IR pattern. An image acquisition with a short
integration time allows grabbing the image of the pattern
with a background texture which appears negligible. A
second image acquisition with a longer integration allows
to grab the texture when the projector is o. Figure 8 shows
the sequential scheduling of the images acquisition. To reach
a video rate of 25 images/s this acquisition sequence must
be done in less than 40 milisecond. The global acquisition
time is given in (6) where T
rst
is the reset time, T
rd
is the
time needed to read the entire image, and T
intVI
and T
intIR
Laser pulse
Read time
(T
rd
)
Short integration
(T
intIR
)
Long integration
(T
intVI
)
Less than 40 ms
Camera
Pattern
projector
Figure 8: Acquisition sequence.
are, respectively, the integration time for both visible and IR
image.
T
total
= 2 T
rst
+ 2 T
rd
+ T
intVI
+ T
intIR
. (6)
The typical values are
T
rst
= 0.5 ms,
T
rd
= 0.5 ms,
T
intVI
= 15 ms,
T
intIR
= 20 m.
(7)
4. Optical Distortion Correction
Generally, the lenses used in the VCE introduce large
deformations on acquired images because of their weak focal
[22]. This distortion is manifested in inadequate spatial rela-
tionships between pixels in the image and the corresponding
points in the scene. Such change in the shape of captured
object may have critical inuence in medical applications,
where quantitative measurements in Endoscopy depend on
the position and orientation of the camera and its model.
The used camera model needs to be accurate. For this reason
we introduce rstly the pinhole camera model and later the
correction of geometric distortion that are added to enhance
it. For practical purposes two dierent methods are studied
to implement this correction, and it is up to researchers
to choose their own model depending on their required
accuracy level and computational cost.
Pinhole camera model (see Figure 9) is based on the
principle of linear projection where each point in the object
space is projected by a straight line through the projection
center into the image plane. This model can be used only
as an approximation of the real camera that is actually not
perfect and sustains a variety of aberration [23]. So, pinhole
model is not valid when high accuracy is required like in our
expected applications (Endoscopes, robotic surgery, etc.). In
this case, a more comprehensive camera model must be used,
Y
c
X
c
O f O
u
v
r
(u, v)
v
r
x
y
Z
c
P(X
w
, Y
w
, Z
w
)
Figure 9: Pinhole camera model (X
w
, Y
w
, Z
w
): World coordinates;
(O, X
c
, Y
c
, Z
c
): camera coordinates; (O
,u, v): image plane coordi-

nates.
taking into account the corrections for the systematically
distorted image coordinates. As a result of several types
of imperfections in the design and assembly of lenses
composing the cameras optical system, the real projection
of the point P in the image plane takes into account the
error between the real image observed coordinates and the
corresponding ideal (non observable) image coordinates:
u
= u +
u
(u, v),
v
= v +
v
(u, v),
(8)
where (u, v) are the ideal nonobservable, distortion-free
image coordinates, (u
, v
) are the corresponding real coor-

dinates, and
u
and
v
are, respectively, the distortion along
the u and v axes. Usually, the lens distortion consists of
radial symmetric distortion, decentering distortion, anity
distortion, and nonorthogonally deformations. Several cases
are presented on Figure 10.
The eective distortion can be modelled by
u
(u, v) =
ur
+
ud
+
up
,
v
(u, v) =
vr
+
vd
+
vp
,
(9)
where
ur
represent radial distortion [24],
ud
represent
decentering distortion, and
up
represent thin prism distor-
tion. Assuming that only the rst- and second-order terms
are sucient to compensate the distortion, and the terms of
order higher than three are negligible, we obtain a fth-order
polynomials camera model (expression 8), where (u
i
, v
i
) are
the distorted image coordinates in pixels, and (` u
i
, ` v
i
) are true
coordinates (undistorted):
u
i
= D
u
S
u
_
k
2
` u
5
i
+ 2k
2
` u
3
i
` v
2
i
+ k
2
` u
i
` v
4
i
+ k
1
` u
3
i
+k
1
` u
i
` v
2
i
+ 3p
2
` u
2
i
+ 2p
1
` u
i
` v
i
+ p
2
` v
2
i
+ ` u
i
_
+ u
0
,
v
i
= D
v
_
k
2
` u
4
i
` v
i
+ 2k
2
` u
2
i
` v
3
i
+ k
1
` u
2
i
` v
i
+ k
2
` v
5
i
+k
1
` v
3
i
+ p
1
` u
2
i
+ 2p
2
` u
i
` v
i
+ 3p
1
` v
2
i
+ ` u
i
_
+ v
0
.
(10)
(a) (b) (c)
Figure 10: (a) The ideal undistorted grid. (b) Barrel distortion. (c) Pincushion distortion.
An approximation of the inverse model is done by (11):
` u
i
=
` u
i
+ ` u
i
_
a
1
r
2
i
+ a
2
r
4
i
_
+ 2a
3
` u
i
` v
i
+ a
4
_
r
2
i
+ 2` u
2
i
_
a
5
r
2
i
+ a
6
` u
i
+ a
7
` v
i
+ a
8
_
r
2
i
+ 1
,
` v
i
=
` v
i
+ ` v
i
_
a
1
r
2
i
+ a
2
r
4
i
_
+ 2a
4
` u
i
` v
i
+ a
3
_
r
2
i
+ 2` v
2
i
_
a
5
r
2
i
+ a
6
` u
i
+ a
7
` v
i
+ a
8
_
r
2
i
+ 1
,
(11)
where
` u
i
=
u
i
u
0
D
uSU
,
` v
i
=
v
i
v
0
D
v
,
r
2
i
=
_
` v
2
i
+ ` v
2
i
.
(12)
The unknown parameters a
1
, . . . , a
8
are solved using
direct least mean-squares tting [25] in the o-line calibra-
tion process.
4.1. O-Line Lens Calibration. There are many proposed
methods that can be used to estimate intrinsic camera and
lens distortion parameters, and there are also methods that
produce only a subset of the parameter estimates. We chose
a traditional calibration method based on observing a planar
checkerboard in front of our system at dierent poses and
positions (see Figure 11) to solve the equations of unknown
parameters (11). The results of the calibration procedure are
presented in Table 1.
4.2. Hardware Implementation. After the computation of
parameters in (11) through an o-line calibration process,
we used them to correct the distortion of each frame. With
the input frame captured by the camera denoted as the source
image and the corrected output as the target image, the task
of correcting the source distorted image can be dened as
follows: for every pixel location in the target image, compute
its corresponding pixel location in the source image. Two
implementation techniques of distortion correction have
been compared:
Direct Computation. Calculate the image coordinates
through evaluating the polynomials to determine intensity
values for each pixel.
Figure 11: Dierent checkerboard positions used for calibration
procedure.
Table 1: Calibration results.
Parameter Value Error
u
0
(pixels) 178.04 1.28
v
0
(pixels) 144.25 1.34
f D
uSU
(pixels) 444.99 1.21
f D
v
(pixels) 486.39 1.37
a
1
0.3091 0.0098
a
2
0.0033 0.0031
a
3
0.0004 0.0001
a
4
0.0014 0.0004
a
5
0.0021 0.0002
a
6
0.0002 0.0001
a
7
0.0024 0.0005
a
8
0.0011 0.0002
Lookup Table. Calculate the image coordinates through
evaluating the polynomials correction in advance, storing
them in a lookup table which is referenced at run-time.
All parameters needed for LUT generation are known
beforehand; therefore for our system, the LUT is computed
only once and o-line.
However, since the source pixel location can be a real
number, using it to compute the actual pixel values of
the target image requires some form of pixel interpolation.
For this purpose we have used the nearest neighbour
0
50
100
150
200
250
B
R
A
M
s
-
K
B
0 20 40 60 80 100 120
Image size-KB
LUT
Real time calculation
Memory occupation
Figure 12: Block Memory occupation for both direct computation
and LUT-based approaches.
Table 2: Area and Clock characteristics of two approaches.
Implementation Area (%) Clock (MHz)
Direct Computation 58 10
Look Up Table 6 24
interpolation approach that means that the pixel value
closest to the predicted coordinates is assigned to the target
coordinates. This choice is reasonable because it is a simple
and fast method for computation, and visible image artefacts
have no subject with our system.
Performance results of these two techniques are pre-
sented in terms of (i) execution time and (ii) FPGA logic
resource requirements.
The proposed architectures described above have been
described in VHDL in a xed point fashion, implemented on
a Xilinx Virtex II FPGA device and simulated using industry
reference simulator (ModelSim). The pixel values of both the
input distorted and the output corrected images use an 8-bit
word length integer number. The coordinates use an 18-bit
word length.
The results are presented in Figures 12 and 13 and
Table 2.
The execution time for the direct computation imple-
mentation is comparatively very slow. This is due to the
fact that the direct computation approach consumes a
much greater amount of logic resources than the Look-up
Table approach. Moreover the slow clock cycle (10 MHz)
could be increased by splitting the complex arithmetic
logic into several smaller stages. The signicant dierence
between these two approaches is that the direct computation
approach requires more computation time and arithmetic
operations, while the LUT approach requires more memory
accesses and more RAM Blocks occupation. Regarding
latency, both approaches can be executed with respect to
0
500
1000
1500
2000
2500
3000
T
i
m
e
-
u
s
0 20 40 60 80 100 120
Image size-KB
LUT
Real time calculation
Execution time versus image size
Figure 13: Execution time for both direct computation and LUT-
based approaches.
real-time constraint of video cadence (25 frames per sec-
ond). Depending on the applications, the best compromise
between time and resources must be chosen by the user. For
our application, arithmetic operations are intensively needed
for later stages in the preprocessing block, while memory
blocks are available; so we chose to use the LUT approach
to benet in time and resources.
5. Thresholding and Labelling
After lens distortion correction, the laser spots projected
must be extracted from the gray level image for delivering
a 3D representation of the scene. Laser spots on the image
appear with variable sizes (depending on the absorption
of the surface and the projection angle). At this level,
a preprocessing block has been developed and hardware
implemented to make an adaptive thresholding in order to
give a binary image and a labelling to classify each laser spot
to compute later their center.
5.1. Thresholding Algorithm. Several methods exist from
a static threshold value dened by user up to dynamic
algorithm as Otsu method [26].
We have chosen to develop a new approach less complex
than Otsu or others well-known dynamic methods in order
to reduce the processing time [27]. The simple method is
described in (Figure 14):
(i) building the histogram of grey-level image,
(ii) nding the rst maxima of the Gaussian correspond-
ing to the Background; compute its mean and
standard deviation ,
(iii) calculating the threshold value with (13):
Background
2
Bias = +
Laser spots
Figure 14: Method developed in Cyclope.

Threshold = + , (13)
where is an arbitrary constant. A parallel architecture of
processing has been designed to compute the threshold and
to give a binary image. Full features of this implementation
are given in [28].
5.2. Labeling. After this rst stage of extraction of spot laser
from the background, it is necessary to classify each laser
spot in order to compute separately their center. Several
methods have been developed in the past. We chose to
use a classical two passes component connected labeling
algorithms with an 8-connectivity. We designed a specic
optimized Intellectual Property in VHDL. This intellectual
property uses xed point number.
6. Computation of Spots Centers
The threshold and labeling processes applied to the captured
image allow us to determine the area of each spot (number
of pixels). The coordinates of center of these spots could be
calculated as follows:
u
gI
=
iI
u
i
N
I
,
v
gI
=
iI
v
i
N
I
,
(14)
where u
gI
and v
gI
and the abscissa and ordinate of Ith spot
center. u
i
and v
i
are the coordinates of pixels constructing the
spot. N
I
is the number of pixels of Ith spot (area in pixels).
To obtain an accuracy 3D reconstruction, we need to
compute the spots centers with higher possible precision
without increasing the total computing time to satisfy the
real-time constraint. The hardest step in center detection part
is the division operations A/B in (14). Several methods exist
to solve this problem.
6.1. Implementation of a Hardware Divider. The simplest
method is the use of hardware divider but they are com-
putationally expensive and consume a considerable amount
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
Figure 15: Smallest rectangle containing active pixel.
of resources. This not acceptable for a real-time embedded
systems. Some other techniques are used to compute the
center of laser spots avoiding the use of hardware dividers.
6.2. Approximation Method. Some studies suggest approx-
imation methods to avoid implementation of hardware
dividers. Such methods like that implemented in [29] replace
the active pixels by the smallest rectangle containing this
region and then replace the usual division by simple shifting
(division by 2):
u
gI
=
Max(u
i
) + Min(u
i
)
2
,
v
gI
=
Max(v
i
) + Min(v
i
)
2
.
(15)
This approach is approximated in (15), where (u
i
, v
i
)
are the active pixel coordinates, and (u
gI
, v
gI
) are the
approximated coordinates of the spot center.
The determination of rectangle limits needs two times
scanning of the image, detecting in every scanning step,
respectively, the minimum and maximum of pixels coordi-
nates. For each spot, we should compare the coordinates of
every pixel by last registered minimum and maximum to
assign newvalues to U
m
, U
M
, V
m
, and V
M
. (m: Minimum; M:
maximum). While N
p
is the average area of spots (number
of pixels), we can estimate the number of operations needed
to calculate the center of each spot by 4N
p
+ 6. And in
global, N
op
25 N (4N
p
+ 6) operations are needed
to calculate the centers of N spots (video-cadence of 25 fps).
Such approximation is simple and easy to use but still needs
considerable time to be calculated. Beside, the error is not
negligible. The error in such method is nearly 0.22 pixel,
and the maximum error is more than 0.5 pixel [29]. Taking
the spot of Figure 15 as an example of inaccuracy of such
a method, the real center position of these pixels is (4.47;
6.51). But when applying this approximation method, the
center position will be (5; 6). This inaccuracy will result
mismatching problem that aects the measurement result
when reconstructing object.
u
v
a
0
b
0
a
1
b
1
a
2
b
2
a
N
b
N
.
.
.
.
.
.
Estimation
Estimation
Estimation
Estimation
Compare
Compare
Compare
Compare
Coder
i
P
a
r
a
m
e
t
e
r
m
e
m
o
r
y
1
/
z
c
o
m
p
u
t
e
Figure 16: 3D unit.
6.3. Our Method. The area of each spot (number of pixels)
is always a positive integer, while its value is limited in
a predeterminate interval [N
min
, N
max
], where N
min
and
N
max
are, respectively, the minimum and maximum areas
of laser spot in the image. The spot areas depend on
object illumination, distance between object and camera,
and the angle of view of the scene. Our method consists
in a memorisation of 1/N where N represent the spot pixel
number and can take value in [1, N
limit
]. N
limit
represent the
maximum considered size, in pixels, of a spot.
In this case we need only to compute a multiplication,
that is resume here:
u
gI
= (u
1
+ u
2
+ + u
I
)
1
N
I
,
v
gI
= (v
1
+ v
2
+ + v
I
)
1
N
I
.
(16)
The implementation of such a lter is very easy, regarding
that the most of DSP functions are provided for earlier
FPGAs. For example, Virtex-II architecture [30] provides an
18 18 bits Multiplier with a latency of about 4.87 ns at
205 MHz and optimised for high-speed operations. Addi-
tionally, the power consumption is lower compared to a slice
implementation of an 18-bit by 18-bit multiplier [31]. For
N luminous spots source, the number of operations needed
to compute the centers coordinates is N
op
25 N N
P
,
and N
p
is the average area of spots. When implementing our
approach to Virtex II Pro FPGA (XC2VP30), it was clear that
we gain in execution time and size. Comparison of dierent
implementation approaches is described in the next section.
7. Matching Algorithm
The set of parameters for the epipolar and depth models
are used during run time to make point matching (identify
the original position of a pattern point from its image) and
calculate the depth using the coordinates of each laser spot
center.
For this purpose we have developed a parallel architec-
ture visible in Figure 16, described in detail in [32].
Starting from the point abscissa (u) we calculate its
estimated ordinate (` v) if it belongs to a epipolar line. We
compare this estimation with the true ordinate (v).
These operations are made for all the epipolar line
simultaneously. After thresholding the encoder returns the
index of the corresponding epipolar line.
The next step is to calculate the z coordinate from the
u coordinate and the appropriate depth model parameters
(, )
These computation blocs are synchronous and pipelined,
allowing, thus, high processing rates.
7.1. Estimation Bloc. In this bloc the estimated ordinate is
calculated ` v = a u + b. The (a, b) parameters are loaded
from memory.
7.2. Comparison Bloc. In this bloc the absolute value of the
dierence between the ordinate v and its estimation ` v is
calculated. This dierence is then thresholded.
The thresholding avoids a resource consuming sort stage.
The threshold was a priori chosen as half the minimum dis-
tance between two consecutive epipolar lines. The threshold
can be adjusted for each comparison bloc.
This bloc returns a 1 result if the distance is underneath
the threshold.
7.3. Encoding Bloc. If the comparison blocs return a unique
1 result, then the encoder returns the corresponding
epipolar line index.
If no comparison bloc returns a true result, the point is
irrelevant and considered as picture noise.
Cyclope
Di (data in)
/CTS
Do (data out)
/RTS
Xbee module Xbee module PC
Di (data in)
/CTS
Do (data out)
/RTS
Figure 17: Wireless communication.
If more than one comparison blocs returns 1, then we
consider that we have a correspondence error and a ag is set.
The selected index is then carried to the next stage where
the z coordinate is calculated. It allows the selection of the
right parameters to the depth model.
We compute (1/z), rather than z as we said earlier, to have
a simpler computation unit. This computation bloc is then
identical to the estimation bloc.
8. Wireless Communication
Finally, after computation, the 3D coordinates of the laser
dots accompanied by the image of texture are sent to an
external reader. So, Cyclope is equipped with a block of
wireless communication which allows us to transmit the
image of texture, the coordinates 3D of the centers of the
spots laser, and even to remotely recongure the digital
processing architecture (an Over The Air FPGA). While
attending the IEEE802.15 Body Area Network standard [33],
the frequency assigned for implanted device RF communica-
tion is around 403 MHz and referred to as the MICS (Medical
Implant Communication System) band due to essentially
three reasons:
(i) a small antenna,
(ii) a minimum losses environment which allows to
design low-power transmitter,
(iii) a free band without causing interference to other
users of the electromagnetic radio spectrum [34].
In order to make rapidly a wireless communication of
our prototype, we chose to use Zigbee module at 2.45 GHz
available on the market contrary to modules MCIS. We
are self-assured that later frequency is not usable for the
communication between the implant and an external reader,
due to the electromagnetic losses of the human body. Two
Xbee-pro modules from the Digi Corporation have been
used. One for the demonstrator and the second plugged on a
PC host where a human machine interface has been designed
to visualise in real-time the 3Dtextured reconstruction of the
scene.
Communication between wireless module and the FPGA
circuit is performed by a standard UART protocol. this prin-
ciple is shown on Figure 17. To make this communication
we integrated a Microblaze softcore processor with UART
functionality. The Softcore recovers all the data stored in
memory (texture and 3D coordinates) and sends them to the
wireless module.
RF block
Processing
block
Instrumentation
block
ZigBee
FPGA
CCD
Laser
Figure 18: Demonstrator
9. Demonstrator, Testbench and Results
9.1. Experimental Demonstrator. To demonstrate the feasi-
bility of our system, a large-scale demonstrator has been
realised. It uses an FPGAprototyping board based on a Xilinx
Virtex2Pro, a pulsed IR LASER projector [35] coupled with
a diraction network that generates a 49-dot pattern and a
CCD imager.
Figure 18 represents the experimental set. It is composed
of a standard 3 mm lens, the CCD camera with an external
8 bits DAC, a projector IR pattern, and a Virtex2pro
prototyping board.
FPGA is used mainly for computation unit but also to
control image acquisition, laser synchronisation, analog-to-
digital conversion, and image storage and displays the result
through a VGA interface.
Figure 19 shows the principal parts of the control and
storage architecture as set in the FPGA. Five parts have been
designed:
(i) a global sequencer to control the entire process,
(ii) a reset and integration time conguration unit,
(iii) a VGA synchronisation interface,
(iv) a dual port memory to store the images and to allow
asynchronous acquisition and display operations,
(v) a wirless communication module based on the
ZigBee protocol.
A separated pulsed IR projector has been added to the
system to demonstrate the system functionality.
Pulsed pattern projector
Time
CTRL
ADC
CTRL
Display
C
M
O
S
c
a
m
e
r
a
ADC
DP RAM
DP RAM
VGA CTRL
Figure 19: Implementation of the control and storage architecture.
40
60
80
100
120
140
160
F
m
a
x
(
M
h
z
)
0 20 40 60 80 100 120 140 160
Parallel operations
Figure 20: FPGA working frequency evolution.
The computation unit was described in VHDL and
implemented on FPGA XilinX VirtexIIpro (XC2VP30) with
30816 logic cells and 136 hardware multipliers [31]. The
synthesis and placement were achieved for 49 parallel
processing elements. We use here 28% of the LUTs and 50
hardware multipliers, for a working frequency of 148 Mhz.
9.2. Architecture Performance. To estimate the evolution
of the architecture performances, we have used a generic
description and repeat the synthesis and placement for
dierent pattern sizes (number of parallel operations).
Figure 20 shows that in every case our architecture mapped
on an FPGA can work at least at almost 90 Mhz and then
obtain a real time constraint of 40 milliseconds.
Table 3: Performances of the distortion correction.
Slices 1795 (13%)
Latency 11.43 ms
Error < 0.01 pixels
(a) Image without correction (b) Image with correction
Figure 21: (a) Checkerboard image before distortion correction.
(b) Checkerboard image after correction.
9.3. Error Estimation of the Optical Correction. The imple-
mentation results of distortion correction method are sum-
marised in Table 3. In this table we have implemented the
correction model only to the active light spots. However,
Figure 21 present an image before and after our lens
distortion correction.
Regarding size and latency, it is clear that the results are
suitable for our application.
Comparing our used method to compute the spots
centers with two other methods (see Table 4), it is clear
that our approach has higher accuracy and smaller size than
approximation method. Since it has nearly the same accuracy
as method using hardware divider, it still uses less resources.
0
0.5
1
1.5
2
2.5
E
r
r
o
r
(
%
)
45 50 55 60 65 70
Depth (cm)
Distorted image
Corrected image
Depth estimation error
Figure 22: Error comparison before and after applying distortion
correction and centers recomputing.
Regarding latency, the results of all three approaches respect
real time constraint of video cadence (25 frames per second).
Comparing many measures on the depth estimation before
and after the implementation of our improvements, the
results indicate that the precision of the system increased, so
that the residual error is reduced about 33% (Figure 22).
These results were interpolated with a scale factor to
measure the error lens in the case of integration inside a
video capsule, and the results can be shown in Figure 23.
This scaling was calculated with a distance between the
laser projector and the imager of 1 cm. It is the maximal
distance that can be considered for endoscopy. This distance
corresponds to the diameter of the PillCam video capsule.
We can show that the correction of the distortion produced
by the lens increases the accuracy of our sensor.
9.4. Error Estimation of the 3D Reconstruction. In order to
validate our reconstruction architecture, we have compared
the results obtained with the synthesised IP (Table 5) and
those obtained from a oating point mathematical model
which was already validated by experimentation. As we
can see, the calculation error margin is relatively weak in
comparison with the distance variations and shows that our
approach to translate a complex mathematical model into a
digital processing for embedded system is valid.
Table 6 shows the error of reconstruction for dierent
distances and sizes of the stereoscopic base. We can see that
for a base of 5 mm we are able to have a 3D reconstruction
with an error above to 4% at a distance of 10 cm. This
precision is perfectly enough in the context of the human
body exploration and an integration of a stereoscopic base
with a such size is relatively simple.
0
0.5
1
1.5
2
2.5
E
r
r
o
r
(
%
)
5 5.5 6 6.5 7 7.5 8 8.5 9
Depth (cm)
Distorted
Corrected
Depth estimation error
Figure 23: Error comparison before and after applying distortion
correction and centers recomputing after scaling for integration.
Table 4: Centers computation performances.
Method Slices Latency Error
(s) (pixel)
Approximation 287 4.7 0.21
Hardware divider 1804 1.77 0.0078
Our approach 272 2.34 0.015
Table 5: Results Validation.
Coordinates couples Model results IP results
abscise/ordinate (pixel) (meter) (meter)
401/450 1.57044 1.57342
357/448 1.57329 1.57349
402/404 1.57223 1.57176
569/387 1.22065 1.21734
446/419 1.11946 1.11989
478/319 1.07410 1.07623
424/315 1.04655 1.04676
375/267 1.03283 1.03297
420/177 1.03316 1.03082
Table 6: Precision versus the size of the stereoscopic base.
Base of 0.5 cm Base of 1.5 cm
Distance Error Distance Error
(cm) (%) (cm) (%)
5 1.8 5 0.61
10 3.54 10 1.21
50 15.52 50 5.77
100 26.87 100 10.91
Texture
IR
Laser spots
3D-VRML
Figure 24: Visualisation of the results by our application.
9.5. Example of Reconstruction. We have used the calibration
results to reconstruct the volume of an object (a 20 cm
diameter cylinder). The pattern was projected on the scene
and the snapshots were taken.
The pattern points were extracted and associated to laser
beams using the epipolar constraint. The depth of each point
was then calculated using the appropriate model. The texture
image was mapped on the reconstructed object and rendered
in an VRML player.
We have created an application written in C++ for
visualising the Cyclopes results (Figure 24). The application
gets the textural information and the barycenter spatial
position of the 49 infrared laser spots from a wireless
communication module. After the result reception, it draws
the texture of three binary maps representing the location of
the 49 barycenters on a 3D coordinates system (XY, ZX, and
ZY).
The recapitulation of the hardware requirement is pre-
sented Table 7. We can observe that the design is small
and if we make an equivalence in logics gates, it should be
integrated in a small area chip like IGLOO AGL 1000 device
from Actel. Such a device has a size of 10 10 mm
2
and its
core can be integrated in a VCE which has a diameter around
1 cm. At this moment, we did not make implementation on
this last platform. It is a feasibility study but the rst results
prove that this solution is valid if we consider the needed
ressources.
We present also an estimation of the energetic consump-
tion which was realised with two tools. This estimation
is visible in Table 8. The rst tool is XPE ( Xilinx Power
Estimation) from Xilinx to evaluate the power consumption
of a Virtex, and the second is IGLOOpowercalculator from
Actel to evaluate the power consumption of a low power
consumption FPGA.
Table 7: Recapitulation of the performances.
Architecture Clb slices Latches LUT RAM
Camera 309 337 618 4
Optical correction
92/94 8/8 176/190 32/56

Thresholding 107 192 214 1
Labelling 114 102 227 0
Matching 1932 3025 3864 0
Communication 170 157 277 3
Total used
2323/2325 3821 1555/1569 40/64

Total free 13693 29060 27392 136
Direct computation/Look up table.

Table 8: Processing block power consumption estimation.
Device Power consumption Duration
1 battery 3 battery
Virtex 1133 mW 29 min 1h 26 min
IGLOO 128,4 mW 4 hours 12 hours
These two tools use the processing frequency, the number
of logic cells, the number of D ip-op, and the amount of
memory of the design to estimate the power consumption.
To realise our estimation, we use the results summarised in
Table 7. Our estimation is made with an activity rate of 50%
that is the worst case.
To validate the power consumption estimation in an
embedded context, we consider that a 3V-CR1220 battery
( 3V-CR1220 is a 3 Volt battery, its diameter is of 1.2 cm,
and its thickness is of 2 mm) which has a maximum of
180 mAh power consumption, that is to say an ideal power
of 540 mWh. This battery is fully compatible with a VCE like
the Pillcam from Given Imaging.
As we can see, the integration of a Virtex in a VCE is
impossible because of the SRAM memory that consumes too
much energy. If we consider the IGLOO technology based on
ash memory, we can observe that its power consumption
is compatible with a VCE. Such technology permits four
hours of autonomy with only one battery, and twelve hours
of autonomy if we used three 3V-CR1220 in the VCE. This
result is encouraging because at this time the mean duration
of an examination is ten hours.
10. Conclusion and Perspectives
We have presented in this paper Cyclope, a sensor designed
to be a 3D video capsule.
we have explained a method to acquire the images at a 25-
frame/s video rate with a discrimination between the texture
and the projected pattern. This method uses an energetic
approach, a pulsed projector, and an original 64 64 CMOS
image sensor with programmable integration time. Multiple
images are taken with dierent integration times to obtain
an image of the pattern which is more energetic than
the background texture. Our CMOS imager validates this
method.
Also we present a 3D reconstruction processing that
allows a precise and real-time reconstruction. This process-
ing which is specically designed for an integrated sensor
and its integration in an FPGA-like device has a low power
consumption compatible with a VCE examination.
The method was tested on a large scale demonstrator
using an FPGA prototyping board and a 352 288 pixels
CCD sensor. The results show that it is possible to integrate
a stereoscopic base which is designed for a integrated sensor
and to keep a good precision for a human body exploration.
The next step to this work is the chip level integration of
both the image sensor and the pattern projector. Evaluate the
power consumption of the pulsed laser projector considering
the optical eciency of the diraction head.
The presented version of Cyclope is the rst step toward
the nal goal of the project. After this, the goal is to realise
a real-time pattern recognition with processing-like support
vector machine or neuronal network. The nal issue of
Cyclope is to be a real smart sensor that can realize a part
of a diagnosis inside the body and then increase its ability.
References
[1] G. Iddan, G. Meron, A. Glukhovsky, and P. Swain, Wireless
capsule endoscopy, Nature, vol. 405, no. 6785, pp. 417418,
2000.
[2] J.-F. Rey, K. Kuznetsov, and E. Vazquez-Ballesteros, Olympus
capsule endoscope for small and large bowel exploration,
Gastrointestinal Endoscopy, vol. 63, no. 5, p. AB176, 2006.
[3] M. Gay, et al., La vid eo capsule endoscopique: quen atten-
dre? CISMEF, http://www.churouen.fr/ssf/equip/capsules-
videoendoscopiques.html.
[4] T. Graba, B. Granado, O. Romain, T. Ea, A. Pinna, and P.
Garda, Cyclope: an integrated real-time 3d image sensor, in
Proceedings of the 19th International Conference on Design of
Circuits and Integrated Systems, 2004.
[5] F. Marzani, Y. Voisin, L. L. Y. Voon, and A. Diou, Active
sterovision system: a fast and easy calibration method, in
Proceedings of the 6th International Conference on Control
Automation, Robotics and Vision (ICARCV 00), 2000.
[6] W. Li, F. Boochs, F. Marzani, and Y. Voisin, Iterative 3d
surface reconstruction with adaptive pattern projection, in
Proceedings of the 6th IASTED International Conference on
Visualization, Imaging, and Image Processing (VIIP 06), pp.
336341, August 2006.
[7] P. Lavoie, D. Ionescu, and E. Petriu, Ahigh precision 3d object
reconstruction method using a color coded grid and nurbs, in
Proceedings of the International Conference on Image Analysis
and Processing, 1999.
[8] Y. Oike, H. Shintaku, S. Takayama, M. Ikeda, and K.
Asada, Real-time and high resolution 3-d imaging system
using light-section method and smart CMOS sensor, in
Proceedings of the IEEE International Conference on Sensors
(SENSORS 03), vol. 2, pp. 502507, October 2003.
[9] A. Ullrich, N. Studnicka, J. Riegl, and S. Orlandini, Long-
range highperformance time-of-ight-based 3d imaging sen-
sors, in Proceedings of the International Symposium on 3D
Data Processing Visualization and Transmission, 2002.
[10] A. Mansouri, A. Lathuili ere, F. S. Marzani, Y. Voisin, and P.
Gouton, Toward a 3d multispectral scanner: an application
to multimedia, IEEE Multimedia, vol. 14, no. 1, pp. 4047,
2007.
[11] F. Bernardini and H. Rushmeier, The 3d model acquisition
pipeline, Computer Graphics Forum, vol. 21, no. 2, pp. 149
172, 2002.
[12] S. Zhang, Recent progresses on real-time 3d shape measure-
ment using digital fringe projection techniques, Optics and
Lasers in Engineering, vol. 48, no. 2, pp. 149158, 2010.
[13] F. W. Depiero and M. M. Triverdi, 3d computer vision using
structured light: design, calibration, and implementation
issues, Journal of Advances in Computers, pp. 243278, 1996.
[14] E. E. Hemayed, M. T. Ahmed, and A. A. Farag, CardEye: a
3d trinocular active vision system, in Proceedings of the IEEE
Conference on Intelligent Transportation Systems (ITSC 00),
pp. 398403, Dearborn, Mich, USA, October 2000.
[15] A. Kolar, T. Graba, A. Pinna, O. Romain, B. Granado, and
E. Belhaire, Smart Bi-spectral image sensor for 3d vision,
in Proceedings of the 6th IEEE Conference on SENSORS (IEEE
SENSORS 07), pp. 577580, Atlanta, Ga, USA, October 2007.
[16] B. Gyselinckx, C. Van Hoof, J. Ryckaert, R. F. Yazicioglu,
P. Fiorini, and V. Leonov, Human++: autonomous wireless
sensors for body area networks, in Proceedings of the IEEE
Custom Integrated Circuits Conference, pp. 1218, 2005.
[17] B. Warneke, M. Last, B. Liebowitz, and K. S. J. Pister, Smart
dust: communicating with a cubic-millimeter computer,
Computer, vol. 34, no. 1, pp. 4451, 2001.
[18] R. Horaud and O. Monga, Vision par Ordinateur, chapter 5,
Herm` es, 1995.
[19] O. Faugeras, Three-Dimensional Computer Vision, a Geometric
Viewpoint, MIT Press, Cambridge, Mass, USA, 1993.
[20] J. Batlle, E. Mouaddib, and J. Salvi, Recent progress in coded
structured light as a technique to solve the correspondence
problem: a survey, Pattern Recognition, vol. 31, no. 7, pp. 963
982, 1998.
[21] S. Woo, A. Dipanda, F. Marzani, and Y. Voisin, Determination
of an optimal conguration for a direct correspondence in
an active stereovision system, in Proceedings of the IASTED
International Conference on Visualization, Imaging, and Image
Processing, 2002.
[22] O.-Y. Mang, S.-W. Huang, Y.-L. Chen, H.-H. Lee, and P.-
K. Weng, Design of wide-angle lenses for wireless capsule
endoscopes, in Optical Engineering, vol. 46, October 2007.
[23] J. Heikkil a and O. Silv en, A four-step camera calibration
procedure with implicit image correction, in Proceedings of
the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pp. 11061112, San Juan, Puerto Rico,
USA, 1997.
[24] K. Hwang and M. G. Kang, Correction of lens distortion
using point correspondence, in Proceedings of the IEEE Region
10 Conference (TENCON 99), vol. 1, pp. 690693, 1999.
[25] J. Heikkil a, Accurate camera calibration and feature based
3-D reconstruction from monocular image sequences, Ph.D.
dissertation, University of Oulu, Oulu, Finland, 1997.
[26] N. Otsu, A threshold selection method from gray level his-
togram, IEEE Transactions on Systems, Man, and Cybernetics,
vol. 9, no. 1, pp. 6266, 1979.
[27] J. N. Kapur, P. K. Sahoo, and A. K. C. Wong, A new method
for gray-level picture thresholding using the entropy of the
histogram, Computer Vision, Graphics, & Image Processing,
vol. 29, no. 3, pp. 273285, 1985.
[28] D. Faura, T. Graba, S. Viateur, O. Romain, B. Granado, and
P. Garda, Seuillage dynamique temps r eel dans un syst` eme
embarqu e, in Proceedings of the 21`eme Colloque du Groupe
de Recherche et d
Etude du Traitement du Signal et des lImage

(GRETSI 07), 2007.
[29] T. Graba, Etude dune architecture de traitement pour un
capteur integre de vision 3d, Ph.D. dissertation, Universit e
Pierre et Marie Curie, 2006.
[30] M. Adhiwiyogo, Optimal pipelining of the I/O ports of the
virtex-II multiplier, XAPP636, vol. 1.4, June 2004.
[31] Xilinx, Virtex-II Pro and Virtex-II Pro Platform FPGA:
Complete Data Sheet, October 2005.
[32] A. Kolar, T. Graba, A. Pinna, O. Romain, B. Granado, and T.
Ea, A digital processing architecture for 3d reconstruction,
in Proceedings of the International Workshop on Computer
Architecture for Machine Perception and Sensing (CAMPS 06),
pp. 172176, Montreal, Canada, August 2006.
[33] ieee802, http://www.ieee802.org/15/pub/TG6.html.
[34] M. R. Yuce, S. W. P. Ng, N. L. Myo, J. Y. Khan, and W. Liu,
Wireless body sensor network using medical implant band,
Journal of Medical Systems, vol. 31, no. 6, pp. 467474, 2007.
[35] Laser2000, http://www.laser2000.fr/index.php?id=368949&L
=2.
doi:10.1155/2009/826296
Research Article
Performance Evaluation of UML2-Modeled Embedded Streaming
Applications with System-Level Simulation
Tero Arpinen, Erno Salminen, Timo D. H am al ainen, and Marko H annik ainen
Department of Computer Systems, Tampere University of Technology, P.O. Box 553, 33101 Tampere, Finland
Correspondence should be addressed to Tero Arpinen, tero.arpinen@tut.
Received 27 February 2009; Accepted 21 July 2009
This article presents an ecient method to capture abstract performance model of streaming data real-time embedded systems
(RTESs). Unied Modeling Language version 2 (UML2) is used for the performance modeling and as a front-end for a tool
framework that enables simulation-based performance evaluation and design-space exploration. The adopted application meta-
model in UML resembles the Kahn Process Network (KPN) model and it is targeted at simulation-based performance evaluation.
The application workload modeling is done using UML2 activity diagrams, and platform is described with structural UML2
diagrams and model elements. These concepts are dened using a subset of the prole for Modeling and Analysis of Realtime and
Embedded (MARTE) systems fromOMGand customstereotype extensions. The goal of the performance modeling and simulation
is to achieve early estimates on task response times, processing element, memory, and on-chip network utilizations, among other
information that is used for design-space exploration. As a case study, a video codec application on multiple processors is modeled,
evaluated, and explored. In comparison to related work, this is the rst proposal that denes transformation between UML activity
diagrams and streaming data application workload meta models and successfully adopts it for RTES performance evaluation.
Copyright 2009 Tero Arpinen et al. This is an open access article distributed under the Creative Commons Attribution License,
1. Introduction
Multiprocessor System-on-Chip (SoC) oers high perfor-
mance, yet energy-ecient, and programmable platform
for modern embedded devices. However, parallelism and
increasing complexity of applications necessitate ecient
and automated design methods. Model-driven development
(MDD) aims to shorten the design time using abstraction,
gradual renement, and automated analysis with transfor-
mation of models. The key idea is to utilize models to
highlight certain aspects of the system (behavior, structure,
timing, power consumption models, etc.) without an imple-
mentation.
Unied Modeling Language version 2 (UML2) [1] is a
standard language for MDD. In embedded system domain,
its adoption is seen promising for several purposes: require-
ments specication, behavioral and architectural modeling,
test bench generation, and IP integration [2]. However, it
should be noted that UML2 has had also criticism on its
suitability in MDD [3, 4]. UML2 oers a rich set of diagrams
for modeling and also expansion and tailoring methods
to derive domain-specic languages. For example, several
UML proles targeted at embedded system design have been
developed [57].
SoC complexity requires ecient performance evalua-
tion and design-space exploration methods. These methods
are often utilized at the system level to make early design
decisions. Such decisions include, for instance, choosing
the number and type of processors, and determining the
mapping and scheduling of application tasks. Design-space
exploration seeks to nd optimum solution for a given
application (domain) and boundary constraints. Design
space, that is, the number of possible system congurations,
is practically always so large that it becomes intractable not
only for manual design but also for brute force optimization.
Hence, ecient methods are needed, for example, optimiza-
tion heuristics, tool frameworks, and models [8].
This article presents an ecient method to capture
abstract performance model of a streaming data real-time
embedded system (RTES). Figure 1 presents the overall
methodology used in this work. The goal of the performance
modeling and simulation is to achieve early estimates on
Execution
monitoring
(simulation results)
Design-space exploration
(models and simulation
results)
System-level simulation
(SystemC)
Application
workload modeling
(UML2 activities)
Platform
performance modeling
(UML2 structural)
Figure 1: The methodology used in this work.
PE, memory, and on-chip network utilization, task response
times, among other information that is used for design-space
exploration. UML2 is used for performance model speci-
cation. The application workload modeling is carried out
using UML2 activity diagrams. Platform is described with
structural UML2 diagrams and model elements annotated
with performance values.
Our focus is on modeling streaming data applications.
It is characteristic to streaming applications that a long
sequence of data items ows through a stable set of compu-
tation steps (tasks) with only occasional control messaging
and branching. Each task waits for the data items, processes
them, and outputs the results to the next task. The adopted
application metamodel has been formulated based on this
assumption and it resembles Kahn Process Network (KPN)
[9] model.
A proprietary UML2 prole for capturing performance
characteristics of an application and platform is dened.
The prole denition is based on a well-dened metamodel
and reusing suitable modeling concepts from the prole for
Modeling and Analysis of Realtime and Embedded systems
(MARTE) [5]. MARTE is a standard prole promoted
by the Object Management Group (OMG) and it is a
promising extension for general-purpose embedded system
modeling. It has been intended to replace the UML Prole for
Schedulability, Performance and Time (SPT) [10]. MARTE
is methodology-independent and it oers a common set of
standard notations and semantics for a designer to choose
from while still allowing to add custom extensions. This
means that the prole dened in this article is a specialized
instance of the MARTE prole that is dedicated for our
performance evaluation methodology.
It should be noted that the performance models dened
in this work can be and have been used together with a
custom UML prole for embedded systems, called TUT-
Prole [7, 11]. However, this article illustrates the mod-
els using the concepts of MARTE because the adoption
of standards promotes commonly known notations and
semantics between designers and interoperability between
tools.
Further, the article presents how performance values
can be specied on UML models with expressions using
MARTE Value Specication Language (VSL). This allows
eective parameterization of system performance model
Application
functions
Platform
resources
Functions on
platform resources
Workload
Mapping
Performance analysis
Simulations
Binding application workloads
on platform elements
Processin elements
Communication elements
Memory elements
Figure 2: Design Y-chart.
representation according to application-specic variables
and reduces the amount of time consuming and error-prone
manual work.
The presented modeling methods are utilized in a
tool framework targeted at simulation-based design-space
exploration and performance evaluation. The exploration is
based on collecting performance statistics from simulation
to optimize the platform and mapping according to a
predened cost-function.
An execution-monitoring tool provides visualization and
monitoring the system performance during the simulation.
As a case study, a video codec system is modeled with the
presented modeling methods and performance evaluation
and exploration is carried out using the tool framework.
The rest of the article is organized as follows. Section 2
analyses the methods and concepts used in RTES perfor-
mance evaluation. Section 3 presents the metamodel utilized
in this work for system performance characterization. UML2
and MARTE for RTES modeling are discussed in Section 4.
Section 5 presents the UML2 specication of the utilized per-
formance metamodel. Section 6 presents our performance
evaluation tool framework. The video codec case study is
covered in Section 7. After nal discussion on our proposal
in Section 8, Section 9 concludes the article.
2. Analysis of Methods and Concepts Used in
RTES Performance Evaluation
In this section the methods and concepts used in RTES
performance evaluation are covered. This comprises an
introduction to design Y-chart in RTES performance eval-
uation, phases of a model-based RTES performance eval-
uation process, discussion on modeling language and tool
development, and a short introduction to RTES timing
analysis concepts. Finally, the related work on UML in RTES
performance evaluation is examined.
2.1. Design Y-Chart and RTES Modeling. Typical approach
for RTES performance evaluation follows the design Y-chart
[12] presented in Figure 2 by separating the application
description from underlying platform description. These two
are bound in the mapping phase. This means that commu-
nication and computation of application functionalities are
committed onto certain platform resources.
There are several possible abstraction levels for describ-
ing the application and platformfor performance evaluation.
One possibility is to utilize abstract specications. This
means that application workload and performance of the
platform resources are represented symbolically without
needing detailed executable descriptions.
Application workload is a quantity which informs how
much capacity is required from the underlying platform
components to execute certain functionality. In model-based
performance evaluation the workloads can be estimated
based on, for example, standard specications, prior expe-
rience from the application domain, or available processing
capacity. Legacy application components, on the other
hand, can be proled and performance models of these
components can be evaluated together with the models of
components yet to be developed.
In addition to computational demands, communication
demands between application parts must be considered. In
practice, the communication is realized as data messages
transmitted between real-time operating system (RTOS)
threads or between processing elements over an on-chip
communication network. Shared buses and Network-on-
Chip (NoC) links and routers perform scheduling for
transmitted data packets in an analogous way as PEs execute
and schedule computational tasks. Moreover, inter-PE com-
munication can be alternatively performed using a shared
memory. The performance characteristics of memories as
well as their utilization play a major role in the overall system
performance. The impact of computation, communication,
and storage activities should all be considered in system-
level analysis to enable successful performance evaluation of
a modern SoC.
2.2. Model-Based RTES Performance Evaluation Process.
RTES performance evaluation process must follow disci-
plined steps to be eective. From SoC designers perspective,
a generic performance evaluation process consists of the
following steps. Some of the concepts of this and the next
subsection have been reused and modied from the work in
[13]:
(1) selection of the evaluation techniques and tools,
(2) measuring, proling, and estimating workload char-
acteristics of application and determining platform
performance characteristics by benchmarking, esti-
mation, and so forth,
(3) constructing system performance model,
(4) measuring, executing, or simulating system perfor-
mance models,
(5) interpreting, validating, monitoring, and back-
annotating data received from previous step.
The selection of the evaluation techniques and tools is
the rst and foremost step in the performance evaluation
process. This phase includes considering the requirements
of the performance analysis and availability of tools. It
determines the modeling methods used and the eort
required to perform the evaluation. It also determines the
abstraction level and accuracy used. All further steps in the
process are dependent on this step.
The second step is performed if the system performance
model requires initial data about application task workloads
or platform performance. This is based on proling, speci-
cations, or estimation. The application as well as platform
may be alternatively described using executable behavioral
models. In that case, such additional information may not
be needed as all performance data can be determined during
system model execution.
The actual system model is constructed in the third step
by a system architect according to dened metamodel and
model representation methods. Gathered initial performance
data is annotated to the system model. The annotation of
the proling results can also be accelerated by combining the
proling and back-annotation with automation tools such as
[14].
After system modeling, the actual analysis of the model
is carried out. This may involve several model transforma-
tions, for example, from UML to SystemC. The analysis
methods can be classied into dynamic and static methods
[8]. Dynamic methods are based on executing the system
model with simulations. Simulations can be categorized into
cycle-accurate and system-level simulations. Cycle-accurate
simulation means that the timing of system behavior is
dened by the precision of a single clock cycle. Cycle-
accuracy guarantees that at any given clock cycle, the state
of the simulated system model is identical with the state
of the real system. System-level simulation uses higher
abstraction level. The system is represented at IP-block level
consisting coarse grained models of processing, memory,
and communication elements. Moreover, the application
functionality is presented by coarse-grained models such as
interacting tasks.
Static (or analytic) methods are typically used in early
design-space exploration to nd dierent corner cases.
Analytical models cannot take into consideration sporadic
eects in the system behavior, such as aperiodic interrupts
or other aperiodic external events. Static models are suited
for performance evaluation when deterministic behavior of
the system is accurate enough for the analysis.
Static methods are faster and provide signicantly larger
coverage of the design-space than dynamic methods. How-
ever, static methods are less accurate as they cannot take into
account dynamic performance aspects of a multiprocessor
system. Furthermore, dynamic methods are better suited
for spotting delayed task response times due to blocking of
shared resources.
Analysing, measuring, and executing the system per-
formance models produces usually a massive amount of
data from the modeled system. The nal step in the ow
is to select, interpret, and exploit the relevant data. The
selection and interpretation of the relevant data depends
on the purpose of the analysis. The purpose can be early
design-space exploration, for example. In that case, the ow
is usually iterative so that the results are used to optimize the
systemmodels after which the analysis is performed again for
the modied models. In dynamic methods, an eective way
of analysing the system behavior is to visualize the results
of simulation in form of graphs. This helps the designer to
eciently spot changes in system behavior over time.
2.3. Modeling Language and Tool Development. SoC design-
ers typically utilize predened modeling languages and tools
to carry out the performance evaluation process. On the
other hand, language and tool developers have their own
steps to provide suitable evaluation techniques and tools for
SoC designers. In general they are as follows:
(1) formulation of metamodel,
(2) developing methods for model representation and
capturing,
(3) developing analysis tools according to selected mod-
eling methods.
The formulation of the metamodel requires very similar
kind of consideration on the objectives of the performance
analysis as the selection of the techniques and tools by
SoC designers. The created metamodel determines the eort
required to perform the evaluation as well as the abstraction
level and accuracy used. In particular, it denes whether the
system performance model can be executed, simulated, or
statically analysed.
The second step is to dene how the model is captured by
a designer. This phase includes the selection or denition of
the modeling language (such as UML, SystemC or a custom
domain-specic language). The selection of notations also
requires transformation rules dened between the elements
of the metamodel and the elements of the selected descrip-
tion language. In case of UML2, the metamodel concepts are
mapped to UML2 metaclasses, stereotyped model elements,
and diagrams.
We want to emphasize the importance of performing
these rst two steps exactly in this order. The denition of
the metamodel should be performed independently from
the utilized modeling language and with full concentration
on the primary objectives of the analysis. The selection of
the modeling language should not alter the metamodel nor
bias the denition of it. Instead, the modeling language and
notations should be tailored for the selected metamodel, for
instance, by utilizing extension mechanisms of the UML2
or dening completely new domain-specic language. The
reason for this is that model notations contribute only to
presentational features. Model semantics truly determine
whether the model is usable for the analysis. Nevertheless,
presentational features determine the feasibility of the model
for a human designer.
The nal step is the development of the tools. To provide
ecient evaluation techniques, the implementation of the
tools should follow the created metamodel and its original
objectives. This means that the original metamodel becomes
the foundation of the internal metamodel of the tools. The
system modeling language and tools are linked together with
model transformations. These transformations are used to
convert the notations of the system modeling language to the
format understood by the tools, while the semantics of the
model is maintained.
2.4. RTES Timing Analysis Concepts. A typical SoC con-
tains heterogeneous processing elements executing complex
application tasks in parallel. The timing analysis of such a
system requires abstraction and parameterization of the key
concerns related to resulting performance.
Hansson et al. dene concepts for RTES timing analysis
[15]. In the following, a short introduction to these concepts
is given.
Task execution time t
e
is the time in which (in clock cycles
or absolute time) a set of sequential operations are executed
undisturbed on a processing element. It should be noted
that the term task is here considered more generally as a
sequence of operations or actions related to single-threaded
execution, communication, or data storing. The term thread
is used to denote typical schedulable object in an RTOS.
proling the execution time does not consider background
activities in the system, such as RTOS thread pre-emptions,
interrupts, or delays for waiting a blocked shared resource.
The purpose of execution time is to determine how much
computing resources is required to execute the task. Task
response time t
r
, on the other hand, is the actual time it
takes from beginning to the end of the task in the system.
It accounts all interference from other system parts and
background activities.
Execution time and response time can be further classi-
ed into worst case (wc), best case (bc), and average case (ac)
times. Worst case execution time t
wce
is the worst possible
time the task can take when not interfered by other system
activities. On the other hand, worst case response time t
wcr
is
the worst possible time the task may take when considering
the worst case scenario in which other system parts and
activities interfere its execution. In multimedia applications
that require streaming data processing, the worst case and
average case response times are usually the ones needed to
be analysed. However, in some hard real-time systems, such
as a car air bag controller, also the best case response time
(t
bcr
) may be as important as the t
wcr
. Average case response
time is usually not so signicant. Jitter is a measure for time
variability. For a single task, jitter in execution time can be
calculated as t
e
= t
wce
t
bce
. Respectively, jitter in response
time can be calculated as t
r
= t
wce
t
bcr
.
It is assumed that the execution time is constant for a
given task-PE pair. It should be noted that in practice the
execution time of a function may vary depending on the
processed data, for example. For these kinds of functions
the constant task execution time assumption is not valid.
Instead, dierent execution times of such functions should
be modeled by selecting a suitable value to characterize it
(e.g., worst or average case) or by dening separate tasks
for dierent execution scenarios. As opposed to execution
time, response time varies dynamically depending on the
tasks surrounding system it is executed on. The response
time analysis must be repeated if
(1) mapping of application tasks is changed,
(2) new functionalities (tasks) are added to the applica-
tion,
(3) underlying execution platform is modied,
(4) environment (stimuli from outside) changes.
In contrast, a single task execution time does not have
to be proled again if the implementation of the task is not
changed (e.g., due to optimization) assuming that the PE
on which the proling was carried out is not changed. If
the PE executing is changed and the proling uses absolute
time units, then a reproling is needed. However, this
can be avoided by utilizing PE-neutral parameters, such as
number of operation, to characterize the execution load
of the task. Other possibility is to represent processing
element performances using a relative speed factor as in
[16].
In multiprocessor SoC performance evaluation, simulat-
ing the proled or estimated execution times (or number of
operations) of tasks on abstract HW resource models is an
eective way of observing combined eects of task execution
times, mapping, scheduling, and HW platform parameters
on resulting task response times, response time jitters, and
processing element utilizations.
Timing requirements of SoC functions are compared
against estimated, simulated, or measured response times.
It is typical that timing requirements are given as combined
response times of several individual tasks. This is naturally
completely dependent on the granularity used in identifying
individual tasks. For instance, a single WLAN data trans-
mission task could be decomposed into data processing,
scheduling, and medium access tasks. Then examining if
the timing requirement of a single data transmission is met
requires examining the response times of the composite tasks
in an additive manner.
2.5. On UML in Simulation-Based RTES Performance Evalua-
tion. Related work has several static and dynamic methods
for performance evaluation of parallel computer systems.
A comprehensive survey on methods and tools used for
design-space exploration is presented in [8]. Our focus is on
dynamic methods and some of the closest related research to
our work are examined in the following.
Erbas et al. [17] present a system-level modeling and sim-
ulation environment called Sesame, which aims at ecient
design space exploration of embedded multimedia system
architectures. For application, it uses KPN for modeling
the application performance with a high-level programming
language. The code of each Kahn process is instrumented
with annotations describing the applications computational
actions, which allows to capture the computational behavior
of an application. The communication behavior of a process
is represented by reading from and writing to FIFO channels.
The architecture model simulates the performance conse-
quences of the computation and communication events
generated by an application model. The timing of application
events are simulated by parameterizing each architecture
model component with a table of operation latencies. The
simulation provides performance estimates of the system
under study together with statistical information such as
utilization of architecture model components. Their per-
formance metamodel and approach has several similarities
with ours. The biggest dierences are in the abstraction level
of HW communication modeling and visualization of the
system models and performance results.
Balsamo and Marzolla [18] present how UML use case,
activity and deployment diagrams can be used to derive
performance models based on multichain and multiclass
Queuing Networks. The UML models are annotated accord-
ing to the UML Prole for Schedulability, Performance and
Time Specication [10]. This approach has been developed
for SW architectures rather than for embedded systems. No
specic tool framework is presented.
Kreku et al. [19] propose a method for simulation-
based RTES performance evaluation. The method is based
on capturing application workloads using UML2 state-
machine descriptions. The platform model is constructed
from SystemC component models that are instantiated
from a library. Simulation is enabled with automatic C++
code generation from UML2 description, which makes the
application and platform models executable in a SystemC
simulator. Platform description provides dedicated abstract
services for application to project its computational and
communicational loads on HW resources. These functions
are invoked from actions of the state-machines. The utiliza-
tion of UML2 state-machine enables eciently capturing the
control structures of the application. This is a clear benet in
comparison to plain data ow graphs. The platform services
can be used to represent data processing and memory
accesses. Their method is well suited for control-intensive
applications as UML state-machines are used as the basis
of modeling. Our method targets at modeling embedded
streaming data applications with less eort required in
modeling using UML activity diagrams.
Madl et al. [20] present how distributed real-time
embedded systems can be represented as discrete event
systems and propose an automated method for verication
of dense time properties of such systems. The model of
computation (MoC) is based on tasks connected with
channels. Tasks are mapped onto machines that represent
computational resources of embedded HW.
Our performance evaluation method is based on exe-
cutable streaming data application workload model specied
as UML activity diagrams and abstract platform perfor-
mance model specied in composite structure diagrams. In
comparison to related work, this is the rst proposal that
denes transformation between UML activity diagrams and
streaming data application workload models and successfully
adopts it for embedded RTES performance evaluation.
3. Performance Metamodel for
Streaming Data Embedded Systems
The foundations of the performance metamodel dened in
this work is based on the earlier work on Model of Compu-
tation (MoC) for architecture exploration described in [21].
We introduce storage tasks, storage elements, and timing
constraints as new features. The metamodel denition is
given using mathematical equations and set theory. Another
alternative would be to utilize Meta Object Facility (MOF)
[22]. MOF is often used to dene the metamodels from
which UML proles are derived as the model elements and
notations of MOF are a subset of UML model elements.
Next, detailed formulation of the performance metamodel is
carried out.
3.1. Application Performance Metamodel. Application A is
dened as a tuple
A = (T, , E, TC), (1)
where T is a set of tasks, is a set of channels, E is a set
of external events (or timers), and TC is a set of timing
constraints. Tasks are further categorized to sets of execution
tasks T
e
and storage tasks T
s
, so that
T = {T
e
T
s
}. (2)
Channels combine tasks and carry tokens between them. A
single channel is dened as
= (
src
,
end
, E
buf
), (3)
where
src
T is task that emits tokens to the channel,
end

T task that consumes tokens, and E
buf
is the set of buered
tokens in the channel. Tokens in channels represent the ow
of control as well as ow of data in the application. A token
carries certain amount of data from task to another. This has
two impacts. First, the load on the communication medium
for the time of the transfer. Second, the execution load
when the next task is triggered after reception. Latter enables
data amount-dependent dynamic variations in execution
of application tasks. Similar to traditional KPN model,
channels between tasks (or processes) are uni-directional,
unbounded FIFO buers and tasks use a blocking read as a
synchronization mechanism.
A task T is dened as
= (S, ec, F,
!
,
?
), (4)
where S {Run, Ready, Wait, Free} is the state of the task,
ec {N
+
{0}} is the execution counter that is incremented
by one each time the task is red, and F is a set ring rules of
which denition depends on the type of the task. However
!
is the set of incoming channels to the task and
?
is the set of
outgoing channels. Incoming channels of task are dened as
!
= { |
end
= }, (5)
whereas outgoing channels have denition
?
= { |
src
= }. (6)
Firing rule f
c
F
c
for a computational task is a tuple
f
c
= (tc, O
int
, O
oat
, O
mem
,
out
), (7)
where tc is a task trigger condition. O
int
, O
oat
, and O
mem
represent the computational complexity of the task in
terms of amounts of integer, oating point, and memory
operations required to be computed. Subset
out

?
determine the set of outgoing channels where tokens are
transmitted when the task is red. Firing rule f
s
F
s
for a
storage task is a tuple
f
s
= (tc, O
rd
, O
wr
,
out
), (8)
where O
rd
and O
wr
are the amounts of read and write oper-
ations associated to a single storage task. Correspondingly to
execution task, tc is task trigger condition and
out

?
is the set of outgoing channels. A task trigger condition is
dened as
tc =
in
, depend, T
ec
,
ec
, (9)
where
in

!
is the set of required incoming transitions to
trigger the task and depend {Or, And} determines the
dependency type from incoming transitions. T
ec
is execution
count modulo period and
ec
is execution count modulo
phase. They can be used to restrict the ring of the task to
certain execution count values, so that the task is red if
ec mod
ec
= 0 when ec < T
ec
,
ec mod

T
ec
+
ec
= 0 when ec T
ec
.
(10)
3.2. External Events and Constraints. External events model
the environment of the application feeding input data to the
task graph, such as packet reception from WLAN radio or
image reception from an embedded camera. External event
e E is a tuple
e =
type, t
per
,
out
, (11)
where type {Oneshot, Periodic} determines whether the
event is red once or periodically. t
per
is the absolute time or
period when the event is triggered, and
out
is the channel
where events are fed.
A path p is a nite sequence of consecutive tasks. Thus, if
n {N
+
{0}} is the total number of tasks in the path, then
p is dened as n-tuple
p = (x
1
, x
2
, x
3
, . . . , x
n
), x : x {T }. (12)
A timing constrain t
c
TC is dened
t
c
=
p, t
req
wcr, t
req
bcr
, (13)
in which p is a consecutive path of tasks and channels and
t
req
wcr and t
req
bcr
are the required worst-case response time and
best case response time for the p to be completed after the
rst element of p has been triggered.
3.3. Platform Performance Metamodel. The HW platform is
a tuple
P
HW
= (C, L), (14)
in which C is a set of platform components and L is a set of
communication links connecting components. Components
are further divided into sets of processing elements PE,
storage elements SE, and to a single communication element
ce in such a manner that
C = (PE SE ce). (15)
Links L connect processing and storage elements to the
communication element ce. The ce carries out the required
data exchange between PEs and SEs.
A
p
p
l
i
c
a
t
i
o
n
pe
2
m
5
m
4
m
3
m
2
m
1
m
0
e0
e1
se
0
pe
1
pe
0
C
o
m
p
u
t
a
t
i
o
n
C
o
m
m
u
-
n
i
c
a
t
i
o
n
H
W

p
l
a
t
f
o
r
m
ce
e4
s0
e3
e1
e0
Figure 3: Example performance model.
A processing element pe PE is dened as
pe =
f
op
, P
int
, P
oat
, P
mem
, (16)
in which f
op
is the operating frequency, P
int
, P
oat
, P
mem
describe the performance indices of the PE in terms of
executing integer, oating, and memory operations, respec-
tively. If a task has operational complexity O (of some of the
three types) and the PE it is mapped on has corresponding
performance index P and frequency f
op
then task execution
time can be calculated with
t
e
=
O
P f
op
.
(17)
Storage element se SE is dened as
se =
f
op
, P
rd
, P
wr
, (18)
in which P
rd
and P
wr
are performance indices for reading
and writing from and to storage element. The time which it
takes to read or write to the storage is calculated in the same
manner as in (17).
The communication element ce has denition
ce =
f
op
, P
tx
, (19)
where P
tx
is the performance index for transmitting data. If a
token carries n bits of data using the communication element
then the time of the transfer can be calculated as
t
tx
=
n
P
tx
f
op
.
(20)
3.4. Metamodel for Functionality Mapping. The mapping M
binds application load characteristics (tasks and channels) to
platform resources. It is dened as
M = {M
e
M
s
}, (21)
where M
e
= (m
e1
, m
e2
, m
e3
, . . . , m
en
) is a set of map-
pings of execution tasks to processing elements, M
s
=
(m
s1
, m
s2
, m
s3
, . . . , m
sn
) mappings of storage tasks to storage
elements. In general, a mapping m M is dened as 2-
tuple (task, platform element). For instance, execution task
mapping is dened as
m =
e
, pe
,
e
T
e
pe PE. (22)
Each task is mapped only onto one platform element
and several tasks can be mapped onto a single platform
element. Events are not mapped to any platform element.
The mapping of channels onto communication element is
not explicitly modeled. Instead, they are implicitly mapped
onto the single communication element that interconnects
processing and storage elements.
3.5. Example Model. Figure 3 visualizes the primary concepts
of our metamodel with a simple example. There are ve
execution tasks
e0
e4
and a single storage task
s0
combined
together with six channels
0
5
. Two external events e
0
and
e
1
are feeding the task graph with tokens. Computation tasks
are mapped (m
0
m
3
) onto three PEs and the single storage
task is mapped (m
4
) onto the single storage element. All
channels are implicitly mapped onto the single communica-
tion element and all inter-PE transfers are conducted by it.
4. UML2 and the MARTE Prole
UML has been traditionally used for specifying software-
intensive systems but currently it is seen as a promising
language for developing embedded systems as well. Natively
UML2 lacks some of the key concepts that are crucial
for embedded systems such as quantiable notion of time,
nonfunctional properties, embedded execution platform,
and mapping of functionality. However, the language has
extension mechanisms that can be used for tailoring the
language for desired domains. One of such mechanisms
is to use proles that add custom semantics to be used
with the set of model elements oered by the language
itself. Proles are dened with stereotype extensions, tag
denitions, and constraints. Stereotypes give new semantics
to existing UML2 metaclasses. Tagged values are attributes of
a stereotype that are used to further specify the stereotyped
model element. Constraints limit the meta -model by
dening how model elements can be used.
One model element can have multiple stereotypes.
Consequently it gets all the properties, tagged values, and
constraints of those stereotypes. For example, a PE may
have dierent stereotypes for dening its performance
characteristics and its power consumption characteristics.
The separation of concerns (one stereotype for one purpose)
when dening proles is recommended to keep the set of
model elements concise for a designer.
4.1. Utilized MARTE Architecture. In this work, a subset of
the MARTE prole is used as the foundation for creating
our domain-specic modeling language for performance
Design model
HRM
Foundations
NFPs Alloc
Annexes
MARTE_model
library
VSL
Analysis model
Platform performance
(custom extension)
Application workload
(custom extension)
Figure 4: Utilized subproles of the MARTE prole and extensions for performance evaluation.
modeling. The concepts of the created performance eval-
uation metamodel are mapped to the stereotypes dened
by MARTE. Thereafter, custom stereotypes with associated
tag denitions for the rest of the metamodel concepts are
dened.
Figure 4 presents the subproles of MARTE that are
utilized in this work together with additional subproles for
our performance evaluation concepts. The complete prole
architecture of MARTE can be found in [5]. From MARTE
foundations, stereotypes of the prole for nonfunctional
properties (NFP) and allocation (Alloc) are used directly.
The NFP prole is used for dening dierent measurement
types for the custom stereotype extensions. Allocation sub-
prole contains suitable concepts for task mapping.
From MARTE design model, the HW resource modeling
(HRM) prole is adopted to identify and give semantics to
dierent types of HWelements. It should be noted that HRM
prole has dependencies in other proles in the foundations,
such as general resource modeling (GRM) prole, but it is not
included to the gure, since the stereotypes from there are
not directly adopted.
The MARTE analysis model contains pre-dened pack-
ages that are dedicated for generic quantitative analysis
modeling (GQAM), schedulability analysis modeling (SAM),
and performance analysis modeling (PAM). MARTE prole
specication denes that this analysis model can be extended
for other domains as well, such as for power consumption.
We do not utilize the pre-dened analysis concepts but dene
own extensions that implement the metamodel dened in
Section 3. This is because the MARTE analysis packages
have been dened according to their own metamodel that
diers from ours. Although there are some similarities
in the modeling concepts, we dene dedicated stereotype
extensions to allow as straightforward way of capturing the
performance models as possible.
5. Performance Model Specication in UML2
The extension of modeling capabilities for our performance
metamodel is specied by rening the elements of UML and
MARTE with additional stereotypes. These stereotypes spec-
ify the performance characteristics of particular elements
to which they are applied to. The additional stereotypes
are designed so that they can be used with other proles
similar to MARTE. The requirements for such prole is
that it supports embedded HW modeling and a function-
ality mapping mechanism. As mentioned, the additional
stereotypes have been successfully used also with the TUT-
Prole. The dened stereotypes are, however, dependent on
the nonfunctional property data types and measurement
units dened by MARTE nonfunctional property and model
library packages. These data types are used in tag denitions.
5.1. Application Workload Model Presentation. UML2 activ-
ity diagrams have been selected as the view for application
workload models. The reasons for this are
(i) activity diagrams are a natural view for presenting
control and data ow between functional elements of
the application,
(ii) activity diagrams have enough expression power to
present the application task network of the workload
model,
(iii) reuse of activity diagrams created for describing task-
level behaviour becomes possible.
In the workload model, the basic activities are used as the
level of detail in activity diagrams. UML2 basic activity is
presented as a graph of actions and edges connecting them.
Here, actions correspond to tasks T and edges to channels .
Basic activities allow modeling of control and data ow, but
explicit forks and joins of control, as well as decisions and
merges, are not supported [23]. Still, the expression power is
adequate for our workload model.
Figure 5 presents the stereotype extensions for the
application performance model. Workload of tasks T are
presented as action nodes. In practice, these actions refer to
certain UML2 behaviour, such as state-machine, activity, or
function that are mapped onto HW platform elements.
Stereotypes ExecutionWorkload and StorageWorkload are
applied to actions that represent execution tasks T
e
and stor-
age tasks T
s
. The tag denitions for these stereotypes dene
other properties of the represented tasks, including trigger
conditions, computational workload indices, and sent data
<<metaclass>>
Action
+tc : TriggerCondition [0..
]
+intOps: Integer [0..
]
+floatOps: Integer [0..
]
+memOps: Integer [0..
]
+outChannels: String [0..
]
+sendAmount: NFP_DataSize [0..
]
+sendPropability: Real [0..
]
<<stereotype>>
ExecutionWorkload
[Action]
+time: NFP_Duration
+sendAmount: NFP_DataSize
+sendPropability: Real
+eventKind: EventKind
<<stereotype>>
WorkloadEvent
[Action]
+tc: TriggerCondition [0..
]
+rdOps: Integer [0..
]
+wrOps: Integer [0..
]
+outPorts: String [0..
]
+sendAmount:
+sendPropability: Real [0..
]
<<stereotype>>
StorageWorkload
[Action]
<<metaclass>>
Activity
+inChannels: String [0..
]
+depend: DependKind
+ecModPhase: Integer
+ecModPeriod: Integer
<<dataType>>
TriggerCondition
+WCRT: NFP_Duration
+BCRT: NFP_Duration
<<stereotype>>
ResponseTiming
[Action, Activity]
<<stereotype>>
WorkloadModel
[Activity]
AND
OR
<<enumeration>>
DependKind
oneshot
periodic
<<enumeration>>
EventKind
<<metaclass>>
Action
NFP_DataSize [0..
]
Figure 5: Stereotype extensions for application workload model.
tokens. The index of tagged value lists represent an individual
trigger condition and its related actions (operations to be
calculated, data to be sent to the next tasks) when the trigger
condition is satised.
Action nodes are connected together using activity edges.
This notation is used in our model presentation to represent
a channel between two tasks. The direction of the
data ow in the channel is the same as the direction of
the activity edge. The names of the channels are directly
referenced as strings in trigger condition as well as in tagged
values indicating outgoing channels.
An external event is presented as an action node stereo-
typed as WorkloadEvent. Such action has always a single
outgoing channel that carries tokens to the task network. The
top-level activity which denes a single complete workload
model of the system is stereotyped as WorkloadModel.
Timing constraints are dened by applying the stereotype
ResponseTiming for a single action or a complete activity and
dening the response timing requirements in terms of worst
and best case response times. The timing requirement for an
activity is dened as the time it takes to execute the activity
from its initial state to its exit state.
Figure 6 shows an example application workload model
our case studyin an activity diagram. There are ten
execution tasks that are connected with edges that represent
channels between the tasks. Actions on the left column
(excluding the workload event) are tasks of the encoder,
whereas actions on the right column are tasks of the
decoder. Tagged values indicating integer operations and
send amounts are shown for each task. Other tagged values
have been left out from the gure for simplicity. The
trigger conditions for PreProcessing and VLCDecoding are
dened so that they execute the operations in a loop.
For example, PreProcessing task res output tokens Xres
Yres/MBPixelSize times to the channels c2 and c11 when
data arrives from the incoming channel c1. This amount
corresponds to the number of macroblocks in a single
frame. Consecutive processing of this task is triggered by
the incoming data token from the loop channel c11. The
number of loop iterations for a single frame is thus the
same as the number of macroblocks in one frame (Xres
Yres/MBPixelSize). The trigger conditions for other tasks
are dened so that they process the operations and send
data to next process when a data token is arrived to their
incoming channel. Send probability for all tasks and trigger
conditions is 1.0. In this case sent data amounts are dened as
expressions depending on the macroblock size, bits per pixel
(BPP) value, and image resolution. The operation counts
are set as constant values xed for the utilized macroblock
size. There is also a single periodically triggered workload
event, that feeds the application workload network. Global
parameters used in expressions are dened in upper right
corner of the gure.
5.2. Platform Performance Model Presentation. The plat-
form is modeled with stereotyped UML2 classes and class
//quantization parameter (1-32)
$qp = 16
// frame rate (frames/s)
$fr = 35
// image size
$Xres = 352
$Yres = 240
// bits per pixel
$BPP = 12
$MBPixelSize = 256
<<ExecutionWorkload>>
PreProcessing
(Encoder::)
{intOps = 56764,
sendAmount = MBPixelSize
BPP/8}
c1
c11
MBtoFrame
(Decoder::)
{intOps = 5440,
BPP/8}
Rescaling
(Decoder::)
{intOps = 4938,
BPP/8}
c8
MotionCompensation
(Decoder::)
{intOps = 4222,
BPP/8}
c10
IDCT
(Decoder::)
{intOps = 15184,
BPP/8}
c9
VLC
{intOps = 11889,
sendAmount = (Xres
Yres
BPP/8)
/(qp
3)}
(Encoder::)
DCT
(Encoder::)
{intOps = 13571,
BPP/8}
c4
VLDecoding
(Decoder::)
{intOps = 61576,
BPP/8}
c6
c7
c12
MotionEstimation
(Encoder::)
{intOps = 29231,
BPP/8}
c2
c3
Quantization
(Encoder::)
{intOps = 9694,
BPP/8}
c5
<<WorkloadEvent>>
VideoInput
{eventKind = periodic,
sendAmount = 1,
sendPropability = 1.0,
time = 1.0/fr}
Figure 6: Example workload model in an activity diagram.
instances. Other alternative would be to use stereotyped
UML nodes and node instances. Nodes and devices in
deployment diagrams are the native way in UML to model
coarse grained HW architecture that serves as the target
to SW artifacts. Memory and communication resource
modeling are not natively supported by UML2. Therefore,
MARTE hardware resource modeling (HRM) package is
utilized to classify dierent types of HW elements.
MARTE hardware resource modeling package oers
several stereotypes for modeling embedded HW platform.
The complete hardware resource model is divided into
logical and physical views. Logical view denes HW resources
according to their functional properties whereas physical
viewdenes their physical properties, such as area and power.
The performance modeling does not require considering
physical properties, and thus, only stereotypes related to the
logical view are enough for our needs. Next, the stereotypes
utilized from MARTE HRM to categorize dierent HW
elements are discussed in detail.
HW ComputingResource is a generic MARTE stereotype
that is used to represent elements in the HW platform which
can execute application functionality. It can be specialized
<<metaclass>>
+intOpsPerCycle: Real
+floatOpsPerCycle: Real
+memOpsPerCycle: Real
+opFreq: NFP_Frequency
<<stereotype>>
PePerformance
[Element]
+txOpsPerCycle: Real
<<stereotype>>
CommPerformance
[Element]
+rdOpsPerCycle: Real
+wrOpsPerCycle: Real
<<stereotype>>
MemPerformance
[Element]
Element
Figure 7: Stereotype extensions for HW platform performance.
<<hwBus>>
bus: Hibi_segment hibi_p1
hibi_p2
hibi_p3
<<PePerformance>>
<<ep_allocated>>
cpu1: ARM9
{opFreq = 150 MHz}
hibi_p
<<PePerformance>>
<<ep_allocated>>
cpu2: ARM9
{opFreq = 120 MHz}
hibi_p
<<PePerformance>>
<<ep_allocated>>
cpu3: ARM9
{opFreq = 120 MHz}
hibi_p
<<hwProcessor>>
<<hwProcessor>>
<<hwProcessor>>
Figure 8: Execution platform performance model.
to, for example, HW Processor to indicate its properties as a
programmable computing resource. This stereotype or any
of its inherited stereotypes is used to represent processing
element pe PE.
HW Memory is a generic MARTE stereotype for re-
sources that are capable of storing data. This stereotype
and its inherited stereotypes, such as HW RAM, are used to
represent storage element se SE.
Finally, generic MARTE stereotype HW Communica-
tionResource and its inherited stereotypes, such as HW Bus,
are used to represent communication element ce.
The performance related characteristics are given with
three additional stereotypes presented in Figure 7. The
PePerformance is applied for a processing resource, MemPer-
formance for a memory resource, and CommPerformance for
a communication resource, respectively. The performance
characteristics are given for the elements with tagged values
of the stereotypes that dene the performance indices and
operating frequency of the particular elements.
Figure 8 presents an example platform model in a UML
composite structure diagram with performance characteris-
tics. In the gure, there are three instances of HW processors
(UML parts) connected to a single bus segment with UML
ports and connectors. The shown tagged values indicate the
operating frequency of the processors.
5.3. Mapping Model Presentation. MARTE allocation pack-
age is used to model the mapping of application tasks onto
platform resources. MARTE allocation mechanism allows
hybrid allocation in which application behavioral elements
are associated with structural platform resources. The hybrid
allocation is performed with two stereotypes Application-
AllocationEnd and ExecutionPlatformAllocationEnd. In UML
diagrams they are written as app allocated and ep allocated
for conciseness. Application allocation end has a tagged
value that describes the platform resources to which the
particular application element is mapped. Execution plat-
form allocation end identies the platform resources onto
which application elements can be mapped. A dependency
stereotyped Allocated is used to bind application behaviour
elements onto platform elements.
An example mapping with the MARTE allocation mech-
anism is shown in Figure 9. In the gure, the tasks dened
in the workload model of Figure 6 are mapped onto HW
processors dened in the HW platform model of Figure 8.
<<ep_allocated>>
cpu1: ARM9
<<ep_allocated>>
cpu2: ARM9
<<ep_allocated>>
cpu3: ARM9
<<app_allocated>>
MotionCompensation
<<app_allocated>>
MotionEstimation
<<app_allocated>>
DCT
<<app_allocated>>
IDCT
<<app_allocated>>
PreProcessing <<app_allocated>>
Rescaling <<app_allocated>>
VLC
<<app_allocated>>
Quantization
<<app_allocated>>
VLDecoding
<<app_allocated>>
MBtoFrame
<<Allocated>>
<<Allocated>>
<<Allocated>>
<<Allocated>>
<<Allocated>>
<<Allocated>>
<<Allocated>>
<<Allocated>> <<Allocated>>
<<Allocated>>
Figure 9: Mapping with MARTE allocation mechanism.
5.4. Parameterizing Models with MARTE VSL Expressions.
The MARTE value specication language (VSL) has been
developed to specify the values of constraints, properties
and stereotype attributes particularly for nonfunctional
properties. It is an extension to the Value specication and
DataType concepts provided by UML. It can be used in any
UML-based specication for extending the base expression
infrastructure provided by UML. The VSL addresses how to
specify variables, constants, and expressions in textual form.
It also deals with time values and assertions as well as how
to specify composite values such as collection, interval, and
tuples in UML models.
In our approach the syntax of VSL is utilized to dene
expressions on application workload models and platform
performance models. It is an ecient way for parameterizing
the workload models according to application-related values.
Top-right corner of Figure 6 shows an example of using
VSL syntax to parameterize application workload models
according to video quality metrics that are dependent on the
application. In the example, frame rate (fr) is set to 35 frames
per second and this constant variable is utilized to determine
the time period for the VideoInput workload event when
a single image is fed to the process network. Further, the
macroblock size in pixels (MBPixelSize) and image size (Xres
and Yres) are used to determine the data amounts transferred
between tasks.
6. Tool Framework for Model-Driven SoC
Performance Evaluation and Exploration
The presented performance evaluation models are used for
early analysis of data intensive embedded systems. Figure 10
presents the tool framework in which the models are applied.
6.1. Performance Model Capture and System-Level Simulation.
The ow begins from capturing the system performance
modeling in UML2 using the presented model elements and
proles. This is followed by the model parsing phase in which
the models are transformed into XML system model (XSM)
[24, 25]. This is the corresponding XML presentation of the
UML2 performance models. The XSM is a common format
between tools to exchange information on the designed
system. The XSM can be modied by tools after its creation
during the design-space exploration iterations.
UML2 performance model
Performance results
SystemC simulation with
transaction generator
XML system model
Design-space
exploration tool
Execution
monitor
Model parser Back-annotator
Figure 10: Tool framework for performance evaluation and
exploration.
After model creation the XSM le is fed to the simulator.
The simulator is divided into two parts: computation and
communication. The computation part is in practice realized
with a congurable transaction generator (TG) [21]. The
computation part simulates the execution and scheduling
of tasks on processing and memory elements. It also
feeds the underlying communication part with data tokens
transmitted between tasks which are mapped onto dierent
platform elements. The abstraction level of the computation
part is the same with the metamodel dened in Section 3.
Due to high abstraction level of the computation part,
the executed tasks do not contain any specic functionality,
but they only reserve the processing or memory element and
block it from other tasks for certain amount of time. For
example, for execution tasks this time is derived with (17).
The computation part (TG) is congured automatically
based on the abstract task, processing and storage resource
models dened in UML. The conguration is based on
generating corresponding SystemC code containing the same
tasks, processing and memory elements. This is done by
instantiating generic task and HW element SystemC com-
ponents with parameters (operation counts, performance
indices, etc.) dened in UML the models.
The computation and communication parts are inter-
faced with Open Core Protocol (OCP) [26] TL2 compatible
Table 1: Summary of collected and monitored performance
statistics.
Category Values
Application
specic
For example, frame rate, radio
throughput
Application
Task
communication
Signals in/out, avg./tot.
communication cycles,
communication % of execution
time, intra/inter-PE
communication bytes and cycles,
communication cycles/byte
Task general
Execution count, avg./tot.
execution cycles, execution % of
thread/service total, signal queue,
execution latency, response time
Mapping Task to thread/PE
Platform PE
Utilization, inter-PE
communication bytes, avg./tot.
execution cycles
Network Utilization, eciency
interfaces. This means that the communication part can
be changed to any SystemC-based network model that
implements OCP TL2 compatible interfaces for intercon-
nected elements. This allows simulation of low abstraction
level models of communication (such as NoCs) with high
abstraction level models of computation. Currently, the
earlier presented simple performance model for commu-
nication element is not used in our framework. Instead,
a more accurate SystemC dened TLM model for the
communication part is used in simulations.
6.2. Execution Monitoring. After simulation the simulator
tool produces a performance result le. It is a detailed
description of events of particular interest during simulation.
This le can be used as an input to Execution Monitor [27]
program that can be used to visualize the simulation in
a repeatable manner. The collected and monitored perfor-
mance statistics are summarized in Table 1. The monitoring
of simulation is ecient in spotting trends, correlations, and
anomalities in system performance over time. In addition, it
is ecient in understanding dynamic eects such as varying
delays (jitter) and race conditions due to contention and
scheduling.
Performance bottlenecks can be detected by observing
the amount of tokens in signal queues and the utilization
of PEs. If the number of tokens in the incoming channel
of a task is increasing it is usually an indication of that task
being the bottleneck in a chain of several tasks. On the other
hand, a bottleneck can be located when a single processor
has a considerably higher utilization than other collaborating
processors.
In practice, the modeled response time requirements are
validated by observing the maximum response time of a
task in dierent execution scenarios. Meeting throughput
requirements can be also observed in a similar manner.
Figure 11 presents the control view of the execution
monitor tool. In the gure, the control view shows a system
consisting of ten tasks mapped onto three processors. Each
processor column consists of the current task mapping on
top and an optional graph on the bottom. The graph can
present, for example, processor utilization as in the gure.
6.3. Design-Space Exploration. After simulation and perfor-
mance monitoring, the performance simulation results and
XSM are fed to the design-space exploration tool which
tries to optimize the platform parameters and task mapping
so that user-dened cost function is minimized. The cost
function can contain several nonfunctional properties such
as power, frequency, area, or response time of an individual
task. The design space exploration tool has several mapping
heuristics supported: simulated annealing, group migration,
hybrid of the previous two [28], optimal subset mapping
[29], genetic algorithm, and random. The design-space
exploration cycle continues by performing the simulation
after each remapping or modication in the execution
platform.
After the design-space exploration cycle ends, the opti-
mized system description is again written to the XSM le.
The back-annotator tool is used to change the UML2 models
according to the results of the design-space exploration
(updated platform and mapping).
6.4. Governing the Tool Flow Execution. The execution of the
design ow is governed by a customizable Java-based tool
for conguring and executing SoC design ows. This tool
is called Koski Graphical User Interface. The idea of this tool
is that a user selects tools to the ow to be executed from
a library of tools. New tools can be imported to the library
in a plug-and-play fashion. Each tool includes a section of
XML which species the input and output tokens (les and
parameters) of that particular tool. Parameters of individual
tools can be set via the GUI. For example, the platform
constraints such as maximum and minimum number of PEs
and the cost function of the design-space exploration tool
are these kind of parameters. Due to its exibility, this tool
has shown to be very eective in researching and evaluating
dierent methodologies and tool ow congurations.
7. Case Study: Performance Evaluation
and Exploration of a Video Codec on
Multiprocessor SoC
This section presents a case study that illustrates the appli-
cability of the modeling methods and tool framework in
practice. The application is a video codec on a multiprocessor
platform. We used an approach in which new functionality
representing web client was modeled and added to an
existing video codec system in Figure 6 and the system
was simulated and optimized based on the monitored
information.
7.1. Proling and Modeling. All the functions were modeled
by their workload and simulated in SystemC using TG. The
0
25
50
75
100
100 110 120 130 140 150 160 170 180 190
0
25
50
75
100
100 110 120 130 140 150 160 170 180 190
0
25
50
75
100
100 110 120 130 140 150 160 170 180 190
Processor utilization Processor utilization Processor utilization
Figure 11: Control view in execution monitor.
workload model of the video codec was originally proled
from real FPGA execution trace whereas the model of the
web client was only a single task which had an early estimate
of its behavior.
The performance requirement of the video codec was
set to 35 frames per second (FPS). Thus, an external
event representing the camera triggered at 35 Hz frequency.
The HW platform consisted of three processors connected
through a shared bus. The operating frequencies of the
processors were set to 150 MHz, 120 MHz, and 120 MHz.
The frequency of the bus was set to 100 MHz.
7.2. Simulating and Monitoring. When the original system
was simulated, it was observed that it met the FPS require-
ment. Next, functionality for the web client was added to
run in parallel with the video codec. The web client was
mapped to cpu1 (see Figure 11) because it was observed that
the utilization of cpu1 was the lowest in the original system.
Simulations indicated that the performance of the video
codec was decreased to 14 FPS. In addition, cpu1 became
fully utilized at all times whereas the utilizations of the other
two processors decreased. This indicated a clear bottleneck
on cpu1 as it was not able to forward processed data fast
enough to other processors. This could also be observed
from the signal queues of the tasks mapped onto cpu1. The
environment produced raw frames so fast that they started
accumulating at the cpu1.
Thereafter, a remapping of the application tasks was
performed since the workload of the processors was clearly
imbalanced. The mapping was done manually so that all the
encoder tasks were mapped to cpu1, the decoder tasks to
cpu2, and the web client functionality was isolated to cpu3.
During the simulation it was observed that this improved the
FPS to 22.
Because the manual mapping did not result in the
required performance, the next phase was automatic
exploration of the task mapping. The result mapping was
nonobvious because the tasks of the encoder and decoder
were distributed among all the processors. Hence, it is
unlikely that we had ended to it with manual mapping.
The system became more balanced and the video codec
performance increased to 30 FPS, but it did still not meet the
required 35 FPS. Cpu1 was still the bottleneck and the signal
queues of the tasks mapped to it kept increasing. However,
they were not increasing as fast as with the unoptimized
mapping, as presented in Figure 12. Figure 12(a) illustrates
the queue before the mapping exploration and Figure 12(b)
after the exploration. The signal queues are shown for the
time frame of 50 to 100 ms, and the scale of the y-axis is 0
150 signals.
Finally, automated exploration was performed for the
operating frequencies of the processors. The result of the
exploration was that the frequency of cpu1 was increased
40 MHz to 190 MHz, and the frequencies of the other
two processors were increased 20 MHz to 140 MHz. The
simulation on this system model showed that the FPS
requirement should be met, and the tasks could process all
the signals which they received.
8. Discussion
In early performance evaluation, the key issue is the tradeo
between accuracy and development time of the model. The
best accuracy is achieved from cycle-accurate simulations
or from actual implementation. However, constructing the
cycle-accurate model or integrating the system is very time
consuming in comparison to using system-level models
and simulations. Thus, utilization of abstract system-level
models allow the designer to explore the design space
more eciently. The actual simulation time is also faster
in system-level simulations in comparison to cycle-accurate
simulations.
0
25
50
50 55 60 65 70 75 80 85 90 95
75
100
125
150
VLC: Signal queue
(a) Before mapping exploration
0
25
50
50 55 60 65 70 75 80 85 90 95 100
75
100
125
150
VLC: Signal queue
(b) After mapping exploration
Figure 12: Signal queues for task VLC before and after mapping exploration.
In this work we concentrate on reducing the eort
in specifying and managing the performance models for
system-level simulations. This has been done by utilizing
graphical UML2 models. As a result, the degree of readability
of the models is improved in comparison to textual presen-
tation. The case study showed that the system model is easy
to construct, interpret, and modify with the presented UML
model elements. The case study models were constructed
in few hours. Proling and estimating operation counts for
workload tasks can be considered time-consuming and hard.
In our case, it was done by proling similar application
executing on FPGA.
MARTE VSL was found useful for dening expressions. It
signicantly simplied modifying the models with dierent
application-specic parameters in comparison to using
constant values.
In earlier study [30] the average error in frame-rate was
4.3%. This article uses the same metamodel. Hence, it can
be concluded that our method oers designer-friendly, rapid
yet rather accurate performance evaluation for RTES.
9. Conclusions and Future Work
This article presented an ecient method to model and
evaluate streaming data embedded system performance with
UML2 and system-level simulations. The modeling methods
were successfully utilized in a tool framework for early
performance evaluation and design-space exploration. The
case study showed that UML2, the presented modeling
methods, and the utilized performance evaluation tools
form a designer-friendly, rapid yet rather accurate way of
modeling and evaluating RTES performance before actual
implementation. Future work consists of taking account
the impact of SW platform in the RTES performance
metamodel. This includes the workload of SW platform
services (such as le access and memory allocation) as well
as scheduling of tasks with dierent policies.
References
[1] Object Management Group (OMG), Unied Modeling Lan-
guage (UML) Superstructure, V2.1.2, November 2007.
[2] G. Martin and W. Mueller, Eds., UML for SOC Design,
Springer, 2005.
[3] K. Berkenk otter, Using UML 2.0 in real-time development
a critical review, in International Workshop on SVERTS:
Specication and Validation of UML Models for Real Time and
Embedded Systems, October 2003.
[4] R. B. France, S. Ghosh, T. Dinh-Trong, and A. Solberg,
Model-driven development using UML 2.0: promises and
pitfalls, IEEE Computer, vol. 39, no. 2, pp. 5966, 2006.
[5] Object Management Group (OMG), A UML prole for
MARTE, beta 1 specication, August 2007.
[6] Object Management Group (OMG), OMGsystems modeling
language (SysML) specication, September 2007.
[7] P. Kukkala, J. Riihim aki, M. H annik ainen, T. D. H am al ainen,
and K. Kronl of, UML 2.0 prole for embedded system
design, in Proceedings of the Conference on Design, Automation
and Test in Europe (DATE 05), vol. 2, pp. 710715, March
2005.
[8] M. Gries, Methods for evaluating and covering the design
space during early design development, Integration, the VLSI
Journal, vol. 38, no. 2, pp. 131183, 2004.
[9] G. Kahn, The semantics of a simple language for parallel pro-
gramming, in Proceedings of the IFIP Congress on Information
Processing, August 1974.
[10] Object Management Group (OMG), UML prole for schedu-
lability, performance, and time specication (Version 1.1),
January 2005.
[11] T. Arpinen, M. Set al a, P. Kukkala, et al., Modeling embedded
Ssoftware platforms with a UML prole, in Proceedings of
the Forum on Specication & Design Languages (FDL 07),
Barcelona, Spain, April 2007.
[12] K. Keutzer, S. Malik, R. Newton, et al., System-level design:
orthogonalization of concerns and platform-based design,
IEEE Transactions on Computer-Aided Design, vol. 19, no. 12,
pp. 15231543, 2000.
[13] G. Kotsis, Workload modeling for parallel processing systems,
Ph.D. thesis, University of Vienna, Vienna, Austria, 1995.
[14] P. Kukkala, M. H annik ainen, and T. D. H am al ainen, Per-
formance modeling and reporting for the UML 2.0 design
of embedded systems, in Proceedings of the International
Symposium on System-on-Chip, pp. 5053, November 2005.
[15] H. Hansson, M. Nolin, and T. Nolte, Real-time in embedded
systems, in Embedded Systems Handbook, chapter 2, CRC
Press Taylor & Francis, 2004.
[16] F. Boutekkouk, S. Bilavarn, M. Auguin, and M. Benmo-
hammed, UML prole for estimating application worst
case execution time on system-on-chip, in Proceedings of
the International Symposium on System-on-Chip, pp. 16,
November 2008.
[17] C. Erbas, A. D. Pimentel, M. Thompson, and S. Polstra,
A framework for system-level modeling and simulation
of embedded systems architectures, EURASIP Journal of
Embedded Systems, vol. 2007, Article ID 82123, 11 pages, 2007.
[18] S. Balsamo and M. Marzolla, Performance evaluation of
UML software architectures with multiclass queueing network
models, in Proceedings of the 5th International Workshop on
Software and Performance, (WOSP 05), pp. 3742, July 2005.
[19] J. Kreku, M. Hoppari, T. Kestil a, et al., Combining UML2
application and SystemC platform modelling for performance
evaluation of real-time embedded systems, EURASIP Journal
on Embedded Systems, 2008.
[20] G. Madl, N. Dutt, and S. Abdelwahed, Performance estima-
tion of distributed real-time embedded systems by discrete
event simulations, in Proceedings of the 7th ACM & IEEE
International Conference on Embedded Software (EMSOFT
07), pp. 183192, 2007.
[21] T. Kangas, Methods and implementations for automated system
on chip architecture exploration, Ph.D. thesis, Tampere Univer-
sity of Technology, 2006.
[22] Object Management Group (OMG), Meta object facility
(MOF) specication (version 1.4), April 2002.
[23] Object Management Group (OMG), Unied modeling lan-
guage (UML) superstructure specication, V2.1.2, November
2007.
[24] T. Kangas, J. Salminen, E. Kuusilinna, et al., UML-based
multi-processor SoC design framework, ACM TECS, vol. 5,
no. 2, pp. 281320, 2006.
[25] E. Salminen, C. Grecu, T. D. H am al ainen, and A. Ivanov,
Networkon- chip benchmarking specications part I: appli-
cation modeling and hardware description, v1.0, OCP-IP,
April 2008.
[26] Open core protocol international partnership (OCP-IP),
OCP specication 2.2., May 2008, http://www.ocpip.org.
[27] K. Holma, T. Arpinen, E. Salminen, M. H annik ainen, and T.
D. H am al ainen, Real-time execution monitoring on multi-
processor system-on-chip, in Proceedings of the International
Symposium on System-on-Chip (SOC 08), pp. 16, November
2008.
[28] H. Orsila, T. Kangas, M. H annik ainen, and T. D. H am al ainen,
Hybrid algorithm for mapping static task graphs on multi-
processor SoCs, in Proceedings of the International Symposium
on System-on-Chip, pp. 146150, November 2005.
[29] H. Orsila, E. Salminen, M. H annik ainen, and T. D.
H am al ainen, Optimal subset mapping and convergence
evaluation of mapping algorithms for distributing task graphs
on multiprocessor SoC, in Proceedings of the International
Symposium on System-on-Chip, November 2007.
[30] K. Holma, M. Set al a, E. Salminen, M. H annik ainen, and T.
D. H am al ainen, Evaluating the model accuracy in automated
design space exploration, Microprosessors and Microsystems,
vol. 32, no. 5-6, pp. 321329, 2008.
doi:10.1155/2009/235032
Research Article
Cascade Boosting-Based Object Detection fromHigh-Level
Description to Hardware Implementation
K. Khattab, J. Dubois, and J. Miteran
Le2i UMR CNRS 5158, Aile des Sciences de lIngenieur, Universite de Bourgogne, BP 47870, 21078 Dijon Cedex, France
Correspondence should be addressed to J. Dubois, jdubois@u-bourgogne.fr
Received 28 February 2009; Accepted 30 June 2009
Object detection forms the rst step of a larger setup for a wide variety of computer vision applications. The focus of this paper
is the implementation of a real-time embedded object detection system while relying on high-level description language such
as SystemC. Boosting-based object detection algorithms are considered as the fastest accurate object detection algorithms today.
However, the implementation of a real time solution for such algorithms is still a challenge. A new parallel implementation, which
exploits the parallelism and the pipelining in these algorithms, is proposed. We show that using a SystemC description model
paired with a mainstream automatic synthesis tool can lead to an ecient embedded implementation. We also display some of the
tradeos and considerations, for this implementation to be eective. This implementation proves capable of achieving 42 fps for
320 240 images as well as bringing regularity in time consuming.
Copyright 2009 K. Khattab et al. This is an open access article distributed under the Creative Commons Attribution License,
1. Introduction
Object detection is the task of locating an object in an image
despite considerable variations in lighting, background, and
object appearance. The ability of object detecting in a scene
is critical in our everyday life activities, and lately it has
gathered an increasingly amount of attention.
Motivated by a very active area of vision research, most
of object detection methods focus on detecting Frontal Faces
(Figure 1). Face detection is considered as an important
subtask in many computer vision application areas such
as security, surveillance, and content-based image retrieval.
Boosting-based method has led to the state-of-the-art detec-
tion systems. It was rst introduced by Viola and Jones as
a successful application of Adaboost [1] for face detection.
Then Li et al. extended this work for multiview faces, using
improved variant boosting algorithms [2, 3]. However, these
methods are used to detect a plethora of objects, such
as vehicles, bikes, and pedestrians. Overall these methods
proved to be time accurate and ecient.
Moreover this family of detectors relies upon several
classiers trained by a boosting algorithm [48]. These
algorithms help achieving a linear combination of weak
classiers (often a single threshold), capable of real-time face
detection with high detection rates. Such a technique can
be divided into two phases: training and detection (through
the cascade). While the training phase can be done oine
and might take several days of processing, the nal cascade
detector should enable real-time processing. The goal is to
run through a given image in order to nd all the faces
regardless of their scales and locations. Therefore, the image
can be seen as a set of subwindows that have to be evaluated
by the detector which selects those containing faces.
Most of the solutions deployed today are general purpose
processors software. Furthermore, with the development of
faster camera sensors which allows higher image resolution
at higher frame-rates, these software solutions are not always
working in real time. Accelerating the boosting detection can
be considered as a key issue in pattern recognition, as much
as motion estimation is considered for MPEG-4.
Seeking some improvement over the software, several
attempts were made trying to implement object/face detec-
tion on multi-FPGA boards and multiprocessor platforms
using programmable hardware [914], just to fell short in
frame rate and/or high accuracy.
The rst contribution of this paper is a new structure
that exploits intrinsic parallelism of a boosting-based object
detection algorithm.
As for a second contribution, this paper shows that
a hardware implementation is possible using high-level
SystemCdescription models. SystemCenables PCsimulation
that allows simple and fast testing and leaves our structure
open to any kind of hardware or software implementation
since SystemC is independent from all platforms. Main-
stream Synthesis tools, such as SystemCrafter [15], are
capable of generating automatic RTL VHDL out of SystemC
models, though there is a list of restrictions and constraints.
The simulation of the SystemC models has highlighted
the critical parts of the structure. Multiple renements
were made to have a precise, compile-ready description.
Therefore, multiple synthesis results are shown. Note that
our fastest implementation was capable of achieving 42
frames per second for 320 240 images running at 123 MHz
frequency.
The paper is structured as follows. In Section 2 the
boosted-based object detectors are reviewed while focusing
on accelerating the detection phase only. In Section 3 a
sequential implementation of the detector is given while
showing its real time estimation and drawbacks. Anewparal-
lel structure is proposed in Section 4; its benets in masking
the irregularity of the detector and in speeding the detection
are also discussed. In Section 5 a SystemC modelling for the
proposed architecture is shown using various abstraction
levels. And nally, the rmware implementation details as
well as the experimental results are presented in Section 6.
2. Review of Boosting-Based Object Detectors
Object detection is dened as the identication and the
localization of all image regions that contain a specic object
regardless of the objects position and size, in an uncontrolled
background and lightning. It is more dicult than object
localization where the number of objects and their size are
already known. The object can be anything from a vehicle,
human face (Figure 1), human hand, pedestrian, and so
forth. The majority of the boosting-based object detectors
work-to-date have primarily focused on developing novel
face detection since it is very useful for a large array of
applications. Moreover, this task is much trickier than other
object detection tasks, due to the typical variations of hair
style, facial hair, glasses, and other adornments. However,
a lot of previous works have proved that the same family
of detector can be used for dierent type of object, such
as hand detection, pedestrian [4, 10], and vehicles. Most of
these works achieved high detection accuracies; of course a
learning phase was essential for each case.
2.1. Theory of Boosting-Based Object Detectors
2.1.1. Cascade Detection. The structure of the cascade detec-
tor (introduced in face detection by Viola and Jones [1])
is that of a degenerated decision tree. It is constituted of
successively more complex stages of classiers (Figure 2).
Figure 1: Example of face detection.
No object
Object
2 1 3 4
Rejected subwindows
All subwindows
Further processing
Figure 2: Cascade detector.
A B C D
+1
+1
+1
1
1
1
1
1 +1
+1
+1
Figure 3: Rectangle features.
The objective is to increase the speed of the detector by
focusing on the promising zones of the image. The rst
stage of the cascade will look over for these promising zones
and indicates which subwindows should be evaluated by
the next stage. If a subwindow is labeled at the current
classier as nonface, then it will be rejected and the decision
upon it is terminated. Otherwise it has to be evaluated
by the next classier. When a sub-window survives all the
stages of the cascade, it will be labeled as a face. Therefore
the complexity increases dramatically with each stage, but
the number of sub-windows to be evaluated will decrease
more tremendously. Over the cascade the overall detection
rate should remain high while the false positive rate should
decrease aggressively.
2.1.2. Features. To achieve a fast and robust implementation,
Boosting based faces detection algorithms use some rectangle
Haar-like features (shown in Figure 3) introduced by [16]:
two-rectangle features (A and B), three-rectangle features
(C), and four-rectangle features (D). They operate on
grayscale images and their decisions depend on the threshold
dierence between the sum of the luminance of the white
region(s) and the sum of the luminance of the gray region(s).
Using a particular representation of the image so-called
the Integral Image (II), it is possible to compute very rapidly
P1
A B
C D
P2
P3
P4
Figure 4: The sumof pixels within Rectangle Dcan be calculated by
using 4 array references; S
D
= II [P4] (II [P3] + II [P2] II [P1]).
the features. The II is constructed of the initial image by
simply taking the sum of luminance value above and to the
left of each pixel in the image:
ii
x, y
<x,y
<y
i
, y
(1)
where ii(x, y) is the integral image, and i(x, y) is the original
image pixels value. Using the Integral Image, any sum of
luminance within a rectangle can be calculated from II using
four array references (Figure 4). After the II computation, the
evaluation of each feature requires 6, 8, or 9 array references
depending on its type.
However, assuming a 24 24 pixels sub-window size, the
over-complete feature set of all possible features computed in
this window is 45 396 [1]: it is clear that a feature selection
is necessary in order to keep real-time computation time
compatibility. This is one of the roles of the Boosting training
step.
2.1.3. Weak Classiers and Boosting Training. A weak classi-
er h
j
(x) consists of a feature f
j
, a threshold
j
, and a parity
p
j
indicating the direction of the inequality sign:
h
j
(x) =
1, if p
j
f
j
(x) < p
j
j
,
0, otherwise.
(2)
Boosting algorithms (Adaboost and variants) are able to
construct a strong classier as a linear combination of weak
classiers (here a single threshold) chosen froma given, nite
or innite, set, as shown in (3):
h(x) =
1,
T
t=1
t
h
t
(x) > ,
0, otherwise,
(3)
where is the stage threshold,
t
is the weak classiers
weight, and T is the total number of weak classiers
(features).
This linear combination is trained in cascade in order to
have better results.
There, a variant of Adaboost is used for learning object
detection; it performs two important tasks: feature selection
from the features dened above and constructing classiers
using selected features.
The result of the training step is a set of parameters (array
references for features, constant coecients of the linear
combination of classiers, and thresholds values selected by
Adaboost). This set of features parameters can be stored
easily in a small local memory.
2.2. Previous Implementations. The state-of-the-art initial
prototype of this method, also known as Viola-Jones algo-
rithm, was a software implementation based on trained
classiers using Adaboost. The rst implementation shows
some good potential by achieving good results in terms of
speed and accuracy; the prototype can achieve 15 frames per
second on a desktop computer for 320 240 images. Such
an implementation on general purpose processors oers a
great deal of exibility, and it can be optimized with little
time and cost, thanks for the wide variety of the well-
established design tools for software development. However,
such implementation can occupy all CPU computational
power for this task alone; nevertheless, face/object detection
is considered as prerequisite step for some of the main
application such as biometric, content-base image retrieval
systems, surveillance, and autonavigation. Therefore, there is
more and more interest in exploring an implementation of
accurate and ecient object detection on low-cost embedded
technologies. The most common target technologies are
embedded microprocessors such as DSPs, pure hardware
systems such as ASIC, and congurable hardware such as
FPGAs.
Lot of tradeos can be mentioned when trying to
compare these technologies. For instance, the use of embed-
ded processor can increase the level of parallelism of the
application, but it costs high power consumption, all while
limiting the solution to run under a dedicated processor.
Using ASIC can result better frequency performance
coupled with high level of parallelism and low power
consumption. Yet, in addition to the loss of exibility, using
this technology requires a large amount of development,
optimization, and implementation time, which elevates the
cost and risk of the implementation.
FPGAs can have a slightly better performance/cost trade-
os than previous two, since it permits high level of
parallelism coupled with some design exibility. However
some restriction in design space, costly rams connections as
well as lower frequency comparing to ASIC, can rule-out its
use for some memory heavy applications.
For our knowledge, few attempts were made trying
to implement Boosting-based face detection on embedded
platforms; even so, fewer attempted such an implementation
for other object type, for example, whole body detection
[10].
Nevertheless, these proposed architectures were cong-
urable hardware-based implementations, and most of them
could not achieve high detection frame rate speed while
keeping the detection rate close of that of the original
implementation. For instance, in order to achieve 15 frames
per second for 120120 images, Wei et al. [11] choose to skip
the enlargement scale factor from 1.25 to 2. However such a
maneuver would lower the detection rate dramatically.
Theocharides et al. [12] have proposed a parallel archi-
tecture taking advantage of a grid array processor. This array
processor is used as memory to store the computation data
and as data transfer unit, to aid in accessing the integral
image in parallel. This implementation can achieve 52 frames
per second at a 500 MHz frequency. However, details about
the image resolution were not mentioned.
Another complex control scheme to meet hard real-time
deadlines is proposed in [13]. It introduces a new hardware
pipeline design for Haar-like feature calculation and a system
design exploiting several levels of parallelism. But it sacrices
the detection rate and it is better tted for portrait pictures.
And more recently, an implementation with NoC
(Network-on-Chip) architecture is proposed in [14] using
some of the same element as [12]; this implementation
achieves 40 frames per second for 320 240 images.
However detection rate of 70% was well below the software
implementation (82% to 92%), due to the use of only 44
features (instead of about 4600 in [1]).
3. Sequential Implementation
In software implementation the strategy used consists of
processing each sub-window at a time. The processing on
the next sub-window will not trigger until a nal decision
is taken upon the previous one, that is, going through a set of
features as a programmable list of coordinate rectangles.
In attending to implement such a cascade algorithm,
each stage is investigated alone. For instance, the rst stage
classier should be separated from the rest since it requires
processing all the possible subwindows in an image, while
each of the other relies on the result of previous stage and
evaluates only the subwindows that passed through.
3.1. First Classication Stage. As mentioned earlier this
classier must run all over the image and reject the subwin-
dows that do not t the criteria (no face in the window).
The detector is scanned across locations and scales, and
subsequent locations are obtained by shifting the window
some number of pixels k. Only positive results trigger in the
next classier.
The addresses of the positive sub-windows are stored in a
memory, so that next classier could evaluate them and only
them in the next stage. Figure 5 shows the structure of such
classier. The processing time of this rst stage is stable and
independent from the image content; the algorithm here is
regular. The classier complexity on this stage is usually very
low (only one or two features are considered; the decision is
made of one to comparisons, two multiplications, and one
addition).
3.2. Remaining Classication Stages. Next classication
stages, shown in Figure 6, do not need to evaluate the
whole image. Each classier should examine only the positive
results, given by the previous stage, by reading their addresses
in the memory, and then takes a decision upon each one
(reject or pass to the next classier stage).
Each remaining classier is expected to reject the major-
ity of sub-windows and keep the rest to be evaluated later in
the cascade. As a result, the processing time depends largely
on the number of positive sub-windows resulted from the
Yes
No
Features
parameters
Load II
values
Integral
image II

Decision
End Shift & scale
Positives sub-windows
addresses
Process
next
image
Figure 5: First cascade stage.
Features
parameters
Load II
values
Integral image (II)
Decision
Positives sub-
windows
addresses n
addresses n1
Figure 6: nth Stage classier.
Yes
Change of
classifier
Load II
values
Decision
End Shift/positive
addresses
addresses
No
Integral image
Figure 7: Sequential implementation.
previous stage. Moreover the classier complexity (number
of comparisons, multiplications, and additions) increasse
with the stage level.
3.3. Full Sequential Implementation. The full sequential
implementation of this cascade is proposed in Figure 7.
For a 320 240 image, scanned on 11 scales with a
scaling factor of 1.25 and a step of 1.5, the number of total
sub-windows to be investigated is 105 963. Based on tests
done in [1], an average of 10 features is evaluated per sub-
window. As a result, the estimated number of decision made
over the cascade, for a 320 240 image, is 1.3 million as an
average. Thereafter around 10 millions memory access (since
each decision needs 6, 8, or 9 array references to calculate
the feature in play). Note that the computation time of the
decision (linear combination of constants) as well as the time
needed to build the integral image is negligible comparing to
the overall memory access time.
Considering the speed of the memory to be 10 nanosec-
ond per access (100 MHz), the time needed to process
a full image is around 100 millisecond (about 10 images
per second). However, this rate can vary with the images
content. Nevertheless, this work has been performed several
times [13] using standard PC, and the obtained processing
rate is around 10 images/s; the implementation is still not
well suited for embedded applications and does not use
parallelism.
4. Possible Parallelism
As shown in Section 3, Boosting-based face detector got
a few drawbacks: rst, the implementation still needs to
be accelerated in order to achieve real time detection, and
second the processing time for an image depends on its
content.
4.1. Algorithm Analysis and Parallel Model. A degraded
cascade of 10 stages is presented in [1]. It contains less than
400 features and achieves a detection rate between 75% and
85%. Another highly used and more complex cascade can be
found in OpenCV [17] (discussed later in Section 5.2). This
cascade includes more than 2000 features spread on 22 stages
and achieves higher detection rates than the degraded version
(between 80% and 92%) with less false positive detections.
Analyzing those 2 cascade, one could notice that about 35%
of the memory access takes place on each of the rst two
classier while 30% on all remaining stages, which leads us
to suggest a new structure (shown in Figure 8) of 3 parallel
blocks that work simultaneously: in the rst two blocks we
intend to implement, respectively, the rst and second stage
classiers, and then a nal block assigned to run over all
remaining stages sequentially.
Unlike the state-of-the-art software implementation, the
proposed structure tends to run each stage as a standalone
block. Nevertheless, some intermediate memories between
the stages must be added in order to stock the positive-label
windows addresses.
The new structure proposed above can upsurge the speed
of the detector in one condition: since that the computation
complexity is relatively small and the time processing
depends heavily on the memory access, an integral image
memory should be available for each block in order to
gain benet of three simultaneous memory accesses. Figure 8
shows the proposed parallel structure. At the end of every
full image processing cycle, the positive results from Block1
trigger the evaluation of Block2. The positive results from
Block2 trigger the evaluation of Block3. And the positive
results from Block3 are labeled as faces. It should be noted
that blocks cannot process simultaneously on the same
image, that is, if at a given moment Block2 is working on
the current image (I
1
), then Block1 should be working on
the next image (I
2
) and Block3 should be working on the
previous image (I
0
). As mentioned in Section 3, the rst
classier stage is slightly dierent from the others since
it should evaluate the whole image. Hence, a shift-and-
scaling model is needed. The positive results are stored in
a memory (mem.1) and copied in another memory (mem.2)
in order to be used on the second stage. The positive results
are stored in a memory (mem.3, duplicated in mem.4) in
order to be used in the nal block.
The nal block is similar to the second, but it is designed
to implement all the remaining stages. Once the processing
on mem.4 is nished, block 3 works the same way as in
the sequential implementation: the block run back and forth
through all remaining stages, to nally give the addresses of
the detected faces.
This can be translated into the model shown in Figure 9.
A copy of the integral image is available to each block as well
as three pairs of logical memory are working in ping pong to
accelerate the processing.
The given parallel model ought to run at the same
speed rate as its slower block. As mentioned earlier, the
rst stage of the cascade requires more access memory and
therefore more time processing than the second stage alone
or all the remaining stages together. In the rst classier
stage, all 105 963 sub-windows should be inspected using
four features with eight array references each. Therefore,
it requires about 3.4 million of memory access per image.
Using the same type of memory as in Section 3.3, an image
needs roughly 34 millisecond (29 images per second) of time
processing.
4.2. Parallel Model Discussion. Normally the proposed struc-
ture should stay the same, even if the cascade structure
changes, since most of the boosting cascade structures have
the same properties as long as the rst two cascade stages.
One of the major issues surrounding boosting-based
detection algorithms (specially when applied on to object
detection in a non constraint scene) is the inconsistency
and the unpredictable processing time; for example, a white
image will always takes a little processing time since no sub-
window should be cable of passing the rst stage of the
cascade. As opposite, an image of thumbnails gallery will take
much more time.
Though this structure not only gives a gain in speed, this
rst stage happens to be the only regular one in the cascade,
with xed time processing per image. This means that we can
mask the irregular part of the algorithmby xing the detector
overall time processing.
As a result, the whole system will not work at 3 times
the speed of the average sequential implementation, but a
little bit less. However, theoretically both models should be
running at the same speed if encountering a homogenous
image (e.g., white or black image). Further work in Section 5
will show that the embedded implementation can benet
from some system teaks (pipelining and parallelism) within
the computation that will make the architecture even faster.
Due to the masking phenomena in the parallel imple-
mentation, decreasing the number of weak classiers can
accelerate the implementation, but only if the rst stage of
the cascade is accelerated.
For this structure to be implemented eectively, its
constraints must be taken into consideration. The memory,
for instance, can be the most greedy and critical part;
BLOCK 1:
BLOCK 2:
BLOCK 3:
Positives sub-
windows
addresses

II
i
Image
Shift & scale
Mem.1
Mem.2
Features
Features
Decision
Decision
Features Decision
Mem.4
Mem.3
II
i 2
II
i 1
- 100% sub-
windows
- 35% total
memory access
- 4 to 8 features
- <50% sub-
windows
- 35% total
memory access
- 8 to 20 features
- <15% sub-
windows
- 30% total
memory access
- Up to 2000
features
Figure 8: Parallel structure.
BLOCK 1
BLOCK 2
BLOCK 3
II
i 2
II
i 3
II
i 1
Sub-windows
addresses
Sub-windows
addresses
Sub-windows
addresses
Sub-windows
addresses
Decision
Figure 9: Data Flow.
the model requires multiple memory accesses to be done
simultaneously.
It is obvious that a generic architecture (a processor, a
global memory and cache) will not be enough to manage
up to seven simultaneous memory accesses on top of the
processing, without crashing it performances.
5. Architecture Denition:
Modelling Using SystemC
Flexibility and target architecture are two major criteria for
any implementation. First, a decision has been taken upon
building our implementation using a high level description
model/language. Modelling at a high-level of description
would lead to quicker simulation, and better bandwidth
estimation, better functional validation, and for most it can
help delaying the system orientation and thereafter delaying
the hardware target.
5.1. SystemC Description. C++ implements Object-
Orientation on the C language. Many Hardware Engineers
may consider that the principles of Object-Orientation are
fairly remote from the creation of Hardware components.
Nevertheless, Object-Orientation was created from design
techniques used in Hardware designs. Data abstraction is the
central aspect of Object-Orientation which can be found in
everyday hardware designs with the use of publicly visible
ports and private internal signals. Moreover, component
instantiation found in hardware designs is almost identical
to the principle of composition used in C++ for creating
hierarchical design. Hardware components can be modelled
in C++, and to some extent, the mechanisms used are
similar to those used in HDLs. Additionally C++ provides
II
i 1
Sub-windows
addresses
Sub-windows
addresses
BLOCK 1
II
i 2
Sub-windows
addresses
Sub-windows
addresses
BLOCK 2
II
i 3
BLOCK 3
Integral
image trans-
formation
module
Decision
Figure 10: SystemC architecture implementation.
inheritance as a way to complement the composition
mechanism and promotes design reuse.
Nonetheless, C++ does not support concurrency which
is an essential aspect of systems modelling. Furthermore,
timing and propagation delays cannot easily expressed in
C++.
SystemC [18] is a relatively new modeling language based
on C++ for system level design. It has been developed as
standardized modeling language for system containing both
hardware and software components.
SystemC class library provides necessary constructs to
model system architecture from reactive behaviour, schedul-
ing policy, and hardware-like timing. All of which are not
available using C/C++ standalone languages.
There is multiple advantages of using SystemC, over a
classic hardware description languages, such as VHDL and
Verilog: exibility, simplicity, simulation time velocity, and
for most the portability, to name a few.
5.2. SystemC Implementation for Functional Validation and
Verication. The SystemC approach consists of a progres-
sive renement of specications. Therefore, a rst initial
implementation was done using an abstract high-level timed
functional representation.
In this implementation, we used the proposed parallel
structure discussed in Section 4.
This modeling consists of high-level SystemC modules
(TLM) communicating with each other using channels,
signals, or even memory-blocks modules written in SystemC
(Figure 10). Scheduling and timing were used but have not
been explored for hardware-like purposes. Data types, used
in this modelling, are strictly C++ data types.
As for the cascade/classiers, we chose to use the database
found on Open Computer Vision Library [17] (OpenCV).
OpenCV provides the most used trained cascade/classiers
datasets and face-detection software (Haar-Detector) today,
for the standard prototype of Viola-Jones algorithm. The
particular classiers, used on this library, are those trained
SystemC model
Simulation
Validation
Figure 11: SystemC functional validation ow.
for a base detection windowof 2424 pixels, using Adaboost.
These classiers are created and trained, by Lienhart et al.
[19], for the detection of upright front face detection. The
detection rate of these classiers is between 80% and 92%,
depending on the images Database.
The output of our implementation is the addresses of
the sub-windows which contain, according to the detector,
an object of particular type (a face in our case). Functional
validation is done by simulation (Figure 11). Then, multiple
tests were done, including visual comparisons on a dataset
of images, visual simulation signals, and other tests that
consist of comparing the response of each classier with
its correspondent implemented on OpenCVs Haar-Detector
software. All of these tests indicate that we were able to
achieve the same result in detection rate as using the software
provided by OpenCV. The images, used in these tests, were
taken from the CMU+MIT face databases [20].
The choice of working with faces, instead of other object
types, can help the comparison with other recent works.
However, using this structure for other object-type detection
is very feasible, on the condition of having a trained dataset
of classiers for the specic object. This can be considered
a simple task, since OpenCV also provides the training
software for the cascade detector. Even more, classiers
from other variant of boosting can be implemented easily,
since the structure is written in a high-level language. As a
result, changing the boosting variant is considered a minor
modication since the architecture of the cascade detector
should stay intact.
5.3. Modelling for Embedded Implementation. While the
previous SystemC modelling is very useful for functional
validation, more optimization should be carried out in order
to achieve a hardware implementation. Indeed, SystemC
standard is a system-level modelling environment which
allows the design of various abstraction levels of systems.
The design cycle starts with an abstract high-level untimed or
timed functional representation that is rened to a bus-cycle
accurate and then an RTL (Register Transfer Level) hardware
model. SystemC provides several data types, in addition to
those of C++. However these data types are mostly adapted
for hardware specication.
BLOCK1
Decision
Shif_scale
Mem_Ctrl
S
w
i
t
c
h M
U
X
BLOCK2/BLOCK3
Decision
Mem_Ctrl
Images
SRAM SRAM
SDRAM SDRAM
S
R
A
M
Integral
image
BLOCK 3
BLOCK 2
BLOCK 1
S
D
R
A
M
Figure 12: The global architecture in SystemC modules.
Besides, SystemC hardware model can be synthesizable
for various target technologies. Numerous behavioural syn-
thesis tools are available on the market for SystemC (e.g.,
Synopsys Cocentric compiler, Mentor Catapult, System-
Crafter, and AutoESL). It should be noted that for, all those
available tools, it is necessary to rene the initial simulatable
SystemC description in order to synthesize into hardware.
The reason behind is the fact that SystemC language is a
superset of the C++ designed for simulation.
Therefore, a new improved and foremost a more rened
cycle accurate RTL model version of the design implemen-
tation was created.
Our design is split into compilation units, each of which
can be compiled separately. Alternatively, it is possible to
use several tools for dierent parts of your design, or even
using the partition in order to explore most of the possible
parallelism and pipelining for more ecient hardware
implementation. Eventually, the main block modules of the
design were split into a group of small modules that work
in parallel and/or in pipelining. For instance, the module
BLOCK1 contains tree compilation units (modules): a
Decision Module which contains the rst stages classiers.
This module is used for computation and decision on each
sub-window. The second module is Shift-and-Scale used
for shifting and scaling the window in order to obtain
all subsequent locations. Finally, a Memory-Ctrl module
manages the intermediate memory access.
As result, a SystemC model composed of 11 modules
(Figure 12): tree for BLOCK1, two for BLOCK2, two for
BLOCK3, one for the Integral image transformation, two for
the SRAM simulation, and one for the SDRAM intermediate
memory (discussed later in this chapter).
Other major renements were done: divisions were
simplied in order to be power of two divisions, dataow
model was further rened to a SystemC/C++ of combined
nite state-machines and data paths, loops were exploited,
and timing and scheduling were taken into consideration.
Note that, in most cases, parallelism and pipelining were
forced manually. On the other hand, not all the modules
were heavily rened; for example, the two module of SRAM
were used in order to simulate a physical memory, which will
never be synthesized no matter what the target platform is.
5.4. Intermediate Memory. One of the drawbacks of the
proposed parallel structure (given in Section 4) is the use
of additional intermediate memories (unnecessary in the
software implementation). Logically, an interblocks memory
unit is formed out of two memories working in ping-pong.
A stored address should hold the position of a particular
sub-window and its scale; there is no need for two-
dimensional positioning, since the Integral Image is created
as a monodimensional table for a better RAM storage.
For a 320240 image and an initial masks size of 2424,
a word of 20 bits would be enough to store the concatenation
of the position and the scale of each sub-window.
As for the capacity of the memories, a worse case scenario
occurs when half of the possible sub-windows manage to pass
through rst block. That leads to around 50 000 (50% of the
sub-windows) addresses to store. Using the same logic on the
next block, the total number of addresses to store should not
exceed the 75 000. Eventually, a combined memory capacity
of less than 192 Kbytes is needed.
Even more, the simulation of our SystemC model shows
that even when facing a case of consecutive positive decisions
for a series of sub-windows, access onto those memories
will not occur more than once every each 28 cycles (case of
mem.1 and mem.2 ), or once each 64 cycles (case of mem.3
and mem.4).
Due to these facts, we propose a timesharing system
(shown in Figure 13) using four memory banks, working
as a FIFO block, with only one physical memory. Typical
hardware implementation of a 192 Kbytes SDRAM or
DDRAM memory, running on a frequency of at least 4 times
the frequency of the FIFO banks, is necessary to replace the
four logical memories.
SystemCsimulation shows that 4 Kbits is enough for each
memory bank. The FIFOs are easily added using SystemC
own predened sc fo module.
6. Hardware Implementation and
Experimental Performances
6.1. Hardware Implementation. SystemC hardware model
can be synthesizable for various target technologies. How-
ever, no synthesizer is capable of producing ecient hard-
ware from a SystemC program written for simulation. Auto-
matic synthesis tool can produce fast and ecient hardware
only if the entry code accommodates certain dicult require-
ments such as using hardware-like development methods.
Therefore, the results of the synthesis design implementation
and the tool itself and the dierent levels of renements done
depend heavily on the entry code. Figure 14 shows the two
dierent kinds of renements needed to achieve a successful
I/O
MUX MUX
FIFO FIFO FIFO FIFO
BLOCK
1
BLOCK
2
BLOCK
2
BLOCK
3
SDRAM
Figure 13: Intermediate Memories structure.
SystemC model
C to RTL
synthesis tool
HDL to hardware
synthesis tool
HDL
Implementation
Refinement
Figure 14: SystemC to hardware implementation development
ow.
fast implementation, using a high-level description language.
The rst type of renements is the one set by the tool itself.
Without it, the tool is not capable of compiling the SystemC
code to RTL level. Even so, those renements do not lead
directly to a good proven implementation. Another type
of renements should take place in order to optimize the
size, the speed and sometimes (depending on the used tool)
power consumption.
For our design, several renement versions have been
done on dierent modules depending on their initial speed
and usability.
The SystemC scheduler uses the same behavior for
software simulation as for hardware simulation. This works
to our advantage since it gives the possibility of choosing
which of the modules to be synthesized, while the rest works
as SystemC test bench for the design.
Our synthesis phase was performed using an automatic
tool, named SystemCrafter, which is a SystemC synthesis tool
that targets Xilinx FPGAs.
Table 1: The synthesis results of the components implementations.
Logic utilization Used Available Utilization
Integral
Image
Number of occupied Slices 913 10752 8%
Number of Slice Flip Flops 300 21504 1%
Number of 4 input LUTs 1761 21504 8%
Number of DSPs 2 48 4%
Maximum frequency 129 MHz
BLOCK 1
BLOCK 2
It should be noted that the used SystemC entry code can
be described as VHDL-like synchronous and pipelined C-
code (bit accurate): most parallelism and pipelining within
the design were made manually using dierent processes,
threads, and state-machines. SystemC data types were used
in order to minimize the implementation size. Loops were
exploited, and timing as well as variables lengths was always
a big factor.
Using the SystemCrafter, multiple VHDL components
are generated and can be easily added or merged into/with
other VHDL components (notably the FIFOs modules).
As for the testbench set, the description was kept in
high-level abstraction SystemC for faster prototyping and
simulation.
Basically, our implementation brings together three
major components: the integral image module, the rst stage
decision module, and the second stage decision module
(block 3 of the structure is yet to be implemented). Other
components such as memory controllers and FIFOs mod-
ules are also implemented but are triing when compared to
the other big three.
Each of these components was implemented separately in
order to analyze their performances. In each case, multiple
graphic simulations were carried out to verify that the
outputs of both descriptions (SystemCs and VHDLs) are
identical.
6.2. Performances. The Xilinx Virtex-4 XC4VL25 was
selected as a target FPGA. The VHDL model was back
annotated using the Xilinx ISE. The synthesis results of the
design implementation for each of the components are given
in Table 1.
The synthesis results of the design implementation for
the whole design (BLOCK1, BLOCK2 and integral image
combined) are given in Table 2.
Table 2: The synthesis results of the entire design implementation.
Table 3: The synthesis results of the decision modules implemen-
tation.
BLOCK 1
Decision
BLOCK 2
Decision
The clock rate of the design did not exceed the rate of
its slowest component (BLOCK2). The design is capable of
running with a frequency of 42 MHz. In the rst block, a
decision is taken on a sub-window each 28 clock cycles.
Hence, this system is capable of achieving only up to 15
frames per second or 320 240 images.
Accelerating BLOCK1 and BLOCK2 is essential in order
to achieve higher detection speed. BLOCK1 includes three
important modules: Decision module, Shift-and-Scale
module, and Memory ctrl module. As for BLOCK2 it
includes only Decision module and Memory ctrl mod-
ule. The decision modules however use some division and
multiplication operators, which are costly in clock cycle
frequency. Therefore, each Decision module of these two
components is synthesized alone, and their synthesis results
are shown in Table 3.
As expected the Decision Modules in both BLOCK1
and BLOCK2 are holding the implementation onto a low
frequency.
Analyzing the automatic generated VHDL code shows
that despite all the renement already done, the System-
Crafter synthesis tool still produces a much complex RTL
code than essentially needed. Particularly, when using arrays
in loops, the tool creates a register for each value, and
then wired it into all possible outputs. Things get worse
when trying to update all the array elements within one
clock cycle. A scenario which occurs regularly in our
design, for example, updating classiers parameters after a
Shifting or a Scaling. Simulation tests proved that these last
manipulations can widely slowdown the design frequency.
Table 4: The synthesis results for the new improved decision
modules.
BLOCK 1
Decision
BLOCK 2
Decision
Table 5: The synthesis results of the rened implementation for the
entire design.
Therefore more renement has been made for the Decision
SystemC modules. For instance, the arrays updating were
split between the clock cycles, in a way that no additional
clock cycles are lost while updating a single array element per
cycle.
The synthesis results for new improve and more rened
decision modules are shown in Table 4. The renements
made allow faster, lighter, and more ecient implementation
for the two modules. A new full system implementation is
made by inserting the new Decision modules, its results
and performances are shown in Table 5. The FPGA can
operate at a clock speed of 123 MHz. Using the same logic
as before, a decision is taken on a sub-window each 28 clock
cycles; therefore the new design can achieve up to 42 frames
per second on 320 240 images.
The simulation tests, used in Section 5.2 for the func-
tional validation of the SystemC code, were carried out on
the VHDL code mixed with a high-level test bench (the same
SystemC test bench used for the SystemC validation model).
The outputs of the VHDL code were compared to the outputs
of the OpenCVs implementation after the rst two classi-
cation stages. These tests prove that we were able to achieve
the same detection results as in using the software provided
by OpenCV. The design can run on even faster pace, if more
renements and hardware considerations are taken. How-
ever, it should be noted that using dierent SystemC synthe-
sis tools can yield dierent results. After all, the amount and
eectiveness of the renements depend largely on the tool
itself.
Other optimizations can be done by replacing some of
the autogenerated VHDL codes from the crafter by manually
optimized ones.
7. Conclusion
In this paper, we proposed a new architecture for an
embedded real-time object and face detector based on a fast
and robust family of methods, initiated by Viola and Jones
[1].
First we have built a sequential structure model which
reveals to be irregular in time processing. As estimation, the
sequential implementation of a degraded cascade detector
can achieve on an average of 10 frames per second.
Then a new parallel structure model is introduced. This
structure proves to be at least 2.9 times faster than the
sequential and provides regularity in time processing.
The design was validated using SystemC. Simulation and
hardware synthesis were done, showing that such an algo-
rithmcan be tted easily into an FPGAchip, while having the
ability to achieve the state-of-the-art performances in both
frame rate and accuracy.
The hardware target, used for the validation, is an FPGA
based board, connected to the PC using an USB 2.0 Port. The
use of SystemC description enables the design to be easily
retargeted for dierent technologies. The implementation
of our SystemC model onto a Xilinx Virtex-4 can achieve
theoretical 42 frames per second detection rate for 320 240
images.
We proved that SystemC description is not only inter-
esting to explore and validate a complex architecture. It can
also be very useful to detect bottlenecks in the dataow
and to accelerate the architecture by exploiting parallelism
and pipelining. Then eventually, it can lead to an embedded
implementation that achieves state-of-the-art performances,
thanks to some synthesis tools. More importantly, it helps
developing a exible design that can be migrated to a wide
variety of technologies.
However, experiments have shown that renements
made to the entry SystemC code add up to substantial
reductions in size and total execution time. Even though,
the extent and eectiveness of these optimizations is largely
attributed to the SystemC synthesis tool itself and designers
hardware knowledge and experience. Therefore, one very
intriguing perspective is the exploration of this design using
other tools for comparison purposes.
Accelerating the rst stage can lead directly to a whole
system acceleration. In the future, our description could
be used as a part of a more complex process integrated
in a SoC. We are currently exploring the possibility of a
hardware/software solution, by prototyping a platform based
on a Wildcard [21]. Recently, we had successful experiences,
implementing a similar type of solutions in order to accel-
erate a Fourier Descriptors for Object Recognition using
SVM [22] and motion estimation for MPEG-4 coding [23].
For example, the Integral Image block as well as the rst and
second stages can be executed in hardware on the wildcard,
while the rest can be implemented in software on a Dual core
processor.
References
[1] P. Viola and M. Jones, Rapid object detection using a
boosted cascade of simple features, in Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR 01), vol. 1, pp. 511518, 2001.
[2] S. Li, L. Zhu, Z. Q. Zhang, A. Blake, H. J. Zhang, and H.
Shum, Statistical learning of multi-view face detection, in
Proceedings of the 7th European Conference on Computer Vision,
Copenhagen, Denmark, May 2002.
[3] J. Sochman and J. Matas, AdaBoost with totally corrective
updates for fast face detection, in Proceedings of the 6th
IEEE International Conference on Automatic Face and Gesture
Recognition, pp. 445450, 2004.
[4] P. Viola, M. J. Jones, and D. Snow, Detecting pedestrians
using patterns of motion and appearance, in Proceedings of
the IEEE International Conference on Computer Vision (ICCV
03), vol. 2, pp. 734741, October 2003.
[5] Y. Freund and R. E. Schapire, A decision-theoretic general-
ization of online learning and an application to boosting, in
Proceedings of the ECCLT, pp. 2337, 1995.
[6] J. Kivinen and M. K. Warmuth, Boosting as entropy pro-
jection, in Proceedings of the 12th Annual Conference on
Computational Learning Theory (COLT 99), pp. 134144,
ACM, Santa Cruz, Calif, USA, July 1999.
[7] P. Pudil, J. Novovicova, and J. Kittler, Floating search methods
in feature selection, Pattern Recognition Letters, vol. 15, no. 11,
pp. 11191125, 1994.
[8] J. Sochman and J. Matas, WaldBoost: learning for time
constrained sequential detection, in Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR 05), vol. 2, San Diego, Calif, USA, June
2005.
[9] M. Reuvers, Face detection on the INCA+ system, M.S. thesis,
University of Amsterdam, 2004.
[10] V. Nair, P. O. Laprise, and J. J. Clark, An FPGA-based
people detection system, EURASIP Journal on Applied Signal
Processing, no. 7, pp. 10471061, 2007.
[11] Y. Wei, X. Bing, and C. Chareonsak, FPGA implementation
of AdaBoost algorithm for detection of face biometrics, in
Proceedings of the IEEE International Workshop on Biomedical
Circuits and Systems, 2004.
[12] T. Theocharides, N. Vijaykrishnan, and M. J. Irwin, A parallel
architecture for hardware face detection, in Proceedings of
the IEEE Computer Society Annual Symposium on Emerging
Technologies and Architectures (VLSI 06), 2006.
[13] M. Yang, Y. Wu, J. Crenshaw, B. Augustine, and R. Mareachen,
Face detection for automatic exposure control in handheld
camera, in Proceedings of the 4th IEEE International Confer-
ence on Computer Vision Systems (ICVS 06), 2006.
[14] H.-C. Lai, R. Marculescu, M. Savvides, and T. Chen,
Communication-aware face detection using noc architec-
ture, in Proceedings of the 6th International Conference on
Computer Vision (ICVS 08), vol. 5008 of Lecture Notes in
Computer Science, pp. 181189, 2008.
[15] http://www.systemcrafter.com/.
[16] C. Papageorgiou, M. Oren, and T. Poggio, A general frame-
work for object detection, in Proceedings of the International
Conference on Computer Vision, 1998.
[17] Open Source Computer Vision Library, February 2009,
http://sourceforge.net/projects/opencvlibrary/.
[18] S. Swan, An Introduction to System Level Modeling in SystemC
2.0, Cadence Design Systems, Inc., 2001.
[19] R. Lienhart, A. Kuranov, and V. Pisarevsky, Empirical analysis
of detection cascades of boosted classiers for rapid object
detection, in Proceedings of the 25th Pattern Recognition
Symposium (DAGM 03), pp. 297304, 2003.
[20] H. Rowley, S. Baluja, and T. Kanade, Neural network-based
face detection, IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 20, no. 1, pp. 2238, 1998.
[21] Annapolis Microsystems Inc, Annapolis WILDCARD Sys-
tem Reference Manual, Revision 2.6, 2003, http://www
.annapmicro.com/.
[22] F. Smach, J. Miteran, M. Atri, J. Dubois, M. Abid, and J.-P.
Gauthier, An FPGA-based accelerator for Fourier Descriptors
computing for color object recognition using SVM, Journal of
Real-Time Image Processing, vol. 2, no. 4, pp. 249258, 2007.
[23] J. Dubois, M. Mattavelli, L. Pierrefeu, and J. Mit eran, Con-
gurable motion-estimation hardware accelerator module for
the MPEG-4 reference hardware description platform, in
Proceedings of the International Conference on Image Processing
(ICIP 05), vol. 3, Genova, Italy, 2005.
doi:10.1155/2009/479281
Research Article
Very Low-Memory Wavelet Compression Architecture
Using Strip-Based Processing for Implementation in
Wireless Sensor Networks
Li Wern Chew, Wai Chong Chia, Li-minn Ang, and Kah Phooi Seng
Department of Electrical and Electronic Engineering, The University of Nottingham, 43500 Selangor, Malaysia
Correspondence should be addressed to Li Wern Chew, eyx6clw@nottingham.edu.my
Received 4 March 2009; Accepted 9 September 2009
This paper presents a very low-memory wavelet compression architecture for implementation in severely constrained hardware
environments such as wireless sensor networks (WSNs). The approach employs a strip-based processing technique where an image
is partitioned into strips and each strip is encoded separately. To further reduce the memory requirements, the wavelet compression
uses a modied set-partitioning in hierarchical trees (SPIHT) algorithm based on a degree-0 zerotree coding scheme to give high
compression performance without the need for adaptive arithmetic coding which would require additional storage for multiple
coding tables. A new one-dimension (1D) addressing method is proposed to store the wavelet coecients into the strip buer for
ease of coding. Asoftcore microprocessor-based hardware implementation on a eld programmable gate array (FPGA) is presented
for verifying the strip-based wavelet compression architecture and software simulations are presented to verify the performance of
the degree-0 zerotree coding scheme.
Copyright 2009 Li Wern Chew et al. This is an open access article distributed under the Creative Commons Attribution License,
1. Introduction
The capability of having multiple sensing devices to commu-
nicate over a wireless channel and to performdata processing
and computation at the sensor nodes has brought wireless
sensor networks (WSNs) into a wide range of applications
such as environmental monitoring, habitat studies, object
tracking, video surveillance, satellite imaging, as well as for
military applications [14]. For applications such as object
tracking and video surveillance, it is desirable to compress
image data captured by the sensor nodes before transmission
because of limitations in power supply, memory storage,
and transmission bandwidth in the WSN [1, 2]. For image
compression in WSNs, it is desirable to maintain a high-
compression ratio while at the same time, providing a low
memory and low-complexity implementation of the image
coder.
Among the many image compression algorithms, wave-
let-based image compression based on set-partitioning in
hierarchical trees (SPIHT) [5] is a powerful, ecient, and
yet computationally simple image compression algorithm. It
provides a better performance than the embedded zerotrees
wavelet (EZW) algorithm [6]. Although the embedded block
coding with optimized truncation (EBCOT) algorithm [7]
which was adopted in the Joint Photographic Experts Group
2000 (JPEG 2000) standard provides a higher compression
eciency as compared to SPIHT, its multilayer coding
procedures are very complex and computationally intensive.
Also, the need for multiple coding tables for adaptive
arithmetic coding requires extra memory allocation which
makes the hardware implementation of the coder more
complex and expensive [79]. Thus, from the viewpoint of
hardware implementation, SPIHT is preferred over EBCOT
coding.
In the traditional SPIHT coding, a full wavelet-trans-
formed image has to be stored because all the zerotrees
are scanned in each pass for dierent magnitude intervals
during the set-partitioning operation [5, 10]. The memory
needed to store these wavelet coecients increases as the
image resolution increases. This in turn increases the cost of
hardware image coders as a large internal or external memory
bank is needed. This issue is also faced in the implementation
of on-board satellite image coders where the available
memory space is limited due to power constraints [11].
By adopting the modications in the implementation
of the discrete wavelet transform (DWT) where the wavelet
transformation can be carried out without the need for full
image transformation [12], SPIHT image coding can also
be performed on a portion of the wavelet subbands. The
strip-based image coding technique proposed in [10] which
sequentially performs SPIHT coding on a portion of an
image has contributed to signicant improvements in low-
memory implementations for image compression. In strip-
based coding, a few image lines that are acquired in raster
scan format are rst wavelet transformed. The computed
wavelet coecients are then stored in a strip-buer for
SPIHT coding. At the end of the image coding, the strip-
buer is released and is ready for the next set of data lines.
In this paper, a hardware architecture for strip-based
image compression using the SPIHT algorithm is presented.
The lifting-based 5/3 DWT which supports a lossless
transformation is used in our proposed work. The wavelet
coecients output from the DWT module is stored in
a strip-buer in a predened location using a new one-
dimension (1D) addressing method for SPIHT coding. In
addition, a proposed modication on the traditional SPIHT
algorithm is also presented. In order to improve the coding
performance, a degree-0 zerotree coding methodology is
applied during the implementation of SPIHT coding. To
facilitate the hardware implementation, the proposed SPIHT
coding eliminates the use of lists in its set-partitioning
approach and is implemented in two passes. The proposed
modication reduces both the memory requirement and
complexity of the hardware coder. Besides this, the fast
zerotree identifying technique proposed by [13] is also
incorporated in our proposed work. Since the recursion of
descendant information checking is no longer needed here,
the processing time of SPIHT coding signicantly reduced.
The remaining sections of this paper are organized as fol-
lows. Section 2 presents an overview of the strip-based cod-
ing followed by a discussion on the lifting-based 5/3 DWT
and its hardware implementation in Section 3. Section 4
presents the proposed 1D addressing method for DWT
coecients in a strip-buer and new spatial orientation tree
structures which is incorporated into our proposed strip-
based coding. Next, a proposal on modications to the tra-
ditional SPIHT coding to improve its compression eciency
and the hardware implementation of the proposed algorithm
in strip-based coding are presented in Section 5. The pro-
posed work is implemented using our designed microproces-
sor without interlocked pipeline stages (MIPS) processor on
a Xilinx Spartan III eld programmable gate array (FPGA)
device and the results of software simulations are discussed
in Section 6. Finally, Section 7 concludes this paper.
2. Strip-Based Image Coding
Traditional wavelet-based image compression techniques
rst apply a DWT on a full resolution image. The computed
N-scale decomposition wavelet coecients that provide a
Wavelet coecients in a strip buer
Wavelet transform of a full image
SPIHT coding
Figure 1: SPIHT coding is carried out on part of the wavelet
coecients that are stored in the strip buer.
compact multiresolution representation of the image are
then obtained and stored in a memory bank. Entropy coding
is subsequently carried out to achieve compression. This
type of coding technique that requires the whole image to
be stored in a memory is not suitable for processing large
images especially in a hardware constrained environment
where limited amount of memory is available.
A low-memory wavelet transform that computes the
wavelet subband coecients on a line-based basis has
been proposed in [12]. This method reduces the amount
of memory required for the wavelet transform process.
While the wavelet transformation of the image data can
be processed in a line-based manner, the computed full
resolution wavelet coecients still have to be stored for
SPIHT coding because all the zerotrees are scanned in each
pass for dierent magnitude intervals [5, 10].
The strip-based coding technique proposed in [10] which
adopts the line-based wavelet-transform [12] has resulted
in great improvements in low-memory implementations for
SPIHT compression. In strip-based coding, SPIHT coding is
sequentially performed on a few lines of wavelet coecients
that are stored in a strip buer as shown in Figure 1. Once
the coding is done for a strip buer, it is released and is
then ready for the next set of data lines. Since only a portion
of the full wavelet decomposition subband is encoded at a
time, there is no need to wait for the full transformation
of the image. Coding can be performed once a strip is fully
buered. This enables the coding to be carried out rapidly
and also signicantly reduces the memory storage needed for
the SPIHT coding.
Figure 2 shows the block diagram of the strip-based
image compression that is presented in this paper. A few
lines of image data are rst loaded into the DWT module
(DWT MODULE) for wavelet transformation. The wavelet
coecients are computed and then stored in a strip-buer
(STRIP BUFFER) for SPIHT encoding (SPIHT ENCODE).
At the end of encoding, the bit-stream generated is trans-
mitted as the output. In the next few sections, the detailed
function and the hardware architecture of each of these
blocks will be presented.
3. Discrete Wavelet Transform
The DWT is the mathematical core of the wavelet-based
image compression scheme. Traditional two-dimension (2D)
DWT MODULE
(Original image)
Image strip
512
5
1
2
1
6
1
6
5
1
2
S
T
R
I
P
B
U
F
F
E
R
SPIHT ENCODE Bit-stream
Figure 2: Block diagram of proposed strip-based image compression.
HH LH
LL HL
LL
3
HL
3
LH
3
HH
3
HL
2
LH
2 HH
2
HL
1
LH
1
HH
1
Two-scale decomposition is carried
out on LL subband
One-scale DWT decomposition
(a)
Three-scale DWT decomposition
(b)
Figure 3: 2D wavelet decomposition.
DWT rst performs row ltering on an image followed by
column ltering. This gives rise to four wavelet subband
decompositions as shown in Figure 3(a). The Low-Low (LL)
subband contains the low-frequency content of an image in
both the horizontal and the vertical dimension. The High-
Low (HL) subband contains the high-frequency content of
an image in the horizontal and low-frequency content of
an image in the vertical dimension. The Low-High (LH)
subband contains the low-frequency content of an image
in the horizontal and high-frequency content of an image
in the vertical dimension, and nally, the High-High (HH)
subband contains the high-frequency content of an image in
both the horizontal and the vertical dimension.
Each of the wavelet coecients in the LL, HL, LH,
and HH subbands represents a spatial area corresponding
to approximately a 2 2 area of the original image [6].
For an N-scale DWT decomposition, the coarsest subband
LL is further decomposed. Figure 3(b) shows the subbands
obtained for a three-scale wavelet decomposition. As a result,
each coecient in the coarser scale represents a larger spatial
area of the image but a narrower band of frequency [6].
The two approaches that are used to perform the DWT
are the convolutional based lter bank method and the lifting
based ltering method. Between the two methods, the lifting
based DWT is preferred over the convolutional based DWT
for hardware implementation due to its simple and fast
lifting process. Besides this, it also requires a less complicated
inverse wavelet transform [1418].
3.1. Lifting Based 5/3 DWT. The reversible Le Gall 5/3 lter
is selected in our proposed work since it provides a lossless
transformation. In the implementation of lifting-based 5/3
DWT [1820], three computation operationsaddition,
subtraction, and shiftare needed. As shown in Figure 4,
the lifting process is built based on the split, prediction, and
updating steps. The input sequence X[n] is rst split into odd
and even components for the horizontal ltering process. In
the prediction phase, a high-pass ltering is applied to the
input signal which results in the generation of the detailed
coecient H[n]. In the updating phase, a low-pass ltering
is applied to the input signal which leads to the generation of
the approximation coecient L[n]. Likewise, for the vertical
ltering, the split, prediction, and updating processes are
repeated for both the H[n] and L[n] coecients. Equations
(1) and (2) give the lifting implementation of the 5/3 DWT
lter used in JPEG 2000 [19]:
H[2n + 1] = X[2n + 1]
X[2n] + X[2n + 2]
2
, (1)
L[2n] = X[2n] +
H[2n 1] + H[2n + 1] + 2
4
. (2)
+
+
Split
Odd
Even
Split
Odd
Even
Split Predict Update
Odd
Even
Row filtering Column filtering
Predict
Predict
Update
HH
LH
HL
LL
Update +
H
L
X[n]
X[2n + 1]
X[2n]
X[2n + 2]
Subtractor
Adder
+
Shifter
>> 1
H[2n + 1]
X[2n]
H[2n 1]
H[2n + 1]
Shifter
>> 2
Adder
+
Adder
+2
Adder
+
L[2n]
Figure 4: Implementation of the lifting-based 5/3 DWT lter: Prediction, highpass ltering, and Updating, lowpass ltering.
TEMP_BUFFER
DWT_MODULE
HPF
(ODD)
Row filtering
LPF
(EVEN)
HPF
(ODD)
LPF
(EVEN)
Col. filtering
Image pixel
DWT coefficients
New address calculating unit
Original pixel
location
Pixel location in
STRIP_BUFFER
STRIP_BUFFER
HH2
HH1
HL1
LH1
HL2
LH2
LLn
.
.
.
Figure 5: Architecture for DWT MODULE.
3.2. Architecture for DWT MODULE. In our proposed work,
a four-scale DWT decomposition is applied on an image size
of 512 512 pixels. For a four-scale DWT decomposition,
the number of lines that is generated at the rst scale
when one line is generated in the fourth scale is equal to
eight. This means that the lowest-memory requirement that
we can achieve at each subband with a four-scale wavelet
decomposition is eight lines. Since each wavelet coecient
represents a spatial area corresponding to approximately a 2
2 area of the original image, the number of image rows that
needs to be fed into the DWT MODULE is equal to 16 lines.
Equation (3) shows the relationship between the number of
image rows that are needed for strip-based coding, R
image
and
the level of DWT decomposition to be performed, N:
R
image
= 2
N
. (3)
Figure 5 shows our proposed architecture for the
DWT MODULE. In the initial stage (where N = 1),
image data are read into the DWT MODULE in a row-by-
row order from an external memory where the image data
are kept. Row ltering is then performed on the image row
and the ltered coecients are stored in a temporary buer
(TEMP BUFFER). As soon as four lines of row-ltered
coecients are available, column ltering is then carried
out. The size of TEMP BUFFER is four lines multiplied by
the width of the image. The ltered DWT coecients HH,
HL, LH, and LL are then stored in the STRIP BUFFER.
For an N-scale DWT decomposition where N > 1, LL
coecients that are generated in stage (N 1) are loaded
from the STRIP BUFFER back into the TEMP BUFFER. A
N-scale DWT decomposition is then performed on these
LL coecients. Similarly, the DWT coecients generated in
F
u
l
l
i
m
a
g
e
H[15] = X[15] [(X[14] + X[16])/2]
Image line
X[0]
X[14]
X[15]
X[16]
(a)
S
t
r
i
p
i
m
a
g
e
1
H[15] = X[15] [(X[14] + X[14])/2]
Image line
X[0]
X[14]
X[15]
X[14]
Symmetric
extension
using reection
(b)
Figure 6: Symmetric extension in strip-based coding.
the N-level are then stored back into the STRIP BUFFER.
The location of each wavelet coecient to be stored in the
STRIP BUFFER is provided by the new address calculation
unit and will be discussed in Section 4.
3.3. Symmetric Extension in Strip-Based Coding. From (1)
and (2), it can be seen that to calculate the wavelet
coecient (2n + 1), wavelet coecients (2n) and (2n + 2)
are also needed. For example, to perform column ltering
at image row 15, image rows 14 and 16 are needed as
shown in Figure 6(a). However, in our proposed strip-based
coding, only a strip image of size 16 rows is available at a
time. Thus, during the implementation of the strip-based
5/3 transformation, a symmetric extension using reection
method is applied at the edges of the strip image data as
shown in Figure 6(b). Compared to the traditional 2D DWT
which performs wavelet transformation on a full image,
the wavelet coecient output from the DWT MODULE
is expected to be slightly dierent due to the symmetric
extension carried out. Analysis fromour study shows that the
percentage error in wavelet coecient value is not signicant
since only an average of 0.81% dierences is observed.
It should be noted that the strip-based ltering can also
support the traditional full DWTif the number of image lines
for strip-based ltering is increased from 16 lines to 24 lines.
This is because each wavelet coecient at N-scale would
require one extra line of wavelet coecients at N 1 scale
for the 5/3 DWT. Thus, for a four-scale DWT, a total of eight
additional image lines are required. This approach is applied
in the strip-based coding proposed in [10] which uses the
line-based DWTimplementation proposed in [12]. However,
in order to achieve the lowmemory implementation of image
coder, our proposed work described in this paper applies the
reection method for its symmetric extension in the DWT
implementation.
4. Architecture for STRIP BUFFER
The wavelet coecients generated from the DWT MODULE
are stored in the STRIP BUFFER for SPIHT coding. The
number of memory lines needed in STRIP BUFFER is equal
to two times the lowest-memory requirement that we can
achieve at each subband. Therefore, the size of the strip-
buer is equal to the number of image rows needed for strip-
based coding multiplied by the number of pixels in each row.
Equation (4) gives the size of a strip-buer:
Size of strip buer = R
image
Width of image. (4)
4.1. Memory Allocation of DWT Coecients in
STRIP BUFFER. To facilitate the SPIHT coding, the
DWT coecients obtained from the DWT MODULE are
stored in a strip buer in a predetermined location. Figure 7
shows a memory allocation example of the DWT coecients
in the STRIP BUFFER. The parent-children relationship
of SPIHT spatial orientation tree (SOT) structure using an
example of an 8 8 three-scale DWT decomposed image
is shown in Figure 7(a). For hardware implementation,
the 2D data are necessary to be stored in a 1D format as
shown in Figure 7(b). Researchers in [9] have introduced
an addressing method to rearrange the DWT coecients in
a 1D array format for practical implementation. However,
their proposed method works only on the pyramidal
structure of DWT decomposition as shown in Figure 7(a).
In our proposed work, the initial collection of DWT
coecients is a mixture of LL, HL, LH, and HH components
as shown in Figures 7(c)7(e). In addition, to simplify the
proposed modied SPIHT coding which will be explained
in the next section, it is preferred that the DWT coecients
in the strip-buer are stored in a predetermined location as
shown in Figure 7(b). For these two reasons, a new address
calculating unit is needed in the DWT MODULE.
Table 1 records the predened rules to calculate the new
addresses of DWT coecients in the STRIP BUFFER. The
DWTcoecients in the STRIP BUFFERare arranged in such
a manner that each parent node will have its four direct
osprings in a consecutive order. Besides this, it can be
seen from Table 1 that the proposed new address calculation
circuit only requires address rewiring and therefore does not
cause an increase in hardware complexity in the implemen-
tation of our proposed work.
LH
6
LL
6
LL
7
LL
8
LH
8
LL
9
LH
9
HH
9
HL
9
HH
8
LH
12
LL
13
LH
13
HL
8
LH
7
HH
7
LL
12
HL
12
HH
12
HL
13
HH
13
HL
7
HL
6
HH
6
LL
10
HL
10
LL
15
LH
15
LH
14
HL
15
HH
15
HH
14
LL
14
HL
14
HH
10
LH
10
LL
11
HL
11
HH
11
LH
11
LH
17
LL
17
LL
16
LH
16
HH
17
HL
17
HL
16
HH
16
LL
19
LH
19
LH
18
HL
19
HH
19
HH
18
LL
18
HL
18
LH
21
LL
21
LL
20
LH
20
HH
21
HL
21
HL
20
HH
20
(c2) 1-D DWT arrangement (scale 1)
(c1) 2-D DWT arrangement (scale 1)
(d1) 2-D DWT arrangement (scale 2) (d2) 1-D DWT arrangement (scale 2)
(e1) 2-D DWT arrangement (scale 3) (e2) 1-D DWT arrangement (scale 3)
(a) 2-D DWT arrangement (scale 3)
(b) 1-D DWT arrangement in
STRIP_BUFFER (final)
LH
1
LL
1
LH
2
LH
6
LH
7
LH
8
LH
9
LH
13
LH
12
LH
11
LH
15
LH
16
LH
17
LH
10
LH
3
LH
5
LH
14
LH
18
LH
19
LH
20
LH
21
LH
4
HL
1
HH
1
HL
2
HL
4
HL
8
HL
9
HL
7
HL
12
HL
13
HL
11
HL
6
HL
10
HL
16
HL
17
HL
15
HL
14
HL
20
HL
21
HL
19
HL
18
HL
5
HL
3
HH
2
HH
4
HH
5
HH
3
HH
9
HH
8
HH
6
HH
7
LH
13
HH
12
HH
10
HH
11
HH
17
HH
16
HH
14
HH
15
HH
21
HH
20
HH
18
HH
19
LL
7
LL
6
LL
10
LL
8
LL
9
LL
12
LL
13
LL
11
LL
3
LL
2
LL
1
LL
1
LH
1
LH
1
HL
1 HL
1
HH
1 HH
1
LL
2
LH
2
LL
3
LH
3
LL
4
LH
4
LL
5
LH
5
HL
2
HH
2
HL
3
HH
3
HL
4
HH
4
HL
5
HH
5
LL
4
LH
2
LH
3
LH
4
LH
5
LL
5
LL
15
LL
14
LL
18
LL
16
LL
17
LL
20
LL
21
LL
19
LH
7
LH
6
LH
10
LH
8
LH
9
LH
12
LH
13
LH
11
LH
15
LH
14
LH
18
LH
16
LH
17
LH
20
LH
21
LH
19
HL
7
HL
6
HL
10
HL
8
HL
9
HL
12
HL
13
HL
11
HL
15
HL
14
HL
2
HL
3
HL
4
HL
5
HL
18
HL
16
HL
17
HL
20
HL
21
HL
19
HH
7
HH
6
HH
10
HH
8
HH
9
HH
12
HH
13
HH
11
HH
15
HH
14
HH
2
HH
3
HH
4
HH
5
HH
18
HH
16
HH
17
HH
20
HH
21
HH
19
LH
1
LL
1
HL
1
LH
2
LH
3
LH
4
LH
5
HH
1
HL
3
HL
2
HL
4
HH
2
HH
3
HH
4
HH
5
HL
5
LH
7
LH
6
LH
10
LH
8
LH
9
LH
12
LH
13
LH
11
LH
15
LH
14
LH
18
LH
16
LH
17
LH
20
LH
21
LH
19
HL
7
HL
6
HL
10
HL
8
HL
9
HL
12
HL
13
HL
11
HL
15
HL
14
HL
18
HL
16
HL
17
HL
20
HL
21
HL
19
HH
7
HH
6
HH
10
HH
8
HH
9
HH
12
HH
13
HH
11
HH
15
HH
14
HH
18
HH
16
HH
17
HH
20
HH
21
HH
19
Figure 7: Memory allocation of DWT coecients in STRIP BUFFER.
4.2. New Spatial Orientation Tree Structure. In our proposed
strip-based coding, a four-scale DWT decomposition is per-
formed on a strip image of size equal to 16 rows. Thus, at
the highest LL, HL, LH, and HH subbands, a single line of 32
DWT coecients is available at each subband.
Since each node in the original SPIHT SOT has 2 2
adjacent pixels of the same spatial orientation as its descen-
dants, the traditional SPIHT SOT is not suitable for applica-
tion in our proposed work. The strip-based SPIHT algorithm
proposed by [10] is implemented with zerotree roots starting
from the HL, LH, and HH subbands. Although this method
can be used in our proposed work, a lower performance of
the strip-based SPIHT coding is expected. This is because
when the number of SOT is increased, many encoding bits
will be wasted especially at low bit-rates where most of the
coecients have signicant numbers of zeros [10, 21].
In [21], new SOT structures which take the next four pix-
els of the same row as its children for certain subbands were
introduced. The proposed new tree structures are named
SOT-B, SOT-C, SOT-D, and so on depending on the number
of scales the parent-children relationship has changed. In the
proposed work, the virtual SPIHT (VSPIHT) algorithm [22]
is applied in conjunction with new SOTs. In VSPIHT coding,
a real LL subband is replaced with zero value coecients and
the LL subband is further virtually decomposed by V-levels.
The LL coecients are then scalar quantized.
In our work presented in this paper, SOT-C proposed
in [21] is applied together with a two-level of virtual
decomposition on the LL subband. However, instead of
replacing the LL coecients with zero value, our proposed
work treats these coecients as the virtual HL, LH, and HH
coecients as shown in Figure 8. The total number of root
nodes during the initialization stage is equal to eight, that
is, two roots without the descendant and two roots for each
of the HL, LH, and HH subbands. With the modied SOT, a
longer tree structure is obtained. This means that the number
of zerotrees that needs to be coded at every pass is fewer.
As a result, the number of bits that are generated during
Table 1: Predened rules to calculate the new addresses of DWT coecients in STRIP BUFFER of size 16 512 pixels.
N-scale DWT N = 1 N = 2 N = 3 N = 4
decomposition (MSB) (LSB) (MSB) (LSB) (MSB) (LSB) (MSB) (LSB)
Initial address of A
12
A
11
A
10
A
9
A
8
A
7
A
6
A
5

image pixel A
4
A
3
A
2
A
1
A
0
Initial address A
10
A
9
A
8
A
7
A
6
A
5
A
4
A
8
A
7
A
6
A
5
A
4
A
3
A
2
A
6
A
5
A
4
A
3
A
2
A
1
A
0
of LL pixel A
3
A
2
A
1
A
0
A
1
A
0
New address of
DWT
A
9
A
0
A
12
A
11
A
8
A
7
A
6
A
8
A
0
A
10
A
7
A
6
A
5
A
4
A
7
A
0
A
6
A
5
A
4
A
3
A
2
A
6
A
0
A
5
A
4
A
3
A
2
A
1
coecients in
STRIP BUFFER
A
5
A
4
A
3
A
2
A
10
A
1
A
3
A
2
A
9
A
1
A
8
A
1
( A
12
1024 ) + ( A
11
512 ) + ( A
10
2 ) ( A
10
256 ) + ( A
9
2 ) ( A
8
2 ) + ( A
7
256 ) ( A
6
64 ) + ( A
5
16 )
Equivalent + ( A
9
4096 ) + ( A
8
256 ) + ( A
8
1024 ) + ( A
7
128 ) + ( A
6
64 ) +( A
5
32 ) + ( A
4
8 ) + ( A
3
4 ) +
mathematical + ( A
7
128 ) + ( A
6
64 ) + ( A
6
64 ) + ( A
5
32 ) + ( A
4
16 ) ( A
2
2 ) + ( A
1
1 ) +
equation for new + ( A
5
32 ) + ( A
4
16 ) + ( A
4
16 ) + ( A
3
8 ) + ( A
3
8 ) + ( A
2
4 ) ( A
0
32 )
address + ( A
3
8 ) + ( A
2
4 ) + ( A
2
4 ) + ( A
1
1 ) + ( A
1
1 ) + ( A
0
128 )
+ ( A
1
1 ) + ( A
0
2048 ) + ( A
0
512 )
Roots without
descendant
0000
Virtual LL
subband
LL subband
Two-scale virtual
decomposition
on LL subband
LH, HL and HH
subbands for
four-scale DWT
decomposition
Virtual LH
subband
Virtual HL
subband
Virtual HH
subband
Virtual LH
subband
Virtual HL
subband
Virtual HH
subband
0000
0002
0032
8191
0004
0006
0008
0016
0024
0031
Parent node in
this subband
1 4 direct
osprings
in this subband
Figure 8: Proposed new spatial orientation tree structures.
the early stage of the sorting pass is signicantly reduced
[21, 22].
5. Set-Partitioning in Hierarchical Trees
In SPIHT coding, three sets of coordinates are encoded [5]:
Type H set which holds the set of coordinates of all SOT
roots, Type A set which holds the set of coordinates of all
descendants of node (i, j), and Type B set which holds the set
of coordinates of all grand descendants of node (i, j). The
order of subsets which are tested for signicance is stored
in three ordered lists: (i) List of signicant pixels (LSPs), (ii)
List of insignicant pixels (LIPs), and (iii) List of insignicant
sets (LISs). LSP and LIP contain the coordinates of individual
pixels whereas LIS contains either the Type A or Type B set.
SPIHT encoding starts with an initial threshold T
0
which
is normally equal to K power of two where K is the number
of bits needed to represent the largest coecient found in
the wavelet-transformed image. The LSP is set as an empty
list and all the nodes in the highest subband are put into the
LIP. The root nodes with descendants are put into the LIS.
A coecient/set is encoded as signicant if its value is larger
than or equal to the threshold T, or as insignicant if its value
is smaller than T. Two encoding passes which are the sorting
pass and the renement pass are performed in the SPIHT
coder.
During the sorting pass, a signicance test is performed
on the coecients based on the order in which they are
stored in the LIP. Elements in LIP that are found to be
signicant with respect to the threshold are moved to the
1
0 1 0 0
1 0 1 1 0 0
DESC(i, j) = 1
and GDESC(i, j) = 1
Test on SIG(k, l)
and DESC(k, l)
(i, j)
(k, l)(i, j)

(a)
1
0 0 0 1
0 0 0 0 0 0
DESC(i, j) = 1
and GDESC(i, j) = 0
Test on SIG(k, l)
only
(i, j)
(k, l)(i, j)

(b)
Figure 9: Two Combinations in modied SPIHT algorithm. (a) Combination 1: DESC(i, j) = 1 and GDESC(i, j) = 1, (b) Combination 2:
DESC(i, j) = 1 and GDESC(i, j) = 0.
LSP list. A signicance test is then performed on the sets
in the LIS. Here, if a set in LIS is found to be signicant,
the set is removed from the list and is partitioned into four
single elements and a new subset. This new subset is added
back to LIS and the four elements are then tested and moved
to LSP or LIP depending on whether they are signicant or
insignicant with respect to the threshold.
Renement is then carried out on every coecient that is
added to the LSP except for those that are just added during
the sorting pass. Each of the coecients in the list is rened
to an additional bit of precision. Finally, the threshold is
halved and SPIHT coding is repeated until all the wavelet
coecients are coded or until the target rate is met. This
coding methodology which is carried out under a sequence
of thresholds T
0
, T
1
, T
2
, T
3
. . . T
K1
where T
i
= (T
i1
/2) is
referred to as bit-plane encoding.
From the study of SPIHT coding, it can be seen that
besides the individual tree nodes, SPIHT also performs
signicance tests on both the degree-1 zerotree and degree-
2 zerotree. Despite improving the coding performance by
providing more levels of descendant information for each
coecient tested as compared to the EZW which only
performs signicance test on the individual tree nodes and
the degree-0 zerotree, the development of SPIHT coding
neglects the coding of the degree-0 zerotree.
Analysis from our study involving degree-0 to degree-2
zerotree coding found that the coding of degree-0 zerotree
which has been removed during the development of SPIHT
coding is important and can lead to a signicant improve-
ment in zerotree coding eciency. Thus, in the next sub-
section, a proposed modication of SPIHT algorithm which
reintroduces the degree-0 zerotree coding methodology will
be presented. It should be noted that in our proposed
modied SPIHT coding, signicance tests performed on
individual tree nodes, Type A, and Type B sets are referred
to as SIG, DESC, and GDESC, respectively.
5.1. Proposed SPIHT-ZTR Coding. In the traditional SPIHT
coding on the sets in the LIS, signicance test is rst
performed on the Type A set. If Type A set is found to
be signicant, that is, DESC(i, j) = 1, its 2 2 osprings
(k, l) O(i, j) are tested for signicance and are moved
to LSP or LIP, depending on whether they are signicant,
that is, SIG(k, l) = 1 or insignicant, that is, SIG(k, l) =
0, with respect to the threshold. Node (i, j) is then added
back to LIS as the Type B set. Subsequently, if Type B
is found to be signicant, that is, GDESC(i, j) = 1, the
set is removed from the list and is partitioned into four
new Type A subsets and these subsets are added back to
LIS. Here, we are proposing a modication in the order in
which the DESC and GDESC bits are sent. In the modied
SPIHT algorithm, the GDESC(i, j) bit is sent immediately
when the DESC(i, j) is found to be signicant. As shown in
Figure 9, when DESC(i, j) = 1, four SIG(k, l) bits need to be
sent. However, whether the DESC(k, l) bits need to be sent
depends on the result of GDESC(i, j). Thus, there are two
possible combinations here: Combination 1: DESC(i, j) = 1
and GDESC(i, j) = 1; Combination 2: DESC(i, j) = 1 and
GDESC(i, j) = 0.
Combination: DESC(i, j) = 1 and GDESC(i, j) = 1. When
the signicance test result of GDESC(i, j)equals 1, it indicates
that there must be at least one grand descendant node under
(i, j) that is signicant with respect to the current threshold
T. Thus, in order to locate the signicant node or nodes,
four DESC(k, l) bits need to be sent in addition to the four
SIG(k, l) bits where (k, l) O(i, j). Table 2 shows the results
of an analysis carried out on six standard test images on
the percentage of occurrence of possible outcomes of the
SIG(k, l) and DESC(k, l) bits.
As shown in Table 2, the percentage of occurrence of
the outcome SIG = 0 and DESC = 0 is much higher than
the other remaining three outcomes. Thus, in our proposed
modied SPIHT coding, Human coding concept is applied
to code all these four possible outcomes of SIG and DESC
bits. By allocating fewer bits to the most likely outcome
of SIG = 0 and DESC = 0, an improvement in the coding
gain of SPIHT is expected. It should be noted that this
outcome where SIG = 0 and DESC = 0 is also equivalent
to the signicance test of zerotree root (ZTR) in the EZW
algorithm. Therefore, by encoding the root node and
descendant of an SOT using a single symbol, the degree-0
zerotree coding methodology has been reintroduced into our
proposed modied SPIHT coding which for convenience is
termed the SPIHT-ZTR coding scheme.
Table 2: The percentage (%) of occurrence of possible outcomes of the SIG(k, l) and DESC(k, l) bits for various standard gray-scale test
images of size 512 512 pixels under Combination 1: DESC(i, j) = 1 and GDESC(i, j) = 1. Node (i, j) is the root node and (k, l) is the
ospring of (i, j).
Test Image
SIG(k, l) = 0 and SIG(k, l) = 0 and SIG(k, l) = 1 and SIG(k, l) = 1 and
DESC(k, l) = 0 DESC(k, l) = 1 DESC(k, l) = 0 DESC(k, l) = 1
Lenna 42.60 32.67 11.49 13.24
Barbara 42.14 35.47 10.70 11.69
Goldhill 44.76 28.13 14.07 13.04
Peppers 44.39 34.49 9.41 11.71
Airplane 44.01 25.22 16.51 14.26
Baboon 42.71 28.30 14.97 14.02
Equivalent symbol in EZW ZTR IZ POS/NEG POS/NEG
Bits assignment in the proposed work 0 10 110 111
Table 3: The percentage (%) of occurrence of possible outcomes of the ABCD for various standard grayscale test images of size 512 512
pixels under Combination 2: DESC(i, j) = 1 and GDESC(i, j) = 0. ABCD refers to the signicance of the four osprings of node (i, j).
Possible outcome of ABCD
Test Image Bits assignment in the proposed work
Lenna Barbara Goldhill Peppers Airplane Baboon
0001 15.40 14.66 15.25 15.15 15.27 14.70 00
0010 14.87 14.21 14.41 14.76 15.84 14.67 1 + 01
0100 14.79 13.66 15.72 15.23 15.96 14.78 10
1000 15.21 13.96 14.83 15.02 15.70 15.26 11
0011 4.81 5.93 5.21 5.20 5.34 5.48 0011
0101 5.48 5.51 5.38 4.98 4.92 4.95 0101
0110 4.60 4.41 4.25 4.24 3.96 4.54 0110
1001 4.34 4.38 4.15 4.39 3.96 4.56 1001
1010 5.33 5.58 5.12 5.06 5.21 4.86 1010
1100 4.84 5.24 5.32 5.37 5.26 5.25 0 + 1100
0111 2.27 2.69 2.34 2.31 1.86 2.36 0111
1011 2.26 2.51 2.12 2.37 1.85 2.31 1011
1101 2.16 2.56 2.21 2.20 1.95 2.47 1101
1110 2.28 2.43 2.37 2.32 1.84 2.40 1110
1111 1.36 2.27 1.32 1.40 1.08 1.41 1111
Combination: DESC(i, j) = 1 and GDESC(i, j) = 0. When
DESC(i, j) = 1 and GDESC(i, j) = 0, it indicates that the
SOT is a degree-2 zerotree where all the grand descendant
nodes under (i, j) are insignicant. It also indicates that at
least one of the four osprings of node (i, j) is signicant. In
this situation, four SIG(k, l) bits where (k, l) O(i, j) need
to be sent. Let the signicance of the four osprings of node
(i, j) be referred to as ABCD. Here, a total of 15 possible
combinations of ABCD can be obtained as shown in Table 3.
The percentage of occurrence of possible outcomes of ABCD
is determined for various standard test images and the results
are recorded in Table 3.
From Table 3, it can been seen that the rst four ABCD
outcomes of 0001, 0010, 0100, an 1000 occur more
frequently as compared to the other remaining 11 possible
outcomes. Like in Combination 1, Human coding concept
is applied to encode all the outcomes of ABCD. The output
bits assignment for each of the 15 possible outcomes of
ABCD is shown in Table 3. Since fewer bits are needed to
encode the most likely outcomes of ABCD, that is, 0001,
0010, 0100, and 1000, an improved performance of the
SPIHT coding is anticipated.
It should be noted that in both Combinations 1 and 2, all
the wavelet coecients that are found to be insignicant are
added to the LIP and those that are found to be signicant
are added to the LSP. The sign bit for those signicant
coecients are also output to the decoder.
5.2. Listless SPIHT-ZTR for Strip-Based Implementation.
Although the proposed SPIHT-ZTR coding is expected to
provide an ecient compression performance, its imple-
mentation in a hardware constrained environment is di-
cult. One of the major diculties encountered is the use
of three lists to store the coordinates of the individual
coecients and subset trees during the set-partitioning
operation. The use of these lists will increase the complexity
and implementation cost of the coder since memory man-
agement is required and a large amount of storage is needed
B
Yes
Has children
Has children
Has grandchildren
Combination #1
Combination #2
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
No
No
No
No
No
No
A
C
Threshold =
threshold/2
Coecient (i, j)
Is (i, j) a root node
DESC PREV
(parent of (i, j)) = 1
SIG PREV(i, j) = 1
DESC PREV
(i, j) = 1
GDESC PREV
(parent of (i, j)) = 1
Output SIG(i, j). Set
SIG PREV(i, j) =
SIG(i, j)/output
renement bit
DESC PREV
(i, j) = 1
Output DESC(i, j).
Set DESC PREV(i, j)
= DESC(i, j)
DESC PREV
(i, j) = 1
GDESC PREV
(i, j) = 1
Output GDESC(i, j).
Set GDESC PREV(i, j)
= GDESC(i, j)
(a) Sorting pass and renement pass.
Output 0 if
SIG(i, j) = 0 & DESC(i, j) = 0
Set SIG PREV(i, j) = SIG(i, j)
and DESC PREV(i, j)
= DESC(i, j)
Output 110 if
SIG(i, j) = 1 & DESC(i, j) = 0
and DESC PREV(i, j)
= DESC(i, j)
Output 10 if
SIG(i, j) = 0 & DESC(i, j) = 1
and DESC PREV(i, j)
= DESC(i, j)
Output 111 if
SIG(i, j) = 1 & DESC(i, j) = 1
and DESC PREV(i, j)
= DESC(i, j)
Combination #1:
A
B
(b) Combination 1: DESC(i, j) = 1 and GDESC(i, j) = 1
Figure 10: Continued.
Yes
Yes
No
No
Is (i, j) rst
direct ospring
Output bit assignment
(refer to Table 3). Set
SIG PREV(x, y) = SIG(x, y)
where (x, y) is the four
direct osprings
Output SIG(i, j).
Set SIG PREV(i, j)
= SIG(i, j)
Is SIG PREV
(next 3 osprings) = 0
Combination #2:
C
Skip coding for next three
coecients
(c) Combination 2: DESC(i, j) = 1 and GDESC(i, j) = 0.
Figure 10: Listless SPIHT-ZTR for strip-based implementation.
to maintain these lists [23, 24]. In this subsection, a listless
SPIHT-ZTR for strip-based implementation is proposed.
The proposed algorithm not only has all the advantages that
a listless coder has but is also developed for the low-memory
strip-based implementation of SPIHT coding. The owchart
of the proposed algorithm is shown in Figure 10.
In our proposed listless SPIHT-ZTR algorithm, three
signicance maps known as SIG PREV, DESC PREV, and
GDESC PREV are used to store the signicance of the
coecient, the signicance of the descendant, and the signif-
icance of the grand descendant, respectively. The SIG PREV
information is stored in a one-bit array which has a size
equal to the size of the strip-buer. In comparison, the array
size of DESC PREV is only a quarter that of SIG PREV
since the leaf nodes have no descendant and the array size
of GDESC PREV is only one-sixteenth that of SIG PREV
since the nodes in the lowest two scales have no grand
descendant.
In listless SPIHT-ZTR coding, the memory needed to
store the signicance information during the entropy coding
is very small when compared to SPIHT and listless zerotree
coder (LZC) [24]. In SPIHT, three lists are used and in LZC,
the signicance ags F
C
and F
D
are equal to the image size,
and a quarter of the image size, respectively. In our proposed
coding scheme, the signicance maps storage is cleared and
released for coding of the next image strip after the coding is
done for each image strip.
It should be noted that the peak signal-to-noise ratio
(PSNR) performance of our proposed listless SPIHT-ZTR
coding is similar to that obtained using the original SPIHT
algorithm at the end of every bit-plane. The number of
signicant pixels of both algorithms after every bit-plane is
SPIHT_ENCODE
D
E
S
C
_
B
U
F
F
E
R
S
I
G
_
P
R
E
V
G
D
E
S
C
_
P
R
E
V
D
E
S
C
_
P
R
E
V
G
D
E
S
C
_
B
U
F
F
E
R
Significance data
collection
(upward scanning)
SPIHT-ZTR
(downward scanning)
Threshold =
threshold/2
Figure 11: Architecture for SPIHT ENCODE.
exactly the same except that the sequence in which the bits
are produced is dierent.
Similar to the other listless coders, the sorting and
renement passes in the traditional SPIHT algorithm are
merged into one single pass in the proposed listless SPIHT-
ZTR algorithm. This makes the control ow of our proposed
coding simple and easy to implement in hardware [23, 24].
5.3. Architecture for SPIHT ENCODE. Figure 11 shows our
proposed SPIHT ENCODE architecture. Since the wavelet
0
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 2 3 4
1
MSB
GDESC
(GDESC_BUFFER)
DESC
(DESC_BUFFER)
SIG
(STRIP_BUFFER)
MSB
OR
OR
OR
OR
OR
OR
MSB
U
p
w
a
r
d

s
c
a
n
n
i
n
g
1 0
0 0 0
0 0 0
0 1 0
0 1 0
1 0 0
0 0 0
0 0 0
1 0 0
0 0 0
0 1 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
0 0 0
1 0 0
0 0 0
1
1
0
1 0 0
0 1 0
0 0 0
1 0 0
0
1
2
3
4
1 1 0 0
10
11
12
13
14
15
16
17
18
19
20
0
1
2
3
4
5
6
7
8
9
Figure 12: Signicance information for each coecient at each bit plane is determined and is stored in buers when the SOT is scanned
from the bottom to the top.
coecients in the STRIP BUFFER are arranged in a pyra-
midal structure where the parent nodes are always on top
of their descendant nodes, the proposed listless SPIHT-ZTR
coding is implemented using a one-pass upward scanning
and a one/multipass downward scanning methodology as
explained below.
One-Pass Upward ScanningSignicance Data Collection.
This scanning method starts from the leaf nodes up to the
roots, that is, from the bottom to the top of STRIP BUFFER.
While the SOT is being scanned, the DESC and GDESC sig-
nicance information for each coecient at each bit-plane is
determined and stored in temporary buers DESC BUFFER
and GDESC BUFFER.
This signicance data collection process is carried out
in parallel for all bit-planes as shown in Figure 12. The SIG
information is obtained directly from the STRIP BUFFER
whereas the DESC and GDESC information for a coecient
is obtained by OR-ing the SIG and DESC results of its four
osprings, respectively. It should be noted that the proposed
signicance data collection process is analogous to the fast
zerotree identifying technique proposed in [13]. With all
the signicance information precomputed and stored, this
results in a fast encoding process since the signicance
information can be readily obtained from the buers during
the SPIHT-ZTR coding.
One/Multi-Pass downward ScanningListless SPIHT-ZTR
Coding. SPIHT-ZTR coding as described in Figure 10
is performed on the DWT coecients stored in the
STRIP BUFFER. Similar to the traditional SPIHT coding, a
bit-plane coding methodology can be applied here. Although
W
B
W
B
M
E
M
E
X
C
M
P
F
W
D
0
Mux
1
0
Mux
1
0
Mux
1
0
Mux
1
0
Mux
1
EX/MEM
ALUOp
Inst. [3-0] ALU
control
ALU
Zero ALUSrc
Branch
ID/EX
MemtoReg
Branch
Control
Inst. [28-26]
Inst. [27]
Inst. [25-21]
Inst. [15-0]
Inst. [20-16]
Inst. [15-11]
Inst. [20-16]
AND
Add
Address
Instruction
memory
1
PC
P
C
S
r
c
IF/ID
Registers
Sign
extend
32 16
MemRead
RegDst
ALUOp
ALUSrc
MemWrite
RegWrite
R
e
g
W
r
i
t
e
XOR
Comparator
ALU
result
RegDst
MemWrite
MemRead
Address
Output
data
Output
data 1
Read
register 1
Read
register 2
Output
data 2
Write
address
Write
data
Write
data
Data
memory
M
e
m
T
o
R
e
g
2
Figure 13: Architecture of our modied MIPS processor.
a fully embedded bit-stream cannot be obtained because
only a portion of the image is encoded at a time, the
proposed strip-based image compression scheme has a
partially embedded property. Each SOT in the strip-buer is
encoded in the order of importance, that is, those coecients
with a higher magnitude are encoded rst. This allows
region-of-interest (ROI) coding since a higher number of
encoding pass can be set for a strip that contains the targeted
part of the image.
On the other hand, a non-embedded SPIHT-ZTR coding
can be performed using the one-pass downward scanning
methodology. Here, instead of scanning the SOT for dierent
magnitude intervals, each coecient in the tree can be
scanned starting from its most-signicant-bit (MSB) to the
least-signicant-bit (LSB). Since all the signicance informa-
tion needed for all bit-planes is stored during the upward
scanning process, a full bit-plane encoding can be carried out
on one coecient followed by the next coecient.
Not only does the proposed listless SPIHT-ZTR coding
require less memory and reduce the complexity in the imple-
mentation of the coder by eliminating the use of listsbut also
the upward-downward scanning methodology simplies the
encoding process and allows a faster coding speed.
6. Microprocessor-Based Implementation and
Simulation Results
The proposed strip-based SPIHT-ZTR architecture was
implemented using a softcore microprocessor-based appr-
oach on a Xilinx Spartan III FPGA device. A customized
implementation of the MIPS processor architecture [25] was
adopted. Figure 13 shows the architecture of our proposed
MIPS processor which is a modied version of the MIPS
architecture presented in [25] in order to simplify the
processor architecture and to facilitate the implementation
of strip-based image compression.
First, a simplied forwarding unit is incorporated into
our MIPS architecture. This unit allows the output of the
arithmetic logic unit (ALU) to be fedback to the ALU itself
for computation. The data forwarding operation is con-
trolled by the result derived from the AND operation which
is stored in the register FWD. Instead of having to detect data
hazard like in the traditional MIPS architecture, a specic
register number (register 31) is used to inform the processor
to use the data directly from the previous ALU operation.
Next, the MIPS architecture is reduced from its original ve-
stage pipeline implementation to a four-stage pipeline imple-
mentation. This is achieved by shifting the data memory unit
and the branch instruction unit one stage forward.
In the traditional MIPS index addressing method, an o-
set value is added to a pointer address to forma new memory
address. For example, the instruction lw $t2, 4($t0) will
load the word at memory address ($t0+4) into register $t2.
The value 4 gives an oset from the address stored in regis-
ter $t0. In our MIPS implementation, the addressing method
is simplied by removing the oset calculation because most
of the time, the oset is equal to zero. For example, to
Table 4: MIPS machine language.
Category Instruction Format Example Meaning
Arithmetic
Add R add $s1, $s2, $s3 $s3 = $s1 + $s2
Subtract R sub $s1, $s2, $s3 $s3 = $s2 $s1
Add Immediate I addi $s1, $s2, 100 $s2 = $s1 + 100
Data Transfer
Load Word I lw $s1, $s2, X $s2 = Memory[$s1]
Store Word I sw $s1, $s2, X Memory[$s1] = $s2
Logical
And R and $s1, $s2, $s3 $s3 = $s1 & $s2
Or R or $s1, $s2, $s3 $s3 = $s1| $s2
Shift Left Logical R sll $s1, X, $s3 $s3 = $s1 1
Shift Right Logical R srl $s1, X, $s3 $s3 = $s1 1
Conditional Branch
Branch on Equal I beq $s1, $s2, B If ($s1 = $s2) Go to B
Branch on Not Equal I bne $s1, $s2, B If ($s1
/
=$s2) Go to B
Set on Less Than R slt $s1, $s2, $s3
If ($s2 > $s1) $s3 = 1;
Else $s3 = 0;
DWT
Add Shift R as1 $s1, $s2, $s3 $s3 = ($s1 + $s2) / 2
Add Shift Shift R as2 $s1, $s2, $s3 $s3 = ($s1 + $s2 + 2) / 4
DWT-1 R dwt1 $s1, X, $s3 $s3 = NewAddressCalculation ($s1)
$s1, $s2, $s3 Registers, XNot in used.
access the data stored in location ($t0+4), the address is rst
obtained by adding 4 to the content of register $t0. Then,
an indirect addressing method lw $t2, $t0 is used to load
the word at the memory address contained in $t0 into $t2.
The register $t0 contains the new address ($t0+4) and is
available directly from the output of the ALU or from the
ID/EX pipeline register. Hence, the data memory unit can be
shifted one stage forward in the proposed MIPS architecture.
This allows the data forwarding hardware to be simplied.
The branch instruction unit is also shifted one stage
forward in our modied MIPS processor in order to reduce
the number of stall instructions that are required after a
branch instruction. In addition, our MIPS architecture also
supports both the branch not equal and branch equal
instructions. By incorporating a comparator followed by a
XOR operation, the branch not equal and branch equal
are selected based on the result stored in register CMP.
Table 4 shows the MIPS instruction set used in our strip-
based image processing implementation. As can be seen, a
few instructions are added for the DWT implementation
besides the standard instructions given in [25]. The as1 and
as2 instructions are used to speed up the processing of the
DWT whereas the dwt1 to dwt4 instructions are used to
calculate the new memory address of the wavelet coecients
in the strip-buer. Table 5 shows the device utilization sum-
mary for the proposed strip-based coding implementation.
The implementation uses 2366 slices which is approximately
17% of the Xilinx Spartan III FPGA. The number of
MIPS instructions needed for the DWT MODULE and
SPIHT MODULE is 261 and 626, respectively.
Table 5: Device utilization summary for the strip-based SPIHT-
ZTR architecture implementation.
Device utilization summary
Selected device Xilinx Spartan III 3S1500L-4 FPGA
Number of occupied slices 2366 out of 13312 (17%)
Number of slice ip ops 1272 out of 26624 (4%)
Number of 4 input LUTs 3416 out of 26624 (12%)
Software simulations using MATLAB were carried out to
evaluate the performance of our proposed strip-based image
coding using SPIHT-ZTR algorithm. The simulations were
conducted using the 5/3 DWT lter. All standard grey-scale
test images used are of size 512 512 pixels. In our proposed
work, a four-scale DWT decomposition and a ve-scale SOT
decomposition were performed using the proposed SPIHT-
ZTR coding with an SOT-C structure. The performance
of the proposed coding scheme was compared with the
traditional SPIHTcoding. Both the binary-uncoded (SPIHT-
BU) and arithmetic-coded (SPIHT-AC) SPIHT coding were
also implemented with a four-scale DWT and a ve-
scale SOT decomposition using the traditional 2 2 SOT
structure.
Table 6 shows the PSNR at various bit-rates (bpp) for
test images Lenna, Barbara, Goldhill, Peppers, Airplane,
and Baboon. Figure 14 shows the performance comparison
plot for SPIHT-AC, SPIHT-BU, and SPIHT-ZTR in terms
of average PSNR versus the average number of bits sent.
From the simulation results shown in Table 6, it can be seen
Table 6: Performance of the proposed strip-based image coder using SPIHT-ZTR coding and SOT-C structure compared to the traditional
binary uncoded (SPIHT-BU) and arithmetic encoded (SPIHT-AC) SPIHT coding in terms of peak signal-to-noise ratio (dB) versus bit-rate
(bpp) for various grey-scale test images of size 512 512 pixels.
Peak Signal-to-Noise Ratio, PSNR (dB)
Bit-rates (Bpp) SPIHT-AC SPIHT-ZTR SPIHT-BU Bit-rates (Bpp) SPIHT-AC SPIHT-ZTR SPIHT-BU
Lenna Barbara
0.25 33.35 32.98 32.91 0.25 26.50 26.16 26.14
0.50 36.56 36.17 36.07 0.50 30.01 29.65 29.60
0.80 38.74 38.46 38.34 0.80 33.35 32.95 32.86
1.00 39.75 39.49 39.31 1.00 34.99 34.46 34.29
Goldhill Peppers
0.25 30.30 29.84 29.91 0.25 34.42 34.04 33.99
0.50 32.82 32.33 32.33 0.50 36.87 36.50 36.48
0.80 34.90 34.62 34.41 0.80 38.35 38.12 37.95
1.00 36.29 35.77 35.66 1.00 39.12 38.85 38.71
Airplane Baboon
0.25 33.35 32.93 32.78 0.25 24.20 24.03 23.88
0.50 37.31 36.81 36.68 0.50 26.49 25.89 25.95
0.80 40.45 40.01 39.83 0.80 28.56 28.25 28.07
1.00 42.01 41.40 41.26 1.00 30.02 29.48 29.38
Table 7: Memory requirements for the strip-based implementation of the traditional SPIHT coding using the original 2 2 SOT structure
and our proposed SPIHT-ZTR using SOT-C.
Coding Scheme
DWT SOT Minimum Memory Lines Needed at Type of Spatial Orientation Tree
Scale Scale each subband (DWT / SOT) (SOT) Structure
SPIHT-BU /
4 5 8/32
Original 2 2 structure with
SPIHT- AC [5] roots at LL subbands
Strip-based
4 5 8/8
Roots start from highest
SPIHT [10] LH, HL and HH subbands.
Our proposed Strip-
4 4 8/8 SOT-C with roots at LL subbands
based SPIHT-ZTR
SPIHT-AC
SPIHT-ZTR
SPIHT-BU
0 100 200 300 400 500 600 700
Number of bits sent (Kbits)
14
18
22
26
30
34
38
42
46
P
S
N
R
(
d
B
)
Figure 14: Performance comparison of SPIHT-AC, SPIHT-BU and
SPIHT-ZTR in terms of peak signal-to-noise ratio (PSNR) versus
the number of bits sent (Kbits). (The comparison plots are in terms
of average PSNR values and average number of bits sent for all six
test images.)
that our proposed SPIHT-ZTR performs better than the
SPIHT-BU. An average PSNR improvement of 0.14 dB is
obtained at 1.00 bpp using the proposed coding scheme. This
is because the number of bits required to encode the image
at each bit-plane is fewer in SPIHT-ZTR when compared
to SPIHT-BU. In comparison with SPIHT-AC, although
SPIHT-ZTR gives a slightly lower PSNR performance, its
implementation is much less complex since there is no
arithmetic coding in SPIHT-ZTR.
Table 7 shows the comparison in memory requirements
needed for the strip-based implementation of our proposed
SPIHT-ZTR and that of those needed in [5] and [10]. It
should be noted that in the traditional SPIHT [5] coding,
a six-scale DWT decomposition and a seven-scale SOT
decomposition were originally applied on an image of size
512 512 pixels. However, for our comparison to be mean-
ingful, the memory requirements recorded here all involve
a four-scale DWT and a ve-scale SOT-decomposition.
From Table 7, it can be seen that our proposed strip-based
SPIHT-ZTR using SOT-C reduces the memory requirement
by 75% as compared to the traditional SPIHT using the orig-
inal 2 2 SOT structure. Even though the strip-based SPIHT
coder proposed in [10] requires the same number of memory
lines as our proposed work, there is a signicant degradation
in its performance since the number of zerotrees to be coded
is increased. This hypothesis has been shown in [10, 21].
Lastly, we have also veried that the result output from
our proposed hardware strip-based coder is similar to the
software simulation results.
7. Conclusion
The proposed architecture for strip-based image coding
using SPIHT-ZTR algorithm is able to reduce the complexity
of its hardware implementation considerably and it requires
a very much lower amount of memory for processing
and buering compared to the traditional SPIHT coding
making it suitable for implementation in severely con-
strained hardware environments such as WSNs. Using the
proposed new 1D addressing method, wavelet coecients
generated from the DWT module are organized into the
strip-buer in a predetermined location. This simplies the
implementation of SPIHT-ZTR coding since the coding can
now be performed in two passes. Besides this, the proposed
modication on the SPIHT algorithm by reintroducing the
degree-0 zerotree coding results in a signicant improvement
in compression eciency. The proposed architecture is suc-
cessfully implemented using our designed MIPS processor
and the results have been veried through simulations using
MATLAB.
References
[1] I. F. Akyildiz, T. Melodia, and K. R. Chowdhury, Wireless
multimedia sensor networks: a survey, IEEE Wireless Commu-
nications, vol. 14, no. 6, pp. 3239, 2007.
[2] A. Mainwaring, J. Polastre, R. Szewczyk, D. Culler, and J.
Anderson, Wireless sensor networks for habitat monitoring,
in Proceedings of the ACM International Workshop on Wireless
Sensor Networks and Applications (WSNA 02), pp. 8897,
Atlanta, Ga, USA, September 2002.
[3] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci,
A survey on sensor networks, IEEE Communications Maga-
zine, vol. 40, no. 8, pp. 102105, 2002.
[4] E. Magli, M. Mancin, and L. Merello, Low-complexity video
compression for wireless sensor networks, in Proceedings of
the IEEE International Conference on Multimedia and Expo
(ICME 03), vol. 3, pp. 585588, Baltimore, Md, USA, July
2003.
[5] A. Said and W. A. Pearlman, A new, fast, and ecient image
codec based on set partitioning in hierarchical trees, IEEE
6, no. 3, pp. 243250, 1996.
[6] J. M. Shapiro, Embedded image coding using zerotrees of
wavelet coecients, IEEE Transactions on Signal Processing,
vol. 41, no. 12, pp. 34453462, 1993.
[7] D. Taubman, High performance scalable image compression
with EBCOT, IEEE Transactions on Image Processing, vol. 9,
no. 7, pp. 11581170, 2000.
[8] W.-B. Huang, W. Y. Su, and Y.-H. Kuo, VLSI implementation
of a modied ecient SPIHT encoder, IEICE Transactions on
Fundamentals of Electronics, Communications and Computer
Sciences, vol. 89, no. 12, pp. 36133622, 2006.
[9] J. Jyotheswar and S. Mahapatra, Ecient FPGA imple-
mentation of DWT and modied SPIHT for lossless image
compression, Journal of Systems Architecture, vol. 53, no. 7,
pp. 369378, 2007.
[10] R. K. Bhattar, K. R. Ramakrishnan, and K. S. Dasgupta,
Strip based coding for large images using wavelets, Signal
Processing, vol. 17, no. 6, pp. 441456, 2002.
[11] C. Parisot, M. Antonini, M. Barlaud, C. Lambert-Nebout,
C. Latry, and G. Moury, On board strip-based wavelet
image coding for future space remote sensing missions, in
Proceedings of the International Geoscience and Remote Sensing
Symposium (IGARSS 00), vol. 6, pp. 26512653, Honolulu,
HI, USA, July 2000.
[12] C. Chrysas and A. Ortega, Line-based, reduced memory,
wavelet image compression, IEEE Transactions on Image
Processing, vol. 9, no. 3, pp. 378389, 2000.
[13] J. M. Shapiro, A fast technique for identifying zerotrees in
the EZW algorithm, in Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP
96), vol. 3, pp. 14551458, 1996.
[14] A. Jensen and l. Cour-Harbo, Ripples in Mathematics: The
Discrete Wavelet Transform, Springer, Berlin, Germany, 2000.
[15] G. Strang and T. Nguyen, Wavelets and Filter Banks, Wellesley-
Cambridge, Wellesley, Mass, USA, 2nd edition, 1996.
[16] M. Weeks, Digital Signal Processing Using Matlab and Wavelets,
Innity Science Press LLC, Sudbury, Mass, USA, 2007.
[17] W. Sweldens, The lifting scheme: a custom-design construc-
tion of biorthogonal wavelets, Applied and Computational
Harmonic Analysis, vol. 3, no. 2, pp. 186200, 1996.
[18] K.-C. B. Tan and T. Arslan, Shift-accumulator ALU centric
JPEG2000 5/3 lifting based discrete wavelet transform archi-
tecture, in Proceedings of the IEEE International Symposiumon
Circuits and Systems (ISCAS 03), vol. 5, pp. V161V164, 2003.
[19] T. Archarya and P.-S. Tsai, JPEG2000 Standard for Image
Compression: Concepts, Algorithms and VLSI Architectures,
Wiley-Interscience, New York, NY, USA, 2004.
[20] M. E. Angelopoulou, K. Masselos, P. Y. K. Cheung, and Y.
Andreopoulos, Implementation and comparison of the 5/3
lifting 2D discrete wavelet transform computation schedules
on FPGAs, Journal of Signal Processing Systems, vol. 51, no. 1,
pp. 321, 2008.
[21] L. W. Chew, L.-M. Ang, and K. P. Seng, New virtual SPIHT
tree structures for very low memory strip-based image
compression, IEEE Signal Processing Letters, vol. 15, pp.
389392, 2008.
[22] E. Khan and M. Ghanbari, Very low bit rate video coding
using virtual SPIHT, Electronics Letters, vol. 37, no. 1, pp.
4042, 2001.
[23] F. W. Wheeler and W. A. Pearlman, SPIHT image
compression without lists, in Proceedings of the IEEE
International Conference on Acoustics, Speech and Signal
Processing (ICASSP 00), vol. 4, pp. 20472050, Istanbul,
Turkey, June 2000.
[24] W.-K. Lin and N. Burgess, Listless zerotree coding for color
images, in Proceedings of the 32nd Asilomar Conference on
Signals, Systems and Computers, vol. 1, pp. 231235, Monterey,
Calif, USA, November 1998.
[25] D. A. Patterson and J. L. Hennessy, Computer Organization and
Design: The Hardware/Software Interface, Morgan Kaufmann,
San Francisco, Calif, USA, 2nd edition, 1998.
doi:10.1155/2009/725438
Research Article
Data Cache-Energy and Throughput Models: Design Exploration
for Embedded Processors
Muhammad Yasir Qadri and Klaus D. McDonald-Maier
School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, UK
Correspondence should be addressed to Muhammad Yasir Qadri, yasirqadri@acm.org
Received 25 March 2009; Revised 19 June 2009; Accepted 15 October 2009
Most modern 16-bit and 32-bit embedded processors contain cache memories to further increase instruction throughput of the
device. Embedded processors that contain cache memories open an opportunity for the low-power research community to model
the impact of cache energy consumption and throughput gains. For optimal cache memory conguration mathematical models
have been proposed in the past. Most of these models are complex enough to be adapted for modern applications like run-time
cache reconguration. This paper improves and validates previously proposed energy and throughput models for a data cache,
which could be used for overhead analysis for various cache types with relatively small amount of inputs. These models analyze
the energy and throughput of a data cache on an application basis, thus providing the hardware and software designer with the
feedback vital to tune the cache or application for a given energy budget. The models are suitable for use at design time in the
cache optimization process for embedded processors considering time and energy overhead or could be employed at runtime for
recongurable architectures.
Copyright 2009 M. Y. Qadri and K. D. McDonald-Maier. This is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
1. Introduction
The popularity of embedded processors could be judged by
the fact that more than 10 billion embedded processors were
shipped in 2008, and this is expected to reach 10.76 billion
units in 2009 [1]. In the embedded market the number of
32-bit processors shipped has surpassed signicantly that of
8-bit processors [2]. Modern 16-bit and 32-bit embedded
processors increasingly contain cache memories to further
instruction throughput and performance of the device. The
recent drive towards low-power processing has challenged
the designers and researchers to optimize every component
of the processor. However optimization for energy usually
comes with some sacrice on throughput, and which may
result in overall minor gain.
Figure 1 shows the operation of a typical battery powered
embedded system. Normally, in such devices, the processor
is placed in active mode only when required; otherwise it
remains in a sleep mode. An overall power saving (increased
throughput to energy ratio) could be achieved by increasing
the throughput (i.e., lowering the duty cycle), decreasing
the peak energy consumption, or by lowering the sleep
mode energy consumption. This phenomenon clearly shows
the interdependence of energy and throughput for overall
power saving. Keeping this in mind, a simplied approach
is proposed that is based on energy and throughput models
to analyze the impact of a cache structure in an embedded
processor per application basis which exemplies the use
of the models for design space exploration and software
optimization.
The remainder of this paper is divided into ve sections.
In the following two sections related work is discussed and
the energy and throughput models are introduced. In the
fourth section experimental environment and results are
discussed, the fth section describes an example application
for the mathematical models, and the nal section forms the
conclusion.
2. Related Work
The cache energy consumption and throughput models
have been the focus of research for some time. Shiue and
Power consumption
Time
Active mode power
Average power
Sleep mode power
Figure 1: Power consumption of a typical battery powered
processor (adapted from [3]).
Chakrabarti [4] present an algorithm to nd optimum
cache conguration based on cache size, the number of
processor cycles, and the energy consumption. Their work
is an extension of the work of Panda et al. [5, 6] on data
cache sizing and memory exploration. The energy model by
Shiue and Chakrabarti , though highly accurate, requires a
wide range of inputs like number of bit switches on address
bus per instruction, number of bit switches on data bus per
instruction, number of memory cells in a word line and in
a bit line, and so forth. which may not be known to the
model user in advance. Another example of a detailed cache
energy model was presented by Kamble and Ghose [7]. These
analytical models for conventional caches were found to be
accurate to within 2% error. However, they over-predict the
power dissipations of low-power caches by as much as 30%.
The low-power cache designs used by Kamble and Ghose
incorporated block buering, data RAM subbanking, and
bus invert coding for evaluating the models. The relative
error in the models increased greatly when the sub-banking
and block buering were simultaneously applied. The major
dierence between the approach used by Kamble and Ghose
[7] and the one discussed in this paper is that the former
one incorporated bit level models to evaluate the energy
consumption, which are in some cases inaccurate as the error
in output address power was found (by the Kamble and
Ghose ) in the order of 200%, due to the fact that data
and instruction access addresses exhibit strong locality. The
approach presented here uses a standard cache modelling
tool, CACTI [8], for measuring bit level power consumption
in cache structures and provides a holistic approach for
energy and throughput for an application basis. In fact the
accuracy of these models is independent of any particular
cache conguration as standard cache energy and timing
tools are used to provide cache specic data. This approach
is discussed in detail in Section 4.
Simunic et al. [9] presented mathematical models for
energy estimation in embedded systems. The per cycle energy
model presented in their work comprises energy components
of processor, memory, interconnects and pins, DC-to-DC
converters, and level two (L2) cache. The model was
validated using an ARM simulator [10] and the SmartBadge
[11] prototype based on ARM-1100 processor. This was
found to be within 5% of the hardware measurements for
the same operating frequency. The models presented in
their work holistically analyze the embedded system power
and do not estimate energy consumption for individual
components of a processor that is, level one (L1) cache, on-
chip memory, pipeline, and so forth. In work by Li and
Henkel [12] a full system detailed energy model comprising
cache, main memory, and software energy components was
presented. Their work includes description of a framework to
assess and optimize energy dissipation of embedded systems.
Tiwari et al. [13] presented an instruction level energy model
estimating energy consumed in individual pipeline stages.
The same methodology was applied in [14] by the authors
to observe the eects of cache enabling and disabling.
Wada et al. [15] presented comprehensive circuit level
access time model for on-chip cache memory. On comparing
with SPICE results the model gives 20% error for an 8
nanoseconds access time cache memory. Taha and Wills [16]
presented an instruction throughput model for Superscalar
processors. The main parameters of the model are super-
scalar width of the processor, pipeline depth, instruction
fetch method, branch predictor, cache size and latency, and
so forth. The model results in errors up to 5.5% as compared
to the SimpleScalar out-of-order simulator [17]. CACTI
(cache access and cycle time model) [8] is an open-source
modelling tool based on such detailed models to provide
thorough, near accurate memory access time and energy
estimates. However it is not a trace driven simulator, and so
energy consumption resulting in number of hits or misses is
not accounted for a particular application.
Apart from the mathematical models, substantial work
has been done for cache miss rate prediction and minimiza-
tion. Ding and Zhong in [18] have presented a framework for
data locality prediction, which can be used to prole a code
to reduce miss rate. The framework is based on approximate
analysis of reuse distance, pattern recognition, and distance-
based sampling. Their results show an average of 94% accu-
racy when tested on a number of integer and oating point
programs fromSPECand other benchmark suites. Extending
their work Zhong et al. in [19] introduce an interactive
visualization tool that uses a three-dimensional plot to show
miss rate changes across program data sizes and cache sizes.
Another very useful tool named RDVIS as a further extension
of the work previously stated was presented by Beyls et al.
in [20, 21]. Based on cluster analysis of basic block vectors,
the tool gives hints on particular code segments for further
optimization. This in eect provides valuable feedback to
the programmer to improve temporal locality of the data to
increase hit rate for a cache conguration.
The following section presents the proposed cache energy
and throughput models, which can be used to identify an
early cache overhead estimate based on a limited set of input
data. These models are an extension of the models previously
proposed by Qadri and Maier in [22, 23].
3. The D-Cache Energy and Throughput Models
The cache energy and throughput models given below strive
to provide a complete application-based analysis. As a result
they could facilitate the tuning of a cache and an application
Table 1: Simulation platform parameters.
Parameter Value
Processor PowerPC440GP
Execution mode Turbo
Clock frequency (Hz) 1.00E+08
Time 1.00E08
CPI 1
Technology 0.18 um
Vdc (V) 1.8
Logic Supply (V) 3.3
DDR SDRAM (V) 2.5
VDD (1.8 V) active
operating current
IDD (A)
9.15E01
OVDD (3.3 V) active
operating current
IODD (A)
1.25E01
Energy per Cycle (J) 1.65E08
Idle mode Energy (J) 4.12E09
Table 2: Cache simulator data.
CACTI Data
Cache Size 32 Kbytes
Block Size 256 bytes
R/W Ports 0
Read ports 1
Write ports 1
Access Time (s) 1.44E09
Cycle Time (s) 7.38E10
Read Energy (J) 2.24E10
Write Energy (J) 3.89E11
Leakage Read Power (W) 2.96E04
Leakage Write Power (W) 2.82E04
according to a given power budget. The models presented
in this paper are an improved extension of energy and
throughput models for a data cache, previously presented
by the authors in [22, 23]. The major improvements in the
model are as follows: (1) The leakage energy (E
leak
) is now
indicated for the entire processor rather than simply the
cache on its own. The energy model covers the per cycle
energy consumption of the processor. The leakage energy
statistics of the processor in the data sheet covers the cache
and all peripherals of the chip. (2) The miss rate in E
read
and
E
write
has been changed to read
mr
(read miss rate) and write
mr
(write miss rate) as compared to total miss rate (r
miss
) that
was employed previously. This was done as the read energy
and write energy components correspond to the respective
miss rate contribution of the cache. (3) In the throughput
model stated in [23] a term t
mem
(time saved from memory
operations) was subtracted from the total throughput of the
system, which was later found to be inaccurate. The overall
time taken to execute an instruction denoted as T
total
is the
measure of the total time taken by the processor for running
an application using cache. The time saved from memory
only operations is already accounted in T
total
. However a new
term t
ins
was introduced to incorporate the time taken for the
execution of cache access instructions.
3.1. Energy Model. If E
read
and E
write
are the energy con-
sumed by cache read and write accesses, E
leak
the leakage
energy of the processor, E
c m
the energy consumed by
cache to memory accesses, E
mp
the energy miss penalty,
and E
misc
is the Energy consumed by the instructions which
do not require data memory access, then the total energy
consumption of the code E
total
in Joules (J) could be dened
as
E
total
= E
read
+ E
write
+ E
c m
+ E
mp
+ E
leak
+ E
misc
. (1)
Further dening the individual components,
E
read
= n
read
E
dyn.read
1 +
read
mr
100
,
E
write
= n
write
E
dyn.write
1 +
write
mr
100
,
E
c m
= E
m
(n
read
+ n
write
)
1 +
total
mr
100
,
E
mp
= E
idle
(n
read
+ n
write
)
P
miss
total
mr
100
,
(2)
where n
read
is the number of read accesses, n
write
the number
of write accesses, E
dyn.read
the total dynamic read energy for
all banks, E
dyn.write
the total dynamic write energy for all
banks, E
m
the energy consumed per memory access, E
idle
the
per cycle idle mode energy consumption of the processor,
read
mr
, write
mr
, and total
mr
are the read, write, and total
miss ratio (in percentage), and P
miss
is the miss penalty (in
number of stall cycles).
The idle mode leakage energy of the processor E
leak
could
be calculated as
E
leak
= P
leak
t
idle
, (3)
where t
idle
(s) is the total time in seconds for which processor
was idle.
3.2. Throughput Model. Due to the concurrent nature of
cache to memory access time and cache access time, their
overlapping can be assumed. If t
cache
is the time taken for
cache operations, t
ins
the time taken in execution of cache
access instructions (s), t
mp
the time miss penalty, and t
misc
is
the time taken while executing other instructions which do
not require data memory access, then the total time taken by
an application with a data cache could be estimated as
T
total
= t
cache
+ t
ins
+ t
mp
+ t
misc
. (4)
0
5
10
15
20
25
30
E
n
e
r
g
y
(
J
)
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
1 2 4 8 16
Associativity
R
e
p
l
a
c
e
m
e
n
t
p
o
l
i
c
y
E
predicted
E
simulated
Figure 2: Energy consumption for write-through cache.
0
5
10
15
20
25
30
E
n
e
r
g
y
(
J
)
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
1 2 4 8 16
Associativity
R
e
p
l
a
c
e
m
e
n
t
p
o
l
i
c
y
E
predicted
E
simulated
Figure 3: Energy consumption for write-back cache.
Furthermore,
t
cache
= t
c
(n
read
+ n
write
)
1 +
total
mr
100
,
t
ins
=
t
cycle
t
c
(n
read
+ n
write
),
t
mp
= t
cycle
(n
read
+ n
write
)
P
miss
total
mr
100
,
(5)
where t
c
is the time taken per cache access and t
cycle
is the
processor cycle time in seconds (s).
4. The Experimental Environment and Results
To analyze and validate the aforementioned models, SIMICS
[25], a full system simulator was used. An IBM/AMCC
PPC440GP [26] evaluation board model was used as the
target platform and Montavista Linux 2.1 kernel was used
as target application to evaluate the models. A generic 32-bit
data cache was included in the processor model, and results
were analyzed by varying associativity, write policy, and
replacement policy. The cache read and write miss penalty
was xed at 5 cycles. The processor input parameters are
dened in Table 1.
As SIMICS could only provide timing information of the
model, processor power consumption data like idle mode
energy (E
idle
) and leakage power (P
leak
) was taken from
0
4
8
12
16
T
i
m
e
(
s
)
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
1 2 4 8 16
Associativity
R
e
p
l
a
c
e
m
e
n
t
p
o
l
i
c
y
T
predicted
T
simulated
Figure 4: Throughput for write-through cache.
0
4
8
12
16
T
i
m
e
(
s
)
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
R
a
n
d
o
m
L
R
U
C
y
c
l
i
c
1 2 4 8 16
Associativity
R
e
p
l
a
c
e
m
e
n
t
p
o
l
i
c
y
T
predicted
T
simulated
Figure 5: Throughput for write-back cache.
0
5
10
15
20
25
E
n
e
r
g
y
(
J
)
1 3 5 7 9 11 13 15 17 19 21 23
Iterations
Basic math E
simulated
Qsort E
predicted
Basic math E
predicted
CRC 32 E
simulated
Qsort E
simulated
CRC 32 E
predicted
Figure 6: Simulated and Predicted Energy Consumption, varying
Cache Size and Block Size (see Table 3).
PPC440GP datasheet [26], and cache energy and timing
parameters such as dynamic read and write energy per cache
access (E
dyn.read
, E
dyn.write
) and cache access time (t
c
) were
taken from CACTI [8] cache simulator (see Table 2). For
other parameters such as number of memory reads/writes
and read/write/total miss rate (n
read
, n
write
, read
mr
, write
mr
,
total
mr
), SIMICS cache prolers statistics were used. The
cache to memory access energy (E
m
) was assumed to be half
that of per cycle energy consumption of the processor. The
Table 3: Iteration denition for varying Block Size and Cache Size.
Block Size/Cache
Size
1
KBytes
2
KBytes
4
KBytes
8
KBytes
16
Kbytes
32
KBytes
64 Bytes 1 2 3 4 5 6
128 Bytes 7 8 9 10 11 12
256 Bytes 13 14 15 16 17 18
512 Bytes 19 20 21 22 23 24
Table 4: Cache Simulator Data for various Iterations.
CACTI Data
Iteration Associativity
Block Size
(bytes)
Number of
Lines
Cache Size
(bytes)
Access
Time (ns)
Cycle
Time (ns)
Read
Energy
(nJ)
Write
Energy
(nJ)
1 0 64 16 1024 2.15 0.7782 0.160524 0.0918
2 0 128 8 1024 2.47 1.182 0.126 0.0695
3 0 256 4 1024 3.639 2.394 0.135 0.063
4 0 512 2 1024 8.185 6.955 0.171 0.068
5 0 64 32 2048 2.368 0.818 0.265 0.142
6 0 128 16 2048 2.58 1.206 0.186 0.095
7 0 256 8 2048 3.706 2.42 0.183 0.0755
8 0 512 4 2048 8.23 6.975 0.213 0.075
9 0 64 64 4096 2.2055 0.778 0.593 0.404
10 0 128 32 4096 2.802 1.25 0.307 0.145
11 0 256 16 4096 3.84 2.46 0.28 0.1
12 0 512 8 4096 8.316 7.016 0.298 0.087
13 0 64 128 8192 2.422 0.8175 0.96 0.5988
14 0 128 64 8192 2.633 1.206 0.619 0.407
15 0 256 32 8192 4.085 2.529 0.474 0.151
16 0 512 16 8192 8.48 7.09176 0.468 0.1125
17 0 64 256 16384 2.85 0.88 1.7 0.988
18 0 128 128 16384 2.8559 1.251 1.0049 0.602
19 0 256 64 16384 3.888 2.4557 0.834 0.413
20 0 512 32 16384 8.533 7.092 0.77 0.254
21 0 64 512 32768 3.783 0.985 3.177 1.7661
22 0 128 256 32768 3.3 1.33 1.776 0.991
23 0 256 128 32768 4.14 2.53 1.413 0.608
24 0 512 64 32768 8.534 7.092 1.263 0.4247
simulated energy consumption was obtained by multiplying
per cycle energy consumption as per datasheet specication,
by the number of cycles executed in the target application.
The results for energy and timing models are presented
in Figures 2, 3, 4, and 5. From the graphs, it could be inferred
that the average error of the energy model for the given
parameters is approximately 5% and that of timing model
is approximately 4.8%. This is also reinforced by the specic
results for the benchmark applications; that is, BasicMath,
QuickSort, and CRC 32 from the MiBench benchmark
suite [27], while varying cache size and block size using a
direct-mapped cache, are shown in Figures 6 and 7. The
denition of each iteration for various cache and block size
is given in Table 3, and the cache simulator data are given in
Table 4.
5. Design Space Exploration
The validation of the models opens an opportunity to
employ these in a variety of applications. One such appli-
cation could be a design exploration to nd optimal cache
0
2
4
6
8
10
12
14
T
i
m
e
(
s
)
1 3 5 7 9 11 13 15 17 19 21 23
Iterations
Basic math T
simulated
Qsort T
predicted
Basic math T
predicted
CRC 32 T
simulated
Qsort T
simulated
CRC 32 T
predicted
Figure 7: Simulated and Predicted Throughput, varying Cache Size
and Block Size (see Table 3).
Start
C code
Compiler
Cache miss rate
analysis
Code optimized
for minimum miss rate?
No
Cache
parameters
Cache
modeller
Code proler
Yes
Energy and
throughput model
Requirements
fullled?
Yes
Stop
Energy and
throughput
requirements
No
Figure 8: Proposed design cycle for optimization of cache and
application code.
conguration for a set amount of energy budget or timing
requirement. A typical approach for design exploration in
order to identify the optimal cache conguration and code
prole is shown in Figure 8. At rst the miss rate prediction
is carried out on the compiled code and preliminary cache
parameters. Then several iterations may be performed to
ne tune the software to reduce miss rates. Subsequently,
the tuned software goes through the proling step. The
information from the cache modeller and the code proler is
then fed to the energy and throughput models. If the given
energy budget along with the throughput requirements is
not satised, then the cache parameters are to be changed
and the same procedure is repeated. This strategy can be
adopted at design time to optimize the cache conguration
and decrease the miss rate of a particular application
code.
6. Conclusion
In this paper straightforward mathematical models were
presented with a typical accuracy of 5% when compared to
SIMICS timing results and per cycle energy consumption
of the PPC440GP processor. Therefore, the model-based
approach presented here is a valid tool to predict the pro-
cessors performance with sucient accuracy, which would
clearly facilitate executing these models in a system in order
to adapt its own conguration during the actual operation
of the processor. Furthermore, an example application for
design exploration was discussed that could facilitate the
identication of an optimal cache conguration and code
prole for a target application. In future work the presented
models are to be analyzed for multicore processors and
to be further extended to incorporate multilevel cache
systems.
Acknowledgment
The authors like to thank the anonymous reviewers for
their very insightful feedback on earlier versions of this
manuscript.
References
[1] Embedded processors top 10 billion units in 2008, VDC
Research, 2009.
[2] MIPS charges into 32bit MCU fray, EETimes Asia, 2007.
[3] A. M. Holberg and A. Saetre, Innovative techniques for
extremely low power consumption with 8-bit microcon-
trollers, White Paper 7903A-AVR-2006/02, Atmel Corpora-
tion, San Jose, Calif, USA, 2006.
[4] W.-T. Shiue and C. Chakrabarti, Memory exploration for low
power, embedded systems, in Proceedings of the 36th Annual
ACM/IEEE Conference on Design Automation, pp. 140145,
New Orleans, La, USA, 1999.
[5] P. R. Panda, N. D. Dutt, and A. Nicolau, Architectural
exploration and optimization of local memory in embedded
systems, in Proceedings of the 10th International Symposium
on System Synthesis, pp. 9097, Antwerp, Belgium, 1997.
[6] P. R. Panda, N. D. Dutt, and A. Nicolau, Data cache sizing
for embedded processor applications, in Proceedings of the
Conference on Design, Automation and Test in Europe, pp. 925
926, Le Palais des Congr es de Paris, France, 1998.
[7] M. B. Kamble and K. Ghose, Analytical energy dissipation
models for low power caches, in Proceedings of the Interna-
tional Symposium on Low Power Electronics and Design, pp.
143148, Monterey, Calif, USA, August 1997.
[8] D. Tarjan, S. Thoziyoor, and P. N. Jouppi, CACTI 4.0, Tech.
Rep., HP Laboratories, Palo Alto, Calif, USA, 2006.
[9] T. Simunic, L. Benini, and G. De Micheli, Cycle-accurate
simulation of energy consumption in embedded systems, in
Proceedings of the 36th Annual ACM/IEEE Design Automation
Conference, pp. 867872, New Orleans, La, USA, 1999.
[10] ARM Software Development Toolkit Version 2.11, Advanced
RISC Machines ltd (ARM), 1996.
[11] G. Q. Maguire, M. T. Smith, and H. W. P. Beadle, Smart-
Badges: a wearable computer and communication system,
in Proceedings of the 6th International Workshop on Hard-
ware/Software Codesign (CODES/CASHE 98), Seattle, Wash,
USA, March 1998.
[12] Y. Li and J. Henkel, A framework for estimation and min-
imizing energy dissipation of embedded HW/SW systems,
in Proceedings of the 35th Annual Conference on Design
Automation, pp. 188193, San Francisco, Calif, USA, 1998.
[13] V. Tiwari, S. Malik, and A. Wolfe, Power analysis of embedded
software: a rst step towards software power minimization,
IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 2, no. 4, pp. 437445, 1994.
[14] V. Tiwari and M. T.-C. Lee, Power analysis of a 32-bit
embedded microcontroller, in Proceedings of the Asia and
South Pacic Design Automation Conference (ASP-DAC 95),
pp. 141148, Chiba, Japan, August-September 1995.
[15] T. Wada, S. Rajan, and S. A. Przybylski, An analytical access
time model for on-chip cache memories, IEEE Journal of
Solid-State Circuits, vol. 27, no. 8, pp. 11471156, 1992.
[16] T. M. Taha and D. S. Wills, An instruction throughput model
of superscalar processors, IEEE Transactions on Computers,
vol. 57, no. 3, pp. 389403, 2008.
[17] T. Austin, E. Larson, and D. Ernest, SimpleScalar: an
infrastructure for computer system modeling, Computer, vol.
35, no. 2, pp. 5967, 2002.
[18] C. Ding and Y. Zhong, Predicting whole-program locality
through reuse distance analysis, ACM SIGPLAN Notices, vol.
38, no. 5, pp. 245257, 2003.
[19] Y. Zhong, S. G. Dropsho, X. Shen, A. Studer, and C. Ding,
Miss rate prediction across program inputs and cache
congurations, IEEE Transactions on Computers, vol. 56, no.
3, pp. 328343, 2007.
[20] K. Beyls and E. H. DHollander, Platform-independent cache
optimization by pinpointing low-locality reuse, in Proceed-
ings of the 4th International Conference on Computational
Science (ICCS 04), vol. 3038 of Lecture Notes in Computer
Science, pp. 448455, Springer, May 2004.
[21] K. Beyls, E. H. DHollander, and F. Vandeputte, RDVIS:
a tool that visualizes the causes of low locality and hints
program optimizations, in Proceedings of the 5th International
Conference on Computational Science (ICCS 05), vol. 3515
Atlanta, Ga, USA, May 2005.
[22] M. Y. Qadri and K. D. M. Maier, Towards increased power
eciency in low end embedded processors: can cache help?
in Proceedings of the 4th UK Embedded Forum, Southampton,
UK, 2008.
[23] M. Y. Qadri and K. D. M. Maier, Data cache-energy
and throughput models: a design exploration for overhead
analysis, in Proceedings of the Conference on Design and
Architectures for Signal and Image Processing (DASIP 08),
Brussels, Belgium, 2008.
[24] M. Y. Qadri, H. S. Gujarathi, and K. D. M. Maier, Low
power processor architectures and contemporary techniques
for power optimizationa review, Journal of Computers, vol.
4, no. 10, pp. 927942, 2009.
[25] P. S. Magnusson, M. Christensson, J. Eskilson, et al., Simics: a
full system simulation platform, Computer, vol. 35, no. 2, pp.
5058, 2002.
[26] PowerPC440GP datasheet, AMCC 2009.
[27] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T.
Mudge, and R. B. Brown, MiBench: a free, commercially
representative embedded benchmark suite, in Proceedings of
the IEEE International Workshop on Workload Characterization
(WWC 01), pp. 314, IEEE Computer Society, Austin, Tex,
USA, December 2001.
doi:10.1155/2009/737689
Research Article
Hardware Architecture for Pattern Recognition in
Gamma-Ray Experiment
Sonia Khatchadourian,
1
Jean-Christophe Pr evotet,
2
and Lounis Kessal
1
1
ETIS, CNRS UMR 8051, ENSEA, University of Cergy-Pontoise, 6, Avenue du Ponceau, 95014 Cergy-Pontoise, France
2
IETR, CNRS UMR 6164, INSA de Rennes, 20, Avenue des buttes de Coesmes, 35043 Rennes, France
Correspondence should be addressed to Sonia Khatchadourian, sonia.khatchadourian@ensea.fr
Received 19 March 2009; Accepted 21 July 2009
The HESS project has been running successfully for seven years. In order to take into account the sensitivity increase of the entire
project in its second phase, a new trigger scheme is proposed. This trigger is based on a neural system that extracts the interesting
features of the incoming images and rejects the background more eciently than classical solutions. In this article, we present the
basic principles of the algorithms as well as their hardware implementation in FPGAs (Field Programmable Gate Arrays).
Copyright 2009 Sonia Khatchadourian et al. This is an open access article distributed under the Creative Commons Attribution
cited.
1. Introduction
For many years, the study of gamma photons have led scien-
tists to understand more deeply the complex processes that
occur in the Universe, for example, remnants of supernova
explosions, cosmic rays interactions with interstellar gas,
and so forth. In the 1960s, it has nally been possible to
develop ecient measuring instruments to detect gamma-
ray emissions, thus enabling to validate the theoretical
concepts. Most of these instruments were built in order
to identify the direction of gammas rays. Since gamma
photons are not deected by interstellar magnetic elds, it
becomes possible to determine the position of the source
accurately. In this context, Imaging Atmospheric Cherenkov
Telescopes constitute the most sensitive technique for the
observation of high-energy gamma-rays. Such telescopes
provide a large eective collection area and achieve excellent
angular and energy resolution for detailed studies of cosmic
objects. The technique relies upon Cherenkov light produced
by the secondary particles once the gamma-ray interacts
with the atmosphere at about 10 km of altitude. It results
a shower of secondary particles, that also may interact
with the atmosphere producing other particles according
to well-known physical rules. By detecting shower particles
(electrons, muons, protons), it is then possible to reconstruct
the initial event and determine the precise location of a
source within the Universe.
In order to determine the nature of the shower, it is
important to analyze its composition, that is, determine
the types of particles that have been produced during
the interaction with the atmosphere. This is performed
by studying the dierent images that are collected by the
telescopes and that are generally representative of the particle
type. For example, gamma-ray showers usually have thin,
high-density structures. On the other hand, protons are quite
broad with low density.
The major problem in these experiments is that the
number of images to be collected is generally huge and the
complete storage of all events is impossible. This is mainly
due to the fact that data-storage capacity is limited and that
it is impossible to keep track of all incoming images for o-
line analysis.
In order to circumvent this issue, a trigger system is
often used to select the events that are interesting (from a
physicists point of view). This processing must be performed
in real time and is very tightly constrained in terms of latency
since it is compatible with the data acquisition rate of the
cameras. The role of such triggering system is to rapidly
decide whether an event is to be recorded for further studies
or rejected by the system.
The organization of this paper is given as follows: the
context of our work is presented in Section 2. Section 3
describes the algorithms that are envisaged in order to build a
new trigger system. Considerations on hardware implemen-
tations are then provided in Section 4 and Section 5 describes
the results in terms of timing and resource usage.
2. The HESS Project
The High-Energy Stereoscopic System (HESS) is a system
of imaging Cherenkov telescopes that strive to investigate
cosmic gamma rays in the 100 GeV to 100 TeV energy range
[1]. It is located in Namibia at an altitude of 1800 m where
the optical quality is excellent. The Phase-I of this project
went into operation in Summer 2002 and consists of four
Large Cherenkov Telescopes (LCT), each of 107 m
2
mirror
area in order to provide a good stereoscopic viewing of the
air showers. The telescopes are arranged on a square of 120 m
sides, enabling thus to optimize the collection area.
The cameras of the four telescopes serve to capture and
record the Cherenkov images of air showers. They have
excellent resolution since the pixel size is very small: each
camera is equipped of 960 photomultiplier tubes (PMTs) that
are assimilated to pixels.
An ecient trigger scheme has also been designed in
order to reject background such as the light of the night
sky that interferes with measurements. Next sections describe
both phases of the project in terms of triggering issues.
2.1. Phase-I. The trigger system of the HESS Phase-I project
is devised in order to make use of the stereoscopic approach:
simultaneous observation of interesting images must be
required in order to store a specic event [2]. This coinci-
dence requirement reduces the rate of background events,
that is, events that may be assimilated to night sky noise. It is
composed of two separate levels (L1 and the central trigger).
At the rst level, a basic threshold is applied on signals
collected by the camera. A trigger occurs if the signals in
M pixels within a 64-pixel sector of the camera exceed a
value of N photoelectrons. This enables to get rid of isolated
pixels and thus to eliminate the noise. The pixel signals are
sampled using 1 GHZ Analogue Ring Samplers (ARSs) [3]
with a ring buer depth of 128 cells. Following a camera
trigger, the ring buer is stopped and its content is digitized,
summed and written in an FPGA buer. After read-out, the
camera is ready for the next event, and further processing
may be performed including the transmission of data via
optical cable to the PC processor farm located in the control
building.
The Central Trigger System(CTS) consists in implement-
ing the coincidence between the four telescopes. It identies
the status of the telescopes and writes this information as well
as an absolute time (measured by a GPS) into an FIFO (First-
in First-out) memory for each system coincidence. Once the
data have been written in this FIFO, the CTS is ready to
process new incoming events, about 330 nanoseconds after
the coincidence occurred. The FIFO memory has a depth of
SAM SAM
L1
L1
L1
L1
LCT4
LCT1
LCT2
LCT3 SAM SAM
CAN
CAN CAN
CAN
Central
trigger
system
Figure 1: Schematic of the HESS Trigger System.
L1accept
~100 kHz
L2accept/L2reject
~3.5 kHz
L1
L2 PreL2
VLCT
Analog
image
SAM CAN FIFO
1
1
Classified
data
12
Figure 2: Schematic of the VLCT Trigger System.
16000 events and is read out asynchronously. A schematic
illustration of the HESS trigger systemis depicted in Figure 1.
2.2. Phase-II. Since its inception in 2002, the HESS project
keeps on delivering very signicant results. In this very
promising context, researchers of the collaboration have
decided to improve the initial project by adding a new Very
Large Central Telescope (VLCT) in the middle of the four
existing ones. This new telescope should permit to increase
the sensitivity of the global system as well as improving
resolution for high-energy particles. It is composed of 2048
pixels which represent the energy of the incident event.
Considering the new approach, the quantity of data to be
collected would drastically increase, and it becomes necessary
to build a new trigger system in order to be compatible with
the new requirements of the project.
One of the most challenging objectives of the HESS
project is to detect particles which energy are below 50 GeV.
In this energy range, it is not conceivable to use all telescopes
(since the smallest ones cannot trigger), and only the fth
telescope may be used in a monoscopic mode.
The structure of the new triggering system is depicted in
Figure 2. Data coming fromthe VLCTcamera consist of 2048
pixels values which are rst stored in a Serial Analog Memory
(SAM). In parallel, data are also sent to a level 1 trigger (L1)
whose structure is described in Section 2.1. The L1 trigger
applies a basic analog threshold on the pixels values and
generates a binary signal indicating whether an event has to
be kept (L1accept) or rejected (L1reject). In the case where an
event is accepted, the entire image is converted into digital
patterns. These data are stored in FIFO memories until a
L2accept/L2reject signal coming from a second level trigger
(L2) is generated.
(1) (2)
(3) (4)
(5) (6)
(7) (8)
Figure 3: Gamma (14), muon (5-6), and proton (7-8) images of
dierent energies.
In parallel, data are sent to the PreL2 stage which
thresholds the incoming pixels according to 3 energy levels.
Each pixel value is coded into 2 bits corresponding to 3 states
of energies. These images are then sent to the L2 Trigger.
L1 and L2 trigger decisions are expected at average rates of
100 KHz and 3.5 KHz, respectively. Examples of simulated
images are depicted in Figure 3.
3. The HESS2 L2 Triggering System
In order to cope with the new performances of the HESS
Phase-II system, an ecient L2 trigger scheme is currently
being built. Like all triggers, it aims to provide a decision
W
L
D
Source
Figure 4: Overview of Hillas moments.
regarding the interest of a particular event. In this context,
two parallel studies have been led in order to identify the best
algorithms to implement at that level. The rst study relied
on the Hillas parameters which are seen as a classical solution
in astrophysics pattern recognition. The second study that
has been envisaged is to use pattern recognition tools such as
neural networks associated with an intelligent preprocessing.
Both approaches are described in the next sections.
3.1. The First Approach
3.1.1. Hillas Parameters. Hillas parametrization has been
introduced in [4]. The retained method consists in isolating
image descriptors which are based on image shape param-
eters such as length (L) and width (W) as well as an angle
(). The angle represents the angle of the image with
the direction of the emitting source location (see Figure 4).
This approach globally considers that gammas signatures are
mainly elliptical in shape whereas other particles signatures
are most irregular. This assumption is often the case in
practice. Nevertheless, signatures strongly depend on the
distance between the impact point of the ray shower and
the telescope. This may lead to various types of images for
the same event nature and constitutes a real challenge for
identication (see Figure 3).
3.1.2. The Classier. In this rst approach, the classier
consists in applying thresholds on the hillas parameters
(or a combination of these parameters) computed on the
incoming images in order to distinguish gamma signatures
between all collected images. One of the best parameters that
have been identied as a good discriminator is the Center
of Gravity (CoG). This parameter represents the center of
gravity of all illuminated pixels within the ellipse.
In this case, the recognition of particles is performed
according the following rule:
(a) if CoG < t, then the event is recognized as a gamma
particle;
(b) if CoG t and < 20 deg, then the event is
recognized as a gamma particle;
(c) otherwise, the event is rejected.
t is a parameter which is adjusted according to the data
set.
The major drawback of such approach is that the
considered thresholds consist of constant values. Thus, a
lack of exibility is to be deplored. For example, it does not
allow to take into consideration the various conditions of the
experiment that may have a signicant impact on the shape
of signatures.
3.2. Intelligent Preprocessing. The second studied approach
aims to make use of algorithms that already brought
signicant results in terms of pattern recognition. Neural
networks are good candidates because they are a powerful
computational model. On the other hand, their inherent
parallelism makes them suitable for a hardware imple-
mentation. Although used in dierent elds of physics,
these algorithms based on neural networks have successfully
been implemented and have already proved their eciency
[5, 6]. Typical applications include particle recognition
in tracking systems, event classication problems, o-line
reconstruction of event, and online trigger in High-Energy
Physics.
From the assumption that neural networks may be
useful in such experiments, we have proposed a new
Level 2 (L2) trigger system enabling to implement rather
complex processing on the incoming images. The major issue
with neural networks resides in the learning phase which
strives to identify optimal parameters (weights) in order
to solve the given problem. This is true when considering
unsupervised learning in which representative patterns have
to be iteratively presented to the network in a rst learning
phase until the global error has reached a predened
value.
One of the most important drawbacks of this type of
algorithms is that the number of weights strongly depends on
the dimensionality of the problem which is often unknown
in practice. This implies to nd the optimal structure of the
network (number of neurons, number of layers) in order to
solve the problem.
Moreover, the curse of dimensionality [7] constitutes
another challenge when dealing with neural networks. This
problem expresses a correlation between the size of the
network and the number of examples to furnish. This
relation is exponential, that is, if the networks size becomes
signicant, the number of training examples may become
relatively huge. This cannot be considered in practice.
In order to reduce the size of the network, it is possible
to simplify its, task that is, reduce the dimensionality of the
problem. In this case, a preprocessing step aims at nding
correlations on data and at applying basic transformations
in order to ease the resolution. In this study, we advise to use
an intelligent preprocessing based on the extraction of the
intrinsic features of the incoming images.
The structure of the proposed L2 trigger is depicted in
Figure 5. It is divided into three stages. A rejection step aims
to eliminate isolated pixels and small images that cannot be
processed by the system. A second step consists in applying
a preprocessing on incoming data. Finally, the classier takes
the decision according to the nature of the event to identify.
These dierent steps are described in the following sections.
L2accept
/reject
L2reject
Neural
network
Rejection
Denoise > 4
PreL2
data
Preprocessing
Zernike
Figure 5: Schematic of the HESS Phase-II Trigger System.
3.2.1. The Rejection Step. The rejection step has two signif-
icant roles. First, it aims to remove isolated pixels that are
typically due to background. These pixels are eliminated by
applying a ltering mask on the entire image in order to
keep the only relevant information, that is, clusters of pixels.
This consists in testing the neighborhood of each pixel of the
image. As the image has an hexagonal mesh grid, a hexagonal
neighborhood is used. The direct neighborhood of each pixel
of the image is tested. If none of the 6 neighbors are activated,
the corresponding central pixel is considered as isolated and
deactivated. Second, the rejection step permits to eliminate
particles that cannot be distinguished by the classier. Very
small images (<4 pixels) are discarded since they contain
poor information that cannot be deciphered.
3.2.2. The Preprocessing Step. The envisaged system is based
on a preprocessing step whose role consists in applying basic
transformations on incoming images in order to isolate the
main characteristics of a given image. The most important
role of the preprocessing is to guarantee invariance in ori-
entation (rotation and translation) of the incoming images.
Since signatures of particles within the image depend on the
impact point of an incident particle, the image may result in
a series of pixels located wherever on the telescope. Without
using a preprocessing stage based on orientation invariance,
the 2048 inputs of the classier would completely dier from
an image to another although the basic shape of the particle
would remain the same.
The retained preprocessing is based on the use of Zernike
moments. These moments are mainly considered in shape
reconstruction [8] and can be easily made invariant to
changes in objects orientation. They are dened as a set
of orthogonal functions based on complex polynomials
originally introduced in [9]. Zernike polynomials can be
expressed as
V
pq
(r, ) = R
pq
(r)e
iq
, (1)
where i =
_
(1), p is a nonnegative integer, q is apositive
integer such as p-q is even and p q, r: length of a vector
from the origin to a point (x, y) such as r 1, that is,
r =
_
x
2
+ y
2
/r
max
where r
max
= max
_
x
2
+ y
2
, : angle
between the x axis and the vector extending from the origin
to a point(x, y).
R
pq
(r): Zernike polynomial dened as:
R
pq
(r) =
p
_
k=q,|pk|even
B
pqk
r
k
(2)
with B
pqk
= (1)
(pk)/2
(((p + k)/2)!/((p k)/2)!((k +
|q|)/2)!((k |q|)/2)!).
Zernike moments Z
pq
are expressed according to
Z
pq
=
p + 1
_
x
_
y
I
_
x, y
_
V
pq
(r, ),
(3)
where I(x, y) refers to the pixels value of coordinates(x,y).
The rotation invariance property of Zernike moments is
due to the intrinsic nature of such moments. In order to
guarantee translation invariance as well, it is necessary to
align the center of the object to the center of the unit circle.
This may be performed by changing the coordinates x and y
of each processing point by the coordinates x x
0
and y y
0
where x
0
and y
0
refer to the center of the signature and may
be obtained by
x
0
=
Y
xI
_
x, y
_
Y
I
_
x, y
_ , y
0
=
Y
yI
_
x, y
_
Y
I
_
x, y
_
. (4)
In this case, r is expressed as follows:
r =
_
(x x
0
)
2
+ (y y
0
)
2
r
max
, (5)
where r
max
= max
_
(x x
0
)
2
+ (y y
0
)
2
.
In the context of our application, it has been found that
considering the rst 8-order Zernike moments was sucient
to obtain the best performances. This implies to compute 25
polynomials for each of the pixels within an image. Then,
it is necessary to accumulate these values in order to obtain
25 real values corresponding to all the Zernike moments of
the image. The module of these moments is then computed
and a normalization step is fullled in order to scale the
obtained values within the 1 to 1 range. These values are
then provided to the neural network.
3.2.3. The Neural Classier. The envisaged classier is a
feed-forward neural network named Multilayer Perceptron
(MLP) with one hidden layer and three outputs. The general
structure of such a network is depicted in Figure 6:
According to the nature of the network, the value of the
output may be computed according to (6).
y = g
M
_
j=0
w
(2)
k j
g
d
_
i=0
w
(1)
ji
x
i
. (6)
In (6), y represents the output; w
(2)
k j
and w
(1)
ji
, respectively,
represent the weights connecting the output layer and the
hidden layer and the weights connecting the hidden layer and
the input nodes. M is the number of neurons in the hidden
layer and d is the number of inputs. x
i
denotes the value of an
input node and g is an activation function. In our case, the
nonlinear function g has been used:
g(x) = tanh(x) =
e
x
e
x
e
x
+ e
x
. (7)
x
0
w
11
w
Md
x
1
y
1
y
2
x
2
x
3
x
d
Figure 6: Structure of a 2-layer perceptron.
In the considered application, the output layer is com-
posed of a three neurons (which value ranges between 1
and 1) corresponding to the type of particle to identify.
Each output refers to a gamma, proton, and muon particle,
respectively. If the value of an output neuron is positive, it
may be assumed that the corresponding particle has been
identied by the network. In the case that more than one
output neuron is activated, the maximum value is taken into
account.
The learning phase has been performed o-line on a
set of 4500 patterns computed on simulated images as the
HESS2 telescope is not yet installed. The simulated images
are generated thanks to series of Monte Carlo simulations.
These patterns covered all ranges of energies and types of
particles. 1500 patterns were considered for each class of
particles. A previous study had determined the reliability
of the patterns in order to consider the most representative
patterns that may be collected by the telescope.
A classical backpropagation algorithm has been pro-
grammed o-line in order to get the optimal value of
weights. The training have been performed simultaneously
on two sets of patterns (learning and testing set). Once the
error on the testing phase was minimum, the training was
stopped ensuring that the weights had an optimal value.
The size of the input layer was determined according to
the type of preprocessing that was envisaged. In the case of a
Zernike preprocessing, this number has been set to 25 since
it corresponds to the number of outputs furnished by the
preprocessing step.
The number of hidden nodes (in the hidden layer) has
been evaluated regarding the results obtained on a specic
validation set of patterns. This precaution has been handled
in order to ensure that the neural network was able to
generalize on new data (i.e., it has not learnt explicitly).
3.3. Simulated Performances. The best performances that
have been obtained are summarized in Table 1. It corre-
sponds to a trigger with a preprocessing based on the rst
25 Zernike moments. Other results concerning dierent
preprocessings have also been described in [10].
Table 1: Performances according to both approaches.
Gamma Muon Proton
Hillas approach 60% 56% 37%
Neural approach 95% 58% 41%
According to Table 1, it may be seen that the neural
solution provides signicant improvement compared to
classical methods in terms of classication. This improve-
ment resides in the fact that a largest dimensionality of
the problem has been taken into account. Whereas Hillas
processing takes only ve parameters into consideration, the
number of inputs in the case of a neural preprocessing is
set to 25. Moreover, as the Hillas approach only consists in
applying strong cuts on predened parameters, the neural
approach is more exible and guarantees nonlinear decision
boundaries. It may be assumed that the considered neural
network is capable of extracting the relevant information
and discriminate between all images, eciently. The major
drawback of the neural approach is its relative complexity
in terms of computation and hardware implementation.
Although Hillas algorithms may be implemented in software,
it is impossible to implement both the neural network
and the preprocessing step in the same manner. In this
context, dedicated circuits have to be designed in order to
be compliant with the strong timing constraints imposed by
the entire system. In our case, an L2-decision has to be taken
at a rate of 3.5 KHz which corresponds to a timing constraint
of 285 microseconds.
4. Hardware Implementation
The complete L2 trigger system is currently being built,
making intensive use of the recongurable technology. Com-
ponents such as FPGAs constitute an attractive alternative
to classical circuits such as ASICs (Application Specic
Integrated Circuits). This type of recongurable circuits
tends to be more and more ecient in terms of speed and
logic resources and is more and more envisaged in deeply
constrained applications.
4.1. Hardware Implementation of Zernike Moments.
Although very ecient, Zernike moments are known
for their computation complexity. Many solutions have
been proposed for the fast implementation of the Zernike
moments. Some algorithms are based on recursivity [11],
reuse of previous parts of the computation [12] or moment
generators [13].
Since using a moment generator allows a reduction
of the number of operations, we have decided to follow
this approach, that is, to compute Zernike moments from
accumulation moments.
4.1.1. Zernike Moments via Accumulation Moments. The
mechanism of a moment generator [14] can be summarized
by the expression of the geometric moments with respect
p
x
p
y
N
x
0, 0
x
y
N
y
Figure 7: Image topology in the L2-trigger of the HESS Phase-II
project.
to the point of coordinates (N
x
, N
y
) from the accumulation
moments:
m
Nx,Ny
p,q
=
Nx
_
x=0
Ny
_
y=0
(N
x
x)
p
(N
y
y)
q
I
_
x, y
_
=
p
_
e=0
q
_
f =0
S
_
p, e
_
S
_
q, f
_
e, f
(8)
with
e, f
being the accumulation moments of order (e, f ),
and I(x, y) being the pixels values in the image and
S =
1 0 0 0 0 0
1 1 0 0 0 0
1 3 2 0 0 0
1 7 12 6 0 0
1 15 50 60 24 0
1 31 180 390 360 120
.
.
.
.
.
.
. (9)
According to (8), it is important to note that geometric
moments may be expressed as a function of accumulation
moments. In the context of our application, (10) to (23)
demonstrate how to calculate the Zernike moments from the
geometric moments and thus from accumulation moments.
Note that, in the particular case of HESS Phase-II,
one of the issues is the image topology which consists
of a hexagonal grid with empty corners (see Figure 7).
Since Zernike moments are continuous, they are particularly
suitable for this type of images. The following equations
aim to express the Zernike moments from the accumulation
moments in the particular context of HESS Phase-II.
We have seen that Zernike moments may be expressed as
follows:
Z
pq
=
p + 1
_
x
_
y
I
_
x, y
_
p
_
k=q, pk even
B
pqk
r
k
e
iq
. (10)
The expressions of r and e
iq
can be rewritten as
r =
_
(x x
0
)
2
+ (y y
0
)
2
r
max
,
(11)
where (x
0
, y
0
) are the coordinates of the image center
computed as explained in (4):
e
iq
=
((x x
0
) + i(y y
0
))
q
((x x
0
)
2
+
_
y y
0
_
2
)
q/2
.
(12)
In order to simplify the equations, we note X = x x
0
and Y = y y
0
. In this case,
Z
pq
=
p + 1
_
x
_
y
I
_
x, y
_
p
_
k=q, pk even
B
pqk
r
k
max
(X
2
+ Y
2
)
(kq)/2
(X + iY)
q
.
(13)
According to the binomial theorem, the development of
a given polynom can be expressed as follows:
(a + b)
n
=
n
_
m=0
C
m
n
a
nm
b
m
. (14)
It is then possible to modify the following expressions:
(X
2
+ Y
2
)
(kq)/2
=
(kq)/2
_
=0
C
(kq)/2
X
kq2
Y
2
,
(X + iY)
q
=
q
_
=0
C
q
i
X
q
Y
.
(15)
Thus, (13) can be reformulated as follows:
Z
pq
=
p + 1
p
_
k=q, pk even
B
pqk
r
k
max
q
_
=0
(kq)/2
_
=0
i
q
C
(kq)/2
_
x
_
y
X
k2
Y
2+
I
_
x, y
_
=
p + 1
p
_
k=q, pk even
(1)
k
B
pqk
r
k
max
q
_
=0
(kq)/2
_
=0
i
q
C
(kq)/2
_
x
_
y
(x
0
x)
k2
_
y
0
y
_
2+
I
_
x, y
_
.
(16)
The next step consists in considering the last point
(N
x
, N
y
) in the equation of Zernike moments with respect
to the center of the image:
Z
pq
=
p + 1
p
_
k=q, pk even
(1)
k
B
pqk
r
k
max
q
_
=0
(kq)/2
_
=0
i
q
C
(kq)/2
_
x
_
y
(x
0
N
x
+ N
x
x)
k2
(y
0
+ N
y
N
y
y)
2+
I
_
x, y
_
=
p + 1
p
_
k=q, pk even
(1)
k
B
pqk
r
k
max
q
_
=0
(kq)/2
_
=0
i
q
C
(kq)/2
k2
_
a=0
C
a
k2
X
k2a
c
2+
_
b=0
C
b
2+
Y
2+b
c
_
x
_
y
(N
x
x)
a
_
N
y
y
_
b
I
_
x, y
_
,
(17)
where X
c
= x
0
N
x
and Y
c
= y
0
+ N
y
.
Since the coordinates of the pixels in the image are
expressed as real numbers, we need to express these coor-
dinates with integers in order to formulate the Zernike
moments in function of the geometric moments. As we can
see in Figure 7, the even rows have to be distinguished from
the odd rows. Therefore, the x coordinate is expressed in two
dierent ways according to the type of row(even or odd row).
x and y may be expressed as
x =
(x
d
0.5)p
x
+ oset
x
, if y
d
%2 = 1,
(x
d
1)p
x
+ oset
x
, if y
d
%2 = 0,
y =
_
1 y
d
_
p
y
+ oset
y
,
(18)
where x
d
and y
d
are positive integers such as x
d
= 1 X
d
and y
d
= 1 Y
d
with X
d
= 48 corresponding to the
number of columns, and Y
d
= 52 corresponding to the
number of rows. p
x
(resp., p
y
) is the distance between
two adjacent columns (resp., rows) and oset
x
and oset
y
correspond to the new position of the origin of the image in
the upper left corner.
In the following equations, since the rst part of the
expressions does not change, the second part is just devel-
oped:
_
x
_
y
(N
x
x)
a
(N
y
y)
b
I
_
x, y
_
=
Xd
_
xd=1
Yd
_
yd=1,yd%2=1
_
N
x
_
(x
d
0.5)p
x
+ oset
x
__
a
_
N
y
_
_
1 y
d
_
p
y
+ oset
y
__
b
f
_
x
d
, y
d
_
+
Xd
_
xd=1
Yd
_
yd=1,yd%2=0
_
N
x
_
(x
d
1)p
x
+ oset
x
__
a
_
N
y
_
_
1 y
d
_
p
y
+ oset
y
__
b
f
_
x
d
, y
d
_
.
(19)
Notice that N
x
= p
x
X
d
and N
y
= p
y
Y
d
:
_
x
_
y
(N
x
x)
a
(N
y
y)
b
I
_
x, y
_
=
Xd
_
xd=1
Yd
_
yd=1,yd%2=1
_
(X
d
x
d
)p
x
+ 0.5p
x
oset
x
_
a
_
_
Y
d
+ y
d
_
p
y
p
y
oset
y
_
b
I
_
x
d
, y
d
_
+
Xd
_
xd=1
Yd
_
yd=1,yd%2=0
_
(X
d
x
d
)p
x
+ p
x
oset
x
_
a
_
_
Y
d
+ y
d
_
p
y
p
y
oset
y
_
b
I
_
x
d
, y
d
_
.
(20)
In this case, we propose to gure out the even and odd
parts of the image, let y
d
= 2y
ed
if y
d
%2 = 0 and y
d
= 2y
od
1 if y
d
%2 = 1. In this case, the sums on y
d
will be bounded
by Y
d
/2 and
_
x
_
y
(N
x
x)
a
(N
y
y)
b
I
_
x, y
_
=
Xd
_
xd=1
Yd/2
_
yod=1
_
(X
d
x
d
)p
x
+ 0.5p
x
oset
x
_
a
_
2p
y
_
Y
d
2
y
od
_
2p
y
oset
y
_
b
I
_
x
d
, y
od
_
(21)
+
Xd
_
xd=1
Yd/2
_
yed=1
_
(X
d
x
d
)p
x
+ p
x
oset
x
_
a
_
2p
y
_
Y
d
2
y
ed
_
p
y
oset
y
_
b
I
_
x
d
, y
ed
_
=
Xd
_
xd=1
Yd/2
_
yod=1
I
_
x
d
, y
od
_
a
_
c=0
C
c
a
(0.5p
x
oset
x
)
ac
p
c
x
(X
d
x
d
)
c
b
_
d=0
C
d
b
_
2p
y
oset
y
_
bd
_
2p
y
_
d
_
Y
d
2
y
od
_
d
+
Xd
_
xd=1
Yd/2
_
yed=1
I
_
x
d
, y
ed
_
a
_
c=0
C
c
a
_
p
x
oset
x
_
ac
p
c
x
(X
d
x
d
)
c
b
_
d=0
C
d
b
_
p
y
oset
y
_
bd
_
2p
y
_
d
_
Y
d
2
y
ed
_
d
(22)
=
a
_
c=0
b
_
d=0
C
c
a
M
ac
odd
N
c
C
d
b
O
bd
odd
P
d
Xd
_
xd=1
Yd/2
_
yod=1
(X
d
x
d
)
c
_
Y
d
2
y
od
_
d
I
_
x
d
, y
od
_
+
a
_
c=0
b
_
d=0
C
c
a
M
ac
even
N
c
C
d
b
O
bd
even
P
d
Xd
_
xd=1
Yd/2
_
yed=1
(X
d
x
d
)
c
_
Y
d
2
y
ed
_
d
I
_
x
d
, y
ed
_
,
(23)
where M
odd
= 0.5p
x
oset
x
, M
even
= p
x
oset
x
, N = p
x
,
O
odd
= 2p
y
oset
y
, O
even
= p
y
oset
y
, and P = 2p
y
.
Equation (23) shows that the Zernike moments can be
computed from the geometric moments. If we consider two
accumulation grids, the rst computes the accumulation
moments on the odd lines of the image and the second
on the even lines. Since the computation is divided into
two dierent parts, the image should be arranged in two
components: the odd component of the image and the
even one. Therefore, according to (8), the analogy gives
an expression of the Zernike moments which is function
of
odd
(accumulation moments computed from the odd
component of the image) and
even
(accumulation moments
computed from the even component of the image) by setting
N
x
= X
d
and N
y
= Y
d
/2:
Xd
_
xd=1
Yd/2
_
yod=1
(X
d
x
d
)
c
_
Y
d
2
y
od
_
d
I
_
x
d
, y
od
_
=
c
_
e=0
d
_
f =0
S(c, e)S
_
d, f
_
odd,e, f
Xd
_
xd=1
Yd/2
_
yed=1
(X
d
x
d
)
c
_
Y
d
2
y
ed
_
d
I
_
x
d
, y
ed
_
=
c
_
e=0
d
_
f =0
S(c, e)S
_
d, f
_
even,e, f .
(24)
By reinjecting (24) in the (23), Zernike moments are
reformulated as follows:
Z
pq
=
p + 1
p
_
k=q, pk even
q
_
=0
(kq)/2
_
=0
i
(1)
k
C
q
C
(kq)/2
B
pqk
r
k
max
k2
_
a=0
C
a
k2
X
k2a
c
2+
_
b=0
C
b
2+
Y
2+b
c
a
_
c=0
b
_
d=0
C
c
a
M
ac
odd
N
c
C
d
b
O
bd
odd
P
d
c
_
e=0
d
_
f =0
S(c, e)S
_
d, f
_
odd,e, f
+
a
_
c=0
b
_
d=0
C
c
a
M
ac
even
N
c
C
d
b
O
bd
even
P
d
d
_
f =0
S(c, e)S
_
d, f
_
even,e, f
.
(25)
In an analogous way, the coordinates of the center of
the image (x
0
, y
0
) can be computed from the accumulation
moments:
x
0
=
m
01
m
00
,
y
0
=
m
10
m
00
,
(26)
where
m
00
=
odd,0,0
+
even,0,0
m
01
= (1)
1
_
b=0
C
b
1
N
1b
y
b
_
d=0
C
d
b
O
bd
odd
P
d
d
_
f =0
S
_
d, f
_
odd,0, f
+
b
_
d=0
C
d
b
O
bd
even
P
d
d
_
f =0
S
_
d, f
_
even,0, f
= (1)
_
N
y
_
odd,0,0
+
even,0,0
_
+
_
O
odd
odd,0,0
+ O
even
even,0,0
+ P S(1, 0)
_
odd,0,0
+
even,0,0
_
+P S(1, 1)
_
odd,0,1
+
even,0,1
__
_
,
m
10
= (1)
1
_
a=0
C
a
1
(N
x
)
1a
a
_
c=0
C
c
a
O
ac
odd
P
c
c
_
e=0
S(c, e)
odd,e,0
+
a
_
c=0
C
c
a
O
ac
even
P
c
c
_
e=0
S(c, e)
even,e,0
= (1)
_
(N
x
)
_
odd,0,0
+
even,0,0
_
+
_
O
odd
odd,0,0
+ O
even
even,0,0
+ P S(1, 0)
_
odd,0,0
+
even,0,0
_
+P S(1, 1)
_
odd,1,0
+
even,1,0
___
.
(27)
It comes
x
0
= N
y
P S(1, 0)
O
odd
odd,0,0
+ O
even
even,0,0
odd,0,0
+
even,0,0
P S(1, 1)
_
odd,0,1
+
even,0,1
_
odd,0,0
+
even,0,0
,
y
0
= N
x
P S(1, 0)
O
odd
odd,0,0
+ O
even
even,0,0
odd,0,0
+
even,0,0
P S(1, 1)
_
odd,1,0
+
even,1,0
_
odd,0,0
+
even,0,0
.
(28)
We have developed here an algorithm enabling the
computation of Zernike moments based on the moment
generator using the accumulation moments. This algorithm
has the advantage to be used on images which have particular
topologies since their mesh grid is regular or semiregular by
the use of a second accumulation grid. The second advantage
of this algorithm is its simplicity to be implemented on
FPGA, for instance. The base of this algorithm relies on the
accumulation moments and is easily computed thanks to a
simple accumulation grid.
4.1.2. Architecture Description. To make the exploitation of
(25) easier, we need to reorder the terms to get an expression
of the Zernike moments such as
Z
pq
=
_
e
_
f
p,q
odd,e, f
odd,e, f
+
p,q
even,e, f
even,e, f
,
(29)
where
p,q
odd,e, f
=
_
p + 1
_
p
_
k=q
t
k,e
t
k, f
q
_
=0
(kq)/2
_
=0
k
e,,
k
f ,, j
i
(1)
k
B
pqk
r
k
max
C
q
C
(kq)/2
k2j
_
a=0
2+
_
b=0
t
a,e
t
b, f
C
a
k2
C
b
2+
X
k2a
c
Y
2+b
c
a
_
c=0
b
_
d=0
C
c
a
C
d
b
M
ac
odd
N
c
O
bd
odd
P
d
S(c, e)S
_
d, f
_
,
p,q
even,e, f
=
_
p + 1
_
p
_
k=q
t
k,e
t
k, f
q
_
=0
(kq)/2
_
=0
k
e,,
k
f ,,
i
(1)
k
B
pqk
r
k
max
C
q
C
(kq)/2
k2
_
a=0
2+
_
b=0
t
a,e
t
b, f
C
a
k2
C
b
2+
X
k2a
c
Y
2+b
c
a
_
c=0
b
_
d=0
C
c
a
C
d
b
M
ac
even
N
c
O
bd
even
P
d
S(c, e)S
_
d, f
_
(30)
with
t
g,h
=
1, if h g,
0, else,
k
e,,
=
1, if e k 2,
0, else,
k
f ,,
=
1, if f 2 + ,
0, else.
(31)
The general scheme of the architecture of Zernike
moments (see Figure 8) can be described as follows. (i) The
image is rst divided into two parts: the odd component
which only contains the odd rows of the images (resp., even).
(ii) The accumulation moments are computed in parallel
according to two accumulation grids. (iii) On the one hand
the accumulation moments of order (0, 0), (0, 1) and, (1, 0)
reach the block which computes X
k2a
c , Y
2+b
c , and
r
k
max
. On the other hand the accumulation moments are
delivered to the Zernike computation block, waiting for the
completion of the computations. (iv) As soon as X
k2a
c ,
Y
2+b
c , and r
k
max
are computed, the coecients
p,q
odd,e, f
and
p,q
even,e, f
can be computed. (v) The coecients are transmitted
to the nal computation block in order to evaluate the
Accumulation
grids
Zernike
computation
Coecients
computation
Image
(X
c
, Y
c
) and r
max
computation
Z
pq
Figure 8: Zernike architecture general scheme.
Zernike moments according to (29). Their module is then
computed.
The scheme of the accumulation grid of width 4 is
given in Figure 9, and we can notice that it consists of a
simple series of accumulators. They are arranged in a way
that the accumulation is rst computed on each row via an
accumulation row (row of y
m
) and then the accumulation
is performed on the columns (set of y
mn
). As soon as a
row ends in a given accumulator y
m
, the result of this
accumulator is furnished to its corresponding rst column
accumulator, and y
m0
and y
m
are cleared. At the same time
all the corresponding column accumulators transmit their
accumulation to the next one.
The registers used between the column accumulators are
synchronized at the end of each row; so their clock enable
depends on the image topology. In our case, corners have
been lled with zeros before dividing the image. Therefore,
the size of each images component is X
d
Y
d
/2. In this case,
the accumulation moment
e, f
is computed on X
d
(Y
d
/2 +
f ) + e clock cycles from the moment when the rst pixel
arrives into the accumulation grid.
One major point of the Zernike moments implemen-
tation is the computation of the coecients. The main
issue of this computation relies in the trade-o between the
number of coecients stored in the chip and the number
of operations that are useful to compute these coecients.
Table 2 shows the number of operations that are necessary
for the computation of the coecients for Zernike moments
until order 8. Conguration 1 corresponds to the case where
B
pqk
(55 values), C
k
p
(45 values), the matrix S (45 values)
and the M
p
odd
, the M
p
even, the N
p
, the O
p
odd
, the O
p
even, and
the P
p
(9 6 = 54 values), that is, 199 values stored. The
second conguration corresponds to a storage of the results
dealing with the operations: (1)
k
(B
pqk
/r
k
max
)C
qC
(kq)/2
,
C
a
k2
C
b
2+
, C
c
a
M
ac
odd
N
c
, C
c
a
M
ac
even
N
c
, C
d
b
O
bd
odd
P
d
, C
d
b
O
bd
even
P
d
,
and S(c, e)S(d, f ). If there is no optimisation of the storage
of these values, the occupied memory will be huge (4564
values), but by using the redundancy in each group and
the centralized storage of the 1 value, the number of stored
values may be reduced to 1203. Note that even if the
number of values to store has hardly increased, the number
of multiplications is divided by two compare to the rst
conguration.
Figure 10 shows the envisaged computation of the
zernike coecients taking into account the second con-
guration. The control block deals with the bounds of
the sum. T1, T2, T3, T4, T5, T6, and T7 look-up
tables correspond, respectively, to C
c
a
M
ac
odd
N
c
, C
d
b
O
bd
odd
P
d
,
R
E
G
2
R
E
G
2
R
E
G
2
R
E
G
2
R
E
G
2
REG REG REG REG REG
REG REG REG REG
REG REG REG
REG REG
REG
42
42
42
42
42
34
34
34
34
26
26
26
18
18 10
7 13 18 24 29 2
0

1

2

3

4
40
30
20
10
00
01
02
03

13
12
11

21
22
04
31
Figure 9: Example of accumulation grid of width 4.
Level 1:
0 c a and 0 d b
Level 2:
0 a k2
0 b 2 +
if e a and f b
Level 4:
q k p, |pk|%2
if e p and f p
Level 3:
0 q and 0 (kq)/2
if e k2 and f 2 +
18
18
T6
T7
T5
T3-4
T1-2
18
18
18
18
18
18
18
18
18
R
E
G
R
E
G
R
E
G
R
E
G
R
E
G
R
E
G
R
E
G
R
E
G
R
E
G
R
E
G
R
E
G
R
E
G
+
+
+
+
+
+
+
+
+
+
+
+
Control
r
k
max
X
ki2ja
c Y
2j+ib
c
(
p,q
odd,e, f
)
(
p,q
odd,e, f
)
(
p,q
even,e, f
)
(
p,q
even,e, f
)
(1)
/2
Figure 10: Computation of Zernike coecients.
Table 2: Number of operations executed to compute the
p,q
e,f
coecients.
p
Nb. accumulations Nb. Multiplications
Congure 1 congure 2
0 2 25 14
1 16 183 102
2 101 1225 672
3 349 5543 3000
4 1311 22987 12266
5 4267 77637 41010
6 13642 241767 126592
7 38860 660481 343560
8 104663 1692910 875720
C
c
a
M
ac
even
N
c
, C
d
b
O
bd
even
P
d
, S(c, e)S(d, f ), C
a
k2
C
b
2+
, and
(1)
k
(B
pqk
/r
k
max
)C
qC
(kq)/2
. T1-2 (resp., T3-4) means that T1
and T2 (resp., T3 and T4) are rst read and then the products
between the read values are computed.
A Zernike computation block aims to compute the mod-
ule of Zernike moments from the accumulation moments
that are provided by the grids and from the module that
furnishes the coecients (see Figure 8). This block consists
in summing the dierent coecients and in computing the
module of each moment. In order to reduce the amount of
logical resources to provide, the computation of the square
root is simplied according to the following approximation:
_
x
2
+ y
2
max
_
|x|,
,
3
4
_
|x| +
_
_
. (32)
This approximation is often utilized in image processing
and does not impact signicantly the nal results.
4.1.3. FPGA Implementation of Zernike Moments Computa-
tion. In order to compute the Zernike moment from the
accumulation, we proposed an original architecture which
is presented in Figure 10. This architecture is very regular
and simplies the implementation on an FPGA target.
Furthermore, we can notice that the hardware required is
simple to design for both the moments accumulation and
moments computation. In fact, the computations are based
on a multiplier and an adder. These constitute the MAC (for
Multiply-ACcumulate) operator and are widely available in
current FPGA devices. In order to improve performances,
MAC operators are integrated in some FPGA devices as a
hardwired component like DSP48 in Xilinx Virtex4.
Two implementation approaches are possible in which
either hardware or time optimization is considered.
Hardware Optimization. This approach allows to reproduce
partially the temporal model of processors. The computa-
tions are performed iteratively and coecients are read from
the tables sequentially. The results can be temporarily stored
in paged memory rather than registers. In this approach,
the total number of iterations is directly proportional to
the order of the desired moment and it remains relatively
Multiplier Adder
D Q
D Q
+
Figure 11: Using MAC operator in data ow architecture.
Multiplier Adder
Out
SEL
+
ENA
Level i
Level i+1
ENA
Clock
Control
D Q
D Q
i1
i2
Figure 12: Reduction of the calculation resources by reusing hard-
wired operators.
small (some thousands only). Figure 11 depicts one of
the two variants of realization: with or without pipelined
computation. The pipelined organization allows to increase
the calculation frequency of the iterations.
Time Optimization. In that case, we consider that the
amount of the computation hardware is sucient. Therefore,
the architecture includes all necessary pipelined operators
as it is suggested in Figure 10. The intermediate results
are stored in registers. This solution oers the possibility
of reducing the number of operators by reusing the same
hardware resources as shown in Figure 12.
Figure 13 describes the hardware implementation of the
Zernike computation block. Its main objective is to generate
the dierent Zernike moments from the accumulation
moments calculated with the accumulation grids and from
the coecients computation module. It mainly consists of
MAC blocks and of a module destined to compute the square
root of the module according to (32). Only 75 slices are
required to implement the entire block.
4.2. Hardware Implementation of the Neural Network. The
parallel nature of neural networks makes them very suitable
for hardware implementation. Several studies have been
performed so far allowing complex congurations to be
implemented in recongurable circuits [15, 16].
The proposed architecture strives to reduce the amount
of logic to be utilized. This is mainly due to the fact that the
neural network has to be implemented with its associated
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
D Q
(1)
18
18
R
E
G
R
E
G
R
E
G
R
E
G
18
18
18
18
48
48
18
(2)
(3)
(4)
(5)
(1)
(2)
(3)
(4)
(5)
(1)
(2)
(3)
(4)
(5)
(1)
(2)
(3)
(4)
(5)
+
+
Control
+
+
+
+
|Z
0,0
|
|Z
1,1
|
|Z
2,0
|
|Z
2,2
|
|Z
3,1
|
|Z
3,3
|
|Z
4,0
|
|Z
4,2
|
|Z
4,4
|
|Z
5,1
|
|Z
5,3
|
|Z
5,5
|
|Z
6,0
|
|Z
6,2
|
|Z
6,4
|
|Z
6,6
|
|Z
7,1
|
|Z
7,3
|
|Z
7,5
|
|Z
7,7
|
|Z
8,0
|
|Z
8,2
|
|Z
8,4
|
|Z
8,6
|
|Z
8,8
|
(
p,q
odd,e, f
)
odd,e, f
even,e, f
(
p,q
odd,e, f
)
(
p,q
even,e, f
)
(
p,q
even,e, f
)
_
(
2
+
2
)
Figure 13: Computation of Zernike moments from the accumulation moments.
complex preprocessing that may require a lot of resources.
An example of such architecture is presented in Figure 14.
In this example, the neural architecture is implementing
a 5-input MLP with 7 hidden nodes and 3 outputs. These
parameters are easily modiable since the proposed circuit is
scalable.
Input data are accepted sequentially and applied to the
series of multipliers. A
j
corresponds to the jth input of the
present state whereas B
j
corresponds to the jth of the next
set. Data arrive at each clock cycle.
At each clock cycle, at any particular level of adder, apart
from the addition operation between the multiplier output
and the sum from the previous level, the multiplication
operation of the next set of inputs at the adjacent multiplier
is also simultaneously performed. The sum, thus, ripples and
accumulates through the central adders (48 bits) until it is
fed to a barrel shifter that aims to translate the data into a
16-bit address. The obtained sum addresses a sigmoid block
memory (SIGMOID0) containing 65536 values of 18 bits.
This block feeds the outputs of the hidden layer sequen-
tially to three MAC units for the output layer calculation.
Finally a multiplexer distributes serially the results of the out-
put layer to another sigmoid block memory (SIGMOID1).
After a study on data representation, it has been decided to
code the incoming data in 18 bits. Weights are stored in ROM
(Read-only Memories) containing 256 words of 18 bits. The
control of the entire circuit is performed by a simple state
machine that aims to organize the sequence of computations
and memory management.
The number of multipliers required for the network is
I +O, where I is the number of inputs and O is the number of
outputs. Considering that the number of hidden nodes may
be large compared to the number of inputs and the number
of outputs, the adopted solution does not aect the number
of multipliers which is a great relief. In this context, it is
also important to note that the design is very easily scalable
to accommodate more hidden, input or output nodes. For
example, adding a hidden node does not impact the number
of resources but requires an additional cycle of computation.
Adding an input may be accommodated by the addition
of another ROM, multiplier, and adder set to the series of
adders at the centre (part HL of the gure). Moreover, the
A
5
A
5
A
5
A
5
A
5
A
5
A
5
B
4
A
4
A
4
A
4
A
4
A
4
A
4
A
4
B
3
B
3
A
3
A
3
A
3
A
3
A
3
A
3
A
3
B
2
B
2
B
2
A
2
A
2
A
2
A
2
A
2
A
2
A
2
B
1
B
1
B
1
B
1
A
1
A
1
A
1
A
1
A
1
A
1
A
1
ROM
ROM
ROM
ROM
ROM
ROM
ROM
ROM
18 18
18
18
Sigmoid1
Sigmoid0
MAC BS
BS
BS
BS
MAC
MAC
M
U
X
18
18
16
16
16
16
48
48
48
48
18
18
18
18
18
18
18
18
18
1
0
Control
+
+
+
+
+
2
3
1
2
3
4
5
HL
OL
S
Figure 14: Example of the hardware implementation of a basic neural network.

Table 3: Summary of occupied resources.
Module Number of used logic slices/ Used DSP blocks/ Number of used memory bits/
available available available (Kb)
Accumulation moments 1786/49152 0/96 0 /4320
Zernike moments 75/49152 60/96 21/4320
Neural Network 477/49152 28/96 2808/4320
Total 2338/49152 (4%) 88/96 (92%) 2829 /4320 (65.5%)
addition of an output node can be fullled by adding another
ROM, MAC unit, and sigmoid block to the part OL of the
gure.
Another advantage of the architecture is that a single
activation function (sigmoid block) is required to compute
the complete hidden layer. This block consists of a Look-up
Table (LUT) that stores 65536 values of the function.
In general, the time required to obtain the outputs after
the arrival of the rst input is xed to I + H + 6, where I is
the number of inputs, and H is the number of hidden units.
In every cycle, I + O number of multiplications is performed
(O is the number of output units).
5. Performances
The complete architecture (preprocessing + neural network)
has been implemented in a Xilinx Virtex4 (xc4lx100) FPGA
which is the part that has been retained for the trigger
implementation. This type of recongurable circuit exhibits
a lot of dedicated resources such as memory blocks or DSP
blocks that allow to compute a MAC very eciently.
5.1. Resources Requirements. The resources that are required
to implement the global L2 trigger are given in Table 3.
The accumulation grid is essentially realized with logical
resources. No DSP block is utilized at this level. The compu-
tation of Zernike moments from the accumulation moments
makes intensive use of parallelism. Five computation stages
enable to compute 25 Zernike moments very rapidly and
make use of 60 DSP blocks.
Concerning the hardware implementation of the neural
network, it is important to notice that, independently of
the conguration, the amount of used resources is very
low. Nevertheless, one may deplore an important usage of
memory blocks destined to store the values of the sigmoid
functions. This issue may be circumvented in case where
hardware resources constitute an issue. Amodied activation
function sgn(x) (1 2
absx
) could be used [17]. This has a
shape quite similar to the sigmoid function and is very easy
to implement on hardware with just a number of shifts and
adds. This function can be executed with an error less than
4.3%
According to Table 3, it is clear that the entire system
ts in an FPGA without consuming to much logic (4%).
Moreover, the complete architecture has been devised in
order to take full benet of the intrinsic dedicated resources
of the FPGA, that is, DSP and memory blocks.
5.2. Timing. The computation time of the complete trigger
is summarized in Table 4. According to this table, it is
important to notice that the timing constraints imposed by
the HESS system have been met since the mean frequency to
take a decision is xed to 3.5 KHz, that is, 285 microseconds.
The global latency time of the proposed L2-trigger is
115.3 microseconds which makes it possible to envisage other
improvements.
It is important to note that most of the computation time
is monopolized by the computation of Zernike moments
from the accumulation moments. This is mainly due to
the fact that the number of accumulations to perform is
huge (104663 accumulations for an order-8) and that these
Table 4: Timing performances.
Module Processing time in s
Accumulation moments 13.5
Zernike moments 101.4
Neural Network 0.4
Total 115.3
computations are performed iteratively. Even if we have
decided to parallelize the architecture in ve stages, the
number of iterations remains high (30 000). A current
work is performed to optimize the computations in this
block for further improvements.
The maximum clock frequency has been estimated at
120 MHz and 366 MHz for the DSP blocks.
6. Conclusion
In this article, we have presented an original solution that
may be seen as an intelligent way of triggering data in the
HESS Phase-II experiment. The system relies on the utiliza-
tion of image processing algorithms in order to increase the
trigger eciency. The hardware implementation has repre-
sented a challenge because of the relatively strong timing
constraints 285 microseconds to process all algorithms. This
problem has been circumvented by taking advantage of the
nature of the algorithms. All these concepts are implemented
making intensive use of FPGA circuits which are interesting
for several reasons. First, the current advances in recon-
gurable technology make FPGAs an attractive alternative
compare to very powerful circuits such as ASICs. Moreover,
their relatively small cost permits to rapidly implement a
prototype design without major developmental constraints.
The recongurability also constitutes a major point. It allows
to congure the whole system according to the application
needs, enabling exibility and adaptivity. For example, in
the context of the HESS project, it may be conceivable to
recongure the chip according to the surrounding noise or
to deal with specic experimental conditions.
References
[1] J. A. Hinton, The status of the hess project, New Astronomy
Reviews, vol. 48, no. 5-6, pp. 331337, 2004.
[2] S. Funk, G. Hermann, J. Hinton, et al., The trigger system of
the Hess telescope array, Astroparticle Physics, vol. 22, no. 3-4,
pp. 285296, 2004.
[3] E. Delagnes, Y. Degerli, P. Goret, P. Nayman, F. Toussenel, and
P. Vincent, Sam: a newghz sampling asic for the Hess-ii front-
end electronics, Nuclear Instruments and Methods in Physics
Research, vol. 567, no. 1, pp. 2126, 2006.
[4] A. M. Hillas, Cerenkov Light Images of EAS Produced by
Primary Gamma Rays and by Nuclei, in Proceedings of the
19th International Cosmic Ray Conference (ICRC 85), San
Diego, Calif, USA, August 1985.
[5] B. Denby, Neural networks in high energy physics: a ten year
perspective, Computer Physics Communications, vol. 119, no.
2, pp. 219231, 1999.
[6] C. Kiesling, B. Denby, J. Fent, et al., The h1 neural network
trigger project, in Advanced Computing and Analysis Tech-
niques in Physics Research, vol. 583, pp. 3644, 2001.
[7] C. M. Bishop, Neural Networks for Pattern Recognition, Oxford
University Press, Oxford, UK, 1995.
[8] M. R. Teague, Image analisis via the general theory of
moments, Journal of the Optical Society of America, vol. 70,
pp. 920930, 1979.
[9] F. Zernike, Beugungstheorie des schneidenverfahrens und
seiner verbesserten form, der phasenkontrastmethode, Phys-
ica, vol. 1, pp. 689704, 1934.
[10] S. Khatchadourian, J.-C. Pr evotet, and L. Kessal, A neural
solution for the level 2 trigger in gamma ray astronomy, in
Proceedings of the 11th International Workshop on Advanced
Computing and Analysis Techniques in Physics Research (ACAT
07), Proceedings of Science, Nikhef, Amsterdam, The Nether-
lands, April 2007.
[11] S. O. Belkasim, M. Ahmadi, and M. Shridhar, Ecient
algorithm for fast computation of zernike moments, Journal
of the Franklin Institute, vol. 333, pp. 577581, 1996.
[12] E. C. Kintner, On the mathematical properties of the zernike
polynomials, Journal of Modern Optics, vol. 23, pp. 679680,
1976.
[13] L. Kotoulas and I. Andreadis, Real-time computation of
zernike moments, IEEE Transactions on Circuits and Systems
for Video Technology, vol. 15, no. 6, pp. 801809, 2005.
[14] M. Hatamian, A real-time two-diensional moment gener-
ating algorithm and its single chip implementation, IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol. 34,
pp. 546553, 1986.
[15] J.-C. Pr evotet, B. Denby, P. Garda, B. Granado, and C.
Kiesling, Moving nn triggers to level-1 at lhc rates, Nuclear
Instruments and Methods in Physics Research A, vol. 502, no.
2-3, pp. 511512, 2003.
[16] A. R. Omondi and J. C. Rajapakse, Fpga Implementations of
Neural Networks, Springer, 2006.
[17] M. Skrbek, Fast neural network implementation, Neural
Network World, vol. 9, pp. 375391, 1999.
doi:10.1155/2009/542035
Research Article
Evaluation and Design Space Exploration of a Time-Division
Multiplexed NoC on FPGA for Image Analysis Applications
Linlin Zhang,
1, 2, 3
Virginie Fresse,
1, 2, 3
Mohammed Khalid,
4
Dominique Houzet,
5
and Anne-Claire Legrand
1, 2, 3
1
Universite de Lyon, 42023 Saint-Etienne, France
2
CNRS, UMR 5516, Laboratoire Hubert Curien, 42000 Saint-Etienne, France
3
Universite de Saint-Etienne, Jean-Monnet, 42000 Saint-Etienne, France
4
RCIM, Department of Electrical & Computer Engineering, University of Windsor, Windsor, ON, Canada N9B 3P4
5
GIPSA-Lab, Grenoble, France
Correspondence should be addressed to Linlin Zhang, linlinzhang0511@gmail.com
Received 1 March 2009; Revised 17 July 2009; Accepted 17 November 2009
The aim of this paper is to present an adaptable Fat Tree NoC architecture for Field Programmable Gate Array (FPGA) designed for
image analysis applications. Traditional Network on Chip (NoC) is not optimal for dataowapplications with large amount of data.
On the opposite, point-to-point communications are designed from the algorithm requirements but they are expensives in terms
of resource and wire. We propose a dedicated communication architecture for image analysis algorithms. This communication
mechanism is a generic NoC infrastructure dedicated to dataow image processing applications, mixing circuit-switching and
packet-switching communications. The complete architecture integrates two dedicated communication architectures and reusable
IP blocks. Communications are based on the NoC concept to support the high bandwidth required for a large number and type
of data. For data communication inside the architecture, an ecient time-division multiplexed (TDM) architecture is proposed.
This NoC uses a Fat Tree (FT) topology with Virtual Channels (VCs) and it packet-switching with xed routes. Two versions
of the NoC are presented in this paper. The results of their implementations and their Design Space Exploration (DSE) on
Altera Stratix II are analyzed and compared with a point-to-point communication and illustrated with a multispectral image
application. Results show that a point-to-point communication scheme is not ecient for large amount of multispectral image
data communications. An NoC architecture uses only 10% of the memory blocks required for a point-to-point architecture but
seven times more logic elements. This resource allocation is more adapted to image analysis algorithms as memory elements are a
critical point in embedded architectures. An FT NoC-based communication scheme for data transfers provides a more appropriate
solution for resource allocation.
Copyright 2009 Linlin Zhang et al. This is an open access article distributed under the Creative Commons Attribution License,
1. Introduction
Image analysis applications consist of extracting some rele-
vant parameters from one or several images or data. Embed-
ded systems for real-time image analysis allow computers to
take appropriate actions for processing images under real-
time hard constraints and often in harsh environments.
Current image analysis algorithms are resource intensive so
the traditional PC- or DSP-based systems are unsuitable as
they cannot achieve the required high performance.
An increases in chip density following Moores law allows
the implementation of ever larger systems on a single chip.
Known as systems on chip (SoC), these systems usually
contain several CPUs, memories, and custom hardware
modules. Such SoC can also be implemented on FPGA.
For embedded real-time image processing algorithms, the
FPGA devices are widely used because they can achieve high-
speed performances in a relatively small footprint with low
power compared to GPU architectures 0 [1]. Modern FPGAs
integrate many heterogeneous resources on one single chip.
The resources on an FPGA continue to increase at a rate that
only one FPGAis capable to handle all processing operations,
including the acquisition part. That means that incoming
data from the sensor or any other acquisition devices are
directly processed by the FPGA. No other external resources
are required for many applications (some algorithms might
use more than one FPGA). Today, many designers of such
systems choose to build their designs on Intellectual Property
(IP) cores connected to traditional buses. Most IP cores
are already predesigned and pretested and they can be
immediately reused [24]. Without reinventing the wheel,
the existing IPs and buses are directly used and mapped to
build the dedicated architecture. Although the benets of
using existing IPs are substantial, buses are now replaced
by NoC communication architectures for a more systematic,
predictive and reliable architecture design. Network on
Chip architectures is classied according to its switching
technique and to its topology. Few NoC architectures for
FPGA are proposed in the literature. Packet switching with
wormhole is used by Hermes [5], IMEC [6], SoCIN [7],
and Extended Mesh [8] NoCs. PNoC [9] and RMBoC [10]
use only the circuit switching whereas the NoC of Lee
[11] uses the packet switching. For the topology, Hermes
uses a 2D mesh, the NoC from IMEC uses a 2D torus,
SoCIN/RASoC can use a 2D mesh or a torus. RMBoC
from [12] has a 1D or 2D mesh topology. An extended
mesh is used for the Extended Mesh NoC. HIBI uses a
hierarchical bus. PNoC and the NoC from Lee have a custom
topology.
Existing NoC architectures for FPGA are not adapted
to image analysis algorithms as the number of input
data is high compared to the results and commands. A
dedicated and optimized communication architecture is
required and is most of the time designed from the algorithm
requirements.
The Design Space Exploration (DSE) of an adaptable
architecture for image analysis applications on FPGA with
IP designs remains a dicult task. It is hard to predict the
number and the type of the required IPs and buses from a
set of existing IPs from a library.
In this paper we present an adaptable communication
architecture dedicated to image analysis applications. The
architecture is based on a set of locally synchronous modules.
The communication architecture is a double NoC archi-
tecture, one NoC structure dedicated to commands and
results, the other one dedicated to internal data transfers.
The data communication structure is designed to be adapted
to the application requirements (number of tasks, required
connections, size of transmitted data). Proposing an NoC
paradigm helps the dimensioning and exploration of the
communication between IPs as well as their integration in
the nal system.
The paper is organised into 5 further sections. Section 2
presents the global image analysis architecture and focuses
on the data ow. Special communication units are set up
to satisfy the application constraints. Section 3 presents
two versions of NoC for the data ow which are built on
these basic communication units. The NoC architectures are
totally parameterized. Section 4 presents one image analysis
application: a multispectral image authentication. DSE
method is used to nd out the best parameters for the NoC
architecture according to the application. Section 5 gives the
conclusion and perspectives.
2. Architecture Dedicated to
Image Analysis Algorithms
This architecture is designed for most of image analysis
applications. Characteristics from such applications are used
to propose a parameterized and adaptable architecture for
FPGA.
2.1. Characteristics of Image Analysis Algorithms. Image
analysis consists of extracting some relevant parameters
from one or several images. Image analysis examples are
object segmentation, feature extraction, motion detection
object tracking, and so forth [13, 14]. Any image analysis
application requires four types of operations:
(i) acquisition operations,
(ii) storage operations,
(iii) processing operations,
(iv) control operations.
A characteristic of image analysis applications is the unbal-
anced data ow between the input and the output. The
input data ow corresponds to a high number of pixels
(images) whereas the output data ow represents little data
information (selective results). From these unbalanced ows,
two dierent communication topologies can be dened, with
each one being adapted to the speed and ow of data.
2.2. An Adaptable Structure for Image Analysis Algorithms.
The architecture presented here is designed from the
characteristics of image analysis applications. The struc-
ture of the architecture contains four types of modules;
each one corresponds to the four types of operations. All
these modules are designed as several VHDL Intellectual
Property (IP) nodes. They are presented in details in
[13].
(i) The Acquisition Module produces data that are pro-
cessed by the system. The number of acquisition
modules depends on the applications and the num-
ber of types of required external interfaces.
(ii) The Storage Module stores incoming images or
any other data inside the architecture. Writing and
reading cycles are supervised by the control mod-
ule. Whenever possible, memory banks are FPGA-
embedded memories.
(iii) The Processing Module contains the logic that is
required to execute one task of the algorithm. The
number of processing modules depends on the
number of tasks of the application. Moreover, more
than one identical processing module can be used in
parallel to improve timing performances.
The number of these modules is only limited by
the size of the target FPGA. The control of the
system is not distributed in all modules but it is fully
centralized in a single control module.
AN
Wrapper
SN
Wrapper CCN
AU NoC
CN
Wrapper PN n
Wrapper
PN 1
Wrapper
Multispectral
camera
Data ow
Result and command ow
Data ow
Figure 1: The proposed adaptable architecture dedicated to image
analysis applications.
(iv) The Control Module performs decisions and schedul-
ing of operations and sends commands to the other
modules. All the commands are sent from this
module to the other modules. In the same way,
this module receives result data from the processing
modules.
The immediate reuse of all modules is possible as
all modules are designed with an identical structure and
interface given in Figure 1.
To run each node at its best frequency, Globally Asyn-
chronous Locally Synchronous (GALS) concept is used in
this architecture. The frequencies for each type of nodes
in the architecture depend on the system requirements and
tasks of the application.
2.3. Structure of Modules/NoC for Command and Results. The
modular principle of the architecture can be shown at dier-
ent levels: one type of operation is implemented by means
of a module (acquisition, storage, processing, etc.). Each
module includes units that carry out a function (decoding,
control, correlation, data interface, etc.), and these units are
shaped into basic blocks (memory, comparator, etc.). Some
units can be found inside dierent modules. Figure 2 depicts
all levels inside a module.
Each module is designed in a synchronous way having
its own frequency. Communications between modules are
asynchronous via a wrapper and use a single-rail data path
4-phase handshake. Two serial ip-ops are used between
independent clock domains to reduce the metastability [15,
16]. The wrapper includes two independent units. One
receives frames from the previous module and the other one
sends frames to the following module at the same time.
An NoC is characterized by its topology, routing pro-
tocol, and ow control. The communication architecture
is an NoC for command and results and another NoC for
internal data. Topology, ow control, and type of packets
dier according to the targeted NoC.
2.4. NoC for Command and Results. Because the command
ow and the nal results are signicantly fewer compared
to the incoming data, they use an NoC architecture which
is linked to the IP wrappers. The topology for this commu-
nication is a ring using a circuit switching technique with
8-bit its. Through the communication ring, the control
module sends 4 packets. Packets have one header it and
3 other its containing command its and empty its. The
control module sends packets to any other modules; packets
are command packets or empty packets. Command packets
sent by the control module to any other module contain
instructions to execute. Empty packets are used by any other
module to send data to the control module. Empty packets
can be used by any module to send results or any information
back to the control module.
2.5. Communication Architecture for Data Transfers. The
NoC dedicated to data uses a Fat Tree topology which can
be customized according to the required communication of
the application. Here we use it packet-switching/wormhole
routing with xed routes and virtual channels. Flow control
deals with the allocation of channel and buer resources to
a packet/data. For image analysis applications, the specica-
tions for the design of our NoC dedicated to data are the
following.
(i) Several types of data with dierent lengths at the
inputs. The size of the data must be parameterized
to support any algorithm characteristic.
(ii) Several output nodes, this number is dened accord-
ing to the application requirements.
(iii) Frequencies nodes/modules are dierent.
According to the algorithms implemented, several data from
any input module can be sent to any output module at any
time.
In the following sections, we assume that the architec-
ture contains four input modules (the memory modules)
connected to four output modules (the processing modules).
This conguration will be used for the multispectral image
application illustrating the design space exploration in the
following sections.
2.5.1. The Topology. The topology chosen is a Fat Tree (FT)
topology as depicted in Figure 3 as it can be adapted to
the algorithm requirements. Custom routers are used to
interconnect modules in this topology.
2.5.2. Virtual Channel (VC) Flow Control. VC ow control
is a well-known technique. A VC consists of a buer that
can hold one or more its of a packet and associated
state information. Several virtual channels may share the
bandwidth of a single physical channel [17]. It allows
minimization of the size of the routers buersa signicant
Module
Synchronous module
Special unit
Coding
unit
Memory
unit
Control
unit
Decoding
unit
Receive Send
Asynchronous communication
Synchronous
FPGA module
Asynchronous
wrapper
Command and result
Data
Figure 2: The generic structure of modules with the asynchronous wrapper for result and command.
Coecient
Original
Compared
Result
Router 0
PN n
0
PN n
1
PN n
2
PN n
3
.
.
.
Figure 3: FT topology for the TDM NoC.
source of area and energy overhead [18, 19], while providing
exibility and good channel use.
During the operation of the router a VC can be in one of
the states: idle, busy, empty, or ready.
Virtual channels are implemented using bi-synchro-
nousd FIFOs.
2.5.3. Packet/Flit Structure. Table 1 shows the structure of the
packet/it used for the data transfers. The packet uses an
8-bit header it, a 8-bit tail it and several 8-bit its for
the data. For the header it, Id is the IDentied number
of data. P is the output Port number corresponding to the
number of PN. Int l that signies INTeger Length represents
the position of the xed point in data.
The tail it is a constant FF. One packet can be
separated in several Flow control units (it). The data
structure is dynamic in order to adapt to dierent types of
data. The length of packet and data, number and size of its,
and the depth of VC are all parameterized. The size of its
can be 8, 16, 32, or 64 bits, but we keep a header and tail of
8 bits, extended to the it size.
Packet switching with wormhole and xed routing
paths is used, each packet containing the target address
information as well as data with Best Eort (BE) trac.
2.5.4. The Switch Structure. This NoC is based on routers
built with three blocks. One block called Central Coordina-
tion Node (CCN) performs the coordination of the system.
FIFO
FIFO
FIFO
.
.
.
TDM-NA
AU
(arbitration unit)
CCN
(central
coordination node)
Figure 4: Switch structure.
The second block is the Arbitration Unit (AU) which detects
the states of data paths. The last one is a mux (TDM-NA)
with formatting of data. The switch structure is shown in
Figure 4.
The CCN manages the resources of the system and maps
all new incoming messages to the target channel. The switch
is based on a mux (crossbar) from several inputs to several
outputs. All the inputs are multiplexed using the TDM Time
Division Multiplexing.
For a high throughput, more than one switch can be
implemented in the communication architecture.
The AU is a Round Robin Arbiter (RRA) [20, 21] which
detects the states of all the VC at the outputs. It determines
on a cycle-by-cycle basis which VC may advance. When AU
Table 1: Data structures for the packets.
Header 8 bits Data N bits Tail 8 bits
id p reserve 1st it of data. . . Nth it of data Constant
F F
10 9 8 7 4 3 0
Table 2: The 24 bit packet data structure for version 1.
Header 8 bits Data 16 bits
id p No used 1st it 2nd it
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
receives the destination information of the it (P enc), it
detects the available paths states connected to the target
output. This routing condition information will be sent back
to CCN in order to let CCN perform the mapping of the
communication.
2.5.5. The Structure of TDM-NA. The TDM-NA is a set of
a MUX and a Network Adapter (NA). One specic NA is
proposed in Figure 5. The Network Adapter adapts any data
before being sent to the communication architecture. The
Network Adapter contains 5 blocks.
(i) Adaptor type veries the type of the data and denes
the required number of its (and the number of clock
cycles used to construct the its).
(ii) Adaptor tmd performs the time division multiplex-
ing for the process of cutting one packet to several
its.
(iii) Adaptor pack adds header and Tail its to the initial
data.
(iv) Fifo NA stores the completed packet.
(v) Adaptor it cuts the whole packet into several 8-bit
its.
Adaptor it runs with a higher clock frequency in this NA
architecture because it needs time to cut one packet into
several its. For dierent lengths of data, Adaptor tdm will
generate dierent frequencies which depend on the number
of its going out for one completed packet.
3. Two Versions of
the TDMParameterized NoCs
Two versions are proposed and presented in this paper. Data
are transferred in packets in version 1 with a packet switching
technique and with a xed size of links. Data are transferred
with its with a wormhole technique and a reduced size
of the links in version 2. The rst version uses one main
switch and 2 Virtual Channels on the outputs. The second
version contains 2 main switchs in parallel with 2 Virtual
Channels on the inputs and on the outputs. All versions have
four memory modules as input modules and four processing
modules as output modules. All versions are designed in
VHDL.
3.1. Version 1 with ONE Main Switch. Version 1 is a TDM FT
NoC containing one main switch and 2 VCs, 2 channels for
each output as shown in Figure 6.
The data are sent as 24-bit packets. The width of VCs in
version 1 is 24 bits. The simplied data structure of version 1
is shown in Table 2 .
3.2. Version 2 with TWO Main Switches. Another switch
is added to the architecture to increase the throughput.
Structure of switch is identical to the switch presented in the
previous section. These two main switches operate in parallel
as depicted in Figure 7.
The width of all the VCs in this version depends on the
algorithm characteristics.
3.3. NoC Parameters for DSE. The proposed NoC is exible
and adaptable to the requirements of image analysis applica-
tions. Parameters from the communication architecture are
specied for the Design Space Exploration (DSE).
The parameters are the following.
(i) Number of switches: one main switch for version 1
and two main switches for version 2.
(ii) Size of VCs: it corresponds to the dierent sizes of the
dierent types of data transferred.
(iii) Depth of the FIFOs in VCs: limited by the maximum
storage resources of the FPGA.
Several synthesis tools are used for the architecture
implementation and DSE as these synthesis tools give
dierent resource allocations on FPGA.
4. Experiments and Results
The size of data, FIFOs, and virtual channels are extracted
from the algorithm implemented. A multispectral image
Clk
rst
id
2
P
3
3
Int length
Data
n
Adaptor pack
Clk
rst
id
2
Adaptor type
Nb it
integer
Adaptor tdm
Vad in
Adaptor it Flit
8
Data pack Tableau 8 bits nb it
Clk
rst
Clk
rst
FIFO NA
Data pack
16 + n
Figure 5: Data structure for the dened types of data.
Storage
Reference
Original
Compared
Result
Storage
module
FIFO
FIFO
FIFO
FIFO
Switch
FIFO
FIFO
FIFO
FIFO
FIFO
FIFO
FIFO
FIFO
Switch
Switch
Switch
Switch
Processing
module 0
Processing
module 1
Processing
module 2
Processing
module 3
Processing
modules
Virtual channel/packet-switching
4 inputs/4 destinations/2 channels per output
Figure 6: The structure of Version 1.
algorithm for image authentication is used here to validate
the communication architecture.
4.1. Multispectral Image Authentication. Multispectral image
Figure 8 analysis (Figure 8) has been used in the space-based
image identications since 1970s [2226]. This technology
can capture light from a wide range of frequencies. This can
allow extraction of additional information that the human
eye fails to capture with its receptors of red, green, and blue.
Art authentication is one common application widely used in
museums. In this eld, an embedded authentication system
is required.
The multispectral images are optically acquired in more
than one spectral or wavelength interval. Each individual
image has usually the same physical area and scale but a
dierent spectral band. Other applications are presented in
[27, 28].
The aim of the multispectral image correlation is to
compare two spectral images.
(i) Original image (OI): its spectrum is saved in the
Storage Module as the reference data.
(ii) Compared images (CIs): its spectrum is acquired by
a multispectral camera.
For the art authentication process, OI is the information
of the true picture, and the CIs are the others similar
candidates. With the comparison process of the authenti-
cation (Figure 9), the true picture can be found among the
false ones by calculating the distance of the multispectral
image data. For this process, certain algorithms require high
precision operations which imply large amount of dierent
types of data (e.g., oating-point, xed-point, integer, BCD
encoding, etc.) and complex functions (e.g., square root
or other nonlinear functions). Several spectral projections
and distance algorithms can be used in the multispectral
authentication.
We can detail the process.
(i) First of all, the original data received from the
multispectral camera are the spectral values for every
pixel on the image. The whole image will be separated
as several signicant image regions. These regions
values need to be transformed as average color values
by using dierent windows sizes (e.g., 8 8 pixel
as the smallest window, 64 64 pixel as the biggest
window).
(ii) After this process, certain color projection (e.g.,
RGB, L
, XYZ, etc.) will transform the average

color values to color space values. An example of RGB
color projection is shown in:
R
i
=
780
=380
S() R
c
(),
G
i
=
780
=380
S() G
c
(),
B
i
=
780
=380
S() B
c
(), (1)
where R
c
, G
c
, and B
c
are the coecients of the red,
green, blue color space. S() represents the spectral
value of the image corresponding to each scanned
wavelength . The multispectral camera used can
scan one picture from 380 nm to 780 nm with 0.5 nm
as precision unit. So the number of spectral values N
can vary from 3 to 800. R
i
, G
i
, and B
i
are the RGB
values of the processed image.
(iii) These color image data go through the comparison
process of the authentication. Color distance is just
the basic neutral geometry distance. For example, for
the RGB color space, the calculated distance is shown
in:
E
RGB
=
(R
1
R
2
)
2
+ (G
1
G
2
)
2
+ (B
1
B
2
)
2
. (2)
If the true picture can be found among the false
ones by calculating the color distance, the process is
nished otherwise goes to the next step.
(iv) Several multispectral algorithms (e.g., GFC, Mv) are
used to calculate the multispectral distance with the
original multispectral image data. Certain algorithms
require high precision operations which imply large
amount of oating-point data and complex functions
(e.g., square root or other nonlinear functions) in this
process.
(v) After comparing all the signicant regions on the
image, a ratio (R
s/d
) of similitude will be calculated
as shown in (3). N
s
represents the number of similar
regions and N
d
represents the number of dissimilar
regions
R
s/d
=
N
s
N
d
. (3)
Dierent thresholds will be dened to give the nal authenti-
cation result for the dierent required precisions, nding the
true image which is most alike the original one. One of these
algorithms is presented in [29].
The calculations are based on the spatial and spectral
data which make the memory accesses a bottleneck in the
communication. From the algorithm given in Figure 9, the
characteristics are
(i) number of regions for every wavelength = 2000,
(ii) number of wavelength = 992,
(iii) size of the window for the average processing = 2 2,
4 4, 8 8, 16 16, 32 32,
(iv) number of tasks: 4color projection, color distance,
multispectral projection, multispectral distance. The
multispectral authentication task is executed by the
control module. In this example, there is no task
parallelism. Sizes of data are 72 bits, 64 bits, 64 bits
and 24 bits as shown in Table 3 .
(v) number of modules: 4 processing modules, 4 storage
modules, 1 acquisition module, 1 control module.
(vi) bandwidth of multispectral camera: 300 MB/s. The
NoC architecture is dimensioned to process and
exchange data at least at the same rate in order to
achieve real time.
For the NoC architecture, four types of data are dened by
analyzing multispectral image algorithms. Each data has an
identical number id).
(i) Coef: Coecient data which means the normalized
values of dierence color space vector (56-bit, id
00).
(ii) Org: Original image data which are stored in the SN
(48-bit, id 01).
(iii) Com: Compared image data which are acquired by
the multispectral camera and received from the NA
(48-bit, id 10).
(iv) Res: Result of the authentication process (8-bit, id
11).
4.2. Resources of Modules in the Architecture. This parame-
terized TDM architecture was designed in VHDL. Table 4
shows the resources of the modules in the architecture.
The FPGA is the Altera Stratix II EP2S15F484C3 which has
6240 ALMs/logic cells. The number of resources dedicated to
all the modules represents around 14% of the total logic cells.
Whatever the communication architecture, all these modules
remain unchanged with the same number of resources.
4.3. The Point-to-Point Communication Architecture Dedi-
cated to Multispectral Image Algorithms. A classical point-
to-point communication architecture is designed for the
algorithm requirements presented previously and is shown
in Figure 10. This traditional structure is used to compare
some signicant results obtained by the proposed NoC. In
Ref
Adaptor
Adaptor
Adaptor
Fifo_out00
Fifo_in0
Fifo_in1
Fifo_in0
Fifo_in1
Fifo_in0
Fifo_in0
Fifo_in1
Fifo_in1
Fifo_out01
Fifo_out02
Fifo_out03
Fifo_out05
Fifo_out06
Fifo_out07
Fifo_out10
Fifo_out11
Fifo_out12
Fifo_out13
Fifo_out14
Fifo_out15
Fifo_out16
Fifo_out17
Fifo_out04
Adaptor
Adaptor
Adaptor
Adaptor
Adaptor
Org
Com
Switch
Switch
Switch
Processing
module 0
Processing
module 1
Processing
module 2
Processing
module 3
Switch
Switch
Switch
Res
Figure 7: The structure of version 2 with 2 main switches in parallel.
Table 3: Type and size of data for the multispectral algorithm.
(a) COEF: Header 8 bits + coecient data 56 bits + tail 8 bits = 72 bits
Header 8 bits Data 56 bits Tail
id p No used 1st it of data 2nd it 7th it 8 bits
Flit0 Flit1 Flit6 oxFF

71 70 69 68 67 66 65 64 63 60 59 56 55 48 11 8 7 0
(b) ORG/COM: Header 8 bits + original/compared data 48 bits + tail 8 bits = 64 bits
Header 8 bits Data 56 bits 48 bits Tail
id p No used 1st it of data 2nd it 6th it 8 bits
Flit0 Flit1 Flit5 oxFF

63 62 61 60 59 58 57 56 55 52 51 48 47 40 11 8 7 0
(c) RES: Header 8 bits + result 8 bits + tail 8 bits = 24 bits
Header 8 bits Data 8 bits Tail 8 bits
id p No used 0255 oxFF constant
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
100 50 0 50 100 150 200
RGB
250
200
150
100
50
(a)
400 500 600 700
Real
Articial
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
(b)
100 50 0 50 100 150 200 250
Wavelength 500 nm
250
200
150
100
50
(c)
100 50 0 50 100 150 200
Wavelength 775 nm
250
200
150
100
50
(d)
Figure 8: Multispectral images. Multispectral images add additional information compared to color image. In this example, the articial
leaf can be extracted to the real ones for 775 nm.
Table 4: Resources for the nodes in the GALS architecture.
Node Frequency (MHz) Resources on Stratix II 2S60
Logic cells Registers Memory bits
Control 150 278 265 32
Acquisition 76.923 315 226 2
Storage 100 280 424 320000
Processing 50 Depending on the algorithms
the global communication architecture, any input data can
be transmitted to any processing module. 72-bit muxes are
inserted here between FIFO and processing modules. This
point-to-point communication uses input FIFOs having the
size of data used. Their bandwidth is thus not tuned to t
the bandwidth of the input streams. For the three versions
studied here, the input FIFO bandwidth is higher than the
specications of multispectral cameras. If it was not the
case, the input FIFO size could be increased to respect the
constraint.
Original region
of OI
Spectral average
calculation
Spectral average
value of OI
Spectral projection
Storage
Spectral
image of OI
Spectral distance
Multispectral
authentication
R P
Precision P
evaluation P
P
Similar
region
Compared
region of CI
Spectral average
calculation
Spectral average
value of CI
Spectral projection
Storage
Spectral
image of CI
Dissimilar
region
Figure 9: General comparison process of the authentication. R: Result of each step of calculation. P: Precision of each multispectral distance.
Coef
Org
Com
Res
Fifo_64bit/32
Fifo_72bit/32
Fifo_24bit/32
Fifo_64bit/32
Fifo_64bit/32
Fifo_72bit/32
Fifo_24bit/32
Fifo_64bit/32
Fifo_64bit/32
Fifo_72bit/32
Fifo_24bit/32
Fifo_64bit/32
Fifo_64bit/32
Fifo_72bit/32
Fifo_24bit/32
Fifo_64bit/32
PN0
PN1
PN2
PN3
72
72
72
72
Figure 10: The point-to-point communication dedicated to the multispectral authentication algorithm.
4.4. Implementation of the Communication Architecture for
Data Transfers. The point-to-point architecture and both
versions of the NoC are designed in VHDL language. The
FPGA used for the implementation is an Altera Stratix II
EP2S15F484C3 EP2S180F1508C3 FPGA. Two sizes of data
are used for version 1 of the NoC, 48 bits and 56 bits.
These sizes are similar to the size of data for the point-to-
point communication. Implementation results are given in
Table 5.
Concerning latency, version 2 uses 8-bit its as the
transmission unit, thus the NA needs 8 cycles to cut a 64-
bit data packet as its plus 1 cycle for the header. Also, the
latency of the NoCis 1 cycle for the storing in the rst FIFO, 1
cycle for the main switch crossing, and 1 cycle for the storing
in the second FIFO, that is, 3 cycles of latency due to the NoC.
Compared to the point-to-point communication, we pay the
packet serialization latency to have much better exibility.
Concerning area resources, as depicted in Table 5 , the
point-to-point communication needs less ALUTs but over
7 times more memory blocks. The switch requires 4 times
more logic (ALUTs) than the point-to-point architecture
(the other ALUTs are for the FIFOs of versions 1 and 2
that use more registers than memory blocks to implement
FIFOs). One reason is the structure of the switch which
is more complex than muxes used in the point-to-point
architecture. If we compare just the switch size with a simple
classical NoC like Hermes, we obtain similar sizes for a
switch based on a 4 4 crossbar, but a full NoC linking
4 memory nodes to 4 processing nodes would require at
least 8 switches, that is almost 8 times more area and from
2 times to 5 times more latency to cross the switches when
there is no contention and even more with contentions.
The advantage of a classical NoC approach is to allow any
communication. This is of no use here as our four memories
are not communicating together. We have here an oriented
dataow application with specic communications. Our
dataow NoC has the advantages of NoC, that is systematic
design, predictability, and so on, and the advantages of point-
to-point communications, that is low latency and optimized
well sized links to obtain the best performance/cost trade o
and to use less memory blocks which is important for image
algorithms using huge quantity of data to be stored inside the
chip.
Also the number of pins for the point-to-point com-
munication is signicantly higher compared to both NoC
versions, even with a simple communication pattern. It
indicates that the point-to-point communication requires
much more wires inside the chip when the complete system
is implemented. This can be a limitation for complex
Multiprocessor SoC. Furthermore, the frequency of point-
to-point communication is a bit slower than NoC versions.
Resource allocations show the benets of using NoC
architecture for image analysis algorithms compared to tra-
ditional point-to-point architectures. The following imple-
mentations focus on the NoCarchitectures. Both versions are
analyzed and explored when implemented on FPGA.
The number of resources for the version 1 is less than for
version 2, but with less bandwidth (a single switch compared
to 2 switches for version 2). The choice of one version of the
NoC is made from the tradeo on timing and resources. The
optimization of the number of resources leads to the choice
of version 1 whereas version 2 is adapted to higher bandwith
requirements.
4.5. DSE of the Proposed NoC. The knowledge of the design
space of architectures is important for the designer to make
tradeos related to the design. The architecture most adapted
to the algorithm requirements requires a Design Space
Exploration (DSE of the NoC architectures). The explo-
ration presents the relationships between input parameter
values and performance results in order to make tradeos
in the implemented architecture. In DSE the parameter
values guide the designer to make choices and tradeos
while satisfying the required performances without requiring
implementation processes.
(i) Input parameters:
(a) number of switches,
(b) number and width of VCs,
(c) depth of FIFOs/VCs.
(ii) Performances:
(a) logic device,
(b) ALUTs,
(c) Registers,
(d) latency,
(e) frequency.
The input parameters are explored to see their eect on the
performances. Performances are focused on the resources
rst. The purpose of DSE is to nd the most appropri-
ate parameter values to obtain the most suitable design.
Hence, it needs to nd the inverse transformation from the
performance back to the parameters as represented by the
light bulb in Figure 11. The Y-chart environment is realized
by means of Retargetable Simulation. It links parameter
values used to dene an architecture instance to performance
numbers [3032].
The depth of the FIFOs is 32 for both versions. The width
for Version 1 is 24-bits, and 8 bits for Version 2. Note that in
Version 2; FIFOs do not only exist in the VCs, but also in the
NA at the input of NoC (shown in Figure 5). The FPGA is an
Altera EP2S1F484C3 implemented with Quartus II 7.2 with
DK Design Tool used for the synthesis tool.
4.5.1. Results of Version 1 (Parameter: Depth of the FIFOs/VCs,
Performance: Device Utilization). Figure 12 presents the DSE
of the proposed communication architecture. The width of
Version 1 FIFOs is 24 bits. The depth of FIFO is the number
of packets stored in the FIFO. This version corresponds to the
case where all the data have the same lengths..
Figure 12 shows that with the increasing of the depths
of VCs, the device utilization increases almost linearly.
With the maximum depth 89, Logic Array Blocks (LABs)
are not enough for the architecture implementation. The
Table 5: Comparison of the resources: Point-to-point versus NoC Version 1.
Ressources Point-to-point Version 1 48-bit Version 1 56-bit Version 2
Logic utilization % 1% 2% 3% 3%
Combinational ALUTs 305 1842 2118 2521
Dedicated logic registers 1425 2347 2739 4217
Total pins 512 344 408 230
Total block memory bit 29568 3384 3960 8652
Frequency for F (MHz) 165.73 MHz 264.34 MHz 282.41 MHz 292.31 MHz
Architecture
model (VHDL)
Applications
model
Mapping
Retargetable
simulation
Performance
number
Parameters Performance
Figure 11: The inverse transformation from the performance back to the parameters.
10 20 30 40 50 60 70 80
Depth FIFOs/VCs
0
20
40
60
80
100
D
e
v
i
c
e
u
t
i
l
i
z
a
t
i
o
n
s
u
m
m
a
r
y
(
%
)
Logic utilization (%)
Combinational ALUTs (%)
Dedicated logic registers (%)
Block memory bit (%)
Figure 12: The Device Utilization Summary of Version 1 on Altera
Stratix II.
memory blocks augmentation is two times bigger than the
augmentation of the total registers.
When DSE reached the maximum depth of FIFOs/VCs,
the utilization of ALUTs is 80%, but for the block memory,
it has been only used at 5%, that is the synthesis tool does
not use memory blocks here as target for implementation of
FIFOs.
4.5.2. Results of Version 1 (Parameter: Width of FIFOs/VCs,
Performance: Device Utilization). In Figure 13, the DSE uses
two depths for the FIFOs: an 8-data depth (depicted as solid
line) and a 32-data depth (presented as dotted line). As data
can be parameterized in its in the NoC, the size of packets is
from 20 bits to 44 bits. For the data width, we take from 8-bit
data (minimal size for the dened data in the multispectral
image analysis architecture) corresponding to 20-bit packets
(add 12 bit header/tail) to 32-bit data corresponding to 44-
bit packets. The X-axes in Figure 13 represent the width of
FIFOs/VCs which is the length of packets in the transmission
(we consider here size of packet = width of FIFOs).
Results show that the number of resources depends on
the width of the FIFOs. The limiting parameters in the size of
FIFOs are the number of logic and registers. With a depth of
32, 20% of the registers are used and 40% of the logic is used.
The use of logic grows more signicantly with a depth of 32.
All required resources can be found from a linear equation
extracted from the gure and resource predictions can be
made without requiring any implementation. We have the
same comment on memory blocks.
4.5.3. Results of Version 2 (Parameter: Depth of the FIFOs/VCs,
Performance: Device Utilization). To solve the problem of
xed width of Version 1, Version 2 uses its method. Version
2 has dierent lengths of the transmitted data which present
dierent widths for each input of the NoC communication.
The total data bit transmitted per data is 224 bits (72 bits +
64 bits + 64 bits + 24 bits).
Figure 14 shows the resource utilization summary on
Stratix II for Version 2 which has the similar characteristics
as the one of Version 1. In the data transmission of Version
20 24 28 32 36 40 44
Combinational ALUTs
800
1689
3455
(a)
20 24 28 32 36 40 44
Dedicated logic registers
975
2151
2755
6231
(b)
20 24 28 32 36 40 44
Total memory (bit)
1368
3096
5016
11352
(c)
20 24 28 32 36 40 44
Width of FIFOs/VCs
0
10
20
30
40
50
60
D
e
v
i
c
e
u
t
i
l
i
z
a
t
i
o
n
s
u
m
m
a
r
y
(
%
)
Logic (depth 8)(%)
ALUTs (depth 8) (%)
Registers (depth 8) (%)
Memory (depth 8) (%)
Logic (depth 32)(%)
ALUTs (depth 32) (%)
Registers (depth 32) (%)
Memory (depth 32) (%)
(d)
Figure 13: The device utilization summary with xed depth of FIFOs/VC but dierent width on Altera Stratix II for Version 1.
10 20 30 40 50 60 70 80 90 100 110 120
Depth of FIFOs/VCs
0
20
40
60
80
100
D
e
v
i
c
e
u
t
i
l
i
z
a
t
i
o
n
s
u
m
m
a
r
y
(
%
)
Logic utilization (%)
Combinational ALUTs (%)
Dedicated logic registers (%)
Block memory bit (%)
Figure 14: The Device Utilization Summary on Altera Stratix II for
Version 2.
2, all the data are divided with 8-bit its by the NA, which
reduce the general width of FIFOs in VCs.
Comparing these 2 versions, Version 1 has xed widths
which is suitable for data having the same size. The structure
of Version 1 is simpler, requires fewer resources, and has
better latency. But for large dierent lengths/sizes of data
transmission, Version 2 is better than Version 1 because it can
adapt precisely to the data sizes to obtain an optimal solution.
4.5.4. Results of Version 2 (Parameter: Synthesis Tool, Per-
formance: Device Utilization). Two dierent synthesis tools
have been chosen: DK design Suite V. 5.0 SP5 from Mentor
Graphics [33] and Synopsys Design Compiler [34] to analyze
the impact of synthesis tools on the DSE. From a single
VHDL description corresponding to version 2, these two
tools gave quite dierent synthesis results with the same
Altera Stratix II EP2S15F484C3 (6240 ALMs) as depicted in
Figure 15.
These two tools have been chosen as the default synthesis
tools on QuartusII 7.1 and 7.2. All the lines with markers
(o, x, etc.) present the synthesized results produced
by DK design suite. All the lines without the marker are
the results synthesized by Design Compiler. The maximum
depths synthesized by these two versions are 63 by Design
Compiler and 120 by DK design suite. For the memory block
utilization, these two tools behave in the same way. But for
the other implementation factors, results are quite dierent.
At certain depths of the FIFOs/VCs, the resource utilizations
are similar (e.g., 14, 30, 62). However, in most of other
depths situations, synthesized results of DK design suite are
7 14 20 30 40 50 60 70 80 90 100 110 120
Depth of FIFOs
0
10
20
30
40
50
60
70
80
90
100
D
e
v
i
c
e
u
t
i
l
i
z
a
t
i
o
n
s
u
m
m
a
r
y
o
n
A
l
t
e
r
a
S
t
a
r
a
i
x
I
I
DK-logic utilization
DK-ALUTs
DK-registers
DK-memory bits
DC-logic utilization
DC-ALUTs
DC-registers
DC-memory bits
Figure 15: The Device Utilization Summary on Altera Stratix II for
Version 2 with dierent synthesis tools: design compiler and DK
design suite.
better than the one of Design Compiler which means it
takes fewer resources on the FPGA. The resource utilization
increase is quite dierent as well. With the Design Compiler
synthesis tool, device utilization increases as several steps
but with DK design suite, and it increases more smoothly
and linearly. In this case, the synthesis with DK design suite
makes resource utilization prediction much easier than with
Design Compiler.
5. Conclusion and Perspectives
The presented architecture is a parameterized architecture
dedicated to image analysis applications on FPGA. All ows
and data are analyzed to propose two generic NoC archi-
tectures, a ring for results and command and a dedicated
FT NoC for data. Using both communication architectures,
the designer inserts several modules; the number and
type of modules depend on the algorithm requirements.
The proposed NoC for data transfer is more precisely a
parameterized TDM architecture which is fast, exible, and
adaptable to the type and size of data used by the given image
analysis application. This NoC uses a Fat Tree topology with
VC packet-switching and parameterized its.
According to the implementation constraints, area, and
speed, the designer chooses one version and can adapt
the communication to optimize the area/bandwidth/latency
tradeo. Adaptation consists in adding several switches in
parallel or in serial and to size data (and it), FIFOs,
and Virtual channels for each switch. Without any imple-
mentation the designer can predict the resources used and
required. This Fat Tree generic topology allows us to generate
and explore systematically a communication infrastructure
in order to design eciently any dataow image analysis
application.
Future work will focus on automating the exploration of
the complete architecture and the analysing of the algorithm
architecture matching according to the dierent required
data. From the evaluation of the NoC exploration an auto-
mated tool can predict the most appropriate communication
architecture for data transfer and the required resources.
The power analysis will be analyzed to complete the Design
Space Exploration of the NoC architecture. Power, area,
and latency/bandwidth are the values which will guide the
exploration process.
References
[1] P. Taylor, Nvidia opens mobile GPUkimono: slideware shows
higher performance, lower TDPs, The Inquirer, June 2009.
[2] Z. Yuhong, H. Lenian, X. Zhihan, Y. Xiaolang, and W. Leyu,
A system verication environment for mixed-signal SOC
design based on IP bus, in Proceedings of the 5th International
Conference on ASIC, vol. 1, pp. 278281, 2003.
[3] U. Farooq, M. Saleem, and H. Jamal, Parameterized FIR
ltering IP cores for reusable SoC design, in Proceedings of
the 3rd International Conference on Information Technology:
New Generations (ITNG 06), pp. 554559, 2006.
[4] S. H. Chang and S. D. Kim, Reuse-based methodology in
developing system-on-chip (SoC), in Proceedings of the 4th
International Conference on Software Engineering Research,
Management and Applications (SERA 06), pp. 125131,
Seattle, Wash, USA, August 2006.
[5] F. Moraes, N. Calazans, A. Mello, L. M oller, and L. Ost,
HERMES: an infrastructure for low area overhead packet-
switching networks on chip, Integration, the VLSI Journal,
vol. 38, no. 1, pp. 6993, 2004.
[6] T. Marescaux, A. Bartic, D. Verkest, S. Vernalde, and R.
Lauwereins, Interconnection networks enable ne-grain
dynamic multi-tasking on FPGAs, in Proceedings of the 12th
International Conference on Field-Programmable Logic and
Applications (FPL 02), pp. 795805, 2002.
[7] C. A. Zeferino and A. A. Susin, SoCIN: a parametric
and scalable network-on-chip, in Proceedings of the 16th
Symposium on Integrated Circuits and Systems Design
(SBCCI 03), pp. 169174, 2003.
[8] E. Salminen, A. Kulmala, and T. D. H am al ainen, HIBI-
based multiprocessor soc on FPGA, in Proceedings of IEEE
vol. 4, pp. 33513354, 2005.
[9] C. Hilton and B. Nelson, PNoC: a exible circuit-switched
NoC for FPGA-based systems, IEE Proceedings: Computers
and Digital Techniques, vol. 153, no. 3, pp. 181188, 2006.
[10] C. Bobda and A. Ahmadinia, Dynamic interconnection of
recongurable modules on recongurable devices, IEEE
Design & Test of Computers, vol. 22, no. 5, pp. 443451, 2005.
[11] H. G. Lee, U. Y. Ogras, R. Marculescu, and N. Chang, Design
space exploration and prototyping for on-chip multimedia
applications, in Proceedings of the 43rd Design Automation
Conference, pp. 137142, 2006.
[12] A. Ahmadinia, C. Bobda, J. Ding, et al., A practical approach
for circuit routing on dynamic recongurable devices, in
Proceedings of the 16th International Workshop on Rapid
System Prototyping (RSP 05), pp. 8490, 2005.
[13] V. Fresse, A. Aubert, and N. Bochard, A predictive NoC
architecture for vision systems dedicated to image analysis,
EURASIP Journal on Embedded Systems, vol. 2007, Article ID
97929, 13 pages, 2007.
[14] G. Schelle and D. Grunwald, Exploring FPGA network on
chip implementations across various application and network
loads, in Proceedings of the International Conference on Field
Programmable Logic and Applications (FPL 08), pp. 4146,
September 2008.
[15] P. Wagener, Metastabilitya designers viewpoint, in
Proceedings of the 3rd Annual IEEE ASIC Seminar and Exhibit,
pp. 14/7.114/7.5, 1990.
[16] E. Brunvand, Implementing self-timed systems with FPGAs,
in FPGAs, W. Moore and W. Luk, Eds., pp. 312323, Abingdon
EE&CS Books, Abingdon, UK, 1991.
[17] W. J. Dally, Virtual-channel ow control, IEEE Transactions
on Parallel and Distributed Systems, vol. 3, no. 2, pp. 194205,
1992.
[18] E. Rijpkema, K. Goossens, A. R adulescu, et al., Trade-os in
the design of a router with both guaranteed and best-eort
services for networks on chip, IEE Proceedings: Computers
and Digital Techniques, vol. 150, no. 5, pp. 294302, 2003.
[19] H.-S. Wang, L.-S. Peh, and S. Malik, A power model for
routers: modeling alpha 21364 and InniBand routers, in
Proceedings of the 10th High Performance Interconnects, pp.
2127, 2002.
[20] P. Gupta and N. McKeown, Designing and implementing a
fast crossbar scheduler, IEEE Micro, vol. 19, no. 1, pp. 2028,
1999.
[21] W. J. Dally and B. Towles, Route packets, not wires: on-chip
interconnection networks, in Proceedings of the Design
Automation Conference (DAC 01), pp. 684689, 2001.
[22] F. Koning and W. Praefcke, Multispectral image encoding, in
Proceeding of the International Conference on Image Processing
(ICIP 99), vol. 3, pp. 4549, October 1999.
[23] A. Kaarna, P. Zemcik, H. Kalviainen, and J. Parkkinen,
Multispectral image compression, in Proceeding of the 14th
International Conference on Pattern Recognition, vol. 2, pp.
12641267, August 1998.
[24] D. Tretter and C. A. Bouman, Optimal transforms for
multispectral and multilayer image coding, IEEE Transactions
on Image Processing, vol. 4, no. 3, pp. 296308, 1995.
[25] P. Zemcik, M. Frydrych, H. Kalviainen, P. Toivanen, and J.
Voracek, Multispectral image colour encoding, in Proceeding
of the 15th International Conference on Pattern Recognition,
vol. 3, pp. 605608, September 2000.
[26] A. Manduca, Multispectral image visualization with
nonlinear projections, IEEE Transactions on Image Processing,
vol. 5, no. 10, pp. 14861490, 1996.
[27] D. Tzeng, Spectral-based color separation algorithm
development for multiple-ink color reproduction, Ph.D.
thesis, R.I.T., Rochester, NY, USA, 1999.
[28] E. A. Day, The eects of multi-channel spectrum imaging on
perceived spatial image quality and color reproduction accuracy,
M.S. thesis, R.I.T., Rochester, NY, USA, 2003.
[29] L. Zhang, A.-C. Legrand, V. Fresse, and V. Fischer, Adaptive
FPGA NoC-based architecture for multispectral image
correlation, in Proceedings of the 4th European Conference
on Colour in Graphics, Imaging, and Vision and the 10th
International Symposium on Multispectral Colour Science
(CGIV/MCS 08), pp. 451456, Barcelona, Spain, June 2008.
[30] A. C. J. Kienhuis and Ir. E. F. Deprettere, Design space
exploration of stream-based dataow architectures: methods
and tools, Ph.D. thesis, Toegevoegd Promotor, Technische
Universit at Braunschweig, 1999.
[31] H. P. Peixoto and M. F. Jacome, Algorithm and architecture-
level design space exploration using hierarchical data ows,
in Proceedings of the International Conference on Application-
Specic Systems, Architectures and Processors, pp. 272282, July
1997.
[32] V. Krishnan and S. Katkoori, A genetic algorithm for the
design space exploration of datapaths during high-level
synthesis, IEEE Transactions on Evolutionary Computation,
vol. 10, no. 3, pp. 213229, 2006.
[33] Mentor graphics, DK Design Suite Tool, http://www.
agilityds.com/products/c based products/dk design suite/.
[34] RTL-to-Gates Synthesis using Synopsys Design Compiler,
http://csg.csail.mit.edu/6.375/6 375 2008 www/handouts/
tutorials/tut4-dc.pdf.
doi:10.1155/2009/318654
Research Article
Efcient Processing of a Rainfall Simulation Watershed on
an FPGA-Based Architecture with Fast Access to
Neighbourhood Pixels
Lee Seng Yeong, Christopher Wing Hong Ngau, Li-Minn Ang, and Kah Phooi Seng
School of Electrical and Electronics Engineering, The University of Nottingham, 43500 Selangor, Malaysia
Correspondence should be addressed to Lee Seng Yeong, yls@tm.net.my
Received 15 March 2009; Accepted 9 August 2009
This paper describes a hardware architecture to implement the watershed algorithm using rainfall simulation. The speed of the
architecture is increased by utilizing a multiple memory bank approach to allow parallel access to the neighbourhood pixel values.
In a single read cycle, the architecture is able to obtain all ve values of the centre and four neighbours for a 4-connectivity
watershed transform. The storage requirement of the multiple bank implementation is the same as a single bank implementation
by using a graph-based memory bank addressing scheme. The proposed rainfall watershed architecture consists of two parts. The
rst part performs the arrowing operation and the second part assigns each pixel to its associated catchment basin. The paper
describes the architecture datapath and control logic in detail and concludes with an implementation on a Xilinx Spartan-3 FPGA.
Copyright 2009 Lee Seng Yeong et al. This is an open access article distributed under the Creative Commons Attribution License,
1. Introduction
Image segmentation is often used as one of the main stages
in object-based image processing. For example, it is often
used as a preceding stage in object classication [13]
and object-based image compression [46]. In both these
examples, image segmentation precedes the classication or
compression stage and is used to obtain object boundaries.
This leads to an important reason for using the watershed
transform for segmentation as it results in the detection
of closed boundary regions. In contrast, boundary-based
methods such as edge detection detect places where there is
a dierence in intensity. The disadvantage of this method is
that there may be gaps in the boundary where the gradient
intensity is weak. By using a gradient image as input into the
watershed transform, qualities of both the region-based and
boundary-based methods can be obtained.
This paper describes a watershed transformimplemented
on an FPGA for image segmentation. The watershed algo-
rithm chosen for implementation is based on the rainfall
simulation method described in [79]. There is an imple-
mentation of a rainfall-based watershed algorithm on hard-
ware proposed in [10], using a combination of a DSP and an
FPGA. Unfortunately, the authors do not give much details
on the hardware part and their architecture. Other sources
have implemented a watershed transform on recongurable
hardware based on the immersion watershed techniques
[11, 12]. There are two advantages of using a rainfall-based
watershed algorithm over the immersion-based techniques.
The rst advantage is that the watershed lines are formed
in-between the pixels (zero-width watershed). The second
advantage is that every pixel would belong to a segmented
region. In immersion-based watershed techniques, the pixels
themselves form the watershed lines. A common problem
that arises from this is that these watershed lines may have
a width greater than one pixel (i.e., the minimum resolution
in an image). Also, pixels that form part of the watershed line
do not belong to a region. Other than leading to inaccuracies
in the image segmentation, this also slows down the region
merging process that usually follows the calculation of the
watershed transform. Other researchers have proposed using
a hill-climbing technique for their watershed architecture
[13]. This technique is similar to that of rainfall simulation
except that it starts from the minima and climbs by the
steepest slope. With suitable modications, the techniques
proposed in this paper can also be applied for implementing
a hill-climbing watershed transform.
This paper describes a hardware architecture to imple-
ment the watershed algorithm using rainfall simulation.
The speed of the architecture is increased by utilizing a
multiple memory bank approach to allow parallel access
to the neighbourhood pixel values. This approach has the
advantage of allowing the centre and neighbouring pixel
values to be obtained in a single clock cycle without the need
for storing multiple copies of the pixel values. Compared
to the memory architecture proposed in [14], our proposed
architecture is able to obtain all ve values required for
the watershed transform in a single read cycle. The method
described in [14] requires two read cycles, one read cycle for
the centre pixel value using the Centre Access Module (CAM)
and another read cycle for the neighbouring pixels using the
Neighbourhood Access Module (NAM).
The paper is structured as follows. Section 2 will
describe the implemented watershed algorithm. Section 3
will describe a multiple bank memory storage method based
on graph analysis. This is used in the watershed architecture
to increase processing speed by allowing multiple values (i.e.,
the centre and neighbouring values) to be read in a single
clock cycle. This multiple bank storage method has the same
memory requirement as methods which store the pixel values
in a single bank. The watershed architecture is described in
two parts, each with their respective examples. The parts are
split up based on their functions in the watershed transform
as shown in Figure 1. Section 4 describes the rst part of
the architecture, called Architecture-Arrowing which is fol-
lowed by an example of its operation in Section 5. Similarly,
Section 6 describes the second part of the architecture, called
Architecture-Labelling which is followed by an example of
its operation in Section 7. Section 8 describes the synthesis
and implementation on a Xilinx Spartan-3 FPGA. Section 9
summarizes this paper.
2. The Watershed AlgorithmBased on
Rainfall Simulation
The watershed transformation is based on visualizing an
image in three dimensions: two spatial coordinates versus
grey levels. The watershed transform used is based on the
rainfall simulation method proposed in [7]. This method
simulates how falling rain water ows from higher level
regions called peaks to lower level regions called valleys. The
rain drops that fall over a point will ow along the path of
the steepest descent until reaching a minimum point.
The general processes involved in calculating the water-
shed transform is shown in Figure 1. Generally, a gradient
image is used as input to the watershed algorithm. By using
a gradient image the catchment basins should correspond
to the homogeneous grey level regions of the image. A
common problem to the watershed transform is that it tends
to oversegment the image due to noise or local irregularities
in the gradient image. This can be corrected using a region
merging algorithm or by preprocessing the image prior to
the application of the watershed transform.
Watershed
(region detect)
Gradient image
(edge detect)
Region
merging
Arrowing Labelling
Label all pixels to their
respective catchment
basins
Find steepest descending
path for each pixel and
label accordingly
Figure 1: General preprocessing and postprocessing steps involved
when using the watershed. Also it shows the two main steps involved
in the watershed transform. Firstly nd the direction of steepest
descending path and label the pixels to point in that direction. Using
the direction labels, the pixels will be relabelled to match the label
of their corresponding catchment basin.
2
1
4
3
(a)
1 3
2
4
(b)
Figure 2: The steepest descending path direction priority and
naming convention used to label the direction of the steepest
descending path. (a) shows the criterion used when determining
order of steepest descendent path when there is more than one
possible path; that is, the pixel has two or more lower neighbours
with equivalent values. Paths are numbered in increasing priority
from the left moving in a clockwise direction towards the right and
to the bottom. Shown here is the path with the highest priority
labelled as 1 to the lowest priority, labelled as 4. (b) shows labels
used to indicate direction of the steepest descent path. The labels
shown correspond with the direction of the arrows.
The watershed transform starts by labelling each input
pixel to indicate the direction of the steepest descent. In other
words, each pixel points to its neighbour with the smallest
value. There are two neighbour connectivity approaches
that can be used. The rst approach called 8-connectivity
considers all eight neighbours surrounding the pixel and
the second approach called 4-connectivity only considers the
neighbours to its immediate north, south, east, and west. In
this paper, we use the 4-connectivity approach. The direction
labels are chosen to be negative values from 1 4 so
that it will not overlap with the catchment basin labelling
which will start from 1. These direction labels are shown
in Figure 2. There are four dierent possible direction labels
for each pixel for neighbours in the vertical and horizontal
directions. This process of nding the steepest descending
path is repeated for all pixels so that every pixel will point
Pixel has at least
one lower neighbour
All pixels have at
least one lower
neighbour
All pixels are of
lesser values than
their neighbour
Pixel has no lower
neighbour
Label to the lowest
neighbour
Label as minima
Normal
Minima
Edge + inner
Inner
Edge
Plateau
Label all pixels to
point to their
respective lowest
neighbour
Iteratively classify
as edge or inner
Label all pixels
as minima
Pixels have similar
valued neighbours
A plateau is a group
of connected pixels
with the same value
Nonplateau
Pixel has no similar
valued neighbours
Pixel type/class
Group have lower,
similar, and/or
higher-valued
neighbours
Plateau-edge
Plateau-inner
Nonplateau-minima
Plateau-(Edge + inner)
Edge: dark grey
Inner: light grey
Example of the different types of
pixels encountered during labelling of
the steepest descending path
10 9 8 35 20
20 20 40 45
20 38
37 26
39 20
20
20 20
20
20 20 20
20 20 20 20 20 20 20
20 24 59
10 12 10 7
6 1 1 1 8
14 1 1 4
10 14 22
20
60 49 45 27 19 17 14 10
2 16 20 24 29 47 55 62
Figure 3: Various arrowing conditions that occur.
to the direction of steepest descent. If a pixel or a group of
similar valued pixels which are connected has no neighbours
with a lower value, it becomes a regional minima. Following
the steepest descending paths for each pixel will lead to a
minimum (or regional minima). All pixels along the steepest
descending path will be assigned the label of that minimum
to form a catchment basin. Catchment basins are formed by
the minimum and all pixels leading to it. Using this method,
the region boundary lines are formed by the edges of the
pixels that separate the dierent catchment basins.
The earlier description assumed that there will always be
only one lower-valued neighbour or none at all. However,
this is often not the case. There are two other conditions
that can occur during the pixel labelling operation: (1) when
there is more than one steepest descending paths because
two or more lowest-valued neighbours have the same value,
and (2) when the current pixel value is the same as any
of its neighbours. The second condition is called a plateau
condition and increases the complexity in determining the
steepest descending path.
These two conditions are handled as follows.
(1) If a pixel has more than one steepest descending path,
the steepest descending path is simply selected based
on a predened priority criterion. In the proposed
algorithm, the highest priority is given to those going
up fromthe left and decreases as we move to the right
and down. The order of priority is shown in Figure 2.
(2) If the image has regions where the pixels have the
same value and are not a regional minimum, they are
called nonminima plateaus. The nonminima plateaus
are a group of pixels which can be divided into two
groups.
(i) Descending edge pixels of the plateau. This group
consists of every pixel in the plateau which
has a neighbour with a lower value. These
pixels simply labelled with the direction to their
lower-valued neighbour.
(ii) Inner pixels. This group consists of every pixel
whose neighbours have equal or higher values
than its own value.
Figure 3 shows a summary of the various arrowing
conditions that may occur. Normally, the geodesic distances
from the inner points to the descending edge are determined
to obtain the shortest path. In our watershed transform this
step has been simplied by eliminating the need to explicitly
calculate and store the geodesic distance. The method used
can be thought of as a shrinking plateau. Once the edges of
a plateau has been labelled with the direction of the steepest
descent, the inner pixels neighbouring these edge pixels will
point to those edges. These edges will be stripped and
the neighbouring inners will become the new edges. This
is performed until all the pixels in the plateau have been
labelled with the path of steepest descent (see Section 4.7 for
more information).
35 8 9 10
7 10 12 10
8 1 1 6
14 1 1 4
20 22 14 10
20 20 20 20
27 45 49 60
29 47 55 62 2
20
20
20
20
20
20
19
24
59 24 20
45 40 20
39 38 20
26 37 20
20 20 20
20 20 20
10 14 17
16 20
0
0
1
1 7 6 5 4 3 2
2
3
4
5
6
7
10 9 8 35
7 10 12 10
6 8 1 1
4 14 1 1
59 24 20 20
45 40 20 20
39 38 20 20
26 37 20 20
10 20 22 14
20 20 20 20
60 27 45 49
62 29 47 55
20 20 20 20
20 20 20 20
10 14 17 19
2 16 20 24
2 1 1 1
2 3 3 3
3 3 3 3
3 3 3 3
2 2 2 2
4 2 2 2
4 3 3 3
4 3 3 3
3 3 3 3
3 3 3 3
4 3 3 3
4 4 4 4
4 3 3 3
4 3 3 3
4 4 4 4
4 4 4 4
(a) Original input values. Values are
typically those from a gradient image
Row
numbering
convention
Column
numbering
convention
(b) Identification of catchment basins
which are formed by the local minima
(indicated in circles) and plateaus (shaded).
Direction of the steepest path is indicated by
the arrows
(d) Region labelling. All pixels which
flow to a particular catchment basin
will assume that catchment basins label.
The catchment basins have been circled
and the pixels that are associated with it
are labelled and shaded correspondingly
(c) Labelling of the pixels based on the
direction of the path of steepest descent.
The earlier circled catchment basins are
given a catchment basin label indicated
by the bold lettering in the circles. All
paths to that catchment basin will assume
that catchment basins label
Labelling convention for the
various paths are also indicated
by the negative values at the end
of the direction arrows.
The steepest descending paths are
labelled from the left moving in
a clockwise direction with
increasing priority. This priority
definition is used to determine
what is the steepest descending
path to choose when there are
two or more lowest-valued
neighbours with the same value.
The steepest descending direction
priority
and
the steepest descending path
labelling convention.
1 3
2
4
1 4
2
3 1
3 1
4
3 3
3 3
3 3
4 4 4
4 1 1 1
1 1 1 4
1 1 1 4
1 1 1 4
2 2 2 2
2 2 1 2
2 2 2 3
3 3 3 3
1 4 4 4
4 4 4 4
3 3 3 4
2 3 3
Figure 4: Example of four-connectivity watershed performed on an 8 8 sample data. (a) shows the original gradient image values. (b)
shows the direction of the steepest descending path for each pixel. Minima are highlighted with circles. (c) shows pixels where the steepest
descending paths and minima have been labelled. The labels used for the direction of the steepest descending path are shown on the right
side of the gure. (d) shows the 8 8 data fully labelled. The pixels have been assigned to the label of their respective minima forming a
catchment basin.
The nal step once all the pixels have been labelled
with the direction of steepest descent is to assign them
labels that correspond to the label of their respective
minimum/minima. This is done by scanning each pixel and
to follow the path indicated by each pixel to the next pixel.
This is performed repeatedly until a minimum/minima is
reached. All the pixel in the path are then assigned to the label
of that minimum/minima. An example of all the algorithm
steps is shown in Figure 4. The operational owchart of the
watershed algorithm is shown in Figure 5.
3. Graph-Based Memory Implementation
Before going into the details of our architecture, we will
discuss a multiple bank memory storage scheme based on
graph analysis. This is used to speed up operations by allow-
ing all ve pixel values required for the watershed transform
to be read in a single clock cycle with the same memory
storage requirement as a single bank implementation. A
similar method has been proposed in [14]. However, their
method requires twice the number of read cycles compared
Find pixel
neighbour
location
Get pixel
value
Any
neighbour with
the same value as the
current pixel?
Find all connected
pixels with the
same value
Label as
minima
Store all pixel
locations
Read and classify
each pixel
Pixel
value smallest
compared to
neighbours?
Any
similar-valued
neighbours (2 lower
decending
paths)?
Get neighbour
values
Label based on
direction
priority
Label to
smallest-valued
neighbour
Current/next
pixel location
No
Yes
Yes
No
No
Yes
Figure 5: Watershed algorithm owchart.
to our proposed method. Their proposed method requires
two read cycles, one to obtain the centre value and another
to obtain the neighbourhood values. This eectively doubles
the number of clock cycles required for reading the pixel
values.
To understand why this is important, recall that one of
the main procedures of the watershed transform was to nd
the path of the steepest descent. This required the values
of the current and neighbouring pixels. Traditionally, these
values can be obtained using
(1) sequential reads: a single memory bank the size of the
image is read ve times, requiring ve clock cycles,
(2) parallel read: it reads ve replicated memory banks
each size of the image. This requires ve times more
memory required to store a single image but all
required values can be obtained in a single clock
cycle.
Using this multiple bank method, we can obtain the
speed advantage of the parallel read with the nonreplicating
storage required by the sequential reading method. The
advantages of using this multiple bank method are to
(1) reduce the memory space required for storing the
image by up to ve times,
(2) obtain all values for the current pixel and its neigh-
bours in a single read cycle, eliminating the need for
a ve clock cycle read.
This multiple bank memory storage stores the image
in separate memory banks. This is not a straightforward
division of the image pixels by the number of memory banks,
but a special arrangement is required that will not overlap
and that will support the access to ve banks simultaneously
to obtain the ve pixel values (Centre, East, North, South,
West). The problem now is to
(1) determine the number of banks required to store the
image,
(2) ll the banks with the image data,
(3) access the data in these banks.
All of these steps shall be addressed in the following sections
in the order listed above.
Two distinctive
subgraphs with
4-neighbourhood
connectivity
Each number represents
a different bank
(a) Shows neighbourhood graph for
4-neighbour connectivity. Each pixel
can be represented by a vertex (node);
two distinct subgraphs arise from this
and have been highlighted. All vertices
within each subgraph is fully connected
(via edges) to all its neighbours
Notice that each vertex is not connected
to any of its four neighbours, that is, the
grey dots are not connected to the black
ones

(b) Combined subgraph with
nonoverlapping labels.
The nonoverlapping nature allows
the concurrent access of the centre
pixel value and its associated
neighbours
Each number has been color
coded and corresponds to a single
bank. The complete image is
stored in eight different banks
Separate into two
subgraphs
Recombine and show
colouration of
different banks
2 0 2 0
0 2 0 2
2 0 2 0
0 2 0 2
1 3 1 3
3 1 3 1
1 3 1 3
6 2 4 0
1 5 3 7
4 0 6 2
3 7 1 5
6 2 4 0
1 5 3 7
4 0 6 2
3 7 1 5
6 2 4 0
1 5 3 7
4 0 6 2
3 7 1 5
6 2 4 0
1 5 3 7
4 0 6 2
3 7 1 5
6 4 6 4
4 6 4 6
6 4 6 4
4 6 4 6
5 7 5 7
7 5 7 5
5 7 5 7
Figure 6: N
4
connectivity graph. Two sub-graphs combined to produce an 8-bank structure allowing ve values to be obtained concurrently.
3.1. Determining How Many Banks Are Needed. This section
will describe how the number of banks needed to allow
simultaneous access is determined. This depends on (1) the
number of neighbour connectivity and (2) the number of
values to be obtained in one read cycle. Here, graph theory
is used to determine the minimum number of databanks
required to satisfy the following:
(1) any of the values that we want cannot be from the
same bank;
(2) none of the image pixels are stored twice (i.e., no
redundancy).
Satisfying these criteria results in the minimum number
of banks required with no additional memory needed
compared to a standard single bank storage scheme.
Imagine every pixel in an image as a region and a vertex
(node) will be added to each pixel. For 4-neighbour con-
nectivity (N
4
), the connectivity graph is shown in Figure 6.
To determine the number of banks for parallel access can
be viewed as a graph colouration problem, whereby any of
the parallel values cannot be from the same bank. We ensure
that each of the nodes will have a neighbour of a dierent
colour, or in our case number. Each of these colours (or
numbers) corresponds to a dierent bank. The same method
can be applied for dierent connectivity schemes such as 8-
neighbour connectivity.
In our implementation of 4-neighbourhood connectivity
and ve concurrent memory access (for ve concurrent
values), we require eight banks. In the discussion and
examples to follow, we will use these implementation
criteria.
(b) Any filling order is possible. For any filling order,
the bank and address within the bank is determined
by the same logic in the address bar (see Figure 8)
Using a traditional raster scan pattern as an
example. The order of bank_select is
(a) Using cardinal directions, CWNES are the centre,
west, north, east, and south values, respectively.
These correspond to the current pixel, left, top, right,
and bottom neighbour values
Scan from top left to bottom
right one pixel at atime
b
a
n
k
_
s
e
l
e
c
t
10
20
1
37
20
20
47
16
10
40
4
20
20
20
62
24
35
59
1
20
20
20
49
17
9
20
8
39
14
20
27
10
12
20
14
26
20
20
29
2
0
1
1 7 3
7 3 5 1 2 6 0 4 5
4 2 6 0 4 2 6 7 3 5
7
45
1
20
20
20
55
20
8
24
6
20
22
20
60
19
10
20
1
38
10
20
45
14
Crossbar
C W N E S
0
1
2
3
4
5
6
7
k
n
a
b

e
h
t

n
i
h
t
i
w

s
s
e
r
d
d
A
0
0
Pixel location
(3, 3) as used in
the addressing
scheme example
1
1
2
2
3
3
4
4
5
5
6
6
7
7
0 1 2 3 4 5 6 7
35 8 9 10
7 10 12 10
8 1 1 6
14 1 1 4
59 24 20 20
45 40 20 20
39 38 20 20
26 37 20 20
20 22 14 10
20 20 20 20
27 45 49 60
29 47 55 62
20 20 20 20
20 20 20 20
10 14 17 19
2 16 20 24
Figure 7: Block diagram of graph-based memory storage and
retrieval.
3.2. Filling the Banks. After determining howmany banks are
needed, we will need to ll the banks. This is done by writing
the individual values one at a time into the respective banks.
During the determination of the number of required banks,
a pattern emerges from the connectivity graph. An example
of this pattern is highlighted with a detached bounding box
in Figures 6 and 7.
The eight banks are lled with one value at a time.
This can be done in any order. The bank number and bank
address is calculated using some logic. The same logic is
used to determine the bank and bank address during reading
(See Section 3.3 for more details on this). For the ease of
explanation, we shall adopt a raster scan type of sequence.
Using this convention, the order of lling is simply the order
of the bank number as it appears from top-left to bottom-
right. An example of this is shown in Figure 7.
The group of banks replicates itself every four pixels in
either direction (i.e., right and down). Hence, to determine
how many times the pattern is replicated, the image size
is simply divided by sixteen. Alternatively, any one of its
sides can be divided by four since all images are square.
This is important as the addressing for lling the banks (and
reading) holds true for square images whose sizes are to the
power of two (i.e. 2
2
, 2
3
, 2
4
). Image sizes which are not square
are simply padded.
3.3. Accessing Data in the Banks. To access the data from
this multiple bank scheme, we need to know (1) which bank
and (2) location within that bank. The addressing scheme
is a simple addressing scheme based on the pixel location.
A hardware unit called the Address Processor (AP) handles
the memory addressing. By providing the AP with the pixel
location, it will calculate the address to retrieve that pixel
value. This address will tell us which bank and location
within that bank the pixel value is stored in.
To understand how the AP works, consider a pixel
coordinate which consists of a rowand column value with the
origin located at the upper left corner. These two values are
represented in their binary form and the lowest signicant
bits for the column and row are used to determine the bank.
The number of bits required to represent the number of
banks is dependent on the total number of banks in this
multiple bank scheme. In our case of eight banks, three bits
from the address are needed to determine in which bank the
value for that particular pixel location is stored in. These
binary values go through some logic as shown in Figure 8 or
in equation form:
B[2] = r[0]c[0]
+ c[0]r[0]
,
B[1] = r[1]
r[0]
c[1] + r[1]r[0]
c[0]
+ r[1]
r[0]c[0]
+ r[1]r[0]c[0],
B[0] = r[0],
(1)
where B[0 2] represent the three bits that determine the
bank number (from 0 7). r[0] and r[1] represent the
rst two bits of the row value in binary while c[0] and c[1]
represent the rst two bits of the column value in binary.
Now that we have determined which bank the value is in;
the remainder of the bits is used to determine the location of
the value within the bank. An example is given in Figure 8(a).
For an image of size y-rows and x-columns, the number
of bits required for addressing will simply be the number of
bits required to store the largest value of the row and column
in binary, that is, no o f address bits = log
2
(x) + log
2
(y).
This addressing scheme is shown in Figure 8. (Note that
the steps described here assume an image with a minimum
size of 4 4 and increase in powers of 2).
3.4. Sorting the Data from the Banks. After obtaining the
ve values from the banks, they need to be sorted according
to the expected neighbour location output to ensure that
values of a particular direction is sent to the right output
position. This sorting is handled by another hardware unit
called the Crossbar (CB). In addition, the CB also tags invalid
values from invalid neighbour conditions which occur at the
corners and edges of the image. This tagging is part of the
output multiplexer control.
The complete structure for reading from the banks is
shown in Figure 9. In this gure, ve pixel locations are fed
into the AP which generates ve addressees, for the centre
and its four neighbours. These ve addresses are fed into
all eight banks. However, only the address corresponding
to the correct bank is chosen by the add sel x, where x =
0 7. The addresses fed into the banks will generate eight
values however, only ve will be chosen by the CB. These
values are also sorted using the CB to ensure that the values
corresponding to the centre pixel and a particular neighbour
are output onto the correct data lines. The mux control,
CB sel x, is controlled by the same logic that selects the
add sel x.
4. Arrowing Architecture
This section will provide the details on the architecture that
performs the arrowing function of the algorithm. This part
of the architecture will describe how we get from Figure 4(a)
to Figure 4(c) in hardware. As mentioned in the previous
description of the algorithm, things are simple when every
pixel has a lower neighbour and gets more complicated
due to plateau conditions. Similarly, this plateau condition
complicates the architecture. Adding to this complexity is the
fact that all neighbour values are obtained simultaneously,
and instead of processing one value at a time, we have to
process ve values, the centre and its four neighbours. This
part of the architecture that performs the arrowing is shown
in Figure 10.
When a pixel location is fed into the system, it enters the
Centre and Neighbour Coordinates block. From this, the
coordinates of the centre and its four neighbours are output
and fed into the Multibank Memory block to obtain all the
pixel values and the pixel status (PS) from the Pixel Status
block.
Assuming the normal state, the input pixel will have a
lower neighbour and no neighbours of the same value, that
is, inner = 0 and plat = 0. The pixel will just be arrowed to the
c[2]
Binary representation
of value location of
within the bank
Row value (in binary) Column value (in binary)
c[0] c[1] c[2]
l
o
g
2
(
y
)
l
o
g
2
(
y
)
l
o
g
2
(
x
)
l
o
g
2
(
x
)
r[0] r[1] r[2]
MSB LSB
(a) Example of location to address calculations
Determining which bank
the data is in
c[0]
B[2]
B[1]
B[0]
r[0]
c[1]
r[0]
r[1]
c[0]
r[0]
r[1]
c[0]
r[0]
r[1]
c[0]
r[0]
r[0]
r[1]
(3,3)
1 1 0 0 1 1
c[0] c[1] c[2] r[0]
0
B[2]
1
B[1]
1
B[0]
r[2]
0 1
r[1]
0
c[2]
r[1] r[2]
Bank address
logic
Address 2
of
Bank 3
This example is based on the convention
that the first pixel location is (0,0).
The bank and location within the bank
count start from 0, that is, the first bank
is 0 and the last bank is 7. Similarly, the first
address location is 0 and the last is 7.
Bank address logic
r[1] c[3]
In the case of 8 banks, 3
bits are needed to
determine which bank the
data is located. For 4 and
16 banks, 2 bits and 4 bits
are required, respectively.
These values are derived
from the LSB of both the
row and column values.
In this 8 bank example,
the bank number is
represented by the 3 bit
value B[0 2].
The location within that
bank is determined by the
remaining 3 bits.
B[2] = r[0]c[0]
+ c[0]r[0]
B[1] = r[1]
r[0]
c[1] + r[1]r[0]
c[0]
+r[1]
r[0]c[0]
+ r[1]r[0]c[0]
B[0] = r[0]
Figure 8: The addressing scheme for the multiple bank graph-based
memory storage.
nearest neighbour. The Pixel Status (PS) for that pixel will be
changed from 0 6 (See Figure 19).
However, if the pixel has a similar valued neighbour,
plat = 1 and plateau processing will start. Plateau processing
starts o by nding all the current pixel neighbours of similar
value and writes them to Q1. Q1 is predened to be the rst
AP-W AP-N AP-C AP-E AP-S
Address processor (AP)
Crossbar (CB)
inv inv inv inv inv
C
B
_
s
e
l
_
0
C
B
_
s
e
l
_
1
C
B
_
s
e
l
_
2
C
B
_
s
e
l
_
3
C
B
_
s
e
l
_
4
a
d
d
_
s
e
l
_
0
a
d
d
_
s
e
l
_
1
a
d
d
_
s
e
l
_
2
a
d
d
_
s
e
l
_
3
a
d
d
_
s
e
l
_
4
a
d
d
_
s
e
l
_
5
a
d
d
_
s
e
l
_
6
a
d
d
_
s
e
l
_
7
Pixel neighbour coordinates
0 1 2 3 4
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4
C W N E S
C W N E S
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
B0
(c, r) (c1, r) (c, r1) (c+1, r) (c, r+1)
B1 B2 B3 B4 B5 B7 B6
Figure 9: 8 Bank memory architecture.
queue to be used. After writing to the queue, the PS of the
pixels is changed from 0 1. This is to indicate which
pixel locations have been written to queue to avoid duplicate
entries in the queue. At the end of this process, all the pixel
locations belonging to the plateau will have been written to
Q1.
To keep track of the number of elements in Q1 WNES,
two sets of memory counters are used. These two sets of
counters consist of mc1 mc4 in one set and mc6
mc9 in another. When writing to Q1 WNES, both sets of
counters are incremented in parallel but when reading from
Q1 WNES to obtain the neighbouring plateau pixels, only
mc14 is decremented while mc69 remains unchanged.
This means that, at the end of the Stage 1 processing,
mc14 = 0 and mc69 will contain the count of the number
of pixel locations which are contained within Q1 WNES.
This is needed to handle the case of a lower complete
minima (i.e., a plateau with all inner pixels). When this
type of plateau is encountered, mc15 = 0, and Q1 WNES
will be read once again using mc69, this time not to
obtain the same valued neighbours but to label all the pixel
locations within Q1 WNES with the current value stored in
the minima register. Otherwise, mc5 > 0 and values will
be read from Q1 C and subsequently from Q2 WNES and
Q1 WNES until all the locations in the plateau have been
visited and classied. The plateau processing steps and the
associated conditions are shown in Figure 11.
There are other parts which are not shown in the main
diagram but warrants a discussion. These are
(1) memory countersto determine the number of
unprocessed elements in a queue,
(2) priority encoderto determine the controls for
Q1 sel and Q2 sel.
The rest of the architecture consists of a few main parts
shown in Figure 10 and are
(1) centre and neighbour coordinatesto obtain the
centre and neighbour locations,
(2) multibank memoryto obtain the ve required pixel
values,
(3) smallest-valued neighbourto determine which
neighbour has the smallest value,
S
m
a
l
l
e
s
t
-

v
a
l
u
e
d

n
e
i
g
h
b
o
u
r
P
l
a
t
/
i
n
n
e
r
A
r
r
o
w
i
n
g
M
i
n
i
m
a
+
1
+
1
a

>

b
Q
1

(
W
)
Q
1
_
s
e
l
Q
1

(
N
)
Q
1

(
E
)
Q
1

(
S
)
w
e
_
t
6
w
e
_
t
7
w
e
_
t
8
w
e
_
t
9
w
e
_
t
1
0
w
e
_
t
1
w
e
_
t
2
w
e
_
t
3
w
e
_
t
4
w
e
_
t
5
i
n
_
c
t
r
l
w
e
_
m
i
n
i
m
a
p
l
a
t
s
_
s
t
a
t
e
_
s
t
a
t
n
_
s
t
a
t
w
_
s
t
a
t
c
_
s
t
a
t
1
Q
1

(
C
)
2 3
0 1 2
1 0
4 5
Q
2

(
W
)
Q
2
_
s
e
l
Q
2

(
N
)
Q
2

(
E
)
Q
2

(
S
)
1
Q
2

(
C
)
2 3 4 5
Location (x,y)
C
u
r
r
e
n
t

p
i
x
e
l

v
a
l
u
e
P
l
a
t
I
n
n
e
r
1
1
1
1
8 8
2
i
n
_
a
m
w
_
l
o
c
w
_
v
a
l
u
e
a b
1
:

w
h
e
n

a

>

b
0
:

o
t
h
e
r
w
i
s
e
P
S

=

2
P
S

=

3
i
n
_
c
t
r
l

=

2
P
S

=

1
m
c
6
9

=

1

3

C
e
n
t
r
e

a
n
d

n
e
i
g
h
b
o
u
r
c
o
o
r
d
i
n
a
t
e
s
P
i
x
e
l
c
o
o
r
d
i
n
a
t
e
s
i
n
_
c
t
r
l

>

0
d
Arrow memory
M
u
l
t
i
b
a
n
k
m
e
m
o
r
y
P
i
x
e
l

s
t
a
t
u
s
Figure 10: Watershed architecture based on rainfall simulation. Shown here is the arrowing architecture. This architecture starts from pixel
memory and ends up with an arrow memory with labels to indicate the steepest descending paths.
(4) plat/innerto determine if the current pixel is part
of a plateau and whether it is an edge or inner plateau
pixel,
(5) arrowingto determine the direction of the steepest
descent. This direction is to be written to the Arrow
Memory,
(6) pixel statusto determine the status of the pixels,
that is, whether they have been read before, put into
queue before, or have been labelled.
The next subsections will begin to describe the parts
listed above in the same order.
Read from Q2_WNES,
label pixels and write
similar valued neighbours
to Q1_WNES
Read all similar valued
neighbouring pixels
Read from Q1_WNES,
label pixels and write
similar valued neighbours
to Q2_WNES
Read from Q1_C, label
pixels and write similar
valued neighbours to
Q2_WNES
Read all from Q1_WNES
using mc69 and label
with value from
minima register
Q1_W
Q1_N
Q1_E
Q1_S
Q1_C
mc1 + mc6
mc2 + mc7
mc3 + mc8
mc4 + mc9
mc5
Plateau processing
completed
Start plateau
processing
if mc69 > 0
if mc69 > 0
if mc69 = 0 if mc14 = 0
if mc14 > 0
if mc5 = 0 if mc5 > 0
S
t
a
g
e

1
S
t
a
g
e

2
S
t
a
g
e
:

i
n
n
e
r

a
r
r
o
w
i
n
g
Notes:
1. In stage 1 of the processing, mc69 is used as a secondary
counter for Q1_WNES and incremented as mc14
increments but does not decrement when mc1-4 is decremented.
In stage 2, if mc5 = 0 (i.e., complete lower minima), mc69
is used as the counter to track the number of elements in
Q1_WNES. In this state, mc6-9 is decremented when
Q1_WNES is read from. However, if mc5 > 0, mc69 is reset
and resumes the role of memory counter for Q2_WNES.
2. Q1_C is only ever used once and that is during stage 2 of the
processing.
Figure 11: Stages of Plateau Processing and their various condi-
tions.
4.1. Memory Counter. The architecture is a tristate system
whose state is determined by the condition of whether the
queues, Q1 and Q2, are empty or otherwise. This is shown
in Figure 12. These states in turn determine the control of
the main multiplexer, in ctrl, which is the control of the data
input into the system.
0
2 1
in_ctrl values = state numbers
E1
E2
E1
E2
E1 E2
E1 E2
E1 E2
E1 E2
E1 E2
E1
E2
E1
E2
E1
E2
E1 = 1 when Q1 is empty
E2 = 1 when Q2 is empty
E1 E2
Figure 12: State diagram of the architecture-ARROWING.
Memory
counter 1
Memory
counter 2
Memory
counter 9
Memory
counter 10
mc1
1
0
Q1_sel = 1
we_t1
Q1_sel = 2
we_t2
Q2_sel = 4
we_t9
Q2_sel = 5
we_t10
+1
mc2
1
0
+1
mc9
1
.
.
.
0
+1
mc10
1
0
+1
1
1
1
1
Figure 13: Memory counter for Queue C, W, N, E, and S. The
memory counter is used to determine the number of elements in
the various queues for the directions of Centre, West, North, East,
and South.
To determine the initial queue states, Memory Counters
(MCs) are used to keep track of how many elements are
pending processing in each of the West, North, East, South,
and Centre queues. There are ve MCs for Q1 and another
ve for Q2, one counter for each of the queue directions.
These MCs are named mc15 for Q1 W, Q1 N, Q1 E, Q1 S,
Table 1: Comparison of the number of clock cycles required for
reading all ve required values and the memory requirements for
the three dierent methods.
Sequential Parallel Graph-based
Clock cycles 5 1 1
Memory Req. 1x image size 5x image size 1x image size
and Q1 C, respectively, and similarly mc610 for Q2 W,
Q2 N, Q2 E, Q2 S, and Q2 C respectively. This is shown in
Figure 13.
The MCs increase by one count each time an element
is written to the queue. Similarly, the MCs decrease by one
count every time an element is read from the queue. This
increment is determined by tracking the write enable we tx
where x = 1 10 while the decrement is determined by
tracking the values of Q1 sel and Q2 sel.
A special case occurs during the stage one of plateau
processing, whereby mc69 is used to count the number of
elements in Q1 W, Q1 N, Q1 E, and Q1 S, respectively. In
this stage, mc69 is incremented when the queues are written
to but are only decremented when Q1 WNES is read again in
the stage two for complete lower minima labelling.
The MC primarily consists of a register and a multiplexer
which selects between a (+1) increment or a (1) decrement
of the current register value. Selecting between these two
values and writing these new values to the register eectively
count up and down. The update of the MC register value is
controlled by a write enable, which is an output of a 2-input
XOR. This XOR gate ensures that the MC register is updated
when only one of its inputs is active.
4.2. The Priority Encoder. The priority encoder is used to
determine the output of Q1 sel and Q2 sel by comparing
the outputs of the MC to zero. It selects the output from the
queues in the order it is stored, that is, from queue Qx W
to Qx C, x = 1 or 2. Together with the state of in ctrl, Q1 sel
and Q2 sel will determine the data input into the system. The
logic to determine the control bits for Q1 sel and Q2 sel is
shown in Figure 14.
4.3. Centre and Neighbour Coordinate. The centre and
neighbourhood block is used to determine the coordinates
of the pixels neighbours and to pass through the centre
coordinate. These coordinates are used to address the various
queues and multibank memory. It performs an addition
and subtraction by one unit on both the row and column
coordinates. This is rearranged and grouped into their
respective outputs. The outputs from the block are ve pixel
locations, corresponding to the centre pixel location and the
four neighbours, West (W), North (N), East (E), and South
(S). This is shown in Figure 15.
4.4. The Smallest-Valued Neighbour Block. This block is to
determine the smallest-valued neighbour (SVN) and its
position in relation to the current pixel. This is used to
determine if the current pixel has a lower minima and to nd
the steepest descending path to that minima (arrowing).
mc1
=
=
=
=
0
mc2
0
mc3
0
mc4
0
=
mc5
0
a
b
c
d
e
P
r
i
o
r
i
t
y

e
n
c
o
d
e
r
Q1_sel[0]
Q1_sel[1]
Q1_sel[2]
mc6
=
=
=
=
0
mc7
0
mc8
0
mc9
0
=
mc10
0
f
g
h
i
j
P
r
i
o
r
i
t
y

e
n
c
o
d
e
r
Q2_sel[0]
Q2_sel[1]
Q2_sel[2]
(a)
Qx_sel
a/f
a/f
Q1_sel[0]/Q2_sel[0]
Q1_sel[1]/Q2_sel[1]
Q1_sel[2]/Q2_sel[2]
b/g
b/g
c/h
c/h
d/i
d/i
e/j
e/j [2] [1] [0]
0 0 0 1 1 1 1 1
1
Disable
1 0 0 x x x x 0
2 0 1 0 x x x 0 1
3 1 1 0 x x 0 1 1
4 0 0 1 x 0 1 1 1
5 1 0 1 0 1 1 1 1
Q1_sel[0] = a
+ abc
+ abcde
Q1_sel[1] = ab
+ abc
Q1_sel[2] = abcd
+ abcde
Q2_sel[0] = f
+ fgh
+ fghij
Q2_sel[1] = fg
+ fgh
Q2_sel[2] = fghi
+ fghij
(b)
Figure 14: The priority encoder. (a) shows the controls for Q1 sel
and Q2 sel using the priority encoders. The output of memory
counters determines the multiplexer control of Q1 sel and Q2 sel.
(b) shows the logic of the priority encoders used. There is a special
disable condition for the multiplexers of Q1 and Q2. This is used
so that the Q1 sel and Q2 sel can have an initial condition and will
not interfere with the memory counters.
Row
Column
C
W
N
E
S
+1
+1
r + 1
r 1
c + 1
c 1
1
1
Figure 15: Inside the Pixel Neighbour Coordinate.
To determine the smallest value pixel, the values of the
neighbours are compared two at a time, and the result of
the comparator is used to select the smaller value of the
two. The last two values are compared once again and the
value of the smallest value neighbour will be obtained. As for
the direction of the SVN, the outputs from the 3 stages of
comparison are used and compared to a truth table. This is
shown in Figure 16. This output is passed to the arrowing
block to determine the direction of the steepest descent
(when there is a lower neighbour).
4.5. The Plateau-Inner Block. This block is to determine
whether the current pixel is part of a plateau and which type
of plateau pixel it is. The current pixel type will determine
what is done to the pixel and its neighbours, that is, whether
they are put back into a queue or otherwise. Essentially,
together with the Pixel Status, it helps to determine if a pixel
or one of its neighbours should be put back into the queues
for further processing. When the system is in State 0 (i.e.,
processing pixel locations fromthe PC), the block determines
if the current pixel is part of a plateau. The value of the
current pixel is compared to all its neighbours. If any one
of the neighbours has a similar value to the current pixel, it is
part of a plateau and plat = 1. The respective similar valued
neighbours are put into the dierent queue locations based
on sv W, sv N, sv E, and sv S and the value of pixel status.
The logic for this is shown in Figure 17(a).
In any other state, this block is used to determine if the
current pixel is an inner (i.e., equal to or smaller than its
neighbours). If the current pixel is an inner, inner = 1. This
is shown in Figure 17(b). Whether the pixel is an inner or
not will determine the arrowing part of the system. If it is an
inner, it will point to the nearest edge.
4.6. The Arrowing Block. This block is to determine the
steepest descending path label for the Arrow Memory. The
steepest path is calculated based on whether the pixel is an
inner or otherwise. When processing non-inner pixels the
arrowing block generates a direction output based on the
location of the lowest neighbour obtained from the block
Smallest Valued Neighbour. If the pixel is an inner, the
arrow will simply point to the nearest edge. When there is
more than one possible path to the nearest edge, a priority
<
<
<
W
value
N
value
E
value
S
value
0
1
0
a
b
c
1
0
1
Value of
smallest-
valued
neighbour
(a)
x = c
y = ac
+ bc
x c
y
b
a
a b c x y Direction
0 x 0 0 0 W
1 x 0 0 1 N
x 0 1 1 0 E
x 1 1 1 1 S
(b)
Figure 16: Inside the Smallest Value Neighbour (SVN) block.
(a) The smallest-valued neighbour is determined and selected
using a set of comparators and multiplexers. (b) The location of
the smallest-valued neighbour is determined by the selections of
each multiplexer. This location information used to determine the
steepest descending path and is fed into the arrowing block.
sv_W = 1, when C = W
value
sv_N = 1, when C = N
value
sv_E = 1, when C = E
value
sv_S = 1, when C = S
value
C
C
C
C
Plat
sv_E
sv_S
sv_N
sv_W
W
value
N
value
E
value
S
value
=
=
=
=
(a)
lv_E
lv_S
lv_N
lv_W C
C
C
C
Inner
W
value
N
value
E
value
S
value
lv_W = 1, when C W
value
lv_N = 1, when C N
value
lv_E = 1, when C E
value
lv_S = 1, when C S
value
(b)
Figure 17: Inside the Plateau-Inner Block.
encoder in the block is used to select the predened direction
of the highest priority. This is shown in Figure 18 when the
systemis in State = 0, and in any other state where the pixel is
not an inner, this arrowing block uses the information from
the SVN block and passes it through directly to its own main
multiplexer, selecting the appropriate value to be written into
Arrow Memory.
If the current pixel is found to be an inner, the arrowing
direction is towards the highest priority neighbour with
the same value which has been previously labelled. This is
possible because we are labelling the plateau pixels from
the edge pixels going in, one pixel at a time, ensuring that
the inners will always point in the direction of the shortest
geodesic distance.
4.7. Pixel Status. One of the most important parts of this
system is the pixel status (PS) registers. Since six states
are used to ag the pixel, this register requires a 3-bit
representation for each pixel location of the image. Thus
the PS registers have as many registers as there are pixels
in the input image. In the system, values from the PS help
determine what processes a particular pixel location has gone
through and whether it has been successfully labelled into
the Arrow Memory. The six states and their transitions are
shown in Figure 19. The six states are as follows:
(i) 0 : unvisitednothing has been done to the pixel,
(ii) 1 : queued : initial,
(iii) 2 : queued in Q2,
(iv) 3 : queued in Q1,
(v) 4 : completed when plat = 0,
(vi) 5 : completed when plat = 1 and reading from Q2,
(vii) 6 : completed when plat = 1 and reading from Q1.
To ease understanding of how the plateau conditions
are handled and how the PS is used, we shall introduce
the concept of the Unlabelled pixel (UP) and Labelled
pixel (LP). The UP is dened as the outermost pixel which
has yet to be labelled. Using this denition, the arrowing
procedure for the plateau pixels are
(1) arrow to lower-valued neighbour (applicable only if
inner = 0)
(2) arrow to neighbour with PS = 5 according to
predened arrowing priority.
With reference to Figure 20, the PS is used to determine
which neighbours to the UPs have not been put into the other
queue, UPs of the same label and LPs.
5. Example for the Arrowing Architecture
This example will illustrate the states and various controls
of the watershed architecture for an 8 8 sample data. It is
the same sample data shown in Figures 6 and 7. A table with
the various controls, status, and queues for the rst 14 clock
cycles is shown in Table 2.
0
1
2
3
0 1
Direction
of steepest
descent
2 2
x and y from
smallest value
neighbour block
0
1
0
1
0
1
0
1
PS_W
sv_W
5
PS_x are the values read from the pixel
status registers from the center (C) and
respective neighbours (W, N, E, S).
sv_x are the same value conditions
obtained from the plat/inner block where
x are the directions W, N, E, S.
6
=
PS_C
2
PS_C
3
ne_W
ne_N
ne_E
ne_S
dir[0]
dir[1]
P
r
i
o
r
i
t
y

e
n
c
o
d
e
r
PS_N
sv_N
5
6
=
PS_E
sv_E
5
6
=
PS_S
sv_S
5
6
=
=
=
in_ctrl
1
=
dir[0] = a
b + a
dir[1] = a
a
dir[0]
dir[1]
b
c
a b c d x y mux_ctrl
1 x x x 0 0 0
0 1 x x 0 1 1
0 0 1 x 1 0 2
0 0 0 1 1 1 3
1
2
3
4
Figure 18: Inside arrowing block.
The initial condition for the system is as follows. The
Program Counter (PC) starts with the rst pixel and
generates a (0, 0) output representing the rst pixel in an
(x, y) format.
With both the Q1 and Q2 queues being empty, that is,
mc1 mc10 = 0, the system is in State 0. This sets in ctrl =
0 that controls mux1 to select the PCvalue (in this case (0, 0).
This value is incremented on the next clock cycle.
The First Few Steps. This PC coordinate is then fed into the
Pixel Neighbour Coordinate block. The outputs of this block
T
a
b
l
e
2
:
E
x
a
m
p
l
e
o
f
c
o
n
d
i
t
i
o
n
s
w
i
t
h
r
e
f
e
r
e
n
c
e
t
o
t
h
e
s
y
s
t
e
m
c
l
o
c
k
c
y
c
l
e
.
Q
u
e
u
e
2
h
a
s
b
e
e
n
o
m
i
t
t
e
d
b
e
c
a
u
s
e
b
y
t
h
e
1
4
t
h
c
l
o
c
k
c
y
c
l
e
,
i
t
h
a
s
n
o
t
b
e
e
n
u
s
e
d
.
c
l
k
i
n
Q
x
s
e
l
L
o
c
p
l
a
t
i
n
n
e
r
s
v
x
P
i
x
e
l
s
t
a
t
u
s
w
e
t
x
w
e
a
m
d
i
r
m
i
n
Q
u
e
u
e
1
c
o
n
t
e
n
t
s
C
W
N
E
S
W
N
E
S
C
1
0
0
(
0
,
0
)
1
0
S
0
6
i
n
v
i
n
v
0
0
1
t
4
,
t
5
1
3
0
[
1
]
(
1
,
0
)
[
1
]
[
1
]
(
0
,
0
)
[
1
]
2
2
Q
1
=
4
(
1
,
0
)
1
0
N
1
6
i
n
v
6
0
0
t
5
1
4
0
[
1
]
(
1
,
0
)
[
0
]
[
1
]
(
0
,
0
)
[
1
]
(
1
,
0
)
[
2
]
3
2
Q
1
=
5
(
0
,
0
)
1
0
S
6
i
n
v
i
n
v
0
0
0
[
1
]
(
0
,
0
)
[
0
]
(
1
,
0
)
[
1
]
4
2
Q
1
=
5
(
1
,
0
)
1
0
N
6
i
n
v
6
0
0
0
[
1
]
(
0
,
0
)
[
0
]
(
1
,
0
)
[
0
]
5
0
0
(
0
,
1
)
0
0
6
6
i
n
v
0
0
3
0
[
1
]
6
0
0
(
0
,
2
)
0
0
6
6
i
n
v
0
0
1
1
1
[
1
2
]
7
0
0
(
0
,
3
)
0
0
6
6
i
n
v
0
0
4
0
[
2
]
8
0
0
(
0
,
4
)
1
1
E
,
S
0
6
i
n
v
0
1
0
1
t
3
,
t
4
0
0
0
[
2
]
(
0
,
5
)
[
1
]
[
1
]
(
1
,
4
)
[
1
]
[
1
]
9
2
Q
1
=
3
(
0
,
5
)
1
1
W
,
S
0
0
1
i
n
v
0
0
1
t
1
,
t
4
0
0
0
[
2
]
(
0
,
4
)
[
1
]
[
1
]
(
0
,
5
)
[
0
]
[
1
]
(
1
,
4
)
[
1
]
[
1
]
(
1
,
5
)
[
2
]
[
2
]
1
0
2
Q
1
=
1
(
0
,
4
)
1
1
E
,
S
1
1
i
n
v
1
1
0
0
0
[
2
]
(
0
,
4
)
[
0
]
[
1
]
(
0
,
5
)
[
0
]
[
1
]
(
1
,
4
)
[
1
]
[
1
]
(
1
,
5
)
[
2
]
[
2
]
1
1
2
Q
1
=
4
(
1
,
4
)
1
0
N
,
E
,
S
0
6
1
i
n
v
1
0
1
t
4
,
t
5
1
1
0
[
2
]
(
0
,
4
)
[
0
]
[
1
]
(
0
,
5
)
[
0
]
[
1
]
(
1
,
4
)
[
0
]
[
1
]
(
1
,
4
)
[
1
]
(
1
,
5
)
[
1
]
[
2
]
(
2
,
4
)
[
2
]
[
3
]
1
2
2
Q
1
=
4
(
1
,
5
)
1
1
W
,
N
,
S
1
6
1
0
0
1
t
4
,
t
5
1
1
0
[
2
]
(
0
,
4
)
[
0
]
[
1
]
(
0
,
5
)
[
0
]
[
1
]
(
1
,
4
)
[
0
]
[
1
]
(
1
,
4
)
[
1
]
(
1
,
5
)
[
0
]
[
2
]
(
2
,
4
)
[
1
]
[
3
]
(
2
,
5
)
[
2
]
[
4
]
1
3
2
Q
1
=
4
(
2
,
4
)
1
0
N
,
E
,
S
0
6
0
6
1
0
1
t
4
,
t
5
1
1
0
[
2
]
(
0
,
4
)
[
0
]
[
1
]
(
0
,
5
)
[
0
]
[
1
]
(
1
,
4
)
[
0
]
[
1
]
(
1
,
4
)
[
1
]
(
1
,
5
)
[
0
]
[
2
]
(
2
,
4
)
[
2
]
(
2
,
4
)
[
0
]
[
3
]
(
2
,
5
)
[
1
]
[
4
]
(
3
,
4
)
[
2
]
[
5
]
1
4
2
Q
1
=
4
(
2
,
5
)
1
1
W
,
N
,
S
1
6
1
0
0
1
t
4
0
0
[
2
]
(
0
,
4
)
[
0
]
[
1
]
(
0
,
5
)
[
0
]
[
1
]
(
1
,
4
)
[
0
]
[
1
]
(
1
,
4
)
[
1
]
(
1
,
5
)
[
0
]
[
2
]
(
2
,
4
)
[
2
]
(
2
,
4
)
[
0
]
[
3
]
(
2
,
5
)
[
0
]
[
4
]
(
3
,
4
)
[
1
]
[
5
]
(
3
,
5
)
[
2
]
[
6
]

W
r
i
t
e

t
o

Q
2
W
r
i
t
e

t
o

Q
1 W
r
i
t
e

t
o

Q
2
Reading fro
m
Q
2
_
W
N
E
S
Reading fro
m
Q
1
_
W
N
E
S
Normal
D
u
r
i
n
g

s
t
a
g
e

1

p
l
a
t

p
r
o
c
e
s
s
i
n
g

w
h
e
n

i
t

i
s

a
n

e
d
g
e
P
la
t
o
f
a
l
l
i
n
n
e
r
/
e
d
g
e
s

(
f
r
o
m

Q
1
)
S
t
a
g
e

1

p
l
a
t

p
r
o
c
e
s
s
i
n
g
The pixel status is a 3-bit register. It is used to tag
the status of pixels. The various tags are as follows:
0: Never visited
1: Queued-initial (all plat pixel locations into Q1)
2: Queued in Q2
3: Queued in Q1
4: Completed when plat = 0
5: Completed when plat = 1 and reading from Q2
6: Completed when plat = 1 and reading from Q1
0 4
1
2
5
3
plat = 1
inner = 1
Q1_sel = 5
in_ctrl = 2
plat = 1
inner = 1
plat = 1
inner = 1
PS = 3
Qx_sel < 5
in_ctrl = 2
plat = 1
inner = 1
PS = 2
Q2_sel < 5
in_ctrl = 1
plat = 1
inner = 1
PS = 1
Qx_sel < 5
in_ctrl = 1
plat = 1
inner = 1
PS = 1
Qx_sel < 5
in_ctrl = 2
plat = 0
inner = 0
in_ctrl = 0
plat = 1
PS = 1
mc5 = 0
Q1_sel < 5
in_ctrl = 2
plat = 1
inner = 0
in_ctrl = 0
6

Controls assume that Q1 is
the first queue to be used
Figure 19: The pixel status block is a set of 3-bit registers used to
store the state of the various pixels.
(the pixel locations) are (0, 0), (0, 1) E, (1, 0) W,
(1, 0) INVALID, and (0, 1) INVALID. The valid
addresses are then used to obtain the current pixel value,
10(C), and neighbour values, 9(W) and 10(S). The invalid
pixel locations are set to output an INVALID value through
the CB mux. This value has been predened to be 255.
The pixel locations are also used to determine address
locations within the 3-bit pixel status registers. When read,
the values are (0, 0) = 0, (0, 1) = 0, and (1, 0) = 0. The
20
[6]
20
[1]
20
[1]
20
[1]
20
[1]
20
[1]
20
[1]
20
[1]
20
[1]
20
[1]
20
[6]
20
[6]
20
[6]
20
[6]
20
[6]
20
[6]
10
10
10
10
10 10 10 10 10
20
[6]
20
[2]
20
[1]
20
[1]
[2]
20
[1]
20
[1]
20
[2]
20 20
20
[6]
20
[6]
20
[6]
20 20 20
10
10
10
10
10 10 10 10 10
20
[6]
20
[5]
20
[3]
20
[1]
20
[5]
20
[3]
20
[3]
20
[5]
20
[5]
20
[5]
20
[6]
20
[6]
20
[6]
20
[6]
20
[6]
20
[6]
10
10
10
10
10 10 10 10 10
20
[6]
20
[6]
20
[6]
20
[6]
20
[6]
20
[6]
20
[6]
20
[3]
20
[3]
20
[3]
20
[2]
20
[2]
20
[2]
20
[2]
20
[2]
20
[2]
20
[1]
20
[1]
20
[1]
20
[1]
20
[1]
20
[1]
20
[1]
20
[1]
20
[1]
20
[2]
20
[2]
20
[2]
20
[2]
20
[2]
20
[3]
20
[3]
20
[3]
20
[0]
20
[0]
20
[0]
20
[0]
20
[0]
20
[0]
20
[0]
20
[0]
20
[0]
20
[0]
20
[0]
20
[0]
20
[0]
20
[0]
20
[0]
20
[0]
10
10
10
10
10 10 10 10 10
UP
LP
[6] [6] [6]
20
[2] [2]
[5]
2
20
[5]
[3
2
[3] [3]
LP
UP
Scan entire plateau by continously
feeding the unvisited neighbours back
into the system. Each visited pixel is
flagged by changing PS = 01.
Scanning stops when there are no
more unvisited neighbours
During the initial scan of
all the plateau pixels, all
the pixels with a lower
neighbour are arrowed,
put into Q1_C and their
PS = 06.
Then Q1_C is read and
all the neighbours to
these pixel locations
(circled in blue) are put
into Q2_WNES. When
put into Q2, PS = 12.
PS after the going through
the plateau once. Shown
here before reading
Q1_C. The pixels in Q1_C
have their PS = 6 even
before they are read back
because they are labelled
before they are put into
Q1_C.
PS shown here after
reading Q1_C. This is
before reading from
Q2_WNES. When reading
from Q2_WNES, all the
inner unlabelled pixel (UP)
will arrow to the labelled
pixel (LP) which can be
identified because their
PS=6 (completed).
PS shown here after
reading Q1_WNES (from
the previous cycle). This is
before reading from
Q2_WNES. When reading
from Q1_WNES, all the UP
will arrow to the LP. This
continues until there are no
more neighbours to write
into the other queue.
When Q1_WNES is read,
all the neighbours to this
pixel location (circled in
blue) are put into Q2_WNES.
When put into another
queue, PS = 12.
When Q2_WNES is
read, all the neighbours
to these pixel locations
(circled in blue) are put
into Q1_WNES. When
put into Q1, PS = 13.
Starting condition with
pixel staus (PS) = 0
(shown in square brackets)
Read from
contents of
Q1_C
Read from
contents of
Q2_WNES
Ignore
contents of
Q1_WNES
Write to
Q2_WNES
Write to
Q1_WNES
Read from
contents of
Q1_WNES
Write to
Q2_WNES
Figure 20: An example of how Pixel Status is used in the system.
neighbours with similar value are put into the queue. In
the example used, only the sound neighbour has a similar
value and is put into queue Q1 S. Next, the pixel status for
(1, 0) is changed from 0 1. This tells the system that
the coordinate (1, 0) has been put into the queue and will
avoid an innite loop once its similar valued neighbour to
Pixel
coordinate
+1
Reverse arrowing
Arrrow memory Buffer
Path queue
L
a
b
e
l

m
e
m
o
r
y
Pixel status
we_pc
1
0
All memories have a built-in
pixel coordinate to memory
address decoder
mux
w_loc: memory write location
r_loc: memory read location
w_info: memory write data
we_pq: write enable path queue memory
we_label: write enable label memory &
pixel status memory
we_pc: write enable for pixel coordinate
incrementation
we_buf: write enable for buffer. Value of
CBL is locked in buffer and read
from it until read Q is completed.
mux: data input selection
we_label
we_label
+1
1
PQ_counter
we_pq
1
0
Path queue
counter 1
Memory counter for path queue
a
b
if a > 0, b = 1
w_loc
w_loc
w_loc
r_loc
r_loc
we_pq
we_buf
w_value
0
>
Figure 21: The watershed architecture: Labelling.
the north (0, 0) nds (1, 0) again. The current pixel location
(0, 0) on the other hand is written to Q1 C because it
is a plateau pixel but not an inner (i.e., an edge) and is
immediately arrowed. The status for this location (0, 0) is
changed from 0 6. Q1 S will contain the pixel location
(1, 0). This is read back into the system and mc4 = 1 0
indicating Q1 S to be empty. The pixel location (1, 0) is
arrowed and written into Q1 C. With mc1 4 = 0 and
mc5 > 0, the pixel locations (0, 0) and (1, 0) is reread into the
systembut nothing is performed because both their PSsequal
6 (i.e., completed).
6. Labelling Architecture
This second part of the architecture will describe how we
get from Figure 4(c) to Figure 4(d) in hardware. Compared
to the arrowing architecture, the labelling architecture is
considerably simpler as there are no parallel memory reads.
In fact, everything runs in a fairly sequential manner. Part 2
of the architecture is shown in Figure 21.
The architecture for Part 2 is very similar to Part 1. Both
are tristate systems whose state depends on the condition
Normal
Fill queue Read queue
P
Q
_
c
o
u
n
t
e
r

>

0 P
Q
_
c
o
u
n
t
e
r

=

0
b

=
1
(
c
atchment basin
fo
u
n
d
)
mux = 1
mux = 0
Figure 22: The 3 states in Architecture:Labelling.
of the queues and uses pixel state memory and queues
for storing pixel locations. The dierence is that Part 2
architecture only requires a single queue and a single bit pixel
status register. The three states for the system are shown in
Figure 22.
Values are initially read in from the pixel coordinate
register. Whether this pixel location had been processed
before is checked against the pixel status (PS) register. If it has
not been processed before (i.e., was never part of any steepest
descending path), it will be written to the Path Queue (PQ).
Once PQ is not empty, the system will process the next
pixel along the current steepest descending path. This is
calculated by the Reverse Arrowing Block (RAB) using the
current pixel location and direction information obtained
from the Arrow Memory. This process continues until a
non-negative value is read from Arrow Memory. This non-
negative value is called the Catchment Basin Label (CBL).
Reading a CBL tells that the system a minimum has been
reached and all the pixel locations stored in PQ will be
labelled with that CBL and written to Label Memory. At
the same time, the pixel status for the corresponding pixel
locations will be updated accordingly from 0 1. Now that
PQ is empty; the next value will be obtained from the pixel
coordinate register.
6.1. The Reverse Arrowing Block. This block calculates the
neighbour pixel location in the path of the steepest descent
given the current location and arrowing label. In other
words, it simply nds the location of the pixel pointed to by
the current pixel.
The output of this block is a simple case of selecting
the appropriate neighbouring coordinate. Firstly the neigh-
bouring coordinates are calculated and are fed into a 4-input
multiplexer. Invalid neighbours are automatically ignored as
they will never be selected. The values in Arrow Memory
only point to valid pixels. Hence, no special consideration is
required to handle these cases.
The bulk of the blocks complexity lies in the control of
the multiplexer. The control is determined by translating the
value from the Arrow Memory into proper control logic.
Using a bank of four comparators, the value from Arrow
Memory is determined by comparing it to four possible
valid direction labels (i.e., 4 1). For each of these
values, only one of the comparators will produce a positive
outcome (see truth table in Figure 23). Any other values
outside the valid range will simply be ignored.
The comparator output is then passed through some
logic that will produce a 2-bit output corresponding to the
multiplexer control. If the value from Arrow Memory is
1, the control logic will be (x = 0, y = 0) corresponding
to the West neighbour location. Similarly, if the value from
Arrow Memory is 2, 3, or 4, the control logic will
be (x = 0, y = 1), (x = 1, y = 0), or (x = 1, y =
1) corresponding to the North, East, or South neighbour
locations, respectively. This is shown in Figure 23.
7. Example for the Labelling Architecture
This example will pick up where the previous example had
stopped. In the previous part, the resulting output was
written to the Arrow Memory. It contains the directions of
the steepest descent (negative values from 1 4) and
numbered minima (positive values from 0 total number
r + 1
r 1
c + 1
c 1
Row
Column
W 0
1
2
3
N
E
S
+1
+1
Lower
neighbour
location
am
am
am
am
=
=
=
=
am = value from
arrow memory
a
b
c
d
x
x
y
y
1
1
1
2
3
4
a b c d x y mux_ctrl
1 0 0 0 0 0 0
0 1 0 0 0 1 1
0 0 1 0 1 0 2
0 0 0 1 1 1 3
x = a
d + a
cd
y = a
d + a
bc
Figure 23: Inside the reverse arrowing block.

of minima) as seen in Figure 4(c). In this part, we will use the
information stored in Arrow Memory to label each pixel
with the label of its respective minimum. Once all associated
pixels to a minimum/minima have been labelled accordingly,
a catchment basin is formed.
The system starts o in the normal state and the initial
conditions are as follows. PQ counter = 0, mux = 1. In the
rst clock cycle, the rst pixel location (0, 0) is read from the
pixel location register. Once this has been read in, the pixel
location register will increment to the next pixel location
(0, 1). The PS for the rst location (0, 0) is 0. This enables
the write enable for the PQ and the rst location is written
to queue. At the same time, the location (0, 0) and direction
3 obtained from Arrow Memory are used to nd the next
coordinate (0, 1) in the steepest descending path.
Since PQ is not empty, the system enters the Fill Queue
state and mux = 0. The next input into the systemis the value
from the reverse arrowing block, (0, 1), and since PS = 0,
it is put into PQ. The next location processed is (0, 2). For
(0, 2), PS = 0 and is also written to PQ. However, for this
location, the value obtained from Arrow Memory is 1. This
is a CBL and is buered for the process of the next state. Once
a non-negative value from Arrow Memory is read (i.e.,
b = 1), the system enters the next state which is the Read
Queue state. In this state, all the pixel locations stored in
PQ is read one at a time and the memory locations in Label
Memory corresponding to these locations are written with
the buered CBL. At the same time, PS is also updated from
0 1 to reect the changes made to Label Memory. It tells
the system that the locations from PQ have been processed
so that it will not be rewritten when it is encountered again.
Table 3: Results of the implemented architecture on a Xilinx
Spartan-3 FPGA.
64 64 image size,
Arrowing
Slice ip ops 423 out of 26,624 (1%)
Occupied slices 2,658 out of 13,312 (19%)
Labelling
Slice ip ops 39 out of 26,624 (1%)
Occupied slices 37 out of 13,312 (1%)
With each read from PQ, PQ counter is decremented. When
PQ is empty, PQ counter = 0 and the system will return to
the normal state.
In the next clock cycle, (0, 1) is read from the pixel
coordinate register. For (0, 1), PS = 1 and nothing gets
written to PQ and PQ counter remains at 0. The same goes
for (0, 2). When the coordinate (0, 3) is read from the pixel
coordinate register, the whole processes of lling up PQ and
reading from PQ and writing to Label Memory start again.
8. Synthesis and Implementation
The rainfall watershed architecture was designed in Handel-
C and implemented on a Celoxica RC10 board containing
a Xilinx Spartan-3 FGPA. Place and route were completed
to obtain a bitstream which was downloaded into the FPGA
for testing. The watershed transform was computed by the
FPGA architecture, and the arrowing and labelling results
were veried to have the same values as software simulations
in Matlab. The Spartan-3 FPGA contains a total of 13312
slices. The implementation results of the architecture are
given in Table 3 for an image size of 64 64 pixels. An
image resolution of 64 64 required 2658 and 37 slices for
the arrowing and labelling architecture, respectively. This
represents about 20% of the chip area on the Spartan-3
FPGA.
9. Summary
This paper proposed a fast method of implementing the
watershed transform based on rainfall simulation with a
multiple bank memory addressing scheme to allow parallel
access to the centre and neighbourhood pixel values. In a
single read cycle, the architecture is able to obtain all ve
values of the centre and four neighbours for a 4-connectivity
watershed transform. This multiple bank memory has the
same footprint as a single bank design. The datapath
and control architecture for the arrowing and labelling
hardware have been described in detail, and an implemented
architecture on a Xilinx Spartan-3 FGPA has been reported.
The work can be extended to implement an 8-connectivity
watershed transform by increasing the number of memory
banks and working out its addressing. The multiple bank
memory approach can also be applied to other watershed
architectures such as those proposed in [1013, 15].
References
[1] S. E. Hernandez and K. E. Barner, Tactile imaging using
watershed-based image segmentation, in Proceedings of the
Annual Conference on Assistive Technologies (ASSETS 00), pp.
2633, ACM, New York, NY, USA, 2000.
[2] M. Fussenegger, A. Opelt, A. Pjnz, and P. Auer, Object
recognition using segmentation for feature detection, in
Proceedings of the 17th International Conference on Pattern
Recognition (ICPR 04), vol. 3, pp. 4144, IEEE Computer
Society, Washington, DC, USA, 2004.
[3] W. Zhang, H. Deng, T. G. Dietterich, and E. N. Mortensen,
A hierarchical object recognition system based on multi-
scale principal curvature regions, in Proceedings of the 18th
International Conference on Pattern Recognition (ICPR 06),
vol. 1, pp. 778782, IEEE Computer Society, Washington, DC,
USA, 2006.
[4] M. S. Schmalz, Recent advances in object-based image com-
pression, in Proceedings of the Data Compression Conference
(DCC 05), p. 478, March 2005.
[5] S. Han and N. Vasconcelos, Object-based regions of interest
for image compression, in Proceedings of the Data Compres-
sion Conference (DCC 05), pp. 132141, 2008.
[6] T. Acharya and P.-S. Tsai, JPEG2000 Standard for Image
Compression: Concepts, Algorithms and VLSl Architecturcs,
John Wiley & Sons, New York, NY, USA, 2005.
[7] V. Osma-Ruiz, J. I. Godino-Llorente, N. Sa aenz-Lech on, and
P. G omez-Vilda, An improved watershed algorithm based on
ecient computation of shortest paths, Pattern Recognition,
vol. 40, no. 3, pp. 10781090, 2007.
[8] A. Bieniek and A. Moga, An ecient watershed algorithm
based on connected components, Pattern Recognition, vol. 33,
no. 6, pp. 907916, 2000.
[9] H. Sun, J. Yang, and M. Ren, Afast watershed algorithmbased
on chain code and its application in image segmentation,
Pattern Recognition Letters, vol. 26, no. 9, pp. 12661274, 2005.
[10] M. Neuenhahn, H. Blume, and T. G. Noll, Pareto optimal
design of an FPGA-based real-time watershed image seg-
mentation, in Proceedings of the Conference on Program for
Research on Integrated Systems and Circuits (ProRISC 04),
2004.
[11] C. Rambabu and I. Chakrabarti, An ecient immersion-
based watershed transform method and its prototype archi-
tecture, Journal of Systems Architecture, vol. 53, no. 4, pp. 210
226, 2007.
[12] C. Rambabu, I. Chakrabarti, and A. Mahanta, Flooding-
based watershed algorithm and its prototype hardware archi-
tecture, IEE Proceedings: Vision, Image and Signal Processing,
vol. 151, no. 3, pp. 224234, 2004.
[13] C. Rambabu and I. Chakrabarti, An ecient hillclimbing-
based watershed algorithm and its prototype hardware archi-
tecture, Journal of Signal Processing Systems, vol. 52, no. 3, pp.
281295, 2008.
[14] D. Noguet and M. Ollivier, New hardware memory manage-
ment architecture for fast neighborhood access based on graph
analysis, Journal of Electronic Imaging, vol. 11, no. 1, pp. 96
103, 2002.
[15] C. J. Kuo, S. F. Odeh, and M. C. Huang, Image segmentation
with improved watershed algorithm and its FPGA implemen-
tation, in Proceedingsof the IEEE International Symposium on
Circuits and Systems (ISCAS 01), vol. 2, pp. 753756, Sydney,
Australia, May 2001.

541420

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

541420

Transféré par

Droits d'auteur :

Formats disponibles

EURASIP Journal on Embedded Systems

Design and Architectures for

(a, b) =min(a, b)+log(e

extra idle cycles could be added if C

being the index of the next layer

1 if obs i obey (V-D.7) & (V6D.8) for track j,

s in a column or a row. The

in a row or a column for a one-to-one coupling

Arbitrary large number if m

s as its elements. The 1s in the matrix are used to

Min{|PredMVy|, |PredMVx|}, the pattern direc-

Min{|PredMVy|, |PredMVx|}, the pattern direc-

m) is the SAD cost obtained as

p ) represents the number of bits used for MV coding, and,

17.60 21.25 8.46 4.08 69.02

Normalized power = Power (0.18

= 2; //size of next wavelet subband is twice of current

Figure 2: Memory requirement of WAGIR.

), x (16, 16), y (16, 16).

Figure 6: Multispectral image sensor.

,u, v): image plane coordi-

) are the corresponding real coor-

Figure 14: Method developed in Cyclope.

92/94 8/8 176/190 32/56

2323/2325 3821 1555/1569 40/64

Direct computation/Look up table.

Etude du Traitement du Signal et des lImage

Figure 14: Example of the hardware implementation of a basic neural network.

, XYZ, etc.) will transform the average

16 EURASIP Journal on Embedded Systems

Figure 23: Inside the reverse arrowing block.

Vous aimerez peut-être aussi