Académique Documents
Professionnel Documents
Culture Documents
ADCOM 2009
Copyright © 2009 by the Advanced Computing and Communications Society
All rights reserved.
Steering Committee ii
Reviewers iii
GRID ARCHITECTURE
Parallel Implementation of Video Surveillance Algorithms on GPU 2
Architecture using NVIDIA CUDA
Sanyam Mehta, Ankush Mittal, Arindam Misra, Ayush Singhal, Praveen Kumar, and Kannappan
Palaniappan
Adapting Traditional Compilers onto Higher Architectures incorporating Energy 10
Optimization Methods for Sustained Performance
Prahlada Rao BB, Mangala N and Amit Chauhan
SERVER VIRTUALIZATION
Is I/O Virtualization ready for End-to-End Application Performance? 19
J. Lakshmi, S.K.Nandy
Eco-friendly Features of a Data Centre OS 26
S. Prakki
GRID SERVICES
OpenPEX: An Open Provisioning and EXecution System for Virtual Machines 45
Srikumar Venugopal, James Broberg and Rajkumar Buyya
Exploiting Grid Heterogeneity for Energy Gain 53
Saurabh Kumar Garg
Intelligent Data Analytics Console 60
Snehal Gaikwad, Aashish Jog and Mihir Kedia
COMPUTATIONAL BIOLOGY
Digital Processing of Biomedical Signals with Applications to Medicine 69
D. Narayan Dutt
Supervised Gene Clustering for Extraction of DiscriminativeFeatures from 75
Microarray Data.
C. Das, P.Maji, S. Chattopadhyay
Modified Greedy Search Algorithm for Biclustering Gene Expression Data 83
S.Das, S.M. Idicula
AD-HOC NETWORKS
Solving Bounded Diameter Minimum Spanning Tree Problem Using Improved 90
Heuristics
Rajiv Saxena and Alok Singh
Ad-hoc Cooperative Computation in Wireless Networks using Ant like Agents 96
Santosh Kulkarni and Prathima Agrawal
A Scenario-based Performance Comparison Study of the Fish-eye State Routing and 83
Dynamic Source Routing Protocols for Mobile Ad hoc Networks
Natarajan Meghanathan and Ayomide Odunsi
NETWORK OPTIMIZATION
Optimal Network Partitioning for Distributed Computing Using Discrete 113
Optimization
Angeline Ezhilarasi G and Shanti Swarup K
An Efficient Algorithm to Reconstruct a Minimum Spanning Tree in an 118
Asynchronous Distributed Systems
Suman Kundu and Uttam Kumar Roy
A SAL Based Algorithm for Convex Optimization Problems 125
Amit Kumar Mishra
GRID SCHEDULING
Energy-efficient Scheduling of Grid Computing Clusters 153
Tapio Niemi, Jukka Kommeri and Ari-Pekka Hameri
Energy Efficient High Available System: An Intelligent Agent Based Approach 160
Ankit Kumar, Senthil Kumar R. K. and Bindhumadhava B. S
3. A Two-phase Bi-criteria Workflow Scheduling Algorithm in Grid Environments 168
Amit Agarwal and Padam Kumar
HUMAN COMPUTER INTERFACE -2
Towards Geometrical Password for Mobile Phones 175
Mozaffar Afaq, Mohammed Qadeer, Najaf Zaidi and Sarosh Umar
Improving Performance of Speaker Identification System Using Complementary 182
Information Fusion
Md Sahidullah, Sandipan Chakroborty and Goutam Saha
Right Brain Testing-Applying Gestalt psychology in Software Testing 188
Narayanan Palani
DISTRIBUTED SYSTEMS
Exploiting Multi-context in a Security Pattern Lattice for Facilitating User Navigation 215
Achyanta Kumar Sarmah, Smriti Kumar Sinha and Shyamanta Moni Hazarika
Trust in Mobile Ad Hoc Service GRID 223
Sundar Raman S and Varalakshmi P
Scheduling Light-trails on WDM Rings 227
Soumitra Pal and Abhiram Ranade
i
ADCOM 2009 STEERING COMMITTEE
Patron
P. Balaram,IISc, India
General Chairs
N. Balakrishnan, IISc, India
Patrick Dewilde, TU Munich
Organising Chairs
B.S. Bindhumadhava, CDAC, India
S. K. Sinha, CEDT, India
Industry Chairs
H. S. Jamadagni,IISc, India
Krithiwas Neelakantan, SUN, India
Lavanya Rastogi, Value-One, India
Saragur M. Srinidhi, Prometheus Consulting, India
Publications Chair
K. Rajan, IISc, India
Finance Chairs
G. N. Rathna, IISc, India
Advisory Committee
Harish Mysore, India
K. Subramanian, IGNOU, India
Ramanath Padmanabhan, INTEL, India
Sunil Sherlekar, CRL, India
Vittal Kini, Intel, India
N. Rama Murthy, CAIR, India
Ashok Das, Sun Moksha, India
Sridhar Mitta, India
ii
Reviewers for ADCOM 2009
The following reviewers’ participated in the review process of ADCOM 2009. We greatly
acknowledge them for their contributions.
iii
ADCOM 2009
GRID ARCHITECTURE
Session Papers:
1. Sanyam Mehta, Ankush Mittal, Arindam Misra, Ayush Singhal, Praveen Kumar, and
Kannappan Palaniappan, “Parallel Implementation of Video Surveillance Algorithms on GPU
Architecture using NVIDIA CUDA”.
2. Prahlada Rao BB, Mangala N and Amit Chauhan , “Adapting Traditional Compilers onto
Higher Architectures incorporating Energy Optimization Methods for Sustained Performance”
1
Parallel Implementation of Video Surveillance Algorithms on GPU
Architecture using NVIDIA CUDA
Sanyam Mehta‡, Arindam Misra‡, Ayush Singhal‡, Praveen Kumar†, Ankush Mittal‡,
Kannappan Palaniappan†
‡ †
Department of Electronics and Computer Engineering, Department of Computer Science,
Indian Institute of Technology, Roorkee, INDIA University of Missouri-Columbia,
USA
E-mail: san01uec@iitr.ernet.in,ari07uce@iitr.ernet.in,ayush488@gmail.com,
praveen.kverma@gmail.com, ankumfec@iitr.ernet.in, palaniappank@missouri.edu
2
stream processors having a single precision floating The three key abstractions of CUDA are the thread
point capability of 933 GFlops. CUDA enables new hierarchy, shared memories and barrier
applications with a standard platform for extracting synchronization, which render it as only an extension
valuable information from vast quantities of raw data. of C. All the GPU threads run the same code and, are
It enables HPC on normal enterprise workstations and very light weight and have a low creation overhead. A
server environments for data-intensive applications, eg. kernel can be executed by a one dimensional or two
[12]. CUDA combines well with multi-core CPU dimensional grid of multiple equally-shaped thread
systems to provide a flexible computing platform. blocks. A thread block is a 3, 2 or 1-dimensional group
In this paper the parallel implementation of various of threads as shown in Fig. 1. Threads within a block
video surveillance algorithms on the GPU architecture can cooperate among themselves by sharing data
is presented. This work focuses on algorithms like (i) through some shared memory and synchronizing their
Gaussian mixture model for background modeling, (ii) execution to coordinate memory accesses. Threads in
Morphological image operations for image noise different blocks cannot cooperate and each block can
removal (iii) Connected component labeling for execute in any order relative to other blocks. The
identifying the foreground objects. In each of these number of threads per block is therefore restricted by
algorithms, different memory types and thread the limited memory resources of a processor core. On
configurations provided by the CUDA architecture current GPUs, a thread block may contain up to 512
have been adequately exploited. One of the key threads. The multiprocessor SIMT (Single Instruction
contributions of this work is novel algorithmic Multiple Threads) unit creates, manages, schedules,
modification for parallelization of the divide and and executes threads in groups of 32 parallel threads
conquer strategy for CCL. The speed-ups obtained called warps.
with GTX 280(30 multiprocessors or 240 cores) were The constant memory is useful only when it is
very significant, the corresponding speed-ups on 8400 required that the entire warp may read a single memory
GS (2 multiprocessors or 16 cores) were sufficient location. The shared memory is on chip and the
enough to process the 640×480 sized surveillance accesses are 100x-150x faster than accesses to local
video in real-time. The scalability was tested by and global memory. The shared memory, for high
executing different frame sizes on both the GPUs. bandwidth, is divided into equal sized memory
modules called banks, which can be accessed
2. GPU Architecture and CUDA simultaneously. However, if two addresses of a
memory request fall in the same memory bank, there is
NVIDIA’s CUDA [14] is a general purpose parallel a bank conflict and the access has to be serialized. The
computing architecture that leverages the parallel banks are organized such that successive 32-bit words
compute engine in NVIDIA GPUs to solve many are assigned to successive banks and each bank has a
complex computational problems. The programmable bandwidth of 32 bits per two clock cycles. For devices
GPU is a highly parallel, multi-thread, many core co- of compute capability 1.x, the warp size is 32 and the
processors specialized for compute intensive highly number of banks is 16. The texture memory space is
parallel computation. cached so a texture fetch costs one memory read from
device memory only on a cache miss, otherwise it just
costs one read from the texture cache. The texture
cache is optimized for 2D spatial locality, so threads of
the same warp that read texture addresses that are close
together will achieve best performance. The local and
global memories are not cached and their access
latencies are high. However, coalescing in global
memory significantly reduce the access time and is an
important consideration (for compute capability 1.3,
global memory accesses are easily coalesced than
earlier versions). Also CUDA 2.2 release provides
page-locked host memory helps in increasing the
overall bandwidth when the memory is required to be
read or written exactly once. Also, it can be mapped to
device address space – so no explicit memory transfer
Fig 1 Thread hierarchy in CUDA required.
3
at a given image pixel, is independent of the
observations at other image pixels. It is also assumed
that these observations of the pixel can be modelled by
a mixture of K Gaussians (currently, from 3 to 5 are
used). Let xt be a pixel value at time t. Thus, the
probability that the pixel value xt is observed at time t
is given by:
= ∑
|
(1)
Where, ,
and are the weights, the mean, and
the standard deviation, respectively, of the k-th
Gaussian of the mixture associated with the signal at
time t. At each time instant t the K Gaussians are
ranked in descending order using w=0.75 value (the
most ranked components represent the “expected”
signal, or the background) and only the first B
Fig. 2 The device memory space in CUDA distributions are used to model the background, where
= arg ∑
> (2)
3. Our approach for the Video Surveillance T is a threshold representing the minimum fraction of
Workload data used to model the background. As the parameters
of each pixel change, to determine the value of
A typical Automated Video Surveillance (AVS)
Gaussian that would be produced by the background
workload consists of various stages like background
process depends on the most supporting evidence and
modelling, foreground/background detection, noise
least variance. Since the variance for a new moving
removal by morphological image operations and object
object that occludes the image is high which can be
identification. Once the objects have been identified
easily checked from the value of .
other applications can be developed as per the security
GMM offers pixel level data parallelism which can
requirements. Fig. 3 shows the multistage algorithm
be easily exploited on CUDA architecture. Since the
for a typical AVS system. The different stages and our
GPU consists of multi-cores which allow independent
approach to each of them are described as follows:
thread scheduling and execution, perfectly suitable for
independent pixel computation. So, an image of size
m × n requires m × n threads, implemented using the
appropriate size blocks running on multiple cores.
Besides this the GPU architecture also provides shared
memory which is much faster than the local and global
memory spaces. In fact, for all threads of a warp,
accessing the shared memory is as fast as accessing a
register as long as there are no bank conflicts [14]
between the threads. In order to avoid too many global
memory accesses, the shared memory was utilised to
store the arrays of various Gaussian parameters. Each
block has its own shared memory (upto 16 KB) which
is accessible (read/write) to all its threads
simultaneously, which greatly improves the
computation on each thread since memory access time
Fig. 3 A typical video surveillance system
is significantly reduced. The value for K (number of
Gaussians)is selected as 4, which not only results in
3.1 Gaussian Mixture Model effective coalescing [14] but also reduces the bank
conflicts. As shown in the Table 1 the efficacy of
Many approaches for background modelling like coalescing is quite prominent.
[4][5]have been proposed. Here, Gaussian Mixture The approach for GMM involves streaming (Fig. 4)
model proposed by Stauffer and Grimson [3] is taken i.e. processing the input frame using two streams
up, which assumes that the time series of observations,
4
allows for the memory copies of one stream to overlap structural element is set to the value 0, the output pixel
with the kernel execution of the other stream. is set to 0. With binary images, erosion completely
removes objects smaller than the structuring element
for i <=2 and removes perimeter pixels from larger image
do objects. This is described mathematically as:
create stream i //cudaStreamCreate
for each stream i !ʘ = "#$ %& ' ! ( ) (3)
do and !Ɵ = "|& * ! (4)
copy half the image from host to device
each stream i. //cudaMemcpyAsync where ($) is the reflection of set B and & is the
translation of set B by point z as per the set theoretic
for each stream i definition.
do 0 0 1 0 0
kernel execution for each stream i. 0 1 0 1 0
(half image processed) //gmm<<<….>>> 1 0 0 0 1
0 1 0 1 0
cudaThreadSynchronize(); 0 0 1 0 0
Streaming resulted in significant speed up in the As the texture cache is optimized for 2-dimensional
case of 8400 GS, where the time for memory copies spatial locality, the 2-dimensional texture memory is
was closely matched to the time for kernel execution, used to hold the input image; this has an advantage
while in case of GTX 280, the speed up was not so over reading pixels from the global memory, when
significant as the kernel execution took little time, coalescing is not possible. Also, the problem of out of
being spread over 30 multiprocessors. bound memory references at the edge pixels are
avoided by the cudaAddressModeClamp addressing
mode of the texture memory in which out of range
3.2 Morphological Image Operations
texture coordinates are clamped to a valid range. Thus
the need to check out of bound memory references by
After the identification of the foreground pixels conditional statements never arose, preventing the
from the image, there are some noise elements (like warps from becoming divergent and adding a
salt and pepper noise) that creep into the foreground significant overhead.
image. They essentially need to be removed in order to
find the relevant objects by the connected component
labelling method. This is achieved by morphological
image operation of erosion followed by dilation [6].
Each pixel in the output image is based on a
comparison of the corresponding pixel in the input
image with its neighbours, depending on the
structuring element (Fig. 5) used. In case of dilation
(denoted by ʘ) the value of the output pixel is the
maximum value of all the pixels in the input pixel's
neighbourhood. In a binary image, if any of the pixels
in the neighbourhood corresponding to the structural
element is set to the value 1, the output pixel is set to 1.
With binary images, dilation connects areas that are
separated by spaces smaller than the structuring
element and adds pixels to the perimeter of each image
Fig. 6 Approach for erosion and dilation
object.
In erosion (denoted by Ɵ), the value of the output
As shown in Fig. 6 a single thread is used to
pixel is the minimum value of all the pixels in the input
process two pixels. A half warp (16 threads) has a
pixel's neighbourhood. In a binary image, if any of the
bandwidth of 32 bytes/cycle and hence 16 threads,
pixels in the neighbourhood corresponding to the
each processing 2 pixels (2 bytes) use full bandwidth,
5
while writing back noise-free image. This halves the
total number of threads thus reducing the execution Region 1 – labels Region 2 – labels
time significantly. A structuring element of size 7×7 limited from 0 to 7 limited from 8 to 15
was used both in dilation and erosion. A
straightforward convolution was done with one thread
running on two neighbouring pixels. The execution
times for the morphological image operations for the
GTX 280 and the 8400 GS are shown in Table 2.
6
Table 1 CUDA Profiler output for 1024X768 image size
7
multiprocessor, giving an occupancy of 0.75.As a can be 8, the total number of active threads per
result of using K=4 all the global memory loads were multiprocessor wereere 256 and hence an occupancy of
coalesced as can be seen from the Table 1 also, there 0.25. The optimal parallelization
lization of the CCL algorithm
were lesser bank conflicts. The use of streaming was significant in itself, as the parallelization of CCL
reduced memory copy overhead but not to the extent on CUDA has not been reported and was deemed very
anticipated, due to the efficient memory copying in difficult. Apart from the code
ode been parallelized, the use
GTX 280 - compute capability 1.3. This approach of shared memory and then texture memory to store
however, was of great help in the
he 8400 GS – compute appropriate data led to significant increases in speedup.
capability1.2. The use of texture memory not only prevented any
The morphological image operations contribute a warps from diverging by avoiding the conditional
major portion of the computational expense of the of statements (due to clamped accesses in texture
the AVS workload. In our approach we are able to memory) but also led to speedup due to the spatial
drastically reduce their execution time. The speedup locality of references in CCL. However, the
scales with the image size both on the GTX 280 and implementation
ion of CCL is block size dependent,
depende which
the 8400 GS, the comparison of sequential code with still remains a bottleneck.
the parallel implementation of the 1024×768 image Table 3 Execution times for CCL
size shows a significant speedup of 260X with a
IMAGE SIZE GTX 280 (ms) 8400 GS (ms)
structuring element of 7×7. The time taken by the
sequential implementation was 89.806
.806 ms as compared 160×120 0.106 0.522
to the 0.352 ms taken by the parallel implementation. 320×240 0.220 1.34
For this image size we were able to unleash the full 640×480 1.256 4.5
computational power of the GPU with occupancy of 1. 1024×768 2.494 14.1
(i.e. neither shared memory, nor the registers per 1600×1200 2.649 46.2
multiprocessor were the limiting
miting factors) 280, as In each of the above kernels page-locked
page host
indicated in the Table 1. Moreover, the use of texture memory (a feature of CUDA 2.2 ) has been used
memory and address clamp modes have reduced the whenever only one memory read and write were
percentage of divergent threads to <1%. On the 8400 involved which increased the memory throughput.
GS also a significant speedup has been achieved. Architectures dedicated to video surveillance cost
as much as lakhs of rupees, while the GeForce GTX
280 costs Rs.17000 and the 8400 GS costs merely
Rs.4000. Even for an image size of 640×480, 30
frames per second could be processed on 8400 GS, for
an image size of 1024×768 close to 15 frames per
second could be processed and for images of smaller
size 30 frames could be easily processed as shown in
the fig 10.
(a) (b)
(c) (d)
(a) Input Image (b) Foreground Image
(c) Image after noise removal (d) Output
Fig. 9 Image output of various stages of AVS
8
5. CONCLUSION AND FUTURE WORK pp.255-261, 20-25 September, 1999, Kerkyra, Corfu,
Greece
[6] H.Sugano and R.Miyamoto. Parallel
Through this paper, we describe the
implementation of morphological processing on cell/be
implementation of a typical AVS workload on the
with opencv interface. Communications, Control and
parallel architecture of NVIDIA GPUs to perform real
Signal Processing, 2008. ISCCSP 2008, pp 578–583,
time AVS. The various algorithms, as described in the
2008.
previous sections are GMM for background modelling,
[7] J. M. Park, G. C. Looney, and H. C. Chen, “ A Fast
morphological image operations, and CCL for object
Connected Component Labeling Algorithm Using
identification. In our previous work [15] a detailed
Divide and Conquer”, CATA 2000 Conference on
comparison has been done between Cell BE and
Computers and Their Applications, pp 373-376, Dec.
CUDA for these algorithms. During the
2000.
implementation on the GPU architecture, major
emphasis was given to selecting the thread [8] R. Fisher, S. Perkins, A. Walker and E. Wolfart
configurations, the memory types for each kind of data, Connected Component Labeling, 2003
out of the numerous options available on GPU [9] K.P. Belkhale and P. Banerjee, "Parallel
architecture, so that the memory latency can be Algorithms for Geometric Connected Component
reduced and hidden. Lot of emphasis was given to Labeling on a Hypercube Multiprocessor," IEEE
memory coalescing and avoiding bank conflicts. Transactions on Computers, vol. 41, no. 6, pp. 799-
Efficient usage of the different kinds of memories 709,1992
offered by the CUDA architecture and subsequent [10] M. Manohar and H.K. Ramapriyan. Connected
experimental verification resulted in the most optimal component labeling of binary images on a mesh
implementations. As a result, significant overall speed- connected massively parallel processor. Computer
up was achieved. vision, Graphics, and Image Processing, 45(2):133-
Further testing and validations are going on. We 149,1989.
have examined the performance on only 8400 GS(2 [11] K.Dawson-Howe, “Active surveillance using
multiprocessors) and GTX 280(30 multiprocessors) in dynamic background subtraction,” Tech. Rep.TCD-
this paper, hence a range of intermediate devices are CS-96-06, Trinity College, 1996.
yet to be explored. Our future work will include the [12] Michael Boyer, David Tarjan, Scott T. Acton†,
implementation of the AVS workload on other GPU and Kevin Skadron, Accelerating Leukocyte Tracking
devices to examine the scalability, as well as using CUDA:A Case Study in Leveraging Manycore
comparison with other parallel architectures to get an Copocessors,2009.
idea of their viability as compared to the GPU [13] A.C. Sankaranarayanan, A. Veeraraghavan, and
implementation. R. Chellappa. Object detection, tracking and
recognition for multiple smart cameras. Proceedings of
6. REFERENCES the IEEE, 96(10):1606–1624,2008.
[14] NVIDIA CUDA Programming Guide,Version 2.2,
page10,27-35,75-97,2009.
[1] S. Momcilovic and L. Sousa. A parallel algorithm [15] Praveen Kumar, Kannappan Palaniappan, Ankush
for advanced video motion estimation on multicore
Mittal and Guna Seetharaman. Parallel Blob Extraction
architectures. Int. Conf. Complex, Intelligent and using Multicore Cell Processor. Advanced Concepts
Software Intensive Systems, pp 831-836, 2008. for Intelligent Vision Systems (ACIVS) 2009. LNCS
[2] M. D. McCool. Data-Parallel Programming on the 5807, pp. 320–332, 2009.
Cell BE and the GPU Using the RapidMind
Development Platform. GSPx Multicore Applications
Conference, 9 pages, 2006.
[3] C. Stauffer and W. Grimson,Adaptive background
mixture models for real-time tracking, In Proceedings
CVPR,pp.246–252,1999.
[4] Zoran Zivkovic, Improved Adaptive Gaussian
Mixture Model for Background Subtraction. In Proc.
ICPR,pp 28-31 vol. 2,2004
[5] Toyama, K.; Krumm, J.; Brumitt, B.; Meyers, B.,
Wallflower: principles and practice of background
maintenance. The Proceedings of the Seventh IEEE
International Conference on Computer Vision, vol.1,
9
Adapting Traditional Compilers onto Higher Architectures
incorporating Energy Optimization Methods
for Sustained Performance
Prahlada Rao B B, Mangala N, Amit K S Chauhan
Centre for Development of Advanced Computing (CDAC),
#1, Old Madras Road, Byappanahalli, Bangalore-560038, India
email: {prahladab, mangala}@cdacb.ernet.in
10
and execution of the code while gcc comes with more
sophisticated and complex optimizations procedures, such as
inter-procedural analysis, for better performance. CDF90 E. Translator
takes the benefits of both the approaches by converting the This module translates FORTRAN source code to correct,
code into intermediate C code through traditional approach compliable and clean C source code.
and later passing the intermediate C code to gcc compiler to Translator makes an in-order traversal of the full parse tree
exploit the benefits offered by advanced techniques. Various and replaces each FORTRAN construct by their
compilation approaches are depicted in Figure 1. corresponding C construct. Output of translator module is a
‘.c’ file. This .c file is passed to some standard C compiler to
Portable
produce final executable. I/O libraries of CDF90 are linked
Traditional Retargetable
using ‘–l’ option and passed to the linker ‘ld’
Scientific Scientific Scientific
Code Code Code
Lexical Analyzer
Lexical Analyzer Lexical Analyzer F. I/O Library
Syntax + Semantic
Syntax + Semantic
This module contains methods that are invoked by the
Syntax + Semantic Analyzer
Analyzer
Analyzer intermediate C code generated as an output of translator
Optimizer
module with the help of linker at link time.
General Optimizer
Optimizer
Translator to
commonly used
language ( c/c++) for Intermediate Code
portability Generator Fortran77/90
Application
Machine Independent
Machine Independent Machine Independent (.f or .f90)
Optimized
AST
I/O Translator
Library
C Compiler
(gcc)
This module converts sequence of characters into a III. CONSIDERATIONS FOR MIGRATING CDF90 TO
sequence of tokens, which will be given as input to the Parser 64-BIT
for construction of Abstract Syntax Tree. This Abstract
Syntax Tree will be used in later modules of compiler for A. Need for Migration of CDF90
further processing. 64 bit architectures are becoming popular since a decade
with a promise of higher accuracy and speed through use of
C. Parser
64-bit registers and 64-bit addressing. The advantages offered
This module receives input in the form of sequential source by 64-bit processors are:
program instructions, interactive commands, markup tags, or Large virtual address
some other defined interface and breaks them up into parts Non-segmented memories
that can then be managed by other compiler components. 64–bit arithmetic
Parser will get the tokens from the Lexer and will construct a
Faster Computations
data structure, usually an Abstract Syntax Tree.
Removal of certain system limitations
D. Optimizer A study was taken up to understand the feasibility, impact,
This module provides a suite of traditional optimization and effort for migrating existing Fortran90 compiler.
methods to minimize energy cost of a program by minimizing Considering the extensive features offered by CDF90 and the
memory access instructions and execution time. Optimization advantages of enhancing this to higher bit processors, it was
techniques like Loop Unrolling, Loop interchanging, Loop decided to enhance the existing compiler to support LP64
Merging, Loop Distribution and Function in-lining etc. have model, and this would require reasonable changes mainly in
been applied on the Parse Tree structure. parser, translator and library modules.
11
64BitCDF90 needs to be validated for 64-bit architecture
compliance against the existing test cases along with the
B. Data Model Standards Followed
newly added test cases specific for 8-byte integer
For higher precision and better performance, newer support.
architectures have been designed to support various data
models. The three basic models that are supported by most of IV. EFFECT OF LP64 DATA MODEL ON CDF90
the major vendors on 64-bit platform are LP64, ILP64 and Some fundamental changes occur when moving from
LLP64 [5]. ILP32 data model to LP64 data model which are listed as
LP64 (also known as 4/8/8) denotes int as 4 bytes, long following
and pointer as 8 bytes each. long and pointers are no longer 32-bit size. Direct or
ILP64 (also known as 8/8/8) means int, long and indirect assignments of int to long or pointer value is no
pointers are 8 bytes each. more valid.
LLP64 (also known as 4/4/8) adds a new type (long For 64-bit compilation, CDF90 needs to use 64-bit
long) and pointer as 64 bit types library archives. Also need to supply 64-bit specific flags
Many 64-bit compilers support LP64 [5] data model to backend compiler so that it can operate in 64-bit
including gcc and xlc on AIX5.3 platform. mode.
CDF90 acts as a front-end and depends upon gcc/xlc for System derived types such as size_t, time_t, ptrdiff_t are
backend compilation. Since gcc/xlc follows LP64 data model 64-bit aligned in 64-bit compilation environments.
for 64-bit compilation on AIX5.3 platform so 64BitCDF90 Hence these values must not be contained or assigned to
also needs to follow the same data model. Hence LP64 data 32-bit variables.
models have been adopted for 64-bit compilation on 64-bit
AIX5.3/POWER5 platform. V. DESIGN/IMPLEMENTATION CHANGES FOR
C. Approach Followed for Migration 64-BIT ENHANCEMENTS
64BitCDF90 is able to perform 64-bit compilation
Adding 8-byte integer (I: 8) support to CDF90 compiler
correctly on 64-bit platform with I:8 support after carrying
was required to enjoy the benefit of faster computing
out the following tasks described in two phases as below:
with larger integer values. In order to implement this in
CDF90, various Implicit FORTAN Library functions A. Migration from 32-bit to 64-bit
needed to be modified and new functions written that
Major porting concerns can be summarized as below:
support 8-byte integer computations. Also various
Pointers size changes to 64-bit. All direct or implied
changes were required in different compiler modules
assignments or comparisons between “integer” and
which will be described in later part of the paper.
“pointer” values have been examined and removed.
FORTRAN applications passed to CDF90 are translated
Long size changes to 64-bit. All casts to allow the
to C code hence it was required to consider the data
compiler to accept assignment and comparison between
model of the underlying C compiler also. gcc4.2 follows
“long” and “integer” have been examined to ensure
LP64 data model for 64-bit compilation and for 32-bit
validity.
compilation it uses ILP32 data model even on the 64-bit
Code has been updated to use the new 64-bit APIs and
AIX platform. Hence LP64 data model would be suitable
hence the executable generated after 64-bit compilation
in this situation.
is 64-bit compliant.
Most 64-bit processors support both 32-bit and 64-bit
Macros depending on 32-bit layout have been adjusted
execution modes [10]. Hence 64BitCDF90 also needs to
for the 64-bit environment.
provide both 32-bit and 64-bit compilation support
through the help of 32-bit and 64-bit compilation A variety of other issues, like data truncation, that can
libraries, though the executable may be 32-bit only. arise from sign extension, memory allocation sizes, shift
Hence it is identified that two different libraries need to counts, array offsets, and other factors have to be dealt
be prepared for 32-bit and 64-bit compilation with extreme care.
environments. User has option to select between 32-bit or 64-bit APIs.
Fortran77/90 executable file format requires to be If 64-bit compiler flag is used, for e.g. ‘-maix64’ flag is
changed to XCOFF64 when compiled by 64BitCDF90 used for backend compilation with gcc 4.2, while
on 64-bit platform using 64-bit libraries. 64-bit CDF90 compiling then 64-bit API is linked and 64-bit object file
APIs need to be generated for 64-bit compilation of any format is generated. If user does not use any 64-bit flag
application, though CDF90 executable may be 32-bit for compilation, then by default 32-bit API is linked to
only. Please note that the same approach has been the application and 32-bit object file format is generated.
followed by ‘gcc’ which uses 32-bit executable for 64- Code has been added in the compiler to select either 32-
bit compilation, of any application file passed to it, bit or 64-bit API depending upon whether user has
through use of 64-bit library files. supplied 64-bit specific flags or not.
64-bit Library files are generated by compiling the
source code using ‘gcc 4.2’ in 64-bit mode using ‘–
12
maix64’ option on AIX5.3/Power5 platform to produce E.g. The Translator is now able to internally translate
64-bit XCOFF file formats. These 64-bit object files are the implicit matmul (a, b) function, where a, b are matrix
passed to ‘ar’ tool using ‘–X64’ flag to produce 64-bit with 8-byte integer elements, in to _matmul_i8i8_22
library archives. (long long **a, long long **b) which is an intermediate
Compiler code, which is written in C, is compiled using C library function to carry out the actual multiplication.
gcc 4.2 in 64-bit architecture (AIX5.3/Power5) without This translation is carried out on the basis of integer size
using any 64-bit specific flags and by default it generates conditional checking. Most of such changes have been
32-bit executables for CDF90 compiler. The same carefully debugged with the help of gdb6.5.
CDF90 executable can be linked to 64-bit library API to I/O Library: There are two libraries supported by
compile applications, passed to it, in 64-bit mode and to 64BitCDF90.One library is used for 32-bit mode and the
generate object file in 64-bit XCOFF file format. The other is linked for 64-bit mode compilation. Most of the
same 32-bit CDF90 executable can be linked to 32-bit library functions in 32-bit CDF90 library have been
API also to compile applications in 32-bit mode and to modified /added to handle 64-bit integer data types and
generate 32-bit XCOFF object file format on the same hence to create 64-bit CDF90 library. Hence 64-bit
64-bit platform (AIX5.3/POWER5). CDF90 library can handle 8-byte integer for most of the
32-bit library archives are generated using compilation functions it contains. The translator module will generate
by ‘gcc 4.2’ followed by archive file generation using corresponding C code that invokes these library
‘ar’ tool without any 64-bit specific flags on AIX5.3 functions, which handle 8-byte integer data size.
platform with PowerPC_Power5 architecture. E.g. _matmul_i8i8_(long long **a, long long **b)
library function has been added in the CDF90 library to
B. Adding I:8 Support in 64BitCDF90 compute matrix-matrix multiplications whose elements
32-bit CDF90 compiler supports following KIND values, in are 8-byte integers.
a Fortran77/90 program, for Integer data types Building Libraries: The 32-bit library archives have
been created by compiling the source code in 32-bit
TABLE 1 mode while 64-bit library archives have been prepared
KIND VALUES FOR I NTEGERS by compiling the source code in 64-bit mode on 64-bit
platform. More specifically 64-bit libraries have been
KIND Value Size in bytes
prepared by compiling source code using gcc4.2 with ‘-
1 1
2 2
maix64’ flag and further passing it to archive tool ‘ar’
3 4 with the ‘-x64’ flag to create 64-bit library archives for
64BitCDF90.
The compiler has been enhanced to allow KIND value 4 for
which the size of the Integer is 8 bytes. Modifications are VI. CODE OPTIMIZATIONS: ENERGY EFFICIENT METHODS
performed in the following modules to achieve the desired OFFERED BY CDF90
results.
Lexer: Code has been added to identify INTEGER*8 as Energy cost of a program depends upon the number of
a valid token. memory access[15]. Large number of memory access
results in high energy dissipation. Energy efficient
Parser: The parser code has been modified so that it
compilers reduce memory access and thereby reduce energy
may correctly add the symbol (I: 8) correctly in the AST
generated after the parser phase. All other programming consumed by these accesses. One approach is to reduce the
LOAD and STORE instructions and storing data in registers
constructs, like functions, macros, data types etc, dealing
or cache memory to provide faster and efficient
with 8-byte integer size are also updated, in the parser
computations reducing the CPU cycles significantly.
module, to transform in to the correct AST structure.
According to Kremer [22], “Traditional compiler
Translator: If the integer variable KIND value is 4 in an
optimizations such as common subexpression elimination,
input Fortran77/90 program, then the translator should
partial redundancy elimination, strength reduction, or dead
be able to declare and translate the symbol in to the
code elimination increase the performance of a program by
corresponding C code symbol. Corresponding code has
reducing the work to be done during program execution”.
been added in the translator module. Functionalities have
There are also several other compiler strategies for energy
been added, in the translator module, to correctly convert
reduction such as - compiler assisted Dynamic Voltage
implicit FORTRAN library function dealing with KIND
Scaling [20], instruction scheduling to reduce function unit
value 4 integer data types to their corresponding C
utilization, memory bank allocation, inter-program
library function names based on conditional type
optimizations [19] etc. which are being tried by the
checking for 8-byte integer data types. After successful
researchers.
conversion, the function call is dumped into an
Reducing power dissipation of processors is also being
intermediate .c file. The C functions, dealing with I:8
attempted at the hardware level and in the operating system.
data types, are called from CDF90 libraries.
Examples of these include power aware techniques for on-
13
chip buses, transactional memory, memory banks with low iv. Constant Value Propagation
power modes [16,17] and work load adaptation [18,23]. An expression involving several constant values will be
However energy reduction through compiler promises some calculated and replaced by a new constant value
advantages such as - no overhead at execution time, assess For ex:
‘future’ program behavior through aggressive whole x=2*y
program analysis, identify optimizations and make code z=3*x*s
transformations for reduced energy usage [24].
will be transformed into the following code by an optimizing
A. Optimizations in CDF90 compiler compiler
z=6*y*s
Programs compiled without any optimization generally
runs very slowly and hence takes more the cpu cycles and
results in higher energy consumption. Usually a medium v. Register Allocation and Instruction Scheduling
level optimization (-o2 on many machines) typically leads This particular optimization is the most difficult and most
to a speed up by a factor of 2-3 without any significant important also. Since CDF90 depends upon gcc for backend
improvements in compilation time. Different types of compilation so it is left up to the backend compiler only.
optimizations performed by CDF90 to improve program
performance are listed as following
B. Code Transformation Criteria and applied Techniques
i. Common Sub-expression elimination
CDF90 offers following loop optimization techniques.
The compiler takes the common sub-expression out from
various expressions and calculate it only once instead of i. Loop Interchange
several times It is a Process of exchanging the order of two iteration
t1=a+b-c variable. Loop interchange can often be used to enhance the
t2=a+b+c performance of code on parallel or vector machines.
Above two statements will be reduced by an optimizing Determining when loops may be safely and profitably
compiler in to following statements interchanged requires a study of the data dependences in the
t= a+b program
t1=t-c Loop interchange mechanism has been implemented in
t2=t+c CDF90. It contains a function which checks whether
interchanging of loops can be done.
Though this approach may not exhibit any significant The loops can be interchanged if the loops are perfectly
performance enhancements for smaller expressions but for nested. If one of the loop bound is dependent on the index
bigger expression it shows some certain performance variable of some other loop the loops can not be
enhancements interchanged, also the loop should be totally independent,
since after loop interchanging there should not be invalid
ii. Strength Reduction
computation. Based on condition output, loop interchanging
Strength Reduction is used to replace an existing is performed by CDF90 supported functions.
arithmetic expression by an equivalent expression that can For example consider the FORTRAN matrix addition
be evaluated faster. A simple example is replacing 2*i by algorithm below:
i+i since integer addition is faster than integer
DO I = 1, N
multiplication.
DO J = 1, M
iii. Loop Invariant Code A(I, J) = B(I, J) + C(I, J)
The code that is not dependent on loop iterations is ENDDO
removed from the loop and calculated one-time only instead ENDDO
of calculating for each loop iteration. The following
FORTRAN code let The loop accesses the arrays A, B and C row by row, which,
in FORTRAN, is very inefficient. Interchanging
do i=1,n I and J loops, as shown in the following example, facilitates
a(i)=m*n*b(i) column by column access.
end do
will be replaced by the following code by an optimizing
DO J = 1, M
compiler
DO I = 1, N
t=m*n
A(I, J) = B(I, J) + C(I, J)
do i=1,n
ENDDO
a(i)=t*b(i)
ENDDO
end do
14
ii. Loop Vectorization v. Loop Distribution
It is the conversion of loops from a non-vectored form to Loop distribution is used for transforming a sequential
a vectored form. A vectored form is one that has the same program into a parallel one. Loop distribution attempts to
operation happening on all members of a range (SIMD) break a loop into multiple loops over the same index range
without dependencies. Theoretically the ranges should be in but each taking only a part of the loop's body. This can
contiguous memory areas because this makes it easy for the improve locality of reference, both of the data being
processor to know where to apply the computations next. accessed in the loop and the code in the loop's body.
Conceptually, it could be any set of data. Only built-in data vi. Function Inlining
type in C and FORTRAN are arrays which have contiguous
Function inlining is a powerful high level optimization
memory.
which eliminates call cost and increases the chances of
iii. Loop Merging other optimizations taking effect due to the breaking down
Another technique implemented by CDF90 to reduce of the call boundaries.
loop overhead. When two adjacent loops would iterate the By declaring a function inline, you can direct CDF90 to
same number of times (whether or not that number is integrate that function's code into the code for its callers.
known at compile time), their bodies can be combined as This makes execution faster by eliminating the function-call
long as they make no reference to each other's data. overhead. The effect on code size is less predictable; object
code may be larger or smaller with function inlining,
iv. Loop Unrolling depending on the particular case.
Duplicates the body of the loop multiple times, in order to
decrease the number of times the loop condition is tested
and the number of jumps, which may degrade performance VII. TESTING 64BitCDF90 ON 64-BIT ARCHITECTURE
by impairing the instruction pipeline. Completely unrolling Compilers are used to generate software for systems where
a loop eliminates all overhead (except multiple instruction correctness is important [4] and testing [2] is needed for
fetches & increased program load time), but requires that quality control and error detection in compilers.
the number of iterations be known at compile time.
Loop unrolling, is designed to unroll loops for
A. Challenges
parallelizing and optimizing compilers. To illustrate,
consider the following loop: One of the challenges encountered in the CDF90 project
was non-availability of free (open source), standard test suite
for (i = 1; i <= 60; i++) a[i] = a[i] * b + c; to test the 64-bit compiler (specially I:8 check). Hence a test
suite to test all the functionalities of Fortran 77/90 was
This FOR loop can be transformed into the following developed (DPTESTCASES). This was not an easy task as the
equivalent loop consisting of multiple copies of the original test suite developer needed to be aware of all the features of
loop body: the language.
15
that covered the complete language constructs with 8-byte v. Functional Testing
integer support was developed called DPTESTCASES. These Tested 64BitCDF90 along with original CDF90 for the
tests verify each of the compiler construct with 8-byte integer same test cases with the intent of finding defects,
support. DPTESTCASES enables to test for 8-byte integer demonstrated that defects are not present, verifying that the
support for various FORTRAN language features like Data module performs its intended functions with integer up to 8
type and Declarations, Specification Statements, Control bytes since Fortran supports integers of 1,2,4 and 8 bytes, and
statements, IO Statements, Intrinsic functions, Subroutine establishing confidence that a program does what it is
functions, boundary checks for integer constants. supposed to do.
Shell scripts are written to compile all test cases, execute
them and check the compiler warnings and error messages
against expected results of DPTESTCASES. vi. Regression Testing
The test package has been modified in several ways, like Similar in scope to a functional test, a regression test allows
supplying large values in possible test cases for boundary a consistent, repeatable validation of each new release of a
value analysis, and check for the correct results. This compiler for new requirement. Such testing ensures reported
approach of testing proved to be helpful in verifying compiler compiler defects have been corrected for each new release and
capabilities. that no new quality problems were introduced in the
maintenance process. Regression testing can be performed
C. Testing Approach manually.
The approach followed for testing 64BitCDF90 can be
described by listing out following main points:
vii. Static Analysis of the CDF90 code
The code size being very large, we needed a suitable static
i. Testing of 64BitCDF90 in 32-bit mode on 64-bit Machine analyser which supports LP64 data model for 64-bit
Execute each test case with 64BitCDF90 compiler in 32-bit compilation on 64-bit platform. We found lint [12] as most
mode and check for correct results. This check is required to suitable and offered following advantages:
see if modifications performed, for 64-bit compatibility, in the Warns about incorrect, error-prone or nonstandard code
compiler code may have resulted in any adverse side effects. that that the compiler does not necessarily flag.
In 32-bit mode compiler internally uses 32-bit library archives Issues potential bugs and portability problem
and generate 32-bit X-COFF on AIX5.3 platform. FCVS is Assists in improving source code’s effectiveness,
used for testing in 32-bit mode. including reducing its size and required memory
Out of various options provided by lint, –errchk=longptr64, in
ii. Testing of 64BitCDF90 in 64-bit mode on 64-bit Machine particular, has proved to be very helpful for migrating CDF90
to 64-bit platform.
Execute each test case with enhanced cdf90 compiler in 64-
bit mode and check for correct results. This check is required
D. Testing Statistics
to see if the enhanced compiler is successfully compiling each
test case in the 64-bit environment. Both FCVS and i. 32-bit Compilation
DPTESTCASES are used to test 64-bit compatibility and 8- FCVS is compiled by 64BitCDF90 in 32-bit mode on 64-
byte integer support in various FORTRAN language bit machine (AIX5.3/Power5).198 out of total 200 test cases,
constructs. In 64-bit mode compiler internally uses 64-bit are successful. The two buggy test cases are being worked on.
library archives and generate 64-bit X-COFF on DPTESTCASES consisting of 210 test cases are compiled
AIX5.3/Power5 platform. successfully with 64BitCDF90 and producing correct results.
16
of 32-bit systems, especially the 4GB virtual memory [12] “lint Source Code Checker”, C User's Guide, Sun
ceiling, have spurred companies to consider migrating to Studio 11, 819-3688-10
64-bit platforms. [13] Andrey Karpov, Evgeniy Ryzhkov, “Traps detection
This paper presents some important porting issues that are during migration of C and C++ code to 64-bit
faced while porting a compiler from 32-bit to 64-bit like - Windows”
adding a new language feature (I:8 data type) to the existing http://www.viva64.com/content/articles/64-bit-
compiler and ensuring that this feature is supported by a development/?f=TrapsDetection.html=en&content=64-
range of already existing library function. The testing bit-development
methodology is elaborated. Performance optimizations [14] http://www.itl.nist.gov/div897/ctg/fortran_form.htm
contributing to energy efficiency of the code are explained [15] Stefan Goedecker, Adolfy Hoisie “Performance
in the paper. The authors in this paper present the traditional optimization of numerically intensivecodes”, Society
optimizations implemented in CDF90. Other energy-saving for Industrial and Applied Mathematics (Cambridge
optimizations are being explored and shall be presented in University Press), 1987
future reports. [16] Tali Moreshet, R. Iris Bahar, Maurice Herlihy, “Energy
Reduction in Multiprocessor Systems Using
Transactional Memory”, ISLPED’05, USA
REFERENCES
[17] CaoY, Okuma, Yasuura, “Low-Energy Memory
[1] AHO, ALFRED V. / JEFFREY D. Allocation and Assignment Based on Variable Analysis
ULLMAN, “Principles of Compiler Design”, Tenth for Application-Specific Systems”, IEIC Technical
Indian Reprint, Pearson Education, 2003. Report, p31-38, Japan, 2002
[2] Jiantao Pan, “Software testing”, 18-849b Dependable [18] Changjiu Xian, Yung-Hsiang Lu, “Energy reduction by
Embedded Systems, Carnegie Mellon University, Spring workload adaptation in a multi-process environment”,
1999 Proceedings of the conference on Design, automation
http://www.ece.cmu.edu/~koopman/des_s99/sw_testing/ and test in Europe, 2006
[3] DeMiIIo, RA1; Krauser, EW2; Mathur, AP3, “An [19] J. Hom and U. Kremer, “Inter-Program Optimizations
Overview of Compiler-integrated Testing” Australian for Conserving Disk Energy”, International Symposium
Software Engineering Conference 1991: Engineering on Low Power Electronics and Design (ISLPED'05),
Safe Software; Proceedings, 1991 San Diego, California, August 2005
[4] A. S. Kossatchev and M. A. Posypkin, “Survey of [20] C-H. Hsu and U. Kremer, “Compiler-Directed
Compiler Testing Methods”, Programming and Dynamic Voltage Scaling Based on Program Regions”
Computer Software, Vol. 31, No. 1, 2005, pp. 10–19 Rutgers University Technical Report DCS-TR461,
[5] Harsha S. Adiga, ’” Porting Linux applications to 64-bit November 2001.
systems”, 12 Apr 2006. [21] Wei Zhang , “Compiler-Directed Data Cache Leakage
http://www.ibm.com/developerworks/library/l-port64.html Reduction”, IEEE Computer Society Annual
[6] http://cdac.in/html/ssdgblr/f90ide.asp. Symposium on VLSI, ISVLSI'04
[7] http://www.cocolab.com/en/cocktail.html [22] U. Kremer. “Low Power/Energy Compiler
[8] http://www.fortran2000.com/ArnaudRecipes/ Optimizations”, Low-Power Electronics Design (Editor:
cvs21_f95.html Christian Piguet), CRC Press, 2005.
[9] Stephen C. Johnson “Yacc: Yet Another Compiler- [23] Majid Sarrafzadeh, Prithviraj Banerjee, Alok Choudhary,
Compiler”, July 31, 1978 Andreas Moshovos, “PACT: Power Aware Compilation
http://www.cs.man.ac.uk/~pjj/cs211/yacc/yacc.html and Architectural Techniques”, California University
[10] Cathleen Shamieh,” Understanding 64-bit PowerPC LA, Dept of Computer Science.
architecture”, 19 Oct 2004, [24] U Kremer, “Compilers for Power and Energy
http://www.ibm.com/developerworks/library/pamicrodesign/ Management”, Tutorial, ACM SIGPLAN Conference on
[11] Steven Nakamoto and Michael Wolfe,” Porting Programming Language Design and Implementation
Compilers & Tools to 64 Bits” Dr. Dobb’s Portal, (PLDI'03), San Diego, CA, June 2003.
August 01, 2005
17
ADCOM 2009
SERVER
VIRTUALIZATION
Session Papers:
18
Is I/O virtualization ready for End-to-End
Application Performance?
J. Lakshmi, S. K. Nandy
Indian Institute of Science, Bangalore, India
{jlakshmi,nandy}@serc.iisc.ernet.in
Abstract: Workload consolidation using system target workloads. But I/O handling of these
virtualization feature is the key to many successful workloads on the consolidated servers results in
green initiatives in data centres. In order to exploit sharing of the physical resources and their
the available compute power, such systems warrant associated access paths. This sharing causes
sharing of other hardware resources like memory, interference that is dependent on the consolidated
caches, I/O devices and their associated access workloads and makes the application performance
paths among multiple threads of independent non-deterministic [1][2][3]. In such scenarios, it is
workloads. This mandates the need for ensuring essential to have appropriate mechanisms to define,
end-to-end application performance. In this paper monitor and ensure resource sharing policies across
we explore the current practices for I/O the various contending workloads. Many
virtualization, using sharing of the network applications like real-time hybrid voice and data
interface card as the example, with the aim to study communication systems onboard aircraft and naval
the support for end-to-end application performance vessels, streaming and on-demand video delivery,
guarantees. To ensure end-to-end application database and web-based services, when
performance and limit interference caused due to consolidated onto virtualized servers, need to
the sharing of devices, we present an evaluation of support soft real-time application deadlines to
previously proposed end-to-end I/O virtualization ensure performance.
architecture. The architecture is an extension to the
PCI-SIGV specification of I/O hardware to support Standard I/O devices are not virtualization aware
reconfigurable device partitions and uses VMM- and hence their virtualization is achieved using a
bypass technique for device access by the virtual software layer for multiplexing device access to
machines. Simulation results of the architecture for independent VMs. In such cases I/O device
application quality of service guarantees virtualization is commonly achieved following two
demonstrate the flexibility and scalability of the basic modes of virtualization, namely para-
architecture. virtualization and emulation [4]. In para-
virtualization the physical device is accessed and
Keywords – Multicore server virtualization, IO- controlled using a protected domain which could be
virtualization architectures,QoS, Performance. the virtual machine monitor (VMM) itself or an
independent virtual machine (VM), also called the
I. Introduction independent driver domain (IDD) as in Xen. The
Multi-core servers have brought in tremendous VMM or IDD actually do the data transfer to and
computing capacity to the commodity systems. from the device into their I/O address space using
These multi-core servers have not only prompted the device’s native driver. From there the copy or
applications to use fine-grained parallelism to gain transfer of the data to the VM’s address space is
advantage of the abundance of CPU cycles, they done using what is commonly called the para-
have also initiated the coalescing of multiple virtualized device driver. The para-virtualized
independent workloads onto a single server. driver is specifically written to support a specific
Multicore servers combined with system mechanism of data transfer between the VMM/IDD
virtualization have led to many successful green and the VM and needs a change in the OS of the
initiatives of data centre workload consolidation. VM (also called GuestOS).
This consolidation however needs to satisfy end-to-
end application performance guarantees. Current In emulation, the GuestOS of VM installs a device
virtualization technologies have evolved from the driver of the emulated virtual device. All the calls
prevalent single-hardware single-OS model which of this emulated device driver are trapped by the
presumes the availability of all other hardware VMM and translated to the native device driver’s
resources to the current scheduled process. This calls. The advantage of the emulation is that it
causes performance interference among multiple allows the GuestOS to be unmodified and hence
independent workloads sharing an I/O device, easier to adopt. However, para-virtualization has
based on the individual workloads. Major efforts been found to be much better in performance when
towards consolidation have focussed on compared to emulation. This is because emulation
aggregating the CPU cycle requirement of the results in each instruction translation whereas para-
1
19
virtualization involves only page-address of Figure 1. The data in this graph represents
translation. But, both these modes of device achievable throughput by the httperf [5] benchmark
virtualization impose resource overheads when hosted on a non-virtualized and virtualized Xen[6]
compared to non-virtualized servers. These and Vmware-ESXi [7] servers. In each of the case
overheads translate into application performance the http server was hosted on a Linux(FC6) OS and
loss. for the virtualized server, the hypervisor, IDD
(Xen) [8] and the virtual machine were pinned to
The second drawback of the existing architectures use the same physical CPU. The server used was
is their lack of sufficient quality of Service (QoS) dual core Intel Core2Duo system with 2GB RAM
controls to manage device usage on a per VM and 10/100/1000Mbps NIC. In the Xen hypervisor
basis. A desirable feature of these controls is that the virtual NIC used by the VM was configured to
they should guarantee application performance use a para-virtualized device driver implemented
with specified QoS on the shared device and this using event channel mechanism and a software
performance should be unaffected by the workloads bridge for creating virtual NICs. In the case of
sharing the device. The other desirable feature is Vmware hypervisor the virtual NIC used inside the
that the unused device capacity should be available VM was configured using a software switch with
for use to the other VMs. Prevalent virtualization access to device through emulation.
technologies like Xen and Vmware and even
standard linux distributions use a software layer
within the network stack to implement NIC usage
policies. Since these systems were built with the
assumption of single-hardware single-OS model,
these features provide required control on the
outgoing traffic from the NIC of the server. The
issue is with the incoming traffic. Since the policy
management is done above the physical layer,
ingress traffic accepted by the device is later
dropped based on input stream policies. This results
in the respective application not receiving the data,
which perhaps satisfies the application QoS, but
causes wasted use of the device bandwidth that
affects the delivered performance of all the
applications sharing the device. Also, it leads to
Figure 1: httperf benchmark throughput graph for non-
non-deterministic performance that varies with the virtualized, Xen and Vmware-ESXi virtual machine hosting http
type of applications using the device. This model is server on Linux(FC6). 1
insufficient for the virtualized server supporting
sharing of the NIC across multiple VMs. In this As can be observed from the graphs of Figure 1,
paper we describe and evaluate an end-to-end I/O the sustainable throughput of the benchmark drops
virtualization architecture that addresses these considerably when the http server is moved to a
drawbacks. virtualized server when compared to the non-
virtualized server. The reason for this drop is
Rest of the paper is organized as follows. section II answered by the CPU Utilization graph depicted in
presents experimental results on existing Figure 2. From the graphs we notice that moving
virtualization technologies, namely Xen and the http server from non-virtualized to virtualized
Vmware, that motivate this work; section III then server, the %CPU utilization, to support the same
describes an end-to-end I/O virtualization httperf workload, is increased significantly and this
architecture to overcome the issues raised in increase is substantial for the emulated mode of
section II; section IV details the evaluation of the device virtualization. The reason for this increased
architecture and presents the results; section V CPU utilization is because of I/O device
highlights the contributions of this work with virtualization overheads.
respect to existing literature and section VI details
the conclusions. Further, when the same server is consolidated with
two VMs sharing the same NIC, each supporting
one stream of an independent httperf benchmark,
II. Motivation there is further drop of achievable throughput per
VM. This is explicable since each VM now
Existing I/O Virtualization architectures use extra contends for the same NIC. The virtualization
CPU cycles to fulfill equivalent I/O workload. mechanisms not only share the device but also the
These overheads reflect in the achievable
application performance, as depicted by the graph 1
Some data of the graph has been reused from [9][10]
2
20
device access paths. This sharing causes
serialization which leads to latencies and
application performance loss which is dependent on
the nature of the consolidated workloads. Also, the
increased latencies in supporting the same I/O
workload on the virtualized platform causes loss of
usable device bandwidth which further reduces the
scalability of device sharing by multiple VMs.
3
21
often results in under-utilized resources and requirement of the VM to which the virtual device
limited consolidation ratios, particularly for I/O is allocated. This, along-with features like TCP
workloads. The remedy to this approach is to offload, virtual device priority and bandwidth
build I/O devices that are virtualization aware specification support at the device level provide
and decouple the device management from fine-grained QoS controls at the device while
device access i.e, provide native access to the sharing it with other VMs, as is elaborated upon in
I/O device from within the VM and allow the evaluation section.
VMM to manage concurrency issues rather than
the ownership issues. Figure 5 gives a block schematic of the proposed
Lack of fine-grained QoS controls on device I/O virtualization architecture.2 The picture depicts
sharing cause performance loss which is a NIC card that can be housed within a virtualized
dependent on the workloads of the VMs sharing server. The card has a controller that manages the
the device. This leads scalability issues in DMA transfer to and from the device memory. The
sharing the I/O device. To address this the I/O standard device memory is replaced by a re-
device should support QoS controls for both the partitionable memory supported with n sets of
incoming and outgoing traffic. device registers. A set of m memory partitions,
where m ≤ n, with device registers forms the
To overcome the above drawbacks, we propose an virtual-NICs (vNICs). Ideally the device memory
I/O virtualization architecture. This architecture should be reconfigurable, i.e. dynamically
proposes an extension to the PCI-SIG IOV partitionable, and the VM’s QoS requirements
specification [11] for virtualization enabled would drive the sizing of the memory partition. The
hardware I/O devices with a VMM-bypass [12] advantage of having a dynamically partitionable
mechanism for virtual device access. device memory is that any unused memory can be
easily extended into or reduced from a vNIC in
III. End-to-End I/O Virtualization order to match adaptive QoS specifications. The
Architecture NIC identifies a vNIC request by generating
message signaled interrupts (MSI). The number of
We propose an end-to-end I/O virtualization interrupts supported by the controller restricts the
architecture that enables direct or native access to number of vNICs that can be exported. Based on
the I/O device from within the VM rather than the QoS guarantees a VM needs to honour,
accessing it through the layer of VMM or IDD. judicious use of native and para-virtualized access
PCI-SIG IOV specification proposes virtualized to the vNICs can overcome this limitation. A VM
I/O devices that can support native device access that has to support stringent QoS guarantees can
by the VM, provided the hypervisor is built to choose to use native access to the vNIC whereas
support such architectures. IOV specified hardware those VMs that are looking for best-effort NIC
can support multiple virtual devices at the hardware access can be allowed para-virtualized access to the
level. The VMM needs to be built such that it can vNIC.
recognize and export each virtual device to an
independent VM, as if the virtual device was an
independent physical device. This allows native
device access to the VM. When a packet hits the
hardware virtualized NIC, the VMM should
recognize the destination VM of an incoming
packet by the interrupt raised by the device and
forwards it to the appropriate VM. The VM
processes the packet as it would do so in the case of
non-virtualized environment. Here, device access
and scheduling of device communication is
managed by the VM that is using the device. This
eliminates the intermediary VMM/IDD on the
device access path and reduces I/O service time
which improves the usable device bandwidth and
application throughput. Figure 5: NIC architecture supporting MSI interrupts with
partitionable device memory, multiple device register sets and
DMA channels enabling independent virtual-NICs.
To support the idea of QoS based on device usage,
we extend the IOV architecture specification by
enabling reconfigurable memory on the I/O device.
2
For each of the virtual device defined on the This section is being reproduced from [10] to maintain
physical device, the device memory associated with continuity in the text. Complete architecture description with
performance statistics on achievable application throughput can
the virtual device is derived from the QoS be found in [10].
4
22
The VMM can aid in setting up the appropriate packet to the corresponding VM, which the NIC
hosting connections based on the requested QoS is now expected to do. While communicating,
requirements. the NIC controller decides on whether to accept
or reject the incoming packet based on the
The proposed architecture can be realized by the bandwidth specification or the virtual device’s
following modifications: available memory. This gives a fine-grained
control on the incoming traffic and helps reduce
Virtual-NIC: In order to define vNIC, the the interference effects. The outbound traffic
physical device should support time-sharing in can be controlled by the VM itself using any of
hardware. For a NIC this can be achieved by the mechanisms as is done in the existing
using MSI and dynamically partitionable device architectures.
memory. These form the basic constructs to
define a virtual device on a physical device as IV. Architecture Evaluation for QoS
depicted in Figure 5. Each virtual device has a controls
specific logical device address, like the MAC
address in case of NICs, based on which the The proposed architecture was generated using
MSI is routed. Dedicated DMA channels, a Layered Queuing Network (LQN) Model and
specific set of device registers and a partition of service times for the various entries of the model
the device memory are part of the virtual device were obtained by using runtime profilers on the
interface which is exported to a VM when it is actual Xen based virtualized server. Complete
started. We call this virtual interface as the model building and validation details are available
virtual-NIC or vNIC which forms a restricted in [9][10]. Here we present the results of QoS
address space on the device for the VM to use evaluation carried out using the LQN model [12] of
and remains in possession of the VM till it is the proposed architecture. The QoS experiments
active or relinquishes the device. were conducted along the same lines as described
in the introduction section. The difference now is
Accessing virtual-NIC: For accessing the that the QoS control is applied on the ingress traffic
virtual-NIC native device driver is hosted inside of the constrained VM, namely VM2. The results
the VM and is initialized with the help of VMM obtained are depicted in Figure 6.The proposed
when the VM is initialized. This device driver architecture allows for achieving higher application
can only manipulate the restricted device throughput on the shared NIC firstly because of the
address space which was exported through the VMM-bypass [12]. Also, as can be observed from
vNIC interface by the VMM. With the vNIC, the graphs above, the control of ingress traffic in
the VMM only identifies and forwards the the case of httperf benchmark shows highly
device interrupts to the destination VM. The OS improved performance benefit to the unconstrained
of the VM now handles the I/O access and thus VM, namely VM1.
can be accounted for the resource usage it
incurs. This eliminates the performance
interference due to VMM/IDD handling
multiple VMs’ request to/from a shared device.
Also, because the I/O access is now directly
done by the VM, the service time on the I/O
access reduces thereby resulting in better
bandwidth utilization. With the vNIC interface,
data transfer is handled by the VM. While
initializing the device driver for the virtual NIC
the VM sets up the Rx/Tx descriptor rings
within its address space and makes request to
the VMM for initializing the I/O page
translation table. The device driver uses this
table and performs DMA transfers directly into
the VM’s address space.
Figure 6: httperf throughput sharing on a QoS controlled,
shared NIC between two VMs using the proposed architecture
QoS and virtual-NIC: The device memory with throughput constraints applied on the ingress traffic of
partition acts as a dedicated device buffer for VM2 at the NIC.
each of the VMs and with appropriate logic on
the NIC card one can easily implement QoS For request-response kind of benchmarks like the
based SLAs on the device that translate to httperf, controlling the ingress bandwidth is
bandwidth restrictions and VM specific priority. beneficial because once a request is dropped due to
The key is being able to identify the incoming saturation of allocated bandwidth, there is no
5
23
downstream activity associated with it and wasteful management from protection. In exokernel, the idea
resource utilization of NIC and CPU is avoided. was to extend native device access to applications
The QoS control at the device on the input stream with exokernel providing the protection. In our
of VM2 and the native access to the vNICs by the approach, the extension of native device access is
VMs gives the desired flexibility of making the to the VM, the protection being managed by the
unused bandwidth available to the unconstrained VMM. A VM is assumed to be running the
VM. traditional OS. Further, the PCI-SIG community
has realized the need for I/O device virtualization
V. Related work and has come out with the IOV specification to
deal with it. The IOV specification however, details
In early implementations, I/O virtualization about device features to allow native access to
adopted dedicated I/O device assignment to a VM. virtual device interfaces, through the use of I/O
This later evolved to device sharing across multiple page tables, virtual device identifiers and virtual
VMs through virtualized software interfaces device specific interrupts. The specification
[14][4]. A dedicated software entity, called the I/O presumes that QoS is a software feature and does
domain is used to perform physical device not address this. Many implementations adhering to
management. The I/O domain is either part of the the IOV specification are now being introduced in
VMM or is by itself an independent domain, like the market by Intel[25], Neterion[26], NetXen[27],
the IDD of Xen [8][15]. With this intermediary Solarflare [28] etc. CrossBow[29]suite from SUN
software layer between the device and the VM, any Microsystems talks about this kind of resource
application in a VM seeking access to the device provisioning, but it is a software stack over a
has to route the request through it. This architecture standard IOV complaint hardware. The results
still builds over the single-hardware single-OS published using any of these products are exciting
model [16]-[21]. The consequence of such in terms of the performance achieved, but almost
virtualization techniques is visible in the loss of all of them have ignored the control of reception at
application throughput and usable device the device level. We believe that lack of such a
bandwidth on virtualized servers as discussed control on highly utilized devices will cause non-
earlier. Because of the poor performance of the I/O deterministic application performance loss and
virtualization architectures a need to build under-utilization of the device bandwidth.
concurrent access to the shared I/O devices was felt
and recent publications on concurrent direct
network access (CDNA)[22] [19] and scalable self- VI. Conclusion
virtualizing network interface describe such efforts.
However, the scalable self-virtualizing interface In this paper we described how the lack of
[23] describes assigning a specific core for network virtualization awareness in I/O devices leads to
I/O processing on the virtual interface and exploits latency overheads on the I/O path. Added to this,
multiple cores on embedded network processors for the intermixing of device management and data
this. The paper does not detail how the address protection issues further increases the latency,
translation issues are handled, particularly in the thereby reducing the effective usable bandwidth of
case of virtualized environments. The CDNA the device. Also, lack of appropriate device sharing
architecture is similar to the proposal in this paper control mechanisms, at the device level leads to
in terms of allowing multiple VM specific Rx and loss bandwidth and performance interference on the
Tx device queues. But CDNA still builds over the device sharing VMs. To address these issues we
VMM/IDD handling the data transfer to and from proposed I/O device virtualization architecture, as
the device. Although the results of this work are an extension to the PCI-SIG IOV specification, and
exciting, the architecture still lacks the flexibility demonstrated its benefit through simulation
required to support fine-grained QoS. And, the techniques. Results demonstrate that by moving the
paper does not discuss about the performance QoS controls to the shared device, the unused
interference due to uncontrolled data reception by bandwidth is made available to the unconstrained
the device nor does it highlight the need for VM, unlike the case in prevalent technologies. The
addressing the QoS constraints at the device level. proposed architecture also improves the scalability
The proposed architecture in this paper addresses of VMs sharing the NIC because it eliminates the
these and also the issue of pushing the basic common software entity which regulates I/O device
constructs to assign QoS attributes like required sharing. The other advantage is that with this
bandwidth and priority into the device to get finer architecture, the maximum resource utilization is
control on resource usage and on restricting now accounted for by the VM. Also, this
performance interference. architecture reduces the workload interference on
sharing a device and simplifies the consolidation
The proposed architecture has it basis in process.
exokernel’s[24] philosophy of separating device
6
24
References hosted virtual machine monitor. In Proceedings of
the USENIX Annual Technical Conference, June
2001.
[1] M. Welsh and D. Culler, “Virtualization considered [16] D. Gupta, L. Cherkasova, R. Gardner, and A.
harmful: OS design directions for well-conditioned Vahdat. Enforcing performance isolation across
services”, Hot Topics in OS, 8th Workshop, 2001. virtual machines in Xen. In M. van Steen and M.
[2] Kyle J. Nesbit, James E Smith, Miquel Moreto, Henning, editors, Middleware, volume 4290 of
Francisco J. Cazorla, Alex Ramirez, Mateo Valero, Lecture Notes in Computer Science, pages 342–362.
“Multicore Resource Management “, IEEE Micro, Springer, 2006.
Vol. 28, Issues 3, P-6-16, 2008. [17] Weng, C., Wang, Z., Li, M., and Lu, X. 2009. The
[3] Kyle J. Nesbit, Miquel Moreto, Francisco J. hybrid scheduling framework for virtual machine
Cazorla, Alex Ramirez, Mateo Valero, and James E. systems. In Proceedings of the 2009 ACM
Smith, Virtual Private Machines: SIGPLAN/SIGOPS international Conference on
Hardware/Software Interactions in the Multicore Virtual Execution Environments (Washington, DC,
Era, In IEEE Micro special issue on Interaction of USA, March 11 - 13, 2009).
Computer Architecture and Operating System in the [18] Kim, H., Lim, H., Jeong, J., Jo, H., and Lee, J. 2009.
Manycore Era, May/June 2008. Task-aware virtual machine scheduling for I/O
[4] Scott Rixner, “Breaking the Performance Barrier: performance. In Proceedings of the 2009 ACM
Shared I/O in virtualization platforms has come a SIGPLAN/SIGOPS international Conference on
long way, but performance concerns remain”, ACM Virtual Execution Environments (Washington, DC,
Queue – Virtualization, Jan/Feb 2008. USA, March 11 - 13, 2009).
[5] D. Mosberger and T. Jin, “httperf: A Tool for [19] Menon, J. R. Santos, Y. Turner, G. J. Janakiraman,
Measuring Web Server Performance,” ACM, and W. Zwaenepoel. Diagnosing performance
Workshop on Internet Server Performance, pp. 59- overheads in the Xen virtual machine environment.
67, June 1998. In Proceedings of the ACM/USENIX Conference
[6] Paul Barham , Boris Dragovic , Keir Fraser , Steven on Virtual Execution Environments, June 2005.
Hand , Tim Harris , Alex Ho , Rolf Neugebauer , [20] Menon, A. L. Cox, and W. Zwaenepoel. Optimizing
Ian Pratt , Andrew Warfield, “Xen and the art of network virtualization in Xen. In Proceedings of the
virtualization”, 19th ACM SIGOPS, Oct. 2003. USENIX Annual Technical Conference, June 2006.
[7] “VMware ESX Server 2 - Architecture and [21] Santos, J. R., Janakiraman, G., Turner, Y., Pratt, I.
Performance Implications”, 2005, available at 2007. Netchannel 2: Optimizing network
http://www.vmware.com/pdf/esx2_performance_im performance. Xen Summit Talk (November)
plications.pdf [22] Willmann, P., Shafer, J., Carr, D., Menon, A.,
[8] K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. Rixner, S., Cox, A. L., Zwaenepoel, W. Concurrent
War_eld, and M. Williamson, “Safe hardware direct network access for virtual machine monitors.
access with the Xen virtual machine monitor.” 1st In Proceedings of the International Symposium on
Workshop on OASIS, Oct 2004. High-Performance Computer Architecture,2007
[9] J.Lakshmi, S.K.Nandy, “Modeling Architecture-OS (February).
interactions using Layered Queuing Network [23] H. Raj and K. Schwan. Implementing a scalable
Models”, International Conference Proceedings of self-virtualizing network interface on a multicore
HPC Asia, March, 2009, Taiwan. platform. In Workshop on the Interaction between
[10] J. Lakshmi, S. K. Nandy, “I/O Device virtualization Operating Systems and Computer Architecture, Oct.
in Multi-core era, a QoS Pespective”, Workshop on 2005.
Grids, Clouds and Virtualization, International [24] M. Frans Kaashoek, et. Al., “Application
Conference on Grids and Pervasive computing, Performance and Flexibility on Exokernel Systems
Geneva, May 2009. “, 16th ACM SOSP, Oct, 1997.
[11] PCI-SIG IOV Specification available online at [25] Intel Virtualization Technology for Directed-I/O
http://www.pcisig.com/specifications/iov www.intel.com/technology/itj/2006/v10i3/2-io/7-
[12] J. Liu, W. Huang, B. Abali, and D. K. Panda. High conclusion.htm
performance VMM-bypass I/O in virtual machines. [26] Neterion http://www.neterion.com/
In Proceedings of the USENIX Annual Technical [27] NetXen http://www.netxen.com/
Conference, June 2006. [28] Solarflare Communications
[13] Layered Queueing Network Solver software http://www.solarflare.com/
package, http://www.sce.carleton.ca/rads/lqns/ [29] CrossBow: Network Virtualization and Resource
[14] T. von Eicken and W. Vogels. Evolution of the Control
virtual interface architecture. Computer, 31(11), http://www.opensolaris.org/os/community/networki
1998. ng/crossbow_sunlabs_ext.pdf
[15] J. Sugerman, G. Venkatachalam, and B. Lim.
Virtualizing I/O devices on VMware Workstation’s
7
25
Eco-Friendly Features of a Data center OS
Surya Prakki
Sun Microsystems
Bangalore, India
surya.prakki@sun.com
26
damage to the rest of the system. Another outcome of this is, These tools could be run either from global zone or from inside
even privileged processes in non-global zones are prevented a non-global zone itself.
from performing operations that can have system-wide impact.
The extensibility of the zones framework can be gauged by
Administration of individual zones can be delegated to the fact that there is an 'lx' brand using which linux 2.6, 32bit
others, knowing that any actions taken by them would not applications can be run unmodified on an OpenSolaris x86
affect the rest of the system. system inside an lx branded zone. Thus even linux applications
can be observed from the global zone using earlier mentioned
Zones do not present a new API or ABI to which tools.
applications need to be 'ported' – I.e. the existing OpenSolaris
applications run inside zones without any changes or A zone can be configured with an exclusive IP stack, such
recompilation. So a process running inside a zone runs that it can have its own routing table, ARP table, IPSec policies
natively on the CPU and hence doesn't incur any performance and associations, IP filter rules and TCP/IP ndd variables. Each
penalty. zone can continue to run services such as NFS on top of
TCP/UDP/IP. This way a physical NIC can be set aside for
A zone or multiple zones can be tied to a resource pool – exclusive use of the zone. Zones can also be configured to have
which pulls together CPUs and scheduler. The resources shared IP stacks on top of a single NIC.
associated with a pool can be dynamically changed. This
enables the global administrator to give more resources to a Disks can be provided to a zone for its exclusive use, or a
zone when it is nearing its peak demand. Faire Share portion of a file system space can be set aside for the zone, or
Scheduler (FSS) can be associated with a pool. Using FSS a using ZFS file system, space can be grown as the demand
physical CPU assigned to multiple zones, via a resource pool, grows dynamically.
can be shared as per their entitlement – eve in the face of most
demanding work loads, thus guaranteeing quality of service The following block diagram captures the above
(QoS). The physical memory and swap memory are also discussion :
configured per zone and can be changed dynamically to meet
the varying demands of a zone.
Zones can be booted and halted independent of the
underlying kernel. As zone boot involves only setting up the
virtual platform, mounting the configured file systems, this is a
much faster operation compared to booting of even a smallest
physical system.
The software packages that need to be available in a zone is
a function of the service it is hosting and a zone administrator
is free to pick and choose. This makes a zone look like 'Just
Enough OS' kind of slick environment which is tailor made to
the services it hosts. Likewise a zone administrator is also free
to maintain the patch levels of the packages she installed.
To save on file system space, two types of zones are
supported – sparse and whole root. In case of sparse zone,
some file systems like /usr are loop back mounted from global
zone so that multiple copies of the binaries are not present. In
whole root zone, all components that make up the platform are
installed and thus takes more disk space.
The zones technology is extended to be able to run Figure 1.
applications compiled for earlier versions of Solaris. This is
referred to as branding a zone and is used to run Solaris 8 and Thus to summarize, zones virtualizes the following
Solaris 9 applications. In such a branded zone, run time facilities :
environment for an Solris 8 application is no different from
what it was on a box running Solaris 8 kernel. This branding Processes
feature helps customers to replace old power-hungry systems
with newer eco-friendly computers. Operating environment File Systems
should also provide tools which will help in making this Networking
transition smoother.
Identity
The OpenSolaris system has got a rich set of observability
tools like DTrace(1M), kstat(1M), truss(1), proc(4) and Devices
debugging tools like mdb(1) which could be used to study the
behavior and debug applications in production environment. Packaging
27
1) Field Use Case : as per the load on the VNIC and also allows classification of
www.blastwave.com offsers open source packages for packets between VNICs to be off-loaded in hardware.
different versions of Solaris. As a result they needed to Each VNIC is assigned its own MAC address and optional
continue to have physical systems running Solaris 8 VLAN id (VID). The resulting MAC+VID tuple is used to
and Solaris 9. When they made use of the branded identify a VNIC on the network, physical or virtual.
zones feature, they were able to consolidate these
legacy systems onto systems running Solaris10 and Crossbow allows a bandwidth limit to be set on a VNIC.
they quantified the gain as follows : The bandwidth limit is enforced by the MAC transparently to
the user of the VNIC. This mechanism allows the
o 65 percent reduction in rack space, saving tens of administrator to configure the link speed of VNICs that are
thousands of dollars in power, cooling, and assigned to zones or VMs. - this way they can't use more
hardware-maintenance costs bandwidth than their assigned share.
o Reduced setup time from hours to minutes. Crossbow provides virtual switching semantics between
Sun IT heavily uses zones to host quite a few services, VNICs created on top of the same physical NIC. Virtual
to quote a few: switching done by Crossbow is consistent with the behavior of
a typical physical switch found on a physical network.
o Request to make source changes, is made via a
portal which runs in a zone. Crossbow VNICs and virtual switches can be combined to
build a Crossbow Virtual Wire (vWire). A vWire can be a fully
o Service to manage lab infrastructure. virtual network in a box. A vWire can be used to instantiate a
layer-2 network which can be used to run a distributed
o Namefinder service which reports basic application spanning multiple virtual hosts,
information about employees.
Virtual Network Machines (VNMs) are pre-canned
o Service to host software patches. OpenSolaris zones which encapsulate a network function such
as routing, load balancing, etc. VNMs are assigned their own
One web server is run in each of the above
VNIC(s), and can be deployed on a vWire to provide network
zones and they all listen on port:80 without stamping
functions needed by the network applications. VNMs can come
on each other. All the above services are very critical
in pre-configured fashion, which helps in deploying new
to running of the business. First 3 cases do not see
instances quickly.
much change in workload and hence are easily
virtualized using zones. In case of 4th, patches are
released periodically and there will be lot of hits in the III. PLATFORM LEVEL VIRTUALIZATION
first 48hrs of patches being made available and during There are some instances where OS level virtualization falls
this time, additional CPUs and physical memory can short :
be set aside for this zone using dynamic resource
pools via a cron(1M) job. In consolidating these Any decent sized data center will have heterogeneous
services, we replaced 4 physical systems with a single workloads - applications compiled for different
one. platforms and any effort to consolidate such workloads
can't be fully achieved by OS level virtualization.
Price et al. [11] report less than 4% performance
degradation for time sharing workloads and is A service or application needs a specific kernel module
attributed to the loop back mounts in case of sparse or driver to operate.
zone.
Different applications need different kernel patch
levels.
B. Crossbow
Crossbow provides the building blocks for network Consolidate legacy applications which expect end of
virtualization and resource control by Virtualizing the network life (EOL) Operating Systems.
stack and NIC around any service (HTTP, HTTPS, FTP, NFS In such scenarios, platform virtualization solution can be
etc.), protocol or Virtual machine. Crossbow does to used to consolidate the workloads. These solutions carve
networking stack what Zones did to OS services. multiple Virtual Machines (VMs) out of a physical machine
One of the main component of Crossbow is the ability to and are referred to as either Hypervisors or Virtual Machine
virtualize a physical NIC into multiple virtual NIC(VNICs). Monitors (VMMs).
These VNICs can be assigned to either zones or any virtual For the rest of the discussion let us look at the hypervisors
machines sharing the physical NIC. Virtualization is LDOMs and Xen, that OpenSolaris supports on Sparc and x86
implemented by the MAC layer and the VNIC pseudo driver of architectures respectively and see how they address some of the
OpenSolaris network stack. It allows physical NIC resources challenges mentioned in Introduction.
such as hardware rings and interrupts to be allocated to specific
VNICs. This allows each VNIC to be scheduled independently
28
A. Logical DOMains Hypervisor automatically powers off CPU cores that are not
LDOMs can be viewed as a hypervisor implemented in in use – I.e. not assigned to any domains.
firmware. LDOMs hypervisor is shipped along with sun4v The following block diagram captures the above discussion:
based Sparc systems.
Hypervisor divides the physical Sparc machine into
multiple virtual machines, called domains. Each domain can be
dynamically configured with a subset of machine resources and
its own independent OS. Isolation and protection are provided
by the LDOMs hypervisor, by restricting access to registers
and address space identifiers (ASI).
Hypervisor takes care of partitioning of resources.
Hardware resources are assigned to logical domains using
'Machine Descriptions (MD)'. It is a graph describing devices
available, which includes CPU threads/strands, memory blocks,
PCI bus. Each LDOM has its own MD. This is used to build
Open Boot Prom (OBP) and consequently the OS device tree
upon booting guest. A fallout of this model is, guest OS doesn't
even see any HW resources not present in its MD.
The first domain that comes up on a sun4v based system
acts as a control domain. LDOMs software manager runs in
the control domain. This helps us to interact with the
hypervisor and allocate and deallocate physical resources to
guest domains. Management software consists of a daemon
(ldmd(1M)) which interacts with the hypervisor and ldm(1M) Figure 2.
CLI to interact with the daemon. ldmd(1M) can be run only in
the control domain.
B. X86 Platform virtualization
Hypervisor provides a fast point-to-point communication In x86 space, over the years, many virtualization
channel called 'logical domain channel' (LDC) to enable technologies have come in and they can be classified under two
communication: between LDOMs, between LDOM and types :
hypervisor. LDC end points are defined in the MD. ldmd(1M)
also uses LDC to interact with the HV in managing domains. Type 1 – Where virtualization solution does not need a
host OS to operate.
Initially all the resources are given to the control domain.
Using ldm(1M), administrator needs to detach the resources Type 2 – Where virtualization solution needs host OS
from the primary domain and pass them onto the newly created to operate.
guest domains.
For the rest of the discussion, let us look at Xen (a Type 1
I/O for guest domains can be configured in two ways: hypervisor) and VirtualBox (a Type 2 hypervisor).
Direct I/O: A guest domain could be given direct access to 1)Xen
a PCI device and the guest domain could manage all the I/O
devices connected to them using its own disk drivers. Xen is an open source hypervisor technology developed at
Virtualized I/O: One of the domains, called a 'service University of Cambridge. This solution is specific to Solaris
domain' controls the HW present on the system and presents running on x86 based systems. Each VM created can run a
'virtual devices' to other guest domains. Then the guest could complete OS. Xen sits directly on top of HW and below the
use a virtual block device driver which forwards the I/O to the guest operating systems. It takes care of partitioning of
service domains through LDC. Virtual device being presented available CPU(s) and physical memory resources across the
to the guest could be backed by a physical disk, or a slice of it, VMs.
or even a regular file. Unlike other contemporary virtualization solutions that
In case of network I/O, hypervisor presents a layer-2 exist for x86, xen started off with a different design approach
'virtual network switch' which enables domain to domain to minimise the overheads a guest OS incurs – guest needs to
traffic. Any number of v-LANs can be created by creating be ported to xen architecture, which very closely resembles the
additional v-switch in the service domain. The switch talks to underlying x86 arch – This approach is referred to as 'Para
the device driver in the service domain to connect to physical Virtualization (PV)'.
NIC for external network connection; And can tag along the In this approach, a PV guest kernel is made to run in ring-1
layer-3 features of service domain kernel to do routing, iptable of the x86 architecture, while the hypervisor runs in ring-0.
filtering, NAT and firewalling. This way hypervisor protects itself against any malicious guest
29
kernel. The way an application makes a system call to get into Extended Page Tables (EPT): With EPT, VMM doesn't
kernel, a PV kernel needs to make a hypercall to get into the have to maintain shadow page tables. The way page
hypervisor – hypercalls are needed to request any services from tables convert VA to PA, EPT converts guests physical
the hypervisor. to host physical address. This virtualizes CR3, which
guest continues to manipulate.
The first VM that comes up on boot is referred as Dom0
and the rest of the VMs are referred as DomUs. To keep the VT-d: This enables guests to access the devices
hypervisor thin, Xen runs the device drivers that control the directly so that performance impact of emulated
peripherals on the system, in dom0. This approach makes a lot devices is cut out.
of code execute in ring-1 rather than ring-0 and thus improves
the security of the system. To get the best of both the worlds [I.e. PV approach and
wanting to use HW advances], hybrid virtualization is picking
In PV mode, IO between guests and their virtual devices is up momentum – where we start off with a guest in HVM mode
directed through a combination of Front End (FE) drivers that [so that there is no porting exercise] and then incrementally add
run in a guest and Back End (BE) drivers that run in a domain PV drivers which will bypass device emulation and thus reduce
that is hosting the device. The communication between the virtualization overheads – in case of OpenSolaris these PV
front end and back end happens via the hypervisor and hence is drivers are already implemented for block and network devices.
subject to privilege checks. The FE and BE drivers are class
specific I.e. one set for all block devices, and one set for Xen tracing facility provides a way to record hypervisor
network devices – thus the finer details associated with each events like VCPU getting on and off a CPU, VM blocking, etc,
specific device is completely avoided. This model of and this data can help nail performance issues with virtual
implementing the IO performs far better than the emulated machines.
devices approach. Xen allows live migration of guests to similar physical
Depending on the guest OS needs, more physical memory machines, which effectively brings in load balancing features.
can be passed on dynamically and likewise, if there is, memory Xen also allows suspend and resume of guests, which can
shortage in the system, hypervisor can also take back memory be used to start service on demand – this feature along with
from a guest. ZFS snapshots can be used to configure a guest and then take a
Likewise, the number of virtual CPUs (vCPUs) associated snapshot of it, and move it to a different physical machine and
with a guest can by dynamically increased if the guest needs these nodes could give a simple fail over capability.
more compute resources. Xen schedules these vCPUs onto the The following block diagram captures the above discussion:
physical CPUs.
The management tools needed to configure and install
guests, also run in dom0 thus making the hypervisor even
thinner. This also improves the debuggability of the tools as
they run in user space.
Given the recent advances in the CPUs, like VT-x of Intel
and SVM of AMD, xen allows unmodified guests to be run in a
VM – these are referred to as HVM guests. There are a couple
of major differences between how a PV guest is handled vis-a-
vis how a HVM guest is handled :
Unlike PV guest, HVM guest, expects to handle the
devices directly by installing its own drivers – for this,
the hypervisor has to emulate different physical
devices and this emulation is done in the user land of
dom0. As can be inferred IO this way will be slower
than in PV approach.
Figure 3.
HVM guest tries to install its own page tables. Xen
uses shadow page tables which track guests 'page table To summarize, xen helps in consolidating heterogeneous
modifications'. This can slow down handling page work loads that are commonly seen in a data center.
faults in the guest.
2)Virtual Box (VB)
In both PV as well as HVM guests, the dom0 needs to be
ported to xen platform. The applications run without any
modifications inside the guest operating systems. VB is an open source virtualization solution for x86 based
systems. VB runs as a process in the host operating system and
x86 CPUs, off late, are seeing a steady stream of new supports various latest as well as legacy operating systems to
features, to help implement VMMs easier: run as guests. The power of VB lies in the fact that the guest is
completely virtualized and is being run as just a process on the
30
host OS. VB follows the usual x86 virtualization technique of A. Advanced Configuration and Power Interface (ACPI)
running guest ring-0 code in ring-1 of the host; and running Processor C-states :
guest ring-3 code natively. ACPI C0 state refers the normal functioning of the CPU
Given its ease of use, VB can be used to consolidate when it draws the rated voltage. CPU enters C1 state while
desktop workloads – a new desktop can be configured and running idle thread by issuing halt instruction, in this state,
deployed in a few seconds time. other than APIC and bus, rest of the units do not consume
power. CPU runs idle thread when there are no threads in
It supports cold migration of the guests, where one could runnable state.
copy over the Virtual Disk Image (VDI) to a different system
along with XML files (config details) and start the VM on the ACPI process C3 state and beyond are referred to as Deep
other system. C-state support, as wake up latency of this state is higher than
the earlier state. In C3 state, even APIC and bus units are also
Creating duplicate VMs is as easy as copying over the VDI stopped, and caches lose state. OS support is needed in C3 state
and removing the uuid from the image, and registering it with because of state loss and OpenSolaris incorporates this support.
VB. VB emulates PCnet, and Intel PRO network adopters,
supports different networking modes: NAT, HIF, Internal B. Power Aware Dispatcher (PAD)
networking.
This feature extends existing topology aware scheduling
facility to bring 'power domain' awareness to the dispatcher.
C. Field Use Case :
With this awareness in place, the dispatcher can implement
The following are real world cases where we used platform coalescence dispatching policy to consolidate utilization onto a
virtualization inside Sun : smaller subset of CPU domains, freeing up other domains to be
power managed. In addition to being domain aware, the
Once a supported release of Solaris is made, a separate
dispatcher will also tend to prefer to utilize domains already
build machine is spawned off which is used to host
running at lower C-states – this will increase the duration and
sources and allows gate keepers to create patches.
extent to which domains can remain quiescent, improving the
Earlier one physical system used to be set aside for
kernel's power saving ability.
each release. Now, with Xen and LDOMs, a new VM
is created for each release and multiple of such guests Because the dispatcher will track power domain utilization
could be hosted on a single system. Patch windows for along the way, it can drive active domain state changes in an
each of the releases are staggered such that multiple event driven fashion, eliminating the need for the power
guests won't hit the peak load together. A complete management subsystem to poll.
build of Solaris takes at least 10% longer inside a
guest, but this is still acceptable as it is not a time These current conservative logs yield a 3.5% improvement
critical job. Performance impact on the interactive in SpecPower on Nehalem but more importantly a 22.2% idle
workload which an engineer might see while pushing power savings.
her code change is too less to be noticed.
C. CPU Frequency Scaling :
Engineers heavily use VB to test their changes in older
It is possible to reduce power consumption of a system by
releases of Solaris. So even though performance
running at a reduced clock frequency, when it is observed that
degradation could be in the range of 5-20% depending
CPU utilization is low.
on workload, it is still acceptable as it is only
functionality and sanity testing [after my kernel change
will the system boot?]. D. Memory Power Management:
Like CPUs, even memory could be put in power saving
So though there is an increase in the number of supported state when the system is idle. OpenSolaris enables this feature
Solaris releases there is not a corresponding increase in the on chip sets which support this feature.
number of physical systems in the data center, thus
significantly saving on the capital expenditure and carbon foot
print. E. Suspend to RAM:
It is common for desktops and laptops to have extended
IV.CPU MANAGEMENT periods of no activity as the user could be away at lunch. To
save power in such cases, OpenSolaris supports what is
To address the second problem mentioned in the referred to as ACPI S3 support – whereby the whole system is
introduction, operating system should support the various suspended to RAM and power cut off to CPU.
features provided by the hardware to reduce the idling power
consumption of the system. For the rest of the discussion let us F. CPU hotplug:
look at how OpenSolaris kernel supports various power saving
features of both the x86 and Sparc platform. This is supported on quite a few Sparc platforms and is
achieved by 'Dynamic Reconfiguration (DR)' support in the
kernel. A DRed out board containing CPUs is effectively
powered off and can be pulled out; But can be left in there, so
31
that depending on workload changes, the board can be DRed REFERENCES
back in.
[1] Paul Barham et al. Xen and the art of virtual-ization. In Proceedings of
V.OBSERVABILITY TOOLS the 19th Symposium on Operating Systems Principles, 2003.
[2] Bryan Cantrill, Mike Shapiro, and Adam Leventhal. Dynamic
There should be a mechanism by which administrator can instrumentation of production systems. In USENIX Annual Technical
see how well a system is taking advantage of the power Conference, 2004.
management features discussed above. For this OpenSolaris [3] I. Pratt, et al. The Ongoing Evolution of Xen. In Proceedings of the
provides a couple of tools : Ottawa Linux Symposium (OLS), Canada, 2006.
[4] C. Clark, et al. Live Migration of Virtual Machines. In Proceedings of
the 2nd Symposium on Networked Systems Design and Implementation
A. Power TOP: (NSDI), Boston, 2005. USENIX
powertop(1M) reports the activity that is making the CPU [5] http://www.usenix.org/event/lisa04/tech/price.html
to move to lower C-states and thus increase the power [6] http://hub.opensolaris.org/bin/view/Project+tesla/
consumption. Addressing the reasons for those, will make a [7] http://hub.opensolaris.org/bin/view/Community+Group+xen/
CPU stay longer in higher C-states. [8] http://hub.opensolaris.org/bin/view/Community+Group+ldoms/
[9] http://hub.opensolaris.org/bin/view/Project+ldoms-mgr/
B. Kstat : [10] http://hub.opensolaris.org/bin/view/Community+Group+zones/
kstat(1M) reports at what clock frequency the CPU is [11] http://www.sun.com/customers/software/blastwave.xml
currently operating and what are the supported frequencies.
Lower the frequency, the better from power consumption point
of view.
32
ADCOM 2009
HUMAN COMPUTER
INTERFACE -1
Session Papers:
1. Shankkar B, Roy Paily and Tarun Kumar , “Low Power Biometric Capacitive CMOS Fingerprint
Sensor System”
2. Raghavendra R, Bernadette Dorizzi, Ashok Rao and Hemantha Kumar, “Particle Swarm
Optimization for Feature Selection: An Application to Fusion of Palmprint and Face”
33
Low Power Biometric Capacitive CMOS
Fingerprint Sensor System
B. Shankkar, Tarun Kumar and Roy Paily
Abstract—A charge sharing based sensor for obtaining fin- modeling of the capacitor between the metal plate and finger
gerprint has been designed. The design used sub threshold as a capacitor with air medium in between. Cp 1 and Cp 2 are
operations of MOSFETs for achieving low power sensor device the internal parasitic capacitances of the nodes N1 and N2 . In
working at 0.5 V. Also the interfacing circuitry and a fingerprint
matching algorithm was also designed to support the sensor and the pre-charge phase, the switches S1 and S3 are on and S2 is
complete a fingerprint verification system. off. The capacitors Cp 1 and Cp 2 get charged up. During the
Index Terms—Fingerprint, sensor, charge sharing, sub- evaluation phase, S2 is turned on. The voltage stored during
threshold, low power the pre-charge phase is now divided between Cs , Cp 1 and
Cp 1 . The output voltage at N1 is easily seen as the following
I. I NTRODUCTION expression:
Biometrics is the automated method of recognizing a person
based on a physiological or behavioral characteristic. Biomet- Cp 1 V1 + Cp 2 V2 + Cs V1
V0 = VN 1 = VN 2 = (1)
ric recognition can be used in identification mode, where the Cp 1 + Cp 2 + Cs
biometric system identifies a person from the entire enrolled
population by searching a database for a match based solely on
the biometric. A biometric system can also be used in verifica-
tion mode, where the biometric system authenticates a person’s
claimed identity from their previously enrolled pattern. Various
biometrics used for such a purpose are signature verification,
face recognition, fingerprints, iris recognition. Fingerprinting
is one of the oldest ways of technically identifying a person’s
identity. Fingerprints have distinct ridge and valley pattern on
tip of a finger for every individual. Every finger print can be
divided into two structures - Ridge and Valleys.
As the world becomes more accustomed to devices reducing in
size with each passing day and dependence on mobile devices Fig. 1. Basic Charge Sharing Scheme [1]
increasing at enormous rate, a fingerprint authentication and
identification system that can be mounted on such a device is As given in figure 2 Cs differs for ridge and valley, thus the
imperative. Any fingerprint authentication system requires a output voltage also differs according to the above expression.
sensor and corresponding circuitry, an interface and fingerprint This difference in voltage when passed to a comparator with
matching algorithm. Paper presents the the results achieved for an appropriate reference voltage, gives a binary output. The
simulation for all these modules. binary values from all the pixels in the chip then constitute
the required fingerprinting image. In the pre-charge phase (pch
II. S ENSOR AND C ORRESPONDING C IRCUITRY = 0), it can be seen that N1 and N3 are kept at Vdd by
Figure 1 shows the basic principle of charge-sharing scheme the PMOS transistors. During this phase, the capacitors Cp 2
[1]. The finger is modeled as upper metal plate of the capacitor, and Cp 3 are shorted with voltages at both ends as ground
with its lower metal plate in the cell. These electrodes are and Vdd respectively. The capacitors Cp 1 and Cs begin to
separated by the passivation layer of the silicon chip, caused charge up. They store a charge of Cp 1 Vdd and Cs (Vdd - Vf in )
by the metal oxide layer on the chip. The series combination of respectively. This is the charge accumulation phase.
the two capacitances is called Cs . The basic principle behind At the beginning of the evaluation phase (pch = 1), both
the working of capacitive fingerprinting sensors is that Cs the input and output voltages are equal to Vdd . Even when
changes according to whether the part of finger on that pixel the voltage at N1 starts decreasing due to charge-sharing
is a ridge or valley. If the part of finger is a ridge then Cs between the capacitors, the unity-gain buffer ensures that the
is higher than when it encounters a valley, in which case the voltage at N3 is equal to voltage at N1 , thus effectively
series combination of the two capacitances falls low due to the shorting the capacitor Cp 3 and removing its effect. Meanwhile,
34
Fig. 2. The Sensor Circuitry
III. I NTERFACE
35
• The supply voltage was first fixed at 0.5V.
• The inverter was designed to have a change from logic The improvement resulted in power requirement of 736nW per
0 to logic 1 at 0.25V. The width of the transistors was pixel position. The improvements introduced in the MOSFET
changed to get the desired characteristics. The circuit designs also improved resolution further and we obtained a
is shown in Figure 8 and the output characteristics are resolution of around 350 mV for a power supply of 0.5 V.
given in Figure 9. Results are presented in Figure 10
36
minutiae of the recently processed fingerprint. The minu-
tiae of the fingerprint records were saved as separate
templates. for every clock cycle one of the templates
was accessed. The minutiae set of current candidate
fingerprint was compared element by element to that of
the accessed template. In case of 15 or more matches in
minutiae locations a match is announced and loop exited.
One sample of the actual 100 pixel * 50 pixel image that
was input to the matching algorithm and the corresponding
Fig. 10. Results obtained with low power Resolution Improvement Circuit thinned image with minutiae positions is shown in figure 11.
37
PARTICLE SWARM OPTIMIZATION FOR FEATURE SELECTION: AN APPLICATION TO
FUSION OF PALMPRINT AND FACE
be performed at 4 different levels: Sensor level, Feature level, Palmprint Log Gabor
PSO Reject
38
Figure 1 shows the proposed block diagram of feature 2.2. Principle of PSO
level fusion of palmprint and face in Log Gabor space. As
The main objective of PSO is to optimize a given function
observed from Figure 1, we first extract the texture features
called fitness function. PSO is initialized with a population of
of face and palmprint separately using Log Gabor transform.
particles distributed randomly over the search space and eval-
We use Log Gabor transform as it is suitable for analyzing
uated to compute the fitness function together. Each particle is
gradually changing data such as face, iris and palmprint [3]
treated as a point in the N-Dimension space. The ith particle
and also it is mentioned in [9] that the Log Gabor transform,
is represented as Xi = {x1 , x2 , . . . , xN }. At every iteration,
can reflect the frequency response of images more realisti-
each particle is updated by two best values called pbest and
cally than usual Gabor transform. On the linear frequency
gbest. pbest is the best position associated with the best fit-
scale, the transfer function of the Log Gabor transform has
ness value of particle ’i’ obtained so far and is represented as
the form [9]:
pbesti = {pbesti1 , pbesti2 , . . . , pbestiN } with fitness func-
tion f (pbesti ). gbest is the best position among all the par-
− log(ω/ωo )2
G(ω) = exp (1) ticles in the swarm. The rate of the position change (veloc-
2 × log(k/ωo )2 ity) for particle ’i’ is represented as Vi = {vi1 , vi2 , . . . , viN }.
The particle velocities are updated according to the following
Where, ωo is the filter center frequency. To obtain a constant equations [8]:
shape filter, the ratio k/ωo must be held constant for varying
ωo . Vidnew = w ∗ Vidold + C1 ∗ rand1 () ∗ (Pbestid − xid )
The Log Gabor transform used in our experiments has +C2 ∗ rand2 () ∗ (gbestd − xid ) (2)
4 different scales and 8 orientations. We fixed these values
based on the results of different trials and also in conformity xid = xid + Vidnew (3)
with the literature [4][3]. Thus, each image (of palmprint and
Where, d = 1, 2, . . . , N ,; w is the inertia weight. The suitable
face) is analyzed using 8×4 different Log Gabor filters result-
selection of inertia weights provides a balance between global
ing in 32 different filtered images of resolution 60 × 60. To
and local explorations, and results in fewer iterations on av-
reduce the computation cost, we down sample the image by a
erage to find near optimal results. C1 and C2 are the accel-
ratio equal to 6. Thus, the final size is reduced to 40×80. Sim-
eration constants used to pull each particle towards pbest and
ilar type of analysis is also carried out for palmprint modality.
gbest. Low values of C1 and C2 allow the particle to roam
By concatenating the column vectors associated to each im-
far from target regions, while high values result in abrupt
age we obtain the fused feature vector of size 6400×1. As the
movements towards or past the target regions. rand1 () and
imaging conditions of face and palmprint are different, a fea-
rand2 () are random numbers between (0,1).
ture vector normalization is carried out as mentioned in [3].
In order to reduce the size of each vector, we propose to per-
form feature selection through PSO as explained in Section 2.3. Binary PSO
2.1 and illustrated in Figure 2 where ’K’ indicates the dimen- The original PSO was introduced for continuous population
sion of the feature space after concatenation and ’S’ indicates but has been later extended by J. Kennedy and R.C. Eberhart
the reduced dimension obtained by PSO. Then, we use KDDA [6] to the discrete valued population. In the binary PSO, the
to project the selected features on Kernel discriminant space. particles are represented by binary values (0 or 1). Each par-
Here we employ KDDA because of its good performance as ticle velocity is updated according to the following equations:
well as its high dimension reduction ability. Finally, decision
about accept/reject is carried out using NNC. 1
S(Vidnew ) = new (4)
1 + e−Vid
if (rand < S(Vidnew )) then xid = 1; else xid = 0; (5)
2.1. Particle Swarm Optimization(PSO)
Where Vidnew denotes the particle velocity obtained from
PSO is a stochastic, population based optimization technique Equation 2, function S(Vidnew ) is a sigmoid transformation
aiming at finding a solution to an optimization problem in and rand is a random number selected from the uniform dis-
a search space. The PSO algorithm was first described by tribution (0,1). If S(Vidnew ) is larger than a random number
J. Kennedy and R.C. Eberhart in 1995 [8]. The main idea then its position value is represented as {1} else its position
of PSO is to simulate the social behavior of birds flocking value is represented as {0}. Binary PSO is well adapted to
to describe an evolving system. Each candidate solution is the feature selection context [10][7]. In order to apply the
therefore modeled by an individual bird that is a particle in a idea of binary PSO for feature selection of face and palmprint
search space. Each particle adjusts its flight by making use of features, we need to adapt the general binary PSO concept
its individual memory and of the knowledge gained from its to this precise application. This will be the objective of the
neighbors to find the best solution. following subsections.
39
FACE Palmprint
F F F F P P P P
F F F F P P P P
F F F F P P P P
F F F F P P P P
LGF1 1
LGF2
Face Log Gabor
Features
Feature
Concatenation
Palm Log Gabor
LGPk-1 Features
LGPk K PSO
0 1 1 1
ction
LGF2
LGPk
re Sele
Featu
in
Doma
1 S
40
2.3.5. Population Size subsets. Thus, in each 10 trials we have 1200 (= 200 × 6)
reference samples and 1200 (= 200 × 6) testing samples and
In our present work, we experimentally varied the size of the
hence we have 1500 genuine matching scores and 238800
population from 10 to 30 in steps of 5 and finally, we fixed
(= 200 × 199 × 6) impostor matching scores, as for each
the population size as 20 which was found to be an optimal
user, all other users are considered as an impostor. In closed
value.
identification, we calculate the recognition rate using the 1200
reference samples and the 1200 testing samples that will give
3. EXPERIMENTAL SETUP 1200 × 1200 matching scores. Note that the persons are ex-
actly the same in reference and test sets, this is why we speak
This section describes the experimental setup that we have of closed identification. Finally, results are presented by tak-
build in order to evaluate the proposed feature level fusion ing the mean of all trials (10) and we also present the statis-
schemes. Because of a lack of real multimodal database of tical variation of the results with 90% parametric confidence
face and palmprint data, experiments are carried out on a interval [13]which gives a better estimation of the deviation
database of virtual persons using face and palmprint data com- than the one that we can obtain thanks to the cross validation.
ing from two different databases. This procedure is valid as
for one person, face and palmprint can be considered as two
independent modalities [1]. 4. RESULTS AND DISCUSSION
For face modality we choose the FRGC face database[11]
as it is a big database, widely used for benchmarking. From
this database, we choose 250 different users from 2 different
sessions. The first session consists of 6 samples for each user
taken from data collected during Fall 2003 and the second
session consists of 6 samples for each user taken from data
collected during Spring 2004. Out of these 6 samples, the
first 4 samples are taken in controlled condition and the next
2 samples are taken from uncontrolled conditions. For palm-
print modality, we select a subset of 250 different palmprints
from polyU database [12], each of these users possesses 12
samples such that 6 samples are taken from the first session
and next 6 samples are taken from the second session. The
average time interval between the first and second session is
two months. In building our multimodal biometric database
of face and palmprint, each virtual person is associated with Fig. 3. ROC curves of the different verification systems
12 samples of face and palmprint produced randomly from
the face and palmprint samples of 2 persons in the respective
databases are associated to each virtual persons. Thus, the This section discusses the results of the proposed feature
built virtual multimodal biometric database consists of 250 fusion and selection scheme in terms of performance and num-
users such that each user has 12 samples. ber of feature selected. The proposed method is compared
with three different feature selection schemes such as Ad-
aBoost (feature fusion-AdaBoost)[14], Genetic Algorithm (fea-
3.1. Experimental Protocol
ture fusion-GA) [15] and Sequential Floating Forward Selec-
This section describes in detail the experimental protocol em- tion (feature fusion-SFFS) [16] in terms of number of feature
ployed in our work. For learning the projection spaces, we selected. Further, we also present the comparative analysis of
use a subset of 100 users called LDB, such that each user has feature level fusion and selection schemes with feature fusion
6 samples (selected randomly out of 12 samples). To validate scheme using complete set of features (feature fusion-LG).
the performance of all the algorithms, we divide the whole We also present the comparative analysis of feature level fu-
database of 250 users into two independent sets called Set-I sion with match score level fusion in terms of performance.
and Set-II. Set-I consists of 200 users and Set-II consists of 50 Figure 3 shows the ROC curves of the individual biomet-
users. Set-II is used as the validation set to fix the parameters rics, feature fusion using complete set of features (feature
of PSO (like Vmax , C1, C2, Population Size), match score fusion-LG), feature fusion and selection schemes and match
fusion and also those of AdaBoost. Set-I is divided into two score level fusion. To perform the match score level fusion,
equal partitions providing 6 reference data and 6 testing data we first obtain the match scores of face and palmprint inde-
for each of the 200 persons. The reference and testing parti- pendently using the combination of Log Gabor transform and
tion was repeated ’m’ times (where m = 10) using Holdout KDDA. Note that this architecture corresponds to state of art
crossvalidation and there is no overlapping between these two systems in both face and palmprint. We therefore perform a
41
dicates the final dimension of the KDDA projection space.
Table 1. Comparative performance of the different feature se-
From Table 2 we can observe that, PSO based feature selec-
lection schemes (Mean GAR at FAR = 0.01%) (Verification)
tion scheme uses less number of features as compared with
Methods GAR at 0.01% of FAR(%) three different feature selection schemes. Indeed, the pro-
with 90% confidence interval posed Feature Fusion-PSO scheme reduces the fused feature
Face Alone 65.32 [63.06; 67.37] space by roughly 45% while, SFFS, AdaBoost and GA re-
duces the fused feature space dimension by around 17%, 36%
Palmprint Alone 74.62 [72.55; 76.08]
and 39% respectively. These figures clearly indicate the ef-
Match Score Fusion 86.50 [84.89; 88.11]
ficacy of the proposed Feature Fusion-PSO. Further, It also
Feature Fusion-LG 92.51 [91.65; 93.75]
observed from our experiments that, fusion at feature level
Feature Fusion-SFFS 92.55 [91.32; 93.77]
allows an improvement of 5% over match score level fusion.
Feature Fusion-AdaBoost 92.88 [91.76; 94.00]
Feature Fusion-GA 92.75 [91.52; 93.97]
Feature Fusion-PSO 94.72 [93.85; 95.59] 5. CONCLUSION
42
Pattern Recognition, vol. 40, no. 3, pp. 3209–3224,
2007.
[5] Y. Yan and Y.J. Zhang, “Multimodal biometrics fusion
using Correlation Filter Bank,” in proceedings of In-
ternational Conference on Pattern Recognition (ICPR-
2008), 2008, pp. 1–4.
[6] J.Kennedy and R.C.Eberhart, “A discrete binary version
of the particle swarm algorithm,” in IEEE International
Conference on Systems, Man and Cybernetics, 1997, pp.
4104–4108.
[7] X.Wang, J.Yang, X.Teng, W. Xia, and B. Jensen, “Fea-
ture selection based on rough sets and particle swarm
optimization,” Pattern Recognition Letters, vol. 28, pp.
459–471, 2007.
[8] J. Kennedy and R.C. Eberhart, “Particle swarm opti-
mization,” in IEEE International Conference on Neural
Networks, 1995, pp. 1942–1948.
[9] X. Zhitao, G. Chengming, Y. Ming, and L. Qiang, “Re-
search on log gabor wavelet and its application in im-
age edge detection,” in proceedings of 6th International
Conference on Signal Processing, 2002, pp. 592– 595.
[10] M. Najjarzadeh and A. Ayatollahi, “A comparison be-
tween Genetic Algorithm and PSO for linear phase FIR
Digital filter design,” in IEEE International Conference
on Signal Processing(ICSP), 2008, pp. 2134–2137.
[11] P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer,
J. Chang, K. Hoffman, J. Marques, J. Min, and
W. Worek, “Overview of the face recognition grand
challenge,” in Proceedings of CVPR05, 2005, p.
947954.
[12] “polyU palmprint Database,”
www.comp.polyu.edu.hk/ biometrics/.
[13] R. M. Bolle, N. K. Ratha, and S. Pankanti, “An Evalua-
tion of Error Confidence Interval Estimation Methods,”
in Proceedings of ICPR-04, 2004, pp. 103–106.
[14] S.Shan, P.Yang, X.Chen, and W.Gen, “Adaboost gabor
fisher classifier for face recognition,” in Proceedings of
AFGR, 2005, pp. 278–291.
[15] G.Bebis, S. Uthiram, and M. Georgiopoulos, “Face de-
tection and verification using genetic search,” Interna-
tional Journal of Artificial Tools, vol. 6, no. 2, pp. 225–
246, 2000.
[16] F.Ferri, P.Pudil, M.Hatef, and J.Kittler, “Comparative
study of techniques for large scale feature selection,”
Pattern Recognition in Practice IV, Elsevier Science, pp.
403–413, 1994.
43
ADCOM 2009
GRID SERVICES
Session Papers:
1. Srikumar Venugopal, James Broberg and Rajkumar Buyya, “OpenPEX: An Open Provisioning
and EXecution System for Virtual Machines”
3. Snehal Gaikwad, Aashish Jog and Mihir Kedia, “Intelligent Data Analytics Console”
44
OpenPEX: An Open Provisioning and EXecution System for Virtual Machines
Srikumar Venugopal
School of Computer Science and Engineering,
University of New South Wales, Australia
Email:srikumarv@cse.unsw.edu.au
James Broberg and Rajkumar Buyya
Department of Computer Science and Software Engineering,
The University of Melbourne, Australia
Email:{brobergj, raj}@csse.unimelb.edu.au
1
45
Web Service or Web Portal. The resources themselves (such Command Line Tools. The Amazon service provides hun-
as compute and storage) are highly abstracted or virtualised. dreds of pre-made AMIs (Amazon Machine Image), giving
However, IaaS service models are evolving and cur- users a wide choice of operating systems (i.e. Windows or
rently, most providers operate on a lease model – the users Linux) and pre-loaded software. Instances come in different
pay for the time the VM was active. Also, the choices avail- sizes, from Standard Instances (S, L, XL), which have pro-
able to the user in terms of specifying requirements are lim- portionally more RAM than CPU, to High CPU Instances
ited. Such models do not allow for more flexible strategies (M, XL), which have proportionally more CPU than RAM.
where the user can reduce both his costs and risk by booking A user can deploy these instances in two different regions,
his resources in advance. An advance reservation provides a US-East and EU-West Regions, with EU instances costing
guaranteed allocation of the resources at the needed time to more per hour than their US counterparts.
the consumer and helps the provider plan capacity require- Amazon EC2 provides an alternative to it’s on-demand
ments better. However, advance reservations induce new instances, known as a reserved instance. This facility of-
challenges in resource management and require new archi- fers a number of benefits over simply requesting instances
tectures for realisation. In this paper, we introduce Open- on demand, as it provides a lower per-hour rate, and pro-
PEX, a utility-based virtual infrastructure manager that en- vides assurances that any reserved instance you launch is
ables users to reserve VM instances in advance. OpenPEX guaranteed to succeed (provided you have booked them in
also offers a bilateral negotiation protocol that allows users advance). That is, users of such instances should not be
and providers to exchange offers and counter-offers, and affected by any transient limitations in EC2 capacity.
come to an agreement that is mutually beneficial. In the
next section, we distinguish the contributions of OpenPEX
from the state-of-the-art. Section 3 discusses the design and 2.2 Private IaaS Cloud Platforms
implementation of OpenPEX at length. Section 4 discusses
the Web Service interface to OpenPEX and finally, we con-
clude the paper with details on our future plans for the sys- Many different platforms exist to assist with the deploy-
tem. ment and management of virtual machines on a virtualised
cluster (i.e. a cluster running Virtual Machine software).
2 Related Work Such platforms are often referred to as ‘Private Clouds’,
as they can bring the benefits of Cloud Computing (such
Virtual Machine technology has become an essential as elasticity, dynamic provisioning and multiplexing work-
enabling technology of Cloud Computing environments. loads onto fewer machines) into local clusters.
Cloud Computing is a style of computing where resources Eucalyptus [9, 8] is an open-source (BSD-licensed) soft-
can be obtained in a pay-per-use manner (no commitment, ware infrastructure for implementing Infrastructure as a
utility pricing). Such resources have elastic capacity, where Service (IaaS) Compute Cloud on commodity hardware.
they can be scaled up and down on demand. Resources are Eucalyptus is notable by offering a Web Service interface
highly abstracted and virtualised, and can be obtained via a that is fully Amazon Web Services (AWS) API compliant.
self-service interface. Specifically, it emulates Amazon’s Elastic Compute Cloud
VMs are highly attractive to manage resources in such (EC2), Simple Storage Service (S3) and Elastic Block Store
environments as they improve utilisation by multiplexing (EBS) services at the API level. However, as the imple-
many VMs on one physical host (consolidation), allow ag- mentation details of Amazon’s services are not published,
ile deployment and management of services, provide on Eucalyptus’ internal implementation would differ.
demand cloning, (live) migration and checkpoint which OpenNebula [11] is an open-source Virtual Infrastruc-
improves reliability. Furthermore, a VM can be a self- ture management software that supports dynamic resizing,
contained unit of execution and migration. As such, effec- partitioning and scaling of computing resources. OpenNeb-
tive management of VMs and Virtual Machine infrastruc- ula can be deployed in a private, public or hybrid Cloud
ture is critical for any Cloud Computing Infrastructure as a models. The OpenNebula software turns an existing clus-
Service (IaaS) provider. ter into private cloud, which can be used privately or can
expose service to public via XML-RPC Web Services. The
2.1 Public IaaS Cloud Services integration of Cloud plugins (EC2, GoGrid) enable hybrid
model, where you can mix and match private and public
Amazon Elastic Compute Cloud (EC2) is an IaaS ser- resources. Haizea [11] has extended OpenNebula further,
vice that provides resizable compute capacity in the cloud. allowing resource providers to lease their resources using
These services can be leveraged via Web Services (SOAP or sophisticated leasing arrangements, instead of only provid-
REST),a web-based AWS Management Console or the EC2 ing on-demand VMs like most other IaaS services.
2
46
2.3 Issues with existing solutions utility / value of their jobs. Further information and compar-
ison of these utility computing environments are available
With the exception of OpenNebula (when used in con- in extensive survey of these platforms [3].
junction with the Haizea extension) and Amazon EC2
(when used with Reserved Instances), none of the above
public or private platforms offer the ability to perform an PEX Portal Web Service
Advanced Reservation of computing resources, rather they
only supply on-demand capacity on a best-effort basis (i.e. PEX Resource Mgr
if adequate resources are available). Furthermore, none of
these platforms provide an alternate offer (that is, a modi-
fied offering that can satisfy a users request that may differ Reservation
Allocator
Manager
from their initial request) in the event that the system cannot
satisfy a user’s specific request for resources.
Whilst Haizea supports a form of Advanced Reservation Event Queue
(which it denotes as advanced reservation leases), if the re-
quest cannot be satisfied there is no recourse - the request
will be rejected. Under the same circumstances, the Open- Node
VM Monitor
PEX system enacts a bilateral negotiation protocol that al- Monitor
lows users and providers to come to an agreement by ex-
changing offers and counter-offers, so a user’s advanced Xen
reservation request can be satisfied. Dispatcher
Amazon EC2 offers its own variation on the notion of
Advanced Reservation with its Reserved Instances product.
However, you need to purchase a Reserved Instance for ev-
ery instance you wish to guarantee to be available at some Xen Cluster
Xen Pool
point in the future. This essentially requires the end user to Mgr.
forecast exactly how many they will require in advance. Ac-
quisition of Reserved Instance is not instantaneous either, in leon pris deckard roy zhora
the authors’ experience a request for an Reserved Instance
Physical Resources
has taken more than an hour on previous occasions.
3
47
These requirements motivate a resource management User/Portal/WS PEX RM ClusterNode
users through the portal or the web service, and re- ACCEPT/REJECT/COUNTER
quired.
[if REJECT]
4
48
OpenPEX
has Reservation activates Instance
User 1 0:M 1 0:M
1
maps to
1:M
OpenPEX Reservation
maps to
Node 1 1 Node
5
49
Figure 5. OpenPEX Reservations Screen.
3.3 Web Portal Interface via the Reservations screen. Figure 5 shows the Reserva-
tions screen with three reservations. If a reservation was
OpenPEX is developed completely in Java and is de- not yet activated, a user can choose to delete it (if they no
ployed as a service in an application container on the clus- longer required it) or activate it, so the associated instances
ter head node (or the storage node). It communicates with can start at the appropriate start time.
the pool manager using the Xen API and uses the Java Virtual Machine instances can be viewed and manipu-
Persistence API (JPA) for the persistence back-end. The lated via the Instances screen depicted in Figure 6. Here
database structure for PEX is depicted in Figure 3 and fol- the user can view salient information regarding their VM
lows Object-Relational Mapping (ORM) for easy develop- instance, such as it’s machine name, status (e.g. HALTED,
ment and extensibility. RUNNING, PAUSED, SUSPENDED). start time, end time,
The OpenPEX system provides an easy to use Web Por- and IP address. An instance can also be stopped early (i.e.
tal interface, enabling the user to access all the functional- before it’s designated end time) if desired.
ity of the OpenPEX. Users can access the system via a web
browser, register for an account and login to the system. 4 RESTful Web Service Interface
Upon logging in they will be greeted by a simple Welcome
Screen depicted in Figure 4, which shows what functions It is essential to provide programmatic access to the
are available to the end user. functions and capabilities of an OpenPEX cluster, in order
The user can choose to make a new reservation, where for users to be able to dynamically request reservations from
they can choose the size of the reservation they wish to the system (i.e. scaling out during periods of peak load), or
make (from the choices listed in Table 1), the start and end even to integrate an OpenPEX system into a wider pool of
time, the template (i.e. Operating System) they wish to use computing resources. As such, the full functionality of the
and the number of instances they require. Their request can OpenPEX system is exposed via Web Services, which are
be accepted or they can enter into a negotiation until they implemented in a RESTful (REpresentational State Trans-
come to an agreement with the OpenPEX cluster. fer) style [6].
Once this process has occurred, they can view their ex- The REST-style architecture provides a clear and clean
isting reservations and activate any unclaimed reservations delineation between the functions of the client and the
6
50
Table 2. OpenPEX RESTful Endpoints.
OpenPEX Operation HTTP Endpoint Parameters Return type
Create reservation POST /OpenPEX/reservations JSON (Fig. 7) JSON (Fig. 8,9)
Update reservation PUT /OpenPEX/reservations/requestId JSON JSON
Delete reservation DELETE /OpenPEX/reservations/requestId None HTTP 200 (OK)
Activate reservation PUT /OpenPEX/reservations/requestId/activate None HTTP 200 (OK)
Get reservation information GET /OpenPEX/reservations/requestId None JSON
List reservations GET /OpenPEX/reservations None JSON
Get instance information GET /OpenPEX/instances/vm id None JSON
List instances GET /OpenPEX/instances None JSON
Stop instance PUT /OpenPEX/instances/vm id/stop None HTTP 200 (OK)
Reboot instance PUT /OpenPEX/instances/vm id/reboot None HTTP 200 (OK)
Delete instance DELETE /OpenPEX/instances/vm id None HTTP 200 (OK)
server. A client performs operations on resources (such where they can stop, reboot or delete these instances.
as reservations and instances) which are identified through
{
standard URIs. The server returns a JSON1 representation "proposal": {
of the resource back to the client to indicate the current state "duration": 3600000,
of that resource. Clients can modify and delete these re- "id": "5D0FA0EB-90A8-F4E6-1DFF-61B2CEC6AD91",
sources by altering and returning their representations as "numInstancesFixed": 1,
"numInstancesOption": 0,
required. A client could be a Java or Python program, or "startTime": "Mon, 17 Aug 2009 04:49:03 GMT",
a Web Portal management interface for the OpenPEX sys- "template": "PEX Debian Etch 4.0 Template",
tem. "type": "XLARGE",
"userid": 1
{ },
"duration": 3600000, "reply": "ACCEPT"
"numInstancesFixed": 1, }
"numInstancesOption": 0,
"startTime": "Mon, 17 Aug 2009 04:49:03 GMT",
"template": "PEX Debian Etch 4.0 Template", Figure 8. Reply to reservation request
"type": "XLARGE"
}
Figure 7 depicts the JSON body for a new reservation
Figure 7. Create reservation JSON body call. A user specifies the duration of the reservation, the
number of instances required, the start time, the desired
Table 4 lists the functions exposed by the Web Ser- template (e.g. Operating System) required and the type (Ta-
vice interface, along with their corresponding HTTP meth- ble 1) of instance required. These preferences are expressed
ods and endpoints. Some calls require a JSON body in the JSON body of the call. OpenPEX will then respond
whilst others trigger their functionality by simply being ac- with a JSON reply that indicates the outcome of the request,
cessed. The calls typically return a JSON object or ar- which could be an acceptance of the proposed reservation
ray, or simply a HTTP code denoting whether an opera- (shown in Figure 8), a counter offer indicated an alternate
tion was a success or failure. All the methods listed re- reservation that could satisfy the user (shown in Figure 9),
quire HTTP basic authentication. From this table we can or an outright rejection of the proposed reservation.
see that the full OpenPEX life-cycle is exposed via the Web Upon successfully obtaining a reservation in the system,
Services interface. Customers can create a new reserva- a user can get the reservation record and activate the reser-
tion, engage in the bilaterally negotiation (via the Alter- vation. Once the reservation has been activated the user can
nate Offers protocol described earlier in this paper) via the then operate on the instances themselves, obtaining the in-
/OpenPEX/reservations endpoint, and finally acti- stance record, and control the state of the Virtual Machine
vate their reservation. Once a reservation has been ac- instance itself by stopping, rebooting or deleting it.
tivated, the corresponding Virtual Machine instances are
started at their designated start time. A user has control over 5 Conclusion and Future Work
these instances via the /OpenPEX/instances endpoint,
1 The application/json Media Type for JavaScript Object Notation In this paper we introduced OpenPEX, a system that al-
(JSON) - http://tools.ietf.org/html/rfc4627 lows users to provision resources ahead of time through ad-
7
51
{
"proposal": {
4. Integrating market-driven scheduling and migration
"duration": 3600000, policies into OpenPEX.
"id": "F07640D4-32BC-DDB6-457E-32B5595BA066",
"numInstancesFixed": 1, 5. Evaluating strategies for negotiating among multiple
"numInstancesOption": 0, VM providers and users based on market conditions in
"startTime": "Mon, 17 Aug 2009 05:52:31 GMT", conjunction with the proposed market-drive policies.
"template": "PEX Debian Etch 4.0 Template",
"type": "XLARGE",
"userid": 1 References
},
"reply": "COUNTER" [1] K. Adams and O. Agesen. A comparison of software and
}
hardware techniques for x86 virtualization. In Proceed-
Figure 9. Counter Reply to reservation re- ings of the 12th international conference on Architectural
support for programming languages and operating systems,
quest
pages 2–13, San Jose, California, USA, 2006. ACM.
[2] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris,
A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and
the art of virtualization. ACM SIGOPS Operating Systems
vance reservations, instead of being limited to on-demand,
Review, 37(5):164–177, 2003.
best-effort resource acquisition. OpenPEX also incorpo- [3] J. Broberg, S. Venugopal, and R. Buyya. Market-oriented
rates a novel bilateral negotiation protocol that allows users Grids and Utility Computing: The state-of-the-art and future
and providers to come to an agreement by exchanging offers directions. Journal of Grid Computing, 6(3):255–276, 2008.
and counter-offers, in the event that a user’s original request [4] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and
cannot be precisely satisfied. I. Brandic. Cloud computing and emerging IT platforms:
The fundamental aim of OpenPEX was to harness virtual Vision, hype, and reality for delivering computing as the 5th
machines for adaptive provisioning of services on shared utility. Future Generation Computer Systems, 25(6):599–
616, June 2009.
computing resources. Adaptive provisioning may involve a [5] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul,
combination of: 1) creating of new VMs to meet increase C. Limpach, I. Pratt, and A. Warfield. Live migration of vir-
in demand; 2) migrating existing VMs to other available tual machines. In Proceedings of the 2nd conference on Sym-
resources; and/or 3) suspending execution of some VMs posium on Networked Systems Design \& Implementation -
in order to increase the resource share available to oth- Volume 2, pages 273–286. USENIX Association, 2005.
ers. These techniques are, however, governed by negotiated [6] R. T. Fielding. Architectural Styles and the Design of
agreements between the users and the resource providers, Network-based Software Architectures. PhD thesis, Univer-
and between providers that encapsulate costs and guaran- sity of California, 2000.
[7] M. Nelson, B. Lim, and G. Hutchins. Fast transparent mi-
tees for deployment and maintenance of VMs for services. gration for virtual machines. In Proceedings of the annual
Demand and supply for services in such an environment is, conference on USENIX Annual Technical Conference, pages
therefore, mediated by market-driven resource management 25–25, Anaheim, CA, 2005. USENIX Association.
mechanisms, thereby leading to a so-called utility comput- [8] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. So-
ing environment. man, L. Youseff, and D. Zagorodnov. Eucalyptus: A Tech-
As such, we are endeavouring to implement provisional nical Report on an Elastic Utility Computing Archietcture
market-based resource management techniques in Open- Linking Your Programs to Useful Systems UCSB Computer
PEX system to collect pricing and utilisation data, and intro- Science Technical Report Number 2008-10.
[9] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. So-
duce policies and strategies for managing virtual machines man, L. Youseff, and D. Zagorodnov. The Eucalyptus Open-
in a market-driven utility computing environment. We in- source Cloud-computing System. Proceedings of Cloud
tend to achieve this by: Computing and Its Applications, 2008.
[10] A. Rubinstein. Perfect equilibrium in a bargaining model.
1. Soliciting a wide range of users (from other faculties Econometrica, 50(1):97–109, January 1982.
and other collaborating Universities) to run VM encap- [11] B. Sotomayor, R. Montero, I. Llorente, I. Foster, and
sulated workloads on the test-bed. F. de Informatica. Capacity Leasing in Cloud Systems using
the OpenNebula Engine. Cloud Computing and Applica-
2. Measuring crucial pricing (using simulated currency tions, 2008, 2008.
mechanism) and usage data from users of the system, [12] S. Venugopal, X. Chu, and R. Buyya. A negotiation mech-
anism for advance resource reservations using the alternate
which is difficult to obtain from commercial comput-
offers protocol. In Proceedings of the 16th International
ing centres and largely absent from the literature.
Workshop on Quality of Service (IWQoS 2008), pages 40–
3. Formulating market-driven policies for scheduling and 49. IEEE Computer Society Press, Los Alamitos, CA, USA,
June 2008.
migration in VM platforms based on data collected.
8
52
Exploiting Heterogeneity in Grid Computing for
Energy-Efficient Resource Allocation
Saurabh Kumar Garg and Rajkumar Buyya
The Cloud Computing and Distributed Systems
Department of Computer Science and Software Engineering
The University of Melbourne, Australia
Email: {sgarg, raj}@csse.unimelb.edu.au
Abstract—The growing computing demand from industry and While a lot of research has been performed to increase effi-
academia has lead to excessive power consumption which not only ciency of individual clusters at various levels such as processor
impacting long term sustainability of Grids like infrastructures in level (CPU) [4][5], in virtualization based resource managers
terms of energy cost but also from environmental perspective. The
problem can be addressed by replacing with more energy efficient [6], and cluster resource managers [7][8], the research on
infrastructures, but the process of switching to new infrastructure improving the energy efficiency of global systems such as grid
is not only costly but also time consuming. Grid being consist is still in its infancy. Most of the existing grid meta-schedulers,
of several HPC centers under different administrative domain, such as Maui/Moab scheduling suite [9], Condor-G [10], and
make problem more difficult. Thus, for reduction in energy GridWay [11], focus on improving system-centric performance
consumption, we address the challenge by effectively distributing
compute-intensive parallel applications on grid. We presented metrics such as utilization, average load and application’s
a metascheduling algorithm which exploits the heterogeneous turnaround time. Others such as Gridbus Broker [12] focus
nature of Grid to achieve reduction in energy consumption. Sim- on deadline and budget constrained scheduling. Thus, this
ulation results show that out algorithm HAMA can significantly paper examines how a grid meta-scheduler can exploit the
improve the energy efficiency of global grids by a factor of heterogeneity of the global grid infrastructure to achieve
typically 23% and as much as a factor of 50% in some cases
while meeting user’s QoS requirements. reduction in energy consumption of overall grid. In particular,
we focus on designing a meta-scheduling policy that can be
I. I NTRODUCTION easily adopted by existing grid meta-schedulers without many
changes in current grid infrastructure. This work will also
From last many years, global grid is serving as a main- have relevance to emerging cloud computing paradigm when
stream High Performance Computing (HPC) platform to pro- scaling of application across multiple clouds is considered
vide massive computational power to execute large-scale and [13]. The key contributions of this paper are:
compute-intensive scientific and technological applications. 1) It defines a novel Heterogeneity Aware Meta-scheduling
Enlarging the existing global grid infrastructure to meet the Algorithm (HAMA) that considers various factors con-
increasing demand from grid users can progressively speed up tributing to high energy consumption of grids, including
the advancement of science and technology. But the growing cooling system efficiency and CPU power efficiency.
environmental and economic impact due to high energy con- 2) It demonstrates through extensive simulations using real
sumption of HPC platforms has become a major bottleneck in workload traces that the energy efficiency of global grids
expansion of grid like platforms. can be improved as much as 23% with HAMA.
In April 2007, Gartner estimates that the ICT industry is The rest of this paper is organized as follows: Section 2 dis-
liable for 2% of the global CO2 emissions annually, which is cusses related work. Section 3 defines the grid meta-scheduling
equal to the aviation industry [1][2]. In addition to that, the model. Section 4 describes HAMA. Section 5 explains the
high power consumption has not only lead to rapid increase in evaluation methodology and simulation setup for comparing
utility bills but also affecting the reliability of servers due to HAMA with existing meta-scheduling policies. In Section 6,
high concentrated heat loads. The power efficiency of a HPC the performance results of HAMA are analyzed. Section 7
center depend on number of factors such as processor’s power concludes the paper and presents future work.
efficiency, cooling and air conditioning system, infrastructure
design and lighting/physical system. A recent study [3] done II. R ELATED W ORK
by Lawrence Berkeley National Laboratory shows the cooling This section presents related work on energy-
efficiency (the ratio of computer power : cooling power) of efficient/power-aware scheduling on grids. To the best
data centers varies drastically from a low of 0.6 to a high of of our knowledge, no previous work has proposed a meta-
3.5. Thus, the sustainable and environmental-friendly solutions scheduler that explicitly addresses energy efficiency of grids
must be employed by current HPC community to increase the from a global perspective.
energy efficiency of HPC systems which can more effectively Currently, for global grids, meta-schedulers in operation,
make use of electricity. such as GridWay [11] use heuristics such as First Come First
53
Serve (FCFS). Moab also has a FCFS batch scheduler with Users Local Scheduler Resource Provider
goals such as minimizing job completion time and achieving 4. Meta-scheduler send
jobs to local scheduler
1. Job request from
load balancing. The issue of energy consumption emission by users with deadline
for execution
utilized [16], [17], [18], [7]; or by Dynamic Voltage Scaling ` 3. Find most energy efficient
(DVS) to slow down the speed of CPU processing [19], [20], resource site
[21], [22], [8], [23], [24], [7]. Hence, these efforts help reduce
the energy consumption of one resource site such as cluster or
server farm, but not across multiple resource sites distributed Fig. 1. Meta-scheduling protocol
geographically.
Orgerie et al. [16] propose a prediction algorithm to reduce
the power consumption in a large-scale computational grid on an individual grid resource site and does not have
such as Grid’5000 by aggregating the workload and switch- preemptive priority. The reason for this requirement is
ing off unused CPUs. They focus on reducing CPU power that the synchronization among various tasks of parallel
consumption to minimize the total energy consumption. As jobs can be affected by communication delays when jobs
the power efficiency of grid sites can vary across the grid, are executed across multiple resource sites. The user’s
reducing CPU power consumption itself may not necessary objective is to have his job completed by a deadline.
lead to a global reduction in the energy consumption by the Deadlines are hard, i.e., the user will benefit only if
entire grid. We focus on conserving energy of grids from a the job completes before its deadline [25]. To facilitate
global perspective. the comparison between the algorithms described in this
Meisner et al. [19] show that in the case of high and work, the estimated execution time of a job provided
unpredictable workload, it is difficult to exploit the power by the user is considered to be accurate [26]. Several
on/off facility even though it is ideal to simply switch off models, such as those proposed by Sanjay and Vadhiyar
idle systems. Thus, DVS-enabled CPUs will be much better [27], can be applied to estimate the runtime of parallel
in saving energy in this case. Therefore, in this work we use jobs. In this work, a job execution time is inversely
DVS to reduce the energy consumption of CPUs since our proportional to the CPU operating frequency.
main focus is on large-scale computational grid resource sites 2) Grid Resource Sites:
which generally have unpredictable workload. Grid resource sites consist of clusters at different loca-
tions, such as the sites of the Distributed European In-
III. G RID M ETA - SCHEDULING M ODEL frastructure for Supercomputing Applications (DEISA)
A. System Model [28] with resource sites located in various European
A grid meta-scheduler acts as an interface to grid resource countries and LHC Grid across the world [29]. Each
sites and schedules jobs on the behalf of users as shown in resource site has a local scheduler that manages the
Figure 1. It interprets and analyzes the service requirements execution of incoming jobs. Each local scheduler peri-
of a submitted job and decides whether to accept or reject odically supplies information about available time slots
the job based on the availability of CPUs. Its objective is to (ts , te , N ) to the meta-scheduler, where ts and te are the
schedule jobs so that the energy consumption of grid can be start time and end time of the slot respectively and N is
reduced while the Quality of Service (QoS) requirements of the number of CPUs available for the slot. To facilitate
the jobs are met. As grid resource sites are located in different energy efficient computing, each local scheduler also
geographical regions, they have different power efficiency of supplies information about cooling system efficiency,
CPUs and cooling systems. Each resource site is responsible CPU power-frequency relationship, and CPU operating
for updating this information at the meta-scheduler for energy frequencies of the grid resource site. All CPUs within a
efficient scheduling. The two participating parties, grid users single resource site are homogeneous, but CPUs can be
and grid resource sites, are discussed below along with their heterogeneous across resource sites.
objectives and constraints:
1) Grid Users: B. Grid Resource Site Energy Model
Grid users submit parallel jobs with QoS requirements The major contributors for total energy usage in grid re-
to the grid meta-scheduler. Each job must be executed source site are computing devices (CPUs) and cooling system
54
which constitute about 80% of total energy consumption. job j, nj is the number of CPUs required for job execution,
Other systems such as lighting are not considered due to their and ej is the job execution time when operating at the CPU
negligible contribution to the total energy cost. frequency fm,j . In addition, let fij be the initial frequency at
The power consumption P of a CPU at a grid resource which CPUs of a grid resource site i operate while executing
site is composed of dynamic and static power [21][7]. The job j. HAMA, then, sorts the incoming jobs based on Earliest
static power includes the base power consumption of the Deadline First (EDF) (Algorithm 1: Line 4). The grid resource
CPU and the power consumption of all other components. sites are sorted in order of their power efficiency (Algorithm 1:
Thus, the CPU power P is approximated by the following Line 5) which is calculated by Cooling system efficiency ×
1 βi
function (similar to previous work [21][7]): P = β + αf 3 , CPU Efficiency, i.e., (1+ COP i
)×( f max +αi (fimax )2 ). Then,
i
where β is the static power consumed by the CPU, α is the meta-scheduler assigns jobs to resource sites according to this
proportionality constant, and f is the frequency at which the ordering (Algorithm 1: Line 7–11).
CPU is operating. We consider that CPUs support DVS facility
and thus their frequency can be varied from a minimum of Algorithm 1: HAMA
f min to a maximum of f max discretely. Let Ni be number 1 while current time < next schedule time do
of CPUs at a resource site i. Thus, if the CPU j running at 2 RecvResourcePublish(P)
frequency fj for tj time, then the total energy consumption //P contains information about grid resource sites
3 RecvJobQoS(Q)
due to computation is given by: //Q contains information about grid users
j
X 4 Sort jobs in ascending order of deadline
5 Sort resource sites in ascending order of
Ec,i = (βi + αi fj3 )tj . (1) 1 βi
(1 + COP ) × ( f max + αi (fimax )2 )
Ni i i
6 foreach job j ∈ RecvJobQoS do
The energy cost of an cooling system depends on the Coeffi- 7 foreach resource site i ∈ RecvResourcePublish do
//find time slot for scheduling job j at resource site i
cient Of Performance (COP) factor of the cooling system [30]. 8 if FindTimeSlot(i,j) then
COP is indication of efficiency of cooling system which is 9 Schedule job j on resource site i using DVS;
defined as the ratio of the amount of energy consumed by Update available time slots at resource site i
10 break
CPUs to the energy consumed by the cooling system. The 11
COP is however not constant and varies with cooling air
temperature. We assume that COP will remain constant during
scheduling cycle and resource sites will update meta-scheduler
whenever COP changes. Thus, the total energy consumed by TABLE I
cooling system is given by: PARAMETERS OF A G RID R ESOURCE S ITE i
55
(i.e. with DVS). If the meta-scheduler fails to schedule the job to ensure that the meta-scheduler can receive at least one job
on the resource site because no free slot is available, then the in every scheduling interval. The cooling system efficiency
job is forwarded to the next energy efficient resource site for (COP) value of resource sites is randomly generated using
scheduling. a uniform distribution between [0.5, 3.6] as indicated in the
study conducted by Greenberg et al. [3].
V. P ERFORMANCE E VALUATION Grid Meta-scheduling Algorithms: We examine the per-
We use workload traces Feitelson’s Parallel Workload formance of HAMA in terms of job selection and resource
Archive (PWA) [31] to model the global grid workload. Since allocation of the grid meta-scheduler. We compare our job
this paper focuses on studying the application requirements selection algorithm with EDF-FQ which prioritize jobs based
of grid users, the PWA meets our objective by providing job on deadline and submit jobs to resource site in earliest start
traces that reflect the characteristics of real parallel appli- time (FQ) manner with the least waiting time. We also
cations. The experiments utilize the jobs in the first week compare HAMA with another version of HAMA i.e. HAMA-
of the LLNL Thunder trace (January 2007 to June 2007). withoutDVS to analyze the affect of DVS facility on energy
The LLNL Thunder trace from the Lawrence Livermore consumption.
National Laboratory (LLNL) in USA is chosen due to its Performance Metrics: We consider two metrics: average
highest resource utilization of 87.6% among available traces energy consumption and workload (i.e. amount of workload
to ideally model a heavy workload scenario. From this trace, executed). Average power consumption shows the amount of
we obtain the submit time, requested number of CPUs, and energy saved by using HAMA in comparison to other grid
actual runtime of jobs. However, the trace does not contain meta-scheduling algorithms, whereas workload shows HAMA
the service requirement of jobs (i.e. deadline). Hence, we use affect on the workload executed successfully by grid.
a methodology proposed by Irwin et al. [32] to synthetically Experimental Scenarios: We run the experiments in two
assign deadlines through two classes namely Low Urgency scenarios 1) urgency class and 2) arrival rate of jobs. For
(LU) and High Urgency (HU). the urgency class, we use various percentages (0%, 20%,
A job i in the LU class has a high ratio of 40%, 60%, 80%, and 100%) of HU jobs. For instance, if the
deadlinei /runtimei so that its deadline is definitely longer percentage of HU jobs is 20%, then the percentage of LU
than its required runtime. Conversely, a job i in the HU class jobs is the remaining 80%. For the arrival rate, we use various
has a deadline of low ratio. Values are normally distributed factors (10, 100, and 1000) of submit time from the trace. For
within each of the high and low deadline parameters. The ratio example, a factor of 10 means a job with a submit time of 10s
of the deadline parameter’s high-value mean and low-value from the trace now has a simulated submit time of 1s. Hence,
mean is thus known as the high:low ratio. In our experiments, a higher factor represents higher workload by shortening the
the deadline high:low ratio is 3, while the low-value deadline submit time of jobs.
mean and variance is 4 and 2 respectively. In other words, LU Equation 3, we know that the performance of HAMA is
jobs have a high-value deadline mean of 12, which is 3 times highly dependent on the CPU efficiency and Cooling System
longer than HU jobs with a low-value deadline mean of 4. efficiency of grid resource sites. We compare performance of
The arrival sequence of jobs from the HU and LU classes is our algorithm in worst case scenario (HL) i.e., when resource
randomly distributed. site with the highest CPU power efficiency has the lowest COP,
Provider Configuration: The grid modelled in our simu- and best case scenario (HH) i.e., when resource site with the
lation contains 8 resource sites spread across five countries highest CPU power efficiency has the highest COP (HH).
derived from European Data Grid (EGEE) testbed [29]. The
configurations assigned to the resources in the testbed for VI. P ERFORMANCE R ESULTS
the simulation are listed in Table II. The configuration of
A. Affect on Energy consumption
each resource site is decided so that the modelled testbed
would reflect the heterogeneity of platforms and capabilities This section compares energy consumption of HAMA with
that is normally the characteristic of such installations. Power other meta-scheduling algorithms for grid resource sites with
parameters (i.e. CPU power factors and frequency level) of HH and HL configurations. The figure 2 shows how energy
the CPUs at different sites are derived from Wang and Lu’s consumption varies with deadline urgency and arrival rate of
work [7]. Current commercial CPUs only support discrete jobs. HAMA has clearly outperformed its competitor EDF-
frequency levels, such as the Intel Pentium M 1.6 GHz CPU, FQ by saving about 17%-23% energy in worst case and about
which supports 6 voltage levels. We consider discrete CPU 52%in best case.
frequencies with 5 levels in the range [fimin , fimax ]. For The effect of job urgency on energy consumption can be
the lowest frequency fimin , we use the same value used clearly seen from figure 2(a) and 2(b). As the percentage of
by Wang and Lu [7], i.e. fimin is 37.5% of fimax . Each HU jobs with more urgent (shorter) deadline increases, the
local scheduler at a grid site use Conservative Backfilling energy consumption (Figure 2(a) and 2(b)) also increases due
with advance reservation support as used by Mu’alem and to more urgent jobs running on resource sites with lower power
Feitelson [33]. The grid meta-scheduler schedules the job efficiency and at the highest CPU frequency to avoid deadline
periodically at a scheduling interval of 50 seconds, which is violations. On the other hand, the effect of job arrival rate on
56
TABLE II
C HARACTERISTICS OF G RID R ESOURCE S ITES
Location of Grid Site CPU Power Factors No. of CPUs MIPS Rating
β α fimax
RAL, UK 65 7.5 1.8 2050 1140
Imperial College (UK) 75 5 1.8 2600 1200
NorduGrid (Norway) 60 60 2.4 650 1330
NIKHEF (Netherlands) 75 5.2 2.4 540 1176
LYON (France) 90 4.5 3.0 600 1166
Milano (Italy) 105 6.5 3.0 350 1320
Torina (Italy) 90 4.0 3.2 200 1000
Padova (Italy) 105 4.4 3.2 250 1330
energy consumption (Figure 2(c) and 2(d) is minimal with a Moreover, the immediate and significant reduction in CO2
slight increase when more jobs arrive. emissions is required for the future sustainability of global
For grid resource sites without DVS, HAMA-without can grids.
reduce up to 15-21% of the energy consumption (Figure 2(a)) In this paper, we have addressed the energy efficiency of
in the HL configuration and 28-50% of energy consumption grids at the meta-scheduling level. We proposed Heterogenity
(Figure 2(b)) in the HH configuration compared to EDF-FQ Aware Meta-scheduling Algorithm (HAMA) to address the
which also doesn’t consider the DVS facility while scheduling problem by scheduling more workload with urgent deadline on
across the entire grid. This highlights the importance of the resource sites which are more power-efficient. Thus, HAMA
power efficiency factor in achieving energy-efficient meta- considers crucial information of global grid resource sites,
scheduling. In particular, HAMA can reduce energy consump- such as cooling system efficiency (COP) and CPU power effi-
tion (Figure 2(a) and 2(b) even more when there are more LU ciency. HAMA address the problem in two steps: 1) allocating
jobs with less urgent (longer) deadline and arrival rate is low. jobs to more energy-efficient resource sites and 2) scheduling
When we compare HAMA and HAMA-withoutDVS, we using DVS policy at the local resource site to further reduce
observe that by using DVS energy saving has increased by energy consumption.
about 11% when % of job with urgent deadline and job arrival Results show that our HAMA can reduce up to 23% energy
rate is high. This is because for the scenario when DVS facility consumption in worst case and upto 50% in best case as
is available jobs can run at lower CPU frequency to save compare to other algorithms (EDF-FQ). Moreover, even if
energy. DVS facility is not available, HAMA-withoutEDF can still
result in considerable amount of power savings of upto 21%.
B. Affect on Workload Executed Particularly, our HAMA algorithm can work very well when
Figure 3 shows the total amount of workload successfully the deadline of jobs is less urgent and arrival rate of jobs is
executed according to user’s QoS. The workload of a job not high. Thus, HAMA can also compliment the efficiency of
refer to multiplication of its execution time and the number of existing power-aware scheduling policies for clusters.
CPU required. The affect of job urgency and arrival rate on In future, we will investigate how HAMA can address the
workload executed can be clearly seen from Figure 3(a) and energy consumption problem in virtualized environments such
3(d). All meta-scheduling algorithm shows consistent decrease as clouds, which is the emerging platform for hosting business
in workload execution particularly in scenario of job urgency. applications. We will also integrate HAMA with existing
The reason is rejection of more jobs due to deadline miss grid meta-schedulers and conduct experiments on real grid
when all jobs are of high urgency. The amount of workload and cloud resources. We will also extend our current meta-
executed by EDF-FQ is less than HAMA because of the reason scheduling model to resources such as the storage disks and
that while scheduling using EDF-FQ, the local scheduler the switching devices.
execute the jobs using conservative backfilling without any ACKNOWLEDGEMENTS
consideration of job deadline. While in case of HAMA, meta-
We would like to thank Chee Shin Yeo for his constructive
scheduler send job to a resource site only if a time slot is
comments on this paper. This work is partially supported by
available to execute job before deadline.
research grants from the Australian Research Council (ARC)
VII. C ONCLUSION and Australian Department of Innovation, Industry, Science
and Research (DIISR).
With the increasing demand of global grids, the energy
consumption of grid infrastructure has escalated to the degree R EFERENCES
that grids are becoming a threat to the society rather than an [1] Gartner, “Gartner Estimates ICT Industry Accounts for 2 Percent of
asset. The carbon footprint of grids may continue to increase Global CO2 Emissions,” http://www.gartner.com/it/page.jsp?id=503867.
[2] J. G. Koomey, “Estimating total power consump-
unless the problem is addressed at every level, i.e., from local tion by servers in US and world,” http:/ /enter-
(within a single grid site) to global (across multiple grid sites). prise.amd.com/Downloads/svrpwrusecompletefinal.pdf.
57
700 800
700
650
600
500
550 400
300
500
200
450
100
400 0
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
(a) HL: Energy Consumption VS Job Urgency (b) HH: Energy Consumption VS Job Urgency
650 800
700
600
500
550
400
500 300
200
450
100
0
400
10 100 1000 10 100 1000
Increase in Job Arrival Rate Increase in Job Arrival Rate
HAMA HAMA-withoutDVS EDF-FQ HAMA HAMA-withoutDVS EDF-FQ
(c) HL: Energy Consumption VS Job Arrival Rate (d) HH: Energy Consumption VS Job Arrival Rate
[3] S. Greenberg, E. Mills, B. Tschudi, P. Rumsey, and B. My- [10] J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke, “Condor-
att, “Best practices for data centers: Results from benchmarking G: A Computation Management Agent for Multi-Institutional Grids,”
22 data centers,” in Proc. of the 2006 ACEEE Summer Study Cluster Computing, vol. 5, no. 3, pp. 237–246, 2002.
on Energy Efficiency in Buildings, Pacific Grove, USA, 2006, [11] E. Huedo, R. Montero, and I. Llorente, “A framework for adaptive
http://eetd.lbl.gov/emills/PUBS/PDF/ACEEE-datacenters.pdf. execution in grids,” Software Practice and Experience, vol. 34, no. 7,
[4] V. Salapura et al., “Power and performance optimization at the system pp. 631–651, 2004.
level,” in Proc. of the 2nd conference on Computing frontiers, Ischia, [12] S. Venugopal, K. Nadiminti, H. Gibbins, and R. Buyya, “Designing a re-
Italy, 2005. source broker for heterogeneous grids,” SoftwarePractice & Experience,
[5] A. Elyada, R. Ginosar, and U. Weiser, “Low-complexity policies for vol. 38, no. 8, pp. 793–825, 2008.
energy-performance tradeoff in chip-multi-processors,” IEEE Transac- [13] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud
tions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 9, Computing and Emerging IT Platforms: Vision, Hype, and Reality for
pp. 1243–1248, 2008. Delivering Computing as the 5th Utility,” Future Generation Computer
[6] A. Verma, P. Ahuja, and A. Neogi, “pmapper: Power and migration cost Systems, vol. 25, no. 6, pp. 599–616, 2009.
aware application placement in virtualized systems,” in Proc. of the 9th [14] Y. Etsion and D. Tsafrir, “A Short Survey of Commercial Cluster Batch
ACM/IFIP/USENIX International Conference on Middleware, Leuven, Schedulers,” Technical Report 2005-13, Hebrew University, May 2005,
Belgium, 2008. Tech. Rep.
[7] L. Wang and Y. Lu, “Efficient Power Management of Heterogeneous [15] R. Raman, M. Livny, and M. Solomon, “Resource Management through
Soft Real-Time Clusters,” in Proc. of the 2008 Real-Time Systems Multilateral Matchmaking,” in Proc. of the 9th IEEE Symposium on High
Symposium, Barcelona, Spain, 2008. Performance Distributed Computing, Pittsburgh, USA, 2000.
[8] K. Kim, R. Buyya, and J. Kim, “Power aware scheduling of bag-of-tasks [16] A. Orgerie, L. Lefèvre, and J. Gelas, “Save Watts in Your Grid: Green
applications with deadline constraints on dvs-enabled clusters,” in Proc. Strategies for Energy-Aware Framework in Large Scale Distributed
of the Seventh IEEE International Symposium on Cluster Computing Systems,” in Proc. of the 2008 14th IEEE International Conference on
and the Grid, Rio de Janeiro, Brazil, 2007. Parallel and Distributed Systems, Melbourne, Australia, 2008.
[9] B. Bode et al., “The Portable Batch Scheduler and the Maui Scheduler [17] D. Bradley, R. Harper, and S. Hunter, “Workload-based power man-
on Linux Clusters,” in Proc. of the 4th Annual Linux Showcase and agement for parallel computer systems,” IBM Journal of Research and
Conference, Atlanta, USA, 2000. Development, vol. 47, no. 5, pp. 703–718, 2003.
58
2500 2500
Millions
Millions
2000 2000
Workload Executed
Workload Executed
1500 1500
1000
1000
500
500
0
0
0% 20% 40% 60% 80% 100%
0% 20% 40% 60% 80% 100%
% of High Urgency (HU) Jobs
% of High Urgency (HU) Jobs
HAMA HAMA-withoutDVS EDF-FQ
HAMA HAMA-withoutDVS EDF-FQ
(a) HL: Workload Execution VS Job Urgency (b) HH: Workload Execution VS Job Urgency
2500 2500
Millions
Millions
2000 2000
Workload Executed
Workload Executed
1500 1500
1000 1000
500 500
0 0
10 100 1000 10 100 1000
(c) HL: Workload Execution VS Job Arrival Rate (d) HH: Workload Execution VS Job Arrival Rate
[18] B. Lawson and E. Smirni, “Power-aware resource allocation in high-end of the 5th ACM conference on Electronic commerce, New York, USA,
systems via online simulation,” in Proc. of the 19th annual international 2004, pp. 61–70.
conference on Supercomputing, Cambridge, USA, 2005, pp. 229–238. [26] D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and
[19] D. Meisner, B. Gold, and T. Wenisch, “PowerNap: eliminating server P. Wong, “Theory and practice in parallel job scheduling,” in Job
idle power,” in Proceeding of the 14th international conference on Ar- Scheduling Strategies for Parallel Processing, London, UK, 1997, pp.
chitectural support for programming languages and operating systems, 1–34.
Washington, USA, 2009. [27] H. A. Sanjay and S. Vadhiyar, “Performance modeling of parallel
[20] G. Tesauro et al., “Managing power consumption and performance of applications for grid scheduling,” J. Parallel Distrib. Comput., vol. 68,
computing systems using reinforcement learning,” in Proceedings of no. 8, pp. 1135–1145, 2008.
the 21st Annual Conference on Neural Information Processing Systems, [28] “Distributed European Infrastructure for Supercomputing Applications
Vancouver, Canada, 2007. (DEISA),” http://www.deisa.eu.
[21] Y. Chen, A. Das, W. Qin, A. Sivasubramaniam, Q. Wang, and N. Gau- [29] Enabling Grids for E-sciencE, “EGEE project,” http://www.eu-egee.org/,
tam, “Managing server energy and operational costs in hosting centers,” 2005.
ACM SIGMETRICS Performance Evaluation Review, vol. 33, no. 1, pp. [30] J. Moore, J. Chase, P. Ranganathan, and R. Sharma, “Making scheduling
303–314, 2005. ”cool”: temperature-aware workload placement in data centers,” in Pro-
[22] A. Verma, P. Ahuja, and A. Neogi, “Power-aware dynamic placement of ceedings of the 2005 Annual Conference on USENIX Annual Technical
HPC applications,” in Proc. of the 22nd annual international conference Conference, Anaheim, CA, 2005.
on Supercomputing, Athens, Greece, 2008, pp. 175–184. [31] D. Feitelson, “Parallel workloads archive,”
[23] N. Kappiah, V. Freeh, and D. Lowenthal, “Just in time dynamic voltage http://www.cs.huji.ac.il/labs/parallel/workload.
scaling: Exploiting inter-node slack to save energy in MPI programs,” [32] D. Irwin, L. Grit, and J. Chase, “Balancing risk and reward in a market-
in Proc. of the 2005 ACM/IEEE conference on Supercomputing, Seattle, based task service,” in Proc. of the 13th IEEE International Symposium
USA, 2005. on High Performance Distributed Computing, Honolulu, USA, 2004.
[33] A. W. Mu’alem and D. G. Feitelson, “Utilization, Predictability, Work-
[24] C. Hsu and W. Feng, “A power-aware run-time system for high-
loads, and User Runtime Estimates in Scheduling the IBM SP2 with
performance computing,” in Proc. of the 2005 ACM/IEEE conference
Backfilling,” vol. 12, no. 6, pp. 529–543, Jun. 2001.
on Supercomputing, Seattle, USA, 2005.
[25] R. Porter, “Mechanism design for online real-time scheduling,” in Proc.
59
Intelligent Data Analytics Console
Snehal Gaikwad Mihir Kedia Aashish Jog Bhavana Tiple
School of Computer Science IBM Global Services School of Management Dept of Computer Science
Carnegie Mellon University IBM India Pvt Ltd IIT Bomaby MIT Pune
Pittsburgh, 15217 USA Delhi, 110020 INDIA Pawai, 400076 INDIA Pune, 411038 INDIA
snehalgaikwad@cmu.edu mihir.kedia@in.ibm.com aashishjog@som.iitb.ac.in bstiple@mitpune.com
Abstract—The problem of integrating data analysis, visualization representations eventually results into the major usability
and learning is arguably at the very core of the problem of data challenges for naïve users. As a result, predictions and
intelligence. In this paper, we review our research work in the decision-making process becomes time consuming and
area of data analytics and visualization focusing on four frustrating for users.
interlinked directions of research-(1) data collection, (2) data
analytics and visual evidence, (3) data visualization, and (4) Today, data analytics and visualization research has
intelligent user interfaces - that contribute and complement each undergone fundamental changes in several approaches [6, 9,
other. Our research resulted in Intelligent Data Analytics 25, 26]. A new research and development focus has emerged
Console (IDAC), an integration of the above four disciplines. The within the visualization to address some of the fundamental
main objectives of IDAC are to - (1) provide a rapid development problems associated with the new classes of data and their
platform for analyzing the data and generating components that related analysis tasks [28]. This research and development
can be used readily in software applications, (2) provide visual focus is known as information visualization. Information
evidences after each data analytical operation that will help user visualization combines aspects of scientific visualization,
to learn the behavior of data, and (3) provide the user-centric human-computer interfaces, data mining, imaging, and
platform for skilled data analytics experts as well as naïve user. graphics [28]. Visualization is perceived as gateway to
The paper presents development process of the user-centric understanding voluminous datasets. It provides healthy
intelligent software equipped with effective visualization. This
environment to data analytics experts, business executives for
approach should help business organizations to develop better
understanding and forecasting a huge amount of data in a quick
data analytics software using open source technologies.
span of time.
Keywords- Human Factors, Intelligent User Interfaces, In this paper, we present the development and
Machine Learning implementation of Intelligent Data Analytics Console (IDAC)
focusing on integration of Data Collection, Data Analytics and
I. INTRODUCTION
Visualization. We demonstrate an approach to develop a rapid
Huge amount of data is generated in various familiar platform for analyzing datasets and generating agents, which
processes, systems and computer networks built around them. can be used readily in software application. This process
Voluminous data from electronic business, computer networks, involves a library of analytics blocks and allows user to drag
financial trading systems, share markets, and weather and drop all the required functional blocks required to form an
forecasting comes in fast stream. In our day-to-day lives, ‘analytics execution chain’. Our development is based on the
manually analyzing and classifying the large amount of data is open source libraries from Weka [5, 36, 43], Yale [15, 29], and
not feasible. JFree Charts [11]. IDAC architecture proves how intelligence
Data analytics is the art and science of getting to know user interfaces can increase usability of the complex systems.
more about the behavior of complex real time data by Our research provides the detail guidelines for research and
application of mathematical and statistical principles [30, 43]. business community to effectively set-up a platform for data
Rapid data analytics operations performed on a given data analysis problems. The rapid learning platform of IDAC
results into huge changes in the original data values. If we captures the knowledge generated by data analytics experts; by
consider a dataset having about a thousand data points, tracking finding typical analysis patterns it provides standard recipes as
changes after each data analytics operation is impossible. well as online recommendation on what to do next. An
Therefore, to learn the pattern and trends of given data set, it is advisory system is an application of data analytics, which
necessary to represent data values into understandable format. analyzes the nature of data as well as the results obtained at
Many data analytics tools have been proposed to overcome each step to provide the recommendations. The paper
these problems; Weka [5, 36, 43], Yet Another Learning demonstrates how this feature is immense to users who do not
Environment (Yale, now known as Rapid Miner) [15, 29], have deep understanding of analytics algorithms. Capturing the
Sumatra TT [4] are major examples of them. However, current knowledge involves recording earlier successful execution of
software systems fail to provide a rapid analytics platform for analytics chains used by users for doing particular types of
naïve users. Sumatra TT, Weka, and Yale cannot support analysis and ‘training’ the advisory system. Further, we
different types of data formats. Sumatra TT fails to provide illustrate how visual evidence technique can acts as an effective
wide variety of analytics algorithms and operators. Less debugging technique to assist naïve users.
efficient drag and drop interfaces, lack of graphical
60
The rest of the paper is organized as follows. In Section 2, The data preprocessing is also called as the data cleaning
we introduce detail system architecture of IDAC; Section 3 or the data cleansing. The Data cleaning process deals with
presents the Intelligent User Interface; Usability evaluation and detecting and removing errors and inconsistencies in order to
results are described in the Section 4. Section 5 discusses the improve the quality of the available data [2, 20, 38]. We have
future research; Section 6 provides concluding remarks. used three categories of filters for data cleaning; Supervised
Filters, Unsupervised Filters, Streamable Filters. IDAC java
II. SYSTEM ARCHITECTURE packages for analytics operators are based on Yale [15] and
In this section, we provide a detail implementation of Weka [5] open source coding hierarchy. IDAC filter package
IDAC. IDAC architecture is based on the iterative process is responsible for filtering operations.
model [34, 35]. Traditional software development cycle The real time data set contains huge number of missing
includes widely used Waterfall Model, Prototyping Model or values, which affects the accuracy of prediction. For example,
Spiral Model [10]. Most of them have barriers of specification, if electronic business executive wants to launch a new product
communication and optimization [10, 34]. Considering but available datasets do not have enough labels for attribute,
research requirements, we decided to follow the Incremental then it is difficult for him to make accurate predictions. To
Process Model development. The main objective of applying overcome this problem and estimate unknown parameters, we
incremental development approach was to enhance system have implemented the Expectation-Maximization (EM)
usability and constantly tune computational performance of algorithm. EM algorithm is used for finding maximum
the software [22, 33, 34, 35]. likelihood estimates of parameters in probabilistic models,
The data collection module, the data analytics and visual where the model depends on unobserved latent variables [12].
evidence module, and the data visualization module are three E-step in the algorithm finds the conditional expectation of
major components of the system. Fig. 1 shows the detail missing data given the observed data and current estimated
architecture of IDAC. The data collection module is parameters, and substitutes the expectations for ‘missing data’
responsible for data preprocessing and cleaning [38]. The data M-step updates the parameter estimates by maximizing the
analytics and visual evidence module focuses on data mining expected complete-data log likelihood [12]. We used Frank
[43] and Machine Learning [30] operations and generates Dellaert’s theory to maximize the posterior probability of the
textual results. The visual evidence operators are created after unknown parameters [12]. If U is the given measurement data,
each data analysis operator. The data visualization module is J is a hidden variable then unknown parameter is estimated by
responsible for converting textual results to suitable graphical following formula [12]:
formats. In following section of the paper, we demonstrate the Θ∗ = argmax ∑J∈𝑗𝑗 𝑛𝑛 P (Θ, 𝐽𝐽 |𝑈𝑈 ) (1)
detailed architecture of IDAC.
Θ
A. Data Collection Module
EM computes a distribution over the space j rather than
Visual Evidence operators are created after each data finding the best J ∈ 𝑗𝑗 [12]. We found that existing data
analysis operator. The data visualization module is responsible analytics tools fails to support more than one data format [4],
for converting textual results to suitable graphical formats. In to overcome this drawback system supports two major data
following section of the paper, we demonstrate detailed file formats ARFF and CSV. Data conversion, operator helps
architecture of the system. The data collection module is user to convert files into other formats.
responsible for rapid data preprocessing.
61
IRIS Cross Validation Fold Maker Attribute Selection
1) ARFF: ARFF is widely known as the Attribute Relation TABLE II. ARFF FOR CUSTOMER DATA
File Format. Structure of an ARFF file is divided into two
Sunny, 89, false, no
major parts: Header Section and Data Section. The header
section contains name of relations, list of attributes and their Overcast, 78, true, yes
types. The data section consists of data declaration lines and rainy, 75, true, yes
values. Table I shows the ARFF File for customer banking
record.
TABLE I. ARFF FOR CUSTOMER DATA The data collection module sends cleaned data to the data
analytics and visual evidence module to perform advance data
@relation bank analysis operations. In this module, IDAC allow data analytics
@attribute Service_type {Fund, Loan, CD, Bank_Account, Mortgage} expert to perform several analysis operations in certain order.
@attribute Customer {Student, Business, Other, Doctor, Professional}
Data probes send cleaned data to the analytics and visual
evidence module. This module involves a library of analytics
@attribute Size {Small,Large,Medium}} blocks which allows user to drag and drop all required blocks
@attribute Interest_rate real and connect them to form an ‘analytics execution chain’ in the
@DATA
visual application. The chain is based on an end goal of
analysis as well as the nature of the dataset. Fig. 2 shows the
Fund,Business,Small,1 chain of the Iris data set, cross validation, attribute selection
Loan,Other,Small,1 operators. Existing data analytics software provides
visualization at the end of the chain; as a result it becomes
Mortgage,Business,Small,4
difficult for user to understand the changes after each data
…………………………………. analytics operator [5, 4, 15]. From analytics, perspective
tracking these changes is very important to learn behavior of
the data set. We propose the new concept of the Visualization
2) Comma Separated Value (CSV): CSV is also known as
Debugger (see Fig. 3).
the comma-separated list. CSV files consist of data values
separated by commas. Table 2 represents the CSV file for
Weather forecasting data used by IDAC.
ClassifierPerformanceEvaluator
62
Visualization Debugger helps the user to understand changing Java Class Algorithm Description
behavior of the dataset after each analytics operation. Fig. 4
FarthestFast Cluster data using FarthestFirst
represents the basic visual debugger used to generate the algorithm.
visual evidence from data analytics results. The visualization
operator displays the dataset values twice: a) before entering
DBScan Density-based algorithm for
into the data analytics operator and b) after resulting from the
data analytics operator. Comparing results from the both discovering clusters in large
spatial databases with noise.
visual operators the user can easily keep track of changes that
take place in the operational chain. This application presents
interface amenable to the user who would like to rapidly An Example: We illustrate the IDAC clustering with a
provide an analytics solution into suitable data processing simple example of the real world company. The main
architecture. IDAC advisory system analyzes the results objective of the company is to develop a new financial product
obtained at each step and provides recommendation to assist for its customers. Table IV represents data values of vacation,
users to make decisions based on visual facts. Data analytics eCredit, salary and property of each customer. The goal of
operations help user to find patterns in the huge datasets. We operation is to find out pattern from the customer behavior and
have used open source Weka platform to develop a library of develop new financial product. Main step of the process is
different data analysis operators. Following section of paper finding out the data belonging to the specific cluster. Using
focuses on major Data Analytics operators implemented in data analytics library we apply Simple K means data analytics
IDAC. operator on available data. Table V shows which customer fits
B. IDAC Classification to what cluster. For example, customer number nine with
salary 13.85 belongs to first cluster. Table VI shows the mean
We have analyzed the performance of existing classifiers value for each cluster. Visualization in Fig. 5 helps us to
[13, 21, 41] as well as classification techniques used in Weka understand the behavior of the data. User can easily observe
[43] and Sumatra TT [4]. We found that decision trees are the that group 3 and 5 are differentiates themselves from the rest.
most popular and stable method among available data mining Customers who belong to cluster 3 and 5 have maximum
classifiers. The main reason is decision tree provides excellent vacations, high salary, and low property.
scalability and highly understandable results. Decision trees
also support the sparse data. We also used KNN model TABLE IV. CUSTOMER DATA
because it is applicable to most cases independent of training
Customer Vacation eCredit Salary Property
data requirement, sparse to dense. KNN is an instance-based
1 6 40 13.62 3.2804
classifier that provides excellent quality of prediction and 2 11 21 15.32 2.0232
explanatory results. However, we found that KNN is not 3 7 64 16.55 3.1202
optimal for problems with a non-zero Bayes error rate, that is, 4 3 47 15.71 3.4022
5 15 10 16.96 2.2825
for problems where even the best possible classifier has a non- 6 6 80 14.66 3.7338
zero classification error. We overcome this problem be 7 10 49 13.86 5.8639
incorporating discriminative and generative classifiers. 8 10 84 15.64 3.187
9 9 74 14.4 2.3823
……………………………………………………………………………………
C. IDAC Clustering 186 51 13 20.71 1.4585
187 49 17 19.18 2.4251
Clustering deals with discovering classes of an instance
that belongs together [23, 43]. Clusters are also known as self- TABLE V. RESULTS AFTER CLUSTERING
organized maps. IDAC clustering operators are developed by
Customer Vacation eCredit Salary Property Fit Cluster
using high performance clustering algorithms [32]. IDAC
1 6 40 13.62 3.2804 2
operator converts an input file into several clusters; it also 2 11 21 15.32 2.0232 2
provides probability of distribution for all attributes. Most of 3 7 64 16.55 3.1202 1
the time data set may have different trend than regular, such 4 3 47 15.71 3.4022 2
5 15 10 16.96 2.2825 2
data is called as Anomaly. Detecting theses anomalies is 6 6 80 14.66 3.7338 1
essential to study abnormal behavior of the dataset. Visual 7 10 49 13.86 5.8639 3
representation of clustering; enables user to find out abnormal 8 10 84 15.64 3.187 1
behaviors in the datasets and helps in decision-making. Table 9 9 74 14.4 2.3823 1
………………………………………………………………
III represents algorithms implemented in the IDAC. 186 51 13 20.71 1.4585 5
187 49 17 19.18 2.4251 5
TABLE III. IDAC JAVA CLASSES AND ALGORITHMS FOR CLUSTERING
Java Class Algorithm Description TABLE VI. MEAN VALUE FOR EACH CLUSTER
Group eCredit Salary Property Total
Cluster data using X-Means Vacation
Simple X Means Cluster Number
algorithm. 1 14.51515 80.45455 21.90152 5.465142 33
2 8.75 14.80556 16.20771 1.423411 36
3 39.02128 55.08511 19.92681 3.72284 47
Cluster data using K Means 4 10.21277 224.0435 24.9087 11.61411 23
Simple K Means algorithm. 5 48.21277 15.42553 21.87149 2.057874 47
63
Table VII shows the user centric visualization method. While
4 representing voluminous data, we have considered following
2
principles [9, 14, 25, 27]:
64
Example Source: Sonar Data File
We have used the Semantic Layer to define contents in the This approach understands human responses and significantly
operator library. The Interaction Layer is responsible for reduces usability barriers faced by naïve users.
defining the Interaction Sequences and to determine User
Experience. The Chanel Layer is responsible for the actual IV. USABILITY EVALUATION AND RESULTS
presentation, which provides the user with a consistent In this section, we describe software testing approaches
experience, independent of the access method. For example, used to measure the performance and usability of the IDAC.
all interactions offered through a screen follow the same During the usability evaluation phase, we focused on both
logical steps and offer the sequence of options. The structure automated as well as manual software testing approaches.
of data analytics library supports the consistent experience. IDAC automated testing is performed by using WinRunner
When user selects operations, system compares it with 7.0i, software. WinRunner automates the testing process by
standard rules; the algorithm returns Boolean results for chain storing a script of the actions that are taking place. We used
validation. If the chain is valid then IDAC continues data Test Director to perform manual testing. The incremental
analysis operation. For invalid chains IDAC finds out the model helped us to achieve desire results and performance.
wrong operator and presents multiple suggestions to the user. Initially we have conducted usability research to understand
Initially IDAC Intelligence Wizard Screen allows user to the proportion of usability problems in Weka, Sumatra TT and
selects the file format. Depending on the selected dataset, Yale. Fig. 9 shows the results of Nielsen’s heuristic analysis.
system presents next interface to user. Fig. 7 shows the screen We have analyzed usability heuristics including match
for several data analysis operation. Intelligence Wizard asks between system and the real world standards, user control and
user about what kind of data analysis he wants to perform. freedom, help and documentation, consistency and standards,
After considering user’s choices, system automatically flexibility and efficiency of use, aesthetic and minimalistic
generates the chain of data analytics operators (See Fig 8). design. Results shows that most of the systems fail to achieve
From user’s inputs, IDAC generates and validates the consistency and standards, user control and freedom. Sumatra
operating chain. TT fails to provide help and documentation.
100
WEKA
90
SUMATRA TT
80
YALE
70
60
ERRORS 50
40
30
20
10
00
Match between User Control & Visibility of System Help & Consistency & Flexibility & Aesthetic & Minimalistic
System & Real Freedom Status Documentation Standards Efficiency of Use Design
world Standard
Nielsen’s Usability Heuristics
Figure 9. Usability evaluation results of Weka, Sumatra TT and Yale using heuristic analysis
65
100 WEKA
90 SUMATRA TT
80 YALE
70 IDAC
60
ERRORS
50
40
30
20
10
00
Match between User Control & Visibility of System Help & Documentation Consistency & Flexibility & Aesthetic & Minimalistic
System & Real Freedom Status Standards Efficiency of Use Design
world Standard
Nielsen’s Usability Heuristics
Figure 10. Usability evaluation results of IDAC, Weka, Sumatra TT and Yale using heuristic analysis
We overcame drawbacks in the traditional software by display is the biggest challenge. The immediate enhancement
focusing on following criteria: to data visualization would be separation and representation of
string attributes by taking frequency counts of their
• Effectiveness: Help to achieve desired accuracy and occurrences or by using other effective measures. Our current
completeness with which users can achieve their goals. research focus is to solve multidisciplinary business problems
• Efficiency: Reduces the computation time and using similar approach. Major work is development of a
resources spent in achieving desired goals. Privacy Protection Architecture. Today, organizations are
trying to incorporate customer-driven innovations; aim is to
• Satisfaction: Reduces user discomfort and increase provide better products and personalized services. This
pleasant interaction. involves a learning process in which organizations
We have adopted the standard approaches in testing phase continuously capture the data and learn from user behavior.
of iterative design. Unit testing, regression testing, integration Unfortunately, neither organizations nor users are equipped to
testing, alpha testing, and beta testing methods were effectively act in the efficient optimal way to decide what information they
used to test IDAC components. As new operator added to should protect and what they should reveal. This situation is
library, we performed regression testing to see whether it has creating challenging problems regarding customer data
an adverse effect on operators created previously. New visual security, privacy and trust. We are researching on how
operators are integrated without any side effect [18]. The alpha independent IDAC agents can effectively integrated with
testing was conducted in the presence of end users. Initial machine learning methods to classify and protect the sensitive
contextual enquiry helped to analyze the software usability. data. Use of the visualization module to define uncertainty,
After getting feedbacks from the beta testing further ambiguity, and behavioral biases in the privacy protection
enhancements about help and documentation standards were mechanism will be the major milestone of our research.
made [18]. The performance of the IDAC was compared to VI. CONCLUSION
other existing software. Results show that, the intelligence
wizard creates accurate analytics chains in about 93% of the In this paper, we presented Intelligent Data Analytics Console,
cases. Fig. 10 shows the improved results and efficiency of the the user centric platform that effectively integrates data
IDAC as compared to the current software systems. analytics, visualization and intelligence. The proposed
architecture demonstrates an open, cooperative and
V. FUTURE RESEARCH multidisciplinary approach to develop data analytics software
In this paper, we have demonstrated the solution to that acts as an assistant to users rather than a tool. Using
integrate data analytics, visualization and intelligence. We Visual Evidence technique we illuminate the demanding
emphasized on an open, cooperative and multidisciplinary approach to provide recommendations based on the data and
approach to increase usability of the system. Developing results obtained at particular instance. As shown through
personalized system for wide variety of users and incorporating series of best practices and usability evaluations, IDAC
effective use of information visualization is one of the major substantially reduces the usability barriers. With our approach,
challenges for future research. Solutions to these challenges researchers and enterprises can generate data analytics
are also rooted in understanding of visual Perception. components that can be used readily in software applications.
Understanding the user’s needs and selecting suitable visual
66
ACKNOWLEDGMENT [21] Mauricio Hernandez, Mauricio A. Hern'andez, and Salvatore Stolfo.
Realworld data is dirty: Data cleansing and the merge/purge problem.
We would like to thank TRDDC scientists Prof. Harrick Data Mining and Knowledge Discovery, 2:9-37, 1998.
Vin, Mr. Subrojyoti Roy Chaudhury, Dr. Savita Angadi and [22] Robert C. Holte. Very simple classification rules perform well on most
commonly used datasets. In Machine Learning, pages 63-91, 1993.
Mr. Niranjan Pedanekar for their constant support and
[23] Clare-Marie Karat. Cost-bene_t analysis of iterative usability testing. In
guidance. We appreciate the contribution of WEKA INTERACT '90: Proceedings of the IFIP TC13 Third Interational
developers to the open source community; your pioneered Conference on Human-Computer Interaction, pages 351-356,
research motivated and guided us. This research is a part of Amsterdam, The Netherlands, The Netherlands, 1990. North-Holland
Systems Research Lab, TATA Research Development and Publishing Co.
Design Center (TRDDC), TATA Consultancy Services. [24] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: an
introduction to cluster analysis. Wiley, 1990.
REFERENCES [25] Kenji Kira and Larry A. Rendell. A practical approach to feature
selection. In ML92: Proceedings of the ninth international workshop on
[1] G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of Machine learning, pages 249-256, San Francisco, CA, USA,1992.
Lipschitz-Hankel type involving products of Bessel functions,” Phil. Morgan Kaufmann Publishers Inc.
Trans. Roy. Soc. London, vol. A247, pp. 529–551, April 1955.
(references) [26] Michel Liquiere and Jean Sallantin. Structural machine learning with
galois lattice and graphs. In Proc. of the 1998 Int. Conf.on Machine
[2] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining Learning (ICML'98, pages 305-313. Morgan Kaufmann, 1998.
association rules between sets of items in large databases, Proceedings
of 1993 ACM SIGMOD International Conference on Management of [27] Classification Algorithms Lus, Lus Torgo, and Joo Gama. Regression
Data, pages 207- 216, Washington, D.C., 1993. using classification algorithms. Intelligent Data Analysis, 1:275-292,
1997.
[3] Aha and David W. Tolerating noisy, irrelevant and novel attributes in
instance based learning algorithms. Int. J. Man-Mach. Stud., 36(2):267- [28] J. I. Maletic, A. Marcus, and M. L. Collard. A task oriented view of
287, 1992. software visualization.pages 32-40, 2002.
[4] C. G. Atkeson, A. W. Moore, and S. Schaal locally weighted learning. [29] A.I. McLeod and S.B. Provost. Multivariate data visualization, January
(1-5):11-73, 1997. 2001.
[5] Petr Aubrecht, Filip zelezny, Petr Miksovsky, and Olga Stepankova. [30] Ingo Mierswa, Ralf Klinkenberg, Simon Fischer, and Oliver Rittho. A
Sumatra TT: Towards a universal data preprocessor. exible platform for knowledge discovery experiments: Yale - yet another
learning environment. In Univ. of Dortmund, 2003.
[6] Remco R. Bouckaert, Eibe Frank, Mark Hall, Richard Kirkby, Peter
Reutemann, Alex Seewald, and David Scuse. Weka manual for version [31] Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.
3-4, 2007. [32] D. J. Murdoch and E. D. Chow. A graphical display of large correlation
[7] Christopher J.C. Burges. A tutorial on support vector machines for matrices. j-AMER-STAT, 50(2):173-178, 1996.
pattern recognition. Data Mining and Knowledge Discovery, 2:121-167, [33] Raymond T. Ng and Jiawei Han. Efficient and effective clustering
1998. methods for spatial data mining, 1994
[8] Tom Chau and Andrew K.C. Wong. Pattern discovery by residual [34] D. L. Parnas and P. C. Clements. A rational design process: How and
analysis and recursive partitioning. IEEE Transactions on Knowledge why to fake it. IEEE Trans. Softw. Eng., 12(2):251-257, February 1986.
and Data Engineering, 11(6):833-852,1999. [35] Roger S. Pressman. Software Engineering: A Practitioner's Approach.
[9] Kumar Chellapilla and Patrice Y. Simard. Using machine learning to McGraw-Hill Higher Education, 2000.
break visual human interaction proofs (hips). In NIPS, 2004. [36] Matthias Rauterberg. An iterative-cyclic software process model, 1992.
[10] W. S. Cleveland. Visualizing Data. Hobart Press, Summit, New Jersey, [37] Roberts R. AI32 - Guide to Weka, March 2005
U.S.A., 1993 http://www.comp.leeds.ac.uk/andyr
[11] Bill Curtis, Herb Krasner, and Neil Iscoe. A field study of the software [38] Ross J. Quinlan. C4.5: Programs for Machine Learning (Morgan
design process for large systems. Commun. ACM, 31(11):1268-1287, Kaufmann Series in Machine Learning). Morgan Kaufmann, January
1988. 1993.
[12] Gilbert,Jfreechart http://www.jfree.org/jfreechart/, 2006. [39] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current
[13] Frank Dellaert. The expectation maximization algorithm. Technical approaches. IEEE Data Engineering Bulletin, 23:2000, 2000.
Report GIT-GVU-02-20, February 2002. [40] Kathy Walrath, Mary Campione, Alison Huml, and Sharon Zakhour.
[14] U. M. Fayyad and K. B. Irani. Multiinterval discretization of The JFC Swing Tutorial: A Guide to Constructing GUIs, Second
continuousvalued attributes for classification learning. In Proc. of the Edition. Addison Wesley Longman Publishing Co., Inc., Redwood City,
13th IJCAI, pages 1022-1027, Chambery, France, 1993. CA, USA, 2004.
[15] Stephen Few. Show Me the Numbers : Designing Tables and Graphs to [41] Edward J. Wegman. Hyperdimensional data analysis using parallel
Enlighten. Analytics Press, September 2004. coordinates. Journal of the American Statistical Association,
[16] Simon Fischer, Ingo Mierswa, Ralf Klinkenberg, and Oliver Rittho. 85(411):664-675, 1990.
Developer tutorial" Yale - yet another learning environment, 2006. [42] S.M. Weiss and C.A. Kulikowski. Computer Systems that Learn:
[17] Yoav Freund and Robert E. Schapire. Experiments with a new boosting Classification and Prediction Methods from Statistics, Neural Nets,
algorithm. In In Proceedings of the thirteenth international conference Machine Learning, and Expert Systems. Morgan Kaufman, 1991.
on machine learning, pages 148-156. Morgan Kaufmann, 1996. [43] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning
[18] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive Tools and Techniques with Java Implementations. Morgan Kaufmann,
logistic regression: a statistical view of boosting. Annals of Statistics, October 1999
28:2000, 1998. [44] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning
[19] Snehal Gaikwad, Mihir Kedia, and Aashish Jog. Data analytics and Tools and Techniques, Second Edition (Morgan Kaufmann Series in
visualization. Technical report, TATA Research Development and Data Management Systems). Morgan Kaufmann, June 2005.
Design Center - TCS, 2007. [45] Normal Sadeh Class Notes Mobile Pervasice Computing.
[20] Mark A. Hall and Lloyd A. Smith. Feature selection for machine
learning: Comparing a correlation-based filter approach to the wrapper,
1999.
67
ADCOM 2009
COMPUTATIONAL
BIOLOGY
Session Papers:
3. S.Das, S.M. Idicula, “ Modified Greedy Search Algorithm for Biclustering Gene Expression Data”.
68
Digital Procesing of Biomedical Signals with
Applications to Medicine
D. Narayana Dutt
Department of Electrical Communication Engineering
Indian Institute of Science, Bangalore-560012, India
dndutt@ece.iisc.ernet.in
Abstract— We consider here digital processing of Schizophrenia is a worldwide prevalent disorder with a multi-
Electroencephalogram (EEG) and Heart Rate Variability (HRV) factorial but highly genetic aetiology. It is a multiple gene
signals with applications to psychiatry and cardiac care. First we disorder. Schizophrenia being a complex and widespread disorder
consider application to EEG. The conventional analysis of EEG giving rise to a great burden of suffering and impairment to both
signals cannot distinguish between schizophrenia and normal patients and their families. By looking at EEG it is very difficult
healthy individual. In this work, graph theoretic approach is used to ascertain and diagnose if an individual is suffering from the
to get various parameters from EEG. It is found that the disease. It is all the more difficult to find to which type of groups
schizophrenia patients can not only be differentiated from healthy a subject belongs without proper clinical analysis by a physician
subjects but can also be grouped into various subgroups in [1]. In this work the focus is on usage of graph theoretical
schizophrenia. Use of SVM based automation gives us a clear way to methods to analyze and statistical methods and techniques to
classify an individual into his particular group or subgroup for
identify individuals suffering from schizophrenia and also classify
further classification with an accuracy of more than 90%. Next we
them into the various groups of schizophrenia.
present a neural network based approach to classification of supine
vs. standing postures and normal vs. abnormal cases using heart
rate variability (HRV) data. We have chosen ten features for the
network inputs. Four classification algorithms have been compared. The Electrocardiogram describes the electrical activity of the
heart. The rhythmic behavior of the heart is studied by analyzing
Index terms— EEG, Graph theory, Schizophrenia, Connectivity, heart rate time series. Heart Rate Variability (HRV) has been
Support vector machine (SVM), HRV, Neural network accepted as a quantitative marker to study regulatory mechanisms
of autonomic nervous system. Understanding physiology of heart
I INTRODUCTION rate dynamics is very important in treating cardiac diseases that
are chronic as well as life threatening. Understanding these
The brain consists of billions of neurons which are the basic physiologies allow clinicians to devise improved treatment
data processing units that are closely interconnected via axons and methodologies. The importance of HRV was first discovered
dendrites forming a large network. The complex system like the when unborn infants were attached to cardiac sensors in utero.
brain can be described as a complex network of decision making Usage of HRV in the study of cardiac disorders is very popular.
nodes. In the brain, the network is composed of these neuronal The HRV analysis has gained importance since many studies [2]
units which are linked by synaptic connections. The activity of have shown low HRV as a strong indicator to predict cardiac
these neuronal units gives rise to dynamic states that are mortality and sudden cardiac death. Several studies suggest that
characterized by specific patterns of neuronal activation and co- these patients with anxiety and depression are at a higher risk for
activation. The characteristic patterns of temporal correlations significant cardiovascular mortality and sudden death [3]. In view
emerge as a result of functional interactions within a structured of these work, we have considered the application of neural
neuronal network. networks in the classification of supine vs. standing postures and
The brain is inherently a dynamic system in which the normal vs. abnormal cases using heart rate variability data. Four
correlation between the regions during behavior or even at rest classification algorithms have been compared viz., k-neighbors,
creates and reshapes continuously. Many experimental studies Radial Basis Function (RBF) networks, Support vector machines
suggest that perceptual and cognitive states are associated with (SVM) and back propagation networks for different set of
specific patterns of functional connectivity. These connections are features.
generated within and between large populations of neurons in the
cerebral cortex. An important goal in computational neuroscience
is to understand these spatiotemporal patterns of complex II .MATERIALS AND METHODS
functional networks of correlated dynamics of the brain activity A. Data acquisition
measured in terms of EEG recordings. The EEG signals were recorded from 53 subjects (25 control
and 28 schizophrenic subjects of them 21 were females and 32
Similarities between the EEG time-series are commonly were of males) using 32 scalp electrodes. The Neuroscan machine
quantified using linear techniques, in particular estimates of was used for the recording purpose. The signal data were analog
temporal correlation in time domain or coherence in frequency band pass filtered with cutoff frequencies 0.5 to 70 Hz and then
domain. The temporal correlations are the most often used to sampled at the rate of 256 Hz with a resolution of 12 bits. All
represent and quantify patterns in neuronal networks and is electrodes were referenced to A2. The subjects were instructed to
represented as correlation matrix. The functional connectivity close their eyes and be restful for some time before the collection
refers to the patterns of temporal correlations that exist between of data was carried out. Subjects chosen for this analysis were old
distinct neuronal units and such temporal correlations are often the established cases schizophrenia. It was ensured that the subjects
result of neuronal interactions along anatomical or structural were not under the influence of any medication during the process
connections. Thus, the correlation matrix may be viewed as a of collection of data. The 28 channels were according to 10-20
representation of functional connectivity of the brain system. international standard of electrode placement. The data were band
69
pass filtered between 0.5 to 38 Hz which included lower gamma linking them. The average of all entries of Dij has been called the
frequency range and eliminated 50 Hz line noise. “characteristic path length” [8,9,10], denoted l path.
B. Formation of network graphs 4)Cluster Index
The network connectivity can be studied using graph theory The clustering coefficient is a measure of the local
methods. A graph is represented by nodes (vertices) and interconnectedness of the graph, whereas the path length is an
connections (edges) between the nodes. In a graph vertices are indicator of its overall connectedness. The clustering coefficient
denoted by black dots and a line is drawn between the dots if the Ci for a vertex vi is the proportion of links between the vertices
two vertices are connected. The complexity of a graph can be within its neighborhood divided by the number of links that could
characterized by many measures like cluster coefficient and possibly exist between them. For an undirected graph the edge eij
characteristic path length. The cluster coefficient is a measure of between two nodes i, j is considered identical to eji. Therefore, if a
local interconnectedness of the graph and the characteristic path vertex vi has ki neighbors , ki (ki-1)/2 edges could exist among the
length is an indicator of its overall connectedness. More literature vertices within the neighborhood. Using this information, the
about the graph theory may be found elsewhere [1, 2 deal with clustering coefficient for undirected graphs can be calculated.
graph theory with the subject in relation to brain connectivity].
The clustering coefficient for the whole graph is given by
The temporal correlation between any two EEG time series Watts and Strogatz as the average of the clustering coefficient for
x1(t) and x2(t) at two scalp electrodes is ranges between 0 and 1. each vertex.
To allow mathematical analysis, we represent the neuronal
network activity patterns as graphs as follows. In this study the D Neural network methods
vertices of the graph represents the position of the electrode The neural network methods implemented in our work are
placement on the scalp. Let the number of electrodes used for explained below.
EEG recording be N and hence there are N vertices.
The correlation between all pair-wise combinations of EEG 1)Multilayer Perceptrons: A Back propagation network or
channels are computed to form a square matrix R of size N where Multilayer perceptron (MLP) consists of at least three layers of
N is the total number of EEG channels. Each entry rij in the matrix units: an input layer, at least one intermediate hidden layer, and
R, 1<=i,j<=N, is the correlation value for the channels i and j and an output layer. The units are connected in a feed-forward
0<=mod(rij)<=1, mod(rij)=1 for i=j, where mod represents fashion. With Back propagation networks, learning occurs during
absolute value. The correlation matrix R is converted into a binary a training phase. After a Back propagation network has learned
matrix by applying a threshold. The resulting binary matrix is the correct classification for a set of inputs, it can be tested on a
called adjacency matrix. A network graph is constructed using this second set of inputs to see how well it classifies untrained
matrix with N vertices and an edge between the nodes if the patterns [11]
correlation between them exceeds the threshold. Thus, if the
correlation between a pair of channels i and j, rij exceeds the 2)Radial-Basis Function Networks: Radial basis function
threshold value, an undirected edge is said to exist between the networks are also feed forward, but have only one hidden layer.
vertices i and j. All the edges are given the same cost equal to RBF hidden layer units have a receptive field, which has a centre
unity. that is, a particular input value at which they have a maximal
output. Their output tails off as the input moves away from this
C. Measures of a graph point. Generally, the hidden unit function is a Gaussian [11].
Graph theory is one of the latest tools being used for analysis 3)Support Vector Machines: Support vector machine is a
of brain connectivity. The following are the measures of graph popular technique for classification. Given a training set of
that are used for this analysis for which a brief is essential.
instance-label pairs (xi, yi), i = 1, . . . l where xi ∈ Rn and y∈{1, -
1)Average connection density 1}t, the SVM require the solution of an optimization problem.
The average connection density kden of an adjacency matrix A
is the number of all its nonzero entries divided by the maximal Here training vectors xi are mapped into a higher dimensional
possible number of connections. Thus, 0<=kden<=1. The sparser space by the function φ. Then SVM finds a linear separating
the graph, the lower is its connection density [4,5]. hyper plane with the maximal margin in this higher dimensional
2)Complexity space. K(xi, xj) = φ(xi)tφ (xj) is called the kernel function[11].
Complexity captures the extent to which a system is both
functionally segregated and functionally integrated (large subsets 4)K- Nearest Neighbour Classifier: Among statistical
tend to behave coherently). The statistical measure of neural approaches, a k-nearest neighbour classifier (KNNC) was
complexity CN(X) takes into account the full spectrum of subsets. selected because it does not assume any underlying distribution
CN(X) can be derived from either ensemble average of the mutual of data. In the k-nearest neighbour rule, a test sample is assigned
information between subset sizes or equivalently from the the class most frequently represented among the k nearest
ensemble average of mutual information between subsets of a training samples [12].
given size (ranging from 1 to n/2) and their complement [6,7,8].
E. Features Considered
3)Characteristic path length
Within a digraph, a path is defined as any ordered sequence of This section describes the features considered for classifying
distinct vertices and edges that links a source vertex j to a target HRV data.
vertex i. The distance matrix Dij describes the distance from
vertex j to vertex i, that is, the length of the shortest valid path
70
1)Fractal Dimension (FD): Katz's approach is used to calculate In this study statistical means of pameters were used to come to
the FD of a waveform [13]. the conclusions. For finding the parameters in the different bands
of EEG signals the EEG was filtered as per the frequency band
2)Complexity measure: Lempel-Ziv complexity measure C(n) that is being investigated. Initial glance on the parameters will
[14] is used in our work, since it is extremely suited for not reveal much information for classification, but on detailed
characterizing the development of spatio-temporal activity analysis of the same and study of various plots we can see that
patterns in high-dimensionality nonlinear systems. the subjects occupy different Euclidian spaces. This property thus
can be used to study for classification of subjects.
3)Time-domain features of HRV: We assume that only a finite
number of intervals are available. For 4-10 min segments, wide- The work involved detailed study of the entire band of EEG
sense stationarily may be assumed.The standard deviation of the signals (i.e. 0.5 to 38Hz) and also study of different bands of EEG
RR intervals (SDRR) is defined as the square root of the variance comprising of Delta, Alpha-1, Alpha-2, Theta and Beta. In all the
of the RR intervals six cases the parameters were extracted, tabulated, statistical mean
generated and listed for each subject. The results were plotted in
SDRR = √ E[RR2n]- RR2mean …………………………(1) 2D and 3D Euclidian space. These plots were examined for study
in classification.
The standard deviation of the successive differences of the RR The study was successful in not only identification of subjects
intervals (SDSD) is defined as the square root of the variance of suffering from Schizophrenia but also was able to classify them
the sequence ∆RRn = RRn – RRn+1 (the ∆RR intervals) into various groups within Schizophrenia. To further ascertain the
work into a working model machine learning approach was used
SDSD=√E[∆RR2n]-∆RR2mean ……………………………(2) to see if the parameters could be used to classify automatically. In
the present work Support Vector Machine was used successfully
4)Non-linear features: The two non-linear features, which are to identify Schizophrenia patients and also classify them with as
obtained from Poincare plot [15] are given below: accuracy of more than 90%.
Before concluding our work with support vector machines the
SD1: The SD1 measure of Poincare width is equivalent to the algorithm that is intended to be proposed for development for
standard deviation of the successive intervals, except that it is successful identification of Schizophrenia is isolation of the
scaled by 1/√2. normal healthy subjects, so the subjects are tested for positivity of
Schizophrenia. Once the person is tested positive , it is required to
SD2: The SD2 measure of the length of the Poincare cloud is find to which subgroup the person belongs. The algorithm
related to the auto covariance function. proposed analyses these steps and proposes a way to detect the
subgroups as well.
SD12+SD22= 2SDRR2 ………………………(3) 1)Detection of Non-Schizophrenic subjects
2 2 2 During analysis of the full band and various sub bands of EEG
SD2 =2SDRR –(1/2)SDSD …………………………(4)
for the study of parameters it is found that the 3D plots of
It can be argued that SD2 reflects the long-term HRV. The width complexity, cluster index and characteristic path length could be
of the Poincare plot correlates extremely highly with other easily used for classification of the normal healthy patients from
measures of short-term HRV. that of the schizophrenic patients. The 3D plot of the above
mentioned parameters can be used to distinguish between normal
healthy patients from that of the schizophrenic patients easily as
5)Frequency-Domain features: The frequency-domain features
shown in figure 1(a). The alpha 1 and alpha 2 band plots are in
considered are powers in ultra low frequency range, very low
figure 1(b) and 1(c) respectively. It can be seen that normal
frequency range, low frequency range and high frequency range. healthy subjects can be easily identified from that of
schizophrenic subjects.
III. RESULTS AND DISCUSSION
2)Identification of Sub group in Schizophrenic subjects
A. Results for EEG Data After identification of a schizophrenic patient, classifying them to
a particular subgroup is a very important and crucial task.
The collection of data was done for a period of three minutes on Identification of mixed schizophrenic subjects. It was observed
an average. For every subject, the connectivity matrices were that by just plotting the connection density with complexity and
found for every 2 seconds epoch for the entire duration of three charecteristic path length and from cluster index and
minutes. Parameters were calculated for each such epoch for all charecteristic path length other three parameters we could
the graphs generated. Once calculation of the parameters for all identify the mixed schizophrenic subjects without much
the subjects for 3 minutes duration was completed mean of each problems. This can be seen from plots as shown in figure 2.
parameter was taken for each subject.In this study the plots of
statistical means of parameters are used to come to conclusions. The plots show the three kinds of plots complexity Vs
The plots show the three kinds of plots like complexity Vs connection density, characteristic path length Vs connection
connecty etc.. Similar observations can also be seen in Delta density, characteristic path length Vs cluster index for full EEG
band as well. band and for alpha1, alpha2 and theta bands respectively.
Similar observations can also be seen in Delta band as well. The
71
The plots clearly show that the mixed schizophrenic subjects the location of errors in the data is given for various bands. We
could be clearly isolated from the other two subgroups. Though have observed that the errors can further be reduced using
the figures are shown for four bands even in delta band the multiple kernels thus increasing the efficiency.
mixed schizophrenia subjects show distinct charecteristics Once a patient is classified it can be updated into the initial
which can be easily identified from rest of the two subgroups. database which is used in SVM blocks. Prediction of more than
95% could be achieved using different kernels
Identification of the Positive and Negative Schizophrenic
The work presented here tries to apply the graph theory
subjects. After looking segregation of the mixed schizophrneic
approach to identify the schizophrenia patients from the normal
subjects we have two important subgroups to be detected. For healthy individuals and also to classify them into their
this we plot the 2D and 3D plots as shown in the figure 3. In subgroups. It is well known that the schizophrenia patients are
these plots we can see that the positive and negative difficult to identify from the EEG itself. At present there is no
schizophrenic subjects form distinct clusters which can easily be way to identify various subgroups of schizophrenia except for
identified. Thus finally the classification of schziphrenic physicians counseling, which consumes physicians valuable
patients is complete. time. Schizophrenia is not the problem of any single portion in
human brain but a problem of entire brain. Hence a technique
3)Proposed SVM based classification using multiband results that uses the entire multichannel EEG data at a time for analysis.
From the above plots it becomes amply clear that like the graph theory approach should be used. Thus, the
Schizophrenia subjects can be classified using some type of approach followed in this work is much better suited for
analyzing the EEG and possibly identifying the patients suffering
classifiers in higher dimensions. Thus, a SVM based classifier
from schizophrenia.
can be so chosen classification that can give an accuracy of more
than 90%. By using multiband classifiers with multiple kernels in During the study it is clear that there is a distinct difference in
tandem we can further reduce the errors in detecting and the connectivity of patients suffering from schizophrenia from
classification. By looking at various kernels for testing it can be that of normal healthy control individuals. The formation of
seen that the when used with multiple kernels the accuracy clusters by various groups and subgroups in 2D and 3D
approaches almost 100%. Thus an algorithm as shown in Figure Euclidian space has been exploited by using SVM based
4 is proposed for identification and classification of classifiers. The study encourages us to see if graph theory
schizophrenia subjects. For the algorithm to function effectively parameters could be used further identifying other brain
a database of established schizophrenic patients is required. The disorders which are otherwise difficult to diagnose using existing
EEG of subjects from the three subgroups of schizophrenia could forms of diagnosis. This study gives us a relatively cheaper
noninvasive tool to prematurely classify individuals who can
be taken and analyzed for as explained in the previous sections.
later be clinically analyzed by a physician in detail. By far this
The SVM modules then run based on the parameters. We have is one of the first ways for successful identification and also
calculated both the errors committed while predicting and also classification of patients suffering from schizophrenia.
.
Fi
gure 1(c) Alpha 2 band
Figure 2(c)(i)
Figure 2(d)(i)
Figure 2(a)(i) Figure 2(b)(i)
72
Fi
Figure 2(c)(ii) Figure 2(d)(ii)
73
Features Considered KNNC MLP RBF SVM than MLP and RBF networks. RBF network, in general, gives
Mean, Variance 79.82 87.83 89.44 90.57 better performance than MLP network. In the case of
Fractal Dimension (FD), 70.69 83.66 84.27 88.19 classification of supine and standing postures, frequency features
Complexity Measure (CM) are significant and heart rate variability is reduced in individuals
Mean, Variance, FD, CM 79.82 87.83 89.44 90.57 with anxiety and depressive disorders. Improved classification
Frequency Domain Features 77.59 89.74 89.44 91.20 can be achieved by taking a larger training set.
Mean, Variance, FD, CM & 67.83 89.74 89.74 91.20
Frequency domain features ACKNOWLEDGEMENTS
Author would like to thank Dr. John P John , National Institute
Table 2. Testing Accuracy Table for HRV Data (Normal and of Mental Health and Neuro Sciences (NIMHANS), Bangalore,
Abnormal cases) for providing the necessary EEG data for this study. Author would
also thank Maj.Kiran Kumar and Mr Mutyalaraju for their help in
From the above Tables 1 and 2, we can observe the following: the development of programs.
(i) The obtained accuracies for the back propagation network, REFERENCES
RBF networks units and support vector machine with RBF kernel [1] K.Sim ,T.H. Chua , Y.H. Chan , R Mahendran , S.A. Chong, “ Psychiatric
are higher than for k-neighbours. This is because in the case of
comorbidity in first episode schizophrenia: a 2 year, longitudinal outcome
data used in this work, different units bring different amount of
study”, J Psychiatr Res, vol. 40 (7), pp. 656-63,2006.
information whereas k-neighbours algorithm compute the
[2] M. Galinier , S. Boveda , A. Pathak , J. Fourcade , B. Dongay , “ Intra-
Euclidean distances between classified vector and some other individual analysis of instantaneous heart rate variability ”, Crit.
vectors. Therefore, this algorithm cannot take into consideration Care Med,vol.28(12), pp. 3939-3940,2000.
the fact that different inputs bring different amount of [3] D.L.Musselman,D.L.Evans,C.B.Nemeroff ,”Therelationship of expression
information, which is the natural feature of neural networks. to cardiovascular diseases.Arch Gen Psychiatryu,vol.55, pp.580- 592,1998.
[4] A.R. McIntosh , M.N. Rajah , N.J. Lobaugh ,. “ Interactions of prefrontal
(ii)The features SDRR, SDSD, SD1 and SD2 play an important
cortex in relation to awareness in sensory learning ”, Science, vol. 284,
role in the classification of normal vs abnormal cases. pp. 1531–1533,1999.
3.The features fractal dimension and complexity measure seems
[5] G.Tononi, O. Sporns, G.M. Edelman , “ A measure for brain complexity:
to be not playing a significant role in the classification problems.
Relating functional segregation and integration in the nervous system”,
4.Heart rate variability is reduced in individuals with anxiety and
Proc. Nat Acad Sci USA,vol. 91,pp. 5033–5037,1994.
depressive disorders and hence we are able to get accuracy
[6] O.Sporns, G.Tononi, G.M. Edelman, “Theoretical neuroanatomy: Relating
around 90% by using either features SDSD, SDRR, SD1 and anatomical and functional connectivity in graphs and cortical connection
SD2 or frequency features alone. matrices” Cerebral Cortex vol. 10, pp.127–141,2000.
5.In the case of classification of supine and standing postures, [7] O.Sporns, G.Tononi, G.Edelman, “Connectivity and complexity: The
frequency features are significant. Spectral analysis of HR data relationship between neuroanatomy and brain dynamics. Neural Networks”,
vol.13, pp. 909–922,2000.
reports a relative increase in low frequency power and decrease
[8] O.Sporns,G. Tononi, “Classes of Network Connectivity and Dynamics”,
in high frequency power from supine to standing posture. These Complexity, vol. 7, pp.28-38,2002.
changes are attributed to a predominance of sympathetic activity [9] D.J.Watts, S.H.Strogatz, “Collective dynamics of ‘small-world’ networks”
and vagal withdrawal in standing posture. Nature ,vol. 393, pp.440–442,1998.
[10] D.J.Watts, “Small Worlds” Princeton University Press: Princeton, NJ, 1999.
The second part of this paper on HRV has presented a neural- [11] Simon Haykin, Artificial neural networks, second edition,New York:
based approach to classifying HRV data into normal and Pearson Education, 2002.
abnormal cases and supine and standing postures. We are able to [12] R.O. Duda, P.E. Hart and D.G.Stork, Pattern Classification, second edition,
New York: John Wiley and Sons, 2001.
correctly classify 106 out of 116 cases corresponding to normal [13] M. Katz, “ Fractals and the analysis of waveforms”,Comput.Biol
and abnormal subjects and in case of supine and standing Med., vol. 18, pp. 145-156, 1988.
subjects we are able to classify 105 out of 116 correctly. We [14] Zhang Xu-Scheng, J. Roy Rob and E.W. Jensen , “ EEG complexity as
compared conventional KNNC method with three kinds of neural a measure of Depth of Anaesthesia ” , IEEE Trans. Biomed. Engg.,
vol.48,pp 1424-1433, 2001.
networks (MLP, RBF and SVM). The obtained accuracies for the [15] M.Brennan, M.Palaniswami and P.Kamen, “ Do existing measures of
back propagation network, RBF network units and support vector Poincare plot geometry reflect nonlinear features of heart rate
machine with RBF kernel are higher than for k-neighbours. variability?”, IEEE Trans. On BME, vol. 48, no. 11 pp. 1342-1347, 2001.
Among neural network methods, SVM gives better performance
74
Supervised Gene Clustering for Extraction of
Discriminative Features from Microarray Data
Chandra Das #1 , Pradipta Maji ∗2 , Samiran Chattopadhyay $3
#
Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, 700 152, India
1
chandradas@hotmail.com
∗
Machine Intelligence Unit, Indian Statistical Institute, Kolkata, 700 108, India
2
pmaji@isical.ac.in
$
Department of Information Technology, Jadavpur University, Kolkata, 700 092, India
3
samiranc@it.jusl.ac.in
Abstract— Among the large number of genes presented in study addresses the supervised classification task where data
microarray data, only a small fraction of them are effective for samples belong to a known class. So, the outcomes are sample
performing a certain diagnostic test. However, it is very difficult classes and the input features are measurements of genes for
to identify these genes for disease diagnosis. Clustering method
is able to group a set of genes based on their interdependence gene array-based sample classification. However, the major
and allows a small number of genes within or across the groups problem of microarray gene expression data-based sample
to be selected for analysis that may contain useful information classification, is the huge number of genes compared to the
for sample classification. In this regard, a new supervised gene limited number of samples. Most classification algorithms
clustering algorithm is proposed to cluster genes from microarray suffer from such a high dimensional input space. Furthermore,
data. The proposed method directly incorporates the information
of response variables in the grouping process for finding such most of the genes in arrays are irrelevant to sample distinction.
groups of genes, yielding a supervised clustering algorithm for These genes may also introduce noises and decrease prediction
genes. Significant genes are then selected from each cluster accuracy. In addition, a biomedical concern for researchers is
depending on the response variables. The average expression of to identify the key ”marker genes” which discriminate samples
the selected genes from each cluster acts as a representative of for class diagnoses. Therefore, gene selection is crucial for
that particular cluster. Some significant representatives are then
taken to form the reduced feature set that can be used to build sample classification in medical diagnostics, as well as for
the classifiers with very high classification accuracy. To compute understanding how the genome as a whole works.
the interdependence among the genes as well as the gene-class One way to identify these marker genes is to use the
relevance, mutual information is used, which is shown to be clustering methods [3], which partitions the given genes into
successful. The performance of the proposed method is described distinct subgroups, so that the genes (features) within a cluster
based on the predictive accuracy of naive bayes classifier, K-
nearest neighbor rule, and support vector machine. The proposed are similar while those in different clusters are dissimilar. A
method attains 100% predictive accuracy for all data sets. The set of small number of top-ranked genes from each cluster are
effectiveness of the proposed method, along with a comparison then selected or extracted based on some evaluation criterion
with existing methods, is demonstrated on three microarray data. to constitute the resulting reduced subset. As a first approach,
unsupervised clustering techniques have been widely applied
to find groups of co-regulated genes on microarray data. The
I. I NTRODUCTION
most prevalent approaches of unsupervised clustering include:
Recent advancement of microarray technologies has made (i) hierarchical clustering [4] (ii) K-means clustering [5] (iii)
the experimental study of gene expression data faster and more clustering through Self-Organizing Maps [6] etc. However,
efficient. Microarray techniques, such as DNA chip and high these algorithms usually fail to reveal functional groups of
density oligonucleotide chip, are powerful biotechnologies genes that are of special interest in sample classification. This
because they are able to record the expression levels of is because genes are clustered by similarity only, without using
thousands of genes simultaneously. The vast amount of gene any information about the experiment’s response variable.
expression data leads to statistical and analytical challenges This problem is solved by supervised clustering. Super-
including the classification of the dataset into correct classes. vised clustering is defined as grouping of variables (genes),
So, an important application of gene expression data in func- controlled by information about the class variables, as for
tional genomics is to classify samples according to their gene example, the tumor types of the tissues. Previous work in
expression profiles such as to classify cancer versus normal this field encompasses tree harvesting [7], a two step method
samples or to classify different types of cancer [1], [2]. which consists first of generating numerous candidate groups
Both supervised and unsupervised classifiers have been by unsupervised hierarchical clustering. Then, the average
used to build classification models from microarray data. This expression profile of each cluster is considered as a potential
75
input variable for a response model and the few gene groups that the proposed method is more effective.
that contain the most powerful information for sample discrim- The rest of this paper is organized as follows: In Section
ination are identified. In this work, only the second step makes II, a novel feature extraction algorithm is presented. Section
the clustering supervised, as the selection process relies on III briefly describes the concept of mutual information, which
external information about the class variable types. But it also is used to measure the similarity between two genes and
fails to reveal functional groups of genes of special interest in the relevance of a gene with respect to class variables. In
class prediction because genes are clustered by unsupervised Section IV, extensive experimental results are discussed, along
information only. with a comparison with other related methods. The paper is
A supervised clustering approach that directly incorporates concluded with a summary in Section V.
the response variables or class variables in the grouping pro-
II. P ROPOSED F EATURE E XTRACTION METHOD
cess is the partial least squares (PLS) procedure [8], [9]. The
PLS has the drawback that the fitted components involve all This section presents an algorithm for supervised learning
(usually thousands of) genes, which makes them very difficult of similarities and interactions among predictor variables for
to interpret. Another new promising supervised method [10] classification in very high dimensional spaces, and hence is
like PLS, is a one step approach that directly incorporates the predestinated for searching functional groups of genes on
response variables into the grouping process. This supervised microarray expression data.
clustering algorithm is a combination of gene selection for A. Proposed Supervised Clustering Algorithm
cluster membership and formation of a new predictor by
The proposed basic stochastic model for microarray data
possible sign flipping and averaging the gene expressions
equipped with categorical response is given by a random pair
within a cluster. The cluster membership is determined with
a forward and backward searching technique that optimizes (ξ, Y ) with values Rm × Y
the Wilcoxon test based predictive score and margin criteria where ξ ∈ Rm denotes a log-transformed gene-expression
defined in [10], which both involve the supervised response profile of a tissue sample, standarized to mean zero and
variables from the data. However, as both predictive score and unit variance. Y is the associated response variable, taking
margin criteria depend on the actual gene expression values, numeric values in Y = {0, 1, · · · , K−1 }. Here K represents
they are very much sensitive to noise or outlier of the data set. the number of classes. Suppose, X represents the gene set
In this paper, a new supervised gene clustering algorithm where X = {X1 , X2 , · · · , Xm }.
is proposed. It finds co-regulated clusters of genes whose To account for the fact that not all m genes on the chip, but
collective expression is strongly associated with the sample rather a few functional gene subsets, determine nearly all of the
categories or class labels. The mutual information is used outcome variation and thus the type of a tissue, the whole gene
here to measure the similarity between genes and the gene- set is partitioned into z number of functional groups or clusters
class relevance. The proposed method uses this measure to ({C1 , · · · , Cz } with z ≪ m). They form a disjoint and usually
reduce the redundancy among genes. It involves partitioning incomplete partition of the gene set:{∪zi=1 Ci } ⊂ {1, · · · , m}
of the original gene set into some distinct subsets or clusters and Ci ∩ Cj = ∅, i 6= j. Finally, a representative of every
so that the genes within a cluster are highly co-regulated cluster is generated and among them a few forms the reduced
with strong association to the response variables or sample feature set. Let, ξCi ∈ R denotes a representative expression
categories while those in different clusters are as dissimilar value of gene cluster Ci . There are many possibilities to
as possible. A single gene from each cluster having the determine such group values ξCi , but as we would like to shape
highest gene class relevance value is first selected as the clusters that contain similar genes, a simple linear combination
initial representative of that cluster. The representative of each is an accurate choice:
cluster is then modified by averaging the initial representative 1 X ˜
ξCi = δg ξg , δg ∈ {−1, 1} (1)
with other genes of that cluster whose collective expression Ci
g∈Ci
is strongly associated with the sample categories. In effect,
the proposed algorithm yields clusters typically made up of Here ξ˜g represents the expression value of gene Xg .
a few genes, whose coherent average expression levels allow Because of the use of log-transformed, mean-centered and
perfect discrimination of sample categories. After generating standardized expression data,we as a novel extension, allow
all clusters and their representatives, a few representatives are the contribution of a particular gene g to the group value
selected according to their class discrimination power and are XCi also to be given by its ’sign-flipped’ expression values
passed through classification algorithms to classify samples. −Xg . This means that we treat under and over expression
To evaluate the performance of the proposed method, dif- symmetrically, and it prevents the differential expression of
ferent cancer gene expression datasets are used. The perfor- genes with different polarity from cancelling out when they
mance of the proposed method is studied using the predictive are tagged. Now, we describe the partioning process of the
accuracy of support vector machine, K-nearest neighbor rule, gene set X into subsets or clusters C = {C1 , · · · , Cz }.
and the naive-bayes method. The classification accuracy of The proposed clustering method has two phases: (1) cluster
the proposed method is compared with those yielded by other generation phase (2) cluster refinement phase. In the cluster
gene-selection methods. The experimental results demonstrate generation phase, first the class relevance value of every gene
76
is calculated. Now, the gene with the highest class relevance 2) At first, using mutual information the class relevance
value, suppose Xv is selected as the member of first cluster value of every gene Xi , CR(Xi ) for i = 1, · · · , m is
C1 . Now, among the remaining genes, the genes which have calculated.
the similarities with Xv are greater than or equal to a user 3) Now the gene, let Xv , with highest class relevance value
defined threshold α are selected as the members of cluster is selected. This is the first member of the cluster Cp .
C1 . In this way cluster C1 is formed. 4) Repeat step 5 to step 15 until p ≤ z or all m genes are
The next phase is cluster refinement phase. In this phase selected:
from cluster C1 , at first the gene Xv with highest class 5) Now among the remaining genes in X whose similarities
relevance value is chosen. Now, any gene which is present with Xv ≥ α are selected for cluster Cp and the cluster
in cluster C1 , suppose Xt , is taken and average expression Cp is formed.
profile of Xv and Xt is calculated. If the class relevance 6) Now, in cluster Cp the gene Xv is taken and the initial
value of the average expression profile is greater than class cluster mean ξCp is set to the expression vector (ξX ˜ )
v
relevance value of Xv then Xt is selected. Again the whole of the chosen gene and in effect Xv ∈ S.
process is repeated that means any gene that is present in 7) Repeat the following two steps until all genes of cluster
cluster C1 except genes Xv and Xt , suppose Xr is taken and Cp are checked:
average expression profile of Xv , Xt , and Xr is taken. If class 8) Average the current average cluster expression profile
relevance value of average expression profile (of genes Xv , Xt , ξCp with each individual gene Xi present in cluster Cp ,
Xr ) is greater than class relevance value of previous average 1 ˜ + ˜ )
X
expression profile (of gene Xv and Xt ) then Xr is selected. ξCp +Xi = (ξXi ξX t
|S| + 1
Now the whole precess is repeated in cluster C1 until there is Xt ∈S
no unchecked gene. In this way in cluster C1 some genes are 9) if ξCp +Xi ≥ ξCp then Xi ∈ S and ξCp = ξCp +Xi .
selected which will increase class relevance value and average 10) The final average expression profile (ξCp ) of cluster Cp
expression profile of these selected genes are taken and this acts as representative of Cp .
will act as a representative for cluster C1 . After that, the gene 11) After generating cluster representative, Xv and all genes
with highest class relevance value i.e., Xv in this cluster and present in cluster Cp that have similarities with Xv
all other genes which have similarity with Xv is greater than greater than or equal to β are discarded from the gene
or equal to β present in this cluster are discarded. After the set X.
completion of cluster C1 creation, the gene which has next 12) Now, p = p + 1 and S ← ∅ .
highest class relevance value among the genes which are not 13) Now the gene, let Xu , with next highest class relevance
selected in any previously created clusters is taken and the value among the genes which are not selected in any
whole clustering process is repeated z number of times. Here, previously created clusters is taken. This is the first
z is a user defined parameter. member of the next cluster Cp and go to step 4.
After generating all z clusters and their representatives, the 14) If no gene is found then p = p − 1 and goto step 15.
best h number of cluster representatives are selected according 15) After generating all cluster representatives, the best h
to their class relevance value and are passed through classifier number of cluster representatives are selected according
algorithms to measure classification accuracy. to their class relevance values.
16) end.
• Proposed Feature (Gene) Extraction Algorithm:
Input: Given a m × n gene expression matrix T = B. Time Complexity
{wij |i = 1, · · · , m, j = 1, · · · , n}, where wij is the In this algorithm the original number of features (genes) is
measured expression level of gene Xi in the jth sample, m. From them the proposed method has selected h number of
m and n represents the total number of genes and cluster representatives. These cluster representatives actually
samples respectively. Let X represents the set of genes. form the reduced feature set. In the proposed method, at first
Then |X| = m and X = {X1 , · · · , Xi , · · · , Xm }. Let the class relevance value of every gene is calculated. The time
CR(Xi ) represents the class relevance value of gene Xi needed to calculate class relevance value of every gene is n
using mutual information. α and β are user defined as there are n number of samples. So, the time complexity
threshold. p is a variable which holds the number of of this phase is O(mn). Now, in the next phase, the gene,
current clusters. z is a user defined input variable which with highest class relevance value is chosen as the member
holds the maximum number of clusters, generated by the of the first cluster. Based on a user defined threshold among
proposed algorithm. Let C represents the set of clusters all other genes some genes are selected for that cluster and
and C = {C1 , · · · , Cz }. Here S is a set which holds the this cluster is formed. Then, the next gene, with highest class
genes which are used to generate average gene expression relevance value among all remaining genes which are not
profile of current cluster. already selected in any previously created clusters is chosen
Output: A set containing h number of cluster represen- and the whole process is repeated until m number of genes
tatives. are checked or we get z number of clusters. In every cluster,
1) Initialize: S ← ∅ and p ← 1. the similarities of all genes are calculated with respect to the
77
gene which has maximum class relevance value in this cluster. be determined by the shared information between the gene
The similarity calculation time between two gene is n as there and the rest as well as the shared information between the
are n number of samples. Then time complexity of selection gene and class label [11], [13]. If a gene has expression
of genes in a cluster is O(mn). As, z number of clusters are values randomly or uniformly distributed in different classes,
created, so, the time complexity of this phase is O(zmn). At its mutual information with these classes is zero. If a gene is
last, h number of cluster representatives are selected according strongly differentially expressed for different classes, it should
to their class relevance value. The time complexity of this have large mutual information. Thus, the mutual information
phase is O(h). As, h ≪ m, the overall time complexity of the can be used as a measure of relevance of genes. Similarly,
proposed method is O(zmn). mutual information may be used to measure the level of
similarity between genes.
C. Choice of α and β The entropy is a measure of uncertainty of random variables.
In this paper, we measure the similarity between any two If a discrete random variable X has X alphabets and the
genes using mutual information. If two genes are highly probability density function is p(x) = Pr{X = x}, x ∈ X ,
similar then mutual information between them is very high, the entropy of X is defined as
otherwise mutual information between them is low. When two X
genes are statistically independent, then mutual information H(X) = − p(x) log p(x). (2)
between them is zero. Mutual information is maximum when x∈X
we measure similarity of a gene with itself. The parameter α Similarly, the joint entropy of two random variables X with
is a threshold that measures the degree of similarity of two X alphabets and Y with Y alphabets is given by
genes. For every dataset, we measure first the similarity of XX
a gene with itself. Now, among these similarity values, the H(X, Y ) = − p(x, y) log p(x, y) (3)
maximum value is the maximum mutual information value for x∈X y∈Y
that particular data set. So, we take the values of α between where p(x, y) is the joint probability density function. The
0 and maximum mutual information value for every dataset. mutual information between X and Y is, therefore, given by
On the other hand, the threshold β is used to decide whether
a gene of current cluster will be considered for next cluster I(X, Y ) = H(X) + H(Y ) − H(X, Y ). (4)
generation step or not. In case of every data set and for every
B. Discretization
cluster β is set to 90% of maximum similarity of initial cluster
representative. In microarray gene expression data sets, the class labels
of samples are represented by discrete symbols, while the
III. E VALUATION C RITERIA FOR G ENE S ELECTION expression values of genes are continuous. Hence, to measure
The F -test value [11], [12], information gain, mutual infor- both gene-class relevance of a gene with respect to class
mation [11], [13], normalized mutual information [14], etc., labels and gene-gene redundancy between two genes using
are typically used to measure the gene-class relevance and mutual information [11], [13], [17], the continuous expression
the same or a different metric such as mutual information, values of a gene are usually divided into several discrete
the L1 distance, Euclidean distance, Pearson’s correlation partitions. The a prior (marginal) probabilities and their joint
coefficient, etc., [11], [13], [15] is employed to calculate the probabilities are then calculated to compute both gene-class
gene-gene similarity or redundancy. However, as the F -test relevance and gene-gene redundancy using the definitions for
value, Euclidean distance, Pearson’s correlation, etc., depend discrete cases. In this paper, the discretization method reported
on the actual gene expression values of the microarray data, in [11], [13], [17] is employed to discretize the continuous
they are very much sensitive to noise or outlier of the data set. gene expression values. The expression values of a gene are
On the other hand, as mutual information depends only on the discretized using mean µ and standard deviation σ computed
probability distribution of a random variable rather than on its over n expression values of that gene: any value larger than
actual values, it is more effective to evaluate the gene-class (µ + σ/2) is transformed to state 1; any value between
relevance as well as the gene-gene similarity [11], [13]. So, in (µ − σ/2) and (µ + σ/2) is transformed to state 0; any value
this paper to measure the gene class relevance and gene-gene smaller than (µ − σ/2) is transformed to state -1. These three
similarity mutual information is used. states correspond to the over-expression, baseline, and under-
expression of genes.
A. Mutual Information
In principle, mutual information is used to quantify the IV. E XPERIMENTAL R ESULTS AND D ISCUSSION
information shared by two objects. If two independent objects Organization of the experimental results is as follows:
do not share much information, the mutual information value First, the characteristics of the four microarray data sets are
between them is small. While two highly correlated objects discussed briefly. Then, the descriptions of different classifiers
will demonstrate a high mutual information value [16]. The (naive bayes classifier, K-nearest neighbor rule and support
objects can be the class label and the genes. The necessity for vector machine) used here are discussed to measure the perfor-
a gene to be an independent and informative can, therefore, mance of the proposed algorithm and finally, the performance
78
of the proposed method is extensively compared with other 3) K-Nearest Neighbor Rule [22]: The K-nearest neighbor
existing methods that are given in [3]. (K-NN) rule is used for evaluating the effectiveness of the
In this paper, mutual information is applied to calculate both reduced gene set for classification. It is a simplest machine
gene-class relevance and gene-gene redundancy. All methods learning algorithm for classifying samples based on closest
are implemented in C language and run in LINUX environ- training samples in the feature space. A sample is classified
ment having machine configuration Pentium IV, 3.2 GHz, 1 by a majority vote of its K-neighbors, with the sample being
MB cache, and 1 GB RAM. To analyze the performance of assigned to the class most common amongst its K-nearest
the proposed method, the experimentation is done on different neighbors. The value of K, chosen for the K-NN, is the square
microarray gene expression data sets. root of the number of samples in training set.
A. Gene Expression Data Sets C. Performance Analysis
In this paper, different public data sets of cancer microarrays
In this section, the results obtained by applying the super-
are used. Since binary classification is a typical and funda-
vised clustering algorithm to the above mentioned datasets
mental issue in diagnostic and prognostic prediction of cancer,
have been briefly described. The experimental results on
the proposed method is compared with other existing methods
different microarray data sets are presented in Tables I-IX.
using following binary-class data sets.
Subsequent discussions analyze the results with respect to the
1) Breast Cancer Data Set [18]: The breast cancer data set
prediction accuracy of the NB, SVM, and K-NN classifiers.
contains expression levels of 7129 genes in 49 breast tumor
Tables I, IV, VII and Tables II, V, VIII provide the performance
samples from [18]. The samples are classified according to
of the proposed method using the NB and SVM respectively,
their estrogen receptor (ER) status. 25 samples are ER positive
while Tables III, VI, IX show the results using the K-NN. For
while the other 24 samples are ER negative.
different values of α, extensive experiments have been investi-
2) Leukemia Data Set [1]: It is an affymetrix high-density
gated. The different values of α for which the experiment has
oligonucleotide array that contains 7070 genes and 72 sam-
been carried out, are 0.0001, 0.001, 0.005, 0.01, 0.02, 0.03,
ples from two classes of leukemia: 47 acute lymphoblastic
0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14,
leukemia and 25 acute myeloid leukemia.
0.15, 0.19, 0.20,0.30, 0.40, and 0.50.
3) Colon Cancer Data Set [19]: The colon cancer data set
To compute the prediction accuracy of the NB, SVM, and
contains expression levels of 40 tumor and 22 normal colon
K-NN, the leave-one-out-cross-validation is performed on each
tissues. Only the 2000 genes with the highest minimal intensity
were selected by [19]. gene expression data set. In all experiments, the value of z is
taken 50. So, we can get maximum 50 number of clusters and
B. Class Prediction Methods every cluster has its own representative. Among them with
Following three classifiers are used to evaluate the perfor- respect to best 5 representatives we have shown all the results
mance of the proposed clustering algorithm. for all the data sets using three classifiers. Each data set is
1) Naive Bayes Classifier [20]: The naive bayes (NB) pre-processed by standardizing each sample to zero mean and
classifier is one of the oldest classifiers. It is obtained by unit variance.
using the Bayes rule and assuming features (variables) are TABLE I
independent of each other given its class. For the jth sample P ERFORMANCE ON B REAST C ANCER D ATA S ET U SING NB C LASSIFIER
sj with m gene expression levels {w1j , · · · , wij , · · · , wmj }
Value Number of Selected Genes
for the m genes, the posterior probability that sj belongs to of α 1 2 3 4 5
class c is m
0.0001 100 * * * *
Y 0.001 100 100 * * *
p(c|sj ) ∝ p(wij |c) (5) 0.005 100 100 * * *
i=1 0.01 100 100 100 * *
0.02 100 100 100 97.96 *
where p(wij |c) are conditional tables (or conditional density) 0.04 100 100 100 100 100
estimated from training examples. Despite the independence 0.06 100 100 100 100 100
0.08 100 100 100 100 100
assumption, the NB has been shown to have good classification 0.10 100 100 100 100 100
performance for many real data sets [20]. 0.11 100 100 100 100 100
2) Support Vector Machine [21]: The support vector ma- 0.12 100 100 100 100 100
0.13 100 100 100 100 100
chine (SVM) is a relatively new and promising classification 0.14 100 100 100 100 100
method. It is a margin classifier that draws an optimal hy- 0.15 100 100 100 100 100
perplane in the feature vector space; this defines a boundary 0.20 100 100 100 97.96 97.96
0.30 97.96 97.96 97.96 97.96 97.96
that maximizes the margin between data samples in different 0.40 89.80 97.96 97.96 100 97.96
classes, therefore leading to good generalization properties. A 0.50 91.84 93.88 93.88 93.88 91.84
key factor in the SVM is to use kernels to construct nonlinear
decision boundary. In the present work, linear kernels are
used. The source code of the SVM is downloaded from Tables I, IV, and VII show the classification accuracy of
http://www.csie.ntu.edu.tw/∼cjlin/libsvm. different cancer microarray datasets using NB classifier. Using
79
TABLE II TABLE IV
P ERFORMANCE ON B REAST C ANCER D ATA S ET U SING SVM P ERFORMANCE ON L EUKEMIA D ATA S ET U SING NB C LASSIFIER
Value Number of Selected Genes Value Number of Selected Genes
of α 1 2 3 4 5 of α 1 2 3 4 5
0.0001 93.87 * * * * 0.0001 100 * * * *
0.001 85.72 93.88 * * * 0.001 100 100 * * *
0.005 89.79 91.84 * * * 0.005 100 100 100 * *
0.01 87.76 89.80 91.84 * * 0.01 100 100 100 100 *
0.02 89.80 93.88 93.88 100 * 0.02 100 100 100 100 100
0.04 87.76 100 100 100 100 0.04 100 100 100 100 100
0.06 91.84 100 100 100 100 0.06 100 100 100 100 100
0.08 91.84 95.92 100 100 100 0.08 100 100 100 100 100
0.10 100 97.96 100 100 100 0.10 100 100 100 100 100
0.11 93.88 97.96 100 100 100 0.11 100 100 100 100 100
0.12 95.92 100 100 100 100 0.12 100 100 100 100 100
0.13 95.92 100 100 100 100 0.13 100 100 100 100 100
0.14 95.92 97.96 100 100 100 0.14 100 100 100 100 100
0.15 100 100 100 100 100 0.15 100 100 100 100 100
0.20 97.96 100 100 100 100 0.20 100 100 100 100 100
0.30 89.80 91.84 89.80 93.88 95.92 0.30 100 100 100 100 100
0.40 89.80 95.92 95.92 93.88 95.92 0.40 100 100 100 100 98.61
0.50 87.76 91.84 91.84 91.84 91.84 0.50 93.06 97.22 98.61 100 100
NB, the proposed method gives 100% accuracy for all the is obtained for α values ranging from 0.0005 to 0.20 using 1
above mentioned datasets considering 1 or more gene cluster or more gene cluster representatives. K-NN also gives 100%
representatives. With the NB, a classification accuracy of accuracy for α value from 0.001 to 0.30 using 1 or more
100% is obtained in case of breast cancer data for α value cluster representatives in case of leukemia data set. For colon
ranging from 0.0001 to 0.20 and for leukemia cancer data cancer data set it also gives 100% accuracy for α values 0.02
100% accuracy is obtained for α value ranging from 0.0001 and 0.13 using 1 or more gene-cluster representatives.
to 0.50. In case of colon cancer data 100% accuracy is obtained Analyzing all these results, we can say that for NB classifier,
for a set of α values. These are 0.01, 0.02, 0.04 and 0.13. in case of breast cancer data set 100% accuracy is obtained
The results reported in Tables II, V, and VIII are based on for α values from 0.0001 to 0.20 and for SVM and K-NN best
predictive accuracy of the SVM. The results show that in case result is obtained for α values 0.10 and 0.15. So, the proposed
of breast cancer data set, 100% accuracy is obtained for α method gives best result for breast cancer data for α values
value ranging from 0.02 to 0.20 using 1 or more gene cluster 0.10 and 0.15. In case of colon cancer data using NB classifier,
representatives. Using SVM, 100% accuracy is obtained for 100% accuracy is obtained for a set of α values. These values
leukemia data for α value ranging from 0.001 to 0.40 using 1 are 0.01, 0.02, 0.04 and 0.13. For SVM, it gives best result
or more cluster representatives. In case of colon cancer data for α value 0.02 and using K-NN it gives best result for α
for α value 0.02 only 100% accuracy is obtained considering value 0.02, and 0.13 for colon cancer data. So when α value
1 gene cluster representative. is set to 0.02, the proposed method gives 100% accuracy for
For breast cancer data set using the K-NN, 100% accuracy three specified classifiers in case of colon cancer data using
80
TABLE VI TABLE VIII
P ERFORMANCE ON L EUKEMIA D ATA S ET U SING K-NN RULE P ERFORMANCE ON C OLON C ANCER D ATA S ET U SING SVM
Value Number of Selected Genes Value Number of Selected Genes
of α 1 2 3 4 5 of α 1 2 3 4 5
0.0001 98.61 * * * * 0.0001 95.16 * * * *
0.001 97.22 100 * * * 0.001 95.16 95.16 * * *
0.005 95.83 97.22 97.22 * * 0.005 95.16 98.39 * * *
0.01 98.61 100 100 100 * 0.01 98.39 96.77 * * *
0.02 98.61 98.61 100 100 100 0.02 100 93.55 * * *
0.04 97.22 98.61 98.61 100 100 0.04 90.32 90.32 95.16 * *
0.06 100 100 100 100 100 0.06 95.16 96.77 95.16 93.55 *
0.08 98.61 100 100 100 100 0.08 88.71 98.39 98.39 98.39 98.39
0.10 100 100 98.61 100 100 0.10 93.55 95.16 93.55 93.55 93.55
0.11 95.83 100 100 100 100 0.11 95.16 96.77 95.16 91.94 91.94
0.12 98.61 98.61 98.61 98.61 100 0.12 95.16 91.94 93.55 91.94 93.55
0.13 100 100 100 100 100 0.13 96.77 98.39 96.77 96.77 95.16
0.14 98.61 100 100 100 100 0.14 95.16 93.55 93.55 95.16 95.16
0.15 100 100 100 100 100 0.15 98.39 95.16 95.16 95.16 95.16
0.20 98.61 98.61 98.61 100 100 0.20 93.55 91.94 91.94 93.55 93.55
0.30 100 100 100 100 100 0.30 93.55 91.94 91.94 87.09 91.94
0.40 97.22 100 95.83 94.44 94.44 0.40 91.94 93.55 95.16 95.16 95.16
0.50 94.44 91.67 91.67 93.06 94.44 0.50 80.65 83.88 80.65 85.49 82.26
1 or more cluster representatives. So, the proposed method method is superior to the other gene selection methods by se-
gives best result for colon cancer data set for α values 0.02. lecting a smaller set of discriminative genes in the colon cancer
For leukemia cancer data we get 100% accuracy for α value and leukemia cancer data sets than the others as reflected by
from 0.0001 to onwards using NB classifier. Using SVM, the the classification results. The proposed method outperforms in
proposed method gives best result for α values ranging from all cases. Although, ACA and t-value algorithms can find good
0.001 to 0.40 and for K-NN it gives best result for α values discriminative genes for the K-NN method, t-value is unable to
from 0.001 to 0.30. So, we can say that for leukemia cancer do so for the naive bayes method. Using naive bayes classifier
data it gives best result when α value is set to 0.13 or 0.30. ACA gives good result for leukemia cancer but is unable to do
so for colon cancer. The k-means algorithm, SOM, biclustering
D. Comparative Performance Analysis algorithm, mRMR and RBF fail to find good discriminative
For comparison purpose we compare the proposed method genes for these two data sets, as shown in the results.
with the results of the attribute clustering algorithm (ACA),
t-value, k-means algorithm, minimum redundancy-maximum V. C ONCLUSION
relevance (mRMR) algorithm, self organizing map (SOM), bi- This paper presents a new algorithm for supervised clus-
clustering algorithm, and radial basis function (RBF) network tering of genes from microarray experiments. The proposed
on colon cancer and leukemia cancer data sets as given in [3] algorithm is potentially useful in the context of medical
using the NB and K-NN methods. diagnostics, as it identifies groups of interacting genes that
The experimental results in Table X show that the proposed have high explanatory power for given tissue types, and which
81
TABLE X
[4] M. B. Eisen, P. T. Spellman, and D. Botstein, “Cluster Analysis and
C OMPARATIVE P ERFORMANCE A NALYSIS OF D IFFERENT M ETHODS Display of Genome-Wide Expression Patterns,” Proc. Natl Acad. Sci.
Classifier Data Sets Methods/ Accuracy Number USA, vol. 95, pp. 14 863–14 868, 1998.
Algorithms (%) of Genes [5] R. Herwig, A. J. Poustka, C. Muller, C. Bull, H. Lehrach, and J. O’Brien,
Proposed 100 1 “Large-scale Clustering of cDNA-fingerprinting data.” Genome Res.,
ACA 83.9 7 vol. 9, pp. 1093–1105, 1999.
t-value 80.6 7 [6] P. Tamayo, D. Slonim, J. Mesirov, Q. Kitareewan, S. Dmitrovsky,
K-NN Colon k-means 69.4 14 E. Lander, E. S., and T. R. Golub, “Interpreting Patterns of Gene
Cancer SOM 59.7 14 Expression with Self-Organizing Maps: Methods and Application to
Biclustering 69.4 7 Hematopoietic Differentiation,” Proc. Natl Acad. Sci. USA, vol. 96, pp.
mRMR 64.5 7 2907–2912, 1999.
RBF 67.7 3 [7] T. Hastie, R.Tibshirani, D. Botstein, and P. Brown, “Supervised Har-
Proposed 100 1 vesting of Expression Trees,” Genome Biology, 2001.
ACA 67.7 14 [8] D. Nguyen and D. Rockie, “Tumor Classification by Partial Least
t-value 56.5 7 Squares using Microarray Gene Expression Data,” Bioinformatics, pp.
NB Colon k-means 62.9 7 39–50, 2002 2001.
Cancer SOM 64.5 7 [9] P. Geladi and B. Kowalski, “Partial Least Square Regression: A Tuto-
Biclustering 67.7 7 rial,” Analyt Chem Acta, 1986.
mRMR 64.5 7 [10] M. Detting and P. Buhlamann, “Supervised Clustering of Genes,”
RBF 64.5 3 Genome Biology, pp. 0069.1–0069.15, 2002.
Proposed 100 1 [11] C. Ding and H. Peng, “Minimum Redundancy Feature Selection from
ACA 91.2 7 Microarray Gene Expression Data,” in Proceedings of the Computational
t-value 88.2 14 Systems Bioinformatics, 2003, pp. 523–528.
K-NN Leukemia k-means 50.0 7 [12] J. Li, H. Su, H. Chen, and B. W. Futscher, “Optimal Search-Based
Cancer SOM 50.0 7 Gene Subset Selection for Gene Array Cancer Classification,” IEEE
Biclustering 58.8 21 Transactions on Information Technology in Biomedicine, vol. 11, no. 4,
mRMR 70.6 14 pp. 398–405, 2007.
RBF 47.1 3 [13] H. Peng, F. Long, and C. Ding, “Feature Selection Based on Mutual
Proposed 100 1 Information: Criteria of Max-Dependency, Max-Relevance, and Min-
ACA 82.4 7 Redundancy,” IEEE Transactions on Pattern Analysis and Machine
t-value 55.9 7 Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
NB Leukemia k-means 58.8 7 [14] X. Liu, A. Krishnan, and A. Mondry, “An Entropy Based Gene Selec-
Cancer SOM 58.8 7 tion Method for Cancer Classification Using Microarray Data,” BMC
Biclustering 58.8 7 Bioinformatics, vol. 6, no. 76, pp. 1–14, 2005.
mRMR 67.6 7 [15] D. Jiang, C. Tang, and A. Zhang, “Cluster Analysis for Gene Expression
RBF 58.8 3 Data: A Survey,” IEEE Transactions on Knowledge and Data Engineer-
ing, vol. 16, no. 11, pp. 1370–1386, 2004.
[16] C. Shannon and W. Weaver, The Mathematical Theory of Communica-
tion. Champaign, IL: Univ. Illinois Press, 1964.
[17] P. Maji, “f -Information Measures for Efficient Selection of Discrimi-
in turn can be accurately predict the class labels of new native Genes from Microarray Data,” IEEE Transactions on Biomedical
samples. At the same time, such gene clusters may reveal Engineering, vol. 56, no. 4, pp. 1063–1069, 2009.
insights into biological processes and may be valuable for [18] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang,
H. Zuzan, J. A. Olson, J. R. Marks, and J. R. Nevins, “Predicting the
functional genomics. Clinical Status of Human Breast Cancer by Using Gene Expression
In summary, the proposed algorithm tries to cluster genes Profiles,” Proceedings of the National Academy of Science, USA, vol. 98,
such that the discrimination of different tissue types is as no. 20, pp. 11 462–11 467, 2001.
[19] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack,
simple as possible. The performance of the proposed method is and A. J. Levine, “Broad Patterns of Gene Expression Revealed by
evaluated by the predictive accuracy of naive bayes classifier, Clustering Analysis of Tumor and Normal Colon Tissues Probed by
K-nearest neighbor rule, and support vector machine. For Oligonucleotide Arrays,” Proceedings of the National Academy of Sci-
ence, USA, vol. 96, no. 12, pp. 6745–6750, 1999.
all data sets, 100% classification accuracy is found by the [20] T. Mitchell, Machine Learning. McGraw-Hill, 1997.
proposed method. The results obtained on real data sets [21] V. Vapnik, The Nature of Statistical Learning Theory. New York:
demonstrate that the proposed method can bring a remarkable Springer-Verlag, 1995.
[22] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification and
improvement on gene selection problem. So, the proposed Scene Analysis. John Wiley & Sons, New York, 1999.
method is capable of identifying discriminative genes that may
contribute to revealing underlying class structures, providing
a useful tool for the exploratory analysis of biological data.
R EFERENCES
[1] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,
J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri,
C. D. Bloomfield, and E. S. Lander, “Molecular Classification of Cancer:
Class Discovery and Class Prediction by Gene Expression Monitoring,”
Science, vol. 286, pp. 531–537, 1999.
[2] E. Domany, “Cluster Analysis of Gene Expression Data,” Journal of
Statistical Physics, vol. 110, pp. 1117–1139, 2003.
[3] W. Au, K. C. C. Chan, A. K. C. Wong, and Y. Wang, “Attribute
Clustering for Grouping, Selection, and Classification of Gene Expres-
sion Data,” IEEE/ACM Transactions On Computational Biology and
Bioinformatics, vol. 2, no. 2, pp. 83–101, 2005.
82
Modified Greedy Search Algorithm for
Biclustering Gene Expression Data
Shyama Das Sumam Mary Idicula
Department of Computer Science
Department of Computer Science,
Cochin University of Science and Technology
Cochin, Kerala, India Cochin University of Science and Technology
shyamadas777@gmail.com Cochin, Kerala, India
sumam@cusat.ac.in
Abstract— Biclustering refers to simultaneous clustering of correspond to different time points or different environmental
both rows and columns of a data matrix. Biclustering is a highly conditions. The samples can be from different organs, from
useful data mining technique in the analysis of gene expression cancerous or healthy tissues, or even from different
data. The problem of identifying the most significant biclusters in individuals. The gene expression data contains thousands of
gene expression data has shown to be NP complete. In this paper genes and hundreds of conditions. An element in the matrix
a greedy search algorithm is developed for biclustering gene refers to the expression level of a particular gene under a
expression data. This algorithm has two steps. In the first step specific condition. The genes are co-regulated if the genes in
high quality bicluster seeds are generated using K-Means
a set display similar fluctuation under all conditions. By
clustering algorithm. Then these seeds are enlarged using the
greedy search method. Here the node that results in the minimum discovering the co-regulation, it is possible to refer to the gene
Hscore value when combined with the bicluster is selected. This is regulative network, which will lead to better understanding as
added to the bicluster. This selection and addition is continued till to how organisms develop and evolve. One of the objectives
the Hscore value of the bicluster reaches the given threshold. of gene expression data analysis is to group genes according to
Even though it is a greedy method the results obtained is far their expression under multiple conditions. Clustering is the
better than many of the metaheuristic methods which are most widely used data mining technique for analyzing gene
generally considered superior to greedy approach. expression data to group similar genes or conditions.
Clustering of co-expressed gene into biologically meaningful
Keywords - Biclustering; Gene expression data; greedy search;
groups assists in inferring the biological role of an unknown
datamining; K-Means clustering gene that is co-expressed with a known gene.
83
gene expression data [3]. Biclustering is a powerful analytical called mean squared residue score or Hscore was introduced
tool when some genes have multiple functions and by Cheng and Church. It is the sum of the squared residue
experimental conditions are diverse. score. The residue score of an element bij in a submatrix B is
In this work a novel algorithm is developed for biclustering defined as
gene expression data using greedy strategy. In the first step
high quality bicluster seeds are generated using K-Means RS(bij)=bij-bIj-biJ+bIJ where
clustering algorithm. Then the seeds are enlarged by adding
the node that results in minimum incremental increase in
Hscore. The node addition continues till the Hscore value of
the bicluster reaches the given threshold.
A. Model of bicluster
Gene expression dataset is a matrix in which rows represent Here I denotes the row set and J denotes the column set of
genes and columns represent experimental conditions. An matrix B, bij denotes the element in a submatrix, biJ denotes
element aij of the expression matrix A represents the logarithm the ith row mean, bIj denotes the jth column mean, and bIJ
of the relative abundance of the mRNA of the ith gene under denotes the mean of the whole bicluster. The residue score of
the jth condition. Let X={G1,G2,....GN} be the set of genes an element bij provides the difference between the actual
and Y={C1,...CM} be the set of conditions in the gene value and its expected value predicted from its row mean,
expression dataset. The dataset can be viewed as an NxM column mean and bicluster mean. The residue of an element is
matrix A of real numbers. A bicluster is a submatrix B of A a measure of how well the entry fits into that bicluster. Hence
and if the size of B is IxJ, then I is a subset of rows X of A, from the value of residue, the quality of the bicluster can be
and J is a subset of the columns Y of A. The rows and evaluated by computing the mean squared residue. That is
columns of the bicluster B need not be contiguous as in the Hscore or mean squared residue score of bicluster B is
2
expression matrix A. MSR(B) =
Biclusters are generally classified into four major types. Cheng and Church defined a bicluster to be a matrix with
They are: biclusters with constant values, biclusters with low mean squared residue score. The maximum value of MSR
constant values on rows or columns, biclusters with coherent that a matrix can have to be called as a bicluster is called MSR
values, and biclusters with coherent evolutions. In the gene threshold and is denoted as δ. A submatrix B is called a δ
expression data matrix constant biclusters disclose subsets of bicluster if MSR(B)< δ for some δ >0. A high MSR value
genes with similar expression values within a subset of signifies that the data is uncorrelated. A low MSR value
conditions. On the other hand a bicluster with constant values means that there is correlation in the matrix. The value of δ
in the rows will identify a subset of genes with similar depends on the dataset. For Yeast dataset the value of δ is 300
expression values across a subset of conditions permitting the and for Lymphoma dataset the value of δ is 1200. The volume
expression levels to vary from gene to gene. Similarly a of a bicluster or bicluster size is the product of number of rows
bicluster with constant columns identifies a subset of and the number of columns in the bicluster.
conditions within which a subset of genes manifest similar This bicluster model is much more flexible than the row
expression values assuming that the expression values might clusters. The identified submatrices need to be neither disjoint
vary from condition to condition. In the case of a bicluster nor cover the entire matrix. But the computation of biclusters
with coherent values, it identifies a subset of genes and a is costly because one will have to consider all the
subset of conditions with coherent values on both the rows and combinations of columns and rows in order to find out all the
columns. In this case the similarity among the genes is biclusters. The search space for the biclustering problem is
measured as the mean squared residue score. If the similarity 2m+n where m and n are size of genes and conditions
measure (mean squared residue score) of a matrix is within a respectively. Usually m+n is more than 2000. The biclustering
certain threshold, it is a bicluster. In the case of a bicluster problem is Np-hard.
with coherent evolutions, a subset of genes is up-regulated or
down-regulated across a subset of conditions without B. Encoding of bicluster
considering their actual expression values [1]). Each bicluster is encoded as a binary string of fixed length
The biclusters with coherent values are biologically more [4]. The length of the string is the sum of the number of rows
relevant than biclusters with constant values. Hence in this and the number of columns of the gene expression data matrix.
work biclusters with coherent values are identified. Thus the First N bits represent genes and next M bits represent
problem of biclustering can be formulated in the following conditions. A bit is set to one when the corresponding gene or
manner: given a data matrix A, find a set of submatrices B1, condition is included in the bicluster otherwise it is set to zero.
B2, ... Bn which satisfy some homogeneous characteristics or This representation is advantageous for node addition and
coherence. For measuring the degree of coherence a measure deletion.
84
C. Algorithm Description when added to the bicluster is considered as the best element.
Different types of algorithm design techniques are used to It cannot be specified as an element with smallest incremental
address the biclustering problem including iterative row and cost of Hscore because adding some elements reduces the
column clustering combination, Divide and Conquer, Greedy Hscore value. Seed growing starts from condition list followed
iterative search, Evolutionary or metaheuristic algorithms. by gene list until the Hscore value reaches the given threshold.
Greedy iterative search methods are based on the idea of This is a greedy method since our aim is to select the next
creating biclusters by adding or removing rows/columns from element which produces bicluster with minimum Hscore
them, using a criterion that maximizes a local gain. In this value. This algorithm is deterministic. A pseudo-code
work greedy search method is used for finding δ biclusters and description of modified greedy search algorithm is given
it is very fast compared to metaheuristic methods. The below.
algorithm has two major phases. In the first phase, an initial
set of seed biclusters are generated using K-Means one way F. Modified Greedy Search Algorithm
clustering algortithm. The second phase is used to enlarge the
seeds by adding more rows and columns using a greedy search Algorithm modifiedgreedy(seed, δ)
algorithm.
D. Seed Finding bicluster := seed
A good seed of a bicluster is a small bicluster with very
low Hscore value. Hence in the seed there exists a possibility Calculate Column_List the list of conditions not included in
of accommodating more genes and conditions within the given the bicluster
Hscore threshold. In this algorithm a simple seed finding While (MSR(bicluster) <= δ)
technique is used [5]. For finding seeds K-Means clustering
No_elem_Col=size(Column_List)
algorithm is used. K-Means is a partitional clustering
algorithm. The generated clusters are disjoint, flat or non- for i:=1: No_elem_Col
hierarchical. The number of clusters generated should be bicluster=bicluster+ Column_List [i]
specified as input. In K-Means clustering algorithm distance
measure is a parameter that specifies how the distance Column_List_msr[i]= MSR(bicluster)
between data points is measured. Here cosine angle distance is Remove Column_List[i] from bicluster
selected as the distance measure. First of all gene and
condition clusters are obtained from the K-Means one way end(for)
clustering algorithm. find minimum value in Column_List_msr and corresponding
That is genes in the dataset are partitioned into n gene index K
clusters. Those clusters having more than 10 genes are further
bicluster=bicluster+ Column_List [K]
divided into groups based on cosine angle distance from the
cluster centre so that each group contains maximum 10 genes. delete Column_List [K] from Column_List
Similarly conditions in the dataset are partitioned into m end(while)
clusters and each cluster containing more than 5 conditions is
further divided based on cosine angle distance from the cluster
center so that each group contains maximum 5 conditions. Calculate Row_List the list of genes not included in the
There are p gene clusters and q condition clusters. All bicluster
combinations of these p gene clusters and q condition clusters
are found. Hscore value for all these combinations is While (MSR(bicluster) <= δ)
calculated and those with Hscore value below a certain No_elem_Row=size(Row_List)
threshold are selected as seeds. Thus the gene expression data
matrix is partitioned into fixed size tightly co-regulated
submatrices. The Yeast dataset is partitioned into 140 gene for i:=1: No_elem_Row
clusters and 3 condition clusters [4]. bicluster=bicluster+ Row_List [i]
E. Seed growing phase
In the seed growing phase a separate list is maintained for Row_List_msr[i]= MSR(bicluster)
conditions and genes not included in the bicluster. Each seed Remove Row_List[i] from bicluster
is enlarged separately by adding more genes and conditions.
Initially conditions are added followed by genes. In modified end(for)
greedy search algorithm the best element is selected from the
gene list or condition list and added to the bicluster. The
find minimum value in Row_List_msr and corresponding
quality of the element is determined by the Hscore or MSR
index J
value of the bicluster after including the element in the
bicluster. The element which results in minimum Hscore value bicluster=bicluster+ Row_List [J]
85
delete Row_List [J] from Row_List residue, gene number, condition, volume etc. for the
performance comparison of modified greedy with other
end(while)
biclustering algorithms.
end(modifiedgreedy) 600 450
400
550
350
G. Difference between Novel Greedy Search and
E x p re s s io n V a lu e
E x p re s s io n V a lu e
500
E x p r e s s io n V a lu e
E x p re s s io n V a lu e
Hscore value of the bicluster combined with a single gene or 250
300
condition which is not included in the bicluster is to be 200
calculated for all the genes or conditions not included in the 150 250
600 600
E x p r e s s io n V a lu e
The proposed algorithm is implemented in Matlab and 400
300
cerevisiae cell cycle expression dataset to assess the quality of 400
250
the proposed method. The dataset is based on Tavazoie et al
200
[7]. Dataset consists of 2884 genes and 17 conditions. The 350
150
values in the expression dataset are integers in the range 0 to 100 300
600. There are 34 missing values represented by -1. The 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10
condition
12 14 16 18
500
550
B. Bicluster Plots
400
In Figure 1 eight biclusters identified by the modified
E x p re s s io n V a lu e
500
E x p r e s s io n V a lu e
greedy search algorithm on the Yeast dataset are shown. From 300
the bicluster plots it can be noticed that genes present a similar 450 200
86
Average mean squared residue score is better than that of all
TABLE 1 other algorithms listed in the table except DBF. As it is clear
INFORMATION ABOUT BICLUSTERS OF YEAST DATASET
Rows Columns Bicl.Vol . MSR
from the Table 2, performance of modified greedy is better
Label
than novel greedy in terms of average mean squared residue,
(a) 10 17 170 66.4403 average gene number, average volume and largest bicluster
(b) 17 17 289 99.3497 size.
(c) 108 17 1836 194.5204 In multi-objective evolutionary computation [11] the
(d) 14 17 238 97.8389 maximum number of conditions obtained is only 11 for the
(e) 107 17 1819 199.1857 Yeast dataset. But, in this method there are biclusters with all
(f) 33 17 561 99.9639 17 conditions. For the Yeast dataset the maximum number of
(g) 31 17 527 97.9121 genes obtained for this algorithm in all the 17 conditions is
(h) 1405 9 12645 299.8968 147 with Hscore value 200.2474. The maximum available in
(p) 147 17 2499 200.2474 all the literature published so far is in the case of multi-
(q) 710 8 5680 199.9880 objective PSO [12]. They obtained 141 genes for 17
(r) 913 9 8217 256.1985 conditions with Hscore value 203.25.
(s) 1163 8 9304 246.0037
(t) 1200 8 9600 249.9022 TABLE 2
PERFORMANCE COMPARISON BETWEEN MODIFIED GREEDY AND
(u) 1355 9 12195 294.9206 OTHER ALGORITHMS FOR YEAST DATASET
Algori Avg. Avg.Gene Avg.Cond. Avg. Largest
thm Residue Num Num Vol. Bicluster
In the above table the first column contains the label of each
Modified 185.88 515.21 13.36 4684.29 12645
bicluster. The second and third columns report the number of
Greedy
rows (genes) and of columns (conditions) of the bicluster
Novel 199.78 94.75 14.75 1422.87 2112
respectively. The fourth column reports the volume of the
Greedy
bicluster and the last column contains the mean squared
residues of the biclusters. Table contains details of some more CC 204.29 166.71 12.09 1576.98 4485
biclusters not included in Figure 1 with labels (p),(q),(r),(s),(t) SEBI 205.18 13.61 15.25 209.92 1394
and (u). FLOC 187.54 195.00 12.80 1825.78 2000
DBF 114.70 188.00 11.00 1627.20 4000
87
many of the biclustering algorithms and especially the Novel
Greedy Search algorithm. Moreover this method finds high
quality biclusters that show strikingly similar up-regulations
and down-regulations under a set of experimental conditions
that can be inspected visually by using plots.
REFERENCES
[1] Madeira S. C. and Oliveira A. L., “Biclustering algorithms for
Biological Data analysis: a survey” IEEE Transactions on computational
biology and bioinformatics, 2004, pp. 24-45.
[2] J. A. Hartigan, “Direct clustering of Data Matrix”, Journal of the American
Statistical Association Vol.67, no.337, 1972, pp. 123-129.
[3] Yizong Cheng and George M. Church, “Biclustering of expression
data”, Proc. 8th Int. Conf. Intelligent Systems for Molecular Biology,
2000, pp. 93-103.
[4] Anupam Chakraborty and Hitashyam Maka “Biclustering of Gene
Expression Data Using GeneticAlgorithm” Proceedings of Computation
Intelligence in Bioinformatics and Computational Biology CIBCB, 2005,
pp. 1-8
[5] Chakraborty A. and Maka H., “Biclustering of gene expression data by
simulated annealing”,HPCASIA ’05, 2005, pp. 627-632.
[6] Shyama Das and Sumam Mary Idicula “A Novel Approach in Greedy
Search Algorithm for Biclustering Gene Expression Data” accepted for
presentation in the International Conference on Bioinformatics,
Computational and Systems Biology (ICBCSB) which will be held in
Singapore during Aug 27-29, 2009.
[7] Tavazoie S., Hughes J. D., Campbell M. J., Cho R. J. and Church G. M.,
“Systematic determination of genetic network architecture”, Nat. Genet.,
vol.22, no.3, 1999 pp, 281-285.
[8] Federico Divina and Jesus S. Aguilar-Ruize, “Biclustering of Expression
Data with Evolutionary computation”, IEEE Transactions on Knowledge
and Data Engineering, Vol. 18,2006, pp. 590-602.
[9] J. Yang, H. Wang, W. Wang and P. Yu, “Enhanced Biclustering on
Expression Data”, Proc. Third IEEE Symp. BioInformatics and BioEng.
(BIBE’03, 2003), pp. 321-327.
[10] Z. Zhang, A. Teo, B. C. Ooi, K. L. Tan, “Mining deterministic
biclusters in gene expression data”, Proceedings of the fourth IEEE
Symposium on Bioinformatics and Bioengineering (BIBE’04), 2004,
pp.283-292.
[11] Banka H. and Mitra S., “Multi-objective evolutionary biclustering of
gene expression data”, Journal of Pattern Recognition, Vol.39, 2006, pp.
2464-2477.
[12] Junwan Liu, Zhoujun Lia and Feifei Liu “Multi-objective Particle Swarm
Optimization Biclustering of Microarray Data”, IEEE International
Conference on Bioinformatics and Biomedicine, 2008, pp. 363-366.
88
ADCOM 2009
AD-HOC NETWORKS
Session Papers:
1. Rajiv Saxena and Alok Singh, “Solving Bounded Diameter Minimum Spanning Tree Problem Using
Improved Heuristics”
89
Solving Bounded-Diameter Minimum Spanning
Tree Problem Using Improved Heuristics
Rajiv Saxena and Alok Singh
Department of Computer and Information Sciences
University of Hyderabad
Hyderabad 500046, Andhra Pradesh, India
rajiiv123@gmail.com, alokcs@uohyd.ernet.in
Abstract—The bounded-diameter minimum spanning tree of degree 1 except at most two nodes. To compute BDMST
(BDMST) problem is to find a minimum spanning tree of a in this case we consider each edge e of the graph one-by-
given connencted, undirected, edge-weighted graph G in which one and make its endpoints the vertices whose degree can
no path between any two vertices contains more than k edges.
The problem is known to be N P-Hard for 4 ≤ k < n − 1, exceed 1. For each of the remaining n − 2 nodes, whose
where n is the number of vertices in G. Therefore, we look degree is 1, it requires a comparison to determine to which
for heuristics to find good approximate solutions. This work is of the two nodes of degree ≥ 2 it is to be connected. This
an improvement over two existing greedy heuristics - Improved has to be repeated for every edge in G and then selecting a
Randomized Greedy Heuristics (RGH-I) and Improved Centre spanning tree with the smallest cost. In a complete graph with
Based Tree Construction (CBTC-I). Both themselves are im-
proved versions of heuristics RGH and CBTC. The improvement m edges, the total no. of comparisons thus required is (n−2)m
is such that given a bounded-diameter minimum spanning tree T which is O(n3 ). Finally when all the edge weights are same,
as constructed by RGH or CBTC, the heuristic tries to improve a minimum diameter spanning tree can be constructed using
the cost of T further by disconnecting a subtree of height h rooted breadth first search in O(mn) time. In the remaining general
at vertex v in T and attaching it to a vertex where the cost of cases the BDMST problem is N P-Hard [4].
attaching it is minimum witout violating the diamteter constraint.
On 25 euclidean instances and 20 non-euclidean instances upto The diameter of a tree is the maximum eccentricity of its
1000 vertices our approach shows substantial improvement over vertices. The eccentricity of a vertex v is the length of the
solutions found by RGH-I and CBTC-I. longest path from v to any other vertex. The vertex with
minimum eccentricity defines the centre of the tree. Every
I. I NTRODUCTION tree has either one or two centres. If the diameter is even we
have only one vertex as the centre. If the diameter is odd then
The bounded-diameter minimum spanning tree problem is two connected vertices forms the centre of the tree.
useful in many practical applications where a minimum span- Abdalla et. al [1] presented a greedy heuristic called One-
ning tree (MST) with a small diameter (length of the longest Time Tree Construction (OTTC) for solving BDMST. OTTC is
path in the tree) is required - such as in distributed mutual a modification of Prim’s algorithm that starts with a vertex and
exclusion algorithms [8], in data compression for information grows the spanning tree by connecting the nearest unconnected
retreival [3] and in linear lighwave networks (LLNs) [2]. vertex to the partially build spanning tree without violating
Let G = (V, E) be a connected undirected graph where V the diamater constraint. It keeps track of eccentricity of each
denotes the set of vertices and E denotes the set of edges. Each vertex such that the eccentricity of any vertex does not exceed
edge e ∈ E has a non-negative weights w(e) associated with diameter k.
it. The BDMST problem seeks a Minimum Spanning Tree (T) Raidl and Julstrom [7] proposed a randomized greedy
on G whose diameter does not exceed a given positive integer heuristic (RGH). Their algorithm starts by fixing the tree’s
k ≥ 2. That is centre. If k is even a vertex v0 at random is chosen from the
X
Minimize W (T ) = w(e) set of vertices V as the centre vertex and if the diameter is odd
e∈T then another vertex v1 is chosen at random and {v0 , v1 } forms
the centre of the tree. RGH maintains the diameter constraint
such that
by maintaining the depth of each vertex in tree, i.e., the number
diameter(T ) ≤ k
of edges on the path from the tree’s centre to the vertex. No
It is to be noted that for diameter k = n − 1 the problem vertex in T can have depth > bk/2c. This is based on an
is nothing but to find MST of G for which we already important observation by Handler [5] that in a tree of diameter
have polynomial time exact algorithms (Prim’s or Kruskal’s k, no vertex is more than bk/2c edges from the tree’s centre.
algorithms). When k = 2, the BDMST takes the form of a Thus by fixing the tree’s centre and using the observation by
star which can be computed in O(n2 ) and then selecting the Handler, RGH grows the spanning tree such that no vertex
smallest-weight star as solution. When k = 3, the BDMST have depth greater than bk/2c. On the test instances considered
takes the form of a dipolar star where every node must be RGH outperforms OTTC substantially.
90
Centre Vertex
Julstrom [6] later proposed a full greedy heuristic which
is a modified version of RGH called Centre Based Tree
Construction (CBTC) for constructing BDMST. In CBTC parent(v)
91
Pseudocode 1 RGH+HT B. Test Instances Description
Require: BDMST T as computed by RGH
We have compared the performance of various heuris-
Ensure: diam(T ) ≤ k
tics on euclidean as well as non-euclidean instances. The
1: moreimp ← true // While further improvements possible
problem instances used in our experiments are same stan-
2: while (moreimp) do
dard BDMST benchmark problem instances as used in [6]
3: moreimp ← false
and [9]. There are in all total 45 instances. Twenty-five
4: U ← V − {c0 } // Vertices in U without centres c0 or c1 of these instances are euclidean with five instances for
5: if odd(k) then each value of n ∈ {50, 100, 250, 500, 1000}. These
6: U ← V − {c1 } insances can be downloaded from Beasley’s OR-library
7: end if (www.people.brunel.ac.uk/∼mastjjb/jeb/info.html). These in-
8: while (U 6= ∅) do stances are listed there as instances of the euclidean steiner tree
9: v0 ← random(U ) problem. Euclidean instances consists of n points randomly
10: U ← U − {v0 } chosen in the unit square. These points are treated as the
11: ht ← height of subtree(v0 ) vertices of a complete graph whose edge weights are the
12: pv0 ← parent(v0 ) euclidean distances between the points. The library contains
13: minvtx ← next min cost(v0 ) 15 instances for each n and the first 5 of them are used for
14: while (minvtx 6= pv0 ) do the BDMST problem. The diameter bound k is taken to be 5,
15: if (minvtx ∈ / desc(v0 )) then 10, 15, 20, and 25 respectively for n = 50, 100, 250, 500 and
16: if ((ht + 1 + depth[minvtx]) ≤ bk/2c) then 1000 respectively.
17: moreimp ← true Twenty more instances, five each for n = 100, 250, 500
18: T ← T − {(pv0 , v0 )} + {(minvtx, v0 )} and 1000 vertices were created by Julstrom [6]. These are
19: parent[v0 ] ← minvtx non-euclidean or random instances which are complete graphs
20: for each vertex w in subtree rooted at vertex v0 with edge weights chosen at random from the interval [0.01,
do 0.99]. The diameter bound is taken to be 10, 15, 20 and 25
21: depth[w] ← depth[parent(w)] + 1 for n = 100, 250, 500 and 1000 vertices respectively.
22: end for
23: break C. Results of Experiments
24: end if Tables I and II report the results of various heuristics on
25: end if euclidean instances, whereas Tables III and IV do the same
26: minvtx ← next min cost(v0 ) on non-eulidean instances. For each instance these tables list
27: end while n, the diameter bound k, the best and average solutions,
28: end while and, standard deviation (SD) of solutions after running the
29: end while heuristics n times on each instance. Best results over the three
heuristics are printed in bold. It is clear from these tables that:
1) On euclidean instances, RGH+HT oobtained the best so-
The improvement for CBTC [6] is same as specified in the lutions (Table I). RGH+HT outperforms RGH-I, which
pseudocode for RGH+HT. Once the tree constructed by CBTC so far gives the most promising results on euclidean
is known, we perform the same steps as in RGH+HT. The instances, both in terms of best cost as well as average
CBTC+HT is CBTC with these improvements. cost for all the instances (except on instances of size 50
III. C OMPUTATIONAL R ESULTS where the best cost of RGH-I and RGH+HT are same,
however RGH+HT gives better average values.).
A. Experimental Setup
2) On non-euclidean instances, it is CBTC+HT (Table IV)
All our heuristics are coded in C and executed on an that outperforms all other heuristics. On all the instances
Intel Core 2 Duo 3.00 GHz CPU using 2GB of RAM in not only it gives better results in terms of best cost but
Linux environment (Open SuSE 10.3). CBTC, CBTC-I and also in terms of average costs.
CBTC+HT were executed n times on each instance of size n 3) On euclidean instances, CBTC+HT (Table II) shows
starting from each vertex in turn. RGH, RGH-I and RGH+HT substantial improvements in comparison to CBTC-I in
were executed n times starting from a random chosen vertex terms of best values for all the instances with n =1000.
each time.
92
TABLE I
R ESULTS OF RGH, RGH-I AND RGH+HT ON 25 E UCLIDEAN INSTANCES HAVING 50, 100, 250, 500 AND 1000 VERTICES
TABLE II
R ESULTS OF CBTC, CBTC-I AND CBTC+HT ON 25 E UCLIDEAN INSTANCES HAVING 50, 100, 250, 500 AND 1000 VERTICES
93
TABLE III
R ESULTS OF RGH, RGH-I AND RGH+HT ON 20 NON -E UCLIDEAN INSTANCES HAVING 100, 250, 500 AND 1000 VERTICES
TABLE IV
R ESULTS OF CBTC, CBTC-I AND CBTC+HT ON 20 NON -E UCLIDEAN INSTANCES HAVING 100, 250, 500 AND 1000 VERTICES
IV. C ONCLUSIONS I and CBTC-I. After attaching the subtree to a new better
We have improved the results of RGH-I and CBTC-I heuris- vertex it updates the depth of all the vertices in the subtree.
tics both on euclidean as well as on non-euclidean instances of This alongwith multiple passes over the list of vertices result
the BDMST problem. The improved heuristics RGH+HT and in better solution values for the BDMST problem.
CBTC+HT takes into consideration the height of the subtree As a future work, we plan to develop hybrid approaches
before connecting it to some other vertex of the tree thus for the BDMST problem combining RGH+HT with some
allowing more candidate vertices for improvement than RGH- metaheuristics like [9].
94
R EFERENCES [6] B.A. Julstrom, “Greedy heuristics for the bounded-diameter minimum
spanning tree problem,” ACM Journal of Experimental Algorithmics, vol.
[1] A. Abdalla, N. Deo, and P. Gupta, “Random-tree diameter and the
14, 2009, pp. 1-14
diameter constrained MST,” Congressus Numerantium, 144, 2000, pp.
[7] G.R. Raidl and B.A. Julstrom, “Greedy heuristics and an evolutionary
161-182.
algorithm for the bounded-diameter minimum spanning tree problem,”
[2] K. Bala, K. Petropoulos and T.E. Stern, “Multicasting in a linear lightwave
Proceedings of the ACM Symposium on Applied Computing, 2003, pp.
network,” Proceedings of IEEE INFOCOM’93, pp. 1350-1358
747-752
[3] A. Bookstein and S.T. Klein, “Compression of correlated bit-vectors,”
[8] K. Raymond, “A tree-based algorithm for distributed mutual exclusion,”
Information Systems, vol. 16, 1990, pp. 387-400
ACM Transactions on Computer Systems, vol. 7, 1989, pp. 61-77
[4] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to
[9] A. Singh and A.K. Gupta, “Improved heuristics for the bounded-diameter
the Theory of NP-Completeness. W.H. Freeman, New York, 1979.
minimum spanning tree problem,” Soft Computing, vol. 11, 2007, pp.
[5] G.Y. Handler, “Minimax location of a facility in an undirected Graph,”
911-921
Transportation Science, vol. 7, 1978, pp. 287-293
95
Ad-hoc Cooperative Computation in Wireless Networks using Ant like Agents
96
of our proposed computing model, would undoubtedly save on the concept of execution migration. Applications that are
more lives in future emergencies. in compliance with the ACC model shall consist of migra-
tory execution units called Ant Agents which work together
1.1. Challenges for Cooperation to accomplish a common goal. Ant Agents (AA), similar to
Mobile Agents [13], are collections of code and data blocks.
Wireless ad-hoc networks pose a unique set of chal- They migrate through the network, executing on each de-
lenges which make traditional distributed computing mod- vice in its path, foraging for devices of interest or devices
els difficult, if not impossible to employ, in alleviating the with desired properties.
effects of resource limitations. Some of the identified chal- The agents are also self-routing, namely, they are respon-
lenges are: sible for determining their own paths through the network.
In the proposed ACC model, AAs forage the network for
• Network size - The number of devices working to- devices of interest using ant like routing algorithms [3], [5],
gether to achieve a common goal will be orders of [8], [19]. Such routing algorithms based on the behavior of
magnitude greater than those seen in traditional dis- social insects in nature are known to result in optimal route
tributed systems. between the source and the destination [20]. For their part,
devices in the network support AAs by providing,
• Heterogeneous architecture - The devices are all likely
to have different hardware architectures as they are • A name based memory system and
typically tailored to perform a specific task within the
network. • An architecturally independent environment for re-
ceipt and execution of Ant Agents
• Unreliable links - The links in the network are inher-
ently fragile, with device and connection failures being To validate the proposed computing model, we have
a norm rather than an exception. developed a simulator that executes AAs allowing us to
evaluate both execution and communication time of a dis-
• Limited processing power - As mobile devices are tributed application. In this simulator we execute applica-
likely to have size and weight restrictions, they are lim- tions that are modeled as per the Bag-of-Tasks paradigm
ited in their processing power and battery capacity. of distributed computing. Simulation results show that our
proposed computation model is able to significantly im-
• Dynamic topology - The availability of the devices prove the execution times of mobile applications on re-
may vary greatly with time, with devices becoming un- source constrained devices.
reachable due to mobility or due to depletion of energy. The rest of this paper is organized as follows. The next
• Limited reach - Because of the nature of wireless com- section describes our proposed Ad-hoc Cooperative Com-
munication, devices can communicate directly with putation model. Section 3 presents the system architecture
only those devices that are within its transmission that supports the proposed model. In Section 4, we dis-
range. cuss the details of Ant Agents while in Section 5 we discuss
the application paradigms implemented using our proposed
model. Section 6 discusses related work and Section 7 con-
1.2. Cooperative Setup
cludes the paper.
Applications designed to suit the ACC model will typi-
cally target specific properties within the network and not 2. Ad-hoc Cooperative Computation Model
individual devices. Such targeted properties could include
specific data and/or resources that the application is inter- To exploit the raw computing power of large scale, het-
ested in. From the application’s point of view devices with erogeneous, wireless ad-hoc networks we propose a dis-
the same properties are interchangeable. Thus fixed naming tributed computing model called the Ad-hoc Cooperative
schemes such as IP addressing are inappropriate in most sit- Computation (ACC) model. This model is based on the
uations. As discussed in [11] a naming scheme based on the social behavior of ants which work together in groups to
content or property of a device is more appropriate for wire- execute tasks that are beyond the abilities of a single mem-
less ad-hoc networks. ber. The ACC model consists of distributed applications
Due to network volatility and dynamic binding of names that are defined as a dynamic collection of Ant Agents (AA)
to devices, execution migration based distributed comput- which cooperate amongst themselves to collectively achieve
ing is more suitable for wireless ad-hoc networks than dis- a common objective. The execution of an AA can be de-
tributed computing based on data migration (message pass- scribed in two phases: forage and migrate phase followed
ing) [4]. Hence, the system architecture for ACC is based by a computation phase. The AA execution performed at
97
each step may differ based on the properties of its hosting than some limit MAX, say 40,000. Since prime number
device. On devices that meet the application targeted prop- generation is a processor intensive task, the resource lim-
erties (device of interest), an AA may advance its execution ited source device, depicted as a black circle, injects AAs
state while on other devices it only executes its routing al- into the network seeking cooperation from other devices
gorithm. Like any mobile agent, an AA too carries along in the network. Because the originating device is proces-
with it its mobile data, mobile code as well as a lightweight sor limited, the injected AAs are initialized to forage for
execution state. computing cycles within the network. Once initialized each
Devices in the network support the reception and exe- AA forages the network for available computing resources.
cution of AAs by providing an architecturally independent When an AA finds a device with enough spare CPU cycles,
programming environment (e.g., Virtual Machine [9]) as depicted as gray circles, it proceeds to calculate the set of
well as a name based memory system (e.g., Tag Space [22]). primes within its given range. Upon completion, each AA
The AAs along with the system support provided by the de- reports its results back to the originating device by tracing
vices in the network form the ACC infrastructure which al- back its migratory route.
lows execution of distributed applications over ad-hoc wire-
less networks.
Our proposed computational model allows the user to ex-
ecute distributed tasks in ad-hoc networks by simply inject-
ing the corresponding AAs into the network. To do this,
the user need not have any prior knowledge about either the
scale or the topography of the network nor the specific func-
tionality of the devices involved. Additionally, making the
AAs intelligent eliminates the issue of implementing new
protocols on all the devices in the network; a task which is
difficult or even impossible to do using current approaches
[11].
Because of their intelligence, AAs are reasonably re- Figure 2. 3-D modeling using Computer-aided
silient to any network volatility. When certain devices be- design
come unavailable due to their mobility or energy depletion,
AAs are able to adapt well by either finding a new path to
their destination or by foraging for other devices in the net- Next, let us consider a Computer-aided design applica-
work that meet the properties targeted by the application. tion which is required to generate a 3-D model, given its
top, front and profile views. Since the originating device,
depicted as a black circle in Figure 2, is missing the required
views, it injects an AA into the network seeking coopera-
tion from devices which have the required data. Because
the originating device is data limited, the injected AA is ini-
tialized to forage for specific data within the network. Once
initialized the AA forages the network for the three views of
the object in question. When an AA finds a device with the
relevant data, depicted as gray circles, it proceeds to pro-
cess the available view. Finally, having processed all three
views, the agent reports the results back to the originating
device by tracing back its migratory path.
For applications that deal with large amounts of data,
moving the execution to the source of the data whenever
possible will improve the overall performance of the dis-
tributed system. For example, when using an AA for ob-
Figure 1. Generating prime numbers ject recognition, performing the image analysis on the de-
vice that acquired the image whenever possible, rather than
Let us consider two example applications that demon- transferring the image (or sequence of images) over the net-
strate the computation and communication aspects of the work would result in improved response time and band-
proposed ACC model. Figure 1 depicts an application width usage while reducing the overall energy consumed.
where the joint task is to generate all prime numbers less Similarly caching frequently used code blocks on devices
98
that regularly host AAs can also limit the impact of code 3.2. Resource Manager
transfer occurring with every injected AA.
Security is an important issue in any cooperative com- Each AA lists its estimate resource requirements in a Re-
puting model. Addressing it in our proposed model would source Table located in the AA header. The Resource Man-
mean protecting AAs against malicious devices as well as ager is responsible for receiving the authenticated AA from
protecting devices against malicious AAs. Although real- the Security Manager and storing them into AA Queue, sub-
izing this requires a comprehensive security framework to ject to the requested resource constraints being satisfied. A
be in place, we limit the current architecture to a simple ad- Resource Manager also checks to see if code section of the
mission control using authentication mechanisms based on incoming AA is already cached locally or not.
digital signatures.
3.3. Virtual Machine
3. System Architecture
The Virtual Machine is a hardware abstraction layer for
Considering the heterogeneous nature of the network,
the execution of AAs across all heterogeneous hardware
the system architecture aims to place as much intelligence
platforms present in the ad-hoc wireless network. Exam-
as possible in Ant Agents and keep the support required
ples include Java Virtual Machine, K Virtual Machine, etc.
from devices in the network to a minimum.
99
4. Ant Agents 3. After completion of execution, the AA may migrate
to other devices of interest or may return back to the
AAs like mobile agents are migratory execution units originating device with the results of execution.
consisting of code, data and an execution state. In ACC,
the behavior of AAs is modeled on the behavior of ants in Ant Agent Admission
nature. Just as ants in nature cooperate with each other to To avoid unnecessary resource consumption, the Secu-
forage for food, AAs in ad-hoc networks cooperate with rity Manager executes a three-way handshake protocol for
each other to forage for devices that satisfy the applica- transferring AAs between neighboring devices. First, only
tion targeted properties (device of interest). In the context the identification information, digital signature and resource
of the proposed computing model, user applications can be table information is sent to the destination for admission
viewed as a collection of AAs cooperating with each other control. If the AA admission fails either due to security or
to achieve a common goal. Such AAs are intelligent and are resource constraints, the transferring task will be notified so
capable of routing themselves without needing any external that it can decide upon subsequent action.
support. When admitted for execution on the hosting de- If the AA is accepted, the Resource Manager at the des-
vice, the computation code within the AA is embodied into tination checks to see if the code section is already cached
a task. During its execution this task may modify the data locally. It then informs the source device to transfer only
sections of AA, modify the local tags to which it has access, the missing sections. Thus if code caching is enabled, the
may migrate to another device or may block on other tags subsequent transfer cost of the code is amortized over time.
of interest.
Ant Agent Execution
4.1. Format Upon admission, an AA becomes a task which is sched-
uled for execution by the Virtual Machine (VM). The ex-
ecution of an AA is non-preemptive, but new AAs can be
In addition to its identity and authentication information,
admitted during execution. An executing AA can yield the
an AA is comprised of code and data sections, a lightweight
VM by blocking on a tag. The VM makes sure that a task
execution state and a resource estimate table. A digital sig-
confirms to its declared resource estimates. Otherwise, the
nature together with the AA and task IDs identifies an AA.
task can be forcefully removed from the system.
The digital signature is used by the host devices to protect
the access to an AA’s tags. The code and data sections con- Ant Agent Migration
tain mobile code and data that an AA carries from one de- If the current computation of the AA does not complete
vice to another. The state field contains the execution con- on the hosting device, the task may continue its execution
text necessary for task resumption after a successful migra- on another device. The current execution state is captured
tion. The resource table consists of resource estimates like, and migrated along with the code and data sections. In case
execution time, memory requirements, etc. The resource the current computation of the AA does complete success-
estimates set a bound on the expected needs of the AA at fully, the execution state as well as the results are captured
the host device. Figure 5 depicts the skeletal structure of and migrated back to the originating device.
the Ant Agent.
4.3. Routing
100
Ant Cooperation in Nature of all tasks that need to be solved is called the bag of tasks.
In nature, ants have the power of finding the shortest path Such a collection of tasks need not be solved in any par-
from ant colonies to foods [6]. As an ant moves, it deposits a ticular order. Workers are entities capable of executing and
substance called pheromone on the ground. This deposited solving tasks from the bag. At each iteration a worker grabs
pheromone is unique to each colony and is used by its mem- one task from the bag and computes the result.
bers to establish a route to the food source. Initially, when The bag-of-tasks applications share a general structure.
ants start out with no prior information, they start search- The first step is to initialize the problem data. Then the bag
ing for food by walking in random directions. When an of tasks is created, where the termination condition either
ant finds food, it follows its pheromone trail back to the represents a fixed number of iterations or is given implicitly
colony. In doing so the ant lays down more pheromone by reading input values from a file until the end of the file.
along its successful path. When other ants run into a trail of The actual computation is represented by a loop, which is
pheromone, they give up their own search and start follow- repeated until the bag is empty. Multiple workers may exe-
ing the existing trail. Since ants follow a trail with strongest cute the loop independently. All workers have shared access
pheromone concentration, the pheromone on the branches to the task bag and the output data. Each worker repeatedly
of the shortest path to the food will grow faster when com- removes a task, solves it by applying main compute func-
pared to its concentration on other branches. As pheromone tion to it and writes the results into a file.
evaporates over time, the colony forgets older, sub-optimal
paths. 5.2. ACC Implementation
Since AAs are modeled on the behavior described above,
Because the number of tasks in the bag-of-tasks
over time they too are expected to avoid sub-optimal paths
paradigm may be large, it is useful to allow the tasks to
between the originating device and the device of interest.
be generated on the fly. An AA is created specifically for
In case the pheromone tags are missing, an AA can forage
this purpose. This task-generating AA, called the Genera-
for the device of interest by spawning another AA for route
tor AA, typically stays at the originating device. When the
discovery and blocking on its pheromone tag. A write on
number of tasks in the task pool falls below a certain thresh-
this tag unblocks the waiting AA, which will now resume
old, the task-generating AA generates additional new tasks.
its migration. Since the tags are persistent for their lifetime,
Generator AA terminates when the bag-of-tasks becomes
pheromone information once acquired can be used by sub-
empty.
sequent AAs that belong to the same application.
After generating initial number of tasks, the Generator
AA injects them as AAs into the network. These AAs in-
5. Simulations dependently forage the network for adequate computing re-
sources. When an AA discovers a device of interest, it mi-
To prove that many distributed applications can be writ- grates there and starts executing its task. Post completion
ten using Ant Agents, we have implemented a simple ap- the results are returned to the Generator AA on the orig-
plication belonging to the Bag-of-Tasks paradigm of dis- inating device. The Generator AA injects a new AA into
tributed applications. There are three reasons for choosing the network for every execution result received. This new
an application confirming to this paradigm. First, many ap- AA can swiftly migrate to a device of interest by following
plications that fit this paradigm are highly computationally the pheromone trial of the previous successful AAs. The
intensive and thus can benefit from cooperation from other number of AAs in the network is not fixed but can be dy-
devices in wireless ad-hoc networks. Second, an applica- namically changed to adapt to the changes in the network.
tion following these paradigms can easily be divided into Figure 6 depicts a snapshot of the ACC implementation
large number of coarse-grain tasks. Third, each of these with the Generator AA located at the originating device
tasks are highly asynchronous and self-contained and there (black circle) while four application AAs execute at four
is limited communication amongst the tasks. These three different devices of interest (gray circle). The arrows indi-
properties make the chosen paradigm suitable for execution cate the back-and-forth migration of the AAs.
in a networked environment.
5.3. Performance
5.1. Bag-of-tasks Paradigm
The bag-of-tasks paradigm is widely used in many sci-
The bag-of-tasks paradigm applies to the situation when entific computations. Our experiments with this paradigm
the same function is to be executed a large number of times were based on a Monte Carlo simulation of a model of light
for a range of different parameters. If applying the function transport in organic tissue [21]. The simulation runs as fol-
to a set of parameters constitutes a task, then the collection lows. Once launched, a photon is moved a distance where
101
Agent at any device in the network leads to the execution of
some code block on the receiving device. However, while
Active Messages point to the handler at the receiving de-
vice, Ant Agents carry their own code with them. More-
over Ant Agents and Active Messages address completely
different problems. While Active Messages target fast com-
munication in system-area networks, Ant Agents are meant
to address large, heterogeneous, ad-hoc wireless networks.
The Smart Packets [23] architecture provides a flexible
means of network management through the use of mobile
code. Smart Packets are implemented over IP, using the IP
option header. They are routed just like other data traffic
Figure 6. Illustration of B-o-T implemenation in the network and only execute on arrival at a specific lo-
cation [4]. Unlike Smart Packets, Ant Agents are executed
it may be scattered, absorbed, propagated undisturbed, in- at each hop in the network to not only deposit its artificial
ternally reflected or transmitted out of the tissue. The pho- pheromone but also to determine its next hop towards the
ton is repeatedly moved until it either escapes from or it is destination. Additionally Ant Agents carry along with them
absorbed by the tissue. This process is repeated until the their execution context.
desired number of photos has been propagated. The ANTS [25] capsule model of programmability al-
Because the model assumes that the movement of each lows forwarding code to be carried and safely executed in-
photon in the tissue is independent of all other photons, this side the network by a Java VM [4]. When compared to Ant
simulation fits well in the bag of tasks paradigm. The ex- Agents we find that ANTS does not migrate the execution
periment results are shown in Figure 7. The graph presents state from device to device. Also ANTS targets IP networks
a near-linear speedup for each additional AA injected into while Ant Agents target large, heterogeneous, wireless ad-
the network. hoc networks.
A Mobile Agent [16] may be viewed as a task that explic-
itly migrates from node to node assuming that the underly-
ing network assures its transport between them [4]. Unlike
mobile agents however, Ant Agents are responsible for their
own routing in the network. The ACC architecture further
defines the infrastructure that devices in the network must
implement in order to support Ant Agents.
Ant Agents are similar to Smart Messages [12] which
also use migration of code in wireless networks. Apart
from being responsible for their own routing, Smart Mes-
sages also carry with them, their execution state as well as
their code and data blocks during every migration. How-
ever unlike Smart Messages, Ant Agents are modeled on
ants in nature. Hence, Ant Agents achieve stigmergic com-
munication by recording their pheromone in the Tag Space
of every device visited which can be sensed by other Ant
Agents belonging to the same distributed application. Such
Figure 7. Speedup for B-o-T experiments ant like behavior when employed in large numbers is known
to yield to emergence of optimal paths [15].
Ant Agents bear some similarity to Active Messages [7], This paper has described a computing paradigm for large
Active Networks [18], [23], [25], Mobile Agents [10], [16] scale, heterogeneous, wireless ad-hoc networks. In the pro-
and Smart Messages [12]. Although Ant Agents borrow posed model, distributed applications are implemented as a
implementation solutions from all of them, the concept is collection of Ant Agents. The model overcomes the scale,
markedly different. heterogeneity and connectivity issues by placing the intel-
Similar to the Active Messages [7], the receipt of an Ant ligence in migratory execution units. The devices in the
102
network cooperate by providing a common minimal sys- Challenges and opportunities. IEEE Pervasive Computing,
tem support for the receipt and execution of Ant Agents. 3(4):16–23, 2004.
Simulations for Bag-of-Tasks family of distributed applica- [15] V. Maniezzo and A. Carbonaro. Ant colony optimization:
tions demonstrated that Ad-hoc Cooperative Computation An overview. In Essays and Surveys in Metaheuristics,
pages 21–44. Kluwer Academic Publishers, 1999.
represents a flexible and a simple solution for surmounting
[16] D. S. Milojicic, W. LaForge, and D. Chauhan. Mobile ob-
resource constraints on mobile computing devices. jects and agents (moa), 1998.
[17] R. Min and A. Chandrakasan. A framework for energy-
References scalable communication in high-density wireless networks.
In International Symposium on Low Power Electronics and
Design, pages 36–41, 2002.
[1] Car 2 car communication consortium. http://www.car-to-
[18] J. T. Moore, M. Hicks, and S. Nettles. Practical pro-
car.org/.
grammable packets. In in Proceedings of the 20th Annual
[2] W. Alsalih, S. Akl, and H. Hassanein. Cooperative ad
Joint Conference of the IEEE Computer and Communica-
hoc computing: towards enabling cooperative processing in
tions Societies (INFOCOM 2001, pages 41–50, 2001.
wireless environments. International Journal of Parallel, [19] Y. Ohtaki, N. Wakamiya, M. Murata, and M. Imase. Scal-
Emergent and Distributed Systems, 23(1):58–79, February able ant-based routing algorithm for ad-hoc networks. In
2008. 3rd IASTED International Conference on Communications,
[3] J. S. Baras and H. Mehta. A probabilistic emergent routing Internet, and Information Technology, 2004.
algorithm for mobile ad hoc networks, 2003. [20] Y. Ohtaki, N. Wakamiya, M. Murata, and M. Imase. Scal-
[4] C. Borcea, D. Iyer, P. Kang, A. Saxena, and L. Iftode. Co- able and efficient ant-based routing algorithm for ad-hoc
operative computing for distributed embedded systems. In networks. IEICE Transactions on Communications, E-
Proceedings of the 22nd International Conference on Dis- 89B(4):1231–1238, January 2006.
tributed Computing Systems (ICDCS 2002, pages 227–236, [21] S. A. Prahl, M. Keijzer, S. L. Jacques, and A. J. Welch. A
2002. monte carlo model of light propagation in tissue. SPIE Pro-
[5] G. D. Caro, G. D. Caro, F. Ducatelle, and L. M. Gam- ceedings of Dosimetry of Laser Radiation in Medicine and
bardella. Anthocnet: An adaptive nature-inspired algorithm Biology, IS(5):102–111, 1989.
for routing in mobile ad hoc networks. European Transac- [22] O. Riva, T. Nadeem, C. Borcea, and L. Iftode. Context-
tions on Telecommunications, 16:443–455, 2005. aware migratory services in ad hoc networks. IEEE Trans-
[6] G. D. Caro and M. Dorigo. Antnet: A mobile agents ap- actions on Mobile Computing, 6(12):1313–1328, 2007.
proach to adaptive routing. Technical report, 1997. [23] B. Schwartz, A. W. Jackson, W. T. Strayer, W. Z., R. D.
[7] T. V. Eicken, D. E. Culler, S. C. Goldstein, and K. E. Rockwell, and C. Partridge. Smart packets for active net-
Schauser. Active messages: a mechanism for integrated works, 1998.
communication and computation, 1992. [24] B. Warneke, M. Last, B. Liebowitz, Kristofer, and S. J. Pis-
[8] M. Gnes and O. Spaniol. Routing algorithms for mobile ter. Smart dust: Communicating with a cubic-millimeter
multi-hop ad-hoc. In Proceedings of International Work- computer. Computer Magazine, 34(1):44–51, January 2001.
shop on Next Generation Network Technologies, European [25] D. Wetherall. Active network vision and reality: Lessons
Comission Central Laboratory for Parallel Processings - from a capsule-based system, 1999.
Bulgarian Academy of Sciences, 2002.
[9] R. P. Goldberg. Survey of virtual machine research. IEEE
Computer, pages 34–45, June 1974.
[10] R. Gray, D. Kotz, G. Cybenko, and D. Rus. Mobile agents:
Motivations and state-of-the-art systems. Technical report,
Handbook of Agent, 2000.
[11] J. Heidemann, F. Silva, C. Intanagonwiwat, R. Govindan,
D. Estrin, and D. Ganesan. Building efficient wireless sen-
sor networks with low-level naming. In SOSP ’01: Pro-
ceedings of the eighteenth ACM symposium on Operating
systems principles, pages 146–159, 2001.
[12] P. Kang, C. Borcea, G. Xu, A. Saxena, U. Kremer, and
L. Iftode. Smart messages: A distributed computing plat-
form for networks of embedded systems. The Computer
Journal, Special Focus-Mobile and Pervasive Computing,
47:475–494, 2004.
[13] D. B. Lange and M. Oshima. Seven good reasons for mobile
agents. Commun. ACM, 42(3):88–89, 1999.
[14] K. Lorincz, D. J. Malan, T. R. F. Fulford-Jones, A. Na-
woj, A. Clavel, V. Shnayder, G. Mainland, M. Welsh, and
S. Moulton. Sensor networks for emergency response:
103
A Scenario-based Performance Comparison Study of the Fish-eye State
Routing and Dynamic Source Routing Protocols for Mobile Ad hoc Networks
104
proposed to lower the traditionally observed higher 2. Simulation Environment
control overhead with this class of protocols. In FSR, a
node exchanges its link-state updates more frequently The simulations of FSR and DSR were conducted in
with nearby nodes, and less frequently with nodes that ns-2 [10]. The network dimensions are 1000m x 1000m.
are farther away. The number of nodes with which the The transmission range of each node is 250m. We vary
link-state information is exchanged more frequently is the network density by conducting simulations with 50
controlled by the “Scope” parameter (basically the nodes (low density network with an average of 10
number of hops), while the frequency of updating the neighbors per node) and 75 nodes (high density network
neighbors outside the scope is controlled by the “Time with an average of 15 neighbors per node). The
Period of Update” (TPU) parameter. The operation of simulation time is 1000 seconds. The scope value is 1-
FSR is basically controlled by these two parameters. As hop. If all the nodes flood their link-state updates at the
a result, a node maintains accurate distance and path same time instant, there would be collisions in the
information to its nearby nodes, with progressively less network. Hence, the TPU value for each node in the
accurate detail about the path to nodes that are farther network is uniformly and randomly chosen from the
away. A scope value of 1 and a larger TPU value interval [0…TPUmax]. The different values of TPUmax
typically results in a lower control overhead at the cost studied in the simulations are: 5, 20, 50, 100, 200 and
of a higher hop count path (a sub-optimal path) between 300 seconds. For simplicity, we refer TPUmax as TPU
any two nodes. On the other hand, a scope value equal for the rest of this paper. They mean the same.
to the diameter of the network and a smaller TPU value The node mobility model used in all of our
basically transform FSR to OLSR, resulting in higher simulations is the commonly used Random Waypoint
control overhead with the advantage of being able to use model [11]. Each node starts moving from an arbitrary
the minimum hop path between any two nodes. location to a randomly selected destination location at a
Given that the scope parameter is normally set to 1- speed uniformly distributed in the range [0,…,vmax].
hop, the critical performance metrics for FSR such as Once the destination is reached, the node may stop there
the control overhead (number of link-state messages for a certain time called the pause time (0 seconds in our
exchanged), path hop count and energy consumption are simulation) and then continue to move by choosing a
heavily dependent on the TPU parameter. To date, only different target location and a different velocity. The
a handful of performance studies ([8][9][10]) are vmax values used are 5 m/s, 50 m/s and 100 m/s; the
available for FSR in the literature. To the best of our corresponding average node velocity values are: 2.5
knowledge, we could not find a simulation study on the m/s, 25 m/s and 50 m/s representing mobility levels of
performance of FSR as a function of this TPU low (school environment), moderate (downtown) and
parameter. In addition, we conjecture that as the node high (interstate highway) respectively.
mobility and network density increases, the proactive Traffic sources are continuous bit rate (CBR).
routing strategy based FSR may be preferable over the Number of source-destination (s-d) sessions used is 15
reactive DSR. DSR and FSR have not been (low traffic load) and 40 (high traffic load). The starting
categorically studied for different levels of node times of the s-d sessions is uniformly distributed
mobility, network density and offered traffic load. The between 1 to 20 seconds. Data packets are 512 bytes in
above observations are the motivation for this paper. size; the packet sending rate is 4 data packets per
In this paper, we present a simulation based second. While distributing the source-destination roles
performance analysis of FSR with respect to the TPU for each node, we saw to it that a node does not end up
parameter under scenarios generated by different as source of more than two sessions and also not as
combinations of node mobility, network density and destination for more than two sessions.
offered traffic load. For each of these scenarios, the Each node is initially provided energy of 1000
performance of FSR is also compared with that obtained Joules to make sure that no node failures happen due to
for DSR. We categorically state which of these two inadequate energy supply. The transmission power loss
protocols can be preferred for each of the different per hop is fixed and it is 1.4 W and the reception power
scenarios. The rest of the paper is organized as follows: loss is 1 W [12]. The Medium Access Control (MAC)
Section 2 describes the simulation environment and the layer model used is the standard IEEE 802.11 model
scenarios considered. Section 3 defines the performance [13] wherein access to the channel per hop is
metrics evaluated. Section 4 illustrates the simulation accomplished using a Request-to-send (RTS) and Clear-
results obtained for different scenarios; interprets the to-send (CTS) control message exchange between the
performance of FSR with respect to the TPU parameter sender and the receiver constituting the hop in a path.
and compares the performance of FSR vis-à-vis DSR. The different combinations of simulation scenarios used
Section 5 concludes the paper. in this paper are summarized in Table 1.
105
Table 1: Scenarios Studied in the Simulation
3. Performance Metrics trace files for each value of vmax and network density,
and 5 sets of randomly selected 15 and 40 s-d sessions.
The following performance metrics are evaluated for To present the results of FSR (with larger TPU values)
each of the 12 scenarios (listed in Table 1) and each of and DSR in a comparable scale in the figures, we
the six TPU values considered. present the control message overhead and energy
(i) Packet Delivery Ratio – the ratio of number of consumption per node incurred by FSR for maximum
actual data packets successfully disseminated from TPU value of 5 seconds in Tables 2 and 3 respectively.
the source to the destination to that of the number of
data packets originating at the source. 4.1 Low Network Density and Low Traffic
(ii) Average Hop Count per Path – the average number Load (Scenarios 1 through 3)
of hops in the route of an s-d session, time averaged
considering the duration of the s-d paths for all the The packet delivery ratio (refer Figure 1.1) of FSR
sessions over the entire simulation time. decreases with increase in the TPU value. This can be
(iii) Control Message Overhead – the ratio of the total attributed to the inaccuracy in the routing information
number of control messages (route discovery stored at the intermediate nodes for certain destination
broadcast messages for DSR or the link-state update nodes. However, it should be noted that FSR still
broadcast messages for FSR) received at the nodes consistently maintains a packet delivery ratio of above
to that of the actual number of data packets 90% even for TPU values exceeding 200 seconds. For
delivered to the destinations across all s-d sessions. both FSR and DSR, as the node mobility is increased
(iv) Energy Consumption per Node – the average from 5m/s to 50m/s, there is an increase in the packet
energy consumed across all the nodes in the delivery ratio. In low density networks, spatial
network. The energy consumed due to transmission distribution of nodes plays a critical role in the
and reception of data packets, periodic broadcasts effectiveness of a routing protocol. Nodes are sparsely
and receptions (in the case of FSR), and route distributed in a low density network and if nodes are
discoveries (in the case of DSR) all contribute to also characterized with low mobility, they tend to
the energy consumed at a node. experience higher rates of network disconnection.
Note that we take into consideration the number of Consequently, since the nodes do not change their
control messages received rather than transmitted positions frequently, the disconnected state persists, and
because a typical broadcast involves a node transmitting packet delivery is adversely impacted. In contrast, as
the control message and all of its neighbors receiving node mobility increases, nodes are redistributed and
the control message. The energy expended to receive the move to new locations, thus increasing the probability
control message, summed over all the nodes, is far less that they move within the transmission range of each
than the energy expended to transmit the message. other. As a result, the probability of network
connectivity increases, thus increasing the likelihood of
4. Simulation Results a node successfully routing a packet to its destination.
In the low node mobility scenario, FSR was
Each data point in Figures 1 through 4 and Tables 2 observed to yield a more optimal minimum hop path
and 3 is an average of data collected using 5 mobility than DSR for a time period of update (TPU) value of 5
106
Table 2: Control Message Overhead (Control messages received per data packet delivered) for
Maximum TPU Value of 5 Seconds
Maximum Node Low Density, Low Low Density, High High Density, Low High Density, High
Velocity (vmax) Traffic Load Traffic Load Traffic Load Traffic Load
5 m/s 178 64 585 220
50 m/s 182 69 640 235
150 m/s 180 67 660 250
Maximum Node Low Density, Low Low Density, High High Density, Low High Density, High
Velocity (vmax) Traffic Load Traffic Load Traffic Load Traffic Load
5 m/s 104 Joules 126 Joules 212 Joules 230 Joules
50 m/s 110 Joules 129 Joules 235 Joules 250 Joules
150 m/s 109 Joules 129 Joules 240 Joules 255 Joules
Figure 1.1: Packet Delivery Ratio Figure 1.2: Average Hop Count per Path
Figure 1.3: Control Message Overhead Figure 1.4: Average Energy Consumption per Node
Figure 1: Performance of FSR and DSR in Low Density Network and Low Traffic Load Scenarios
seconds (see Figure 1.2). In low node density networks, minimum hop paths. In contrast, at higher TPU values,
nodes are sparsely distributed, and availability of routes FSR propagates routing information infrequently, thus
between s-d pairs is not always guaranteed. At low DSR outperforms FSR at these TPU values. The
mobility, nodes are less likely to change their location, degradation in the performance of FSR can be attributed
which hinders them from discovering more optimal to routing inaccuracy as a result of longer link-state
routes to destinations. In addition, DSR tends to update time intervals utilized to exchange broadcast
maintain its current minimum hop path route until a link messages about the network topology.
failure is detected, predisposing it to retain sub-optimal FSR incurs a significantly higher control overhead
routing information in low node density scenarios. over DSR at a lower TPU value of 5 seconds (see Table
Consequently, since FSR proactively maintains more 2 and Figure 1.3). FSR periodically generates network
accurate topology information at lower TPU values, it wide broadcasts at a TPU value with the purpose of
outperforms DSR by determining more optimal establishing routes for every node in the network.
107
Figure 2.1: Packet Delivery Ratio Figure 2.2: Average Hop Count per Path
Figure 2.3: Control Message Overhead Figure 2.4: Average Energy Consumption per Node
Figure 2: Performance of FSR and DSR in Low Density Network and High Traffic Load Scenarios
This process of periodic broadcasts generates high least 94%. Thus, FSR can be utilized for applications
control overhead especially if it is done rather requiring optimized energy consumption at high node
frequently as in the case of a TPU value of 5 seconds. In velocity scenarios and can tolerate a packet delivery
contrast, DSR incurs less overhead than FSR because it ratio of approximately 94%.
generates less control packets in a low network density
scenario. DSR performs network wide flooding only 4.2 Low Network Density and High Traffic
when a route is needed for a data transmission session, Load (Scenarios 4 through 6)
and thus its control overhead depends on the offered
traffic load (number of s-d pair sessions). Both DSR and FSR exhibits an appreciable increase
In comparison to DSR, FSR generates less control in their respective packet delivery ratios (refer Figure
message overhead (refer Figure 1.3) for TPU values 2.1) at low node mobility of 5m/s. However, as node
ranging from 50 to 300 seconds. With respect to node mobility is increased to 50m/s and 100m/s respectively,
mobility, the amount of overhead generated appears to both protocols experience a slight decrease in their
grow with increasing mobility in DSR. However, FSR packet delivery ratios. This observation is justified for
remains unaffected by variations in node mobility. the following reason: In networks of low density and
At lower TPU values for FSR, it can be observed high traffic load, the number of neighbors per node is
that DSR consumes less energy per node (refer Table 3 significantly smaller compared to the number of active
and Figure 1.4), relative to FSR. This is expected s-d pairs. As a result, there is more demand placed on a
because DSR, a reactive protocol, should incur less few nodes to successfully route packets to their
energy consumption as a result of less control overhead destinations. This obviously results in more packets
generation, when compared to a proactive routing getting dropped at each node and hinders the ability of
protocol like FSR. However, we do notice that operating both protocols to successfully route packets to their
FSR under higher TPU values helps to minimize energy destinations at a higher rate. As node velocity is
consumption per node. FSR loses less energy per node increased from low to high, FSR incurs a higher hop
relative to DSR in high mobility scenario cases of count compared to DSR, except for the TPU value of 5
100m/s, corresponding to TPU values of 100 seconds seconds (see Figure 2.2). The hop count of DSR is not
and 200 seconds respectively. Figure 1.1 shows that the much affected by the node velocities.
packet delivery ratios of FSR corresponding to TPU As illustrated in Table 2, FSR is observed to incur a
values of 100 seconds and 200 seconds in a higher control overhead than DSR at a lower TPU value
characteristic high mobility scenario of 100m/s to be at of 5 seconds due to frequent network-wide broadcasts.
108
Figure 3.1: Packet Delivery Ratio Figure 3.2: Average Hop Count per Path
Figure 3.3: Control Message Overhead Figure 3.4: Average Energy Consumption per Node
Figure 3: Performance of FSR and DSR in High Density Network and Low Traffic Load Scenarios
However, DSR incurs significantly more overhead than unable to find paths to route data packets successfully to
FSR as traffic load is increased to 40 s-d pairs (refer their designated destination. Thus, there will be an
Figure 2.3) for TPU values of 20 seconds and beyond. observed increase in the amount of control overhead
This observation can be attributed to the reactive nature generated to maintain and establish routes for the
of DSR and the low node density of the network. DSR voluminous amount of data traffic. The energy
determines routes as needed. With an increasing need to consumption of FSR is significantly less when
determine routes for a growing number of s-d pairs, compared to that of DSR in moderate to high mobility
DSR invokes its route discovery mechanism frequently, scenarios for a TPU value of 100 seconds and above.
leading to frequent flooding of the network with
broadcast messages. The amount of route discoveries 4.3 High Network Density and Low Traffic
increases with increasing mobility, to determine routes Load (Scenarios 7 through 9)
for all the s-d pairs, and thus DSR incurs a higher
control overhead compared to FSR. FSR remains In high-density networks, the packet delivery ratios
largely unaffected by increasing rates of node mobility. incurred by both FSR and DSR are relatively larger than
The amount of energy consumed by both protocols those incurred in low-density networks (compare
(refer Figure 2.4) is observed to be appreciably larger Figures 1.1 and 3.1). For low mobility scenarios of
than that observed in low-density networks with low 5m/s, both FSR and DSR deliver packets at
traffic load (refer Figure 1.4). The spike noticed in approximately 100%. FSR maintains this perfect packet
energy consumption can be attributed to factors such as delivery rate for low node mobility as TPU values are
the number of data and control packets flowing through increased from 5 seconds up to 200 seconds. The better
the network. An increase in the offered traffic load at performance of both protocols can be attributed to the
low network density is analogous to an increase in the fact that each node has more neighbors within its
number of active s-d pairs wishing to establish sessions. transmission range to route messages along a given s-d
Consequently, this corresponds to an increase in the route. This distribution almost always guarantees that a
number of data packets flowing through each node in packet will be successfully routed to its destination.
the network, which contributes to the observed increase In high-density networks, the average hop count per
in the energy consumption at each node. In addition, in path values for both FSR and DSR shown in Figure 3.2,
a low density network, the probability of route failures reduced appreciably compared to the low network
is rather high. This is attributed to the fact that nodes density scenarios in Figures 1.2 and 2.2. Nodes in a high
could be sparsely distributed, and as a result, will be density network tend to have more neighbors, and as a
109
Figure 4.1: Packet Delivery Ratio Figure 4.2: Average Hop Count per Path
Figure 4.3: Control Message Overhead Figure 4.4: Average Energy Consumption per Node
Figure 4: Performance of FSR and DSR in High Density Network and High Traffic Load Scenarios
result have better path alternatives (shorter paths) to For moderate to high node mobility, DSR yielded a
choose from among the optimal routes to any given higher packet delivery ratio. The discrepancy between
destination. On the other hand, FSR and DSR are FSR and DSR in terms of packet delivery ratio
observed to incur significantly higher control overhead increased, as the TPU parameter values were increased
in high-density networks. This is because more from 50 seconds to 300 seconds. It should be noted that
broadcast messages are received at each node due to an FSR is still able to maintain a packet delivery ratio
increase in the number of neighbors. As illustrated in above 97% even at a high TPU value of 300 seconds.
Table 2 and Figure 3.3, for TPU value of 100 seconds With respect to hop count, FSR outperforms DSR in
or above, FSR incurs less control overhead than DSR. low node mobility scenario at a TPU 5 seconds. Beyond
The energy consumed per node by both protocols is 5 seconds, DSR discovers more optimal minimum hop
lower in magnitude for high density networks, compared paths compared to FSR, due to the discovery of
to that consumed in lower density networks (see Figures inaccurate routes in FSR. One major difference
1.4, 2.4, 3.4 and 4.4). As each node has more neighbors, observed is a slight increase in the magnitude of the hop
data gets efficiently routed along optimal paths in high count discovered by both protocols as compared to the
density networks. In low node mobility scenarios, the high network density and low traffic load scenario.
energy consumption of FSR is significantly higher than As illustrated in Table 2 and Figure 4.3, with respect
that of DSR. This is because FSR incurs a fixed energy to the control message overhead, FSR scales
cost due to periodic network broadcasts. However, at considerably better than DSR at TPU values greater
higher node mobility scenarios, energy consumption of than 50 seconds. FSR proactively maintains routing
FSR converges to that of DSR, and actually outperforms information and is not affected by increasing network
DSR at higher TPU values of 100 seconds and above. density. On the other hand, DSR incurs more overhead
Thus, FSR can be employed as a suitable routing with increasing demand of route discoveries for the s-d
alternative in networks characterized with high node sessions. Thus, variations in mobility have a significant
density and moderate to high node mobility. effect on the amount of control messages generated by
DSR in high node density and high traffic scenarios.
4.4 High Network Density and High Traffic FSR is not much affected by changes in node mobility.
Load (Scenarios 10 through 12) It is observed from Table 3 and Figure 4.4 that the
energy consumption of both protocols exceeded that of
FSR and DSR maintained a near perfect packet the high network density and low traffic load scenario
delivery ratio of 100% as illustrated in Figure 4.1. (refer Figure 3.4). This is justified by the increase
110
observed in the number of communicating s-d pairs. [2] C. E. Perkins and P. Bhagwat, “Highly Dynamic
More packets are routed in the network due to data and Destination Sequenced Distance Vector Routing for
control overhead, and as a result nodes expend more Mobile Computers,” Proceedings of ACM (Special
energy associated with routing a larger amount of Interest Group on Data Communications)
packets. Energy consumption per node also increases SIGCOMM, pp. 234 – 244, October 1994.
with increase in the mobility levels of nodes. When [3] P. Jacquet, P. Muhlethaler, T. Clausen, A. Laouiti,
compared to DSR, FSR consumes less energy in A. Qayyum and L. Viennot, “Optimized Link State
moderate to high mobility scenarios at TPU values Routing Protocol for Ad Hoc Networks,”
ranging from 20 seconds to 300 seconds. Thus, it can be Proceedings of the IEEE International Multi Topic
suggested that for high mobility and high-density Conference, pp. 62 – 68, Pakistan, December 2001.
scenarios, FSR can be configured to a lower TPU value [4] D. B. Johnson, D. A. Maltz, and J. Broch, “DSR:
of 20 seconds to minimize energy consumption. For The Dynamic Source Routing Protocol for Multi-
moderate node mobility, high density and high traffic hop Wireless Ad hoc Networks,” Ad hoc
load networks, FSR can be selected over DSR by Networking, edited by Charles E. Perkins, Chapter
configuring the former with a TPU value of 50 seconds. 5, pp. 139-172, Addison-Wesley, 2001.
[5] C. E. Perkins and E. M. Royer, “Ad hoc On-Demand
5. Conclusions Distance Vector Routing,” Proceedings of the 2nd
IEEE Workshop on Mobile Computing Systems and
This paper explores the performance and the Applications, pp. 90-100, February 1999.
associated tradeoffs for the FSR protocol relative to the [6] G. P. Mario, M. Gerla and T-W Chen, “Fisheye
DSR protocol for MANETs under varying scenarios of State Routing: A Routing Scheme for Ad Hoc
network density, node mobility, and traffic load using a Wireless Networks,” Proceedings of the
comprehensive simulation based analysis. Conclusions International Conference on Communications, pp.
and suggestions are made with respect to the 70 -74, New Orleans, USA, June 2000.
configuration of the FSR protocol in order to yield [7] S. Jaap, M. Bechler and L. Wolf, “Evaluation of
better performance than DSR under specific scenarios Routing Protocols for Vehicular Ad Hoc Networks
based on the results observed in the simulations. in City Traffic Scenarios,” Proceedings of the 5th
A significant tradeoff has been observed in the International Conference on Intelligent
performance of FSR regarding the hop count per path. Transportation Systems and Telecommunications,
For lower TPU values, FSR has been discovered to Brest, France, June 2005.
obtain shorter paths due to the increased frequency of [8] E. Johansson, K. Persson, M. Skold and U. Sterner,
route update messages. As the TPU value is increased, “An Analysis of the Fisheye Routing Technique in
FSR has been observed to incur higher hop count values Highly Mobile Ad Hoc Networks,” Proceedings of
due to lower update frequency. Consequently, this leads the IEEE 59th Vehicular Technology Conference,
to the persistence of stale routes, which generates longer Vol. 4, pp. 2166 – 2170, May 2004.
hop paths. We have identified the TPU values that will [9] T-H. Chu and S-I. Hwang, “Efficient Fisheye State
generate paths with hop count comparable to DSR. It Routing Protocol using Virtual Grid in High-
has been discovered that at low mobility levels, FSR density Ad Hoc Networks,” Proceedings of the 8th
yields more optimal paths. International Conference on Advanced
In high density networks characterized with high Communication Technology, Vol. 3, pp. 1475 –
traffic load, even at higher TPU values, FSR has a 1478, February 2006.
significantly lower control message overhead compared [10] Ns-2 Simulator: http://www.isi.edu/nsnam/ns/
to DSR and yet achieves a packet delivery ratio of at [11] C. Bettstetter, H. Hartenstein and X. Perez-Costa,
least 90%. The same trend has been noticed with respect “Stochastic Properties of the Random-Way Point
to energy consumption at high node density, moderate Mobility Model,” Wireless Networks, pp. 555-567,
and high mobility values with FSR losing less energy for Vol. 10, No. 5, September 2004.
routing and topology maintenance as compared to DSR. [12] L. M. Feeney, “An Energy Consumption Model for
Performance Analysis of Routing Protocols for
6. References Mobile Ad hoc Networks,” Journal of Mobile
Networks and Applications, Vol. 3, No. 6, pp. 239-
[1] C. Siva Ram Murthy and B. S. Manoj, “Routing 249, June 2001.
Protocols for Ad Hoc Wireless Networks,” Ad Hoc [13] G. Bianchi, “Performance Analysis of the IEEE
Wireless Networks: Architectures and Protocols, 802.11 Distributed Coordination Function,” IEEE
Chapter 7, pp. 299 – 364, Prentice Hall, June 2004. Journal of Selected Areas in Communications, Vol.
18, No. 3, pp. 535-547, March 2000.
111
ADCOM 2009
NETWORK
OPTIMIZATION
Session Papers:
1. Angeline Ezhilarasi G and Shanti Swarup K , “Optimal Network Partitioning for Distributed
Computing Using Discrete Optimization”
2. Suman Kundu and Uttam Kumar Roy, “An Efficient Algorithm to Reconstruct a Minimum
Spanning Tree in an Asynchronous Distributed Systems”
3. Amit Kumar Mishra, “A SAL Based Algorithm for Convex Optimization Problems”
112
Optimal Network Partitioning for Distributed
Computing Using Discrete Optimization
G. Angeline Ezhilarasi Dr. K. S. Swarup
Department of Electrical Engineering Department of Electrical Engineering
Indian Institute of Technology Madras Indian Institute of Technology Madras
Chennai, INDIA Chennai, INDIA
angel.ezhil@gmail.com swarup@ee.iitm.ac.in
Abstract— This paper presents an evolutionary based discrete reflects the features of parallel and distributed processing.
optimization (DO) technique for optimal network partitioning However these methods are computation intensive and involve
(NP) of a power system network. The algorithm divides the procedures based on natural selection crossover and mutation.
network model into a number of sub networks optimally in order It also requires a large population size and occupies more
to balance distributed computing and parallel processing of
power system computation and to reduce the communication
memory.
overhead. The partitioning method is illustrated on IEEE This paper presents the application of evolutionary based
Standard 14 Bus, 30 Bus and 118 Bus Test Systems and discrete particle swarm optimization for the problem of
compared with the other existing methods. The performance of network partition. The main advantages of the PSO algorithm
the algorithm is studied using the test systems with different are summarized as: simple concept, easy implementation,
configurations. robustness to control parameters, and computational efficiency
when compared with mathematical algorithm and other
Keywords-Network Partitioning, Discrete Particle Swarm heuristic optimization techniques. Recently, PSO have been
Optimization.
successfully applied to various fields of power system
optimization such as power system stabilizer design, reactive
I. INTRODUCTION power and voltage control, dynamic security border
The power system is a large interconnected complex identification, economic dispatch and optimal power flow.
network involving computation intensive applications and The following sections are organized as follows. Section 2
highly intensive, nonlinear dynamic entities that spread across deals with the formulation of the objective function for the
vast area. Under normal as well as congested condition, network partition problem. The aim of this optimization is to
centralized control requires powerful computing facilities and minimize the cost function, which is a measure of the
multiple high speed communication links at the control execution time of the applications in the torn network. An
centers. During certain circumstances, a failure in remote part overview of the DPSO and its implementation to NP problem
is done is section 3. The algorithm is tested on IEEE standard
of the system might spread instantaneously if the control
test systems and simulation results are discussed in section 4.
action is delayed. This lack of response may cripple the entire The case studies demonstrate validity of the application of this
power system including the centralized control center itself. algorithm by attaining a near optimal solution with a small
An effective way to monitor and control complex power population size and less computational effort.
system is to intervene locally at places where there is
disturbance and control the problem from propagating through
II. PROBLEM FORMULATION
the network. Hence distributed computing can greatly enhance
the reliability and improve the efficiency of power system The objective of the problem is to optimally assign each node
monitoring and control. of a large interconnected network to a sub network subject to
To simulate and implement distributed computing in power constraints. The resulting sub networks are used for efficient
system the large interconnected network must be torn into sub distributed computing and parallel processing of power system
networks in an optimal way. The partitioning should balance analysis. The allocation of the nodes to the sub networks
between the size of the sub networks and the interconnecting should be such that the number of nodes in the sub network
tie lines in order to reduce the overall parallel execution time. and the number of tie lines connecting the sub networks are
Over the past decades a number of algorithms have been well balanced. Hence the conventional cost function [3] which
proposed in literature for optimal network tearing. The models the computational performance of the partitioned
techniques include dynamic programming, and the heuristic network is taken as the fitness function for solving this
clustering approaches. Some of the optimization techniques optimization problem.
such as simulated annealing, genetic algorithm [1] and tabu
search [2] have also been used for network tearing. For these The objective is to minimize
optimization problems the cost function is formed such that it F = α M 2 + β L3 (1)
113
where,
F Partition Index ⎝ ( )
If ⎛⎜ rand < S V pnew ⎞⎟ then X new
⎠
p =1
(4)
M Maximum number of nodes in a sub network
L Total Number of Branches between all the sub else X new
p =0
networks where,
α , β Weighting factors ω Weight parameter
C1, C2 Weight factors
The first term in the fitness function includes the maximum rand1, rand 2 Random number between 0 and 1
nodes in a sub network, thereby influencing the load balance
in the distributed processing of any applications. The second X kp +1, X kp Position of the particle at the (k+1)th and kth
term relates to the communication of data between the
processes and hence focuses on the number of branches iteration
linking the sub networks. The total fitness function value V pk +1, V pk Velocity of the particles at the (k+1)th and
reflects the overall computation time of the power system
analysis problems under distributed or parallel processing. kth iteration
Pbest kp Best position of the particle p until the kth
III. IMPLEMENTATION OF NETWORK PARTITIONING
iteration
A. Overview of Discrete Optimization Gbest Best position of the group until the kth
Particle Swarm Optimization (PSO) was developed by iteration
Kennedy and Eberhart through simulation of bird flocking in a
two dimensional space. In this search space, every feasible
solution is called a particle and several such particles in the B. Evolutionary Based Discrete Optimization
search space form a group. The particles tend to optimize an The PSO gains self adapting properties by incorporating
objective function with the knowledge of its own best position any one of the evolutionary techniques such as replication,
attained so far and the best position of the entire group. Hence mutation and reproduction. In order to improve the
the particles in a group share information among them leading convergence in PSO mutation is generally done on the weight
to increased efficiency of the group. The original PSO treats factors. Also if there is no significant change in the Gbest for
non linear optimization problem with continuous variables. a considerable amount of time then mutation can be applied.
However the practical engineering problems are often In this work mutation is used to update the particles as they
combinatorial optimization problems for which Discrete are constituted by binary values only. The entire position
Particle Swarm Optimization (DPSO) can be used. update process of the particles is done based on the mutation
In a physical n-dimensional search space, the position of a probability normally above 0.85[6]. This ensures that the
particle ‘p’ is represented as a vector X = ( X1, X 2 , X 3......... X p ) particles are not trapped in their local optimum and do no
deviate far off from the current position as well. The process
and the velocity of a particle as (
V = V1,V2 ,V3.........V p ). Let of the algorithm and its implementation aspects are described
(
Pbest p = X1best , X 2best ,........ X best
p ) and (
Gbest p = Best X1best ,......... X best
p ) be in detail in the following sections.
the best position of the particle p and its neighbors so far. In 1) Generation of the Particles
DPSO [4] [5], the particles are initially set to binary values The objective of the network partition problem is to allocate
randomly. The probability of the particle making a decision is every node of power system network to a sub network, such
a function of the current particle, velocity, Pbest and Gbest. that the nodes are equally distributed and number of lines
The velocity of the particle given by equation (2) determines a linking them is minimum. Hence the structure of the particle is
probability threshold. The sigmoid function shown in equation framed as matrix of dimension (nc x nn), where ‘nc’ is the
(3) imposes limits on the velocity updates. The threshold is number of clusters or sub networks, it is a user defined
constrained within the range of [0, 1] such that higher velocity quantity and ‘nn’ is the total number of nodes in the power
likely chooses 1 and lower velocities chooses 0. Hence the system network. It is ensured that the each node is assigned to
position update is done using the velocity as shown in one cluster only .i.e, the sum of the columns of the particle
equation (4). array is 1 always. The velocity of the particles corresponds to
a threshold probability. Initially all velocities are set to zero
(
V pk +1 = ωV pk + C1rand1 Pbest kp − X kp ) vectors and all solutions including the Pbest and Gbest are
undefined.
+ C2 rand 2 ( Gbest − X kp )
(2)
The position of particles ‘p’ in the search space is created as
follows:
114
[1, nn/nc]. Step 1) Set N = number of Nodes and
Step 3) Set the number of nodes to 1 from k to R1, and M = number of clusters.
set k = k+R1. Step 2) Select a column in random from 1 to N
Step 4) Set j = j -1, if j = 0 go to step 2, otherwise go Step 3) Find the row index whose element is 1
to step 5 Step 4) Select a row in random from 1 to M
Step 5) Repeat steps 1 to 4 for all the particles. whose element is 0.
Step 6) Stop the initialization process. Step 5) Flip the elements using the condition given by
equation (4).
N1 N2 Nnn-1 Nnn Step 6) Repeat the above steps for all the particles.
C1 1 1 0 0 0 0
C2 0 0 1 0 0 0
N1 N2 Nnn-1 Nnn
C1 1 0 0 0 0 0
Cnc-1 0 0 0 1 0 0 C2 0 1 1 0 0 0
Cnc 0 0 0 0 1 1
115
in the population. Similarly Figure 4(a) shows the IEEE 30
Bus System partitioned into three clusters. This is the optimal
partition obtained from the DPSO applied to the network
partitioning problem and Figure 4(b) shows a worst case
partition of the same system present in the population with the
same configuration.
116
In order the test the efficiency of the algorithm under different V. CONCLUSION
conditions, simulation is performed with variant configuration This paper presents a Discrete Particle Swarm Optimization
of parameters like weights, population size and mutation rates. method for optimal tearing of power system network. The
Table II shows the partitions of 14 Bus and 30 Bus systems algorithm is simple and can be used for clustering related
with different weight factors. problems in any field. In this method the DPSO is used to
minimize the cost function, which is actually an estimate of
TABLE II. COMPARISON OF COST FOR IEEE 14 BUS AND 30 BUS SYSTEMS
WITH DIFFERENT CONFIGURATION the execution time of a sub network in the distributed
computing environment. The algorithm was implemented in a
No. of Test high performance computing environment which supports
Weight M L Cost
Clusters System distributed computing and parallel processing. The simulation
α =1 14 Bus 8 4 128 results shows that the DPSO can find the near optimum
3 solution under different operating conditions. The torn sub
β =1 30 Bus 12 7 487
network can aid parallel processing thereby improving speed
α =3 14 Bus 5 5 200 of intensive power system computations.
3
β =1 30 Bus 16 6 984 REFERENCES
α =1 14 Bus 5 5 400
3
β =3 30 Bus 15 6 873 [1] M.R. Irving, BEng, PhD, CEng, MlEE, Prof. M.J.H. Sterling,, “Optimal
network tearing using simulated annealing”, 1EE Proceedings of Generation,
Transmission and Distribution, Vol. 137, No. I, Jan 1990, 69 – 72.
The performance of the optimization algorithm is shown by
means of convergence characteristics in Figure 5(a) and 5(b). [2] C.S. Chang, L.R. Lu, F.S. Wen, “Power system network partitioning using
Simulation was done with standard parameter mentioned Tabu Search”, Electric Power Systems Research 49 (1999) 55–613:1, 2009
earlier and convergence was attained when the system was [3] H. Ding, A.A. E1-Keib, R. Smith, Optimal clustering of power networks
partitioned into two and three clusters. Unlike SA[1] and using genetic algorithms, Electric Power Systems Research 30 (1994) 209-214
GA[3], DPSO reaches a near optimal solution in a less number
of iterations and minimum particles in the population. [4] Jong-Bae Park, Member, IEEE, Ki-Song Lee, Joong-Rin Shin, and Kwang
Y. Lee, Fellow, IEEE, “A Particle Swarm Optimization for Economic
Dispatch With Nonsmooth Cost Functions”, IEEE Transactions on Power
Systems, Vol. 20, No. 1, February 2005
[5] Qian-Li Zhang, Xing Li, Quang-Ahn Tran, “A Modified Particle Swarm
Optimization Algorithm”, Proceedings Of The Fourth International
Conference On Machine Learning And Cybernetics, Guangzhou, 18-21
August 2005
[7] X.H. Shi’, X.L. Xing, Q.X. Wang, L.H. Zhang, X.W. Yang, C.G. Zhou,
Y.C. Liang, “A Discrete PSO Method For Generalized Tsp Problem”,
Figure 5(a). Convergence of 14 Bus System Proceedings of the Third International Conference on Machine Learning and
Cybernetics, Shanghai, 26-29 August 2004
[9] Zhongxu Li, Yutian Liu, Senior Member IEEE, Rushui Liu and Xinsheng
Niu, “Network Partition for Distributed Reactive Power Optimization in
Power Systems”, IEEE International Conference on Networking, Sensing and
Control, 6-8 April 2008, 385 - 388
117
An Efficient Algorithm to Reconstruct a Minimum
Spanning Tree in an Asynchronous Distributed
Systems
Suman Kundu Dr. Uttam Kr. Roy
Department of Information Technology Department of Information Technology
Jadavpur University Jadavpur University
Salt Lake, Kolkata - 700098 Salt Lake, Kolkata - 700098
sumankundu.nsec@gmail.com u roy@it.jusl.ac.in
Abstract—In a highly dynamic asynchronous distributed net- Several algorithms for constructing MST in distributed sys-
work, node failure (or recovery) and link failure (or recovery) tems were proposed in last three decades. Most of these MST
triggers topological changes. In many cases, reconstructing the construction algorithms are applicable in a static topology. In
minimum spanning tree, after each such topological change, is
very much required. this paper, we have proposed an algorithm based on message
In this paper, we have described a distributed algorithm based passing to reconstruct an MST that works seamlessly even
on message passing to reconstruct the minimum spanning tree in a dynamic topology. Our algorithm considers a single
after a link failure. The algorithm assumes that no further link failure and assumes that no further topological changes
topological changes occur during the execution of the algorithm. occur during the execution of the algorithm. It is also shown
The proposed algorithm requires significantly fewer numbers of
messages to reconstruct the spanning tree in comparison to other that the total number of messages required to reconstruct
existing algorithms. the spanning tree is significantly less in comparison to other
existing algorithms.
I. I NTRODUCTION The rest of the paper is organized as follows: Section II
A distributed network consists of several nodes and con- describes the related work and overview of our result. Section
nection among them. Each node is a computational unit, and III describes description of distributed algorithm, analysis and
the connections between them can send and receive messages proof of correctness. In section IV, we provide a result of
in a duplex manner. Multiple paths may exist between a pair simulation; section V concludes the overall algorithm and
of nodes. A Minimum Spanning Tree (hereafter referred to finally in section VI, we point out the further research areas
as MST) of such a network is the minimally connected tree we are working on.
that contains all the nodes of the network. Applications of
MST includes, effective communication in distributed systems, II. R ELATED W ORK
effective file searching and sharing for peer-to-peer network, In their pioneer paper [2], Gallager, Humblet and Spira
gateway routing in local area network or bandwidth allocation proposed one of the first distributed protocols to construct
in multi hop radio network and other computational scenario. MST in year 1983. The protocol of [2] is further improved
Usually, a cost is associated with each link. The cost may indi- in the protocol [3], [4], [5], [6], [7] and [8]. In [9], some
cate distance between two nodes, or the time required to send flaws of [5] are rectified. All these protocols of constructing
or receive data packets, or bandwidth of the communication MST are developed for static topology. Some of them address
channel, or any other parameters. A MST always contains the message efficiency and some of them address time efficiency,
minimum cumulative cost within the network. as the performance measures for the algorithm.
A distributed system is dynamic in nature i.e. in any However, the distributed systems are dynamic as described
distributed network topological changes occur with respect to in the previous section. Researchers are working on protocols,
time. A change can occur due to deletion or recovery of nodes which are resilient in nature to adopt the topological changes.
and links. In many situations, it is important to reconstruct Few of such algorithms are given in [1], [12] and [14]. In
the MST after each topological change. The main hurdle to their paper [10] B. Das and M. C. Loui provide a serial and
reconstruct MST arises due to the asynchronous nature of a parallel algorithm to address the similar problem and later
the system. Moreover, A node only knows local information. improved by Nardelli, Proietti and Widmayer in their paper
If topological changes occur, it must be propagated to each [11]. These algorithms do not address the distributed version
node via message communication. It is also possible that some of the problem. In paper [13], P. Flocchini, T. M. Enriquez,
part of the network gets the latest knowledge, whereas some L. Pagli, G. Prencipe, and N. Santoro provided a distributed
portion does not. Algorithm should address this issue as well. version of the same problem. In [13] authors provided with
118
the precomputed node replacement scheme. In our paper we previous one. After identifying the MOE a node propagates
provide the improved version of the distributed algorithm of the FINDMOE ACK<w(MOE)> to upward. Where w(MOE)
[1] for a single link failure. In the following section, we will is the locally known best outgoing weight, either its own
describe the response of a link failure by [1]. MOE or MOE received from its children (Whichever is
minimum). After receiving FINDMOE ACK<w(MOE)> root
A. Basic Algorithm of [1] sends CHANGE ROOT<id> through the same path it receives
In the algorithm of [1], C. Chang, I. A. Cimett and S.P.R. the MOE. The CHANGE ROOT<id> reaches to the node
Kumar proposed a resilient algorithm which reconstructs the where the MOE of the fragment incident. The node marks
MST after link failure and recovery. Complexity of the algo- itself the new root of the fragment and sends CONNECT
rithm for a single link failure is O(e), where e is the number message over the MOE. Connect subroutine works same as
of links in the network. the algorithm [2] and merge two fragments sharing the same
A link failure is a process of fragment expansion i.e. A MOE and starts the next iteration.
link(which is a part of the MST) failure breaks the MST into
two different fragments. The failed link initiates the process B. Overview of Our Results
of recovery in the adjacent nodes. The initiator node generates When considering the single link failure, our approach
new fragment identity. In algorithm [1], authors suggest two provides a significant improvement on the total number of
approaches to generate new fragment identity such that no messages required during reconstruction over the algorithm
conflict occurs between subsequent topological changes. First of [1]. Also, the message size for some control message is
approach, is to include identity of all nodes of the fragment improved slightly.
into the fragment identity. Second approach is to maintain a After link failure, the fragment which contains the root node
counter for each link. This counter counts the link failure and of the MST is referred as root fragment in this paper. If the
includes the value along with weight of the node for generating root fragment contains E 0 number of edges, then our algorithm
new fragment identity. In case of first approach, the size of requires 2 × E 0 fewer messages to reconstruct the MST. Our
the fragment identity of a large fragment becomes very large. approach here is to use the previously known fragment identity
So, the second approach is efficient in terms of message size. (say it as a historical data) for the root fragment. However, how
If a link w(u, v), is broken in certain time and u be the the algorithm evolves if another link failure occurs during the
parent of v. In response to the link failure, u will generate execution is still under observation.
the new identity for the fragment like w(u, v), u, c where c
is the counter of topological changes of the link w(u, v). III. A LGORITHM TO R ECONSTRUCTING MST AFTER L INK
FAILURE
Then u will forward the fragment identity to the root of
the fragment. However, in case of v, after generating the We closely followed the response of the algorithm [1] for a
fragment identity like w(u, v), v, c, it marks itself as root single link failure and found some improvement areas. In the
of that fragment. After getting the new fragment identity, following subsections, we will describe the network model, our
root of these two fragments changes their fragment identity observation regarding the algorithm [1], our contribution to im-
and starts broadcasting REIDEN<id> using the tree links proving the algorithm, description of the modified distributed
and wait for the acknowledgment. Any intermediate nodes, algorithm, analysis of the outcome and proof of correctness.
upon getting the REIDEN<id> message changes its fragment
identity to the id and sends the same message to its child A. Network Model
node. A leaf node, after getting the REIDEN<id> changes its The communication model for the algorithm is modeled
fragment identity and sends REIDEN ACK<id> to its parent. as an asynchronous network represented by an undirected
Intermediate node, sends REIDEN ACK<id> to its parent weighted graph of N nodes. The graph is represented by
only after getting REIDEN ACK<id> from all of its children. G(V, E), where V is the set of nodes and E ⊂ V × V is
Receiving of the REIDEN ACK<id> message indicates that the set of links. Each node is a computing unit consisting
all nodes of the fragment aware about the new identity value. of a processor, a local memory and also an input and output
Root now changes its state to find and initiate the find queue. The input (output) queue is an unlimited sized buffer
minimum outgoing edge (hereafter referred to as MOE) phase to send and receive messages. A unique identification number
by sending FINDMOE message over the children. When a is associated with each node (Node i represent the node with
node receives FINDMOE message, a node changes its state to the identification number i).
find. In the find state each node starts to send TEST<id> Each link (u, v), assigned with a fixed weight w(u, v) is
message via each non tree link. A TEST<id> message is a bidirectional communicational line between the node u and
responded by either ACCEPT<id> or REJECT<id>. An the node v. Each node has only the local information i.e. each
ACCEPT<id> message, indicates that the edge is outgoing, node aware about its identification number and the weight of
leading to another fragment. An important thing to remember the incident links. After construction of the MST, each node
here is that the ACCEPT or REJECT message should return will be aware of two additional information; first, the adjacent
the identity number of the test message. This will help to edge leading to the parent node in the MST and secondly the
determine whether the message is for the current failure or adjacent edges, those leading to the child nodes in the MST.
119
Nodes can communicate only via messages. The messages
may be lost due to link failure during transmission. However,
if the link is functioning, the messages can be sent from either
end and can be received by the other end within a finite, un-
determinable time, without error and in sequence.
Also, if a link failure occurs, then the failure event triggers
the recovery process for each end of the failed link and the
recovery process initiated by the same node.
B. Observation
In the algorithm of [1] the reconstruction process works in Fig. 1. MST of a random network
two phases. In Phase-I, root node informs each node of the
fragment with the new fragment identity, and in Phase-II, each
node finds its own MOE and forward the MOE to the root. 4) Rejected - the link is leading to the node included in the
Root then identifies the MOE of the fragment. Finally, the same fragment
fragment sends CONNECT message via the MOE. 5) Down - the link in not working
After failure each fragment changes its fragment identity to
It is assumed that, the MST is already constructed using
new one. Also, each message passes to the neighbor contains
some distributed protocol. A link failure, triggers the recovery
the fragment identity along with the control information. This
process in both end of the link. Let us take, failure occurs for
is used because if overlapping of link failure occurs then
the link e = (u, v, w, c) where u and v is the node connecting
the message response could be avoided depending upon the
the link. w is the weight of the link and c is the status change
fragment identity such a way, that only the current failure will
count for the link. If the link is not included in the MST, i.e.
be processed during the execution.
it is either in Basic or in Rejected state then u and v simply
C. Our Contribution changes the link status to Down and do nothing.
Our contribution to the algorithm is that we can use the
historical data like previously known fragment identity for
one fragment. When we use the previously known fragment
identity than the fragment with the older identity enter its
Phase-II without executing Phase-I. For our algorithm, we use
the previously known fragment identity for the root fragment.
Also, the FAILURE message propagating from the failure link
to the root of the root fragment no longer requires to carry
newly generated fragment identity. That means the FAILURE
message size is also reduced. The difficulty with this approach
is, when a TEST<id> message is received it may be possible
that fragment identity of the node is not updated yet. It may be
part of the same fragment or other fragments still in Phase- Fig. 2. A non MST link failure
I (Propagating new fragment identity is not completed yet).
However, if a node receives a TEST message in Phase-II, then Otherwise, the nodes marks the link Down and respond in
its fragment identity is correct. So, to avoid the conflicting the following manner -
response to a TEST<id> message, the response is delayed • When previous status of the link is Parent - the link
until the node enters into Phase-II. marks itself as root of the newly generated fragment. It
Also, it is assumed that no further failures occur during the then marks all Rejected nodes to Basic. This is necessary
execution. So, it is possible to reduce message size for control because the link may lead to the other fragments due to
messages. For example, the ACCEPT and REJECT message the topological change. The node generates new fragment
do not require to send the fragment identity back to the sender. identity for the fragment, reset its own fragment identity
and enters into the Phase-I by sending the INIT<fid>
D. Description of the protocol
message to its children. Here fid is the new fragment
In the beginning, each node maintains a collection of identity as described by the algorithm [1], i.e. it includes
adjacent edge. This adjacent edge collection is sorted by the the weight and count of failure along with the identity
cost of the link. During the life time, an adjacent link can have of the node. If u is the parent of v in the example edge
one of the following status e then after failure v marks itself the root and changes
1) Basic - the link is yet to processed its fragment identity to f id = w(u, v), v, c then initiate
2) Parent - the link lead to the parent Phase-I by sending this fid along with the INIT message.
3) Child - the link lead to child After receiving the INIT<fid> message node changes its
120
fragment identity to fid and marks all Rejected link to
Basic. Then it forwards the message to its children. If
the node is leaf node then it returns a FINISH message
to its parent. Each intermediate node waits for receiving
FINISH message from all its children and then sends
the FINISH message to the parent. A FINISH message
received by v (i.e. the root) indicates that all node of
the fragment knows the current fragment identity. Then
v starts Phase-II by sending the FINDMOE message.
• When previous status of the link is Child - the link
forward the FAILURE message to upward. Note that the
node did not generate new fragment identity to forward
along with FAILURE message. When the root node Fig. 5. TEST message and response
detects the failure, it initiates the Phase-II directly by
sending FINDMOE message to its children.
121
E. Analysis Proof: A failed link broke down the MST into two
Compare with the algorithm [1], in our approach the root fragments and notifies the failure to either end of the link.
fragment directly enters into Phase-II. So, the INIT<fid> and The node attached with the root fragment propagate the
FINISH messages (REIDEN<id> and REIDEN ACK<id> in FAILURE message toward the root element. The other node
algorithm [1]) of phase one is not required for root fragment. mark itself root of the new fragment and generates the new
Let us take the root fragment has E 0 number of edges after fragment identity. Each message is assumed to be reach at the
failure. Then to executing Phase-I it required to send E 0 destination in finite time and in a sequence manner whenever
number of INIT<fid> message over E 0 links. Also, it required the link is working. Also, it is assumed that no further failure
to send E 0 number of FINISH message over E 0 links. That occurs during the execution. Hence, the FAILURE message is
means the reconstruction process of our algorithm require correctly received by the root of the both fragments in finite
E 0 + E 0 = 2E 0 fewer messages then the protocol described in time.
[1]. Lemma III-F.3. On each node, fragment identity is updated
When compare our approach with the protocol of [1] in before the start of Phase-II according to the latest link failure.
terms of message size, some control message contains very
few bits with respect to the protocol of [1]. For example, Proof: Root fragment uses the previously known fragment
the FAILURE message in the root fragment only contains identity of existing MST. So the root fragment does not require
the control information indicating the failure occurrence. No to update the fragment identity. It then directly enters into
fragment identity is sends along with it. As we consider no the Phase-II, all nodes belongs to the root fragment already
further failure occur during execution of the protocol ACCEPT aware about the fragment identity. However, for the other
and REJECT message also contains control message only; no fragment, newly generated identity is propagated to the child
fragment identity returns with it. nodes by INIT<fid> message. Leaf node returns the FINISH
1) Complexity: Let us consider the network contains N message to its parent after updating their fragment identity.
nodes and E edges. The initial MST contains N nodes and Only after receiving FINISH message from its entire child a
e edges before the failure. Also, consider the root fragment intermediate node sends FINISH message to its parent. As the
contains N 0 number of nodes and E 0 number of edges. If the messages are assumed to be reach at destination sequential
height of the root fragment be h0 , then to propagate the failure manner in a finite time then FINISH message received by the
message to the root of the root fragment require O(h0 ) number root of the fragment implies each node already updated its
of message. Propagate the new fragment identity for the other fragment identity. Then the root node initiates Phase-II for
fragment, requires O(N − N 0 ) messages because this infor- that fragment. Thus, when a node is in Phase-II, its fragment
mation will be propagated through the tree links. Similarly, to identity is always updated with the latest link failure.
send Find MOE request and merging two fragments require Lemma III-F.4. Each fragment starts its Phase-II execution
O(N ) messages since these messages also be sent through the in finite time.
tree links. However, finding MOE of the fragments require to
send messages through O(E) links of the network. So, the Proof: As their is no link failure occur during the execu-
message complexity for the algorithm on link failure is O(E). tion of the protocol, all messages of Phase-I will be properly
responded by the nodes in a finite time and finally terminated
F. Proof of Correctness when FINISH message received by root of a fragments.
Root node then initiate the Phase-II by sending FINDMOE
At first, we will proof several Lemmas, which are used
message. Hence, each fragment starts its Phase-II in a finite
in the distributed algorithm, and then we will show that the
time.
algorithm generates the minimum weighted spanning tree after
completion of the algorithm. Lemma III-F.5. Fragments find its MOE within a finite time.
Lemma III-F.1. Before topological changes, each node of the Proof: After getting FINDMOE message each node sends
network has information about the tree links incident to it. TEST message to the minimum weighted non tree link (Which
is not in Down status) to test whether the link is leading
Proof: It is assumed that the MST is initially constructed
to another fragment or not. Then the node waits for the
using some distributed protocol. In our simulation, we use
response from the other end. The TEST message is correctly
the algorithm [7] to construct MST. Each node maintains a
reached to the other end because the link is not marked as
list of incident edges and their status as described in section
down and message lost is not a valid property according to
3.4 i.e. nodes are aware about the links leading to its parent
the assumption. TEST message response is delayed until the
and links leading to its children. These parent and child links
node start executing Phase-II. From the previous lemma, we
denote the tree links which are included in MST. Hence, each
found that TEST message is responded within a finite time
node is aware about the tree links, incident to it before any
because the responder node will enter Phase-II within a finite
topological changes.
time. Also, the response is known to be correct because in
Lemma III-F.2. Root of the fragments receives failure noti- Phase-II the fragment identity is already updated with the new
fication within a finite time. fragment identity. Now, if El be the local minimum outgoing
122
edge and Ec be the minimum outgoing edge forwarded by
all of its children. Then, an intermediate node forward the
M OE = min(El , Ec ) to its parents. Thus, at the root node
M OE of the fragment is calculated by M OE = min(El , Ec )
within a finite time.
Theorem III-F.1. The algorithm reconstructs spanning tree
in finite time and the spanning tree is the minimum weight
spanning tree of the network.
Fig. 8. Recovered MST after failure(in red)
Proof: When the minimum weighted outgoing edge MOE
is determined by root, the fragment changes its root to the new
node where the MOE is incident. Then it sends the connect Edges:
message through MOE. If the MOE found by one fragment
is also MOE of the other fragment then two fragments merge E = {e1 = (0, 5), e2 = (0, 2), e3 = (0, 1),
and create a merged fragment. If there is no other fragment e4 = (1, 2), e5 = (1, 4), e6 = (1, 3),
remains then it is the desire spanning tree. As merging only e7 = (2, 4), e8 = (2, 3)}
possible if both fragments agreed that the MOE is common,
it is obvious that algorithm does not produce any cyclic path. Weight:
Considering the case, where only one link failure occurs
W = {w(e1) = 4, w(e2) = 8, w(e3) = 15,
we can easily derive the proposition that the MOE of one
fragment is also the MOE of other fragment(Since network w(e4) = 10, w(e5) = 3, w(e6) = 5,
links are uniquely weighted). Hence, the fragments merge at w(e7) = 6, w(e8) = 7}
MOE and produce a spanning tree.
Message required to terminate the execution of the algo-
Now, there is no possibility of any other outgoing edge with
rithm is tabularized in Table 1. Scenario 1, e7 link fails
lesser weight then MOE, because in the process only minimum
and initiates the recovery process. If we use the construction
weight edge is filtered and forwarded from the leaf node
algorithm again (i.e. algorithm [7]) after failure it requires 114
to the root node(M OE = min(El , Ec ) from last lemma).
messages, where as the performance is improved if we use
Finally, root node determines MOE of the fragment. So, the
the reconstruction algorithm (i.e. using the algorithm [1]). In
merging occurs at the minimum possible weighted edge of
our modified algorithm, it takes fewer messages then existing
two fragments. Hence, merged spanning tree has the minimum
reconstruction algorithm of [1].
collective weight in the system.
Thus, upon terminating the algorithm reconstruct the span- TABLE I
ning tree which has the minimum weight i.e. MST of the L INK FAILURE VS R EQUIRE M ESSAGE C HART
network.
Message Require For
IV. E XPERIMENT AND R ESULT
MST Con- Algorithm Modified Al-
Broken
To compare the output, we simulate few network graphs Link
struction of [1] gorithm
using ’Network Simulator v2 (NS2 2.29)’ [15]. We constructed
Scenario 1: Single link failure
the initial MST using the algorithm [7]. The pictures below e7 114 59 55
shows the results of one experiment.
Scenario 2: Single link failure
e6 114 54 46
V. C ONCLUSION
In this paper, we have presented a distributed algorithm to
reconstructing a Minimum Spanning Tree after deletion of
Fig. 7. Initial Graph and Constructed MST a link. The problem can also be solved using the protocol
[1]. Here we showed that for single link deletion scenario our
The example graph contains the vertex set V , edge set E protocol can reconstruct the MST with 2E 0 fewer messages
and corresponding weight set W as below - than protocol [1]; where E 0 represents the number of links in
Vertex: the root fragment. If we consider MST with large depth and the
link failure occur to very close to leaf node (i.e. the E 0 is much
V = {0, 1, 2, 3, 4, 5} greater than E − E 0 ) then our algorithm performs much better
123
way. However, when E 0 is zero i.e. the link failure occurs at [3] Chin F., Ting H. ”An almost linear time and O(n log n+e) messages
root node, the algorithm completed without any improvement distributed algorithm for minimum-weight spanning trees.” Proceedings
of 26th IEEE Symp. Foundations of Computer Science, p.257-266, 1985
of the total number of messages. We can refer to Scenario 3, [4] Gafni E., ”Improvement in the time complexities of two message optimal
where for next link failure there is no improvement in total protocols.” Proceedings of the ACM Symp. on Principles of Distributed
number of messages. Computing, 1985
[5] B. Awerbuch, ”Optimal Distributed Algorithm for Minimum Weight
Spanning Tree, Counting, Leader Election, and Related Problems,” Symp.
VI. F URTHER W ORKS Theory of Comp., pp. 230-240, May 1987.
[6] J. Garay, S. Kutten and D. Peleg, ”A Sub-Linear Time Distributed
When a TEST<fid> message is delayed until the node Algorithm for Minimum-Weight Spanning Trees.” 34th IEEE Symp. on
enters its finding state then we can use the same TEST<fid> to Foundations of Computer Science, pp. 659-668, November 1993.
[7] Gurdip Singh , Arthur J. Bernstein, ”A highly asynchronous minimum
determine whether the adjacent edge is Rejected or Accepted spanning tree protocol.” Distributed Computing, v.8 n.3, p.151-161,
and used it instead of sending another TEST<fid> message March 1995
over the same edge. However, this approach may lead to some [8] Elkin M., ”A faster distributed protocol for constructing minimum
spanning tree.” Proceedings of the ACM-SIAM Symp. on Discrete
other difficulties due to the asynchronous nature, and we are Algorithms, p.352-361, 2004
currently working on the same. [9] Michalis Faloutsos, Mart Molle, ”Optimal Distributed Algorithm for
Whenever the failure occurs at the root or very close to the Minimum Spanning Trees Revisited” Proceedings of the 14th Annual
ACM Symposium on Principles of Distributed Computing, pp. 231-237,
root, improvement is close to zero. We are working on the 1995.
algorithm so that it can use historical data such a way, which [10] B. Das and M.C. Loui, ”Reconstructing a minimum spanning tree after
produces improvement for other scenarios too. deletion of any node.” Algorithmica, 31. pp. 530-547, 2001.
[11] E. Nardelli, G. Proietti, and P. Widmayer ”Nearly linear time minimum
Also, currently we are working how we can modify the al- spanning tree maintenance for transient node failures” Algoritmica, 40.
gorithm so that it accepts topological changes during execution pp. 119-132, 2004
[12] Hichem Megharbi and Hamamache Kheddouci ”Distributed algorithms
of the algorithm. for Constructing and Maintaining a Spanning Tree in a Mobile Ad hoc
Network” First International Workshop on Managing Context Information
R EFERENCES in Mobile and Pervasive Environments, 2005.
[13] P. Flocchini, L. Pagli, G. Prencipe, and N. Santoro ”Distributed compu-
[1] C. Cheng, I.A. Cimett, and S. P.R. Kumar, ”A protocol to maintain a min- tation of all node replacements of a minimum spanning tree” Euro-Par,
imum spanning tree in a dynamic topology.” Computer Communications volume 4641 of LNCS, pages 598-607. Springer, 2007
Review 18, no. 4 (Aug. 1988): 330-338 [14] Awerbuch, B., Cidon I., and Kuten, ”Optimal maintenance of a spanning
[2] R. Gallager, P. Humblet and P. Spira, ”A distributed algorithm for tree.” J. J. ACM 55, 4, Article 18 (September 2008), 45 pages.
minimum-weight spanning trees.” ACM Transaction on Programming [15] Network Simulator version 2(N S2), URL:
Languages and Systems, 5(1):66-77, January 1983 http://www.isi.edu/nsnam/ns/
124
A SAL based algorithm for convex optimization
problems
Amit Kumar Mishra
Department of Electronics and Communication Engineering
Indian Institute of Technology Guwahati, India
Email: akmishra@ieee.org
125
The above mentioned updating is done using digital suc- TABLE I: Short description of the functions used in the
cessive approximation algorithm. In this first of all the search pseudo-codes
space is sampled by digital representation using binary number Function name Description
system. This representation and the number of bits used for ceil ceiling function
each point depends on the desired accuracy of the algorithm. Hs Heaviside step function
The starting point of updating is always 0. Let xi be the ith zeros zeros(K) gives a K bit binary
estimation for x0 , and let B be the number of bits used for number with all bits set to 0
representing a point in the search space. In the ith iteration bin2int function to convert binary
(i ∈ [1, B]), the (B − i − 1)th bit is updated confirming with number to equivalent integer
algorithm 1. The updating algorithm for the ith iteration is findslope finds the slope of the cost
given in algorithm 2. function at the given arguments
126
Algorithm 4 Find (xO , yO ), given the search space boundaries TABLE II: Comparison of the proposed algorithm with some
(xh , yh ) and (xl , yl ), and the desired resolution in the search standard optimization algorithms
space (δx, δy) Prob. BFGS MBFGS UTRCP SALO
x −x
1: B1 ⇐ ceil log2 hδx h no. [6] [6] [5]
yh −yh NI/NF NI/NF NI/NF NI/NF
2: B2 ⇐ ceil log2 δy
3: B ⇐ max(B1, B2) 1 40/57 39/54 22/35 17/35
x −x
4: δxup ⇐ h2B h 2 33/45 34/45 22/38 17/35
yh −yh
5: δyup ⇐ 2B
6: arg ⇐ zeros(1, 2B)
7: sl ⇐ zeros(1, 2B) ||∇f (xk )|| ≤ 10−8 . The algorithms are compared based on
8: argx ⇐ zeros(1, B) two factors, viz. the number of iterations needed (NI) and
9: argy ⇐ zeros(1, B) number of function evaluations (NF) involved.
10: for i = 1 to 2B do Table II gives the consolidated results.
11: arg(i) ⇐ 1 It can be observed that the performance of the proposed
12: for j = 1 to B do SALO algorithm is better or comparable to some of the best
13: argx(j) ⇐ arg(2j − 1) algorithms in the literature. However the classic algorithms
14: argy(j) ⇐ arg(2j) are designed to work on any unconstrained function. The
15: end for generalization of the SALO algorithm in the same lines is
16: xi ⇐ xl + bin2int(argx) ∗ δxup under progress.
17: yi ⇐ yl + bin2int(argy) ∗ δyup
18: sl(i) ⇐ Hs(findslope(x(i), y(i))) V. C ONCLUSIONS AND DISCUSSIONS
19: arg(i) ⇐ sl(i)
We have presented an algorithm based on digital successive
20: end for
approximation principle for convex optimization problem. The
21: xO ⇐ x2B
algorithm can directly be applied for SVO and with minor
22: yO ⇐ y2B
modification, for MVO. Some of the major advantages of this
23: return (xO , yO )
algorithm are as follows:
• The number of iterations is fixed and is equal to the
number of bits used to represent numbers in the sample
IV. C OMPARISON WITH SOME STANDARD OPTIMIZATION space. This number of iterations is irrespective of the cost
ALGORITHMS BASED ON NUMERICAL EXPERIMENTS
function.
In this section we compare the proposed algorithm SAL • Resolution of the algorithm depends on the user and is
based optimization (SALO) algorithm with some of the ex- fixed for a given number of bits, irrespective of the cost
isting powerful optimization algorithms reported in the open function.
literature [5], [6]. The comparisons are made on the basis • A D dimension SVO takes DB number of iterations.
of the performance in the optimization of scaled sine-valley This greatly reduces the computational complexity of an
function and scaled Rosenbrock function. Comparisons are SVO.
made with respect to classic BFGS algorithm, Yuan’s modified • The complete algorithm runs in digital domain and hence
BFGS algorithm [6] and usual trust region method with is highly amenable to digital computer implementation.
curvilinear path (UTRCP) [5]. • The proposed algorithm will also work for finding non-
Following are the functions on which the algorithm has been smooth maxima or minima, provided there are no local
validated: maxima or minima.
1) Problem 1: The first problem is a sine-valley function We have tested the algorithm on a range of quadratic func-
given by: tions for different number of variables. For all the cases the
algorithm was found to find the maxima with error < 2−B .
f (x1 , x2 ) = 100[x2 − sin(x1 )]2 + 0.25x21 . (1)
However we have not discussed about the boundary prob-
Starting point for this problem was ( 32 π, −1).
The solu- lem, in this paper. Still, because of the above mentioned
tion is (0, 0). advantages, the algorithm is deemed to prove as a useful
2) Problem 2: The second problem is the Rosenbrok’s practical solution for convex optimization problems in any
function given by: domain.
f (x1 , x2 ) = 100(x2 − x21 )2 + (1 − x1 )2 . (2) R EFERENCES
Starting point for this problem was (−1.2, 1.0). The [1] R. Fletcher, Practical methods of optimization. Wiley Interscience, 1987.
solution is (1, 1). [2] M. Powel, “How bad are the BFGS and DFP methods when the objective
function is quadratic?” Mathematical Programming, vol. 34, pp. 34–47,
The problems were solved for a resolution of 10−8 , i.e. 1986.
127
[3] J. F. Wakerly, Digital Design: Principles and Practices. Prentice Hall,
1999.
[4] J. E. Volder, “The CORDIC trignometric computing technique,” IRE
Trans. Electronics and Computers, vol. EC-8, pp. 330–334, 1959.
[5] Y. Xiao and F.Zhou, “Nonmonotone trust region methods with curvilinear
path in unconstrained optimization,” Computing, vol. 48, pp. 303–317,
1992.
[6] Y.-X. Yuan, “A modified BFGS algorithm for unconstrained optimiza-
tion,” IMA Journal of Numerical Analysis, vol. 11, pp. 325–332, 1991.
128
ADCOM 2009
WIRELESS SENSOR
NETWORKS
Session Papers:
1. V. V. S. Suresh Kalepu and Raja Datta , “Energy Efficient Cluster Formation using Minimum
Separation Distance and Relay CH’s in Wireless Sensor Networks”
2. Pankaj Gupta, Tarun Bansal and Manoj Misra, “An Energy Efficient Base Station to Node
Communication Protocol for Wireless Sensor Networks”
3. R. C. Hansdah, Neeraj Kumar and Amulya Ratna Swain, “A Broadcast Authentication Protocol
for Multi-Hop Wireless Sensor Networks”,
129
Energy Efficient Cluster Formation using Minimum Separation Distance
and Relay CH’s in Wireless Sensor Networks
and , Member, IEEE
Department of Electronics and Electrical Communication Engineering
Indian Institute of Technology, Kharagpur, India, Kharagpur-721302
Email: ,
1
130
has a drawback that the cluster is not evenly distributed
due to its random rotation of local cluster-head.
The Power Efficient Gathering in Sensor Information
Systems (PEGASIS), another clustering-based routing
protocol, further enhances network lifetime by increasing
local collaboration among sensor nodes [5]. In PEGASIS,
nodes are organized into a chain using a greedy algorithm
so that each node transmits to and receives from only one
of its neighbors. In each round, a randomly chosen node
from the chain will transmit the aggregated data to the Figure 1. major components and energy cost parameters
base station, thus reducing the per round energy of a sensor node.
expenditure compared to LEACH.
III. Network and Radio Models In our analysis, we use the same radio model
The Network Model and Architecture discussed in [9]. The transmit and receive energy costs
for the transfer of a l-bit data message between two nodes
Our proposed protocol lies in the realization that the separated by a distance of d meters is given by Eqs. 1 and
base station is a high-energy node with a large amount of 2, respectively.
energy supply. Thus, it utilizes the base station to control (1)
the coordinated sensing task performed by the sensor (2)
nodes. In this article we assume a sensor network model, Where in Eq. 1 denotes the total energy
similar to those used in [3, 6], with the following dissipated in the transmitter of the sensor node, and
properties: in Eq. 2 represents the energy cost incurred in the
• A fixed base station is located far away from the sensor receiver of the destination node. The parameters
nodes. and in Eq. 1 and Eq. 2 are the per bit energy
• The sensor nodes are energy constrained with a uniform dissipation for transmission and reception, respectively.
initial energy allocation. is the energy required by the transmit
• The nodes are equipped with power control capabilities amplifier to maintain an acceptable signal-to-noise ratio
to vary their transmitted power. in order to transfer data messages reliably. As is the case
• Each node senses the environment at a fixed rate and in [6], we use both the free-space propagation model and
always has data to send to the base station. the two-ray ground propagation model to approximate the
• All sensor nodes are immobile. path loss sustained due to wireless channel transmission.
The two key elements considered in the design of Given a threshold transmission distance of , the free-
protocol are the sensor nodes and base station. The sensor space model is employed when , and the two-ray
nodes are geographically grouped into clusters and model is applied for cases when . Using these two
capable of operating in two basic modes: models, the energy required by the transmit amplifier
• The cluster head mode is given by
• The sensing mode
In the sensing mode, the nodes perform sensing tasks (3)
and transmit the sensed data to the cluster head. In cluster Where and denote transmit amplifier parameters
head mode, a node gathers data from the other nodes corresponding to free-space and the two-ray models,
within its cluster, performs data fusion, and routes the
respectively, and is the threshold distance given by
data to the base station through other cluster head nodes.
The base station in turn performs the key tasks of cluster
(4)
formation, randomized cluster head selection, and CH-to-
CH routing path construction. We assume the same set of parameters used in [3] for all
experiments throughout the article:
The Radio Model
, , and
As shown in Fig. 1, a typical sensor node consists of . Moreover, the energy cost
four major components: a data processor unit; a micro- for data aggregation is set as .
sensor; a radio communication subsystem that consists of
transmitter/receiver electronics, antennae, and an IV. Energy efficient Routing protocol using MSD
amplifier; and a power supply unit [1]. Although energy and Relay CH”s
is dissipated in all of the first three components of a
sensor node, we mainly consider the energy dissipations The proposed routing technique is an extension to
associated with the radio component since the core LEACH. It uses a centralized cluster formation algorithm
objective of this article is to develop an energy-efficient to form clusters that means the cluster formation was
network layer protocol to improve network lifetime. In carried out by the BS. The protocol uses the same steady-
addition, energy dissipated during data aggregation in the state protocol as LEACH. During the set-up phase, the
cluster head nodes is also taken into account. base station receives information from each node about
2
131
their current location and energy level. After that, the 4.2 Our Approach
base station runs the centralized cluster formation
algorithm to determine cluster heads and clusters for that In WSNs asymmetric communication is possible.
round. Once the clusters are created, the base station That is, the base station reaches all the sensor nodes
broadcasts the information to all the nodes in the network. directly, while some sensor nodes cannot reach the base
Each of the nodes, except the cluster head, determines its station directly but need other nodes to forward its data,
local TDMA slot, used for data transmission, before it hence routing schemes are necessary.
goes to sleep until it is time to transmit data to its cluster As the network size increases the transmission
head, i.e., until the arrival of the next slot. distance within the cluster increases. There by energy
In our method during the set-up phase, for cluster consumption increases.
formation we are using the minimum separation distance In our approach we present the Designed energy
method proposed by Ewa Hansen, Jonas Neander [7] efficient cluster based routing protocol to overcome the
which overcomes the drawback of LEACH protocol by above drawbacks. This section includes selection of the
spatially distributing the cluster heads. A simple relay cluster heads to forward the data when minimum
algorithm to find and select cluster heads is described separation distance between clusters is maintained.. We
below. also present the efficient routing technique for large
sensor network areas. So, to forward the aggregated data
4.1 Cluster head selection algorithm from CHs to BS relay nodes are required. The selection
of Relay nodes is described below.
We randomly choose a node among the eligible nodes 4.2.1 Selection of Forwarding CHs
to become cluster head but we also make sure that the
nodes are separated with at least the minimum separation Once the CHs are identified and the nodes are
distance (if possible) from the other cluster head nodes. clustered relative to the distance from the CHs, the
Algorithm : CH selection algorithm routing towards the base station (BS) is initiated. First,
MSD = Minimum Separation Distance the CH checks if the BS is within communication range.
dc = Number of desired cluster heads, If so, data is sent directly to the BS. Otherwise, the data
energy(n) = Remaining energy for node n from the CHs in the sub-network are sent over a multi-
hop route to the BS. Here, the selection of a relay node is
set to maximize the link cost factor (LCF) which includes
energy, end-to-end delay and distance from the BS to the
RN.
Initially, a CH broadcasts HELLO packets to all CH
) nodes in range and receives ACK packets from all the
relay candidates that are in communication
range. The ACK packets contain information such as the
node ID, available energy, and processing delay at a
node, and distance from the BS. The RNs that are further
away from the BS than the current node do not respond to
the HELLO packets. If one of the ACK packets was sent
from the BS, then it is selected as a next hop node, thus
ending the route discovery procedure. Otherwise, the
current node builds a list of potential RNs from the
In the cluster head selection part, cluster heads are
ACKs. Then it selects the optimal RN using the LCF
randomly chosen from a list of eligible nodes. To
parameter. The same procedure is carried out for all hops
determine which nodes are eligible, the average energy of
to the BS. The advantage of this routing method is
the remaining nodes in the network is calculated. In order
reduction of the number of relay nodes that have to
to spread the load evenly, only nodes with energy levels
forward data in the network, and hence the scheme
above average are eligible.
reduces overhead and minimizes the number of hops and
If a node that has been randomly chosen is too close
communication due to flooding.
i.e. within the range of the minimum separation distance
The LCF from a node to its next hop node is given
from all other chosen cluster heads, a new node has to be
by (5) where represent the delay to reach the next hop
chosen to guarantee the minimum separation distance.
node, is the distance between the next hop node and
This process iterates until the desired number of cluster
heads is attained. If we cannot find a node outside the the BS, and is the energy remaining at the next hop
range of the minimum separation distance (to guarantee node:
the minimum separation distance) we choose any node (5)
among the eligible nodes to become cluster head. When
all cluster heads have been chosen and separated, In equation (5), consideration of the remaining energy
generally with at least the minimum separation distance, at the next hop node increases network lifetime, the
clusters are created the same way as in LEACH. distance to the BS from the next hop node reduces the
number of hops and end-to-end delay; and the delay
3
132
incurred to reach the next hop node minimizes any For an evaluation to be meaningful, the performance
channel fading problems and processing delay. When of the proposed protocol should be compared with the
multiple RNs are available for routing of packets, the performances of certain well known existing energy
most optimal RN is selected based on the maximum LCF. aware protocols namely, LEACH and PEGASIS.
4.2.2 Using MST in Intra cluster for large Performance is measured by quantitative metrics of
Sensor Networks: average energy dissipation, system lifetime, total data
messages successfully delivered, and number of nodes
In this protocol the main idea is MSTs to replace that are alive.
direct communication in one layer of the network: intra- For these simulations, we consider random network
cluster. The average transmission distance of each node configuration with 100 nodes where each node is
can be reduced by using MSTs instead of direct assigned an initial energy of 1J.Further more, the data
transmissions and thus the energy dissipation of message size for all simulations is fixed at 500 bytes, and
transmitting data is reduced. packet header for each type of packet was 25 bytes long.
In the performed simulations we have varied the
minimum separation distances between cluster heads, in
order to see the effects on energy consumption in the
network. We have also investigated whether the number
of clusters used, together with the minimum separation
distance, has any effect on the energy consumption. The
minimum separation distance varied between 30 and 45
meters, and the number of clusters varied between 2 and
8 clusters.
Nodes 100
Network size 100m x100m
Base station location (50,175)
Radio propagation speed 3x m/s
Processing delay 50µs
Figure 4. Distribution of sensor nodes and cluster
Radio speed 1Mbps formation with msd=30m
Data size 500 bytes
In Figure 3, we see how the minimum separation
distance affects the energy consumption, i.e., the number
Table 1. Characteristics of the test network. of messages received at the base station during the
lifetime of the network. We also see how the number of
clusters used affects the energy consumption in the
4
133
network. Further, we see that when using 5 clusters and a Next we analyze the number of data messages
minimum separation distance of 30 meters between received by the base station for the three routing
cluster heads, the base station receives the most protocols under consideration. For this experiment, we
messages. So, this gives the best energy efficient again simulated 100 m × 100 m network topology where
configuration. Figure 4 gives the distribution of sensor each node begins with an initial energy of 1J. Figure 7
nodes and the formation of clusters with a minimum shows the total number of data messages received by the
separation distance 30m. base station over the average energy dissipation. The plot
The improvement gained through MSD with relay clearly illustrates the effectiveness of proposed protocol
CH’s protocol is exemplified by the system lifetime graph in delivering significantly more data messages than its
in Fig.5. This plot shows the number of nodes that remain counterparts. Moreover, results in Fig. 7 confirm that
alive over the number of rounds of activity for the 100 m MSD protocol delivers the most data messages per unit of
× 100 m network scenario. With MSD protocol, all the energy of the two schemes. In the final experiment, we
nodes remain alive for 920 rounds, while the evaluate the performance of the routing protocols as the
corresponding numbers for LEACH and PEGASIS are area of the sensor field is increased. For this simulation,
510, and 825, respectively. Furthermore, if system 100 nodes are randomly placed in a square field of
lifetime is defined as the number of rounds for which 75 varying network areas with the base station located at
percent of the nodes remain alive, proposed protocol least 75 m away from the closest sensor node, and results
exceeds the system lifetime of LEACH by 30 percent. A were obtained over 25 different network topologies for
5 percent improvement in system lifetime is observed each network area instance. The figure 8 shows the no of
over PEGASIS. alive nodes after 900 rounds by varying the network area.
there is a comparison in performances between MSD with
120 relay CHs protocol, LEACH,PEGASIS and MST
leach
No of Alive nodes
protocol.
100 pegasis
80 msd
50000
no of msgs Rxed
60
40000
40
20 30000
0 20000
leach
0 1 2 3 4 5 6 7 8 91011121314151617181920 10000 pegasis
Hundreds msd
No Of rounds 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 5. No of alive nodes as rounds increases Avg Energy Dissipation (J)
1
dissipation(J)
0.8 leach
120
0.6 pegasis
No Of Nodes Alive
5
134
uniformly placed across the whole sensor field. As a 2) J. N. Al-Karaki and A. E. Kamal, “Routing
result, the cluster head nodes in LEACH and can become Techniques in Wireless Sensor Networks: A
concentrated in a certain region of the network, in which Survey,” IEEE Wireless Communications, vol.11, pp.
case nodes from the “cluster head deprived” regions will 6-28, Dec. 2004.
3) W. Heinzelman, A. Chandrakasan, and H.
dissipate a considerable amount of energy while Balakrishnan. “Energy-Efficient Communication
transmitting their data to a faraway cluster head. the Protocols for Wireless Microsensor Networks
utilization of the greedy algorithm in PEGASIS results in (LEACH)”. Porc. of the 33rd Hawaii International
a gradual increase in neighbor distances. This in turn Conference on Systems Science-Volume 8, pp. 3005-
increases the communication energy cost for those 3014, January 04-07, 2000.
PEGASIS nodes that have far neighbors. As shown 4) An Energy Efficient Routing Mechanism for Wireless
shortly, increasing neighbor distances will have a Sensor Networks, Ruay-Shiung Chang and Chia-Jou
significant effect on PEGASIS’ performance when the Kuo, IEEE,2006.
area of the sensor field is increased. 5) S. Lindesy and C. Raghavendra, “PEGASIS:Power-
Efficient Gathering in Sensor Information System,”
Proc. of 2002 IEEE Aerospace Conference, pp. 1-6,
120 msd March 2002.
6) S. D. Muruganathan, D. C. F Ma., R. I. Bhasin, and A.
No of nodes alive
100
MST in O. Fapojuwo, “A Centralized Energy-Efficient
80 Routing Protocol for Wireless Sensor Networks,”
cluster IEEE Communications Magazine, vol.43, pp. 8-13,
60
2005.
40 7) Ewa Hansen, Jonas Neander, Mikael Nolin and Mats
20 Björkman,”Energy-Efficient Cluster Formation for
0 Large Sensor Networks using a Minimum Separation
Distance” , In proceedings of the Fifth Annual
500 750 1000 1250 1500 Mediterranean Ad Hoc NetworkingWorkshop 2006,
No Of Rounds MedHocNet, Lipari, Italy, June 2006.
8) G.Huang, Xiaowei Li, Jing He,“Dynamic Minimum
Figure 9. Network lifetimes in network area of Spanning Tree Routing Protocol for large wireless
400m X400m Sensor networks” .IEEE, 2006.
9) Wendi B. Heinzelman, Member, IEEE, Anantha P.
Figure 9 gives the performance comparison of MSD
Chandrakasan “An Application-Specific Protocol
routing protocol and MST routing protocols for network Architecture for Wireless Microsensor Networks”
area 400m. The plots also show that tree topology makes IEEE Transactions on Wireless Communications, vol.
protocol performs better than direct transmission in larger 1, pp. 660-670, October 2002.
network area. MST protocol uses tree topology. MSD 10) Nauman Israr and Irfan Awan, July 2006”Multihop
protocol has uses direct communication intra-cluster. The routing Algorithm for Inter-ClusterHead
nd
simulation results prove MST is an elegant solution in communication”, 22 UK Performance Engineering
large network area. Workshop Bournemouth UK”, Pp.24-31.
VI. Conclusion 11) M. Younis, M. Youssef, and K. Arisha. “Energy-
We presented a simple energy-efficient cluster aware management for cluster- based sensor
formation algorithm for the AROS architecture. The networks”. Computer Networks, 43:649.668, Dec
simulations showed that using a minimum separation 2003.
12) A. Manjeshwar and D. P. Agrawal.” TEEN: A
distance between cluster heads and using Forwarding
Routing Protocol for Enhanced Efficiency in Wireless
CHs improves energy efficiency compared to LEACH, Sensor Networks.” Parallel and Distributed
PEGASIS, measured by the number of messages received Processing Symposium., Proceedings 15th
at the base station. Using 5 clusters and a minimum International, pages 2009-2015, April 2001.
separation distance of 30 meters between cluster heads is 13) A. Manjeshwar and D. P. Agrawal. “APTEEN: A
the best efficient configuration for our simulated network. Hybrid Protocol for Efficient Routing and
Using MST within the cluster is an elegant Comprehensive Information Retrieval in Wireless
solution for large sensor networks. From the simulation Sensor Networks.” Parallel and Distributed
result, this approach reduces the distance between a Processing Symposium., Proceedings International,
IPDPS2002, pages 195-202, April 2002.
cluster head and its cluster member nodes, thereby
14) J. Chang and L. Tassiulas. “Maximum lifetime
reducing the transmission energy when a cluster member routing in wireless sensor networks”. IEEE/ACM
node communicates with its cluster head. Results show Trans. Networks, 12(4):609.619, 2004.
that it is more energy efficient than LEACH and 15) A. Mainwaring, J. Polastre, R. Szewczyk, D. Culler,
PEGASIS for large sensor networks. and J. Andersson. “Wireless Sensor Networks for
References Habitat Monitoring”. WSNA'02, September 2002
1) F. Akyildiz, Su Weilian, Y. Sankarasubramaniam, and E.
E. Cayirci. “A Survey on Sensor Networks”. IEEE
Communications Magazine, 40(8):102.114, 2002.
6
135
An Energy Efficient Base Station to Node
Communication Protocol for
Wireless Sensor Networks
Pankaj Gupta Tarun Bansal Manoj Misra
Department of E.C.E. Department of Computer Science Department of E.C.E.
Indian Institute of Technology Roorkee University of Texas at Dallas Indian Institute of Technology Roorkee
Roorkee-247667, India Richardson TX, USA Roorkee-247667, India
pankiuec@iitr.ernet.in tarun@student.utdallas.edu manojfec@iitr.ernet.in
136
data from sensor nodes to base station. These protocols use Various alternatives to blind flooding were proposed.
flooding for data transmission from BS to individual nodes Typically these techniques aim at minimising number of
[3]. Obviously, flooding proves to be a very costly solution. retransmission of broadcast messages. One of the
This paper presents a novel method for base station to alternatives proposed is randomised forwarding where each
communicate with nodes in network, which could be used in node forwards a packet to its neighbours with a probability
scenarios where BS to node communications is frequent. For p. This scheme was termed as gossiping [9]. Typical value
example, in cases where BS has to update some parameters of p lies in the range 0.65 to 0.75 for acceptable reliability
at a particular node. of data delivery. However this probabilistic forwarding
increases delay in data delivery. Directed flooding proposed
The rest of paper is organised as follows. In Section II, we
in [10] sends data in a specific directional virtual aperture
discussed previous work in this area. Section III presents
outline of sensor network model we used and our instead of broadcasting. Only nodes within this virtual
assumptions. Then our protocol is described in Section IV. In aperture forward packets and thus power consumption is
section V we analyze performance of algorithm using reduced while maintaining a low overhead. However, it is
simulations. Lastly, paper is concluded in section VI with very difficult to decide the size of the aperture. If the
pointers to future work. aperture is small, the adjusting times will be large.
Increasing the aperture will give less adjustment, but the
benefit of directed flooding reduces because of increased
II. RELATED WORK overhead. Alternatives proposed in [11, 12] require 1 or 2
An adaptive clustering scheme called Low-Energy hop neighbour information of nodes and are thus do not
Adaptive Clustering Hierarchy (LEACH) is proposed in [4] scale with increasing node density.
which try to reduce the number of nodes communicating
directly with the BS. The protocol achieves this by forming Many solutions are available in literature to solve these
few clusters (elected randomly), where each cluster-head problems and avoid flooding. First solution, LEAR proposed
(CH) collects the data from nodes in its cluster, fuses and in [13] is inspired from Dynamic Source Routing [14], where
sends the result to the BS. LEACH-C [4] uses a centralized the base station puts the complete path to the destination
cluster formation algorithm to guarantee k nodes in the node in the packet. Intermediate nodes read the path in the
cluster and minimize the total energy spent by the non- received packets and use that to forward it to the destination.
cluster-head nodes, by evenly distributing the CHs However this solution incurs an overhead of carrying the
throughout the network. UDACH proposed in [5] works on whole path inside packet. When the network size is large
similar lines; however here the cluster heads are selected with multiple hops, the overhead of carrying whole path in
based upon the residual energy of each node. the packet will prove to be very costly.
Another clustering based approach called BCDCP [6] Second solution given by Hyun-sook Kim et. al. In [15]
makes clusters of equal size to ensure similar power requires nodes to maintain information of all their children.
dissipation. PEDAP [7] follows a minimum spanning tree So whenever a node receives a packet from BS, it forwards it
organisation with BS as the root, improving the total lifetime to one of its children on the basis of the final destination
and scalability. In PEGASIS [8] a chain is constructed address. This solution is also not scalable as the size of the
among the sensor nodes so that each node will receive from routing tables will grow with increase in network density.
and transmit to close neighbour.
Protocols discussed above and other data-oriented III. SYSTEM MODEL AND ASSUMPTIONS
centralized network protocols allow efficient node to BS
communication. However when base station has to
The system consists of following components:
communicate with the nodes, these protocols rely on
flooding where each receiver is obligated to retransmit the
packet it has not seen before in network. Network wide Node: This refers to sensor nodes. Sensor nodes are heart of
flooding reduces the network capacity by sending the network; they are in-charge of collecting, processing
information to mobile hosts which are not supposed to data and routing it to the sink. In other words, sensor nodes
receive it, thus increasing the traffic load, and packet can sense data from environment, perform simple
collision rate. This also leads to an increase in individual computations and transmit this data wirelessly to a
node power consumption. Obviously this proves to be a very command center either directly or in a multi hop fashion
costly solution especially when it comes to industrial through neighbors.
deployments where base station has to frequently
communicate the values of various parameters to the nodes. Base Station: Base station is a sensor node responsible for
For example, consider a real life example where nodes are getting request for data collection from applications. BS is
deployed to monitor temperature as directed by BS and also responsible for calculation of routing tree according to
sampling frequency is different at each node.
137
underlying routing protocol. Base station thus coordinates position of the nodes, residual energy and other heuristic
and controls the sensing tasks performed by sensor nodes. parameters. Base station then broadcasts this routing tree to
the nodes which on receiving the routing tree, rebroadcast it.
Destination Node: This is the node to which BS wants to Finally all the nodes in the field are aware of the routing tree.
send information. Nodes then use the information available in routing tree to
route their data to BS. Although this setup allows nodes to
Intermediate Nodes: These are sensor nodes which come in transfer their data to the base station efficiently, it provides
path between BS and destination node. Intermediate nodes no support for the BS to individual node communication.
forward data packet based upon underlying routing protocol.
In MSTBBN, we propose that along with the routing tree,
BS will assign each node a key KMSTBBN (inspired from
Network topology: It is a connectivity graph where nodes concept of m-way search trees [18]), which later on will be
are sensor nodes and edges are communication links. In a used by BS to communicate efficiently with any particular
wireless network, the link represents a one-hop connection, node without any packet or memory overhead. MSTBBN is
and the neighbours of a node are those within the radio not limited by use of any particular routing algorithm. Any
range. of the centralized algorithms (like LEACH-C, PEDAP) can
be used to provide underlying routing capabilities. MSTBBN
Following were our assumptions which are consistent with provides BS to node communication in O(h) hops and O(h)
the assumption made in literature [4, 6, 7, 8]: time & message complexity. Here h is the height of routing
• Base station is a high-energy node with a large tree constructed by the underlying routing protocol with BS
amount of energy supply as root.
• Sensor nodes are homogeneous and energy
constrained with uniform initial energy. Next we describe the working of MSTBBN protocol in
• Sensor nodes are equipped with power control following four phases.
capabilities to vary their transmitted power.
Phase 1: Calculation of routing tree by BS.
• Sensor nodes exhibit no mobility.
• All sensor nodes communicate through wireless At the beginning of each round, routing tree will be
links over a single shared channel. calculated by the underlying routing protocol. For example,
• Links between two sensor nodes is bidirectional. fig. 1 here shows the routing tree calculated by LEACH-C
• Each node knows its current energy and location with BS, Cluster Heads (CHs) and Nodes.
(using GPS [16] or other localization mechanisms).
• A message sent by a node is received correctly Phase 2: Allocation of KMSTBBN by BS.
with in a finite time by all one hop neighbours [17]. In next phase, using idea of m-way search trees, BS assigns
• Network topology does not change during network key - KMSTBBN to each node. Following rules are followed by
operation. BS while assigning keys to the nodes where KMSTBBN (X)
• Each node can be identified uniquely with its refers to key assigned to node X according to MSTBBN:
identifier. • Rule 1: KMSTBBN(N1) > KMSTBBN(N2) where N1 is
• Single hop broadcast refers to the operation of child of N2. E.g. In fig. 2, KMSTBBN(A) = 0 which is
sending a packet to all single-hop neighbours. lowest in whole tree, since its root (BS).
138
• Rule 2: KMSTBBN (N1) < KMSTBBN (N2) where N2 is A 0
the right sibling of N1. E.g. In fig. 2, KMSTBBN(B) <
KMSTBBN(C), Since C is the right sibling of B where 12
KMSTBBN(B) = 1, KMSTBBN(C) = 6.
• Rule 3: KMSTBBN (N3) > KMSTBBN (N1) where N1 is in 1 6 C
B D
the subtree rooted by N2 & N3 is the right sibling of
N1. E.g. In fig. 2, KMSTBBN(C) < KMSTBBN(D), Since
D is the right sibling of C where KMSTBBN(C) = 12,
KMSTBBN(D) = 6 and both rooted at A(base station). 11 15
2 7
• Rule 4: KMSTBBN (N1) < KMSTBBN (N3) where N3 is in 5 16
the subtree rooted by N2 and N1 is the left sibling of
N2. E.g. In fig. 2, KMSTBBN(C) < KMSTBBN(Z), Since 10
C is the left sibling of D and Z is in subtree rooted at Z
3 4 8 9
D where KMSTBBN(C) = 6, KMSTBBN(D) = 12, 13
KMSTBBN(Z) = 14. 14
BS CH Node
Figure 2. Routing tree with key values assigned to nodes
Observe that this method of assigning keys converts the
routing tree into a search tree. Moreover, it ensures that the
nodes to BS routing paths in the resulting tree are same as KMSTBBN(destination) (obtained from packet) lies in between
those calculated by the underlying protocol. This is KMSTBBN of node i and KMSTBBN of i’s child with highest value
important, as MSTBBN does not require any alteration to the of KMSTBBN. If not, node i will drop the packet instead of
routing tree created by the underlying protocol. In phase 4, processing it. Hence node i will only forward the packet if
we will explain how these keys will be used to provide destination node is in its subtree. Here nodes B and C will
efficient BS to node communication drop packet since Z is neither in subtree of B or C. Node D
will forward packet to its children, since Z is in subtree of D.
Phase 3: BS broadcasts keys to nodes. Except Z, rest of D’s children will discard this packet (since
neither is this packet intended for them, nor does destination
BS next broadcasts KMSTBBN of every node. This broadcast lies in their subtree) and in this way packet finally reaches its
is done in the similar fashion as done in the underlying destination. It can be easily seen that these decisions are
protocol (like LEACH-C etc). Moreover this broadcast can followed directly from the traditional m-way tree search
be piggybacked by BS while broadcasting the routing tree, algorithms.
thus further cutting down the network overhead. When the
node receives these packets, it makes record of its own
KMSTBBN and KMSTBBN of its children with highest value of
KMSTBBN (or base station can assign each node this value).
Thus the memory requirements of MSTBBN is O(1). For
example, Table 1 shows KMSTBBN’s stored at node C.
TABLE I. KMSTBBN’S STORED AT NODE C
KMSTBBN(C) 6
KMSTBBN(C’s child with maximum KMSTBBN) 11
E.g. in fig. 2, if base station A has to send data to node Z V. PERFORMANCE EVALUATION
where KMSTBBN(Z) = 14 = KMSTBBN(destination), it will
forward its data to node D with KMSTBBN(D) = 12. However, In order to evaluate our scheme, we simulated MSTBBN
owing to the broadcast nature of the wireless medium, nodes over LEACH-C [4] and PEDAP [7] as underlying routing
B and C will also receive the transmitted data (apart from protocols on ns-2 [19, 20] and compared its performance
node D). According to flowchart shown in fig. 3, a node i on with flooding and directed flooding. We used following
receiving the packet will check whether model in our simulation studies:
139
• Wireless Sensor Network consisted of randomly
(uniform) distributed nodes in a square field of size 17.5
100 x 100 m2.
Average energy dissipated
• Each sensor node was assigned initial energy of 20 15.5
MSTBBN
Joules. Θ=π/2
•
(in Joues)
Size of data message was set to 500 bytes. Θ=3π/4
• First order radio model as described in [4] was used 13.5 Θ=π
to calculate energy consumption for receiving and Θ=5π/4
transmitting a data packet. 11.5 Θ=3π/2
• Base station was located at the centre of field.
• One round was defined as the duration of time from 9.5
when BS initiates sending of data packet to a node
600 700 800 900
to the time when this node receives the packet.
• Radio range of nodes was set to 40 m. No. of Rounds
• Performance was measured by quantitative metrics Figure 5: Average energy dissipated by nodes in LEACH-C using MSTBBN
of average energy dissipated per node and number and Directed flooding (for Θ = π/2 to Θ = 3π/2) versus number of rounds.
of packets forwarded.
flooding (for all values of virtual aperture Θ). This is because
A. MSTBBN over LEACH-C MSTBBN limits receptions & retransmissions of packets by
assigning KMSTBBN to all nodes and then performing efficient
In this section we present simulation results where we
routing on the basis of this assigned key. This results into
ran MSTBBN algorithm over LEACH-C [4] as underlying reduced redundant packet transmissions and thus increased
routing protocol. energy savings. Average energy efficiency in MSTBBN
compared to flooding was observed to be 15.52%.
Energy dissipated with Rounds: In the first experiment, we
evaluated the average energy dissipated by nodes in the Energy dissipated with node density: In next experiment, we
network with increasing number of rounds. For this evaluated the average energy dissipated by nodes in the
experiment, we simulated deployment of 100 nodes. In each network by varying number of nodes from 25 to 400. This
round base station randomly chooses a destination node and experiment was done for 100 rounds and in each round BS
sends a data packet to it. Number of rounds was varied from chooses a random destination node for sending data packet.
600 to 900. Fig. 4 here shows comparison of MSTBBN with Fig. 6 here shows comparison of MSTBBN with flooding
flooding while fig. 5 shows comparison of MSTBBN with while fig. 7 shows comparison of MSTBBN with directed
directed flooding. For directed flooding, virtual aperture was flooding. Energy dissipation in MSTBBN is nearly constant
varied from Θ=π/2 to Θ=3π/2. while for both flooding and directed flooding energy
dissipation keeps on increasing with increasing number of
The simulation results clearly demonstrate that as we nodes. This is because in case of MSTBBN, only CH’s
increase number of rounds, MSTBBN achieves significant forward data packets while in case of both flooding and
energy savings compared to both flooding and directed directed flooding, comparatively more nodes will receive
flooded-packets which lead to more flooding thereby
increasing overall energy dissipation.
Flooding
MSTBBN MSTBBN
Flooding
Average energy dissipated
17 2.05
Average energy dissipated
15
(in Joules)
1.95
13 1.85
(in Joules)
11 1.75
9 1.65
140
25
2.05
No. of packets forwarded
Average Energy dissipated
20 MSTBBN
( in Thousands)
1.95
MSTBBN Θ=π/2
15
(in Joules)
Θ=π/2 Θ=3π/4
1.85
Θ=3π/4 Θ=π
10
Θ=π Θ=5π/4
1.75
Θ=5π/4 5 Θ=3π/2
1.65 Θ=3π/2
0
1.55 0 100 200 300 400
No. of Nodes
0 100 200 300 400
No. of Nodes
Figure 9. Number of packets forwarded by nodes in LEACH-C using
Figure 7. Average energy dissipated by nodes in LEACH-C using MSTBBN MSTBBN and Directed Flooding (for Θ=π/2 to Θ=3π/2) with varying
and Directed flooding (for Θ = π/2 to Θ = 3π/2) with varying node density. node density.
Average energy dissipated
Experiment was done for 100 rounds where BS chooses a
16.5
random destination node at each round for sending data
packet. Fig. 8 and fig. 9 shows comparison of MSTBBN
(in Joules)
with flooding and directed flooding respectively. These 14.5
results testify our interpretation of nearly constant energy
dissipation in LEACH-C. 12.5
16.5
Average energy dissipated
MSTBBN
Θ=π/2
(in Thousands)
(in Joules)
20 14.5 Θ=3π/4
Θ=π
Θ=5π/4
10 12.5
Θ=3π/2
0 10.5
No. of Nodes No. of Rounds
Figure 8. Number of packets forwarded by nodes in LEACH-C using Figure 11. Average energy dissipated by nodes in PEDAP using MSTBBN
MSTBBN and Flooding with varying node density. and Directed flooding (for Θ = π/2 to Θ = 3π/2) versus number of rounds
141
As we increase the number of rounds, the simulation
results clearly indicate that MSTBBN is more energy 2
efficient compared to flooding. Except for Θ = π/2,
Average energy dissipated
MSTBBN is always more efficient in case of directed MSTBBN
1.9
flooding. However, when we compare these results with Θ=π/2
(in Joules)
LEACH-C i.e. from fig. 4 & fig. 5, we find that MSTBBN Θ=3π/4
is lesser efficient for PEDAP. The reason is PEDAP is 1.8
Θ=π
based on spanning tree while LEACH-C is single level Θ=5π/4
cluster based protocol. So height of BS rooted tree in case of 1.7 Θ=3π/2
PEDAP is higher while for LEACH-C it’s constant and is 2.
Thus lesser number of nodes receive and forward packets in 1.6
case of LEACH-C which can be seen by comparing fig. 8 0 100 200 300 400
and fig. 14. No. of Nodes
Energy dissipated with node density: In next experiment we Figure 13. Average energy dissipated by nodes in PEDAP using MSTBBN
evaluated the average energy dissipated by nodes in the and Directed flooding (for Θ = π/2 to Θ = 3π/2) with varying node density.
network by varying number of nodes from 25 to 400. This
experiment was done for 100 rounds and in each round BS
MSTBBN
chooses a random destination node for sending data packet.
Flooding
Fig. 12 shows comparison of MSTBBN with flooding while
fig. 13 shows its comparison directed flooding. MSTBBN is 30
No. of packets forwarded
thus an efficient protocol compared to flooding while in
case of directed flooding for some values of Θ, energy
(in Thousands)
20
dissipated is nearly same as MSTBBN.
MSTBBN
Flooding 25
2.2
No. of packets forwarded
Average Energy dissipated
20 MSTBBN
2.1
(in Thousands)
Θ=π/2
2 15
(in Joules)
Θ=3π/4
1.9 10 Θ=π
1.8 Θ=5π/4
5 Θ=3π/2
1.7
0
1.6
0 100 200 300 400
0 100 200 300 400
No. of Nodes
Number of Nodes
Figure 15. Number of packets forwarded by nodes in PEDAP using
MSTBBN and Directed Flooding (for Θ = π/2 to Θ = 3π/2) with varying
Figure 12. Average energy dissipated by nodes in PEDAP using MSTBBN
node density.
and Flooding with varying node density.
142
VI. CONCLUSIONS AND FUTURE WORK networks,” IEEE Commnication Magazine, vol. 43, no. 3, pp. S8-S13,
Mar. 2005.
Minimizing energy consumption is a key requirement in [7] Huseyin O zgur Tan, Ibrahim Korpeoglu, “Power efficient data
the design of sensor network protocols and algorithms. Most gathering and aggregation in wireless sensor networks”, SIGMOD
of the attention however has been given to designing Record, Vol. 32, No. 4, December 2003, pp. 66-71.
protocols for routing data from sensor nodes to base station. [8] S. Lindsey and C. S. Raghavendra, “Pegasis: Power-efficient
In this paper we presented MSTBBN, a novel scheme for gathering in sensor information systems,” In Proceedings of IEEE
base station to node communication in O(h) hops where h is Aerospace Conference, 2002.
the height of the tree constructed by the underlying routing [9] S. Hedetniemi and A. Liestman, “A survey of gossiping and
broadcasting in communication networks,” Networks, Vol. 18, No. 4,
protocol. h can vary from 1 (in LEACH-C) to as much as N pp. 319-349, 1988.
(in PEGASIS) where N refers to number of nodes in whole [10] R. Farivar, M. Fazeli, and S. G. Miremadi, “Directed Flooding: A
network. Currently no such protocol exists for base station to fault tolerant routing protocol for wireless sensor networks,” 2005
node communication as the proposed one. Existing protocols Systems Communications (ICW'05, ICHSN'05, ICMCS'05,
are based upon multicasting protocols which are not energy SENET'05), pp. 395- 399, 2005
efficient. [11] Hai Liu , Xiaohua Jia , Peng-Jun Wan , Xinxin Liu , Frances F. Yao,
“A distributed and efficient flooding scheme using 1-hop information
in mobile ad hoc networks,” IEEE Transactions on Parallel and
Our protocol can work with any of the underlying Distributed Systems, v.18 n.5, p.658-671, May 2007.
centralized routing protocol like LEACH-C, BCDCP without
[12] Trong Duc Le , Hyunseung Choo, “PIB: an efficient broadcasting
any noticeable modification or message overhead. Routing scheme using predecessor information in multi-hop mobile ad-hoc
tree constructed by underlying protocol is converted to a networks,” Proceedings of the 2nd international conference on
search tree using keys assigned in MSTBBN. Furthermore, Ubiquitous information management and communication, January
our protocol does not result in any message or memory 31-February 01, 2008, Suwon, Korea.
overhead compared to the existing solutions, making our [13] Kyungtae Woo , Chansu Yu , Dongman Lee , Hee Yong Youn , Ben
solution scalable with increasing number of nodes as well as Lee, “Non-blocking, localized routing algorithm for balanced energy
consumption in mobile ad hoc networks,” Proceedings of the Ninth
density. This resulted into an energy efficient scheme for International Symposium in Modeling, Analysis and Simulation of
base station to node communication, which we verified using Computer and Telecommunication Systems (MASCOTS'01), p.117,
ns-2 simulations. August 15-18, 2001
[14] David B. Johnson, David A. Maltz, and Josh Broch, “DSR: The
Recent developments increasingly call for scenarios dynamic source routing protocol for multi-hop wireless ad hoc
networks. in ad hoc networking,” edited by Charles E. Perkins,
where the sensed data must be delivered to multiple base Chapter 5, pp. 139-172, Addison-Wesley, 2001.
stations. This forms on of the areas for future research where
[15] Hyun-sook Kim, Ki-jun Han, “A power efficient routing protocol
MSTBBN could be extended to serve for scenarios with based on balanced tree in wireless sensor networks,” Proceedings of
multiple base stations. Further, we also assumed that all the the First International Conference on Distributed Frameworks for
sensor nodes in network are stationary. For further research, Multimedia Applications (DFMA’05), p.138-143, February 06-09,
MSTBBN could be explored as a solution for slightly mobile 2005
networks. This could be done by increasing update interval [16] N. Bulusu, J. Heidemann, D. Estrin, ‘‘GPS-Less Low Cost Outdoor
of keys based upon node mobility. Localization for Very Small Devices,’’ IEEE Personal
Communications, Vol. 7, 2000.
[17] Chalermek Intanagonwiwat, Ramesh Govindan and Deborah Estrin,
“Directed diffusion: A scalable and robust communication paradigm
REFERENCES for sensor networks,” In Proceedings of the Sixth Annual
[1] Arampatzis, Th., Lygeros, J., Manesis, S., “A survey of applications International Conference on Mobile Computing and Networking
of wireless sensors and wireless sensor networks,” Intelligent (MobiCOM '00), August 2000, Boston, Massachussetts.
Control, 2005. Proceedings of the 2005 IEEE International [18] Sartaj Sahni, “Data structures, algorithms, and applications in C++,”
Symposium on, Mediterrean Conference on Control and Automation , edited by June Waldman, Chapter 11, pp. 525-527, McGraw-Hill,
vol., no., pp.719-724, 27-29 June 2005. 2000.
[2] Li, Yingshu, Thai, My T., Wu, Weili (Eds.), “Wireless Sensor [19] ns-2. http://www.isi.edu/nsnam/
Networks and Applications,” Springer Series on Signals and
Communication Technology, 2008, ISBN 978-0-387-49591-0. [20] K. Fall, “ns Notes and Documents,” The VINT Project, UC Berkeley,
LBL, USC/ISI, and Xerox PAPC, Feb. 2000, available at
[3] Carlos de Morais Cordeiro, Dharma Prakash Agarwal, “Ah Hoc & http://www.isi.edu/nsnam/ns-documentation.html.
Sensor Networks: Theory and Applications,” World Scientific
Publishing Company, 2006, ISBN 981-256-681-3.
[4] Wendi B. Heinzelman, Anantha P. Chandrakasan, Hari Balakrishnan,
“An application-specific protocol architecture for wireless
microsensor networks”, IEEE Transactions on wireless
communications, vol.1, no. 4, October 2002, pp. 660-670.
[5] Jin-Young Choi, Joon-Sic Cho, Seon-Ho Park, Tai-Myoung Chung,
“A clustering method of enhanced tree establishment in wireless
sensor networks,” in Proc. 10th Int. Conf. On Advacned
Communication Technology(ICAST), Feb. 2008, pp. 1103-1107.
[6] S. D. Muruganathan, D. C. F. Ma, R. I. Bhasin, and A. O. Fapojuwo,
“A centralized energy-efficient routing protocol for wireless sensor
143
A Broadcast Authentication Protocol for Multi-Hop
Wireless Sensor Networks
R. C. Hansdah, Neeraj Kumar and Amulya Ratna Swain
Dept. of Computer Science & Automation
Indian Institute of Science, Bangalore, India.
{hansdah, neerajkumar, amulya}@csa.iisc.ernet.in
Abstract—A base station in wireless sensor network (WSN) come from the base station only. This problem is known as the
needs to frequently broadcast messages to sensor nodes, since broadcast authentication problem. If a global shared secret key
broadcast communication is used in many applications such is used to authenticate these messages, the malicious nodes can
as networks query, time synchronization, multi-hop routing,
etc. One of the main problems of broadcast communication either modify these messages if they have to rebroadcast them
in WSNs is source authentication. Source authentication means or masquerade as the base station even in single hop WSN
that the receivers of broadcast data have to verify that the since they already have the key. A solution to this problem [4]
received data really originated from the claimed source and is is to use hash key chain. The first key of the chain is usually
not modified on the way. This problem is complicated due to distributed to each node of the WSN using some mechanism.
untrusted receivers and unreliable communication environment
where the sender does not retransmit the lost packets. In this The first message is encrypted using the first key of the chain.
paper, we propose a novel scheme for authenticating messages The key next to the first key of the chain is sent along with
from each node of the WSN at the base station using Diffie- the first message, which can be used to authenticate the first
Hellman key. Most existing schemes for broadcast authentication message, and so on. The problem in multi-hop environment,
using hash key chain are limited to single hop WSNs only. Using which is quite common in WSN, is that the sensor nodes which
the above technique for source node authentication, we extend
the broadcast authentication scheme using hash key chain to receive the broadcast directly from the base station can modify
multi-hop wireless sensor networks. The number of transmissions the messages before rebroadcasting. In a single hop WSN, this
of packets is also reduced using some selective paths during problem does not arise. Also the solution given above ensures
the broadcast and as a result, the storage and communication that malicious nodes cannot masquerade as the base station
overhead is also reduced. The analysis and experiments show since they do not have the next key. A few solutions to above
that our protocol is efficient and practical, and achieves better
performance than the previous approaches. problem have been proposed in the literature [5], [6], [7]. Of
these solutions, some of them use digital signature [5], [6], and
I. I NTRODUCTION others use one way hash key chain [7]. The solutions which
A wireless sensor networks consists of a collection of low use digital signature are quite heavy on the meager resources
cost, low power, and multifunctional sensor nodes. Some of the of sensor nodes. In this paper, we propose a novel scheme
designated nodes, called base stations, facilitate computation to authenticate each node of the WSN at the base station
within the WSN as well as communication with the outside using Diffie-Hellman key. We also use this scheme to propose
world. A WSN usually has a single base station. The base a broadcast authentication protocol using hash key chain for
station controls the sensor nodes as well as collects data multi-hop WSNs. An important feature of our protocol is that
reported by the sensor nodes. A WSN essentially can monitor it is fault-tolerant to node failures.
events of practical importance either periodically, or whenever The rest of the paper is organized as follows. In section
they occur over any geographical area such as forest, buildings II, we give a brief survey of related works. Assumptions
etc. As a result, WSNs have the potential to provide practical and definitions for the proposed protocol are described in
solutions to many problems of these types. Some of the section III. In section IV, we describe our proposed scheme.
potential applications of WSN are environmental and habi- Security and performance analysis of the protocol is described
tat monitoring, monitoring of civil structures like buildings, in section V. In section VI, we discuss our simulation results.
bridges etc., target tracking for military as well as civilian Conclusions are given in section VII.
applications, monitoring health conditions of patients and so
on. II. S URVEY OF R ELATED W ORKS
Security of many of the applications of WSN is very critical Broadcast authentication is an essential service in WSNs.
to the systems using them. There are many types of attacks Symmetric key based message authentication code [8] cannot
that can be made on the WSNs. A survey of the attacks can be directly used for resource-constrained wireless sensor net-
be found in [1], [2], [3]. One of the important operations of work, since a compromised receiver can easily impersonate
WSN is that the base station needs to broadcast messages the sender. On the other hand, asymmetric key based digital
to the sensor node occasionally. But the messages need to signature schemes [9] which are typically used for broadcast
be authenticated at each sensor node ensuring that they have authentication in traditional networks, are too expensive to be
144
used in WSNs, due to high computation involved in signature GH(M ) (si ) = GH(M ) (sj ) and si 6= sj , and send the message
verification. As a result, several broadcast authentication pro- M with the signature hsi , sj i to the receiver. After receiving the
tocol have been proposed for resource constrained WSNs [4], message, the receiver authenticates the received message by
[7], [10], [11], [12], [13]. authenticating the signature using previously obtained public
Perrig et al. have proposed a broadcast authentication pro- keys. The advantage of BiBa is fast verification and a short
tocol, named µTESLA [4], and it is the first protocol proposed signature but BiBa takes longer signing time and uses larger
for broadcast authentication in WSNs. This protocol is based public key size to authenticate the signer.
on one way hash key chain. µTESLA uses the key chain to To make an improvement over public key size and signing
emulate public key cryptography with delayed key disclosure. time, Reyzin et al. have proposed a new one-time signature
A key is initially chosen, and the remaining keys are generated scheme called HORS (Hash to Obtain Random Subset) [12]
using one way hash function. The first key of this chain(the last which reduces the time needed to sign the message and verify
key produced by the hash function) is used to encrypt the first the signature. It also reduces the key and signature sizes
message to be broadcasted by the base station, and this key in comparison to the ones used in BiBa and makes HORS
is distributed to each node of the WSNs apriori. The sender the faster one-time signature scheme. The security of BiBa
divides the time period for broadcast into multiple intervals depends upon the random-oracle model, while the security
and in each interval it uses one key starting with the first of HORS relies on the assumption of the existence of one-
key. At the end of each interval, it discloses the next key, way functions. HORS is computationally efficient, requiring
which makes it possible to authenticate the messages that were a single hash evaluation to generate the signature and a few
sent encrypted with the previous key. However, the receiving hash evaluations for verification as compared to BiBa. Still this
node needs to verify that the next key was not yet disclosed protocol has large public key size, which is not suitable in a
when it received the messages. After receiving a packet, WSN environment without additional modifications. Signing
if the receiver can ensure that the packet was sent before each packet would definitely provide secure broadcast authen-
the next key was disclosed, the receiver buffers this packet tication, but it still has considerable overhead for signing and
and authenticates it later after receiving the next key. The verifying packets and also uses more bandwidth.
protocol has certain drawbacks. The protocol requires loose An efficient broadcast authentication scheme[13], proposed
time synchronization between sender and receiver. Individual by Shang-Ming, is also based on one-time digital signature
authentication as well as instantaneous authentication is not scheme. Compared to HORS, this scheme requires less storage
available in µTESLA. More storage space is required at the and communication overhead at the expense of higher compu-
receiver side to buffer the packets until the next key is received. tation cost. In this scheme, key generation is the same as that
Many WSN applications are real time applications. Hence, used in HORS scheme. This scheme makes an improvement
to minimize the delay in authentication of real time data, over the HORS scheme by reducing the large key size, but
the maximum number of additional packets that are received still the public key size is large and computational overhead
before a packet is authenticated should be small. Nonetheless, per message is also large.
there would be some delay before a broadcast packet can be Bekara et al. have proposed a hop-by-hop broadcast source
authenticated, and therefore, it is not suitable for real time authentication protocol for WSN [7] to overcome the DOS
applications. attacks that limits the effect of attack to the one hop neighbor
To increase the scalability of µTESLA, Liu and Ning only. In this protocol, the authors use different key chains
have proposed a multilevel µTESLA [10]. The basic idea of for different hops of the network, where maximum hop of
this protocol is to predetermine and broadcast the parame- the network can be deduced from the maximum propagation
ters such as the key chain commitments instead of unicast delay in the network. This protocol consists of three phases
based message transmission used in µTESLA. Even though i.e., initialization phase, data broadcast phase and data buffer-
it improves the scalability of µTESLA, it still suffers from ing/verification phase. In the initialization phase, the base
certain drawbacks like requirement of time synchronization, station divides time into fixed time intervals, and generates
more buffer storage, etc. a separate hash key chain for each hop of the network, and
A broadcast authentication protocol, called BiBa (Bins stores the first key of each key chain and the duration of the
and Balls) [11], have been proposed by Perrig, and it uses intervals in each sensor node. In the data broadcast phase,
one time digital signature scheme to authenticate the source. the base station computes the MAC for the data it sends in
In BiBa, signer precomputes some t random values, called the current time interval by using the current key of each key
SEALs (SElf Authenticating vaLues). For each SEAL si , chain, and broadcasts the data together with the MACs. Later,
signer generates a public key fsi = Fsi (0), where Fsi () is it discloses the keys of the current time interval one after
a one-way function, and these public keys are transferred to another. In data buffering/verification phase, after receiving
the receiver to authenticate SEALs at the receiver end. For the broadcasted data, each node in a particular hop buffers the
each message M, the signer computes GH(M ) for all SEALs data until the associated key corresponding to the hop number
s1 to st , where GH(M ) is a particular instance from a family and time interval is disclosed. If the data packet is authentic,
of one-way function whose range is 0 to n-1 (i.e., n possible it forwards them to the next hop. With increase in size of
output values). The signer generates a signature hsi , sj i, where the network and the number of nodes, the protocol requires
145
more number of hash key chains that leads to demand for
more storage space. Hence, the protocol is not scalable and
it has large storage overhead. Since each node stores one key
of each key chain, authentication of the broadcast messages
using each of the keys introduces extra computation overhead
at each node. Even if the protocol claims that nodes need
Fig. 1. Authentication request packet from node to a base station
to buffer data for a duration less than the duration of key
disclosure delay as in µTESLA, it still suffers from delay in
authentication. the first key k0 . The signature on all messages signed with
Since the symmetric key based cryptography is not suitable key ki by the sender can be verified at the receivers using key
for the broadcast authentication, most of the proposed proto- ki+1 after it becomes available at the receivers.
cols have used asymmetric key based cryptography. Among
these protocols, a few of them use public key concept to C. Objectives of the Proposed Protocol
achieve asymmetric key based cryptography and others use We aim to achieve the following with the proposed protocol.
hash key chain technique. The protocols which use public (i) Messages sent by each node of WSN to the base station
key concept suffer from large public key size and large is fully authenticated at the base station.
computational overhead and the protocols which use the hash (ii) Compromising of a single node should not affect other
key chain technique achieve the broadcast authentication for nodes of the WSN, i.e., other nodes are not compro-
single hop network only with the exception of the protocol mised.
proposed in [7]. In this paper, we propose a novel scheme to (iii) Intermediate nodes must not be able to modify the
authenticate each node of the WSN at the base station using broadcast messages.
Diffie-Hellman key, and also use this scheme to propose a
broadcast authentication protocol for multi-hop wireless sensor IV. T HE P ROPOSED P ROTOCOL
networks using multilevel hash key chain. In this section, we present our proposed scheme for broad-
cast authentication. Our proposed protocol uses multilevel
III. A SSUMPTIONS AND O BJECTIVES hash key chain and Diffie-Hellman key for authentication
In this section, we first state the assumptions about WSNs of messages sent by sensor nodes to the base station, and
that we make for the proposed broadcast authentication pro- also messages broadcasted by the base station. The broadcast
tocol, and then give a brief description of hash key chain authentication protocol proposed in this paper is based on a
which is used in our protocol. Finally, we briefly describe the novel scheme using Diffie-Hellman key to authenticate each
objectives that we aim to achieve with our proposed scheme. sensor node at the base station. The scheme is described in
the following subsection.
A. Assumptions
We make the following assumptions for our proposed pro- A. Authentication of Sensor Node at the Base Station
tocol to ensure authentication of each node and also broadcast The scheme to authenticate sensor nodes at the base station
the messages securely over the whole WSN. uses Diffie-Hellman key between a pair of nodes. A Diffie-
• The WSN has a single base station, and the sensor nodes Hellman key is generated from the multiplicative group Zp∗ =
are static. {1, 2, . . . , p − 1}, where p is a large prime number. Let g be a
• Each sensor node has a unique ID. generator for Zp∗ . In this scheme, the base station is assigned
• The WSN is connected, i.e., there always exists a path a unique private key 1 < α < p − 1, and each sensor node
between any pair of sensor nodes. i is assigned a unique private key 1 < βi < p − 1. Now the
• The base station is trusted, but the broadcast medium following is preloaded in each sensor node i.
α
1. A key ki = f g βi mod p mod p , where f is any
is not trusted, i.e., the opponents can eavesdrop on the
messages being transmitted. suitable function. We refer to key ki as Diffie-Hellman
• The base station is secure, i.e., none can tamper with it, key.
and can extract information from it and also, it is suffi- 2. g βi mod p
ciently powerful to perform cryptographic computations. It is assumed that the βi ’s have been chosen in such a way
that βi 6= βj ⇒ ki 6= kj . This property can be ensured at the
B. Hash Key Chain
time of generating βi ’s, which would mean that each node has
A hash key chain of length n + 1 consists of a sequence of a unique key. The base station is preloaded just with α and p.
h h h h h
keys kn → kn−1 → kn−2 → . . . → k1 → k0 , where kn is the A message M sent by a sensor node to the base station has
initial key, and h is an arbitrary hash function. An important the generic format as shown in Figure 1.
property of hash key chain is that ki−1 can be derived from ki It is important that g βi mod p is not encrypted. The message
(0 ≤ i < n) but not vice versa. The key k0 is referred as the M may or may not be encrypted according to the requirement.
first key of the chain since it is used first in any application. Upon receipt of the message M at the base station, the base
Usually, the sender has the key chain, and the receivers have station computes the key ki , and verifies the authenticity and
146
the integrity of the message. The important features of the Procedure CBT;
scheme are as follows. begin
Initialize its cost to ∞;
1. The base station stores only two values, viz. α & p, to
SET FLAG = false; ACK NODE = -1;
authenticate any message from any sensor node.
Timer Flag=RESET;
2. At the minimum, each sensor node only needs to compute
while (node i receive ADV message from node j) do
the MAC of the message that it sends to the base station.
if (Timer Flag = RESET) then
3. The private key βi of each node, and prime p are not Wait for more advertisement message for
stored in the sensor node i. time duration t1 ;
B. Informal Description of the Broadcast Authentication Pro- Timer Flag = SET;
tocol end
1
Since sensor nodes are resource and energy constrained, if costi >costj + then
REi
it is important that any broadcast operation should consume 1
as minimum resources and energy of each node as possible. costi = costj + ;
REi
Keeping this in mind, we initially construct a broadcast tree levi = levj + 1;
using which messages are broadcasted. In order that the SET FLAG = true; ACK NODE = src;
intermediate nodes which rebroadcast messages cannot modify end
the messages, we divide the sensor node into groups, and each if ((Timer Flag = SET) and (Time duration t1
group is initialized with different hash key chain. Using this expired)) then
mechanism, we ensure that a message cannot be modified Break;
when it is moved from a node in a group to a node in end
another group. We use the Diffie-Hellman key of each node to end
distribute the first key of the hash key chain of a group to each if (SET FLAG = true) then
member node of the group. If the same tree is used repeatedly Broadcast new advertisement message having
for broadcast, the energy of the internal nodes of the tree would cost costi and level levi ;
dry up quite fast. Therefore, we periodically restructure the Send acknowledgment to node ACK NODE;
end
broadcast tree by taking into account the remaining energy
Wait for possible ACK message;
of each node. After reconstruction, the nodes with higher
if (ACK message received) then
remaining energy would become the internal nodes. As a Node i is an internal node;
result, our broadcast authentication protocol consists of the else
following four phases and they are elaborated upon in the Node i is a leaf node;
following subsections. end
1) Broadcast tree establishment and group formation. end
2) Authentication and key distribution. # costj = cost of node j from where the advertisement
3) Message broadcast phase. # message has been received.
4) Periodic restructuring of the tree. # REi = remaining energy of node i.
P 1
C. Broadcast Tree Establishment and Group Formation # costi =
j∈path REj
In this section, we present algorithm for the construction of Algorithm 1: Broadcast tree construction phase of node i
broadcast tree using approach similar to the one given in [14],
and dividing the sensor nodes into groups. The broadcast tree
essentially has the following structure. The base station is at
level 0, which is the highest level. All the sensor nodes which given above describes the algorithm for the construction of the
can be directly reached from the base station are at level 1, the broadcast tree at each node i.
next lower level. When the sensor nodes at level 1 broadcast a Each node in the WSN stores the parent node ID, its level
message, the new sensor nodes which receive these messages number along with the associated least cost of the path to the
are at level 2, and so on. The algorithm for the construction of base station through the parent node. At the very beginning,
broadcast tree given below designates some of the nodes with each node except the base station sets its cost field, parent
higher remaining energy as internal nodes of the tree at each node ID and level number to ∞, -1 and -1 respectively. The
level 1 ≤ i < n−1, where n is the total number of levels in the base station sets both its cost field and level number to 0 and
tree. The value of n depends on the extent of geographical area sets its parent node ID to its own ID. Initially, the base station
into which the sensor nodes have been deployed and the power broadcasts an advertisement message ADV with its node ID,
used to broadcast messages, which essentially determines the level no, and cost as shown in Figure 2. After receiving the
communication range of broadcast. When the internal nodes of first ADV message, a sensor node i waits for a fixed duration
the broadcast tree at level i broadcast a message, all the sensor of time to receive additional ADV messages. It then chooses
nodes at level i + 1 receive the message. The procedure CBT among the nodes from which it received an ADV message a
147
Fig. 2. Advertisement(ADV) message format
node j that would result in a path with the least cost from the
base station to the node i itself. When the node i broadcast
its ADV message, it piggybacks an ACK message for node j.
When the node j receives the ACK, it comes to know that it is
an internal node of the tree. If a node does not receive an ACK Fig. 3. Broadcast tree
message, it concludes that it is a leaf node. The difference
between leaf nodes and internal nodes of a broadcast tree
is that a leaf node never rebroadcast a broadcast message.
The base station is initially given an estimate of how long it
would take for the construction of the broadcast tree. After this
duration elapses, the base station can start using the tree. To Fig. 4. Authentication request(ARQ) message format
maintain integrity of the ADV message, we can use a shared
secrete key for the whole network, which will be preloaded
D. Authentication and Key Distribution
before the deployment of the whole network. This key will
not be required afterwards. After the level number and group number are assigned to
To prevent the intermediate nodes from modifying a broad- a sensor node i during the construction of broadcast tree, it
cast message, one can assign to each level of the tree a sends an authentication request(ARQ) message to the base
different hash key chain. But this would run into problem as station through its parent. On receipt of authentication request
the number of levels is not known apriori, and also the level of message, the base station replies to the sensor node with
a node may change after the reconstruction of the tree. Hence, authentication request reply(ARR) message which, apart from
we divide the sensor nodes into groups based on their levels other information, contains the first key of the hash key chain
in the broadcast tree constructed initially as follows. of the group to which the node i belongs. We first describe the
Let levi = level of node i format of ARQ and ARR messages followed by the format
Then groupi = group number of node i is assigned as of data packet(DP) messages which are broadcasted by the
follows. base station. The format of the ARQ message is as shown
in Figure 4. Each node sends an ARQ message to the base
station through the path created during the broadcast tree
0, if levi = 1 construction and group formation phase. To reduce collisions,
((((levi − 1)%(M AX GROU P − 1))
groupi = (1) ARQ message from each node are sent after a random delay.
+2)%(M AX GROU P − 1)) After an ARQ message is received at the base station, it is
+1, if levi > 1
authenticated as shown in Figure 5.
where MAX GROUP is the maximum number of groups Upon receipt of an ARQ message from a node i at the base
used for the whole network. The actual number of groups station, it sends an authentication request reply(ARR) message
(ANOG) may be less than or equal to MAX GROUP and they to the node with the first key of hash key chain of the group
are numbered from 0 to (ANOG-1). The equation (1) is illus- to which the node i belongs. The format of ARR message is
trated by an example as given in Table I with MAX GROUP as shown in Figure 6. The ARR message for node i contains
equal to 4. the first key of the hash key chain of the first group and the
first key of the hash key chain of the group to which node
levi 1 2 3 4 5 6 7 8 ... i belongs encrypted with key ki . The use of these keys in
groupi 0 1 2 3 1 2 3 1 ... the message broadcast phase is described in the next section.
TABLE I The base station generates a separate hash key chain for each
group of node. We denoted the hash key chain of group i
h h h h h
as Sin → Si(n−1) → Si(n−2) → . . . → Si1 → Si0 . The
The construction of broadcast tree is illustrated in Figure 3. maximum number of groups MAX GROUP is independent of
From Figure 3, it is clear that the broadcast tree construction the number of sensor nodes or the number of levels, and it is
algorithm always chooses a path from base station to a fixed apriori. The group of a node is as given by equation (1).
node, whose intermediate nodes have higher remaining energy
compared to the nodes in other possible paths. A packet from E. Message Broadcast Phase
a node to the base station is sent along the tree towards the After each node has received the ARR message, the
parent. base station can broadcast the data packet. The format of
148
Fig. 7. Data Packet(DP) message format
149
Proposed Protocol H2 BSAP Proposed Protocol H2 BSAP
Computaion overhead M AX GROU P × l × M ACop Computaion overhead M ACop + 2 × M ACop + ⌊l ×
per packet M ACop + per packet DECop Hashop ⌋
(M AX GROU P + Transmission overhead M AX GROU P × (l − r)×|M AC|+
1) × EN Cop per packet |M AC| + [l × |key|]
Transmission overhead M AX GROU P × l × |M AC| + [M AX GROU P ×
per packet |M AC| + [l × |key|] |key|]
[M AX GROU P × Storage overhead 2 × |key| l × |key|
|key|]
Storage overhead M AX GROU P × l × (n + 1) × |key| TABLE III. Overhead at other nodes
(n + 1) × |key|
MACop : MAC operation Hashop : Hash operation
TABLE II. Overhead at the base station ENCop : Encryption operation DECop : Decryption operation
|key|: Key length |MAC|: MAC length
n: Size of key chain l: Maximum hop
MAX GROUP: Maximum no. of r: Hop distance from BS
by the base station as the group one node. But, to compute groups
M ACS1,0 (x), he has to know the key S1,0 , which is not
possible. If the attacker modifies the broadcast packet using its TABLE IV. Notations
own key and forwards the modified packet to the next level,
then the next level nodes would identify the incorrect packet
a fixed area of 100 * 100 meter2 . The network parameters,
through their MAC if their group is different from that of the
such as transmission range, transmission rate, sensitivity, trans-
sender, and reject the packet.
mission power etc., for this simulation study are similar to the
It is possible that two different nodes in consecutive levels
parameters specified in CC2420[16] data sheet and TelosB[17]
belong to the same group after restructuring of the broadcast
data sheet. We have taken the initial energy of each node to
tree. In this case, some nodes in the next lower level may
be 29160 joules for 2 AA batteries as given in the Castalia
not be able to detect the modification. But other nodes will
simulator. Energy consumption for different radio modes used
be able to detect the same, and the event can be reported to
in this simulator are given in Table V. For this simulation,
all the nodes in the vicinity. Nonetheless, this situation would
we assume that clocks of all the nodes are synchronized.
be very rare as only the internal nodes broadcast, and it is
The simulation was carried out for both realistic as well as
very unlikely that a node would become internal node and its
ideal channel. We have used TelosB node hardware platform
level would also get increased or decreased. Our protocol is
specification for our simulation and have also used “tunable
also fault-tolerant to node failures as periodic restructuring of
protocol” provided by Castalia as the MAC layer protocol.
the broadcast tree would eliminate the failed nodes from the
The broadcast packet is generated randomly with uniform
internal node of the tree. Besides, even if an internal node
distribution in every 2 seconds interval at the base station.
has failed, broadcast might still go through other neighboring
Figure 8 shows the total number of transmissions to broad-
internal nodes.
cast a packet for different sizes of network. In this figure,
Table II and III show the comparisons between our proposed
we have compared our protocol with the previously proposed
protocol and H 2 BSAP protocol [7] with respect to compu-
scheme with respect to the number of transmissions made.
tation, transmission, and storage overhead at the base station
Our approach gives better performance as compared to the
and other nodes respectively. Table IV describes the notations
previously proposed scheme, because our approach generates
used in Table II and III. As compared to H 2 BSAP protocol,
a broadcast tree, where only the internal nodes can forward
our proposed protocol uses lesser number of key chains
the packet over the network. This reduces the number of
which depends upon MAX GROUP, whereas in H 2 BSAP ,
transmissions required to broadcast a packet over the network.
the maximum number of key chains depends on the depth
Figure 9 shows the number of authenticated nodes and the
of the network. Since MAX GROUP is fixed, our proposed
average number of nodes that received the broadcast packet
protocol can support network of any depth, but H 2 BSAP
for different sizes of WSNs. From this figure, we can say that
can support maximum of up to 15-hops as given in [7]. As
with increase in the density of the network, the percentage
we have already discussed that each node in the proposed
of authenticated nodes also decreases. This happens only
protocol keeps only two keys, i.e., the key of the key chain
due to the collisions of the authentication request packet. It
of the first group and the key of the key chain of the group
can be reduced by increasing the random delay before the
to which it belongs. Hence, the proposed protocol always has
authentication request packet is transferred.
less overhead as compared to H 2 BSAP , and authentication
Figure 10 shows the minimum, average, and maximum
in the proposed protocol is always immediate.
delivery time of a broadcast packet in the network for different
VI. S IMULATION S TUDIES node density of the network.
150
Radio mode Energy Consumption (mW) 180
Transmit 57.42 authentication
Receive 62 160 avg. packet delivery
Listen 62
Sleep 1.4 140
in number
120
TABLE V. Radio Characteristics
100
2500 80
with floding
Number of packet transmitted
without floding 60
2000
40
20
1500 50 75 100 125 150 175 200 225
Number of Nodes
1000
Fig. 9. Number of authenticated nodes, and average number of nodes that
received the broadcast packet
500
0.04
0 Minimum
50 75 100 125 150 175 200 225 0.035 Average
Number of nodes Maximum
151
ADCOM 2009
GRID SCHEDULING
Session Papers:
1. Tapio Niemi, Jukka Kommeri and Ari-Pekka Hameri “Energy-efficient Scheduling of Grid
Computing Clusters”
3. Amit Agarwal and Padam Kumar “A Two-phase Bi-criteria Workflow Scheduling Algorithm in
Grid Environments”
152
Energy-efficient Scheduling of Grid Computing
Clusters
Tapio Niemi Jukka Kommeri
Helsinki Institute of Physics, Technology Programme Helsinki Institute of Physics, Technology Programme
CERN, CH-1211 Geneva 23, Switzerland CERN, CH-1211 Geneva 23, Switzerland
tapio.niemi@cern.ch kommeri@cern.ch
Ari-Pekka Hameri
HEC, University of Lausanne,CH-1015 Lausenne, Switzerland
ari-pekka.hameri@unil.ch
Abstract—Energy-efficiency is an increasingly important com- and grid computing for HEP has mostly focused on similar
ponent in computation costs in scientific computing. We have infrastructure issues as in general HPC computing such as
studied different scheduling settings with different hardware for cooling and purchasing energy-efficient hardware. As far as
high-throughput computing trying to minimise the electricity
usage of computing jobs. Instead of common practice of one we know, there are not many studies focusing on optimising
task per CPU core scheduling in grid clusters, we have tested the system configuration and scheduling settings for grid
variations of different scheduling methods based on idea to fully- computing.
load computing nodes. Our test showed running multiple tasks
simultaneously can decrease energy usage per computing task
In this paper, we focus on a typical grid computing problem:
over 40% and improve throughput of the computing node up How to process a large set of jobs efficiently. We try to opti-
to 100% when running a high-energy physics (HEP) analysis mise the energy efficiency and the total processing time of the
application. The trade-off is that processing times of individual set of jobs by choosing an optimal scheduling policy. In this
tasks are longer but in cases, such as HEP computing, in which sense our focus is closer to high-throughput computing than
the tasks are not time critical, only the total throughput is
important.
high-performance computing. Basically the problem is similar
to production management in any manufacturing processes.
I. I NTRODUCTION This kind of optimisation problem can lead to a trade-off
situation: improving energy-efficiency can weaken throughput.
Energy consumption has become one of the main cost of However, our tests indicated that these two aims are not neces-
computing and several methods to improve the situation has sary contradictory, meaning that optimising system throughput
been suggested. The focus on research has been in hardware also improves its energy efficiency.
and infrastructure aspects. Most of the computing centers
and computing clusters of research institutes focus on high- Our method is based on a conclusion that computers should
performance computing trying to optimise processing time run full-power or be turned off, since the fixed power con-
of individual computing jobs. Jobs can have strict deadlines sumption is around 50% of the full-power of the server.
or require massive parallelism. Instead, in high-throughput Since computers can run multiple tasks simultaneously in
computing the aim is slightly different, since individual jobs a CPU core using time sharing techniques, this naturally
are not time critical and the aim is to optimise the total leads to load-based scheduling policy. Our previous tests [1]
throughput over a longer period of time. indicated that the load should not mean only the processor
load but all components of the computer including memory
In computing intensive sciences, such as high energy-
usage, processor load, and I/O traffic. In the current study we
physics (HEP) energy-efficient solutions are important. For
tested different computing hardware: single core, low-energy
example the Worldwide LHC Computing Grid (WLCG) to
mini-PCs, and modern multicore systems commonly used in
be used to analyse the data that the Large Hadron Collier of
computer centers. Our test software included applications util-
CERN will produce, includes tens of thousands CPU cores.
ising different resources of the computer and a HEP analysis
In this scale, even a small system optimisation can offer
application.
noticeably energy and cost savings. Since scientific computing,
and especially, high-energy physics computing has special The basic terminology used in this paper is:
characteristics, also energy-optimisation methods can be tai- - A task is the smallest entity of processing work. The
lored for it. The main characteristics in this sense are: large task starts, retrieves/reads its possible input file, process
sets of similar kind jobs, data-intensive computing, no time the data and possibly writes its output file.
critical, no preceding conditions, and no intercommunication - A job is a collection of tasks. In general situation tasks
between jobs or their tasks, i.e. high parallelisms. In spite can have preceding relations but in our case tasks are
of this special nature, improving energy efficiency in cluster independent.
153
- A computing node, i.e. node is a part of the computing the completion time of the last job or the total completion time
cluster. It has one or more CPU cores, a fixed amount i.e. sum of all completion times.
of memory and disk space, and a network connection Often there are several objectives such as processing time
with some fixed capacity. The node schedules its jobs and energy efficiency in our case. Then the overall objective
independently to its CPU cores. is a (weighted) sum of the sub-objectives. This often leads to
- Energy efficiency means how many similar jobs can be a Pareto-optimal schedule.
processed by using the same amount of electricity.
- Computing efficiency, i.e. the system throughput, means B. Cluster Schedulers
how many similar jobs can be processed in a time unit. There are various batch scheduling systems – also called as
The paper has been organised as follows. In Background job schedulers or distributed resource management systems –
section we explain the common concepts of scheduling and available, such as Torque 1 , OpenPBS 2 , LSF 3 , Condor 4 , and
review related literature. After that the used methodology is Sun Grid Engine 5 . These systems have different features but
described in Section 4. Tests are explained in Section 5 and the basic functionality is very much similar.
results in Section 6. Finally conclusions are given in Section In our tests we used Sun Grid Engine (SGE) [4] of Sun
7. Microsystems that is also commonly used in grid computing
clusters. It has various features to control scheduling. The
II. BACKGROUND scheduling is done based on the load of the computing nodes
and resource requirements of the jobs. SGE supports check-
A. Scheduling pointing migration of checkpointing jobs among computing
Scheduling means in which order and to which computing nodes. In addition to batch jobs also interactive and parallel
nodes computing tasks should be allocated. How an individual jobs are supported. SGE accounts the resources, such as CPU,
computers schedule their processes is not included to our topic. memory, I/O, that a job has used. SGE contains an accounting
Scheduling problems can be classified according to following and reporting console (“ARCo”) that store accounting data into
properties: the SQL database for later analysis.
- on-line / offline In SGE scheduling is done in fixed intervals (default setting)
- knowledge on jobs or triggered by some events such as a new job submission. The
- knowledge on computing resources scheduler finds the best nodes for pending jobs based on, for
example, resource requirements of the job, the load of nodes,
There are lots of research on scheduling multiprocessor sys-
the relative performance of the nodes. By default the scheduler
tems. More formally (e.g. following [2] or [3]) the scheduling
dispatches the jobs to the queues, i.e. nodes, in the order in
problem can be defined as follows: We have m machines
which they have arrived. If several queues are identical, the
Mj (j = 1, ..., m) (i.e. computing nodes in our case) and j jobs
selection is random.
Ji (i = 1, ..., n) to be processed. A schedule S is an allocation
It is also possible to change to scheduling algorithm but
of time intervals from machines for each job. The challenge is
only one algorithm is shipped with the default distribution.
to find an optimal schedule for jobs when there exist certain
However, the scheduling can be controlled in four ways: 1)
constraints. A schedule is called optimal if it minimises a given
Dynamic resource management, 2) Queue sorting and 3) Job
optimality criteria. The criteria can be for example time, cost
sorting, 4) Resource reservation. Here we focus only queue
or usage of some resource. Figure I illustrates a schedule.
and job sortings. In the queue sorting the queue instances
of computing nodes are ranked to the order the scheduler
M1 J1
M2 J2 J2 J1 should use them. The ranking possibilities for queues are, for
M3 J3 J3 J3 example: system load, scaled system load, user defined system
load, or fixed order. The jobs sorting can be done, for example,
T ime
in following ways: ticket-based job priority, urgency or POSIX
TABLE I -based priorities, user or group based quotas.
A SCHEDULE
III. R ELATED WORK
Venkatachalam and Franz [5] give a detailed overview on
The optimality criteria can be defined in several ways. If techniques that can be used to reduce energy consumption of
the finishing time of the job Ji is denoted by Ci the cost is computer systems. There are several studies in different parts
marked by fi (Ci ). Now, the usual cost functions are called of the topic such as optimising processors by dynamic voltage
bottleneck objectives and sum objectives. The bottleneck is scaling (e.g. [6] and [7]); optimising disk systems (e.g. [8],
the maximum value of the cost functions of all jobs while the
1 www.clusterresources.com/pages/products/torque-resource-manager.php
sum is the summarized value. The cost function can be defined 2 http://www.openpbs.org
in several ways. The most common ones are make-span, total 3 www.platform.com/Products/Workload-Management
flow time and weighted flow time. When designing a schedule 4 www.cs.wisc.edu/condor
154
[9], and [10]); network optimisation (e.g. [11]); and compilers IV. M ETHODOLOGY
(e.g. [12]). There are also several studies on pure energy issues. A. Problem Description
For example, Lefurgy et al. [13] suggest a method to control
peak power consumption of servers. The method is based on We assume having a large set of tasks – compared to the
power measurement information on each computing server. number of CPU cores available – organised as a job. Further
Controlling peak power makes it possible to use smaller and we assume that jobs do not have deadlines, all of them arrive
more cost-effective power supplies. at time zero, there is no preceding relations between them or
among tasks in them, and the number of tasks is much larger
Scheduling is a widely studied topic but there is little work
than the number of computing nodes. These assumptions are
on scheduling as an energy saving method. Instead some works
usually true in HEP computing and they make the needed
suggest clearly opposite approaches: For example, Koole and
scheduling algorithm easier. Figure 1 illustrates the situation.
Righter [14] suggest a scheduling model in which tasks are
Our scheduling problem can be divided into two indepen-
replicated to several computers. However, the authors do not
dent steps:
estimated how much more resources is needed when the same
tasks (or at least parts of them) are computed in several 1. Finding the optimal load combination for the comput-
times. Fu et al. [15] present a scheduling model being able ing node. The optimal means that a job can be run by
to restart batch jobs. They give an efficient algorithm to solve using smallest possible amount of energy in the minimal
the problem but they do not touch the resource usage. time.
2. Scheduling jobs to the computing nodes in such a way
There also exists several studies relevant to our topic such
that all computing nodes are as close as possible to the
as: Kurowsk et al. [16] studies two-level hierarchical grid
optimum state (i.e. Step 1).
scheduling. Their approach is taking into account all stake-
holders of grid computing systems. The approach does not In an important special case in which all tasks are identical
require time characteristics of jobs being known and in it a inside a job, the problem simplifies into a form: how many
set of jobs in the grid level is scheduled simultaneously to the tasks to run simultaneously in a computing node. Then the
local computing resources. problem is how to measure the load of the node and define
the optimum load level. Generally, it is important to notice
Edmonds [17] studies non-clairvoyant scheduling in multi-
that we did not try to minimise the processing time of an
processor environments. In his model, the jobs can have arbi-
individual task but a large job containing several tasks. We do
trary arrival times and execution characteristics can change.
not include the local process scheduling on the node to our
Wang et al. [18] have studied optimal scheduling methods study.
in a case of identical jobs and different computers. They aim
to maximise the throughput and minimise the total load. They job
give an on-line algorithm to solve the problem.
Shivam et al. [19] presents a learning scheduling model, Cluster
while Srinivasa Prasanna and Musicus [20] give a theoretical Clusterlevel queue
scheduling
scheduling model in which the number of processors allocated
to a task can be a continuous variable and it is possible to Cluster
scheduler
allocate all processors for one task if needed.
Medernach [21] have studied workload in a grid computing
Node Node
cluster to be able to compare different scheduling methods. queue queue
Nodelevel
The idea of the work was to find ways how the users of scheduling
the cluster can be grouped to characterise their usage. The Node Node
scheduler scheduler
scheduling is based on one job per CPU core idea.
Etsion and Tsafrir [22] compared commercial workload Core Core Core Core
management systems focusing on their scheduling systems and Resources
Core Core Core Core
default settings. According to the authors, the default settings
are often used by the administrators, or they are just slightly memory network memory network
modified. disk disk
Aziz and El-Rewini [23] have studied online scheduling
algorithms based on evolutionary algorithms in the grid con-
text. Ges et al. [24] have studied scheduling of irregular I/O Fig. 1. Scheduling system
intensive parallel jobs. They note that CPU load alone is
not enough but all other system resources (memory, network, Briefly, our hypothesis is:
storage) must be taken into account in scheduling decisions. Running several tasks simultaneously in a CPU core
Santos-Neto et al. [25] have studied scheduling in case of improves energy-efficiency and throughput compared
data-intensive data mining applications. to running only one task.
155
Possible reasons for this can be: and the other one no I/O access at all.
1. If tasks have I/O access, there would be idle 2. Memory access delays: we used two similar
time for the CPU core because of slow disks test application one having intensive memory
and network. access and the other one using very little of
2. If tasks have intensive memory access, there memory.
would be idle time for the CPU core because We tested two different scenarios:
of slow main memory access. 1. In the simplest case all tasks and all jobs and
computing nodes were identical.
B. Test Method and Environment
2. The second case we had different jobs but
Our test method was to execute a job , i.e. a large identical computing nodes. Now the problem
set of tasks and measure the time and electricity is how to allocate tasks to computing nodes.
consumed during the test run. The same test job was
The following scheduling methods were tested fo-
run with different cluster configurations to find out
cusing first two mentioned resources:
the optimum one.
We ran our tests on different test environments: - The default scheduling settings of SGE. Job
slots are equal to the number of CPU cores.
- A Xeon test cluster including one front-end
Currently this is often used in grid clusters.
and three computing nodes running Sun Grid
- Slot-based: fixed number of jobs per CPU core.
Engine [4]. The nodes had two single core
Intel Xeon 2.8 GHz processors (Supermicro A. Basic Tests
X6DVL-EG2 motherboard, 2048 KB L2 cache, We used following applications for basic tests:
800 MHz front side bus) with 2 gigabytes of 1. The I/O test application was a simple pro-
memory and 160 gigabytes of disk space. gram that wrote and read 300MB files mul-
- A Dell PowerEdge SC1435 computer with two tiple times. First it created a file containing
4-core ADM Opteron 2376 2.3GHz and 32 numbers generated using the process id as a
gigabytes of memory. ”random” seed, making all files unique. Files
- A cluster of three EeeBox mini computers with were named using the pid to avoid simultane-
Intel Atom N270 and 2 gigabytes of memory. ous writing/reading into the same file. After
The operating system used with Xeon and Opteron generating the file, the contents were copied
was Rocks 5.0 with kernel version 2.6.18. Eeeboxes to another file 20 times. Each time was a bit
required newer drivers so with them Rocks 5.2 and different (small shift in numerical values) to
kernel version 2.6.24.4 was used. The effect of the avoid buffering.
kernel was tested and found nonexistent. 2. The CPU test application used here was a
The electricity consumption of the computing nodes long loop calculating floating point multiplica-
was measured with the Watts Up Pro electricity me- tion. A reminder of the index variable was also
ter. We tested the accuracy of our test environment used to make compiler optimization harder.
by running the same tests several times with exactly 3. The memory test application reserved mem-
the same settings. The differences between the runs ory (200 MB) and filled it with numbers. After
were around +-1% both in time and electricity that it read a part of the memory and wrote it
consumption. to another part. This was done multiple times.
We developed and customised some tools to make
We performed two types of tests: 1) running identical
testing process easier. The test runs were submitted
test applications, and 2) mixing the applications. In
using a Perl script that automatically set the wanted
mixed test we submitted test applications to test
cluster parameters and stored all information into a
clusters in the following order: CPU, memory, I/O,
relational database.
and CPU, i.e. two CPU test per one set of memory
We assumed knowing the type of a job and the
and I/O.
characteristics of the hardware in advance. The test
applications were real HEP analysis applications and B. Test with Physics Software
dummy test applications simulating CPU intensive, We used a CMS analysis application in our tests.
memory intensive, and disk intensive applications. Input data for the test was from the CRAFT (CMS
V. T ESTS Running At Four Tesla) experiment that used cosmic
ray data recorded with CMS detector at the LHC
To find out a reason why our hypothesis is valid, we during 2008 [26]. This detector was used in the
formed tests to test our assumptions: similar way to future LHC experiments. Our test
1. I/O access delays: we used two similar test application uses CMSSW framework and it is very
application one having intensive I/O access close to an analysis applications that will be used
156
1800
Write speed
Read speed
to low amount of memory.
1600 There are big differences in energy efficiency among
1400
different hardware. Figure 5 illustrates this in our test
environment: the modern multicore computer is over
1200
7 times more energy efficient than the older single
Speed (blocks/s)
200
150
100
50
0
0 1 2 3 4 5 6
Time (minutes)
VI. R ESULTS
Results of our basic tests are shown in Table II,
mixed basic tests in Table III, and physics analysis
tests in Table IV.
Generally running more than one task in a CPU
core improved throughput and decreased electricity
consumption. Figure 4 illustrates this in the case
of the multicore computer. However, improvements
heavily depended on the application and the hard-
ware. In modern multicore environment running 3-4
task per CPU core gave the the best results while
in older single core cluster 1-2 task per CPU was
the best. The single core cluster also was the only
environment in which running multiple tasks can
decrease the efficiency. This can partially be related Fig. 5. Wh/job performance in physics jobs
157
Hardware Test Environment Jobs/core Jobs/hour Wh/job Avg. kW / core Avg. kW / node
Xeon cluster Memory normal 1 60 11.88 119.17 715
optimal 3 91 8.44 127.50 765
Disk normal 1 119 6.06 119.67 718
optimal 1 119 6.06 119.67 718
CPU normal 1 176 3.79 111.33 668
optimal 1 176 3.79 111.33 668
2x4 core Opteron Memory normal 1 90 2.91 32.50 260
optimal 3 91 2.88 32.75 262
Disk normal 1 320 0.79 31.25 250
optimal 4 493 0.59 36.25 290
CPU normal 1 407 0.6 30.38 243
optimal 3 461 0.56 31.75 254
Eeebox cluster Memory normal 1 22 2.05 14.83 44.5
optimal 4 32 1.48 15.27 45.8
Disk normal 1 38 1.27 15.53 46.6
optimal 3 65 0.76 16.13 48.4
CPU normal 1 24 1.89 14.70 44.1
optimal 2 30 1.52 14.73 44.2
TABLE II
BASIC TESTS
158
VII. C ONCLUSION AND F UTURE W ORK [13] C. Lefurgy, X. Wang, and M. Ware, “Server-level power
control,” in ICAC ’07: Proceedings of the Fourth Interna-
Our tests showed that both energy efficiency and tional Conference on Autonomic Computing. Washington,
throughput can be remarkable improved by running DC, USA: IEEE Computer Society, 2007.
several tasks simultaneously in each CPU core. [14] G. Koole and R. Righter, “Resource allocation in grid
computing,” J. Scheduling, vol. 11, no. 3, pp. 163–173,
However, the results highly depended on the used 2008.
hardware. The biggest improvement in physics com- [15] R. Fu, T. Ji, J. Yuan, and Y. Lin, “Online scheduling in
puting in both throughput (100%) and energy con- a parallel batch processing system to minimize makespan
using restarts,” Theor. Comput. Sci., vol. 374, no. 1-3, pp.
sumption (43%) was achieved in modern 2 x 4 CPU 196–202, 2007.
core computer. The same hardware was clearly the [16] K. Kurowski, J. Nabrzyski, A. Oleksiak, and J. Weglarz, “A
most energy efficient, too. multicriteria approach to two-level hierarchy scheduling in
grids,” J. Scheduling, vol. 11, no. 5, pp. 371–379, 2008.
Our future work includes development of a load- [17] J. Edmonds, “Scheduling in the dark,” Theor. Comput. Sci.,
based scheduler being able to find automatically vol. 235, no. 1, pp. 109–141, 2000.
the best amount of tasks per computing node by [18] C.-M. Wang, X.-W. Huang, and C.-C. Hsu, “Bi-objective
optimization: An online algorithm for job assignment,” in
using the utilisation rate of different resources of Advances in Grid and Pervasive Computing, GPC 2009,
the computing system. Geneva, Switzerland, 2009, pp. 223–234.
[19] P. Shivam, S. Babu, and J. Chase, “Active and accelerated
ACKNOWLEDGEMENTS learning of cost models for optimizing scientific applica-
tions,” in VLDB ’06: Proceedings of the 32nd international
We would like to thank Arttu Klementtilä who conference on Very large data bases. VLDB Endowment,
helped us implementing a part of the test applica- 2006, pp. 535–546.
tions and executing the test runs and Magnus Ehrn- [20] G. N. S. Prasanna and B. R. Musicus, “The optimal control
approach to generalized multiprocessor scheduling,” Algo-
rooth’s Foundation for a grant for the test hardware. rithmica, vol. 15, no. 1, pp. 17–49, 1996.
[21] E. Medernach, “Workload analysis of a cluster in a grid
R EFERENCES environment,” in Job Scheduling Strategies for Parallel
[1] T. Niemi, J. Kommeri, K. Happonen, J. Klem, and A.-P. Processing 11th International Workshop, JSSPP 2005.
Hameri, “Improving energy-efficiency of grid computing Springer, 2005.
clusters,” in Advances in Grid and Pervasive Computing, 4th [22] Y. Etsion and D. Tsafri, “A short survey of commercial
International Conference, GPC 2009, Geneva, Switzerland, cluster batch schedulers,” Hebrew Univ. of Jerusalem, Tech.
2009, pp. 110–118. Rep. 2005-13, 2005.
[2] M. Pinedo, Scheduling: Theory, Algorithms, and Systems. [23] A. Aziz and H. El-Rewini, “On the use of meta-heuristics to
Springer, 2008. increase the efficiency of online grid workflow scheduling
[3] P. Brucker, Scheduling algorithms. Springer, 2007. algorithms,” Cluster Computing, vol. 11, no. 4, pp. 373–390,
[4] BEGINNER’S GUIDE TO SUNTM GRID ENGINE 6.2 2008.
Installation and Configuration, Sun Microsystems, 2008. [24] L. F. Ges, P. Guerra, B. Coutinho, L. Rocha, W. Meira,
[5] V. Venkatachalam and M. Franz, “Power reduction tech- R. Ferreira, D. Guedes, and W. Cirne, “Anthillsched: A
niques for microprocessor systems,” ACM Comput. Surv., scheduling strategy for irregular and iterative i/o-intensive
vol. 37, no. 3, pp. 195–237, 2005. parallel jobs,” in Job Scheduling Strategies for Paral-
[6] R. Ge, X. Feng, and K. W. Cameron, “Performance- lel Processing 11th International Workshop, JSSPP 2005.
constrained distributed dvs scheduling for scientific appli- Springer, 2005.
cations on power-aware clusters,” in SC ’05: Proceedings [25] E. Santos-Neto, W. Cirne, F. Brasileiro, A. Lima, R. Lima,
of the 2005 ACM/IEEE conference on Supercomputing. and C. Grande, “Exploiting replication and data reuse to
Washington, DC, USA: IEEE Computer Society, 2005, efficiently schedule data-intensive applications on grids,”
p. 34. in Proceedings of the 10th Workshop on Job Scheduling
[7] N. Kappiah, V. W. Freeh, and D. K. Lowenthal, “Just in Strategies for Parallel Processing, 2004, pp. 210–232.
time dynamic voltage scaling: Exploiting inter-node slack [26] D. Acosta and T. Camporesi, “Cosmic success,” CMS Times,
to save energy in mpi programs,” in SC ’05: Proceedings November 2008.
of the 2005 ACM/IEEE conference on Supercomputing.
Washington DC, USA: IEEE Computer Society, 2005.
[8] D. Essary and A. Amer, “Predictive data grouping: Defining
the bounds of energy and latency reduction through predic-
tive data grouping and replication,” Trans. Storage, vol. 4,
no. 1, pp. 1–23, 2008.
[9] Q. Zhu, Z. Chen, L. Tan, Y. Zhou, K. Keeton, and J. Wilkes,
“Hibernator: helping disk arrays sleep through the winter,”
in SOSP ’05, 20th ACM symposium on Operating systems
principles. New York, NY, USA: ACM, 2005, pp. 177–190.
[10] X. Li, Z. Li, Y. Zhou, and S. Adve, “Performance directed
energy management for main memory and disks,” Trans.
Storage, vol. 1, no. 3, pp. 346–380, 2005.
[11] S. Conner, G. M. Link, S. Tobita, M. J. Irwin, and
P. Raghavan, “Energy/performance modeling for collective
communication in 3-d torus cluster networks,” in SC ’06:
Proceedings of the 2006 ACM/IEEE conference on Super-
computing. New York, NY, USA: ACM, 2006.
[12] W. Zhang, J. S. Hu, V. Degalahal, M. Kandemir, N. Vi-
jaykrishnan, and M. J. Irwin, “Reducing instruction cache
energy consumption using a compiler-based strategy,” ACM
Trans. Archit. Code Optim., vol. 1, no. 1, pp. 3–33, 2004.
159
Energy Efficient High Available System:
An Intelligent Agent Based Approach
Ankit Kumar, R.K.Senthil Kumar, B.S.Bindhumadhava
Centre for Development of Advanced Computing,
'C-DAC Knowledge Park', Opp. HAL Aeroengine Division,
No. 1, Old Madras Road, Byappanahalli,
Bangalore-560038, India.
{ankitk, senthil, bindhu}@cdacb.ernet.in
Abstract— For achieving high availability in agent based consistency. Despite the rapid evolution in all aspects of
applications which are of mission critical nature, a replica of the computer technology, both the computer hardware and
real agent system can be created. However even the replica of real software are prone to numerous failure conditions which may
agent system may fail due to various reasons like node failure lead to the termination of these applications. So providing
,failure in communication link etc which may lead to the agent loss.
highly availability for these types of applications becomes
Another major issue with such types of systems is the wastage of
energy since the replica has to be in the active mode (full power increasingly important. High availability refers to the
mode) all the time which is really harmful for the environment. The availability of resources in a system, in the wake of
need for improved energy management in these type of systems has component failures in that system. High availability in agent
become essential for many reasons like reduced energy based applications can be achieved by detecting node failures
consumption & compliance with environmental standards. and reconfiguring the system appropriately, so that the
To overcome these issues, we present an intelligent agent based workload of the real agent system can be taken over by the
approach for efficient energy management in these systems and other node in the system called replica. So fault tolerance for
also agent loss prevention by creating a replica 'on demand' of the the replica is also required to prevent the loss of agents
real agent system using an efficient election algorithm (to find the
performing critical applications. Check pointing [3] of real
best suited system for replication) designed for dynamic networks.
agent system or replica is not a good approach We intended to
KEY WORDS achieve such a goal by applying replication ‘on demand’.
Instead of making the system run in the full power mode all
Agent System, Mobile Agents, Election Algorithm, Energy
the time which leads to wastage of energy, we put it to the
Management
sleep state and bring it to the active state only when required.
I. INTRODUCTION & MOTIVATION To achieve this, we propose an intelligent agent called “green
agent .
In distributed systems, there are various activities like
In this paper, we present an approach to ensure reliability of
electronic commerce, network management, process control
real agent system using replication based on election
applications and defence applications which are mobile agent
algorithm for dynamic networks. In this approach, we
based. A mobile agent is a self-managed software program
optimize the performance by replicating only the real agent
performing a particular task and which is capable of
system and running all the other agent systems as proxy agent
autonomously migrating through a heterogeneous network.
systems with minimal load.
An agent can exist only on nodes which have agent system
The rest of the paper is organized as follows. In Section 2,
running on them. An agent system provides an execution
we describe the ‘agent based highly available environment’.
environment to mobile agents. In CMAF (C-DAC Mobile
Election algorithm for dynamic networks is discussed in
Agent Framework) [1], agent systems are classified into two
Section 3. Section 4 discussed about the agent based energy
categories - real agent system and proxy agent system.
efficient high available system .Performance evaluation is
Pluggable services like registry, communication, and user
explained in Section 5. Section 6 discusses the conclusions.
interface provide functionalities to the agent system. These
services are called system agents, since they work for agent II. AGENT BASED HIGHLY AVAILABLE ENVIRONMENT
system. In a single network domain, there is only one real
In our proposed approach, ‘agent based highly available
agent system and all the other agent systems are run as proxy
environment’, we use replication to address agent failure due
agent system. A proxy agent system has lesser load compared
to a real agent system. Real agent system maintains a registry to node failure or agent system failure. Replication increases
of all the agent systems running in the network whereas proxy the dependability and availability of a system.
agent system does not. Mobile agent execution is initiated
In this model, all the agents running in the agent systems
from the real agent system and it can migrate to any proxy
agent system which is registered with the real agent system. which are registered are check pointed in the real agent
Mobile agent based applications which are of mission system. On failure of a proxy agent system, the agents which
are abnormally terminated continue their execution from the
critical nature require high degree of dependability and
160
real agent system using the last check pointed state. To Whenever a new proxy agent system comes up, it gets the
achieve dependability of agents, the real agent system is location of the real agent system and its replica from the
replicated. neighbour agent system. The proxy agent system then
The location of the real agent system and its replica are registers itself with the real agent system and its replica as
maintained in all the agent systems as realLocFile and shown in Figure 1.
replicaLocFile respectively.
When the real agent system fails, the replica will system. This algorithm handles the dynamic nature of the
takeover the control and hence become the new real agent network.
system. The realLocFile in the real agent system is updated Generally, the aim of an election algorithm [4] is to elect
with the new location. The agents which are blocked will a node from a fixed set of nodes. Some of the most
continue their execution in the new real agent system. common applications of election algorithm are key
These self healing and self configuring properties make our distribution [5], routing coordination [6], sensor
system highly available and self aware environment for coordination [7] and general control [8], [17]. Now a days,
agent execution. election algorithms are also being used in mobile agent
Whenever a replica becomes the new real agent system, based applications. In case of mobile agent based networks,
another replica is created. The location of the current real the election algorithm should adapt with the dynamic
agent system and its replica is updated in all the agent nature of the network as well as it should elect the agent
systems. The replication of real agent system is based on system based on its performance.
election algorithm for dynamic networks. Some existing election algorithms works for static
networks [9], [10], [11], [7], [12], [13], [14] or assume that
III. ELECTION ALGORITHM FOR DYNAMIC NETWORKS the network is static in nature [15], [16]. Existing election
The agent systems running in a distributed network form algorithms designed for dynamic networks use random
a hierarchical structure with the real agent system as the selection of node [17]. Sudarshan Vasudevan, Jim Kurose
root. Since proxy agent system can get registered and and Don Towsley proposed in their paper an election
unregistered at anytime, this network is dynamic in nature. algorithm for mobile ad hoc networks based on extrema-
In the proposed system, we use an election algorithm to finding [18]. There are also other extrema- finding
select the best-suited agent system from the proxy agent algorithms [8] and clustering algorithms for mobile
systems which are running in the network and reconfigure networks [19], [20]. But these algorithms are not used in
the selected agent system to create replica for the real agent
161
our approach since they require high amount of message agent system entry from the list and tries to migrate to that
passing between nodes which will increase the overhead. agent system.
In our approach, we propose an election algorithm for
dynamic mobile agent based networks. This algorithm 3) getInfo: After the successful migration to the proxy
selects an agent system among the proxy agent systems agent system, the ElectionAgent continues its execution.
based on the performance-related characteristics of the The election of agent system is made on the basis of the
system. Since the processor speed and the amount of hard disk space and the load average.
memory required for a proxy agent system and a real agent
system are same, we consider only the hard disk space and 4) moveBack: ElectionAgent gathers this information of
the load average for replica election. During the start up of the proxy agent system and gets the communication service
of the real agent system to move back. It migrates back to
real agent system, election is triggered by creating a mobile
agent called ElectionAgent. On failure of real agent system, the real agent system using its communication service.
the replica will takeover the control and creates a new 5) compareInfo: After migrating back to the real agent
replica by reconfiguring the proxy agent system selected by system, the information which is gathered is compared
ElectionAgent. with previous information, if it exists and the best value is
We now describe the operation of election algorithm for saved in a tmpInfoFile. The tmpInfoFile contains name,
mobile agent based networks. In Section A, we explain the location and attributes of the best proxy agent system.
algorithm for electing best proxy agent system. An
algorithm for performance comparison of proxy agent 6) updateList: The ElectionAgent checks the registry for
systems is given in Section B. Section C describes the any new proxy agent system entry. If it finds a new entry, it
process of updating the replica location in all the proxy is added to the list of agent systems.
agent systems. For each proxy agent system entry in the list, the
A. Algorithm for Electing Best Proxy Agent System ElectionAgent gets the information about that agent
system, compares with the previous information in the
When the real agent system starts up, it triggers the
tmpInfoFile and updates the tmpInfoFile with the best
election for best proxy agent system by creating a mobile
value. Finally, the tmpInfoFile which will contain the
agent called ElectionAgent.
information of the best proxy agent system is renamed as
We describe the election process by explaining the
infoFile. The ElectionAgent keeps on continuing its
methods used by the ElectionAgent. Table 1 shows the
execution and hence, we assume that the infoFile will
different methods used by the ElectionAgent.
contain the best proxy agent system. When the real agent
Table I
Methods used by Election Agent system fails, the replica will become the new real agent
Method Purpose system. The realLocFile in the real agent system is updated
getList gets the list of all proxy agent with the new location and the agent system which is
systems
contained in the infoFile is reconfigured as the new replica.
move migrates to a particular proxy The replicaLocFile in the real agent system is updated with
agent system
the new replica location and the location of real agent
getInfo get the attributes of the proxy
system and its replica are updated in all the proxy agent
systems.
agent system
moveBack move back to the real agent B. Algorithm for Performance Comparison
system
We describe an algorithm for comparing the
compareInfo compare the information gathered
performance of proxy agent systems based on the
updateList checks the registry for new proxy
information gathered by the ElectionAgent. Here we
agent system consider the hard disk space and the load average of the
1) getList: Real agent system maintains a registry of all proxy agent systems for comparison. The load average is
the agent systems running in the network. The the average number of processes in the kernel's run queue
ElectionAgent makes a list of all the proxy agent systems during an interval.
from this registry where each entry contains the agent We represent a proxy agent system with free hard disk
system name and its location. space, h and load average, l as (h,l). Let (h1,l1) be the proxy
agent system contained in the tmpInfoFile and (h2,l2) be
2) move: The agent takes the first entry from the list and
the proxy agent system that is to be compared. The best
tries to get the communication service of that agent system
proxy agent system is selected based on the different
by using its location. Once the communication service is
conditions given below.
received, the ElectionAgent migrates to that particular
proxy agent system. If the communication service is not Condition 1: h1=h2
available, the ElectionAgent will retry for ten times. After In this case, we compare l1 and l2, and select the proxy
the retrials, it will discard this agent system, gets the next agent system having lesser load average.
162
l1>l2 => (h2,l2) system to the first agent system entry in the list of agent
l1<l2 => (h1,l1) systems using the communication service. In the proxy
l1=l2 => (h1,l1) agent system, the entry of realLocFile and replicaLocFile
Condition 2: are updated with the location of the new real agent system
and its replica. The real agent system checks the registry
When there is a negligible difference between h1 and h2 ,
for any new proxy agent system entry. If it finds a new
we give more priority to load average for the selection of entry, it is added to the list of agent systems.
proxy agent system.
Example:
l1>l2 => (h2,l2) We illustrate an example for the operation of the
l1<l2 => (h1,l1) algorithm. Let us consider a network of agent systems as
When there is no difference between l1 and l2, we select shown in Figure 2(a). It consists of one real agent system
the proxy agent system having comparatively more free (R1), one replica (R2) and three proxy agent systems (P1,
hard disk space. P2, P3). The location of the real agent system and its
l1=l2 => (h1,l1), if h1>h2 l1=l2 => (h2,l2), if h2>h1 replica are shown in all the agent systems as [R1,R2].
Condition 3: R1 will now create ElectionAgent (EA). EA maintains a
When there is a significant difference between h1 and h2 list of all proxy agent systems as [P1,P2,P3]. It moves to
, we give priority to either load average or free hard disk P1 as shown in Figure 2(b). After reaching the proxy agent
space for the selection of proxy agent system, based on system P1, EA collects the information about P1, i.e., I(P1)
certain criteria. and moves back to the real agent system R1. The
Here we consider that a system with single CPU is information which is gathered is compared with previous
overloaded if the load average is greater than 1. So if one of information, if it exists the best value is saved in a
the systems is having a load average less than 1, we select tmpInfoFile. This is represented as {P1} as shown in
that system. Figure 2(c ).
l1>1 and l2 <1 => (h2,l2) Before EA migrates to the next proxy agent system, it
l1<1 and l2 >1 => (h1,l1) updates the list of proxy agent systems. In this example, a
new proxy agent system P4 is added to the list. The
When l1<1 and l2 <1, or l1>1 and l2 >1, we have two
updated list is [P2,P3,P4] and is shown in Figure 2(d). EA
cases. If the system having comparatively more free hard moves to each agent system in the list, gets the information
disk space has lesser load average, we select that system. and saves the best value. Finally, the tmpInfoFile which
Otherwise, we select the system having lesser load average will contain the information of the best proxy agent system
when the difference between load averages is significant is renamed into infoFile and we assume that P3 is the best
and when the difference is negligible, system with proxy agent system as shown in Figure 2(e). The EA will
comparatively more free hard disk space is selected. continue its execution and updates the infoFile each time.
1) h1>h2 When R1 fails, R2 will become the new real agent
l1<l2 => (h1,l1) l1=l2 => (h1,l1) system and the location of the real agent system in R2 is
l1>l2 => (h 2,l2), if the difference between l1 and l2 is updated as [R2,R2]. Now R2 will fetch the best proxy
significant agent system entry from the infoFile i.e., [P3]. P3 is
l1>l2 => (h1,l1), if the difference between l1 and l2 is reconfigured as the new replica R3 and the location of the
negligible replica in R2 is updated as [R2, R3] as shown in Figure
2(f).
2) h1<h2
The real agent system R2 updates the location of real
l1>l2 => (h2,l2) l1=l2 => (h2,l2)
agent system and its replica in all the proxy agent systems.
l1<l2 => (h 1,l1), if the difference between l1 and l2 is The real agent system R2 has a list containing the entries of
significant all proxy agent systems registered with it, [P1,P2,P4]. The
l1<l2 => (h2,l2), if the difference between l1 and l2 is location of the real agent system and its replica [R2,R3] are
negligible retrieved from the realLocFile and replicaLocFile. Real
agent system sends this information to the first proxy agent
C. Updating Replica Location in Proxy Agent Systems system entry in the list, P1 using the communication
The location of real agent system and its replica is service as shown in Figure 3(a). In the proxy agent system
updated in all the proxy agent systems by the real agent P1, the location of real agent system and its replica [R1,R2]
system. We describe this process below. are updated with [R2,R3]. For each proxy agent system
The entries of all the proxy agent system which are entry in the list, the real agent system updates the entry of
registered with the real agent system are added into a list. realLocFile and replicaLocFile with the location of the new
Each entry in the list contains the agent system name and real agent system and its replica as shown in Figure 3(b). In
its location. The location of the real agent system and its Figure 3, we assume that EA is running and shows only the
replica are retrieved from the realLocFile and the entries in the tmpInfoFile and infoFile.
replicaLocFile. This information is send by the real agent
163
Figure 2. Election Process in Proposed Environment
164
2) move: It takes the first agent system entry from the list
IV. AGENT BASED ENERGY EFFICIENT of agent systems and tries to get the communication service
HIGH AVAILABLE SYSTEM of that agent system by using its location. Once the
As we choose the replica dynamically ‘on demand’ communication service is received, the GreenAgent
using an election algorithm in ‘agent based highly available migrates to that particular proxy agent system. If the
environment’. Most of the time this replica is idle and is put communication service is not available, the GreenAgent
to use only when it needs to be synchronized with the real will retry for ten times. Even after the retrials, if it is not
agent system. During this idle time the replica runs in full available, it will discard this agent system, gets the next
power mode which leads to a considerable amount of agency system entry from the list and try to migrate to that
energy wastage. agent system.
Modern computer systems are equipped with special
utilities that allow users to either manually or automatically 3) checkInfo: After the successful migration to the proxy
schedule their computers to switch to sleep mode for agent system, the GreenAgent continues its execution. It
energy saving purpose, which is done by the administrator checks specific parameters like CPU load, no of
[22,23]. Practically in highly available systems, it is not application/processes, running mouse/keyboard activity in
possible for an administrator to determine when exactly the order to detect whether the current system reaches an idle
replica will not be in use which leads to significant energy state and if so it sets the status flag corresponding to this
wastage. So for reducing the energy wastage we introduced agent system as ‘ready to sleep’ from ‘active’ in the real
an agent based efficient energy management approach for agent system.
High available systems. This approach is first implemented
4) moveBack: GreenAgent now gets the communication
for replica of real agent system of our ‘agent based highly
service of the real agent system and migrates back to the
available environment’ and then extended to other systems
real agent system using its communication service.
running as proxy agent systems.
In the current approach we will put the system in sleep 5) makeSleep: The GreenAgent finds all the agent system
mode using an agent known as ‘Green Agent’. Figure 4 whose status flags are set to ‘ready to sleep’ and puts all
illustrates the working of Green Agent. these systems to stand by mode by calling a special API
A. Working of Green Agent: function on these agent systems and sets their flags to
‘sleep’.
When the real agent system starts up, it triggers the
energy saving process by creating a mobile agent called 6) updateList: The GreenAgent checks the registry for
GreenAgent. any new proxy agent system entry. If it finds a new entry, it
We describe the working of GreenAgent by explaining is added to the list of agent systems with its status.
the methods used by the GreenAgent. Table 2 shows the
The real agent system checks the status flag of
different methods used by the GreenAgent.
the proxy agent system before sending any other mobile
Real agent system has a special registry which contains agent to it. If the flag is set as ‘active’ then only it sends it
the status of all other agent systems whether its in active
to the proxy agent system, and if the flag is set as ‘sleep’ or
mode or sleep mode with its location and is periodically
‘ready to sleep’, it first makes the corresponding system in
updated by the GreenAgent. active mode , sets the flag as ‘active’ and then sends the
Table II
Methods used by GreenAgent agent there.
Method Purpose The replica is updated by the real agent system
gets the list of all active proxy agent from the synchronization process. It should be in active
getList state only during synchronization, rest of the time it should
systems
migrates to a particular proxy agent be in sleep state. As there is no agent is running on the
move replica, the real agent system directly puts it into the active
system
get the specific parameters of proxy and sleep state as and when required. So first time after
agent system and check them to make synchronization real agent system will set replica to ‘stand
checkInfo
that system sleep or remain to be by’ mode and after that before each synchronization it will
active agent system set replica to full power mode. Through this approach we
moveBack move back to the real agent System can achieve efficient energy management technique in high
Make all the system sleep whose flag available systems.
makeSleep
is set to ‘ready to sleep’
checks the registry for new proxy V. PERFORMANCE E VALUATION
updateList
agent system
For comparing the performance of a mobile agent on
1) getList: The GreenAgent gets the entry of all the active CMAF [1] and ‘agent based highly available environment’,
proxy agent systems from the registry and makes a list. four different situations were simulated.
Each entry in the list contains the agent system name and
its location.
165
1) ACP represents normal execution of agent in CMAF ASFT-fR is due to the time taken by the replica to take
[1]. This involves migration time, processing time and over the control on failure of real agent system, time taken
agent check pointing [2] time. for reconfiguration of proxy agent system to replica and
restoration overhead. Since the updation of realLocFile and
2) ASFT represents normal execution of agent in fault replicaLocFile in proxy agent systems is done after the
tolerant agent system i.e., ‘agent based highly available restoration of agents, it does not affect the ASFT-fR curve.
environment’. However, this updation process does not take much delay.
Another simulation environment was setup to analyze
3) ASFT-fP represents execution of agent in ‘agent based the influence of number of nodes on agent execution time.
highly available environment’ on failure of the real agent For each of the four simulated situations, the agent
system when the agent is in proxy agent system. In this
execution time was measured by increasing the number of
case, we assume that the agent moves back from the proxy nodes. Figure 5(b) shows the results of the simulation.
agent system to the replica before the updation of the From the figure, we examine that there is a small
realLocFile and the replicaLocFile. difference between ACP and ASFT curves due to the
4) ASFT-fR represents execution of agent in ‘agent based replication overhead. We can also see that the difference
highly available environment’ on failure of the real agent between ASFT and ASFT-fP as well as ASFT-fR is almost
system when the agent is in the real agent system itself. constant.
The energy management system was tested on a network
Our algorithm was simulated to study the impact of consisting of 12 computers. CMAF was installed on each
agent size on its execution time. A simulation environment of the 12 machines and the total power consumed by the 12
was setup with one real agent system, one replica and one computers was measured. After this the green agent was
proxy agent system. For each simulated situations, an agent triggered in all of the 12 machines and again the power
was sent from the real agent system to the proxy agent consumed by all the machines was measured. The total
system by increasing its size. The results of the simulation power consumed by the machines was measured for a
are shown in Figure 5(a). The difference between ACP and significant time for both the cases. A graph was plotted of
ASFT curves is due to the replication overhead. There is a power consumed Vs time from the result obtained in both
constant difference between ASFT and ASFT-fP. This is the cases. Figure 5(c) shows the power variations with and
due to the delay taken by the agent to realise that the real without the green agent. From the graph it can be clearly
agent system has failed and that it has to move to the concluded that a significant amount of energy is saved if
replica. The significant difference between ASFT and we use the green agent.
12 60
ACP
ACP
ASFT
ASFT
ASFT-fP
Agent Execution Time (s)
10 ASFT-fP
Agent Execution Time (s)
ASFT-fR 50
ASFT-fR
8 40
6 30
4 20
2 10
20 40 60 80 100 20 40 60 80 100
Agent Size (kb) No. of Nodes
10
0
1000 1030 1100 1130 1200 1230 1300
166
VI. CONCLUSION [5] B. DeCleene et al., “Secure Group Communication for Wireless
Networks”, Proc. of MILCOM 2001, VA, October 2001.
In this paper, we have proposed a ‘agent based highly [6] C. Perkins and E. Royer, “Ad-hoc On-Demand Distance Vector
available environment’ for achieving high availability in Routing”, Proc. of the 2nd IEEE WMCSA, New Orleans, LA,
agent based mission critical applications by creating a February 1999, pp. 90-100.
[7] W. Heinzelman, A. Chandrakasan and H. Balakrishnan, “Energy-
replica ‘on demand’ for a real agent system. We achieve Efficient Communication Protocol for Wireless Microsensor
this by using an election algorithm designed for dynamic Networks”, Proc. of HICSS, 2000.
networks to find the best-suited proxy agent system in the [8] K. Hatzis, G. Pentaris, P. Spirakis, V. Tampakas and R. Tan,
network. Since we are creating the replica dynamically ‘on “Fundamental Control Algorithms in Mobile Networks”, Proc. of
11th ACM SPAA, March 1999, pages 251-260.
demand’ hence we can avoid the agent loss in such type of [9] R. Gallager, P. Humblet and P. Spira, “A Distributed Algorithm for
systems. Minimum Weight Spanning Trees”, ACM Transactions on
In this paper we have also considered the issue of heavy Programming Languages and Systems, vol.4, no.1, pages 66-77,
energy wastage in such types of systems and proposed an January 1983.
[10] D. Peleg, “Time Optimal Leader Election in General Networks”,
intelligent agent based efficient energy management Journal of Parallel and Distributed Computing, vol.8, no.1, pages
approach for saving energy. 96-99, January 1990.
Finally, from the simulations, we found that the agent [11] D. Coore, R. Nagpal and R. Weiss, “Paradigms for Structure in an
execution delay due to replication is not of much overhead. Amorphous Computer”, Technical Report 1614, Massachussetts
Institute of Technology Artificial Intelligence Laboratory, October
We can summarise that there is a significant delay in agent 1997.
execution only when real agent system fails. Our approach [12] D. Estrin, R. Govindan, J. Heidemann and S. Kumar, “Next
enhances the availability and self awareness of the agent Century Challenges: Scalable Coordination in Sensor Networks”,
system and provides a highly reliable environment for the Proc. of ACM MOBICOM, August 1999.
[13] S. Vasudevan, B. DeCleene, N. Immerman, J. Kurose and D.
execution of mobile agent based mission critical Towsley, “Leader Election Algorithms for Wireless Ad Hoc
applications at the expense of replication overhead. Networks”, Proc. of IEEE DISCEX III, 2003.
We have also observed and compared the energy [14] A. Amis, R. Prakash, T. Vuong, and D.T Huynh, “MaxMin D-
consumed by the systems with and without the ‘Green Cluster Formation in Wireless Ad Hoc Networks”, Proc. of IEEE
INFOCOM, March 1999.
Agent’. The result clearly shows that a considerable [15] M. Aguilers, C. Gallet, H. Fauconnier and S. Toueg, “Stable leader
amount of energy is saved by using our energy election”, LNCS 2180, p. 108 ff.
management approach as compared to the normal [16] G. Taubenfeld, “Leader Election in presence of n-1 initial failures”,
approach. Information Processing Letters, vol.33, no.1, pages 25-28, October
1989.
Our future work will concentrate on the performance [17] N. Malpani, J. Welch and N. Vaidya, “Leader Election Algorithms
optimization of the agents based energy efficient highly for Mobile Ad Hoc Networks”, Fourth International Workshop on
available environment. Discrete Algorithms and Methods for Mobile Computing and
Communications, Boston, MA, August 2000.
REFERENCES [18] Sudarshan Vasudevan, Jim Kurose and Don Towsley, “Design and
Analysis of a Leader Election Algorithm for Mobile Ad Hoc
[1] [1] S. Venkatesh, B. S Bindhumadhava and Amrit Anand Bhandari, Networks”.
"Implementation of automated Grid software management tool: A [19] C. Lin and M. Gerla, “Adaptive Clustering for Mobile Wireless
Mobile Agent based approach", Proc. of Int'l Conf on Information Networks”, IEEE Journal on Selected Areas in Communications,
and Knowledge Engineering, June 2006, pages 208-214. 15(7):1265-75, 1997.
[2] Banupriya, Manju Abraham and B. S Bindhumadhava, "Fault [20] P. Basu, N. Khan and T. Little, “A Mobility based metric for
Tolerance for Mobile Agents", Proc. of Int'l Conf on Wireless clustering in mobile ad hoc networks”.
Networks, June 2007. [21] Newsham G & Tiller D, “Energy Consumption of Desktop
[3] Eugene Gendelman, Lubomir F. Bic and Michael B. Dillencourt, Computers: measurements and saving potentials” IEEE
“An Application-Transparent, Platform-Independent Approach to Transactions on Industry applications Jul/Aug 1994, pp1065-1072
Rollback-Recovery for Mobile Agent Systems”. [22] Nordman B., Kinney K., Piette M.A, Webber C., "User
[4] N.Lynch, “Distributed Algorithms”, Morgan Kaufmann Publishers, Guide to Power Management in PCs and Monitors", University
Inc, 1996. of California, Berkeley, January 1997
167
A Two-phase Bi-criteria Workflow Scheduling
Algorithm in Grid Environments
Amit Agarwal and Padam Kumar
Department of Electronics and Computer Engineering
Indian Institute of Technology Roorkee
Roorkee, India
{aamitdec, padamfec}@iitr.ernet.in
Abstract—Scheduling workflow application in a highly dynamic application over grid) has been considered as another important
and heterogeneous grid environment is a complex NP-complete scheduling criterion to employ the user-centric policies.
optimization problem. It may require several different criteria to
be considered simultaneously when evaluating the quality of Considering multiple criteria enables us to propose a more
solution or a schedule. The two most important scheduling realistic solution. Thus, an effective multi-criteria scheduling
criteria frequently addressed by the current GRID research are algorithm is required for executing workflows over grid while
the execution time and the economic cost. This paper presents an assuring the high speed of communication, reducing the tasks
efficient bi-criteria scheduling heuristic for workflows called execution time and economic cost. A workflow type of
Duplication-based Bi-criteria Scheduling Algorithm (DBSA). The application can be modeled as Directed Acyclic Graph (DAG)
proposed approach comprises two phases: (1) Duplication-based in which nodes represents the executable tasks and the directed
Scheduling – optimizes the primary criterion i.e. execution time edges represent the inter-task data and control dependencies.
(2) Sliding Constraint Schedule Optimization – optimizes Since the DAG scheduling problem in grid is NP-complete, we
secondary criterion i.e. economic cost keeping primary criterion have emphasized on heuristics for scheduling rather than the
within sliding constraint. The sliding constraint is defined as a exact methods. In [8, 10, 11, 12, 13, 14], several scheduling
function of the primary criterion to determine how much the algorithms have been proposed which minimizes the makespan
final solution can differ from the best solution found in primary and economic cost of the schedule but only few of them
scheduling. The experimental results reveal that the proposed
address the workflow types of applications. In [8, 13], Buyya et
approach generates schedules which are fairly optimized for both
al. propose the multi-objective planning for workflow
economic cost and makespan while keeping the makespan within
defined constraints for executing workflow applications in the
scheduling approaches for utility grids. In [10], a quality of
grid environment. service optimization strategy for multi-criteria scheduling has
been presented for criteria namely payment, deadline and
Keywords- grid computing; bi-criterion scheduling; reliability. In [11], Wieczorek et al. presents an efficient bi-
optimization; workflow applications; DAG. criterion scheduling algorithm called Dynamic Constraint
Algorithm (DCA) based on a sliding constraint. It takes list-
based heuristic for primary scheduling. In [12], Dogan and
I. INTRODUCTION
Ozguner show another tradeoff between makespan and
Grid [1] is a unified computing platform which consists of reliability using a sophisticated reliability model assuming
diverse set of heterogeneous resources distributed over large computation and network performance. Our work represents
geographical region inter-connected over high speed networks the different specification for two specific criteria (i.e.
and Internet. A workflow application can be defined as a makespan and economic cost).
collection of tasks with precedence constraints that are In [15], Deelman et al. describes the three different
executed in a well-defined order to achieve a specific goal [2]. workflow scheduling strategy namely full-plan-ahead
Scheduling workflow applications in grid with characteristics scheduling, in-time local scheduling, and in-time global
of dynamism, heterogeneity, distribution, openness, scheduling. In Just-in-time scheduling (in-time local
voluntariness, uncertainty and deception, is a complex scheduling), scheduling decision for an individual task is
optimization problem and several different criteria are needed postponed as long as possible, and performed before the task
to be considered simultaneously to obtain a realistic schedule. execution starts (fully dynamic approach). In Full-ahead
In general, minimization of total execution time (or makespan) planning (full-plan-ahead), the whole workflow is scheduled
of the schedule is applied as the most important scheduling before its execution starts (fully static approach). In this paper,
criteria [3-6]. The current grid computing systems are based on we adopted full-ahead planning as it does not incur the run-
system-centric policies, whose objectives are to optimize the time overheads and scheduling complexity on a federation of
system-wide metrics of performance i.e. total execution time or geographically distributed computing resources known as a
makespan. The convergence of grid computing toward the computational grid. The research study shows that heuristics
service-oriented approach is fostering a new vision where performing best in static environment (e.g. HLD [4], HBMCT
economic aspects represent central issues to burst the adoption [6]) have the highest potential to perform best in a more
of computing as a utility [7]. In current economic market accurately modeled Grid environment.
models [8, 9, 10], economic cost (cost of executing a workflow
168
Scheduling heuristics can be categorized as list-based The rest of the paper is organized as follows. Section II
scheduling, cluster-based scheduling and duplication-based describes the bi-criteria scheduling problem and related
scheduling. In large, the current multi-criteria scheduling terminology. Section III presents the bi-criteria scheduling
approaches have adopted the list-based scheduling heuristics approach. The proposed bi-criteria scheduling algorithm
(such as HEFT [3], HBMCT [6]) as a primary scheduling. In (DBSA) is described in section IV. In section V, simulation
extensive literature survey, it has been observed that results are presented and discussed. Section VI concludes the
duplication-based heuristics [4-5, 16] generate remarkably proposed research work.
much shorter schedules as compared with the list-based and
cluster-based heuristics. The duplication approach utilizes the II. BI-CRITERIA SCHEDULING PROBLEM
idle time slots (scheduling holes) for task duplication which, in
turn, reduces the communication cost. The duplication strategy A. Workflow Application Model
generates more optimized alternative schedules which help to
minimize the overall schedule length. This motivates us to A workflow scheduling problem can be defined as the
adopt the duplication based approach for primary scheduling to assignment of available grid resources to different workflow
optimize the makespan (primary criterion). tasks. A workflow can be modeled as DAG, as shown in fig. 2
and it can be represented by W ( N , E, T , C ) , where N is a
In secondary scheduling, our objective is to optimize the set of n computational tasks, T is a set of task computation
economic cost (secondary criterion) of the schedule while volumes (i.e., one unit of computation volume is one million
keeping the makespan within the defined sliding constraint. instructions), E is a set of communication arcs or edges that
The fig.1 illustrates that the primary solution (M1, C1) can be shows precedence constraint among the tasks and C is the set
obtained considering primary criterion i.e. makespan in
of communication data from parent tasks to child tasks (i.e.,
primary scheduling that yields makespan of length M1 while
one unit of communication data is one Kbyte). The value of
the economic cost is C1. In secondary scheduling, the above
schedule is optimized for the economic cost dragging the i T is the computation volume for the task ni N . The
makespan from M1 to M2 (M2 is the maximum allowable value of cij C is the communication data transferring along
schedule length) that yields the schedule with makespan M2 the edge e ij , eij E from task ni to task n j , for ni , n j N.
and the reduced economic cost C2. This approach generates
schedules which are remarkably more optimized both in terms n1
of the execution time and the economic cost as compared to
300 500
other relative algorithms. 100
300
m
(Economic Cost)
Local
Economic Cost (EC) = Cj
Search j 1
Direction
169
where M j is the per unit time cost of executing task on a B. Grid Resource Model
resource p j and PBT j is the total busy time consumed by tasks A grid resource model can be represented by undirected
weighted graph G ( P, Q, A, B) as shown in fig. 3, where
scheduled on resource p j .
P { pi | pi P, i 1,2,......, p} is the set of p available
resources, A { ( pi ) | ( pi ) A, i 1,2,......, p} is the set of
TABLE I. SHOWING COMPUTATION COST, B-LEVEL AND
TASK SEQUENCE FOR DAG IN FIGURE 2 execution rates (Table II), where ( pi ) is the execution rate for
resource p i , Q {q( pi , p j ) | q( pi , p j ) Q, i 1,2,......, p} is
the set of communication links connecting pairs of distinct
resources, q( pi , p j ) is the communication link between p i and
p j , and B { ( pi , p j ) | ( pi , p j ) B, i 1,2,......, p} is the
set of data transfer rates (bandwidths) , where ( pi , p j ) is the
data transfer rate between resource p i and p j (Fig. 3). In our
model, task executions are assumed to be non-preemptive and
intra-processor communication cost between two tasks
scheduled on same resource is considered as zero.
In this model, the cost of the idle time slots between the
scheduled tasks on any resource is also considered in economic C. Bi-criteria Performance Criteria
cost as it is difficult for the grid scheduler to schedule other The computation cost of task ni on p j is ij (see Table I). If
workflow tasks in these idle slots. Thus, the total execution
time (makespan) can be expressed as: resource p j is not capable to process the task ni then ij .
In grid, some resources may not always be in fully connected
topology. Therefore, bandwidths between such resources are
makespan AFT (nexit ) AST (nentry ) computed by searching alternative paths between them with
maximum allowable bandwidths. The communication cost
where AFT and AST are the actual finish and actual start time between task ni scheduled on resource pm and task n j scheduled
of exit task and entry task respectively. The normalized on resource p n can be computed as:
schedule length (NSL) of a schedule can be calculated as:
cij
makespan ij
NSL ( pm , pn )
min{ ij }
pj P
ni CPm in
p2
The denominator is the summation of the minimum execution
costs of tasks on the CPmin [3]. 1 50 50
100
100 p4
TABLE II. RESOURCE CAPACITY p1
Resources p1 p2 p3 p4
100
100
Processing Capacity p3
220 350 450 310
( pi ) (in MIPS) Direct path
Indirect path
TABLE III. MACHINE PRICE
Resource Machine cost per MIPS Figure 3. Grid with 4 resources (Bandwidth is in Kbps)
pi M ( pi ) (in Dollar $)
In this model, we avoid the communication startup costs of
1 1.0
resources and intra-processor communication cost is negligible.
2 2.5 A workflow of tasks is submitted to the Grid scheduler [14]
3 3.0 where tasks are queued in non-decreasing order of their b-level.
4 2.0 The b-level (bottom level) of task ni can be defined as the
longest directed path including execution cost and
communication cost from task ni to the exit task in the given
DAG. It can be computed recursively as:
170
bi ~ max{b j ~} n succ(ni ) Algorithm (DBSA)
i ij j
Input:
where succ(ni ) refers to the immediate successors of task A DAG (workflow) W with task computation and communication costs.
node ni and ~i is the mean computation cost of task ni and ~ij is A set of available resources P with cost of execution per unit time.
prel
Sliding constraint L (10%, 25%, 50% and 75% of the makespan c1 ).
the mean communication cost between task ni and n j .
Algorithm I: Primary Scheduling
The optimization goal of bi-criteria scheduling is to obtain
the schedule with minimum schedule cost. It can be expressed Construct a priority based task sequence based on highest b-level
in terms of performance metrics called effective schedule cost for (each unscheduled task ni in task sequence)
(ESC) which can be computed as:
Let finish time Fi of task ni is Infinite
for (each capable resource p j )
ESC NSL EC Compute finish time Fi of task ni on resource p j
Construct task predecessor list pred_list(ni )
III. BI-CRITERIA SCHEDULING APPROACH Initialize temp _ list(ni , p j ) to zero
In this paper, we consider the makespan as the primary if (pred_list)
criterion and the economic cost as the secondary criterion. We for (each predecessor d k not scheduled on p j )
define the sliding constraint for the primary criterion i.e. how if duplication of d k on p j reduces the finish time Fij
much the final solution may differ from the best solution found
for the primary criterion. The primary scheduler adopts an Add d k in temp _ list(ni , p j )
efficient duplication scheduling approach to minimize the Update finish time Fij
schedule length as much as possible. Then, this schedule is endif
forwarded to the secondary scheduler. The secondary scheduler endfor
optimizes the above schedule produced by primary scheduler to endif
minimize the economic cost. In secondary scheduling, some of if Fij < Fi
duplicated tasks are removed and schedule is modified such Fi = Fij ;
that makespan of the schedule after removing those duplicated
tasks remains within maximum allowable execution length. ri = p j ;
if (temp_list)
Further, it investigates those tasks in the schedule which Copy temp _ list(ni , p j ) into duplicate_ list(ni , p j )
have been duplicated on other resources. Such tasks may
become unproductive if their descendent tasks are receiving endif
endif
input data from their duplicated version. Thus, such endfor
unproductive tasks or schedules are removed in order to reduce Assign task ni on resource ri and update schedule S.
the economic cost. If the makespan of the aforementioned if (duplicate_list)
schedule is less than the upper limit of the defined sliding Duplicate tasks from duplicate_list to ri and update schedule S.
constraint then it can be further modified to reduce the
endif
economic cost. The above schedule is modified by swapping endfor
tasks among resources (costlier to cheaper resources) such that
it reduces the economic cost of the schedule while keeping the Compute makespan c1prel (Equation 3) and Economic Cost c2prel
makespan within the upper limit. (Equation 1) from schedule S.
171
Algorithm II: Secondary Scheduling maximum allowable limit. The schedule in fig. 4(c) is modified
in order to reschedule tasks from resource P3 to P4 which
prel prel reduces the economic cost to 9.32$ while makespan is kept
Let L (maximum allowable schedule length) = c1 + 10% of c1
below 18 (+10% of makespan in primary scheduling).
prel
if (duplicate_list && c1 <= L )
Copy tasks in list A and sort them in non-decreasing order of start time V. SIMULATION RESULTS AND ANALYSIS
for (each duplicated task ai in A)
The algorithms described in section IV have been simulated
Compute schedule length SL without considering ai in schedule S and implemented for the evaluation of different random task
if SL<=L graphs or DAGs of different graph sizes (100, 200, 300, 400
c1final =SL; and 500) with different parallelisms i.e. maximum outdegree of
Remove ai from S and update S and A nodes in DAG (2, 4, 6, 8 and 10). The algorithms have been
executed and compared in the grid of heterogeneous clusters of
endif
endfor different sizes (5, 10, 15, 20, and 25) with 4 resources in each
Construct list B of tasks from A that were duplicated cluster. The proposed algorithm (DBSA) has been compared
Sort list B in non-decreasing order of task start time with DCA [11] for performance metrics i.e. effective schedule
for (each task bi in B) cost (ESC) as described in section II with respect to workflows
Compute schedule length SL without considering bi in schedule S. and grids of different sizes. The algorithms have been run
if SL<=L
under the same conditions for fair comparison: for each
workflow, each algorithm is run to find best possible second
c1final =SL; criteria cost while keeping the primary criteria within the
Remove bi from S and update S and B defined sliding constraint.
endif
endfor
endif
Compute economic cost c2final of the optimized schedule S
for (each task ni scheduled on resource p j in S)
Construct list R of capable resources in non-decreasing order whose
machine cost is less than M ( p j )
for (each resource pk in R)
final
Reschedule task ni to resource pk for c1 <=L
Compute economic cost EC’ (using Equation 1)
final
if EC’< c2
Update schedule S
c2final =EC’;
endif
endfor
endfor
172
The algorithm is run for primary scheduling for both the an efficient scheduling algorithm called Duplication-based Bi-
criteria (i.e. makespan and economic cost) to find the best and criteria Scheduling Algorithm (DBSA) which optimizes both
worst solution for primary criteria ( c1best and c1worst ) which yield the makespan and economic cost of the schedule. The schedule
generated by DBSA algorithm is much more optimized than
the maximum sliding constraint i.e. the
other related bi-criteria algorithms in respect of both makespan
difference | c1worst c1best | . The algorithm is run for three different and economic cost. The algorithms have been implemented to
sliding constraint values: 25%, 50% and 75% of schedule different random DAGs onto different grids of
difference | c1worst c1best | . The simulated results and graphs heterogeneous clusters of various sizes. Different variants of
the algorithm were modeled and evaluated.
reveal that the proposed bi-criteria scheduling approach
outperforms the DCA algorithm in terms of both economic cost
and schedule length. In fig. 5 and fig. 6, the DBSA algorithm REFERENCES
yields reduced effective schedule cost (ESC) as compared to [1] I. Foster and C. Kesselman, “The Grid 2: Blueprint for a New
DCA over grid. The simulation parameters for modeling Computing Infrastructure”, Morgan Kaufmann Pub.,Elsevier Inc., 2004.
workflows and grid environments are presented in Table IV. [2] Zhio Shi, Jack J. Dongarra, Scheduling workflow applications on
processors with different capabilities, Elsevier, 2005.
[3] H. Topcuoglu, S. Hariri, and M. Wu. Performance-effective and low-
TABLE IV. GRID ENVIRONMENT LAYOUTS complexity task scheduling for heterogeneous computing. In IEEE
Transactions on Parallel and Distributed Systems, volume 13(3), pages
Number of grid resources [20, 100]
260–274, March 2002.
Resource bandwidth [100 Mbps, 1 Gbps] [4] Savina Bansal, Padam Kumar and Kuldip Singh, Dealing with
Number of tasks [100, 500] Heterogeneity Through Limited Duplication for Scheduling Precedence
Constrained Task Graphs, Journal of Parallel and Distributed
Computation cost of tasks [50, 2000] ms Computing, 65(4): 479-491, Apr 2005.
Data transfer size [20 Kbytes, 20 Mbytes] [5] A. Dogan and F. Ozguner, LDBS: A Duplication Based Scheduling
Algorithm for Heterogeneous Computing Systems, Proceedings of the
Resource capability (MIPS) [220, 580] Int’l Conf. on Parallel Processing, pp. 352-359, Aug 2002.
Execution cost (per MIPS) [1-5 $ per MIPS] [6] R. Sakellariou and H. Zhao. A hybrid heuristic for DAG scheduling on
heterogeneous systems. In 13th IEEE Heterogeneous Computing
Workshop (HCW’04), Santa Fe, New Mexico, USA, April 2004.
4000000
Effective Schedule Cost
[7] Nadia Ranaldo and Eugenio Zimeo, Time and Cost-Driven Scheduling
3500000
of Data Parallel Tasks in Grid Workflows, IEEE Systems Journal VOL.
3000000 3, NO. 1, pp.104-120, MARCH 2009.
2500000
[8] J. Yu, R. Buyya, and C. K. Tham, Cost-based Scheduling of Scientific
2000000 Work on Applications on Utility Grids, in Proceedings of the 1st IEEE
1500000 DBSA International Conference on e-Science and Grid Computing (e-Science
1000000 DCA 2005), IEEE. Melbourne, Australia: IEEE CS Press, Dec. 2005.
500000 [9] C. Ernemann, V. Hamscher and R. Yahyapour. Economic Scheduling in
0 Grid Computing. In Proceedings of the 8th Workshop on Job Scheduling
20 40 60 80 100 Strategies for Parallel Processing, Vol. 2537 of Lecture Notes in
Computer Science, Springer, pages 128–152, 2002.
Number of Resources
[10] Chunlin Li and Layuan Li, Utility-based QoS optimization strategy for
multi-criteria scheduling in Grid, in JDPC, 2006.
Figure 5. Effect of grid sizes on effective schedule cost [11] Wieczorek, M.; Podlipnig, S.; Prodan, R.; Fahringer, T., "Bi-criteria
Scheduling of Scientific Workflows for the Grid," Cluster Computing
and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
4500000
, vol., no., pp.9-16, 19-22 May 2008.
Effective Schedule Cost
4000000
3500000
[12] A. Dogan and F. Ozguner, Biobjective Scheduling Algorithms for
Execution Time-Reliability Trade-off in Heterogeneous Computing
3000000
Systems, Comput. J., vol. 48, no. 3, pp. 300.314, 2005.
2500000
[13] J. Yu and R. Buyya, Scheduling Scientific Workflow Applications with
2000000
DBSA Deadline and Budget Constraints using Genetic Algorithms, Scientific
1500000
DCA Programming Journal, vol. 14, no. 1,pp:217-230, 2006.
1000000
[14] Agarwal, A. and Kumar, P. "An Effective Compaction Strategy for Bi-
500000
criteria DAG Scheduling in Grids", Int. J. of Communication Networks
0 and Distributed Systems (IJCNDS), Inderscience Publishers, in press.
100 200 300 400 500 [15] E. Deelman, J. Blythe, Y. Gil, and C. Kesselman. Workflow
Number of Tasks Management in GriPhyN. Grid Resource Management, State of the Art
and Future Trends. pages 99–116, 2004.
Figure 6. Effective schedule cost on different workflows [16] Agarwal, A. and Kumar, P. "Economical Duplication Based Task
Scheduling for Heterogeneous and Homogeneous Computing Systems,"
Advance Computing Conference, 2009. IACC 2009. IEEE International,
vol., no., pp.87-93, 6-7 March 2009.
VI. CONCLUSIONS
[17] H. Z. E. Tsiakkouri, R. Sakellariou and M. D. Dikaiakos, Scheduling
In this paper, a novel bi-criteria workflow scheduling Workflows with Budget Constraints, In Proceedings of the CoreGRID
approach has been presented and analyzed. We have proposed Workshop ”Integrated research in Grid Computing“, S. Gorlatch and
M. Danelutto, Eds., Nov. 2005, pp. 347.357.
173
ADCOM 2009
HUMAN COMPUTER
INTERFACE -2
Session Papers:
1. Mozaffar Afaq, Mohammed Qadeer, Najaf Zaidi and Sarosh Umar ,“Towards Geometrical
Password for Mobile Phones”
174
Towards Geometrical Password for Mobile Phones
Abstract — Mobile cell phones have brought a revolution in Most of the passwords chosen by users are dictionary based
the modern world. They have become profound instruments in in textual password system. And it makes the cracker’s (one
bringing social as well as financial transformation. Mobile who tries to guess your password) job easier. Armed with a
phones today not only hold the key to communication dictionary of 250,000 words, a cracker could compare their
problems but can also be a suitable medium to facilitate
encryptions with those already stored in the password file in
commercial and financial transaction. There is an urgent need
to establish ways to authenticate people over cell phone. The a little more than five minutes [1]. Even if the edited words
current method for authentication uses alphanumeric are included it will add 14 to 17 additional tests per word.
username and password. The textual password scheme is This will add another 1,000,000 words to the list of possible
convenient but has its own drawbacks. Alphanumeric passwords for each user [1]. In this paper we will
passwords are most of the times easy to guess, offer limited demonstrate a graphical grid based password scheme which
possibilities and are easily forgotten. With financial will aim at providing a huge password space along with ease
transactions at stake, the need of the hour is a collection of of use .We will also analyze its strength by examining the
robust schemes for authentication. Graphical passwords are success of brute force technique. In this scheme we will try
one of such schemes which offer a plethora of options and
to make it easy for the user to remember and more complex
combinations. We are proposing a scheme which is simple for
the user and robust at the same time. Graphical password by for the attacker.
drawing geometries will provide a larger password space; at
the same time will allow the user to use its photographic
memory, making it easy to remember. The proposed scheme is II. RELATED WORK
suitable for all touch sensitive mobile phones.
Many papers have been published in recent years with a
Keywords: User authentication, graphical password, smart vision to have a graphical technique for user authentication.
phone security, geometrical password. Primarily there are just two methods, having recall and
recognition based approach respectively. Traditionally both
the methods have been realized through the textual
I. INTRODUCTION password space, which makes it easy to implement and at
the same time easy to crack.
Cell phones have become a necessity. The ability to keep in
touch with family, business associates, and access to email
are not the only reasons for the increasing demand of cell
phones. Today's technically advanced cell phones are
capable of not only receiving and placing phone calls, but
can very conveniently store data, take pictures and connect
to the internet. These features have allowed them to become
successful mediums in the field of e-commerce. With the
foray of mobiles into the world of finance a method to
authenticate users and their transactions was required.
Textual passwords were the first choice, not because they
were robust, but because they were easy to implement. With
this choice we opted for a catch 22 situation where if the
passwords are easy to remember, they are also easy to crack
or guess and when they are complex they are easily
forgotten.
Figure 1: VisKey SFR
175
The study shows that there are 90% recognition rates for
few seconds for 2560 pictures [2]. Clearly the mind of
Homo sapiens is best suited to respond to a visual. A recall
based password approach is VisKey [3], which is designed
for PDAs. In this scheme to make a password, users have to
tap spot in sequence. As for PDAs have a smaller screen
difficult to point exact location of spot. Theoretically it
provides a large password space but not enough to face a
brute force attack if number of spots is less than seven [4].
176
In the above figure 4, we have considered 4x5 grid keeping
in mind the typical screen size of the PDAs these days and
its width height ratio. Depending on the screen size it can be
changed with justifiable number of rows and columns. But
taking that size (4x5) we will have total of 5x4 = 20 block
and each block has four triangles so total of possible triangle
(20 blocks) x (4 triangle/block) = 80 triangle. Similarly
each block has 4 small diagonal lines so total lines in that
way (20 blocks) x (4 lines/block) = 80 lines. Also we do
have some lines which are a result of joining adjacent points
horizontally and vertically. That will give 4x6=24
(horizontal) and 5x5=25 vertical lines which makes a total
of 24+25 = 49 (horizontal and vertical) lines. In that way we
will have total of
Note that the inversion does not take place for non-parallel
These 209 objects can be used to choose password by lines. Figure 8 shows a password made by using parallel and
drawing some of these objects in efficient manner. A non-parallel lines. To draw that we have button stylus able
password is considered to be the selection of certain lines to draw those lines by dragging stylus from one point to
and triangles. When a triangle is selected it is filled with another. The start point and end point of such line will be
some color and when a line is selected the color of that line decided by actually where stylus touches the screen and
changes (gets highlighted). Any combination of the where it leaves it. As illustrated in figure 8 if stylus touches
selection of lines and triangles will form a password as the screen at any location say coordinate (x,y) where two
shown in figure 5. In this way highlighted lines and filled vertical line va and vb (nearest vertical lines from point P at
triangle will provide us larger password space. Filling a distance half cell width) such that va _ x < vb and ha_ y <
triangle and highlighting work can be done by using stylus hb the nearest point of region P will be considered. Same
of PDAs either by putting dot in triangle or by dragging the strategy will be adopted for end point where stylus release
stylus crossing that line. As research shows that if the screen. If lines drawn by user are parallel but procedure
number of dots increases to difficult to remember those it is adopted by user to draw is as of nonparallel, in that case the
also increases. In this scheme we fill the triangle highlighted scheme will automatically detect that and even if parallel
lines makes geometric shape which is to be recalled not the lines are drawn by non-parallel method of drawing it will be
dots. More over we give another option which converts all considered as parallel lines.
highlighted lines to un-highlighted and vice-versa and the
same for filling triangle by single click a button “Invert” a
button which at least double the password space within
practical limit of password length. A line which is not
inclined at an angle of 45° or 0° or 90° i.e. the line which is
not parallel to diagonal, horizontal as well as vertical lines.
(Let’s call them non-parallel lines) These non-parallel lines
can also be drawn by joining two points after enabling those
drawing by clicking the button given labeled line “Line”
which enables user to draw non-parallel lines. As we can
see that crossing the same lines again cancels the effect of
highlighting, figure 6, in general we can say that crossing
even number of times the same line will cancel the
highlighting effect. The users don’t need to recall the
strokes but the resulting geometry. By using inversion
operation as shown in figure 7 the user can deselect all
currently highlighted lines and triangles and select all the
unselected lines and triangles.
Figure 6: Drawing lines
177
Figure 7: Inversion of drawn geometry
178
VI. STORAGE OF PASSWORD
= 209 + 1 + 10 + np * 10;
179
VII. SECURITY ANALYSIS when employed on PCs and ATM machines it is susceptible
to shoulder surfing. To make it more robust and handle the
As we have seen in eqn.(1) that we have 209 objects problem of shoulder surfing we will have to take into
individually each will be either highlighted or unselected. account the order in which the various components of the
Considering only the 209 objects and excluding the non- geometrical shape were drawn i.e. which line or triangle
parallel lines then we have a total of 2209 = 8.2275X1062 was first selected and then the next line which was selected
possibilities which is huge in terms of password space. and so on. This consideration will limit the scheme’s
vulnerability to shoulder surfing and will also expand the
So it is very robust from security point of view even after password space.
excluding non-parallel lines. If we consider non-parallel
lines also the additional 220 lines will be added which will
be also either highlighted or unselected so in that case total REFERENCES
password possible 2(209+220) = 2429 = 1.386X10129 . It is clear
that the password space will increase exponentially with 1. Daniel V. Klein “ Foiling the Cracker: A Survey of, and
increase in rows or columns as shown in table above. It will Improvements to, Password Security”.
be possible for device with a bigger screen (like ATM) to
have many more columns and rows. 2. Perception and memory for pictures: Single-trial learning of
2500 visual stimuli. Psychonomic Science, 19(2):73-74, 1970.
Due to this larger password space it is very difficult to carry
3. SFR-IT-Engineering,
out brute force attack on this password. With this scheme http://www.sfrsoftware.de/cms/EN/pocketpc/viskey/,
even if user decides to have the graphical representation of Accessed on January 2007.
the text, he will be least susceptible to dictionary attacks.
We have computed above password space in simple case 4. Muhammad Daniel Hafiz, Abdul Hanan Abdullah, Norafida
with only 4x5 grids and single stage password entry. If we Ithnin, Hazinah K. Mammi “Towards Identifying Usability
include those properties then the password space from the and Security Features of Graphical Password in Knowledge
scheme will be increased by many folds. Since we have not Based Authentication Technique”, in Proceedings of the
made any special assumption for text simulation in this Second Asia International Conference on Modelling &
Simulation, IEEE Computer Society.
scheme, so password space remains same even if we take it
as textual password scheme.
5. D. Davis, F. Monrose, and M. Reiter. On User Choice in
Graphical Password Schemes. In 13th USENIX Security
Symposium, 2004.
VIII. CASE STUDY
6. J. Thorpe and P. van Oorschot. Graphical Dictionaries and the
We had requested 25 users to try this scheme and share their Memorable Space of Graphical Passwords. In 13th USENIX
experience with us. When asked to rate the ease of usage of Security Symposium, 2004.
the new methodology on a scale of 1-10, we got an average
of 8.5. Twenty out of twenty five users said they found it 7. I. Jermyn, A. Mayer, F. Monrose, M. Reiter, and A. Rubin.
The Design and Analysis of Graphical Passwords. 8th
easier to remember passwords in graphical geometries.
USENIX Security Symposium, 1999.
Twenty one users out of twenty five could reproduce their
passwords after an interval of one week. 8. R.-S. French. Identification of Dot Patterns From Memory as
a Function of Complexity. Journal of Experimental
Psychology, 47:22–26, 1954.
IX. CONCLUSION AND FUTURE WORK
9. S.-I. Ichikawa. Measurement of Visual Memory Span by
In this paper we have proposed a graphical password Means of the Recall of Dot-in-Matrix Patterns. Behavior
scheme in which the user can draw simple geometrical Research Methods and Instrumentation, 14(3):309–313,
shapes consisting of lines and solid triangles. The user does 1982.
not need to remember the way in which password has been
10. Xiaoyuan Suo, Ying Zhu, G. Scott. Owen “Graphical
drawn but just the final geometrical shape. Passwords: A Survey”, in Proceedings of the 21st Annual
Computer Security Applications Conference (ACSAC 2005),
This scheme gives more password space and is competent in IEEE Computer Society.
resisting brute force attack. This way of storing the
password requires less space to store passwords as 11. Konstantinos Chalkias, Anastasios Alexiadis, George
compared to other graphical schemes. Stephanides “A Multi-Grid Graphical Password Scheme”.
This scheme is immune to shoulder surfing as the screen of 12. Julie Thorpe P.C. van Oorschot “Towards Secure Design
the hand held device is visible to the user only. However Choices for Implementing Graphical Passwords”, in
Proceedings of the 20th Annual Computer Security
180
Applications Conference (ACSAC’04), IEEE Computer
Society.
181
Improving Performance of Speaker Identification
System Using Complementary Information Fusion
Md. Sahidullah, Sandipan Chakroborty and Goutam Saha
Department of Electronics and Electrical Communication Engineering
Indian Institute of Technology, Kharagpur, India, Kharagpur-721 302
Email: sahidullah@iitkgp.ac.in, mail2sandi@gmail.com, gsaha@ece.iitkgp.ernet.in
Telephone: +91-3222-283556/1470, FAX: +91-3222-255303
Abstract—Feature extraction plays an important role as a also contains some speaker specific information [5]. Residual
front-end processing block in speaker identification (SI) process. signal which can be obtained from the Linear Prediction
Most of the SI systems utilize like Mel-Frequency Cepstral (LP) analysis of speech signal contains information related to
Coefficients (MFCC), Perceptual Linear Prediction (PLP), Linear
Predictive Cepstral Coefficients (LPCC), as a feature for repre- source or vocal cord. Earlier Auto-associative Neural Network
senting speech signal. Their derivations are based on short term (AANN), Wavelet Octave Coefficients of Residues (WOCOR),
processing of speech signal and they try to capture the vocal tract residual phase etc. were used to extract the information from
information ignoring the contribution from the vocal cord. Vocal residual signal. In this work we have introduced Higher-
cord cues are equally important in SI context, as the information order Statistical Moments to capture the information from the
like pitch frequency, phase in the residual signal, etc could convey
important speaker specific attributes and are complementary to residual signal. In this paper we are integrating the vocal
the information contained in spectral feature sets. In this paper cord information with vocal tract information to boost up
we propose a novel feature set extracted from the residual signal the performance of speaker identification system. The log
of LP modeling. Higher-order statistical moments are used here likelihood score of both the system are fused together to
to find the nonlinear relationship in residual signal. To get the get the advantages of their complementarity [6], [7]. The
advantages of complementarity vocal cord based decision score
is fused with the vocal tract based score. The experimental speaker identification results on both the databases prove that
results on two public databases show that fused mode system combining the two systems, the performance can be improved
outperforms single spectral features. over baseline spectral feature based systems.
Index Terms—Speaker Identification, Feature Extraction, This paper is organized as follows. In section II we first
Higher-order Statistics, Residual Signal, Complementary Fea- review the basic of linear prediction analysis followed by the
ture.
proposed feature extraction technique. The speaker identifica-
tion experiment with results is shown in section III. Finally,
I. I NTRODUCTION
the paper is concluded in section IV.
Speaker Identification is the process of identifying a person
by his/her voice signal [1]. A state-of-the art speaker identi- II. F EATURE E XTRACTION F ROM R ESIDUAL S IGNAL
fication system requires feature extraction unit as a front end
In this section we first explain the conventional method
processing block followed by an efficient modeling scheme.
of derivation of residual signal by LP-analysis. The proposed
Vocal tract information like its formant frequency, bandwidth
feature extraction process is described consequently.
of formant frequency etc. are supposed to be unique for human
beings. The basic target of the feature extraction block is to A. Linear Prediction Analysis and Residual Signal
characterize those information. On the other hand this feature
extraction process represents the original speech signal into a In the LP model, (𝑛 − 1)-th to (𝑛 − 𝑝)-th samples of the
compact format as well as emphasizing the speaker specific speech wave (𝑛, 𝑝 are integers) are used to predict the 𝑛-th
information. The function of the feature extraction process sample. The predicted value of the 𝑛-th speech sample [3] is
block is also to represent the original signal into a robust given by
manner. Most of the speaker identification system uses Mel 𝑝
Frequency Cepstral coefficients (MFCC) or Linear Prediction ∑
𝑠ˆ(𝑛) = 𝑎(𝑘)𝑠(𝑛 − 𝑘) (1)
Cepstral Coefficient (LPCC) as a feature extraction block [1].
𝑘=1
MFCC is the modification of conventional Linear Frequency
Cepstral Coefficient keeping in mind the auditory system of where {𝑎(𝑘)}𝑝𝑘=1 are the predictor coefficients and 𝑠(𝑛) is
human being [2]. On the other hand, the LPCC is based on the 𝑛-th speech sample.The value of 𝑝 is chosen such that it
time domain processing of speech signal [3]. Later conven- could effectively capture the real and complex poles of the
tional LPCC is also modified motivated by perceptual property vocal tract in a frequency range equal to half the sampling
of human ear [4]. Like vocal tract, Vocal cord information frequency.The Prediction Coefficients (PC) are determined by
182
1000 100
500 50
Amplitude →
Amplitude →
0 0
−500 −50
−1000 −100
0 50 100 150 0 50 100 150
Number of Samples → Number of Samples →
1000 100
500 50
Amplitude →
Amplitude →
0 0
−500 −50
−1000 −100
0 50 100 150 0 50 100 150
Number of Samples → Number of Samples →
0.04 0.2
0.02 0.1
Amplitude →
0 Amplitude → 0
−0.02 −0.1
−0.04 −0.2
5 10 15 20 5 10 15 20
Number of Moments → Number of Moments →
Fig. 1. Example of two speech frames (top), their LP residuals (middle) and corresponding residual moments (bottom).
minimizing the mean square prediction error [1] and the error 𝑝
is defined as ∑
𝑠(𝑛) = − 𝑎(𝑘)𝑠(𝑛 − 𝑘) + 𝑒(𝑛) (5)
𝑁 −1 𝑘=1
1 ∑
𝐸(𝑛) = (𝑠(𝑛) − 𝑠ˆ(𝑛))2 (2) The LP transfer function can be defined as,
𝑁 𝑛=0
𝐺 𝐺
𝐻(𝑧) = = (6)
where summation is taken over all samples i.e., 𝑁 . The set 1 + 𝑝𝑘=1 𝑎(𝑘)𝑧 −𝑘
∑
𝐴(𝑧)
of coefficients {𝑎(𝑘)}𝑝𝑘=1 which minimize the mean-squared
where 𝐺 is the gain scaling factor for the present input and
prediction error are obtained as solutions of the set of linear
𝐴(𝑧) is the 𝑝-th order inverse filter. These LP coefficients itself
equation
can be used for speaker recognition as it contains some speaker
𝑝
∑ specific information like vocal tract resonance frequencies,
𝜙(𝑗, 𝑘)𝑎(𝑘) = 𝜙(𝑗, 0), 𝑗 = 1, 2, 3, . . . , 𝑝 (3) their bandwidths etc.
𝑘=1 The prediction error i.e., 𝑒(𝑛) is called Residual Signal and
where it contains all the complementary information that are not con-
𝑁 −1 tained in the PC. Its worth mentioning here that residual signal
1 ∑ conveys vocal source cues containing fundamental frequency,
𝜙(𝑗, 𝑘) = 𝑠(𝑛 − 𝑗)𝑠(𝑛 − 𝑘) (4)
𝑁 𝑛=0 pitch period etc.
The PC, {𝑎(𝑘)}𝑝𝑘=1 are derived by solving the recursive B. Statistical Moments of Residual Signal
equation (3). Residual signal which is introduced in Section II-A gener-
Using the {𝑎(𝑘)}𝑝𝑘=1 as model parameters, equation (5) ally has a noise like behavior and it has flat spectral response.
represents the fundamental basis of LP representation. It Though it contains vocal source information, it is very difficult
implies that any signal can be defined by a linear predictor to perfectly characterize it. In literature Wavelet Octave Coef-
and its prediction error. ficients of Residues (WOCOR) [7], Auto-associative Neural
183
LP Coeff where, 𝜇 is the mean of residual signal over a frame. As the
LP Analysis
Inverse range of the residual signal is normalized, the first order mo-
Filtering ment (i.e. the mean) becomes zero. The higher order moments
Windowed (for 𝑘 = 2, 3, 4...𝐾) are taken as vocal source features as they
Speech
Frame Magnitude Higher Order represent the shape of the distribution of random signal. The
Moment
Normalization
Computation lower order moments are coarse parametrization whereas the
higher orders are finer representation of residual signal. In fig.
1, LP residual signal of a frame is shown as well as its higher
Residual Moment
Feature order moments. It is clear from the picture that if the lower
order moments are considered both the even and odd order
Fig. 2. Block diagram of Residual Moment Based Feature Extraction values are highly differentiable.
Technique.
C. Fusion of Vocal Tract and Vocal Cord Information
In this section we propose to integrate vocal tract and
Network (AANN) [5] , residual phase [6] etc are used to vocal cord parameters identifying speakers. In spite of the two
extract the residual information. It is worth mentioning here approaches have significant performance difference, the way
that higher-order statistics have shown significant results in a they represent speech signal is complementary to one another.
number of signal processing applications [8] when the nature Hence, it is expected that combining the advantages of both the
of the signal is non-gaussian. Higher order statistics also got feature will improve [13] the overall performance of speaker
attention of the researchers for retrieving information from identification system. The block diagram of the combined
the LP residual signals [9]. Recently, higher order cumulant system is shown in fig. 3. Spectral features and Residual
of LP residual signal is investigated [10] for improving the features are extracted from the training data in two separate
performance of speaker identification system. streams. Consequently, speaker modeling is performed for the
Higher order statistical moments of a signal parameterizes respective features independently and model parameters are
the shape of a function [11]. Let the distribution of random stored in the model database. At the time of testing same
signal 𝑥 be denoted by 𝑃 (𝑥), the central moment of order 𝑘 process is adopted for feature extraction. Log-likelihood of
of 𝑥 be denoted by two different features are computed w.r.t. their corresponding
∫∞ models. Finally, the output score is weighted and combined.
We have used score level linear fusion which can be
𝑀𝑘 = (𝑥 − 𝜇)𝑘 𝑑𝑃 (7)
formulated as in equation (10). To get the advantages of both
−∞
the system and their complementarity the score level linear
for 𝑘 = 1, 2, 3..., where 𝜇 is the mean of 𝑥. fusion can be formulated as follows:
On the other hand, the characteristics function of the prob- 𝐿𝐿𝑅𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑 = 𝜂𝐿𝐿𝑅𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙 + (1 − 𝜂)𝐿𝐿𝑅𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 (10)
ability distribution of the random variable is given by,
where 𝐿𝐿𝑅𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙 and 𝐿𝐿𝑅𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 are log-likelihood ratio
∫∞ ∞ 𝑘 calculated from the spectral and residual based systems, re-
∑ (𝑗𝑡)
𝜑𝑋 (𝑡) = 𝑒𝑗𝑡𝑥 𝑑𝑃 = 𝑀𝑘 (8) spectively. The fusion weight is decided by the parameter 𝜂.
𝑘!
−∞ 𝑘=0
III. S PEAKER I DENTIFICATION E XPERIMENT
From the above equation it is clear that moments (𝑀𝑘 ) are A. Experimental Setup
coefficients for the expansion of the characteristics function.
1) Pre-processing stage: In this work, pre-processing stage
Hence, they can be treated as one set of expressive constants
is kept similar throughout different features extraction meth-
of a distribution. Moments can also effectively capture the
ods. It is performed using the following steps:
randomness of residual signal of auto regressive modeling
∙ Silence removal and end-point detection are done using
[12].
energy threshold criterion.
In this paper, we use higher order statistical moments of
∙ The speech signal is then pre-emphasized with 0.97 pre-
residual signal to parameterize the vocal source information.
emphasis factor.
The feature derived by the proposed technique is termed as
∙ The pre-emphasized speech signal is segmented into
Higher Order Statistical Moment of Residual (HOSMR). The
frames of each 20ms with 50% overlapping ,i.e. total
different blocks of the proposed feature extraction technique
number of samples in each frame is 𝑁 = 160, (sampling
from residual are shown in fig. 2.
frequency 𝐹𝑠 = 8𝐾𝐻𝑧.
At first the residual signal is first normalized between the
∙ In the last step of pre-processing, each frame is windowed
range [−1, +1]. Then central moment of order 𝑘 of a residual
using hamming window given equation
signal 𝑒(𝑛) is computed as,
2𝜋𝑛
𝑁 −1 𝑤(𝑛) = 0.54 + 0.46 cos( ) (11)
1 ∑ 𝑁 −1
𝑚𝑘 = (𝑒(𝑛) − 𝜇)𝑘 (9)
𝑁 𝑛=0 where 𝑁 is the length of the window.
184
Fig. 3. Block diagram of Fusion Technique: Score level fusion of Vocal tract (short term spectral based feature) and Vocal cord information (Residual).
2) Classification & Identification stage: Gaussian Mixture In SI, each speaker is represented by the a GMM and is re-
Modeling (GMM) technique is used to get probabilistic model ferred to by his/her model 𝜆. The parameter of 𝜆 are optimized
for the feature vectors of a speaker. The idea of GMM is to using Expectation Maximization(EM) algorithm [14]. In these
use weighted summation of multivariate gaussian functions to experiments, the GMMs are trained with 10 iterations where
represent the probability density of feature vectors and it is clusters are initialized by vector quantization [15] algorithm.
given by In identification stage, the log-likelihood scores of the
feature vector of the utterance under test is calculated by
𝑀
𝑇
∑
𝑝(x) = 𝑝𝑖 𝑏𝑖 (x) (12) ∑
log 𝑝(𝑿∣𝜆) = 𝑝(𝒙𝑡 ∣𝜆) (15)
𝑖=1
𝑡=1
where x is a 𝑑-dimensional feature vector, 𝑏𝑖 (x), 𝑖 = 1, ..., 𝑀 Where X = {𝒙1 , 𝒙2 , ..., 𝒙𝑡 } is the feature vector of the test
are the component densities and 𝑝𝑖 , 𝑖 = 1, ..., 𝑀 are the mix- utterance.
ture weights or prior of individual gaussian. Each component In closed set SI task, an unknown utterance is identified
density is given by as an utterance of a particular speaker whose model gives
maximum log-likelihood. It can be written as
{ }
1 1 𝑡 −1
𝑏𝑖 (x) = 𝑑 1 exp − (x−𝝁𝒊 ) Σi (x−𝝁𝒊 ) (13)
(2𝜋) 2 ∣Σi ∣ 2 2 𝑇
𝑆ˆ = arg max
∑
𝑝(𝒙𝑡 ∣𝜆𝑘 ) (16)
∑𝑀 Σi . The mixture
with mean vector 𝝁𝒊 and covariance matrix 1≤𝑘≤𝑆
𝑡=1
weights must satisfy the constraint that 𝑖=1 𝑝𝑖 = 1 and 𝑝𝑖 ≥
0. The Gaussian Mixture Model is parameterized by the mean, where 𝑆ˆ is the identified speaker from speaker’s model set
covariance and mixture weights from all component densities Λ = {𝜆1 , 𝜆2 , ..., 𝜆𝑆 } and 𝑆 is the total number of speakers.
and is denoted by 3) Databases for experiments:
YOHO Database: The YOHO voice verification corpus
𝜆 = {𝑝𝑖 , 𝝁𝒊 , Σi }𝑀
𝑖=1 (14) [1], [16] was collected while testing ITT’s prototype speaker
185
verification system in an office environment. Most subjects feature dimension is set at 19 for all kinds of features for
were from the New York City area, although there were many better comparison. In LP based systems 19 filters are used
exceptions, including some non-native English speakers. A for all-pole modeling of speech signals. On the other hand 20
high-quality telephone handset (Shure XTH-383) was used to filters are used for filterbank based system and 19 coefficients
collect the speech; however, the speech was not passed through are taken for extracting Linear Frequency Cepstral Coefficients
a telephone channel. There are 138 speakers (106 males and 32 (LFCC) and MFCC after discarding the first co-efficient which
females); for each speaker, there are 4 enrollment sessions of represents dc component. The detail description are available
24 utterances each and 10 test sessions of 4 utterances each. In in [19], [20]. The derivation LP based features can be found
this work, a closed set text-independent speaker identification in [1], [4], [21].
problem is attempted where we consider all 138 speakers The performance of baseline SI systems and fused systems
as client speakers. For a speaker, all the 96 (4 sessions × for different features and different model orders are shown in
24 utterances) utterances are used for developing the speaker Table II and Table III for POLYCOST and YOHO databases
model while for testing, 40 (10 sessions × 4 utterances) respectively. In this experiment, we take equal evidence from
utterances are put under test. Therefore, for 138 speakers we the two systems and set the value of 𝜂 to 0.5. The results for
put 138 × 40 = 5520 utterances under test and evaluated the the conventional spectral features follows the results shown
identification accuracies. in [22]. The POLYCOST database consists of speech signals
POLYCOST Database: The POLYCOST database [17] was collected over telephone channel. The improvement for this
recorded as a common initiative within the COST 250 action database is significant over the YOHO which is micro-phonic.
during January- March 1996. It contains around 10 sessions The experimental results shows significant performance im-
recorded by 134 subjects from 14 countries. Each session provement for SI system compare to only spectral systems for
consists of 14 items, two of which (MOT01 & MOT02 files) various model order.
contain speech in the subject’s mother tongue. The database
was collected through the European telephone network. The TABLE I
S PEAKER I DENTIFICATION R ESULTS ON POLYCOST AND YOHO
recording has been performed with ISDN cards on two XTL DATABASE USING HOSMR FEATURE FOR DIFFERENT MODEL ORDER OF
SUN platforms with an 8 kHz sampling rate. In this work, a GMM (HOSMR C ONFIGURATION : LP O RDER = 17, N UMBER OF
closed set text independent speaker identification problem is H IGHER O RDER M OMENTS = 6).
addressed where only the mother tongue (MOT) files are used. Database Model Order Identification Accuracy
Specified guideline [17] for conducting closed set speaker 2 19.4960
identification experiments is adhered to, i.e. ‘MOT02’ files 4 21.6180
POLYCOST 8 19.0981
from first four sessions are used to build a speaker model while 16 22.4138
‘MOT01’ files from session five onwards are taken for testing. 2 16.8841
As with YOHO database, all speakers (131 after deletion of 4 18.2246
YOHO 8 15.1268
three speakers) in the database were registered as clients.
16 18.2246
4) Score Calculation: In closed-set speaker identification 32 21.2138
problem, identification accuracy as defined in [18] and given 64 21.9565
by the equation (17) is followed.
Percentage of identification accuracy (PIA) =
No. of utterance correctly identified
× 100 (17) IV. C ONCLUSION
Total no. of utterance under test
B. Speaker Identification Experiments and Results The objective of this paper is to propose a new technique
The performance of speaker identification system based to improve the performance of conventional speaker identifica-
on the proposed HOSMR feature is evaluated on both the tion system which are based on spectral features representing
databases. The order of LP is kept at 17 and 6 residual only vocal tract information. Higher-order statistical moment
moments are taken to characterize the residual information. We of residual signal is derived and treated as a parameter carrying
have conducted experiment based on GMM based classifier vocal cord information. The log likelihood of both the system
for different model order. The identification results are shown are fused together. The experimental results on two popular
in Table I. The identification performance is very low because speech corpus prove that significant improvement can be
the vocal cord parameters are not the only cues for identifying obtained in combined SI system.
speakers but it has some inherent contribution in recognition.
At the same time it contains information which are not R EFERENCES
contained in spectral feature. The combined performance of [1] J. Campbell, J.P., “Speaker recognition: a tutorial,” Proceedings of the
both the system is to be observed. We have conducted SI IEEE, vol. 85, no. 9, pp. 1437–1462, Sep 1997.
experiment using two major kinds of baseline features, some [2] S. Davis and P. Mermelstein, “Comparison of parametric representations
for monosyllabic word recognition in continuously spoken sentences,”
are based on LP analysis (LPCC and PLPCC) and others Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 28,
(LFCC and MFCC) are based on filterbank analysis. The no. 4, pp. 357–366, Aug 1980.
186
TABLE II
S PEAKER I DENTIFICATION R ESULTS ON POLYCOST DATABASE phase and mfcc features for speaker recognition,” Signal Processing
SHOWING THE PERFORMANCE OF BASELINE ( SINGLE STREAM ) SYSTEM Letters, IEEE, vol. 13, no. 1, pp. 52–55, Jan. 2006.
AND FUSED SYSTEM (HOSMR C ONFIGURATION : LP O RDER = 17, [7] N. Zheng, T. Lee, and P. C. Ching, “Integration of complementary
N UMBER OF H IGHER O RDER M OMENTS = 6, F USION W EIGHT (𝜂)= 0.5). acoustic features for speaker recognition,” Signal Processing Letters,
IEEE, vol. 14, no. 3, pp. 181–184, March 2007.
Feature Model Order Baseline System Fused System [8] A. Nandi, “Higher order statistics for digital signal processing,” Math-
2 63.5279 71.4854 ematical Aspects of Digital Signal Processing, IEE Colloquium on, pp.
4 74.5358 78.9125 6/1–6/4, Feb 1994.
LPCC 8 80.3714 81.6976 [9] E. Nemer, R. Goubran, and S. Mahmoud, “Robust voice activity
16 79.8408 82.8912 detection using higher-order statistics in the lpc residual domain,” Speech
and Audio Processing, IEEE Transactions on, vol. 9, no. 3, pp. 217–231,
2 62.9973 65.7825 Mar 2001.
4 72.2812 75.5968 [10] M. Chetouani, M. Faundez-Zanuy, B. Gas, and J. Zarader, “Investiga-
PLPCC 8 75.0663 77.3210 tion on lp-residual representations for speaker identification,” Pattern
16 78.3820 80.5040 Recognition, vol. 42, no. 3, pp. 487 – 494, 2009.
2 62.7321 71.6180 [11] C.-H. Lo and H.-S. Don, “3-d moment forms: their construction and
4 74.9337 78.1167 application to object identification and positioning,” Pattern Analysis
LFCC 8 79.0451 81.2997 and Machine Intelligence, IEEE Transactions on, vol. 11, no. 10, pp.
16 80.7692 83.4218 1053–1064, Oct 1989.
2 63.9257 69.7613 [12] S. G. Mattson and S. M. Pandit, “Statistical moments of autoregressive
4 72.9443 76.1273 model residuals for damage localisation,” Mechanical Systems and
MFCC 8 77.8515 79.4430 Signal Processing, vol. 20, no. 3, pp. 627 – 645, 2006.
16 77.8515 79.5756 [13] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,”
Pattern Analysis and Machine Intelligence, IEEE Transactions on,
vol. 20, no. 3, pp. 226–239, Mar 1998.
[14] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood
TABLE III from incomplete data via the em algorithm,” Journal of the Royal
S PEAKER I DENTIFICATION R ESULTS ON YOHO DATABASE SHOWING THE Statistical Society. Series B (Methodological), vol. 39, pp. 1–38, 1977.
PERFORMANCE OF BASELINE ( SINGLE STREAM ) SYSTEM AND FUSED [15] G. R. Linde Y., Buzo A., “An algorithm for vector quanization design,”
SYSTEM (HOSMR C ONFIGURATION : LP O RDER = 17, N UMBER OF IEEE Transactions on Communications, vol. COM-28, no. 4, pp. 84–95,
H IGHER O RDER M OMENTS = 6, F USION W EIGHT (𝜂)= 0.5)). 1980.
[16] A. Higgins, J. Porter, and L. Bahler, “Yoho speaker authentication final
Feature Model Order Baseline System Fused System report,” ITT Defense Communications Division, Tech. Rep., 1989.
2 80.9420 84.7101 [17] H. Melin and J. Lindberg, “Guidelines for experiments on the polycost
4 88.9855 91.0870 database,” in in Proceedings of a COST 250 workshop on Application
8 93.8949 94.7826 of Speaker Recognition Techniques in Telephony, 1996, pp. 59–69.
LPCC 16 95.6884 96.2862 [18] D. Reynolds and R. Rose, “Robust text-independent speaker identi-
32 96.5399 97.1014 fication using gaussian mixture speaker models,” Speech and Audio
64 96.7391 97.2826 Processing, IEEE Transactions on, vol. 3, no. 1, pp. 72–83, Jan 1995.
2 66.5761 72.5543 [19] S. Chakroborty, A. Roy, S. Majumdar, and G. Saha, “Capturing comple-
4 76.9203 81.0507 mentary information via reversed filter bank and parallel implementation
8 85.3080 87.7717 with mfcc for improved text-independent speaker identification,” in
PLPCC 16 90.6341 91.9022 Computing: Theory and Applications, 2007. ICCTA ’07. International
32 93.5326 94.3116 Conference on, March 2007, pp. 463–467.
64 94.6920 95.3986 [20] S. Chakroborty, “Some studies on acoustic feature extraction, feature
selection and multi-level fusion strategies for robust text-independent
2 83.0072 85.8152 speaker identification,” Ph.D. dissertation, Indian Institute of Technol-
4 90.3623 91.7935 ogy, 2008.
8 94.6196 95.4891 [21] L. Rabiner and H. Juang B, Fundamental of speech recognition. First
LFCC 16 96.2681 96.6848 Indian Reprint: Pearson Education, 2003.
32 97.1014 97.3551 [22] D. Reynolds, “Experimental evaluation of features for robust speaker
64 97.2464 97.6268 identification,” Speech and Audio Processing, IEEE Transactions on,
2 74.3116 78.6051 vol. 2, no. 4, pp. 639–643, Oct 1994.
4 84.8551 86.9384
8 90.6703 92.0290
MFCC 16 94.1667 94.6920
32 95.6522 95.9964
64 96.7935 97.1014
187
Right Brain Testing
Applying Gestalt psychology in Software Testing
Narayanan Palani,
Student in Computer Applications, BITS- Pilani, India.
[Tel:+91-9900465054 E-mail: narayananp24@gmail.com]
188
expects similar objects in consecutive steps/functionalities and
processes. Similar objects/information can group together in
GUI to overcome this issue (in User Interface Testing).
A. Application of Emergence:
A SME can identify ‘strong emergence’ and ‘weak
emergence’ of components/functions in web pages by applying Figure 2. Weak Emergence in picture1 and 2. Strong Emergence in picture3.
emergence into testing. A tester can validate linkages,
parameters passing mechanisms by applying ‘emergence’ to
track on various components/scripts which communicates B. Application of Reification:
together at various stages of user interactions. In User Interface
Testing, it addresses the critical aspects like application and the
user interaction problems, ‘strong emergence’ of application
components with interrupts. In integration testing, a tester can
easily track the problems that arises when units are combined
strongly (using strong emergence) or weakly (using weak
emergence). It delivers clear differences between strong and
weak emergence of components.
Unit1 Unit2
189
C. Application of Multistability: VII. APPLICATION OF ‘PRODUCTIVE THINKING’ IN
USABILITY TESTING:
Formulae:
PTR = [Combinations*Test Techniques]/Time;
PTR=Productive Thinking Result which analyses test
factors.It can be customized based on evaluations to identify
the waitage of Productive Thinking Process.
190
XII. APPLICATION OF NEO-GESTALTS VIEW: objectives’ to test. It delivers good understanding on AUT for
various testing activities.
191
ADCOM 2009
MOBILE AD-HOC
NETWORKS
Session Papers:
1. Vijayashree Budyal, Sunilkumar Manvi and Sangamesh Hiremath, “Intelligent Agent based
QoS Enabled Node Disjoint Multipath Routing”
2. Adil Erzin, Soumyendu Raha and.V.N. Muralidhara, “Close to Regular Covering by Mobile
Sensors with Adjustable Ranges”
192
Intelligent Agent based QoS Enabled Disjoint
Multipath Routing in MANETs
*Vijayashree Budyal, #S. S. Manvi,**S. G. Hiremath ,*Kala K. M.,
Abstract: Mobile ad hoc network (MANET) is an Multipath Routing (NDMR) has two novel aspects
infrastructureless, multihop, wireless, and frequently compared to the other on-demand multipath
changing network. To support multimedia applications protocols: It reduces routing overhead dramatically
such as video and voice, MANETs require an efficient and achieves multiple node disjoint routing paths (No
routing protocol and Quality of Service (QoS) mechanism. node except source-node and destination-node are
QoS support in MANETs is an important issue as best
effort routing is not efficient for supporting multimedia
common in multipaths) [5].
applications. Whenever there is a link break on the route,
the best effort protocols need to initiate a new route Important issue in multimedia communications is
discovery process. This results in a high routing load. On routing of application data based on QoS
demand Node-Disjoint Multipath Routing (NDMR) requirements. QoS routing is a method of finding
alleviate these problems and reduces routing overhead. This QoS routes between a source and destination. If a
paper proposes an intelligent agent based QoS enabled proper QoS route is identified, the applications will
NDMR, built on NDMR for supporting multimedia meet the guaranteed services [6]. A Biologically
applications that considers bandwidth, delay as QoS inspired QoS routing algorithm is described in [7]
metrics for optimal path computation in Mobile Ad hoc
Networks (MANETs). A mobile agent is employed to find
based on swarm intelligence inspired routing
QoS paths and to select an optimal path among them. The technique.
performance of the scheme is evaluated for packet delivery
ratio, end-to-end delay, QoS acceptance ratio and route Agent technology is emerging as a new paradigm in
discovery time for various network scenarios. the areas of artificial intelligence and computing.
Agents are said to become the next generation
1. Introduction components in software development, because of its
inherent structure and behaviour, which can be used
Mobile ad hoc networks are infrastructureless to facilitate Internet services.
networks that can be rapidly deployed. They are
characterized by multihop wireless connectivity, Agents are the autonomous programs situated within
frequently changing network topology [1]. The a programming environment. The agents achieve
design of efficient and reliable routing protocols in their goals by collecting the relevant information
such a network is a challenging issue. Ad Hoc On- from the host without affecting the local processing.
demand Distance Vector (AODV) and Dynamic They have certain special properties such as
Source Routing (DSR) are the two most widely mandatory and orthogonal (supplementary), which
studied on-demand ad hoc routing protocols. The make them different from the standard programs. The
limitation of both of them is that, they build and rely mandatory properties are autonomy, reactive,
on a single path for each data transmission. proactive and temporally continuous. The orthogonal
Whenever there is a link break on the route, protocols properties are communicative, mobile, learning and
need to initiate a new route discovery process, believable. An agent should posses the mandatory
resulting in high routing overheads [2]. properties which are compulsory. The orthogonal
properties enhance the capabilities of agents and
Multipath routing is one of the solutions that aim to provide strong notion of agents. An agent may or
establish multiple paths between source and may not posses the orthogonal properties.
destination. A lot of benefits have been explored for
multipath routing [4]. On-demand Node-Disjoint
193 1
Agents can be classified as local/user interface redundancy and avoid introducing a broadcast flood
agents, networked agents, distributed artificial in MANETs: Path accumulation, decreasing
intelligence (AI) agents and mobile agents. The multipath broadcast routing packets (using shortest
networked agents and user interface agents are single routing hops), and selecting node-disjoint paths.
agent systems, whereas the other two types of agents
are multi-agent systems [9]. In NDMR, AODV is modified to include path
accumulation in RREQ packets. When the packets
From the literature, we have noticed that a dynamic are broadcast in the network, each intermediate node
QoS path computation, based on rapidly changing appends its own address to the RREQ packet. When a
network conditions and capable of providing RREQ packet finally arrives at its destination, the
adaptability, flexibility, software reuse and destination is responsible for judging whether or not
customizability features, has not been addressed. This the route path is a node-disjoint path. If it is a node-
paper proposes an Intelligent agent based QoS disjoint path, the destination will create a route reply
enabled NDMR built on NDMR for supporting packet (RREP) which contains the node list of whole
multimedia applications that considers bandwidth, route path and unicasts it back to the source that
delay as QoS metrics for optimal path computation. generated the RREQ packet along the reverse route
A mobile agent is employed to find QoS paths and to path.
select an optimal path among them.
When an intermediate node receives a RREP packet,
II. QoS Enabled NDMR it updates the routing table and reverse routing table
using the node list of the whole route path contained
This paper proposes an intelligent agent based QoS in the RREP packet. When receiving a duplicate
enabled NDMR built on NDMR for supporting RREQ, the possibility of finding node-disjoint
multimedia applications, which identifies a set of multiple paths is zero if it is dropped, for it may come
multiple paths that meet the QoS requirements of a from another path. But if all of the duplicate RREQ
particular application, and selects a path which leads packets are broadcast, this will generate a broadcast
to highest overall resource efficiency. The scheme storm and dramatically decrease the performance. In
uses a mobile agent to perform this operation. Every order to avoid this problem, a novel approach is
node in a network comprises of an agent platform to introduced in NDMR recording the shortest routing
support mobile agents. In this section, we describe, hops to keep loop-free paths and decrease routing
NDMR in brief and explain the functioning of QoS broadcast overhead. When a node receives a RREQ
Enabled NDMR. packet for the first time, it checks the node list of the
route path calculates the number of hops from the
2.1 Node-disjoint multipath routing protocol source node to itself and records the number as the
shortest number of hops in its reverse routing table. If
Node-disjoint multipath routing protocol (NDMR) is the node receives a duplicate RREQ packet again, it
a new protocol developed by Xuefei Li [5], computes the number of hops and compares it with
modifying and extending AODV to enable the path the shortest number of hops in its reverse routing
accumulation feature of DSR in route request table. If the number of hops is larger than the shortest
packets. It can efficiently discover multiple paths number of hops in the reverse routing table, the
between source and destination nodes with low RREQ packet is dropped. Only when it is less than or
broadcast redundancy and minimal routing latency. equal to the shortest number of hops, the node
appends its own address to the node list of the route
In the route discovery process, the source creates a path in a RREQ packet and broadcasts it to
route request packet (RREQ) containing message neighboring nodes again.
type, source address, current sequence number of
source, destination address, the broadcast ID and The destination node is responsible for selecting and
route path. Then the source node broadcasts the recording multiple node-disjoint paths. When
packet to its neighbouring nodes. The broadcast ID is receiving the first RREQ packet, the destination
incremented every time that the source node initiates records the list of node IDs of the entire route path in
a RREQ, forming a unique identifier with the source its reverse route table and sends a RREP packet along
node address for the RREQ. Finding node-disjoint the reverse route path. When the destination receives
multiple paths with low overhead is not a duplicate RREQ, it compares the whole node IDs of
straightforward when the network topology changes the entire route path in the RREQ to all of the
dynamically. NDMR routing computation has three existing node-disjoint paths in its reverse routing
key features that help it to achieve low broadcast table. If there is no common node (except the source
194
and destination node) between the node IDs from the 2.2.2 Agency at each Node
RREQ and node IDs of any node-disjoint path in the
destination’s reverse table, the route path in current We assume that every node in a network maintains an
RREQ is a node-disjoint path and is recorded in the agency as shown in Fig.1. An agency consists of a
reverse routing table of the destination. Otherwise, BlackBoard, Communication Manager Agent, Delay
the current RREQ is discarded. and Bandwidth Estimator agent (D&B), QoS
negotiator/re-negotiator agent.
2.2 Functioning of QoS enabled NDMR. • BlackBoard : is a shared knowledge base structure,
which is read and updated by agents as and when
We describe the QoS routing metrics, and Agency at required. It consists of information such as residual
each node considered in the proposed work. bandwidth, delays that connected to a node as shown
in Fig 2.
2.2.1 QoS routing metrics • Communication Manager Agent (CMA): is a static
agent running at each node to serve the applications
Consider a network as an undirected graph G (V, E) requesting for on-demand QoS routes and also
to describe QoS metrics used in the proposed scheme, support route finding operations when the node is
where V is a set of nodes and E is a set of edges. A acting as an intermediate node. This agent is
path ‘P’ from source ‘s’ to destination ‘d’, P(s, d), is responsible for creation of the agents and updates the
a sequence of edges belonging to set E .The proposed data in blackboard. All the operations like,
scheme uses residual bandwidth (bwe) and delay (de) communication, updating the blackboard, reading the
metrics of a link for QoS routing of an application. A blackboard and so on, take place with the permission
QoS of an application is specified as Q = {bwmin, D}, of communication manager agent.
where bwmin is the minimum bandwidth required
for an application, D is the bounded end-to-end
delay for delivery of information. The application
can be viewed by a user with acceptable QoS by
guaranteeing bwmin and D metric. These concave and
addtive properties are defined below for a path
P={l1, l2, . . ., ln }, where ln is the nth link and m(P) is
the metric value on the path P.
195
Algorithm 1: Functions of Communication- 4. When the agent reaches the source, it finds a
Manager-agent. set of multiple QoS paths that satisfies the
required resources
To maintain QoS requirements of an application by • Prune all the edges/links in collected
performing dynamic negotiation/re-negotiation connectivity/ resource information that
have less than the desired bandwidth and
Begin delay.
1. Receive connection request for source and • Find K Node-disjoint paths (No node is
destination with QoS requirements (maximum common in more than one path) by
bandwidth and delay required). following principles of NDMR routing
2. Trigger QoS –negotiator/Re-negotiator agent protocol.
to negotiate the resources to find a QoS route .
(Algorithm 2a) 5. If QoS path(s) available then, select a best
3. Trigger D&B-Estimator-Agent (Algorithm 3a) QoS path (path with widest bandwidth and
4. Trigger D&B-Estimator-Agent periodically to lowest delays) and reserve the resources on the
observe the QoS of an application in the path.
network (Algorithm 3b) Else, inform the communication manager agent
5. Check, if there is a QoS violation then, that QoS path is not available.
Trigger QoS-negotiator/Re-negotiator-agent 6. Dispose the QoS negotiator/re-negotiator
to re-negotiate the resources with the nodes in agent;
the established path (Algorithm 2b) 7. Stop.
6. Repeat steps 4 -5 until the session is completed. End.
7. Dispose the created agents.
8. stop. Algorithm 2b: Re-negotiation phase
End.
To re-negotiate resources whenever a QoS violation
• QoS Negotiator/Re-negotiator agent: QoS or congestion/failure is detected during a session.
negotiator/re-negotiator agent is a mobile agent, Begin
which is used to find the QoS route (a route 1. The QoS negotiator/re-negotiator agent collects the
satisfying bandwidth, delay) from the source to QoS requirements to be re-negotiated from the
destination at the beginning of a session as well as Communication-manager-agent.
whenever required. It negotiates/re-negotiates the 2. It migrates on the specified path by visiting every
resources in the path. It follows the principle of node on the path.
NDMR routing protocol while establishing multiple 3. After reaching the destination, it checks whether
paths between the source to destination by collecting the renegotiation is successful at all the visited
the neighbours connectivity and resource information nodes.
(bandwidth availability, delays). Finally chooses a 4. If re-negotiation is successful then inform the
maximum bandwidth and a minimum delay path newly negotiated QoS values on the existing path
among them for resource reservation. to the Communication-manager-agent and goto
step 6.
Algorithm 2a: Negotiation phase Else find the multiple QoS paths that satisfies
required resources (as given in step4 of Algorithm
To find a QoS route considering bandwidth and delay 2a)
parameters. 5. If QoS path(s) available then, select a best QoS
path and reserve the resources on the path and
Begin inform the server and manager agent;
1. The QoS negotiator/re-negotiator agent collects the Else, inform the communication manager agent
QoS requirements from the Communication- that QoS path is not available;
manager-agent. 6. Dispose the QoS negotiator/re-negotiator agent;
2. The agent migrates from source to its 7. Stop.
neighbours until it reaches the destination. End.
While traversing it collects the resource availability
from each of the intermediate nodes. • Delay and Bandwidth Estimator Agent: is a agent
3. The agent routes from destination to source that monitors the node. Such agents are created for
as per the pre-existing routing ofNDMR routing each node by the Communication-manager-agent.
protocol. The Delay and bandwidth-Estimator-agent computes
196
the bandwidth, delay for each node and updates the 3.1 Simulation model
Blackboard at regular intervals. Bandwidth is
calculated by monitoring flow of information on link The proposed model has been simulated in various
of a node and the available Residual bandwidth. network scenarios on Pentium-4 machine by using
Delay is computed by averaging the queuing delays “C++” programming language for the performance
taken for all the packets on the link within a given and effectiveness of the approach. Simulated area for
time interval. the network topology is of A X B sq.mts,
Total “N” number of nodes, “C” Mbps is the link
Algorithm 3a: Delay and Bandwidth Estimator agent capacity and Transmission range are considered in
computes Bandwidth and Dealy the simulation. In order to simulate mobility of nodes
in the network, we considered nodes to move in any
Begin of the 8 directions at a distance “d” mts with speed
1. Compute bandwidth and delay as follows varying between 0-12 mph (meter per hour). The
QoS requirements of an application are specified as
bw p(s , d) = min bwe ( i, j) ≥ bwmin Q= {bandwidth, delay}, where all the metrics of Q
(i, j) є p (s, d) are randomly distributed. Principles of NDMR
routing is used to generate node-disjoint multipaths
delay p(s,d) = ∑ de (i , j) ≤ D and create routing tables at each of the nodes before
(i, j) є p (s, d) applying the QoS Enabled NDMR scheme.
197
• QoS Acceptance ratio: is the ratio of sum of Qos requirement because more number of optimal paths
fulfilled paths to existing multipaths. are available.
4. Results
Figure 3 depicts the route discovery time that the
proposed scheme is more than NDMR scheme,
because the mobile agents need time to traverse the
network to find a optimal path.
5. Conclusions
An Intelligent-agent-based QoS enabled NDMR
scheme using the metrics bandwidth and delay for a
feasible path computation has been proposed. A
comparison of NDMR and the proposed QoS enabled
scheme presented in terms of Average End-to-End
delay, Packet delivery ratio, network bandwidth
utilization ratio and Route discovery time for sparse
and dense network scenarios. The results demonstrate
that Average end-to-end delay, packet delivery ratio
and the network bandwidth utilization of the
Fig 4. End-to-End Delay Vs. no. of nodes proposed scheme is better than the NDMR scheme.
The performance of the scheme is dependent on the
As shown in Figure 4 the Average End-to-End Delay richness of the network connectivity information
for proposed scheme is less than NDMR because, gathered by a mobile agent that is this scheme
proposed scheme selects the optimal path. performs better in the case of dense networks. The
agent’s visibility of its visited nodes also plays an
important role in improving the QoS acceptance
ratio. Important benefits of the agent-based scheme
as compared with traditional methods of software
development are flexibility, adaptability, software
reusability and maintainability.
Acknowledgments
198
References
199
Close to Regular Covering by Mobile Sensors with
Adjustable Ranges
A.I.Erzin V.N.Muralidhara S.Raha
Abstract— A mobile wireless sensor network of a set of mobile focus in this paper is to maximize the lifetime of WSN. This
sensors with adjustable sensing and communication ranges is an- problem is sufficiently complex, and even special cases are
alyzed for coverage. Each sensor, being in active mode, consumes NP-hard [1]. Our goal is to take advantage of the mobility of
its limited energy for sensing, communication and movement and
in a sleep mode, preserves its energy. The problem that we focus sensors in comparison with the static sensors in class of regular
in this paper is to maximize the lifetime of WSN. This problem is covers [2], [3], [4]. In the model that we consider, the sensor
sufficiently complex, and even special cases are NP-hard [1]. Our nodes can adjust the sensing and communication ranges by
goal is to take advantage of mobility of sensors in comparison consuming some energy. In this paper, we show that mobility
with the static sensors in class of regular covers [2], [3], [4]. of the sensor nodes can be exploited to improve the life time
Index Terms— Adjustable ranges, coverage, energy efficiency, of the WSN.
wireless sensor network.
II. F IXED GRID
I. I NTRODUCTION Let the region√ O be tiled by the regular triangles (tiles)
A wireless sensor network (WSN) is composed of a large with the side R 3. These triangles form a regular grid with
number of sensor nodes deployed densely close a area of the set of grid nodes I. Suppose, each sensor has the energy
interest and are connected by a wireless interface. Wireless storage q > 0. For any sensor, sensing energy consumption
sensor networks constitute the platform of a broad range of per time period depends on a sensing range r (radius of the
applications such as national security, surveillance, military, disk) and equals SE = µ1 ra , µ1 > 0, a ≥ 2; communication
health care, and environmental monitoring. A sensor node in energy consumption per time period depends on the distance
a WSN, is typically equipped with Radio Transceiver, micro d and equals CE = µ1 rb , µ2 > 0, b ≥ 2; and the energy
controller and power supply and every node in the network consumption per time round during the motion depends on
have very limited processing, storage and energy resources. the speed v and equals M E = µ3 rc , µ3 > 0, c > 0. We
In most of the applications in the real world, it is almost suppose that during the motion, sensor does not consume the
impossible to replenish the power resources, hence energy energy for sensing and communication.
optimization is the most important issues in WSN.
Suppose the WSN is presented by the set J, |J| = m, of
mobile sensors with adjustable sensing and communication
ranges, which are distributed randomly over the plane region
O of space S. Each sensor, being in active mode, consumes its
i R 3 j
limited energy for sensing, communication and movement. In a
sleep mode sensor preserves its energy. Let the monitoring and R
communication areas of every sensor are the disks of certain disk i
radii with sensor in the centers [2], [3], [5], [6].
We say that the region O is covered, if every point in O
belongs to at least one monitoring disk. The lifetime of a WSN R 3
is the number of time rounds during which the region O is
covered by connected active sensors [7]. Observe that by max- k
imizing the lifetime of a WSN, we are actually maximizing
the time period over which the region is covered by the sensor
nodes with the limited energy resources. The problem that we
This research was supported jointly by the Russian Foundation for Basic
Research (grant 08-07-91300-IND-a) and by the Department of Science and
Technology of Government of India (grant INT/RFBR/P-04)
A.I.Erzin is with the Sobolev Institute of Mathematics, Russian Academy of
Sciences, Novosibirsk, Russia and the Novosibirsk State University, Novosi-
birsk, Russia. Fig. 1. Covering model A1
V.N.Muralidhara was with the Supercomputing Education and Research
Center, Indian Institute of Science, Bangalore, India. At present he is a
faculty member at the International Institute of Information Technology (IIIT)
Bangalore,
S.Raha is with the Supercomputing Education and Research Center, Indian If all sensors have the same sensing ranges R, and are
Institute of Science, Bangalore, India. equally placed in the grid nodes, then the covering model, we
200
call it A1, [8], [4](Fig. 1) is optimal with respect to the sensing
energy consumption (or covering density). In the model A1
each triad of neighbor disks of radius R with centers in the
nodes of triangle has one common point in the center of the G
j
tile. In cover A1 each sensor, located in the node i, must cover
i kv
the disk of radius R and center in the node√i (we call it disk
i). The density of the cover is DA1 = 2π/ 27 ≈ 1.2091 [4], (k 1)v
[8], and the sensing energy consumption of every sensor is
a
SEA1 = µ√ 1 R . The communication distance for each sensor
in A1 is R 3, hence disk i
√ the communication energy consumption
is CEA1 = µ2 (R 3)b . Therefore, the lifetime of one sensor sides of
is hexagon i
q
tA1 = √
µ1 R + µ2 (R 3)b
a
√
tA1 m qm 27
tA1 ≈ ≈ √ b Fig. 2. Sensors Inside the Regular Hexagon
N 2S(µ1 Ra−2 + µ2 Rb−2 ( 3 )
Let the sensors be distributed uniformly over the region O,
and parameter aij = 1 if i is the closest grid node to the sensor
j(i.e. the distance between j and i is dij = mink∈I dkj ) and
aij = 0 otherwise. Denote the set Ji = {j ∈ J|aij = 1}. b = c = 2, k = 6, one gets lk = k = 6, and lifetime of sensor
Then the sensors inside the regular hexagon i with j ∈ Qki equals tk (lk ) = 65.61.
√ center in
the node i and the sides at the distance δ = R 3/2 from Since the sensors are distributed uniformly, then there are
the center, are in the set Ji (Fig. 2). We reasonably suppose Nk ≈ mπ(2k−1)v 2 /S sensors in every set Jik . Let first active
that the sensor j in Ji (or in hexagon i) must cover the disk sensors are initially located in Ji1 , and we suppose that they
i. Then if j is located on the distance r away from the grid do not move and are active during
node i, then it must increase the sensing range by r in order to qN1
cover the disk i. Moreover, if the distance between the node L1 = √
µ1 (R + v)a + µ2 (R 3 + 2v)b
i and sensor j1 ∈ J1 is r1 , and the distance between the
node k and sensor j2 ∈ J2 is r2 , then in order to guarantee time periods. During time L1 sensors in J12 could move
a communication between the neighbor sensors j1 and j2 , it towards the grid node i, and then they can be active during
is necessary to increase the communication range of j1 and L2 = N2 max0≤l≤min{2,L1 } t2 (l) time periods. Therefore,
j2 by at least r1 + r2 units. But additionally, every sensor Pk−1
during the time Λk−1 = l=1 Ll sensors in Jik could move
j ∈ Ji can move towards the node i during some time rounds to the grid node i, and then they can be active during Lk =
in order to be nearer i. For the sake of simplicity, we suppose Nk max0≤l≤min{k,Λk−1 } tk (l) time periods. As a result, we get
that the speed of every sensor possess the two values 0 or v. the lifetime of sensors as
Therefore, if sensor j ∈ Ji is moving, then the speed v and
direction (towards the grid node i) are known.
K
Let us consider the concentric circles of radii δk = kv, k = X
1, 2, . . . k = vδ . Denote the set Jik = {j ∈ Ji |δk−1 < dij ≤ Λδ = ΛK = Lk
l=1
δk }. Then any sensor j ∈ Jik could reach the node i by at
most k time rounds. For example, when q = 365, µ1 = 0.5, µ2 = 0.25, µ3 =
Since the resource of each sensor is limited by q, then if 1.0, v = 1, a = b = c = 2, R = 6, we have K = 5 and
any sensor j ∈ Jik moves l time rounds and, as a result, lk = k for each sensor j ∈ J1k , k = 1, 2, . . . k = vδ . Let us
consumes lµ3 v c units of its energy, then taking into account compare the lifetime Λ0 of the WSN in example when the
the remainder sensor-node distance (k − l)v, it can be active sensors are static, and the lifetime Λδ of the WSN when the
during sensors are mobile. We have
q − lµ3 v c K
tk (l) ≈ √ X qNk
µ1 (R + (k − l)v)a + µ2 (R 3 + 2(k − l)v)b Λ0 ≈ √
k=1
µ1 (R + kv)a + µ2 (R 3 + 2kv)b
time rounds. Function tk (l) is concave, then one can find Tk = 5
tk (lk ) = max0≤l≤k tk (l) in O(log K). For example, when 2.365mπ X 2k − 1
≈ √
q = 365, µ1 = 0.5, µ2 = 0.25, µ3 = 1.0, v = 0.15, R = a = S 40 + 8(1 + 3)k + 3k 2
k=1
201
2.365mπ 1 3 5 7 9
≈ ( + + + + )
S 64.85 95.71 132.56 175.42 224.28
m
≈ 374
S
and
qN1
Λδ ≈ √
µ1 (R + v)a + µ2 (R 3 + 2v)b
K
X (q − lk µ3 v c )Nk
+ √
µ (R + (k − lk )v)a + µ2 (R 3 + 2(k − lk )v)b
k=2 1
5
mπ 365 1 X
≈ ( + (365 − k)(2k − 1))
S 62.89 45
k=2
m
≈ 621
S Fig. 3. Relocation of Grids with Movement of a Sensor
In this example the motion of the sensors gives us a con-
siderable gain in lifetime in comparison with the static case,
which may be advantageous only when the movement energy Since N1 ≈ mπv 2 /S, then the WSN’s lifetime in this case is
consumption M E = µ3 rc is relatively big. In any case,
optimal value of lk can be zero, and if it is disadvantageous, (q − µ3 v c )
Λv ≈ 1 + (N1 − n′1 + N1 n′2 ) √
then the sensors will not move. Therefore, the model with µ1 Ra + µ2 (R 3)b
mobile sensors is always better than one with static sensors. (q − 2µ3 v c )
+n′3 √
µ1 Ra + µ2 (R 3)b
III. F REE 3R2 mπ
GRID ≈1+(
4S √
The previous results depend on the parameter δ and are ob- µ1 (R + v)a + µ2 (R 3 + 2v)b (q − µ3 v c )
tained in case of fixed grid. Suppose that the number of sensors − ) √
q µ1 Ra + µ2 (R 3)b
Nk in Jik is sufficiently large, for each k = 1, 2, . . . k = vδ . 6mR2 (q − 2µ3 v c )
If grid is wandering, then we may replace it several times + √
25S µ1 Ra + µ2 (R 3)b
without changing the size (the new grid node in is relocated
from the previous position in−1 by 2δ distance from in to For the last example, (when q = 365, µ1 = 0.5, µ2 =
right or down like on the Fig. 3), then WSNs lifetime can be 0.25, µ3 = 1.0, v = 1, a = b = c = 2, R = 6), the lifetime
increased as follows. Let us set δ = v and suppose that during of the sensor network with free grid is Λv ≈ 747m/S, the
the first time round, when a part of sensors in Ji11 are active, WSNs lifetime in case of fixed grid is Λδ ≈ 621m/S, the
other sensors in every Ji1n , n ≥ 1 move to the grid node in . lifetime of cover A1 is LA1 ≈ 758m/S, and the lifetime of
The number of sensors in each Ji11 which are active
√ during the static WSN is Λ0 ≈ 374m/S. Then in the example, one gets
first time round is n′1 ≈ (µ1 (R + v)a + µ2 (R 3 + 2v)b )/q, 2Λ0 ≈ 1.2Λδ ≈ 1.01Λv ≈ LA1 and Λ0 < Λδ < Λv < LA1 .
and we suppose that n′1 ≤ N1 . These sensors do not move and Inequalities Λ0 ≤ Λδ ≤ LA1 and Λv ≤ LA1 are always
must increase their sensing ranges by v. During the first time true. Inequality Λδ ≤ Λv depends on the parameters. Thus
round N1 − n′1 sensors in each set Ji11 will reach the grid node if energy consumption for motion in unit time M E = µ3 rc
i1 , and it is not necessary to increase their sensing ranges to is considerably big, then one may get inequality Λ0 > Λv .
cover the O. Moreover, each sensor in Ji1n , n ≥ 2, will reach Let in the last example change only the value of µ3 and
in during the first time round. The number of sensors in every set it µ3 = 180. Then Λδ is the same Λ0 ≈ 374m/S, but
set Ji1n , n ≥ 2, is N√1 , and the number of these sets (new grid Λv ≈ 333m/S < Λ0 .
nodes) is n′2 ≥ ⌊ R2v 3 ⌋2 − 1 (value
√ n′2 + 1 is the number of
disks of radius δ packed in R 3 side rhombus), where ⌊A⌋ IV. G RID SIZE OPTIMIZATION
is an integer part of A. Every sensor, located outside the sets
The above results depend on the radius R which, in turn,
Ji1n , n ≥ 1, has two time rounds to reach the nearest grid node
determines the tiles size. The lifetime LA1 of model A1 is
(Fig. 3). The number of such sensors in the rhombus is
irrespective R. Suppose for√simplicity a = b = c = 2, v = 1
and lk = k for any k ≤ R 3/2, and let us find the optimal
√
R2 27m value of R ∈ [2, 8] giving the maximum to the WSNs lifetimes
n′3≈ − (n′2 + 1)N1 Λ0 (R), Λδ (R) and Λv (R). In this case for the considered
2S
mR2 √ models the lifetimes are
≥ (2 27 − 3π) √
4S R 3/2
mπq X 2k − 1
mR2 0
Λ (R) ≈ √
≈ 0.24 S µ1 (R + k)2 + µ (R 3 + 2k)2
2
S k=1
202
mπq q
Λδ (R) ≈ ( √
S µ1 (R + 1)2 + µ2 (R 3 + 2)2
√
R 3/2
1 X
+ (q − kµ3 )(2k − 1))
(µ1 + 3µ2 )R2
k=2
m(2.6q − 2.83µ3 )
Λv (R) ≈ +1
S(µ1 + 3µ2 )
√
µ1 (R + 1)2 + µ2 (R 3 + 2)2 q − µ3
−
qR2 (µ1 + 3µ2 )
Function Λv (R) is non-decreasing, and when, for exam-
ple, q = 36, µ1 = 1.0, µ2 = 0.2, µ3 = 5.0 then
maxR∈[2,8] Λv (R) = Λv (Rv ) ≈ 57m S , and optimal R
v
=
0 δ
8. Λ (R) and Λ (R) are multi-extremal functions, and get
maxR∈[2,8] Λ0 (R) = Λ0 (R0 ) ≈ 20m S when R0 ≈ 7 and
δ δ δ
maxR∈[2,8] Λ (R) = Λ (R ) ≈ S when Rδ ≈ 2.4.
33m
R EFERENCES
[1] S. Slijepcevic and M. Potkonjak, “Power efficient organization of wireless
sensor networks,” in ICC, 2001, pp. 472,476.
[2] H. Zhang and J. Hou, “Maintaining sensing coverage and connectivity in
large sensor networks,” Ad Hoc & Sensor Wireless Networks, pp. 89–124,
2005.
[3] J. Wu and S. Yang, “Energy-efficient node scheduling models in sensor
networks with adjustable ranges,” Int. J. of Foundations of Computer
Science, no. 16, pp. 3–17, 2005.
[4] R. Williams, “The geometrical foundation on natural structure,” A Source
Book of Design; Dover Pub. Inc.: New York, pp. 51–52, 1979.
[5] M. Cardei, J. Wu, and M. Lu, “Improving network lifetime using sensors
with adjustable sensing ranges,” Int. J.of Sensor Networks, no. 1, pp.
41–49, 2006.
[6] J. Wu and F. Dai, “Virtual backbone construction in manets using
adjustable transmission ranges,” IEEE Trans On Mobile Computing, no. 5,
pp. 1188–1200, 2006.
[7] M. Cardei and J. Wu, “Energy-efficient coverage problems in wireless ad-
hoc sensor networks,” Computer Communications, no. 29, pp. 413–420,
2006.
[8] R. Kershner, “The number of circles covering a set,” American Journal
of Mathematics, no. 61, pp. 665–671, 1939.
[9] A. I. Erzin, “Close to regular plane covering by mobile sensors,” Abstracts
of Int. Conf. on Optimization and Applications (OTIMA-2009), Petrovac,
Montenegro, pp. 25–26, 2009.
203
Virtual Backbone Based Reliable Multicasting for MANET
Dipankaj G Medhi
ADG, Evolving Systems Network India Ltd
dipankaj@gmail.com
204
As losses of data packet is obvious in wireless mechanism is proposed, in which the lost packet is
environment, special attention needs to pay to improve pulled from other group (cluster). Further, in our
the packet delivery ratio to achieve reliability. The mechanism, the sender does not have to keep any
normal way of recovery the lost packet is by sending information about the receiver nodes.
feedback from the receiver(s) to the sender. Based on
the feedback, the sender may retransmit the lost packet The rest of the paper is as follows-Section 2
once again. This mechanism may introduce request describes the backbone construction process. Section 3
message explosion problem in extreme condition. To provides correctness proof of the approach,
rectify this problem, ReMHoc [10] slow down the theoretically. Section 4 explains the mechanism used
recovery process randomly. Slowing downs the in this research work to achieve the reliable
recovery process by individual node augment end-to- multicasting. Section 5 summarized this work.
end delay and hence it is not a suitable solution. To
overcome this problem, AG [11], ReACT [12], RALM 2. Virtual Back-bone construction process
[13], RMA [14] introduces a localized loss recovery
mechanism, in which the feedback message is sent to a The virtual backbone infrastructures allow a smaller
set of nearby nodes. To get back the lost packet in any subset of nodes to participate in forwarding
localized loss recovery system, receiver must obtain control/data packets and help reduce the cost of
information about the recovery nodes that hold the lost flooding. Most virtual backbone construction
packet. Some of the existing approaches (for example algorithms for ad hoc networks are based on the
AG [11], ReACT [12]) uses explicit search mechanism connected dominating set problem but like ADB and
to find out the recovery node. This in turn introduce MobDHop, we constructed a d hop connected
control message overhead. Other approaches (for dominating set that creates a forest of varying depth
example RALM [13], RMA [14]), maintain a list tree, each of which is rooted at a backbone node.
containing the information of all the receiver nodes.
This information is collected by flooding of control 2.1 The Cluster-head Selection Criteria
messages in the entire network or at least to a part of
the network. Further, most of the existing approaches For construction of virtual backbone, the criteria for
depend on underlying multicast protocol (for example, selecting a cluster-head is trivial. The uniqueness of
RALM, ReMHoc, ReACT etc) our virtual backbone creation approach is the cluster-
Keeping all these points in mind, we proposed a head selection criterion. For efficient working of the
new approach for reliable multicasting that utilizes a proposed approach, the following cluster-head
virtual backbone infrastructure. Constructing virtual selection criteria need to be considered –
backbone for routing in MANET is not new. But none Mobility: The most important factor to be
of the existing reliable multicasting mechanism considered in cluster-head selection process is
explores this area. In doing so, we developed a mobility. In order to avoid the frequent cluster-
distributed clustering algorithm that extract a head (re)selection process, the cluster-head
dominating set and interconnect them to from a should be relatively stable. A dynamic node
backbone. Rest of the nodes is get associated with any changes its geographical position very
of the cluster-head and forms a forest of varying depth. frequently and hence, the node associated with
The backbone construction mechanism is similar with it also change very frequently. If such a node is
ADB [15] and ModDHop [16] except the cluster-head selected as cluster-head, a frequent breakdown
selection criteria that perfectly reflect the dynamic of the backbone will take place.
environment of MANET. Next, we proposed an Unlike others, we believe that mobility can
approach for reliable multicasting over the backbone not be measure with speed (WCA) or with
infrastructure to achieve guaranteed delivery of data Normalized Link Failure Frequencies (VDBM)
packets. Further, we develop an ACK based one hop or with Link Life Time (RMA) only. Speed of
loss detection and recovery mechanism to achieve high an individual node does not reflect the
degree of reliability. Moreover, we used a NACK surrounding environment. Similarly Normalized
based localized loss recovery mechanism considering Link Failure Frequencies (VDBM/ADM)
the cluster-head as the recovery node. Thus, our reflects the dynamic condition of the area
approach does not demand for explicit search of surrounding a node in terms of the number of
recovery node, hence network does not suffer from link failures per neighbor. But, some times the
control message overhead. If the lost packet can’t link may not be failed due to unavailability of
recover locally (within cluster), a global loss recovery
205
neighbor, but due to loss of Neighbor Discovery NumberOfLinkExpiredt
request packet (due to buffer overflow, LFFt = i
hidden/expose terminal problem) at MAC layer. i DegreeOfNodet
Moreover, keeping track of each link to find out i
Link Lifetime may be resource consuming in The Network Link Failure Frequency at time t is
multicasting environment. So, a combination of
speed and Network Layer Link Failure
Frequencies (NLLFF) is used to represent the
true mobility scenario of an individual node at where α smoothing factor of the past history.
network layer. A simplified approach for Value of α < 1
measuring NLLFF is used in this work.
Degree: To reduce the overhead associated with NLFF at time t-∆t
the cluster-head and to make a proper load
balancing, there should be a limitation of the 2.2 Cluster-head Selection Process
number of node that can be associated with a
cluster-head. So, unlike other approaches (e.g. The core selection process at each node begins
WCA) in this work preference is given to that after the node has wait for a random period of time,
node which is not highly loaded (the degree of usually long enough to allow the node to have heard
the node is lower than the threshold degree), in Hello packets from all of its neighbors. This process
cluster-head selection process. decides whether a node should still be serving as a
Node ID: If the value of the above two criteria core, or become a child of an existing core. Fig 1
for selecting CH is same, the conflict is resolved demonstrates the cluster-head election and tree
by node ID. The lowest ID node will be selected construction procedure.
as cluster-head in this situation.
To keep the number of node as low as possible,
Based on these criteria, a weight-age is assigned to the cluster-head election procedure has to maintain two
every node, as follows constrains-
WT_CONSTRAIN: It limits the maximum
Wtt = speedti × Speed_Factor + NLLFFti × cumulative weight from the core to the child
i node. Because of this constrain, the size of the
NLLF_Factor + Degreeti × Degree_Factor tree in highly dynamic area will be smaller.
DEPTH_CONSTRAIN: It limits the
Where Wtt Weight of node i at time t maximum depth a tree can have.
i
NLLFFti Network Link Failure Frequency of For maintaining the backbone structure in
node i at time t dynamic environment, every node exchanges a
Degreeti Degree of node i at time t periodic NEIGH_UPDATE message among the
and, Speed_Factor, NLFF_Factor and Degree_Factor neighborhood. Upon receiving a NEIGH_UPDATE
are multiplicative factor for speed, NLFF and degree message, the node updates its NIT table and performs
for node i, respectively. the following checks
If the NEIGH_UPDATE message sender is a
Network Layer Link Failure Frequency (NLLFF) new core node, check weight of the new core
reflects the dynamic condition of the surrounding area node with the weight of the current core node.
by measuring how frequently the neighbor table of the If the weight of the new core node is lesser
current node changes. To measure the NLLFF, the that the current core node, the node can join
node has to remember this information temporarily. the new core node provided it does not violate
After every ∆t interval of time, the node compares this WT_CONSTRAIN and
remembered information with newly gathered neighbor DEPTH_CONSTRAIN.
information. The difference in this comparison gives If the message sender is a tree node with
the number of link expired. The NLLFF for node i at better cumulative weight, the current node can
time t can be calculates as join the tree provided it does not violate
WT_CONSTRAIN and
DEPTH_CONSTRAIN.
206
If the message contains no better information, nodes in the network with individual node ids. Dotted
it will continue with current status. circles with equal radius represent the fixed
transmission range for each node.
Let, NIT ← One Hope Neighbor Table
q ← min_wt(NIT) /* return the least weighted node
information */
MyBackboneStatus = NONMEMBER
VirtualBackboneCreation ()
{
1. q ← min_wt(NIT);
2. if(my_wt < q→wt) {
3. MyCoreID` = MyOwnID;
4. MyParentID = MyOwnID;
5. MyBackboneStatus = CLUSTERHEAD;
6. OneHopBroadcast_MyStatus (MyCoreID, MyParentID, Wt, Figure 2. Initial Configuration of node
DistanceToCore); }
7. else {
8. if(my_wt = q→wt) {
9. if(my_ID < q→ID) {
10. MyCoreID = MyOwnID;
11. MyParentID = MyOwnID;
12. MyBackboneStatus = CLUSTERHEAD;
13. OneHopBroadcast_MyStatus (MyCoreID, MyParentID,
Wt,
DistanceToCore); } } }
14. for(;;) {
15. on receiving MyStatus (CoreID, ParentID, Wt,
DistanceToCore) {
16. if(my_wt + Wt) <WT_TH) /* Wt constrain is not
violated */ Figure 3. Neighbor nodes with weight
&&((DistanceToCore + 1)<DEPTH_TH) {
/* Distance constrain is not
violated */
17. if(my_wt > Wt) { /* I am heavier */
18. MyCoreID = CoreID;
19 MyParentID = ParentID;
20. MyBackboneStatus= MEMBER;
21. CumulativeWt = my_wt + Wt;
22. DistanceToCore ++;
23. OneHopBroadcast_MyStatus (MyCoreID, MyID,
CumulativeWt, DistanceToCore); }
24. else if(my_wt == Wt){ /* my wt is same */
25. if(my_ID < senderID) { /* conflict resolve */
26. MyCoreID = CoreID;
27. MyParentID = ParentID; Figure 4. Cluster with clusterhead
28. MyBackboneStatus = MEMBER;
29 CumulativeWt = my_wt + Wt; Fig 3 shows the neighbor nodes with
30. DistanceToCore ++;
31. OneHopBroadcast_MyStatus (MyCoreID, corresponding weight of every node. This is the
MyID, resultant weight after executing the backbone
CumulativeWt, DistanceToCore); } construction process. The cluster-head selection
}} procedure is executed in purely distributed manner and
else {
32. Wait to covered by other node } elected 3, 4 and 7 as cluster-head. From example, node
33. if (Wait to covered Time expired && MyBackboneStatus 7 is the minimum weight node among its neighborhood
== NONMEMBER){ (node 2, node 14 and node 9) hence it declares itself as
34. MyCoreID = CoreID; cluster-head and broadcast this information to
35. MyParentID = ParentID;
} } } }
neighborhood. All other node will join the cluster-
head, gradually. Finally, a cluster will form locally as
Figure 1: Backbone Construction Process shown in Fig 4. From Fig 4, it is observable that node
2.3 An Illustrative Example 9 is in the transmission range of node 7 and node 4.
Fig 2 to 6 demonstrates the backbone creation So, this node will exchange core node information
process. Figure 2 shows the initial configuration of the
207
with node 7 and node 4. Thus node 7 and node 4 will v. In other words, N0(v) = {v} and
able to know each other. N d (v) = (U u∈N d −1 ( v ) N1 (u )) ∪ N d −1 (v)
for d ≤ 1.
Ed(v) denotes set of links between d-hop neighbor.
The local execution of the algorithm by any node v
will generate a Nd(v)
208
minimum (Weight, Degree, ID) then other, which is combination). Periodically, every node check its
not possible. Hence, the theorem is proved packet sent table, and retransmit the data packet to the
one hop destination node, if it is not receiving any
4. Virtual Backbone Based Reliable ACK from that node. This process is continue for
Multicasting Protocol MAX_RETRY times (in our simulation, this value is
2). After MAX_RETRY times of attempt, it presumes
that the neighbor node is going out of range.
Now, we will describe the proposed reliable
multicasting protocol that uses the virtual
infrastructure constructed by the above-mentioned 4.2 Intra Cluster Operation
approach. For guaranteed delivery of data packet, we
used a ACK based technique for ensuring reliable one The reliable multicasting protocol explained in
hop data delivery and a receiver initiated NACK based this research work perform its operation in two levels –
approach for data recovery at receiver side. intra cluster (local) and inter cluster (backbone) level.
Inter-cluster operation is based on the concept of
4.1 One Hop Reliable Packet Delivery: ACK rendezvous point, where all the control message and
data packets are directed to the cluster-head of the
based approach
cluster the node belongs.
One of the main goals of this research work is to
4.2.1 Multicast Joining
reduce bottleneck in the sender node by limiting
retransmission request. To achieve this goal, an ACK
based retransmission mechanism for successful packet Each node maintains a Multicast Member Joining
delivery to the neighbor node is used. Consider the Table (MMIT) that keeps the information about the
scenario shown in Fig 7, in which sender S is sending child nodes that participate in the multicast process. To
data packet to receiver R via intermediate node i1 & i2. join a multicast group, a multicast member node
It is observable that, packet no 4 is lost in between register itself with own cluster-head. For this purpose,
node S and i1. By one hop reliable packet delivery node i announces its existence by sending a
scheme, this loss is identified and tried to recover. In JOIN_REQ message to its parent (ParentIDi). The
receiver initiated NACK based scheme, the receiver parent node updates the MMIT, set itself as a
will try to recover the data packet (packet no 4) by forwarder and forwards the message towards the
sending retransmit request (NACK) over the link {(R, cluster-head. This process continues till the
i2), (i2, i1), (i1, S)}. Hence, the link {(R, i2), (i2, i1)} JOIN_REQ message reaches the cluster-head. As soon
will be overburdened by retransmission request. We as the cluster-head receives a JOIN_REQ message, it
believe that, this overburden can be reduce by makes an entry in MMIT. Thus the forwarding node
providing a ACK based system, in which loss creates a multicast tree rooted at the core.
recovery can be done in one hop neighbor. Moreover, Similarly, every node sends a LEAVE_REQ
it will reduce bottleneck at sender due to NACK message, whenever it wants to leave the group. This
messages. LEAVE_REQ message is also processed in the same
way like JOIN_REQ message. As soon as any parent
node/cluster-head hears a LEAVE_REQ message, it
simply removes the entry of the node from MMIT.
Figure 7 Packet loss during transmission from sender 4.2.2 Data Transmission
to receiver
After construction of the backbone and the forest
In order to recover from one hop packet loss, of varying depth tree, any node can start to transmit
every node maintains a packet sent table. As soon as it data packet via the backbone or the tree. If the source
transmits /forwards a data packet, it makes an entry in is a tree member, it can simply forward the data packet
that table along with ACK_Recv = 0. As soon as a to its parent node and send to all the multicast member
node receives a data packet, it sends back an ACK to child nodes (consulting the MMIT). Upon receiving a
the sender. Upon receiving an ACK, it set the packet from a child node, the parent node forwards the
ACK_Recv of the packet sent table for the particular data packet until it reaches the cluster-head. In this
packet for the corresponding destination node (for data forwarding process, if there is any multicast
<MCastAddr, SeqNo, TimeStamp, DestID> member on the path towards the cluster-head, it simply
209
receives the packet and sends it to upper layer. When a other nearby core nodes listed in Core Information
core node receives a data packet from the downstream Table (CIT). If the received packet is an older one, the
node, the packet is buffered and forward to other core packet is simply discarded and sends back an ACK to
nodes by the backbone level multicast process. The the sender. Upon receiving a data packet, forward node
multicast source inserts a sequence number, multicast simply delivered the data packet in the link towards
group address and a time stamp (packet creation time) that cluster-head.
in every outgoing packet to uniquely identify a packet.
4.3.2 NACK Based Error Recovery
4.2.3 NACK based Error Recovery
As soon as a cluster-head receive a new data
One hop packet delivery mechanism improves the packet,, it buffers that packet to satisfy the NACK
packet delivery ratio between a pair of nodes. The requestor. Due to inherent characteristics of MANET,
supportive mechanism may not guarantee successful a cluster-head may require to pull data packet from
delivery of packet to the destination node in highly other cluster-head to satisfy the need of the own cluster
dynamic situation. Hence, a NACK based error member. For this purpose, a NACK based error
recovery mechanism is adopted in both local and recovery mechanism is required in backbone level.
backbone level. Upon received a NACK message from the
In our local loss recovery mechanism, cluster-head downstream, cluster-head checks the availability of the
acts as recovery node. Hence, unlike other system, our data packet in the buffer. If the packet is available in
approach does not need explicit search of the recovery the buffer, it sends back the data packet to the
node. Instead of sending retransmission request as requester. Otherwise, it makes an entry in
soon as a receiver detects a lost packet, a receiver NACK_Table (table that keeps track the NACK
periodically inform the cluster-head about the packet it message received by a node) generate a NACK request
has not yet received. The major advantage of this message putting own ID in the requestor field and
mechanism is that it helps reduce NACK explosion in hand over the request message to the neighbor leading
the core node. Each NACK message includes a to nearby cluster-head. Thus, it helps in reducing lost
sequence number R and a vector V. The sequence packet recovery latency for the receiver.
number R indicates that all the packets up to R is
received successfully and each flag in the vector 5 Simulations and Performance Evaluation
corresponds to sequence number of a lost packet.
To analyze the performance of the proposed
4.3 Inter Cluster Operations approach, we have conducted experiments using
GloMoSim2.02. It is a library based simulator
The previous section explained how to perform designed by Scalable Network Inc for mobile
multicast operation inside a cluster. To achieve data networks. In our experiments, 50 nodes are placed
forwarding among different cluster, a core receiving a randomly in 1000m by 1000m area. Constant bit rate
data packet must distribute among other core nodes. (CBR) traffic is generated by the application with each
payload being 512 bytes. UDP is used in the transport
4.3.1 Data Transmission layer. 802.11 is used as MAC layer protocol with two
ray path loss model. All the experiments are run for 15
Any node in the backbone may help play one of minute of simulation time.
the following two roles – Core and Forwarder. The
core node may be a multicast source or receiver or it 5.1 Performance Analysis
may simply act as a distributor of data packet among
other nodes. To analyze the performance of the proposed
If a core node is a source node, it sends the data reliable multicast protocol, the following three metric
packet to all other cluster-head via forward node (s). were used –
Moreover, it sends the data packet to all multicast Average Packet Delivery Ratio: defined as the
member child within the cluster by consulting with the number of received data packets divided by the
MMIT. Whenever, a cluster-head receives a data number of data packets generated. This metric
packet for the first time, it buffers the packet, sends measures the effectiveness and reliability of the
back an ACK to the sender and performs a cluster level protocol,
multicasting. Moreover, it forwards the data packet to
210
Average End-to-End delay: is the delay over all group is inspiring. Multicast Group 1 in Fig 9 is the
the packets received by all the receivers. It can be evidence of this conclusion.
define as. This metric evaluates the protocol’s
timeliness and efficiency.
First we will study the behavior of the proposed Multicas t Group = 3
approach with varying mobility, for that purpose we 3.5 Multicas t Group = 2
fixed the radio range of individual node at 12.0 dB Multicas t Group = 1
grows.
0.98 Multicas t Group = 3
Multicas t Group = 2
0.96 1.02 Multicas t Group = 1
0.94 1
Packet Delivery Ratio
0.92 0.98
0.9 0.96
0 10 20 30 40 50
0.94
Mobility (m/sec)
0.92
Figure 8: Effect of mobility on PDR (Interdeparture
Interval 200ms, Tx Range 12dB) 0.9
100 200 300 400 500
Fig 9 shows the end-to-end delay comparisons Packet Interderpature Interval
with different multicast group size. It is clearly
observable that there is a sharp increase in end-to-end Figure 10: Effect of Traffic rate on Packet Delivery
delay with increase in mobility and number of senders. Ratio (mobility 0, Tx Range 10)
The end-to-end delay suffered by the receivers in small
211
Multicas t Group = 3 Multicast Group = 3
1.02 Multicas t Group = 2 3 Multicast Group = 2
0.98 2
0.96 1.5
0.94 1
0.92 0.5
0.9 0
100 200 300 400 500 100 200 300 400 500
Figure 11: Effect of Traffic rate on Packet Delivery Figure 13: Effect of Traffic rate on End-to-end delay
Ratio (mobility 50 m/sec, Tx Range 10) (mo50 m/sec, Tx Range 0)
The Fig 12 and Fig 13 show the average end-to-end
6. Conclusion and Future Directions
delay with varying inter-departure time with mobility 0
and mobility 50 m/sec respectively. As seen in Fig 12, The research work proposed a reliable multicasting
the end-to-end delay is constantly increased with approach for mobile ad hoc network. We used a
decrease in packet inter-departure time interval, with different cluster-head selection criterion to find the d-
increase in group member. Although, it shows a hop dominating set. For load balancing, the highest
promising result with higher inter-departure interval, degree node is giving less weight-age in this process.
the end-to-end delay ratio is increased up to Then we used the virtual infrastructure for reliable
approximately 2 seconds with larger number of multicasting.. The error control mechanism combines
sources and lower packet inter-departure interval. one hop ACK based reliable data packet delivery
Multicas t Group = 3 approach helps in reducing global packet
3 Multicas t Group = 2 retransmission request and hence increases the
Av End-To-End Delay (sec)
Figure 12: Effect of Traffic rate on End-to-end delay [1] T Gopalswamy, M Singhal, D Panda and P
(mobility 0) Sadayappan A Reliable Multicast Algorithm for mobile
Ad Hoc Networks In Proceedings of ICDCS 2002, July
02-05, 2002
212
Journal on Selected Areas in Communications Vol 5 [13] Tnag Ken, Obraczka Katia Lee Sung-Ju, Gerla
no 3, Apr 1997 pp 407-421 Mario (2002) A Reliable, Congestion-Controlled
Multicast Transport Protocol in Multimedia Multihope
[3] J Macker and W Dang The Multicast Networks in WPMC, 2002
Dissemination Protocol (MDP) framework IETF
Internet Draft, draft-macker-mdp-framework-00.txt [14] Gopalsamy Thaigaraja, Singhal Mukesh, Panda D
Sadayappan P (2002) A Reliable Multicast Algorithm
[4] K Obraczka Multicast Transport Mechanism: A for Mobile Ad hoc Networks 22nd IEEE International
Survey and Taxonomy IEEE communication Magazine Conference on Distributed Computing System (ICDCS
Vol 36, no 1, Jan 1998 pp94-102 02)
[5] T Speakman, N Bhaskar, R Edmonstone, and D [15] 25. Jaikaeo C, Shen Chien-Chung (2002)
Farinacci, S Lin A Tweedly and L Vicisano PGM Adaptive Backbone-Based Multicasting for Ad Hoc
Reliable Transport Protocol Specification IETF Networks IEEE International Conference on
Internet Draft, draft-speakman-pgm-spec-03.txt Communications (ICC), New York City, April 28-May
2, 2002.
[6] Jetcheva Jorjeta G & Johson David B. (2001)
Adaptive Demand Driven Multicast Routing in Multi [16] 26. Inn Inn ER, Seah Winston K G (2004)
hop Wireless Ad Hoc Networks MobiHoc 2001 CA Mobility Based d-Hop Clustering Algorithm for Mobile
USA Ad Hoc Networks In: proceedings of IEEE Wireless
Communications and Networking Conference, March
[7] Jetcheva Jorjeta G. and Johnson David B. (2004) 21-25, 2004
A Performance Comparison of On-Demand Multicast
Routing Protocols for Ad Hoc Networks reports-
archive.adm.cs.cmu.edu/ anon/2004/CMU-CS-04-
176.pdf
213
ADCOM 2009
DISTRIBUTED SYSTEMS
Session Papers:
1. Achyanta Kumar Sarmah, Smriti Kumar Sinha and Shyamanta Moni Hazarika, “Exploiting
Multi-context in a Security Pattern Lattice for Facilitating User Navigation”
214
1
215
2
it in any platform. This cheifly happens because of two Pattern name, Abstract, Aliases, Problem, Solution, Static
reasons - firstly the templates used are different from Structure, Dynamic structure, Implementation issues, Com-
one another and specifically suit only the abstraction and mon attacks, Known uses, Sample code, Consequences,
perspective for which it is used and secondly a hierarchy Related patterns, References. The elements of this template
based organizational scheme for security patterns, which exhibits different perspectives:
are related to one another is missing. A hierarchy based • Documentation: Pattern name, Abstract, Aliases,
organizational scheme allows users a one point navigation Known uses, Consequences.
facility to search for some patterns based on any character- • Functionality: Problem, Solution.
istic. In [1], we decided on a security template of our own • State: Static structure, dynamic structure
based on Christopher Alexander’s definition of pattern and • Environmental: Related patterns, References, Com-
attempted organizing patterns by exploiting results from mon attacks
FCA and constructing a concept lattice of security patterns, • Implementation: Implementation issues, Sample
the SPL. Here we propose a generic template as discussed code
above.
3) Sun’s Core SP template for JEE by Nagappan et.
al.: Motivated by the concept of Security by default[3], a
III. SP template notion that ensures security at all OSI levels, the authors
forwards a security design methodology with the following
A. SP templates from related works stages
a) Define Security requirements
In this section we summarize some of the templates b) Candidate Security Architecture
used in the field of SP. Our aim of reviewing these c) Perform Risk and Trade off analysis
templates is not to reach at a template structure that would d) Identify SP and create Security Design
encompass all these templates and give us uniformity of e) Implement prototype
representation. It is rather to detect the various perspectives f) Validation testing and auditing
of SPs. We then attempt to capture these perspectives into The security template used for security patterns in this
the semantics of our proposed template and reach at an case has Problem, Forces, Solution, Structure, Strategies,
algorithmic structure. Uniformity of representation would Consequences, Reality checks and security actors and
allow proper and efficient selection while the algorithmic risks as its elements. The reality check element in this
structure would help us tranlate a pattern into any imple- template considers the perspective of testing resources for
menting framework. applicability of a pattern, though it is at a very conceptual
1) Markus Schumacher’s pattern template: Based on level.
the terminology provided in the Common Criteria, 4) Microsoft’s Web Service Security template: Mi-
Schumaher in [7] proposes a template with the elements crosoft classifies patterns for Web Service security into
Name, Context, Problem and Solution. Along with this Architectural, Design and Implementation patterns. For
comes some other optional elements which could improve this purpose it considers the elements Name, Context,
the comprehension of a SP. Problem, Forces and Solution with semantics similar to
2) Kienzle and Elder’s template: Kienzle et. al. in [8] the elements in other templates.
considers four levels of abstraction for SPs viz: Most of these patterns adhere to the basic elements of a
a) Concepts: These encompasses general strategies and security pattern viz: Name, Context, Problem and Solution,
are represented by abstract nouns that could not with additon of some other optional elements. In our case,
be directly implemneted by developers. For example we consider a pattern to have an algorithmic represen-
”least privilege”. tation. We therefore base upon these basic elements and
b) Classes of patterns: It represents a general problem propose a template as follows:
area that could have multiple solutions. CONTEXT - These are the preconditions that need to
c) Patterns: A pattern is specific enough to allow basic be met for applying the SP. For example, Authentication
properties to be specified and trade-off analysis to be needs to be performed before authorization always.
conducted against other patterns. A Precondition here exhibit inter pattern relationship or
d) Examples: An example is typified by sample code. It dependencies between patterns. It could be implemented
is the most immediately useful, but in a very narrow as a vector V ect<pattern,perspective> , where each element
context. of the vector is a tuple < pattern, perspective > and the
The authors takes an object oriented approach and proposes pattern in question is applicable for one or more of the
a template at the third level of abstraction with the elements elements of Vect.
216
3
PROBLEM- It defines a situation that would require DOMAIN CONCEPT These are concepts like role,
•
this pattern to be applied. For example, Authentication session, subject which are specific to a certain domain
elements viz: cards, passes etc tend to become namesakes and represents some security requirement.
after some time. We illustrate the reltionships between the above elements
SOLUTION- These are security algorithms/measures to with the example in figure 1:
be applied for solving the above problem. For example,
Digital signature is a solution for authorization. Authorization Concern
CONSEQUENCES- The context of a SP would usually
<refines> <refines>
be given in terms of some variables or objects. Conse- <refines>
quences would give us the change in values of the context
or introduce/deduce them. Apart from these, we could Access Control List RBAC Capability Engineering
have a number of optional elements which could serve as Artifacts
selection criteria for the patterns from a repository. Some <depends>
of these optional elements may be:
• Aliases Subjects Objects
• Known Uses or examples
• Abstract <depends> Domain Concepts
• Sample code
• Common Attacks and risks Sessions Roles Privileges
217
4
Thereafter, we attempt at classifying SPs within SPL by lattice in polynomial time delay1 or space linear in number
use of scaling techniques. of all the concept[10]
Algorithm 1 presented here is a concept generating
algorithm based on the Next-Closure [11], [12] algorithm
VI. Multi context in a SPL by Ganter.
We extend our idea of a SPL to accomodate search of Boundary conditions for the algorithm 1
a specific pattern in a security concept as follows: • Single attribute concepts of the context are given.
The conceptual TE is formalized as SR with a template. • No new attribute or object is added during concept
The user searches for some SP on the basis of a SR. This generation.
takes the user to a security concept in the SPL that has SPs
Data structure involved and input to the algorithm 1
satisfying the provided SR. For reaching at the specific
• OBJECTS = Array of objects.
SP desired by the user, he/she launches a second level
• ATTRIBUTES= Array of attributes.
search with some other parameter that would discriminate
• NO ATR = Total number of attributes.
amongst the SPs in the Security Concept at hand. In
• NO OBJ = Total number of objects.
the approach presented here, the second level seach is
• CONCEPT ARRAY = A array whose elements are
done on the basis of a precondition or in other words a
related pattern. The related pattern submitted by the user CONCEPTS.
• SUPREMUM = The CONCEPT with all objects and
is searched for in the Precodition vector of each of the
patterns in the extent of the security concept at hand. The null attribute.
• MAXEXTENT: The extent with all objects from
search returns all those patterns whose precondition vector
contains the related pattern submitted by the user in the OBJECTS as member.
• SINGLEATTRIBUTECONCEPTS: List of concepts
second level search. From our template of SP, we have the
element of precondition as multivalued, so we represent whose intents have only one attribute.
a security concept as another lattice say <G c,M c,I c> Output from the algorithm 1
where CONCEPTLATTICE - Generated lattice of the mined
G c=Set of security patterns in the present concept. concepts.
M c = Union of all preconditions belonging to all
patterns in G c Procedures involved in the algorithm 1
I c = Set of relationship between elements of G c and • VAL: Function to build the bitstring from the indices
M c tat tells us which precondition is applicable to which of a subset.
pattern. • UPDATESUBITENTS: Function to update the list of
In this context, the atomic concept having only the subintents.
desired precondition as its intension would give us the Input Parameters:
necessary SPs. INTENTS: List of intents.
SUBINTENT: The sub intent for all intents in IN-
TENTS.
VII. Generating and navigating in a multi- • RETRIEVEPOSITIONS: Function to retrieve the set
context SPL bit positions in an integer.
Classes involved in the algorithm 1
A. Concept Generating Algorithms • SUBSET ITERATOR (It is a class which allows us
to iterate through the subsets of a given set)
Generation and navigation in a concept lattice involves Data:
iterating through all possible concepts in a given context. A) SET SIZE: Size of the parent set.
Hence, the complexity of any algorithm that generates or B) SUBSET SIZE: Size of the Subsets
navigates in a concept lattice depends upon the number C) SUBSET INDICES: List of Indices of the current
of concepts in the given context. Again, the number of subset.
concepts is exponential in the size of it’s input context. D) FIRSTTIME: Flag to check whether the iterator
This case is similar to the case of finding the powerset have been called for the first time.
of a set. Here the complexity increases exponentially with Functions:
the size of the set. So, from the standpoint of worst case 1 Algorithm for listing a family of combinatorial structures is said
complexity, an algorithm generating all concepts and/or the to have polynomial delay if it executes at most polynomially many
concept lattice can be considered optimal if it generates the computation steps before either outputting each next structure or halting.
218
5
A) INCREMENT SUBSET INDICES(): Increment Classes involved
present subset indices to produce next subset indices. The set of classes for algorithm 1 is extended with the
B) NEXT SUBSET(): Function to return indices of following classes
next subset on the fly. • PRECON (A class that represents a precondition)
219
6
Variables: CUR ATTRIBUTE INDICE=0
1 repeat
2 Find all subsets of ATTRIBUTES of size CUR ATTRIBUTE INDICE+1.
3 CUR INTENT=0, CUR EXTENT=MAXEXTENT.
4 foreach subset S from step 2 do
5 CUR INTENT= S
6 CUR EXTENT = Intersection of the EXTENTS of CONCEPTS from SINGLEATTRIBUECONCEPTS
corresponding to each element in S.
7 if CUR EXTENT <> 0 then
1) NEW CONCEPT={CUR INTENT, CUR EXTENT }
2) Update SUP INTENTS of NEW CONCEPT with CUR INTENT
3) Update SUB INTENTS of each CONCEPT corresponding to each element in CUR INTENT with
NEW CONCEPT.
4) Append NEW CONCEPT to CONCEPT ARRAY.
8 end
9 end
10 CUR ATTRIBUTE INDICE = CUR ATTRIBUTE INDICE + 1; CUR INTENT=0
11 until CUR ATTRIBUTE INDICE=NO ATR
220
7
Let X < 2N O AT R , where X is the number of valid (and hence the algorithm is feasible) if X is of atmost order
concepts. C1 = Time required to perform an indexed polynomial. The order of X depends on the sparseness and
search for an extent or intent = Linear and Constant denseness of the initial context that gives all the single
C2 = Time required for 1 computation = Linear and attribute concepts. Higher the denseness of the context,
constant. higher would be the order of X and equivalently for sparse
C3 = Time required to perform 1 bit level operation = context.
Linear and constant.
C4 = Time required to perform 1 conditional operation = VIII. Conclusions and future directions
Linear and constant.
C5= Time to retrieve an extent or intent of a
Analysing the existing templates of SPs, we attempt to
concept=Linear and constant.
capture the various perspectives in which security patterns
C6=Average time required to search linearly for an
are considered. Building up on the SPL proposed by the
attribute in an intent=Linear
authors earlier, which is a concept lattice organizing secu-
C7=Average time required to search for a given
rity patterns in a formal manner, the present work extends
precondition in a pattern=Linear
the SPL further to include a multiple context within the
SPL. For this an all encompassing template for the security
We proceed to compute the analytical complexity of
patterns is considered. Also, applicability of all SPs are
our algorithm based on X as follows:
considered in light of various SRs based on the CIA model.
Casting the SPs and SRs into a FCA framework as in SPL,
No of comparisons / computations to search for a
the authors proposes a generating algorithm for the SPL,
concept:
based on concept generating tools like Nextclosre, Object
1) Search for a SECREQ in the intent of the current exploration and attribute exploration etc. Observing the
CONCEPT = C6 fact that patterns in a domain always works under a system
2) Retrieve extent if search at step 1 successful=C5+1 of forces where the patterns are related to one another, the
In the worst case, the total number of comparisons that idea of multi-context is introduced in the template of SP as
may be required to search for a concept in the SPL = PRECON which is a set of patterns related to the current
X ∗ (C6 + C5 + 1) one. In this regard the generating algorithm is extended to
facilitate navigation for a particular pattern based on a SR
No of comparisons to search for patterns in a concept and related SP provided. The present algorithm facilitates
with a given precondition navigation based on only one characteristic of SP i.e.
1) Average number of patterns in the extent of a given the CONTEXT or PRECONDITION. Also, the generating
concept = (N O OBJ + 1)/2 algorithm works only with a finite context. If a new SP is to
2) Average time required to search for a precondition be added then a rerun of the algorithm is required. Future
in a pattern=C7 work in this direction could be to make provision for
3) Total time to search for patterns in a concept, with adding a new concept to the existing lattice without need
a given precondition=C7 *(N O OBJ + 1)/2 of re-run of the whole generating algorithm. Also, facility
Total number of comparisons / computations for could be incorporated in the navigation algorithm to allow
selecting patterns in SPL based on a SECREQ and a search on any characteristic of a SP. With the algorithm at
precondition: our hand for generation and navigation, application could
No. of comparisions/computations to search for a concept be developed that would allow us to build a view of the
+ No. of comparisions to search for patterns in a generated lattice in any platform automatically.
concept with a given precondition = X*(C6+C5+1)+
C7*(N O OBJ + 1)/2 References
Since C5 is linear and constant, taking C5 ≈ 1 we have
[1] A. K. Sarmah, S. M. Hazarika, S. K. Sinha, Security pattern lattice:
No. of comparisions/computations ≈ X ∗ (C6 + 2) + C7 ∗ A formal model to organize security patterns, In Proceedings of
(N O OBJ + 1)/2 the 2008 19th International Conference on Database and Expert
In the above expression C6 is of order linear in the number Systems Application (2008) 292–296.
[2] J. Yoder, J. Barcalow, Architectural patterns for enabling application
of attributes and C7 is of order linear in the number of security, In Proc. of PLoP 1997.
patterns. Therefore the terms C7*(N O OBJ + 1)/2 and [3] C. Steel, R. Nagappan, R. Lai, Core Security Patterns, Prentice Hall,
(C6+2) are linear in order. Hence the overall complexity 2007.
[4] S. Romanosky, Page 1 security design patterns part 1 v1.1 (2001).
of the above expression depends upon the order of X i.e. [5] D. Kienzle, M. Elder, D. Tyree, J. Edward-Hewitt, Security pattern
O(X). consequently, the algorithm is of polynomial order repository v1.0.
221
8
[6] M. Hafiz, Security patterns and secure software architecture, 51st
tutorial in International Conference on Object Oriented Program-
ming, Systems, Languages and Applications.
[7] S. Markus, R. Utz, Security Engineering With Patterns, Springer-
Verlag New York, Inc., 2003.
[8] D. M. Kienzle, P. D, M. C. Elder, P. D, D. S. Tyree, Intro-
duction security patterns template and tutorial, Retrieved from
citeseerx.ist.psu.edu/viewdoc/summary10.1.1.131.2464.
[9] M. Dan, R. Indrakshi, R. Indrajit, H. S. Hilde, Building secu-
rity requirement patterns for increased effectiveness early in the
development process, In proc. of Symposium on Requirements
Engineering for Information Security,Paris.
[10] S. Kuznetsov, S. Obiedkov, Comparing performance of algorithms
generating for concept lattices, Journal of experimental and Theo-
retical Artificial Intelligence (2002) 189–216.
[11] B. Ganter, Two basic algorithms in concept analysis, Technical
Report preprint, TH Darmstadt.
[12] B. Ganter, K. Reuter, Finding all closed sets: A general approach,
Order 8 (1991) 283–290.
222
Trust in Mobile Ad Hoc Service GRID
P. Varalakshmi 1 S. Thamarai Selvi2 S.Sundar Raman 3
Department of Information Technology, Madras Institute of Technology, Anna University Chennai,
Chennai-600044, India
Email : varanip@gmail.com, stselvi@annauniv.edu, ssrmit62@gmail.com
223
Resource Discovery Service – RDS is a service that is 2.3. Evaluation of Mobility Factor
used to find a particular resource node using RLT.
Resource Access Service – RAS is a service that is The mobility factor is based on the concept of how far
used to make the job execute in a remote grid node and a node is mobile relative to other nodes in the neighbor
keep track of the jobs submitted. list. The interval of a node can be identified based on the
Watchdog Service is used to check the job execution theory of relativity. Each node itself moves towards the
status and update the trust factor in the database. destination and hence the neighbor node speed cannot be
Each node maintains three tables in its database. The predicted by just the replacement in unit interval of time.
tables are node_info, job_sent_info and Thus the mobility factor is calculated based on the given
job_received_info. The node_info table contains the Eq. (2) as
neighbor listing and the available resource information.
The trust and mobility factor information for each mobility_nodej =
neighbor node will be updated in this table. The 1– (interval(nodej) / max_interval(nodej)) (2)
job_sent_info table contains the job submission
information. Each job contains the status flag which The interval function returns the unit of time taken by
shows the job success and the node to which the particular the node to reach the destination from a source point. The
job is assigned. The job_received_info table contains the max_interval function returns the maximum interval of
job accepted from its neighbors. This table is maintained one of the neighbor nodes of nodej.
as the queue and the job execution is carried out based on Thus for a node ‘j’ with ‘m’ neighbors can have one
the First-In First-Out basis. node (which is having maximum interval) with mobility
Consider a nodei needs to execute its jobj in MASGRID factor 0. All the remaining nodes have the mobility factor
environment. Then it gets the neighbor listing from the relative to that node.
node_infoi table. Each node in the neighbor listing will be The mobility factor for all nodes will be 0 initially.
assigned the part of the jobj. Thus nodei’s job_sent_infoi Each simulation interval updates the mobility factor for
table will be updated with the neighbor nodes each node. Thus this mobility factor will be in the range
information. Later the corresponding neighbor nodes’ of [0, 1]. This factor identifies the stability in the network.
job_received_info tables are also updated with the nodei The random waypoint mobility model is chosen for
information. The watchdog service checks for the job random positioning and keeping the random movements
execution feasibility in the neighbor nodes and updates to all nodes. This gives a relatively real world model for
the status flag of the job in both job_sent_info and mobile ad hoc networks in which each node is free to
job_received_info tables. move to any destination as it wishes as shown in Figure 1.
224
The neighbor listing for each node is updated for each
simulation interval based on the transmission range and
2.4. Job Submission to the next hop the current position of each node. The ad hoc devices are
identified and the configurations are used randomly to
Further broadcasting of jobs can be carried out once each node. Thus all the nodes collectively act as the ad
the neighbor nodes are found to be not capable enough to hoc environment.
finish the job. The broadcasting of these jobs is restricted For each simulation, the following values need to be
based on the hop count limit. This increases the job configured.
success rate since each node is in search of capable nodes
in the later hop region. Number of nodes = 30
Transmission range = 400
Terrain = 2000, 2000
Number of jobs = 100 – 800
nodej Max_length,Min_length = 1000, 100
hop1 Max_file_size,Min_file_size = 1000, 10
nodei
After simulation, the performance evaluation is carried
hop2 out and the graph is plotted.
nodek
4. Results and Discussion
In Figure 3, a graph is plotted with number of jobs
against job success rate.
In Figure 2, a scenario shown may happen only when
nodei was not capable of finishing the job received from The job success rate for a set of nodes can be calculated
some nodej. Then nodei will again assign the same job using the cumulative of all nodes job success rates to the
with the source node being the nodej to its neighbor say number of nodes. Consider a case that there are ‘m’ nodes
nodek and receive the status. The status will be updated chosen for simulation and nodei have the job success rate
for the nodej too. of JSi . Then the cumulative job success rate for number
Though the job submissions to next hop proved to work of jobs ‘n’ will be given Eq. (3) as
well for less number of jobs, there are some demerits in
the higher order of job count. When the number of jobs Job Success Raten = i=0..m JSi / m (3)
raised, then nodei only just broadcast the jobs and failed to
receive the later jobs from nodej. Since the number of So, for unit increment in the number of jobs, the
resources available for the job is increased, the job corresponding job success rate are calculated and plotted
success rate also increases for the increased hop count. in all the cases.
80.00%
2. Trusted Vs. Untrusted mobility 60.00% WITH TRUST
3. Single Vs. Next Hop Job Submissions 40.00% WITHOUT TRUST
For each node, separate tables are maintained such that 20.00%
they can keep the trust and mobility values for their 0.00%
neighbors separately and assign jobs to the neighbors’ 100 200 300 400 500 600 700 800
respectively. Number of jobs
225
In Figure 4, a graph is plotted with number of jobs 5. Conclusions
against job success rate and the simulated results are
shown for mobility without trust vs mobility with trust. Trust implementation greatly improves the
Here, though initially both have the same performance performance of MASGRID in proper resource utilization
but increasing the number of jobs gives better and increased job success rate. The mobility factors too
performance only for trusted mobility node. Compared to taken into consideration improve the stability in job
previous analysis, this shows a fluctuation in the graph submission and raise the confidence that the nodes seems
this is based on the concept of mobility model pattern. to be stable until the result of execution is received by the
Though the graph fluctuates, never the untrusted mobility sender. This mobility factor also avoids the job
beats the trusted mobility value. The effect of simulation submission of nodes which are found to be more mobile
will be more clear once the environment is more mobile and less stable. Those nodes once received the job may
and the nodes seems to move randomly. suddenly leave the transmission range and almost
impossible to detect the node availability in the ad hoc
environment. Extending the job submission to the next
Trusted mobility Vs. Untrusted Mobility
hop increase the scalability of job submissions and avoids
assigning the job to the inefficient nodes again.
100.00%
References
Job Success Rate
80.00%
TRUSTED MOBILITY
60.00% [1] Ihsan I., Abdul Qadir M., and Iftikhar N. “Mobile Ad hoc
40.00% UNTRUSTED Service Grid – MASGRID”, Proceedings of World Academy of
MOBILITY Science, Engineering and Technology,2005.
20.00% [2] Kurdi H., Li M. and Al-Raweshidy H. “A Classification of
0.00% Emerging and Traditional Grid Systems”, IEEE Distributed
100 200 300 400 500 600 700 800 Systems, IEEE Computer Society, Vol. 9, No.3, 2008.
[3] Li Z., Sun L. and Ifeachor E., University of Plymouth, UK.
Number of jobs
“Challenges of Mobile Ad-hoc Grids and their applications in E-
HealthCare”, Conf. on Computational Intelligence in Medicine
and Healthcare, 2005.
( ! [4] Amin K., Laszewski G., Sosonkin M., Mikler R., Hategan
M. “Ad Hoc GRID Security Infrastructure”, Grid Computing
Workshop, 2006.
In Figure 5 also, a graph is plotted with number of jobs [5] Ma B., Sun J., Yu C.,School of Electronic Information
against job success rate and the simulated results are Engineering,Tianjin University, China, “Reputation-based Trust
shown for with first hop vs next hop also. Model in Grid Security System”, Journal of Communication and
In this graph, the next hop values are almost twice that Computer, ISSN1548-7709,2006.
of the first hop simulation. In this case, the number of hop [6] Kwok Y., Senior Member, IEEE, Hwang K., Fellow, IEEE,
and Song S., Member, IEEE. “Selfish Grids: Game-Theoretic
count is 2. The increase in the number of hop count with
Modeling and NAS/PSA Benchmark Evaluation” , IEEE
proper and efficient flooding scheme shows the improved transactions on parallel and distributed systems, Vol 18,No.
overall performance and hence increases the overall job 5.,2007.
success rate. [7] Srinivasany A., Teitelbaumy J., Liangz H., Wuyand Mihaela
J., Cardeiy A. “Reputation and Trust-Based Systems for Ad
Hoc and Sensor Networks” , Conference on GRID Security,
Single Hop Vs. Next Hop
2007.
120.00%
[8] Martin A., “Trust and Security in Virtual Communities”,
IEEE conference on secure Virtual Organization, 2007.
100.00%
Job Success Rate
) * + *
226
Scheduling Light-trails on WDM Rings
Soumitra Pal and Abhiram Ranade
Department of Computer Science and Engg.,
Indian Institute of Technology Bombay,
Powai, Mumbai 400076, India.
Abstract—We consider the problem of scheduling commu- case. The third algorithm considers stationary traffic. In this
nication on optical WDM (wavelength division multiplexing) case, our algorithm can be allowed to take more time, because
networks using the light-trails technology. We give two online the computed configuration will be used for a long time since
algorithms which we prove to have competitive ratios O(log n)
and O(log2 n) respectively. We also consider a simplification the traffic pattern does not change. For both problems, our
of the problem in which the communication pattern is fixed objective is to minimize the number of wavelengths needed to
and known before hand, for which we give a solution using accommodate the given traffic.2
O(c + log n) wavelengths, where c is the congestion and a lower The input to the stationary problem is a matrix B, in which
bound on the number of wavelengths needed. While congestion B(i, j) gives the bandwidth demanded between nodes i and
bounds are common in communication scheduling, and we use
them in this work, it turns out that in some cases they are j. We give an algorithm which schedulesPthis traffic using
quite weak. We present a communication pattern for which the O(c+log n) wavelengths, where c = maxk i,j|i≤k<j B(i, j)
congestion bound is O(log n/ log log n) factor worse than the best is the maximum congestion at any link. The congestion as
lower bound. In some sense this pattern shows the distinguishing defined above is a lower bound, and so our algorithm can be
character of light-trail scheduling. Finally we present simulations seen to use a number of wavelengths close to the optimal. The
of our online algorithms under various loads.
reader may wonder why the additive log n term arises in the
I. I NTRODUCTION result. We show that there are communication matrices B for
which the congestion is much smaller than 1, but which yet
Light-trails [1] are considered to be an attractive solution require Ω(log n/ log log n) wavelengths. In some sense, this
to the problem of bandwidth provisioning in optical networks. justifies the form of our result.
The key idea in this is the use of optical shutters which are
For the online problem, we use the notion of competitive
inserted into the optical fiber, and which can be configured
analysis [2], [3], [4]. Specifically we establish that our first
to either let the optical signal through or block it from
algorithm is O(log n)-competitive, i.e. it requires at most
being transmitted into the next segment. By configuring some
a O(log n) factor more wavelengths as compared to the
shutters on (signal let through) and some off (signal blocked),
best possible algorithm, including an unrealistic algorithm
the network can be partitioned into subnetworks in which
which is given all the communication requests in advance.
multiple communications can happen in parallel on the same
A multiplicative O(log n) factor might be considered to be
light wavelength. In order to use the network efficiently, it is
too large to be relevant for practice (and indeed it is an
important to have algorithms for controlling the shutters.1
important open problem whether a lower factor can be proved);
In this paper we consider the simplest scenario: two fiber however, the experience with online algorithm design is that
optic rings, one clockwise and one anticlockwise, passing such algorithms often give good hints for designing practical
through a set of some n nodes, where typically n < 20 algorithms. We establish that our second algorithm for the
because of technological considerations. At each node of a online problem is O(log2 n), nevertheless we mention it
ring there are optical shutters that can either be used to because it is a simplified version of the first algorithm and
block or forward the signal on each possible wavelength. it seems to perform better in our simulations.
The optical shutters are controlled by an auxiliary network
That brings us to our final contribution: we simulate two
(“out-of-band channel”). It is to be noted that this network is
algorithms based on our online algorithms for some traffic
typically electronic, and the shutter switching time is of the
models. We compare them to a baseline algorithm which keeps
order of milliseconds as opposed to optical signals which have
the optical shutter switched off only in one node for each
frequencies of Gigahertz.
wavelength. Note that at least one node should switch off its
For this setting we give three algorithms for controlling optical shutter otherwise light signal will interfere with itself
the shutters, or bandwidth provisioning. The first two consider after traversing around the ring. We find that except for the
dynamic traffic, i.e. communication requests arrive and depart case of very low traffic, our algorithms are better than the
in an online manner, i.e. they have to be serviced as soon as
they arrive. The algorithm must respond very quickly in this 2 If our analysis indicates that some λ wavelengths are needed while only
λ0 are available, then effectively the system will have to be slowed down by
1 Notice that in the on mode, light goes through a shutter without being first a factor λ/λ0 . This is of course only one formulation; there could be other
converted to an electrical signal – this is one of the major advantages of the formulations which allow requests to be dropped and analyse what fraction
light-trail technology. of requests are satisfied.
baseline. For very local traffic, our algorithms are in fact much paper [19] suggests that minimizing total number of light-
superior. trails also minimizes total number of wavelengths, it may not
The rest of the paper is organized as follows. We begin be always true. For example, consider a transmission matrix in
in Section II by comparing our work with previous related which B(1, 2) = B(3, 4) = 0.5 and B(2, 3) = 1. To minimize
work. In Section III we give the details of our algorithm for total number of light-trails used, we create two light-trails on
the stationary problem. Section IV gives an example instance two different wavelengths. Transmission (2,3) is put in one
of the stationary problem where congestion lower bound is light-trail and transmissions (1,2) and (3,4) are put in the other
weak. We describe our two algorithms for the online problem light-trail. On the other hand, to minimize total number of
in Section V. We give results of simulation of our online wavelengths, we put each of them in a separate light-trail on
algorithms in Section VI. a single wavelength. We believe that minimizing the number
of light-trails (while fixing the number of wavelengths) is
II. P REVIOUS W ORK an appropriate objective for the online case, where this is a
Our problem as formulated is in fact similar to the problem measure of the work done by the scheduler. In our opinion,
of reconfigurable bus architectures [5], [6]. These have been for the stationary problem, the number of wavelengths is a
proposed for standard electrical communication; like the opti- better measure. There are few other models as well, e.g. [22]
cal shutter in light-trails, there is a switch which connects one minimizes total number of transmitters and receivers used in
segment of the bus to another, and which can be set on or off. all light-trails.
Again, even in this model, the switches are slow as compared The general approach followed in the literature to solve
to the data rates on the buses. So from an abstract view points the stationary problem is to formulate the problem as an
this model is very similar to ours. integer linear program (ILP) and then to solve the ILP using
While there is much work in the reconfigurable bus lit- standard solvers. The papers [18], [19] give two different
erature, it mostly concerns regular interconnection patterns, ILP formulations. However, solving these ILP formulations
such as those arising in matrix multiplication, list ranking takes prohibitive time even with moderate problem size since
and so on [7], [8], [9], [10]. The only work we know of the problem is NP-hard. To reduce the time to solve the
dealing with random communication patterns is in relation ILP, the paper [20] removed some redundant constraints from
to the PARBUS architecture. Such patterns are handled using the formulation and added some valid-inequalities to reduce
standard techniques such as Chernoff bounds [11]. We do not the search space. However, the ILP formulation still remains
know of any work which discusses how to schedule arbitrary difficult to solve.
irregular communication patterns in this setting. This is prob- Heuristics have also been used. The paper [20] solves the
ably understandable because reconfigurable bus architectures problem in a general network. It first enumerates all possible
have mostly been motivated as special purpose computers, light-trails of length not exceeding a given limit. Then it
except for the PRAM simulation motivation of PARBUS creates a list of eligible light-trails for each transmission and a
where the communication becomes random. However, if the list of eligible transmissions for each light-trail. Transmissions
network is used for general purpose computing, it does make are allocated in an order combining descending order of band-
sense to have algorithms to provision bandwidth for arbitrary width requirement and ascending order of number of eligible
irregular patterns, as we do here. light-trails. Among the eligible light-trails for a transmission,
After the light-trail technology was was introduced in [1], the one with higher number of eligible transmissions and
much work has been published in the literature. For example, higher number of already allocated transmissions is given
[12] has a mesh implementation of light-trails for general preference. The paper [21] used another heuristic for the
networks. The paper [13] implements a tree-shaped variant of problem in a general network. For a ring network, [19] used
light-trail, called as clustered light-trail, for general networks. three heuristics.
The paper [14] describes ‘tunable light-trail’ in which the For the problem on a general network, [16] solves two
hardware at the beginning works just like a simple light- subproblems. The first subproblem considers all possible light-
path but can be later tuned to act as light-trail. There is trails on all the available wavelengths as bins and packs
some preliminary work on multi-hop light-trails [15] in which the transmissions into compatible bins with the objective
transmissions are allowed to go through a sequence of overlap- of minimizing total number of light-trails used. The second
ping light-trails. Survivability in case of failures is considered subproblem assigns these light-trails to wavelengths. The first
in [16] by assigning each transmission request to two disjoint subproblem is solved using three heuristics and the second
light-trails. problem is solved by converting it to a graph coloring problem
Even with this basic hardware implementation, there are where each node corresponds to a light-trail and there is
different works solving different design problems. Several ob- an edge between two nodes if the corresponding light-trails
jectives are mentioned in the seminal paper [17] – to minimize conflict with each other.
total number of light-trails used, to minimize queuing delay, For the online problem, a number of models are possible.
to maximize network utilization etc. Most of the work in From the point of view of the light-trail scheduler, it is
the literature seems to solve the problem by minimizing total best if transmissions are not moved from one light-trail to
number of light-trails used [18], [19], [20], [21]. Though the another during execution, which is the model we use. It is also
appropriate to allow transmissions to be moved, with some The input is a matrix B with B(i, j) denoting the bandwidth
penalty. This is the model considered in [19], [23], where the requirement for the transmission from node i to node j,
goal is to minimize the penalty, measured as the number of without loss of generality, as a fraction of the bandwidth of a
light-trails constructed. The distributions of the transmissions single wavelength. The goal is to schedule these in minimum
that arrive is also another interesting issue. It is appropriate to number of wavelengths w. The output must give w as well
assume that the distribution is fixed, as has been considered in as a partitioning of each wavelength into a set of segments.
many simulation studies including our own. For our theoretical The partitioning may be specified as an increasing sequence
results, however, we assume that the transmission sequence of numbers (what we refer to as configuration) between 0
can be arbitrary. The work in [19] assumes that the traffic and n − 1; if u appears in the sequence it indicates that the
is an unknown but gradually changing distribution. They use shutter in node u is off, otherwise the shutter in node u is
a stochastic optimization based heuristic which is validated on. The segment between two off shutters is a light-trail. A
using simulations. The paper [20] considers a model in which transmission from i to j can be assigned to a light-trail L only
transmissions arrive but do not depart. Multi-hop problems if u ≤ i, j ≤ v where u, v are the endpoints of the light-trail.
have also been considered, e.g. [24]. An innovative idea to Further the sum of the required bandwidths assigned to any
assign transmissions to light-trails using online auctions has single light-trail must not exceed 1.
been considered in [25]. It is customary to consider two variations: non-splittable, in
which a transmission must be assigned to a single light-trail,
A. Remarks and splittable, in which a transmission can be split into two
As may be seen, there are a number of dimensions along or more transmissions by dividing up the bandwidth, and the
which the work in the literature may be classified: the network resultant transmissions can be assigned to different light-trails.
configuration, the kind of problem attempted, and the solution Our results hold for both variations.
approach. Network configurations starting from simple linear We will use cl (S) to denote the congestion induced on a link
array/rings [9], [19], [23] to full structured/unstructured net- l by a set S of transmissions. This is simply the total band-
works [8], [16], [18], [20], [21], [24] have been considered, width requirement of those transmissions from S requiring
both in the optical communication literature as well as the to cross link l. Clearly maxl cl (S), the maximum congestion
reconfigurable bus literature. The stationary problem as well over all links, is a lower bound on the number of wavelengths
as the dynamic problem has been considered, with additional needed. We use c(S) to denote maxl cl (S). Finally if t is
minor variations in the models. Finally, three solution ap- a transmission, then we abuse notation to write cl (t), c(t),
proaches can be identified. First is the approach in which instead of cl ({t}), c({t}), for the congestion contributed by
scheduling is done using exact solutions of Integer Linear t only, which is equal to the bandwidth requirement of t.
Programs [18], [19], [20]. This is useful for very small The key observation behind our algorithm for the stationary
problems. For larger problems, using the second approach, problem is: if all transmissions go the same distance in the
a variety of heuristics have been used [16], [19], [20], [21]. network, then it is easy to get a nearly optimal schedule. Thus
The evaluation of the scheduling algorithms has been done we partition the transmissions into classes based on the length
primarily using simulations. The third approach could be of the transmission. We then stitch back the separate schedules.
theoretical. However, except for some work related to random Define the length of a transmission to be the distance
communication patterns [11], we see no theoretical analysis between the origin and the destination. Transmissions with
of the performance of the scheduling algorithms. length between 2i−1 (non-inclusive) and 2i (inclusive) are said
to belong to the ith class where 0 ≤ i ≤ dlog2 (n − 1)e.
In contrast, our main contribution is theoretical. We give
algorithms with provable bounds on performance, both for Let R denote the set of all transmissions, and Ri the set of
the stationary and the online case. Our work uses the com- transmissions in class i. Class 0 is served simply by putting
petitive analysis approach [2] for the online problem. We shutters off at every node. Clearly, dc(R0 )e wavelengths will
use techniques of approximation algorithms to solve the sta- suffice for the splittable case, and twice that many for the non-
tionary problem. To our knowledge, this competitive analysis splittable (using ideas from bin-packing [26]). For R1 also it
and approximation algorithm approach to solve the light-trail is easily seen that O(dc(R1 )e) wavelengths will suffice. So for
scheduling problem has not been used in the literature. We the rest of this paper we only consider classes 2 and larger.
also give simulation results for the online algorithms. A. Schedule Transmissions of Class i
Our aim is to partition Ri further into sets S0 , S1 , . . . each
III. T HE S TATIONARY P ROBLEM
with congestion at most some constant value so that overall it
In this section, instead of considering two unidirectional does not take many wavelengths. We start with T0 = Ri , and
rings, we consider a linear array of n nodes, numbered 0 to in general given Tj we construct Sj and Tj+1 = Tj \ Sj as
n−1. Communication is considered undirected. This simplifies follows:
the discussion; it should be immediately obvious that all results We add transmissions greedily into Sj starting from the
directly carry over to two directed rings mentioned in the leftmost link l moving right, i.e. for each l pick transmissions
introduction. crossing it and move them into Sj until we have removed
at least unit congestion from cl (Tj ) or reduced cl (Tj ) to 0. the splittable case, the transmissions will be accommodated
Then we move to the next link. So, at the end the following in these wavelengths, since c(Sjp ) < 4. For the non-splittable
condition holds: case, 8 wavelengths will suffice, using standard bin packing
= cl (Tj ) if cl (Tj ) ≤ 1, and ideas [26].
∀l, cl (Sj ) (1) Thus all of Sj can be accommodated in at most 16 wave-
≥1 otherwise.
lengths for the splittable case, and at most 32 wavelengths for
However, to make sure that c(Sj ) is not large, we move back the non-splittable case.
transmissions from Sj , in the reverse order as they were added,
into Tj so long as condition (1) remains satisfied. Now we Theorem 4. The entire set Ri can be scheduled such that at
claim the following: each link x there are O(Cx (Ri ) + 1) light-trails.
Lemma 1. Construction of Sj , Tj+1 from Tj takes polynomial Proof: We first consider the light-trails as constructed in
time and c(Sj ) < 4. Lemma 3. In that construction, it is possible that some light-
trails contain links that are not used by any of the transmissions
Proof: For the first part, it can be seen that the construc- associated with the light-trail. In such cases we shrink the
tion takes at most n|Tj | time in the pick-up step and also in light-trails by removing the unused links (which can only
the move-back step. be near either end of the light-trail because all transmissions
For the second part, at the end of move-back step, for any assigned to a light-trail overlap at their anchor).
transmission t ∈ Sj there must exist a link l such that cl (Sj ) < Let j be largest such that x has a transmission from Sj .
1 + c(t) otherwise t would have been removed. We call l as a Then we know that cx (Ri ) ≥ j. For each k = 0, 1, . . . , j we
sweet spot for t. Since c(t) ≤ 1 we have cl (Sj ) < 2 for any have O(1) light-rails at x as described above. Thus we have
sweet spot l. a total of O(j + 1) = O(cx (Ri ) + 1) light-trails at x.
Now consider any link x. Of the transmissions through x,
let L1 (L2 ) denote transmissions having their sweet spot on the B. Merge Light-trails of All Classes Together
left (right) of x. Consider y, the rightmost of these sweet spots If we simply collect together the wavelengths as allocated
of some transmission t ∈ L1 . Note first that cy (Sj ) < 2. Also above, we would get a bound O(c log n). Note however, that
all transmissions in L1 pass through both x, y. Thus cx (L1 ) = if two transmissions, one in class i and the other in class j,
c(L1 ) = cy (L1 ) ≤ cy (Sj ) < 2. Similarly, cx (L2 ) < 2. Thus are spatially disjoint, then they could possibly share the same
cx (Sj ) = cx (L1 ) + cx (L2 ) < 4. But since this applies to all wavelength. Given below is a systematic way of doing this,
links x, c(Sj ) < 4. which gets us the sharper bound.
Next we show that not too many Sj will be constructed. We know that at each link l there are a total of O(cl (Ri ) +
Lemma 2. Suppose Sj is created for class i. Then j ≤ c(Ri ). 1) light-trails. Thus the total number of light-trails at l are
O(cl (R) + log n), summing over all classes.
Proof: Suppose Sj contains a transmission that uses some Think of each light-trail as an interval, giving us a collection
link l. The construction process above must have removed at of intervals such that any link l has at most O(cl (R)+log n) =
least unit congestion from l in every previous step 0 through O(c + log n) intervals. Now this collection of intervals can be
j − 1. Hence j ≤ cl (Ri ) ≤ c(Ri ). colored using O(c + log n) colors. So we put all the intervals
Every transmission in Sj has length at least 2i−1 + 1, and of the same color in the same wavelength.
must cross some node whose number is a multiple of 2i−1 .
The smallest numbered such node is called the anchor of the IV. O N THE C ONGESTION L OWER B OUND
transmission. The trail-point of a transmission is the right most We now consider an instance of the stationary problem. For
node numbered with a multiple of 2i−1 that is on the left of convenience, we assume m = n − 1 = 2k for some k, and
the anchor. If the transmission has trail-point at some node all logarithms with base 2. All the transmissions have same
with number of the form t2i−1 , then we define t mod 4 as its bandwidth requirement α = 1/(log m + 1).
phase. First, we have a transmission going from 0 to 2k . Then a
transmission from 0 to 2k−1 and a transmission from 2k−1 to
Lemma 3. The set Sj can be scheduled using O(1) wave-
2k . Then 4 spanning one-fourth the distance, and so on. Thus
lengths.
we have transmissions of log m + 1 classes, each class having
Proof: We partition Sj further into sets Sjp containing transmissions of same length. In class i ∈ {0, 1, . . . , log m}
transmissions of phase p. Note that the transmissions in any there are 2i transmissions B(sij , dij ) = α where sij =
Sjp either overlap at their anchors, or do not overlap at all. This jm/2i , dij = (j + 1)m/2i and j = 0, 1, . . . , 2i − 1. All other
is because if two transmissions in Sjp have different anchors, entries of B are 0. This is illustrated in Fig. 1 for n = 17.
then these two anchors are at least 2i+1 distance apart. Since Clearly the congestion of this pattern is uniformly 1. Con-
the length of transmission is at most 2i , the two transmissions sider an optimal solution. There has to be a light-trail in which
can not intersect. the first transmission from node 0 to m is scheduled. Thus
So for the set Sjp , consider 4 wavelengths, each having we must have a wavelength with no off shutters except at
shutters off at nodes numbered (4q + p)2i−1 . Clearly, for node 0 and node m. In this wavelength, it is easily seen
i=0
i=1
i=2
i=3
i=4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
that the longest transmissions should be scheduled. So we O(log n) and O(log2 n) respectively. They are inspired by the
start assigning transmissions of first few classes in this light- analysis of the algorithm for the snapshot problem, as may be
trail. Suppose, all the transmissions for first l classes are seen.
assigned. Then we have total 1 + 2 + 4 + · · · + 2l = 2l+1 − 1 In both the online algorithms, when a transmission request
transmissions assigned to this light-trail. Total bandwidth arrives, we first determine its class i and trail-point x (defined
requirement of these transmissions should be less than 1. in Section III-A). The transmission is allocated to some light-
This gives us (2l+1 − 1)(1/(log m + 1)) ≤ 1 implying trail with end nodes x and x + 2i+1 . However, the algorithms
l ≤ log(log m + 2) − 1 ≈ log log m. differ in the way a light-trail is chosen from some candidate
For the subsequent classes of transmissions, we allocate a light-trails.
new wavelength and create light-trails by putting shutters off at
nodes numbered multiples of m/2l+1 . It can be seen that again A. Algorithm S EPARATE C LASS
transmissions of next about log log m classes can be put in
these light-trails. We repeat this process until all transmissions In this algorithm, every allocated wavelength is assigned
are assigned. a class label i and a phase label p, and has shutters off at
In each wavelength we assign transmissions of log log m nodes (4q + p)2i−1 for all q, i.e. is configured to serve only
classes. There are total (1 + log m) classes. Thus the total transmissions of that class and phase. Whenever a transmission
number of wavelengths needed is d(1 + log m)/ log log me = of class i and phase p is to be served, it is only served
O(log n/ log log n) rather than the congestion bound of 1. by a wavelength with the same labels. If such a wavelength
For the example in Fig. 1, using this procedure, we have is found, and light-trail starting at its trail-point has space,
log log m = 2. Thus we require d(1 + log m)/ log log me = 3 then the transmission is assigned to that light-trail. If no such
wavelengths. The first wavelength is used for the transmissions wavelength is found, then a new wavelength is allocated and
of classes {0,1}, the second wavelength for classes {2,3} and labeled and configured as above.
the third for class 4. When a transmission finishes, it is removed from its asso-
ciated light-trail. The wavelength can be relabeled only when
V. T HE O NLINE P ROBLEM there are no transmissions in any of its light-trail.
In the online case, the transmissions arrive dynamically. Lemma 5. Suppose, at some point of time, among the wave-
An arrival event has parameters (si , di , ri ) respectively giving lengths allocated by the algorithm, x wavelengths had non-
the origin, destination, and bandwidth requirement of an empty light-trails of the same class and phase across a link l.
arriving transmission request. The algorithm must assign such Then l must have congestion Ω(x) at some instant.
a transmission to a light-trail L such that si , di belong to the
light-trail, and at any time the total bandwidth requirement Proof: Number these wavelengths in the order that they
of transmissions assigned to any light-trail is at most 1. A got allocated. Suppose the xth one was allocated due to a
departure event marks the completion of a previously sched- transmission t. This could only happen because t could not fit
uled transmission. The corresponding bandwidth is released in the first x − 1 wavelengths.
and becomes available for future transmissions. The algorithm For the splittable case this can only happen if the previous
must make assignments without knowing about subsequent x − 1 wavelengths contain congestion at least x − 1 − c(t) at
events. the anchor of t, when t arrived. But this is Ω(x) giving us the
Unlike the stationary problem, congestion at any link may result.
change over time. Let clt (S) denote the congestion induced For the non-splittable case, suppose that c(t) ≤ 0.5. Then
on a link l at time t by a set S of transmissions. This is simply each of the first x − 1 light-trails must have congestion of
the total bandwidth requirement of those transmissions from least 0.5 when t arrived, giving congestion Ω(x). So suppose
S requiring to cross link l at time t. The congestion bound c(t) > 0.5. Let k be the largest such that wavelength k
c(S) is maxl maxt clt (S), the maximum congestion over all contains a transmission t0 with c(t0 ) ≤ 0.5. If no such k exists,
links over all time instants. then clearly the congestion is Ω(x). If k exists, then all the
For the online problem, we present two algorithms, S EP - wavelengths higher than k have congestion at least 0.5 when
ARATE C LASS and A LL C LASS having competitive ratios t arrived. And the wavelengths lower than k had congestion
at least 0.5 when t0 arrived. So at one of the two time instants will end up with a light-trail L00 such that there are at least
the congestion must have been Ω(x). w00 wavelengths with light-trails conflicting with L00 , where
w00 = w−log n−log n(w/2 log n) = w/2−log n ≥ w/4 log n
Theorem 6. S EPARATE C LASS is O(log n) competitive.
for w = Ω(log2 n). But L00 is a single link and so we are done.
Proof: Suppose that S EPARATE C LASS uses w wave- Of these w/4 log n light-trails, at least w/16 log2 n must
lengths. We will show that the best possible algorithm (in- have the same class and phase. But Lemma 5 applies, and
cluding off-line algorithms) must use at least Ω(w/ log n) hence there is a link having congestion at least w/16 log2 n.
wavelengths. But this is a lower bound on the number of wavelengths
Consider the time at which the wth wavelength was allo- required by any algorithm, including an offline algorithm.
cated. At this time w − 1 wavelengths are already in use, and
of these w0 = (w − 1)/4 log n must have the same class and VI. S IMULATIONS
phase. Among these w0 wavelengths consider the one which We simulate our two online algorithms and a baseline
was allocated last to accommodate some light-trail L serving algorithm on a pair of oppositely directed rings, with nodes
some newly arrived transmission. At that time, each of the numbered 0 through n − 1 clockwise.
previously allocated w0 − 1 wavelengths was nonempty in the We use slightly simplified versions of the algorithms de-
extent of L. By Lemma 5, c(B) = Ω(((w −1)/4 log n)−1) = scribed in Section V (but easily seen to have the same bounds):
Ω(w/ log n). This is a lower bound on any algorithm, even basically we only use phases 0 and 2. Any transmissions that
off-line. would go into class i phase 1 (or phase 3) light-trail are
contained in some class i + 1 light-trail (of phase 0 or 2 only),
B. Algorithm A LL C LASS and are put there. We define a class i and phase 0 light-trail to
This is a simplification of the previous algorithm in that be one that is created by putting off shutters at nodes jn/2i
allocated wavelengths are not labeled. When a transmission for different j, suitably rounding when n is not a power of 2.
arrives, if a light-trail of its class and phase capable of A light-trail with class i and phase 2 is created by putting off
including it is found, then the transmission is assigned to it. shutters at nodes (jn/2i + n/2i+1 ), again rounding suitably.
If no such light-trail is found, then an attempt is made to For A LL C LASS, there is a similar simplification. Basically, we
create such a light-trail from the unused portions of any of the use light-trails having end nodes at jn/2i and (j + 1)n/2i or
existing wavelengths. If such a light-trail can be created, then at jn/2i + n/2i+1 and (j + 1)n/2i + n/2i+1 . As before, in
it is created and the transmission is placed in it. Otherwise a S EPARATE C LASS, we require any wavelength to contain light-
new wavelength is allocated, the required light-trail is created, trails of only one class and phase; whereas in A LL C LASS, a
and the rest of the wavelength is considered unused. wavelength may contain light-trails of different classes and
When a transmission finishes, it is removed from its asso- phases.
ciated light-trail. If this makes the light-trail empty then we For the baseline algorithm in each ring we use a single off
consider it as unused. Then we check if there are adjacent shutter at node 0. Transmissions from lower numbered nodes
unused light-trails for the same wavelength. If so, we merge to higher numbered nodes use the clockwise ring, and the
them by turning on the off shutter between them. others, the counterclockwise ring.
Theorem 7. A LL C LASS is O(log2 n) competitive. A. The simulation experiment
Proof: Suppose A LL C LASS uses w wavelengths. We will A single simulation experiment consists of running the
show that an optimal algorithm will use at least Ω(w/ log2 n). algorithms on a certain load, characterized by parameters
Clearly, we may assume w = Ω(log2 n). λ, D, rmin and α for 100 time steps. In our results, each data-
We first prove that there must exist a point of time in the point reported is the average of 150 simulation experiments
execution of A LL C LASS when there are w/4 log n light-trails with the same load parameters.
crossing the same link. In each time step, all nodes j that are not busy transmitting,
Number the wavelengths in the order of allocation. Consider generate a transmission (j, dj , rj ) active for tj time units.
the transmission t for which the wth wavelength was allocated After that the node is busy for tj steps. After that it generates
for the first time. Let L be the light-trail used for t. Clearly, another transmission as before. The transmission duration tj
the wth wavelength had to be allocated because the other is drawn from a Poisson distribution with parameter λ. The
wavelengths contained light-trails overlapping with L. Of destination dj of a transmission is picked using the distribution
these if at least w/4 log n light-trails crossed either end of D discussed later. The bandwidth is drawn from a modified
L, then we are done. If this fails, there must be at least Pareto distribution with scale parameter = 100 × rmin and
w0 = w − 1 − w/2 log n wavelengths that have light-trails shape parameter = α. The modification is that if the generated
which are strictly contained inside the extent of L. Let L0 bandwidth requirement exceeds the wavelength capacity 1, it
be the light-trail allocated on the w0 th of these wavelengths. is capped at 1.
Note that L0 is strictly smaller than L. Thus we can repeat We experimented with α = {1.5, 2, 3} and λ = {0.01, 0.1}
the above argument by using L0 and w0 in place of L and w but report results for only α = 1.5 and λ = 0.01; results for
respectively, only log n times, and if we fail each time, we other values are similar. We tried four values 0.01, 0.1, 0.25
and 0.5 for rmin . Here we report the results for rmin = congestion is not always a good lower bound. This may lead
0.01, 0.5. We considered four different distributions D for to a constant factor approximation algorithm for the problem.
selecting the destination node of a transmission. 1) Uniform: In the online case we gave two algorithms which we
we select a destination uniformly randomly from the n − 1 prove to have competitive ratios O(log n) and O(log2 n)
nodes other than the source node. 2) UniformClass: we first respectively. In practice we found that the second algorithm
choose a class uniformly from the dlog n/2e + 1 possible was better, and showing this analytically is an important open
classes. It should be noted that there can be a destination at problem.
a distance at most n/2 in any direction since we schedule Our online model is very conservative in the sense that
along the direction requiring shortest path. 3) Bimodal: first once a transmission is allocated on a light-trail, the light-trail
we randomly choose one of two possible modes. In mode 1, cannot be modified. However, other models allow light-trails to
a destination from the two immediate neighbors is selected shrink/grow dynamically [17]. It will be useful to incorporate
and in mode 2, a destination from the nodes other than the this (with some suitable penalty, perhaps) into our model.
two immediate nodes is chosen uniformly. For applications It will also be interesting to devise special algorithms that
where transmissions are generated by structured √ algorithms, work well given the distribution of arrivals.
local traffic, i.e. unit or short distances (e.g. n for mesh
like communications) would dominate. Here, for simplicity, ACKNOWLEDGMENT
we create a bimodal traffic which is mixture of completely We would like to thank Ashwin Gumaste for encourage-
local and completely global. 4) ShortPreferred: we select ment, insightful discussions and patient clearing of doubts
destinations at shorter distance with higher probability. In fact, related to light-trails.
we first choose a class i in the range 0, . . . , dlog n/2e with
probability 2i+1
1
and then select a destination uniformly from R EFERENCES
the possible destinations in that class. We report the results [1] I. Chlamtac and A. Gumaste, “Light-trails: A solution to IP centric
only for the distributions Uniform and Bimodal. communication in the optical domain,” Lecture notes in computer
science, pp. 634–644, 2003.
B. Results [2] A. Borodin and R. El-Yaniv, Online computation and competitive
analysis. Cambridge University Press, New York, NY, USA, 1998.
Fig. 2 shows the results for the 4 load scenarios. For each [3] D. Sleator and R. Tarjan, “Amortized efficiency of list update and paging
scenario, we report the number of wavelengths required by rules,” Communications of the ACM, vol. 28, no. 2, pp. 202–208, 1985.
[4] A. Karlin, M. Manasse, L. Rudolph, and D. Sleator, “Competitive snoopy
the 3 algorithms and the measured congestion as defined in caching,” Algorithmica, vol. 3, no. 1, pp. 79–119, 1988.
Section V. Each data-point is the average of 150 simulations [5] R. Wankar and R. Akerkar, “Reconfigurable architectures and algo-
(each of 100 time steps) for the same parameters on rings rithms: A research survey,” IJCSA, vol. 6, no. 1, pp. 108–123, 2009.
[6] K. Bondalapati and V. Prasanna, “Reconfigurable meshes: Theory and
having n = 5, 6, . . . , 20 nodes. We say that the two scenarios practice,” in Fourth Workshop on Reconfigurable Architectures, IPPS,
corresponding to rmin = 0.01 have low load and the remain- 1997.
ing two scenarios (rmin = 0.5) have high load. [7] K. Li, Y. Pan, and S. Zheng, “Parallel matrix computations using a
reconfigurable pipelined optical bus,” Journal of Parallel and Distributed
For low load, the baseline algorithm outperforms our algo- Computing, vol. 59, no. 1, pp. 13–30, 1999.
rithms. At this level of traffic, it does not make sense to reserve [8] C. Subbaraman, J. Trahan, and R. Vaidyanathan, “List ranking and graph
different light-trails for different classes. However, as load algorithms on the reconfigurable multiple bus machine,” in Parallel
Processing, 1993. ICPP 1993. International Conference on, vol. 3, 1993.
increases our algorithms outperform the baseline algorithm. [9] Y. Pan, M. Hamdi, and K. Li, “Efficient and scalable quicksort on a linear
For the same load, it is also seen that our algorithms become array with a reconfigurable pipelined bus system,” Future Generation
more effective as we change from the completely global Computer Systems, vol. 13, no. 6, pp. 501–513, 1998.
[10] Y. Wang, “An efficient O (1) time 3D all nearest neighbor algorithm
Uniform distribution to the more local Bimodal distribution. from image processing perspective,” Journal of Parallel and Distributed
This trend was also seen with the other distributions we Computing, vol. 67, no. 10, pp. 1082–1091, 2007.
experimented with. [11] S. Rajasekaran and S. Sahni, “Sorting, selection, and routing on the
array with reconfigurable optical buses,” IEEE Transactions on Parallel
Though we could not show analytically that A LL C LASS is and Distributed Systems, vol. 8, no. 11, pp. 1123–1132, 1997.
better than S EPARATE C LASS always, our simulation shows [12] A. Gumaste and I. Chlamtac, “Mesh implementation of light-trails:
that A LL C LASS performs better. It may be noted that our a solution to IP centric communication,” Proceedings of the 12th
International Conference on Computer Communications and Networks,
algorithm for the stationary case mixes up the light-trails of ICCCN’03, pp. 178–183, 2003.
different classes, and so suggests that the A LL C LASS might [13] A. Gumaste, G. Kuper, and I. Chlamtac, “Optimizing light-trail assign-
work better. ment to WDM networks for dynamic IP centric traffic,” in The 13th IEEE
Workshop on Local and Metropolitan Area Networks, LANMAN’04,
2004, pp. 113–118.
VII. C ONCLUSIONS AND F UTURE W ORK [14] Y. Ye, H. Woesner, R. Grasso, T. Chen, and I. Chlamtac, “Traffic
It can be shown that the non-splittable stationary problem grooming in light trail networks,” in IEEE Global Telecommunications
Conference, GLOBECOM’05, 2005.
is NP-hard, using a simple reduction from bin-packing. We [15] A. Gumaste, J. Wang, A. Karandikar, and N. Ghani, “MultiHop Light-
do not know if the splittable problem is also NP-hard. We Trails (MLT) - A Solution to Extended Metro Networks,” Personal
gave an algorithm for both variations of the stationary problem Communication.
[16] S. Balasubramanian, W. He, and A. Somani, “Light-Trail Networks: De-
which takes O(c + log n) wavelengths. It will also be useful sign and Survivability,” The 30th IEEE Conference on Local Computer
to improve the lower bound arguments; as Section IV shows, Networks, pp. 174–181, 2005.
Uniform Bimodal
10 9
AllClass AllClass
9 SeparateClass 8 SeparateClass
Baseline Baseline
8
Congestion 7 Congestion
Number of wavelengths W
7
6
6
5
5
4
4
3
3
2 2
1 1
0 0
6 8 10 12 14 16 18 20 6 8 10 12 14 16 18 20
Number of nodes n
(a) Low Load
Uniform Bimodal
16 16
AllClass AllClass
14 SeparateClass 14 SeparateClass
Baseline Baseline
Congestion Congestion
Number of wavelengths W
12 12
10 10
8 8
6 6
4 4
2 2
0 0
6 8 10 12 14 16 18 20 6 8 10 12 14 16 18 20
Number of nodes n
(b) High Load
Fig. 2. Simulation results
[17] A. Gumaste and I. Chlamtac, “Light-trails: an optical solution for IP [23] A. Lodha, A. Gumaste, P. Bafna, and N. Ghani, “Stochastic Optimization
transport [Invited],” Journal of Optical Networking, vol. 3, no. 5, pp. of Light-trail WDM Ring Networks using Benders Decomposition,” in
261–281, 2004. Workshop on High Performance Switching and Routing, HPSR’07, 2007,
[18] J. Fang, W. He, and A. Somani, “Optimal light trail design in WDM op- pp. 1–7.
tical networks,” in IEEE International Conference on Communications, [24] W. Zhang, G. Xue, J. Tang, and K. Thulasiraman, “Dynamic light
vol. 3, June 2004, pp. 1699–1703. trail routing and protection issues in WDM optical networks,” in IEEE
[19] A. Gumaste and P. Palacharla, “Heuristic and optimal techniques for Global Telecommunications Conference, GLOBECOM’05, 2005, pp.
light-trail assignment in optical ring WDM networks,” Computer Com- 1963–1967.
munications, vol. 30, no. 5, pp. 990–998, 2007. [25] A. Gumaste and S. Zheng, “Dual auction (and recourse) opportunistic
[20] A. Ayad, K. Elsayed, and S. Ahmed, “Enhanced Optimal and Heuristic protocol for light-trail network design,” in IFIP International Conference
Solutions of the Routing Problem in Light Trail Networks,” Workshop on Wireless and Optical Communications Networks, 2006, p. 6.
on High Performance Switching and Routing, HPSR’07, pp. 1–6, 2007. [26] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson, “Approximation
[21] B. Wu and K. Yeung, “OPN03-5: Light-Trail Assignment in WDM algorithms for bin packing: a survey,” Approximation algorithms for
Optical Networks,” in IEEE Global Telecommunications Conference, NP-hard problems, pp. 46–93, 1997.
GLOBECOM’06, 2006, pp. 1–5.
[22] S. Balasubramanian, A. Kamal, and A. Somani, “Network design in
IP-centric light-trail networks,” in 2nd International Conference on
Broadband Networks, IEEE Broadnets’05, 2005, pp. 41–50.
ADCOM 2009
FOCUSSED SESSION ON
RECONFIGURABLE
COMPUTING
235
AES and ECC Cryptography Processor with
Runtime Configuration
Samuel Antão, Ricardo Chaves, Leonel Sousa
Instituto Superior Técnico/INESC-ID
Technical University of Lisbon
Email: {sfan,rjfc,las}@sips.inesc-id.pt
Abstract—In nowadays society, the amount of applications that distributed secret without any previous agreed information.
require cryptographic support keeps growing. The functionality Several algorithms have been proposed to implement these
and security of these applications rely on the capability of cryptographic functions, and the most successful ones have
cryptographic accelerators in providing both adequate perfor-
mance metrics while maintaining flexibility. In this paper a been adopted regarding their strength against attacks, and their
programmable cryptographic processor prototype, supporting compatibility with the performance-and-compact demand [1],
AES and EC (Elliptic Curve) ciphering is presented. This [2].
processor consists of up to 12 programmable processing units. Regarding the asymmetric algorithms, the Elliptic Curve
We explore and present results for the dynamic reconfiguration (EC) cryptosystem has emerged as a reliable and effective
of these processing units, allowing the runtime replacement of
AES by EC units (or vice-versa) according to the application alternative for the widely used Rivest-Shamir-Adleman (RSA)
needs. Combining programmability and runtime reconfiguration, algorithm. The EC cryptosystems have the advantage of pro-
both flexibility and performance can be improved. Moreover, the viding a greater security per bit of the secret key. Therefore,
reconfiguration capability allows to further reduce the required smaller keys need to be used/stored. Consequently, more
hardware area, since these functionalities are multiplexed in time. compact and bandwidth efficient, since smaller keys need to
The presented prototype is supported by a Xilinx XC4VSX35
FPGA, consisting of 6 static EC units and 6 reconfigurable be transmitted, cryptosystems can be developed.
AES/EC units, running simultaneously. This processor is able Although symmetric algorithms do not offer the same
to cipher an 128 bit AES block in 22.9 µs and perform an properties as asymmetric ones in terms of the secret key con-
EC point multiplication in 2.02 ms. The full reconfiguration struction, they are simpler, more compact, and more efficiently
of a processing unit can be achieved in less time than an EC computed, allowing for better area and throughput metrics.
multiplication.
Thus, its usage is mandatory for some applications. Currently,
I. I NTRODUCTION one of the most widely used algorithms for symmetric cryp-
tography is the Advanced Encryption Standard (AES) [2].
Currently, most applications require security and authen- Although both symmetric and asymmetric algorithms have
tication services. Several protocols have been designed to shown to provide good performance metrics, their complex-
provide such requirements to these applications, being used ity is still considerable. To overcome this problem, hard-
in a variety of devices: from smart cards, wireless sensors, ware accelerators are employed. Several accelerators have
cell phones, and laptops, that usually need a small amount been proposed supported on Application Specific Integrated
of connections, to high-end servers that have to establish Circuit (ASIC) solutions [3], Field Programmable Gate Array
thousands of connections. For such a wide variety of devices, (FPGA) [4], Graphical Processing Unit (GPU) [5], [6], and In-
there is also a wide range of different demands. The following struction Set Architecture (ISA) extensions for general purpose
highlights the key features that have to be considered: processors [7]. While the flexibility of the solutions increase
• performance, supporting high-throughput and low la- when we move from the ASIC to the general purpose solu-
tency; tions, the performance decreases. The ASIC approach allows
• low cost, using cheap platforms and massive production for fast and low power solutions, but with limited adaptability
computing elements; and higher design costs. General purpose processors solutions
• compactness, allowing the coexistence of different appli- allow for optimal programmability, but achieve relatively low
cations in a small pool of resources; performances and higher power costs. The GPU solutions
• flexibility, allowing adjustment to different needs; allow for the utilization of a large amount of parallel hard-
• low power, enhancing the battery savings, reducing the ware structures with a reduced cost, because of the massive
costs in energy and heat sinks, and increasing autonomy. production due to the gaming market. However, the GPU’s
The security and authentication protocols are often sup- datapath is not optimized for cryptographic procedures and
ported by two main types of cryptographic functions: symmet- the parallelism extraction for cryptography is limited, allied
ric and asymmetric. The latter allows to establish a secure and with the significant power consumption. The FPGA solutions
confidential communication between two entities that share are a compromise between the high performance/low power of
a secret, while the former allows two entities to create a the ASIC and the flexibility/low cost of the general purpose
236
processors. Moreover, FPGAs allow to combine programmable • byte additions over GF (28 ), which correspond to a 8-bit
solutions with reconfiguration capabilities, providing adaptable bitwise exclusive OR (XOR) operation;
datapaths. FPGAs can be considered as an advised option to • non-linear function S(.) often called an SBox and its
efficiently support a wider range of cryptographic algorithms inverse; this function can be computed with multiplica-
and procedures. tions and inversions over GF (28 ) with the irreducible
This paper proposes a general cryptographic processor polynomial I(x) = x8 + x4 + x3 + x + 1;
supported on FPGA. This programmable processor was de- • data matrix multiplication with constant matrices, with
signed to take advantage of the reconfigurable capabilities the irreducible polynomial I(x);
of an FPGA for achieving good performance metrics and • matrix row rotating shift operation.
enhanced flexibility. The processor proposed in this work Further details about these operations and how they are applied
aims to provide support for the majority of the security and can be found in [2].
authentication protocols, introducing microcoded AES and EC
cores, and a true Random Number Generator (RNG) supported B. EC arithmetic
on oscillator rings to generate secrets. Very few attempts An EC over GF (2m ) is a set composed by a point at infinity
have been made in the related art to combine AES and EC O and the points Pi = (xi , yi ) ∈ GF (2m ) × GF (2m ) that
arithmetic into a single arithmetic body. The efficiency of such comply the following equation:
approaches is compromised by the difference in the size of
yi2 + xi yi = x3i + ax2i + b, a, b ∈ GF (2m ). (1)
the datapath (m ≥ 163 for the EC versus m = 8 for the
AES), requiring the use of different irreducible polynomials, By establishing the addition operation over the EC points and
thus different reduction structures. Our approach is different: by applying it recursively, it is possible to obtain the multipli-
instead of sharing the datapath for the AES and EC arithmetic, cation by a scalar operation. It is known to be computationally
we create individual, compact and high-performance AES and hard to invert this operation, since it is difficult to determine,
EC cores that share the same microcoded control unit. With from the recursive addition result of an EC point, how many
this approach and using the reconfiguration capabilities of the times this point was added. This is known as the Elliptic
FPGA, it becomes very easy and efficient to dynamically trade Curve Discrete Logarithm Problem (ECDLP), which supports
AES and EC cores, depending on the requirements. A compact the security of EC cryptosystems.
and flexible cryptographic processor with good performance The EC point addition and doubling (addition to itself) are
metrics is obtained. With a RNG associated to the processing performed with operations over the underlying field GF (2m )
units the secret keys of the protocols can be locally computed applied to the points’ coordinates. These GF (2m ) operations
and directly stored in the processing units’ memory. Avoiding are the addition, multiplication, squaring and the inversion,
the communication of secret keys makes the system more modulo an irreducible polynomial with degree m. Details
secure and resistant to external attacks. about how these operations can be efficiently performed can
The paper is organized as follows. In Section II we provide a be found in [8].
brief introduction on the AES and EC arithmetic. In Section III
we present the details of the reconfigurable architecture used. III. C RYPTOGRAPHIC P ROCESSOR A RCHITECTURE AND
In Section IV we describe the system layout in order to D ETAILS
handle the runtime configuration of processing units. Section In this work we developed a prototype of a cryptographic
V presents results for the developed prototype, and Section VI accelerator supported on reconfigurable hardware, namely a
draw some conclusions about the developed work. prototyping board powered by a Xilinx Virtex 4 FPGA [9]. In
this prototype the aim is to support the majority of protocols
II. AES AND EC C RYPTOGRAPHY that need asymmetric and symmetric cryptographic schemes,
In this section we briefly introduce the arithmetic that the and also the secure generation of secret keys for these proto-
proposed processor supports. cols. A schematic overview of the proposed processor organi-
zation is presented in Figure 1. The processor is composed by
A. AES arithmetic several processing units (PUs), responsible for computing the
The AES algorithm is composed by three main operations: cryptographic procedures. An RNG is also included in order
the key expansion, the ciphering, and the deciphering. In the to generate the secret data (such as the private keys). The
key expansion operation, the used key, with 16, 24 or 32 processor has an I/O interface to communicate and receive
bytes, is expanded in order to obtain 176, 208, or 240 bytes, the data (public keys, plain texts, ciphered texts) to/from the
depending on the initial size. This expanded key is divided main controller, which we herein call host of the processor.
in sets of 16 bytes and each set is used in each round of This interface is also used to provide commands, such as
the ciphering/deciphering operation. The number of rounds start commands for the processing units (PUs) or write/read
depends on the used key size. The key and data used in the commands of data and instructions. Through this interface the
ciphering/deciphering rounds are organized in a common way: host can query the processor for, e.g., availability of PUs or
in a 4 × 4 bytes matrix. Each AES round affects each of these check if the required tasks were already done. When the host
matrices’ elements using the following elementary operations: sends a write command to any PUs, it also defines the origin
237
of the data to be written, namely external data or internal storage logic storage logic
data read from the RNG. Thus, the host can use the secret input A
input A
adder A
input B
adder B
input B R1 R3
Data
information without having to touch or to know it. adder
RAM
R2
red A
Data R4
red B
R(a) R(b) RAM
All the PUs are responsible to run according to the mi- R(a) R(b)
crocode stored in a centralized instruction memory. For this, addition logic Reduction &
RA RB R(a) R(b) squaring logic
each PU has its own microprogram counter (µPC) and startup R(a)
0
R(b) 0 0
R(a) R(b)
addresses to run and control the flow of the correct program.
+
Look-up
ROM adder A
+ + adder B reduction
An arbiter controls the access to the instructions memory
multiplier red A red B
according to a priority policy, and signals any PU when the R(a)
0
0 0 R(b)
+ +
Each PU contains its local data memory, which is addressed adder
multiplication R1 ... ... ... ... R6
according to the received microinstructions. Input data and arithmetic logic logic
temporary data, as well as the final results are stored in this (b) EC PU
(a) AES PU
local memory. This memory can be accessed by the host, when
the PU is set to the idle state through specific microinstructions Fig. 2: Architecture of the processing units.
directly provided by the host, in order to be possible to read
and write data from/to the PUs. The width of the data memory
as well as the details of the arithmetic units available is port, the same ROM can be used for two PUs. The two adders
customizable according to the type of the PU. Different types at the input and output of the ROM perform the required
of PUs support different cryptographic procedures. additions for the AES arithmetic. With this architecture the
With this modular architecture, the PUs share the same following basic operations are implemented, where L(.) is a
control through the instruction memory while facilitating the look-up result: R(c) = R(a) + R(b), R(c) = R(a) + L(R(b)),
replacement of a given PU by another one. This allows to R(c) = L(R(a) + R(b)), and R(c) = L(R(b)), where a,
extract full advantage of the reconfiguration capabilities of the b, and c are the addresses provided by the microinstruction.
electronic devices. An operation to load a constant directly to the memory is
also implemented. The byte shift operations can be overcome
A. PU for AES
with the appropriate addressing of data, since each address
The architecture of the AES PU is presented in Figure 2a. correspond to one byte. Regarding the flow control of the
This architecture is composed by a data RAM of 512 positions, program, three jump related operations are implemented:
a ROM and two adders. The ROM implements a look up table
• jmpset: set the value of an indexing counter;
for the non-linear function S(.) and its inverse S −1 (.) (see
• jmpinc: jump if the value associated to this jump
Section II-A). We also include in this ROM the operations
instruction match the value in the indexing counter; the
2S(x), 3S(x), 9x, 11x, 13x, and 14x, where x represents the
indexing counter is incremented;
ROM address. With these operations, we are able to perform
• jmpdec: similar to jmpinc, but decrements the index-
the multiplications with the constants matrices operations.
ing counter;
Since the computation of the AES is performed over
• end: determines the end of the program, and thus the PU
GF (28 ) the used datapath and memory width is of 8 bits. Re-
becomes idle.
garding that x has 8 bits, the ROM has 2048 entries of 8 bits.
This amount of data fits a single BRAM present in the Xilinx It is also possible to sum the value in the indexing counter
Virtex 4 technology. Furthermore, since these BRAMs are dual multiplied by 16 to the data addresses. This allows to easily
browse through the 16 byte matrices where the AES data
is organized in the data BRAM depending on the indexing
Host Communication Interfaces
counter. All these functionalities, including the choice of the
RNG data access Data in Data out 1 Commands & PU Data out n ROM look-up and the usage of the indexing counter data in
Random Number status queries
Gen. the addresses, are mapped in microinstructions of 36 bit width.
Each microinstruction run in 3 clock cycles: one cycle to read
accessing request address ...
arbiter valid instruction
... the data, one cycle to read the ROM, and another cycle to
write the result.
Data Data
RAM RAM
Micro
instructions
control control B. PU for EC
& &
RAM
µPC µPC
We support the EC PU in our previous work presented in [8]
arithmetic arithmetic
for polynomial basis field arithmetic. The architecture of this
GF(2m) PU GF(2m) PU
micro-instruction ... compact and flexible PU is similar to the one of the AES
...
n processing units (PU) PU, and is depicted in Figure 2b. There is also a data BRAM
where the field elements (of size m ≥ 163) are split and
Fig. 1: Processor Organization Overview. stored in 21 bit words. The arithmetic logic supports two-
238
word with two-word operand additions and Karatsuba-Offman
multiplications.
The microcode adopted for the EC PU can be classified
into two main microinstruction types. The complex microin- random bit
structions (type I) are performed over field elements, while
the lower complexity microinstructions (type II) operate over
words. There is a type I reserved microinstruction that corre-
sponds to a customizable sequence of type II operations. reset clock
239
storage logic storage logic
reconfiguration, when the host enables the PU, a reset pulse input A
input A
input B input B
is generated for that PU to set it to the idle state. adder Data Data adder
RAM RAM
In order to support both AES and EC processing units, R(a) R(b) R(b) R(a)
Between the two considered PUs, the most demanding in terms + ROM
+
of resources is the EC PU, due to the wider datapath (21-bit
R(a) 0 0 R(b) R(b) 0 0 R(a)
instead of 8-bit) and larger complexity. For these reasons, the 0 0
reconfigurable zones are sized to fit an EC PU. arithmetic logic + + arithmetic logic
adder adder
Since the several PUs compete to access the instruction reconf. PU 1 static reconf. PU 2
memory, conflicts can exist, thus some PUs may stall waiting
for their request to be fulfilled. These conflicts penalty will Fig. 4: AES PUs with shared look-up memory.
increase if the number of PUs appended to one of the instruc-
tions memory port increases. This effect has to be taken into
account when setting the number of PUs in the design and, maximum of 89 input signals (⌈89/8⌉ = 12 bus macros) and
consequently, the number of reconfigurable zones. Each of 64 output signals (⌈64/8⌉ = 8) is required, corresponding to
the AES operations requires 3 clock cycles to perform, while a total of 20 bus macros.
an EC operation requires from 3 to 14 cycles to perform.
This means that the average of clock cycles per instruction V. E XPERIMENTAL R ESULTS AND R ELATED W ORK
in the AES PUs is less than the EC PUs average. Thus, the The proposed design was successfully implemented and
AES PUs will generate more conflicts than the EC PUs. Since experimentally tested on a prototyping board powered by
the arbiter can issue one instruction per clock cycle, only a a Xilinx XC4VSX35-10 FPGA [9]. We implemented and
maximum of 3 AES PUs can ideally operate at the same time evaluated different combinations in the number of AES and
without conflicts. Putting a forth PU with less priority than the EC cores. These implementations refer to EC arithmetic
others will cause this fourth PU to stall until one of the others over GF (2163 ) and AES arithmetic with 128 bit key size.
finishes the ongoing computation. This observation determines The FPGA programming files were obtained from a VHDL
the number of the required reconfigurable zones, which is 3 description of the hardware, synthesized with the Synplify
per instruction memory port. Thus, the system can have up Premier C-2009.06 tools and Placed&Routed with the ISE
to 3 AES PUs per instruction memory port, implemented in 9.2.04i PR14 tools. The Virtex 4 technology supports the
the reconfigurable zones. The system can have more static handling of dynamic reconfiguration using the Internal Config-
EC PUs according to the conflicts that the user admits or to uration Access Port (ICAP). The advantage of using this port
the available resources. Considering a dual port instructions is the possibility to direct instantiate and conjugate it with the
memory, the number of reconfigurable zones can be increased remaining design, including the communication logic, that can
to the double, 6. write the reconfiguration bitstream directly to this port.
The use of dual port memories also contributes to reduce the The Virtex 4 FPGA contain block RAMs that provide
resources used in the design of the AES PUs. Considering a true dual port capabilities. This allows for all the memories
dual-port look-up ROM the same memory can be implemented employed in the design (instruction, data, and look-up) to be
statically outside the PUs and shared by two AES PUs, as dual port, saving resources. As discussed in Section IV, the
Figure 4 suggests. Moreover, this procedure allows for the maximum number of AES PUs competing for an instruction
information inside the RAM not to reside among the config- memory port can be up to 3. Thus, we will use 6 reconfigurable
uration data, enhancing the compactness of the bitstream and zones that can be reconfigured with an EC or AES PU. We
the configuration speed. Another issue that as to be considered also implement another 6 (3 per instruction BRAM port) static
while reconfiguring the PUs is the amount of signals that EC PUs. Thus the design can have up to 12 PUs working
cross the reconfigurable zone boundary, since the path of these simultaneously, and up to 6 PUs can be AES PUs. The reason
signals through the boundary has to be directly instantiated. for implementing only 6 static PUs is related with the Slice
This instantiation, except for the clock signal when provided resources constraint and with the increasing number of con-
by a global buffer, is performed recurring to directional slice flicts while accessing the instruction BRAM. We considered
bus macros. These bus macros are provided with the Xilinx an instruction memory with 1024 36-bit instructions to contain
ISE tools that support dynamic reconfiguration. The number all the routines for EC and AES arithmetic.
and type of the required macros is determined by the number The static design contains the required logic to implement
of PU inputs and outputs signals. Each bus macro occupies the communication with the host, the random numbers genera-
a Configurable Logic Block (CLB), which correspond do 4 tion, the AES look-up memories, and the 6 static EC PUs. The
slices, and supports up to 8-bit signals. To determine the required resources to implement the static design are 8,446
number of bus macros, the maximum number of inputs and slices and 11 BRAMs (2 for the instruction memory, 3 for the
maximum number of outputs in both the PUs types (AES AES look-up ROMs, and 6 for the data storage in the 6 static
and EC) have to be considered. For the proposed design a EC PUs).
240
Reconfigurable
Zone
Bus Macros
(a) No reconfigurable PUs (b) AES PUs only (c) ECC PUs only
There are 6 reconfigurable zones in the design with rectan- The maximum and minimum size in 32-bit words of the
gular shape of 13 CLB width and 21 CLB height (13 × 21 × runtime reconfiguration bitstreams are 30662 and 31067 for
4 = 1092 total slices). Considering the size of the Virtex 4 EC PUs, and 27898 and 29500 for the AES PUs. Although the
configuration frames which have 1 CLB width and 16 CLB reconfiguration area is the same for the AES and EC PUs, the
height, the reconfiguration of a reconfigurable area requires AES PUs result in approximately 5% smaller bitsreams due to
the communication of 26 frames. The different reconfigurable the lower utilization of resources, allowing for a slightly higher
zones do not intercept reconfiguration frames of the others. compression. The reconfiguration time is directly correlated
For this, the bottom boundaries of the reconfigurable zones with the bitstreams and the clock frequency. The ICAP in
are at the CLB coordinates 0, 32, and 64 (slices 0, 64, and Virtex 4 technologies allow to write a 32-bit reconfiguration
128). The layout of the system, as well as the bus macros word in each clock cycle. The maximum ICAP working
location, is depicted in Figure 5 for different contents of the frequency is 100 MHz [13], thus we expect that the maximum
reconfigurable zones after place and routing. Each PU employs reconfiguration time can be of 31067/100 M Hz ≈ 310µs.
1 BRAM for its data memory. The reconfigurable AES PUs Although, in the developed prototype, the reconfiguration
requires 157 ± 2 slices and the EC PU requires 943 ± 7 slices. bitstream is communicated from outside the device and written
The variation in the slices resources employed by each PU is directly in the ICAP, being the reconfiguration time limited by
due to the slightly different placing of the resources by the the communication process. The communication is performed
tools for the different reconfiguration zones. The occupation through a PCI bus, working at 33 MHz. Hence, we use the
of the reconfigurable zones by the PUs is 14% and 86% for same bus frequency driving the ICAP, being the incoming data
the AES and EC PUs, respectively. Although the occupation immediately transferred to the ICAP. With this, we obtain a
of the reconfigurable zones is not complete, the margin of maximum reconfiguration time of 31067/33 M Hz ≈ 941µs.
free resources allow to enhance the routing delays. Regarding In the next subsection we present the results specific for the
the complete design, the required resources are 14,092 slices RNG and PUs operation.
(92% of the total resources) with all the reconfigurable zones
A. Random Number Generator
implementing EC PUs and 9,387 (61% of the total resources)
with all the reconfigurable zones implementing AES PUs. In order to validate the implementation of the RNGs,
Considering the reconfigurable zones completely occupied, the random bitstreams were collected from the processor and their
required resources for the complete design are 14,998 slices randomness tested using a battery of tests. Two main batteries
(98% of the complete resources). The obtained system can run of tests are used for this purpose: the National Institute of
at the maximum frequency of 100.3 MHz. Standards and Technology (NIST) test [14] and the Diehard
one [15]. For the implementation proposed in [12], in order to
The reconfiguration bitstreams were generated in com- pass both batteries of tests successfully, the authors obtained
pressed format, using the appropriate Xilinx tools options. a RNG with 25 oscillators of 3 inverters each, sampled at
241
100 MHz. The option of using 3 inverters is justified by the AES (nAES ) operations were counted. The efficiency (E) is
enhanced compactness of the implementation. given by:
nEC TEC + nAES TAES
The randomness of the bitstream is enhanced if the number E= ; (2)
of oscillators increase or/and the sampling frequency decrease. nP U T
For the processor herein presented, using 3 inverters per oscil- where TEC and TAES is the time of a single EC and AES
lator, the number of required oscillators to pass both NIST and operation without conflicts in the memory accessed measured
Diehard tests at 100 MHz, which is the operating frequency in clock cycles, respectively, and nP U is the number of the
for the prototype, was shown to be 20. Each oscillator is active PUs. From Table I, it can be observed that the efficiency
implemented within a CLB, resulting in a very compact RNG. is very close to 100% for configurations with less than 4 PUs.
This result arose from the fact that an instruction takes at
B. Processing Units least 3 clock cycles to complete, thus the number of conflicts
Using the proposed architecture and microcode format, we in the arbiter will be meaningless. Moreover, for the other
were able to program the EC scalar multiplication and point configurations the efficiency is always greater than 61%.
addition in 401 instructions, and the AES key expansion, Comparing the presented results with the related work is
ciphering and deciphering in 253 instructions. The total latency not straightforward, since different technologies and different
for the EC PUs is 201,661 clock cycles for the EC scalar metrics are used by different authors. Nonetheless, we intro-
multiplication and 4,796 clock cycles for the point addition. duce some related art results to comparatively evaluate our
The latency for the key expansion and ciphering/deciphering design.
in our AES PU is 610 clock cycles and 2,290 clock cycles, In [4] a compact AES/EC design is proposed, supported on
respectively. a Xilinx Virtex XCV800 platform running at 41 MHz. Several
Performance metrics for different combinations of simulta- Logical Units (LUs) that support the basic field operations over
neously working PUs in the cryptographic processor are pre- GF (28 ) are organized by two reconfigurable modes: a Single-
sented in Table I. This metrics are measured at the prototype Instruction-Multiple-Data (SIMD) mode that support the AES
operating frequency, 100 MHz. arithmetic, and a Single-Instruction-Single-Data (SISD) mode
The evaluation in Table I use 1 EC scalar multiplication that supports the EC arithmetic. This design does not support
and 88 consecutive AES ciphering operations, because the simultaneous EC and AES arithmetic, since the LUs must
time consumption of one individual EC point multiplication be reconfigured to reuse resources. This design offers a
is approximately the time of 88 AES operations, allowing a throughput of 3.8 Mbit/s for the AES ciphering (128 bit key),
fair analysis. Although the instruction memory has two ports, and a Point multiplication (in GF (2163 )) latency of 5.36 ms.
we focus our analysis on a single arbiter individually, thus one Our AES throughput when using one PU is 5.5 Mbit/s (1.4
of the instruction memory ports. This analysis hold for both times higher) and the latency for the EC point multiplication
arbiters, even if the configuration of the PUs attached to them is 2.02 ms (2.65 times lower). The design in [4] occupies
is different. 220K gates (approx. 2329 Slices), which is 2.1 times more
An EC point multiplication produces a result in 2.02 ms than one reconfigurable zone in our design. In [4], the sharing
if no conflicts occur, thus the proposed design provides a of the datapath between the AES and EC results in the splitting
throughput of 496 Op/s for only one PU. For 6 EC PUs of an operation in smaller ones, when these operations could
running simultaneously, the throughput is of 1,536 Op/s, which be more efficiently computed in dedicated hardware or using
is lower than 6 times the throughput for one PU, due to look-up tables. This could justify the lower performance metric
the conflicts accessing the instructions. Performing the same of this design.
analysis for the AES arithmetic, considering the ciphering of In [3] a 0.18µm ASIC solution operating at 100 MHz is
128 bit blocks, the proposed processor provides a throughput proposed. In this solution the AES and EC arithmetic share
from 5.6 Mbit/s for 1 PU to 16.8 Mbit/s for the 3 PUs. In most the multipliers and registers. With 56K gates, the authors
this case, the throughput of the system scales directly with in [3] state that a throughput of 64 Mbit/s for the AES, and
the number of PUs, since all the instructions for the 3 the a latency of 1.8 µs for a field multiplication can be achieved.
AES PUs competing for the instruction memory take the Considering that 983 field multiplications and 650 squaring
same 3 clock cycles, thus no conflicts will occur. Intermediary operations are required for implementing the EC multiplication
configurations can be useful for the dynamic requirements of algorithm, we estimate that the EC point multiplication latency
the host. would be >2,9 ms. The herein proposed design is able to
We also introduce an efficiency metric in Table I. This perform the EC point multiplication 1.4 times faster. Although
efficiency measures the impact of the request conflicts solved our AES throughput is lower, our design can operate AES and
by the instruction memory arbiter. This efficiency measures the EC simultaneously and offer a flexibility and programmability
ratio of time used for useful computing by all the operating that an ASIC solution can not.
PUs within a specific time interval. To perform this efficiency In [16], a compact solution for AES supported by a Xilinx
measurement we programmed all the PUs to run consecutively XC2S15 FPGA running at 67 MHz is proposed. This design is
the same program, and after a specific time interval T mea- supported by two main arithmetic units: a multiply accumulate,
sured in clock cycles the number of complete EC (nEC ) and and a byte substitution unit, to support the non-linear function
242
TABLE I: Performance metrics for different combinations of simultaneously working PUs.
# ECC # AES Latency ECC throughput AES throughput Efficiency
PUs PUs (K clk cycles) ms (Op/s) (Mbit/s) (%)
0 0 - - - - -
1 0 201.7 2.02 496 - 100.00
2 0 201.7 2.02 992 - 100.00
3 0 201.7 2.02 1488 - 100.00
4 0 342.3 3.42 1169 - 82.50
5 0 344.9 3.45 1450 - 71.60
6 0 390.5 3.91 1536 - 61.67
0 1 201.5 2.02 - 5.59 99.98
1 1 206.9 2.07 483 5.44 99.08
2 1 223.5 2.24 895 5.04 96.61
3 1 348.8 3.49 860 3.23 81.80
4 1 354.5 3.55 1128 3.18 71.24
5 1 391.4 3.91 1278 2.88 61.59
0 2 201.5 2.02 - 11.18 99.98
1 2 208.1 2.08 481 10.83 98.57
2 2 348.7 3.49 574 6.46 79.27
3 2 350.3 3.50 856 6.43 70.72
4 2 385.9 3.86 1037 5.84 61.20
0 3 201.5 2.02 - 16.77 99.98
required in the AES. These units are controlled by microin- [2] ——, “Federal Information Processing Standards Publication 197: Ad-
structions and a microprogram counter controls the program vanced Encryption Standard,” November 2001.
[3] J. Wang, X. Zeng, and J. Chen, “A VLSI implementation of ECC
flow and branches. This design achieves a throughput of 2.2 combined with AES),” Proc. 8th International Conference on Solid-State
Mbit/s occupying 124 slices and 2 BRAMs. Our AES PU and Integrated Circuit Technology, pp. 1899–1904, March 2006.
offers a throughput 2.5 times higher with 1092 slices allocated [4] W. Lim and M. Benaissa, “Subword parallel GF(2m ) ALU: an implemen-
tation for a cryptographic processor,” Proc. IEEE Workshop on Signal
for its reconfigurable zone and 4 BRAMs. These 4 BRAMs Processing Systems, pp. 63–68, Aug. 2003.
are the minimum required for an AES PU to operate in the [5] R. Szerwinski and T. Guneysu, “Exploiting the Power of GPUs for
herein proposed design. Asymmetric Cryptography,” Proc. Workshop on Cryptographic Hard-
ware and Embedded Systems CHES, pp. 79–99, Aug. 2008.
VI. C ONCLUSIONS [6] S. Manavski, “CUDA Compatible GPU as an Efficient Hardware Ac-
celerator for AES Cryptography,” Proc. IEEE International Conference
In this paper, a microcoded and customizable cryptographic on Signal Processing and Communications, pp. 65–68, Nov. 2007.
processor prototype is presented, capable of efficiently com- [7] O. Kocabas, E. Savas, and J. Grossschadl, “Enhancing an Embedded
Processor Core with a Cryptographic Unit for Speed and Security,” Proc.
puting the AES and EC algorithms, as well as the generation International Conference on Reconfigurable Computing and FPGAs, pp.
of secrets through a RNG. The adopted approach relies on 409–414, Dec. 2008.
efficient and compact EC and AES processing units that share [8] S. Antão, R. Chaves, and L. Sousa, “Compact and Flexible Mi-
crocoded Elliptic Curve Processor for Reconfigurable Devices,” Proc.
the same control from a central microinstruction memory, 7th IEEE Symposium on Field-Programmable Custom Computing Ma-
allowing simultaneous computing of AES and EC routines. chines, FCCM, March 2009.
With this processor, customization can be performed by adding [9] Annapolis Micro Systems, Inc., Wildcard 4 Summary Description, 2007,
http://www.annapmicro.com/wc4.html.
processing units according to the processing needs. Additional [10] P. Kocher, J. Jaffe, and B. Jun, “Differential Power Analysis,” Proc. 19th
configuration can be achieved in runtime through the dynamic Annual International Cryptology Conference, Advances in Cryptology,
reconfiguration capabilities of the FPGA. These characteris- CRYPTO, vol. 1666, pp. 388–397, 1999.
[11] B. Sunar, W. Martin, and D. Stinson, “A provably secure true random
tics make this processor highly adaptable and flexible. The number generator with built-in tolerance to active attacks,” IEEE Trans-
reconfiguration time for a single PU is smaller than an EC actions on computers, vol. 56, no. 1, p. 109, 2007.
multiplication, resulting in negligible impact in the system [12] K. Wold and C. Tan, “Analysis and Enhancement of Random Number
Generator in FPGA Based on Oscillator Rings,” Proc. International
performance if several reconfigurations need to be performed. Conference on Reconfigurable Computing and FPGAs, REConFig, pp.
The proposed processing units, that provide the computing 385–390, 2008.
power of the processor, have shown to be very compact [13] Xilinx, Inc., Virtex-4 FPGA Data Sheet: DC and Switching Character-
istics, version 3.7, 2009, http://www.xilinx.com/support/documentation/
and suitable for embedded systems, supporting AES and EC data sheets/ds302.pdf.
with configurations fitting reconfiguration zones of 1092 slices [14] N. I. of Standards and Technology, “A Statistical Test Suite for
each, and throughputs up to 1536 Op/s for EC and 16.8 Mbit/s Random and Pseudorandom Number Generators for Cryptographic
Applications, Special Publication 800-22, Revision 1,” August 2008,
for AES. Another advantage of the proposed processor is the http://csrc.nist.gov/publications/nistpubs/800-22-rev1/SP800-22rev1.pdf.
inclusion of a compact true RNG in the architecture. This true [15] G. Marsaglia, “Diehard Battery of Tests of Randomness,” 1995,
RNG allows for the internal generation of secrets (such as http://stat.fsu.edu/pub/diehard/.
[16] T. Good and M. Benaissa, “AES on FPGA from the Fastest to the
private keys), thus enhancing the system security. Smallest,” Proc. Workshop on Cryptographic Hardware and Embedded
R EFERENCES Systems CHES, pp. 427–440, September 2005.
243
The Delft Reconfigurable VLIW Processor
Stephan Wong, Fakhar Anjam
Computer Engineering Laboratory
Delft University of Technology
Mekelweg 4, 2628 CD Delft,The Netherlands
E-mail: J.S.S.M.Wong@tudelft.nl, F.Anjam@tudelft.nl
Abstract—In this paper, we present the rationale and design numerical analysis etc. contain a lot of ILP, as they have
of the Delft reconfigurable and parameterized VLIW processor many independent repetitive calculations. VLIW processors
called ρ-VEX. Its architecture is based on the Lx/ST200 ISA such as the Lx/ST200 [1] from HP and STMicroelectronics
developed by HP and STMicroelectronics. We implemented
the processor on an FPGA as an open-source softcore and and the TriMedia [2] from NXP can exploit the ILP found in
made it freely available. Using the ρ-VEX, we intend to bridge an application by means of a compiler. By issuing multiple
the gap between general-purpose and application-specific operations in one instruction, a VLIW processor is able to
processing through parameterization of many architectural and accelerate an application many times compared to a RISC
organizational features of the processor. The parameters include: system [1][3].
instruction set (number and type of supported instructions), the
number and type of functional units (FUs), issue-width (number This paper presents the design of an open source, extensible
of slots), register file size, memory bandwidth. The parameters and reconfigurable softcore VLIW processor. The processor
can be set in a static or dynamic manner in order to provide the architecture is based on the VEX (VLIW Example) Instruction
best performance or the best utilization of available resources Set Architecture (ISA), as introduced in [4], and is imple-
on the FPGA. A complete toolchain including a C compiler mented on an FPGA. Parameters of the VLIW processor such
and a simulator is freely available. Any application written
in C can be mapped to the ρ-VEX processor. This VLIW as the number and type of functional units (FUs), supported
processor is able to exploit the instruction level parallelism instructions, memory-bandwidth, and register file size can be
(ILP) inherent in an application and make its execution faster chosen based on the application and the available resources
compared to a RISC processor system. This project creates on the FPGA. A software development toolchain including
research opportunities in the domain of softcore embedded a highly optimizing C compiler and a simulator for VEX
VLIW processor prototyping, as well as designs that can be
used in high-performance applications. is made freely available by Hewlett-Packard (HP) [5]. We
additionally present a development framework to optimally
Keywords: Reconfigurable computing, FPGA, softcore, ILP, utilize the processor. Any application written in C can be
VLIW. executed on the processor implemented on the FPGA. The
ISA can be extended with custom operations and the compiler
I. I NTRODUCTION is able to generate code for the custom hardware units, further
Very Long Instruction Word (VLIW) processors can be enhancing the performance.
used to increase the performance beyond normal Reduced In- The remainder of the paper is organized as follows. Sec-
struction Set Computer (RISC) architectures [1]. While RISC tion II explains the rationale behind the project. In Section III,
architectures only take advantage of temporal parallelism some previous work related to softcore processors is discussed.
(by utilizing pipelining), VLIW architectures can additionally The VEX VLIW processor architecture and the available
take advantage of the spatial parallelism by using multiple software toolchain are discussed in Section IV. Section V
functional units (FUs) to execute several operations simultane- presents the design and implementation details of our softcore
ously. VLIW processor improves the performance by exploit- VLIW processor ρ-VEX. Finally, conclusions are presented in
ing Instruction Level Parallelism (ILP) in a program. Field- Section VI.
programmable gate arrays (FPGAs) have become a widely
II. T HE R ATIONALE
used tool for rapid prototyping, providing both flexibility (as
in software programming) and performance (as in dedicated The utilization of reconfigurable hardware (with the most
hardware). Nowadays, FPGAs are moving beyond their simple common nowadays: field-programmable gate array (FPGA))
prototyping beginnings towards mainstream products being has increased tremendously in the past years due to their
utilized in many markets: general-purpose, high-performance, inherent parallelism1 that can be exploited in order to improve
and embedded. the execution of many applications, e.g., multimedia, bio-
For an application to take advantage of performance im- informatics, and many large-scale scientific computing appli-
provement from FPGA, it must possess inherent parallelism, or cations. Many approaches have been adopted to exploit recon-
the application source code should be structured in such a way 1 There are multiple factors that played a role, e.g., lowering cost of
to expose its parallelism. Applications in different domains ownership, but these are not mentioned as the discussion is focussed on
such as multimedia, bio-informatics, wireless communication, performance.
244
figurable hardware, but no single all-encompassing solution VLIW softcore(s) can be instantiated and configured on
has emerged as each performs usually only very well for their the FPGA. In this manner, a short trade-off study, e.g.,
particular environment or supported application(s). However, via a simulator or model, can determine the parameters
many of these solutions are hampered not by their ingenious most-suited for the available hardware and targeted ap-
designs but by the lack of tools to fully exploit the solution plication(s) at hand. This scenario is most suited for the
for more general-purpose cases. Therefore, we proposed the embedded design environment as the requirements and
ρ-VEX processor as a reconfigurable and extensible VLIW platform are usually well-known and fixed. The sharing
softcore processor to bridge the gap between application- of resources between multiple VLIW processors is also
specific and general-purpose processing. In the following, pre-determined.
we first highlight the advantages of our choice for a VLIW • dynamic resource sharing: When neither the application
processor as a starting point: nor the precise characteristics of the attached recon-
• simple hardware: One of the main advantages of VLIW figurable hardware is known at design time, the most
processors is that their hardware design is relatively appropriate scenario is to allow for dynamic resource
simple compared to other RISC processors as there is no sharing. In this scenario, enough resources are instan-
need for complex instruction decoders (e.g., out-of-order tiated to allow for sharing among the multiple VLIW
execution) in hardware since the compiler has already processors running on the same chip. The method how
taken care of the instruction scheduling. This means that to do this is under investigation and initial solutions have
the hardware we need to implement on the FPGA can been proposed already.
be kept simple and, therefore, higher clock frequencies • on-the-fly resource instantiation: When new resources
can be achieved to improve performance. Furthermore, are needed they can be instantiated on-the-fly. Similarly,
additional parallelism can be provided by simply adding when they are no longer needed their space can be freed
more issue slots or functional units. and be dedicated to other applications.
• availability of existing tools: Compilers for VLIW pro- The most promising solution to implement is most certainly
cessor are readily available and research and development the combination of the second and the third benefit mentioned
effort in improving them is still ongoing. Moreover, for above. On the other hand, one must not loose sight of certain
the VEX that we have chosen as a basis, a simulator intrinsic disadvantages of VLIW architectures that prevented
is available to investigate the performance gains for dif- it to become mainstream processors. However, we believe that
ferent architectural instances of the VEX processor. This these disadvantages are mainly due to their fixed design and
means we can exploit existing compilers (and simulators) many of these disadvantages can be mitigated when being
and future advancements without the need to dedicate implemented on reconfigurable hardware. We will highlight
much effort in their development. several issues2 in the following and how they could be
• no need for language translations: Another benefit of addressed:
using an existing VLIW architecture and its toolchain is • varying instruction word widths: Different applications
that there is no longer need for translators and automatic contain different levels of parallelism (this is true even
synthesis tools. Nowadays, e.g., when looking at C- within the same application). In order to fully exploit
2-VHDL tools, restrictions must be placed on the C this more issue slots should be used leading to longer
constructs before they can be utilized for the purpose (and therefore, different) instruction widths. Moreover,
of automatic hardware synthesis and sometimes code when using a different number of instructions can lead
rewriting is necessary to achieve improved performance. to a different encoding scheme of the VLIW instructions
In the latter, the (software) programmer needs to possess and thereby varying their length again. This issue can
hardware knowledge, which is not always the case. This be easily dealt with by the reconfigurable nature of a
means we can take any existing code and compile it to reconfigurable and parameterized VLIW processor as
our VLIW processor without rewriting code and without different instruction decoders can be instantiated. This
requiring the programmer to have hardware knowledge. can be achieved with or without reconfiguring the issue
We see a clear motivation for a reconfigurable VLIW slots (in the latter, unused issue slots can be shared among
processor between hardware design using automatic syn- other different softcores).
thesis tools (starting from C) and manual design as • high number of NOPs: Due to the traditionally fixed
adequate performance can be achieved after just the implementation nature of VLIW processors, their organi-
compilation time. zation may not completely match the parallelism inherent
The choice for a VLIW processor clearly has its advantages in the application leading to a high number of NOPs
and in the following, we will discuss reconfigurability-specific being scheduled. This leads to an under-utilization of the
benefits we foresee: available resources (in some cases to over 50%). Instead
• static resource sharing: When the size of the recon- 2 The length of this paper does not allow for an extensive discussion of the
figurable hardware structure and the available hardware shortcomings of VLIW processors and how they can be addressed. Therefore,
area are known beforehand, one or several pre-configured we only mention the most important ones.
245
of idling issue slots, the reconfigurable VLIW processor recompilation is needed to take advantage of new orga-
can reconfigure the issue slots, or reduce the number of nizational improvements. This need can be relaxed as
issue slots - i.e., either physically or enable sharing. dedicated organizational features can be provided in the
• unbalanced issue slots: This issue is tightly coupled reconfigurable hardware for particular already-compiled
with the previous issue as it is one of the causes for code.
the scheduling of NOPs as functional units might not 3) relaxation of backwards compatibility: Rarely used
be available across all issue slots. This issue can be instructions can be implemented in reconfigurable hard-
addressed by adding more functional units per issue slot. ware and their implementation can be instantiated when
Having stated how a reconfigurable and parameterized VLIW needed. This means that complex instruction decoding
can overcome the traditional shortcomings of a VLIW proces- hardware can be avoided leading to simpler hardware
sor, we will highlight in the following how such a reconfig- design and potentially lower power consumption.
urable VLIW processor can be used in two likely scenarios: By no means, our research in the design of the ρ-VEX
1) stand-alone general-purpose processor: In this sce- processor is finished and there are still many open questions
nario, complete applications (or application threads) run that need to be solved. However, discussing them is beyond
on the VLIW processor. The implementation of the the scope of this paper. In the remainder of this paper, we
processor can be fixed during the execution of multiple highlight several other similar approaches and describe more
applications, but our envisioned reconfigurable VLIW in-depth the design of our ρ-VEX processor and its current
processor should be able to adapt itself to different development status.
applications (or even to code portions within a single
application). III. R ELATED W ORK
2) application-specific co-processor: In this scenario, only In literature, few softcore VLIW processors with a complete
specific kernels that require acceleration are being com- toolchain can be found. The first VLIW softcore processor
piled to the VLIW processor. The benefits are: (1) no found in literature is Spyder [6]. The design and implementa-
need for code rewriting, (2) avoidance of using complex tion of Spyder marked the beginning of a reconfigurable VLIW
tools such as C-2-VHDL translators, and (3) manual softcore processor. Spyder consists of three reconfigurable
design of accelerators can be skipped. We have to note units. A compiler toolchain was made available. One of the
again that we are not stating that there is no need for the drawbacks of Spyder was that both the processor architecture
aforementioned actions or tools, but they can be avoided and the compiler were designed from scratch. Because the
when the VLIW processor is capable of providing good designer had to put efforts in both directions, the processor
enough performance within the requirements (such as, did not evolve extensively.
power, area) set.
Instance-specific VLIW processors are presented in [7][8].
Having stated the above, we present an advantage due to These architectures are specific implementations for some
the existence of a reconfigurable and parameterized VLIW applications, and do not represent a more general VLIW
processor, namely instruction-set architecture (ISA) emulation. processor. A VLIW processor with reconfigurable instruction
This means that we can implement different ISAs on top of the set is presented in [9]. An FPGA based design of a VLIW
VLIW processor and ensure that each emulation is the most ef- softcore processor is presented in [10]. Additionally, this
ficient. This will have the obvious advantage that applications processor is able to execute custom hardware. It has an ISA
compiled for different architectures can be executed without that is binary-code compatible with the Altera NIOS-II soft
code recompilation (cumbersome) or software code emulation processor. To support this architecture, a compilation and
(slow). Moreover, having the mentioned ability allows for the design automation flow are described for programs written in
following scenarios: C. The compilation scheme consists of a Trimaran [11] as the
1) ISA extension emulation: When new ISA extensions front-end and the extended NIOS-II as the back-end. Due to
are being introduced, much research and development the licensed Altera NIOS-II, this VLIW design is less flexible
effort is needed in order to ensure marked acceptance. and not open source.
However, with a reconfigurable ISA emulator it is possi- In [12], a modular design of a VLIW processor is reported.
ble to implement and ship the (draft) extension to poten- Certain parameters of the processor architecture could be
tial end-users for actual use and evaluation. Furthermore, altered in a modular fashion. The lack of a good software
bug reports can lead to further improvements before the toolchain and the absence of parametric extensibility limited
extension is fixed in hardware. The latter is still needed the use of this architecture. In [13], the architecture and
since the performance and power utilization of reconfig- micro-architecture of a customizable soft VLIW processor
urable hardware is usually not optional. However, early- are presented. Additionally, tools are discussed to customize,
on experience of developers can lead to a much earlier generate and program this processor. Performance and area
market adoption of the intended ISA extension. trade-offs achieved by customizing the processor’s datapath
2) instantiation of dedicated processor organizations: and ISA are evaluated. The limitation is the absence of a
When new processors are released, in many cases code compiler. In [14], the design and architecture of a VLIW
246
microprocessor is presented without any tool chain, which 3) The VEX Simulation System: The VEX simulator is an
restricts the processor usability. architectural-level simulator that uses compiled simula-
In [3], we presented the design and implementation of a tor technology to achieve faster execution. It additionally
reconfigurable VLIW softcore processor called ρ-VEX. In provides a set of POSIX-like libc and libm libraries
addition, a development framework to utilize the processor is (based on the GNU newlib libraries), a simple built-in
presented. The processor architecture is based on the VLIW cache simulator (level-1 cache only) and an Application
Example (VEX) ISA, as introduced in [4]. VEX represents a Program Interface (API) that enables other plug-ins used
scalable technology platform that allows variation in many for modeling the memory system.
aspects, including instruction issue-width, organization of A VEX software toolchain including the VEX C compiler
functional units, and instruction set. A software development and the VEX simulator is made freely available by the
toolchain for the VEX architecture [5] is freely available Hewlett-Packard Laboratories [5]. The reason behind choosing
from Hewlett-Packard (HP). The ρ-VEX processor is open- the VEX architecture for our project is the scalability and
source and implemented on an FPGA. Different parameters customizability of the VEX ISA and the availability of the free
such as the number and types of functional units, supported C compiler and simulator, which can be used for architecture
instructions, memory bandwidth, and size of register file can exploration.
be chosen based on the application requirements and available
resources on the FPGA. Initially, an instruction ROM file B. The VEX Instruction Set Architecture
has to be generated for each application to be run on the VEX offers a 32-bit clustered VLIW ISA. VEX models a
processor and the design has to be re-synthesized along with scalable technology platform for embedded VLIW processors
the instruction ROM file. Now a boot loader like functionality that allows variations in the parameters of the processor.
has been added and the executable files can be downloaded Following the VLIW design philosophy, the parameters of the
to the instruction memory and executed directly avoiding the processor, such as issue width, FUs, register files and processor
necessity of resynthesis. instruction set can be varied. The compiler is responsible
IV. T HE VEX VLIW P ROCESSOR : H ARDWARE AND for scheduling the instructions. Along with basic data and
R ELATED S OFTWARE operation semantics, VEX includes many features for compiler
flexibility in scheduling multiple concurrent operations. Some
Compared to superscalar and RISC processors, VLIW ar-
of these features are [4]:
chitecture requires a more powerful compiler due to more
• Parallel execution units, such as integer ALUs and mul-
complex operation scheduling. [4] presents definition of the
VLIW design philosophy as: "The VLIW design philosophy is tipliers.
• Parallel memory pipelines, including access to multiple
to design processors that offer ILP in ways completely visible
in the machine-level program and to the compiler". data memory ports.
• Data prefetching and other locality hints supported by the
A. The VEX System architecture.
The VEX stands for VLIW Example. VEX is a system • A large multiported shared register file made visible by
247
encoded operation in VEX system is called a syllable. Multiple in both the source and destination cluster and may require
syllables are combined to form an instruction, which is an more than one cycle (pipelined or not). There is only a
atomic unit of execution in a VLIW processor. The instruction single instruction cache (I-cache), but different data cache
issue-width is the number of syllables in an instruction that (D-cache) ports and/or private memories can be associated
could be issued, and it depends on the number of FUs in with each cluster. This means that VEX allows multiple
the processor. An instruction having multiple syllables or memory accesses to execute simultaneously. Figure 1 depicts
operations is issued every cycle by the compiler to the multiple multiple D-cache blocks, attached by a crossbar to different
execution units of the VLIW processor, which is the main clusters, which allows a variety of memory configurations.
reason for performance compared to a RISC processor, which VEX clusters obey the following set of rules [4]:
has an issue-width of one. • Each cluster has the ability to issue multiple operations
1) Multicluster Organization: The number of read ports in the same instruction.
of the shared multiported register file in a VLIW processor • Different clusters can have different issue-widths and
is twice the issue-width, and the number of write ports is different types of operations.
equal to the issue-width (assuming that each FU requires two • Different clusters can have different VEX ISA, and not
input operands and writes one output as a result). Therefore, all clusters have to support the entire VEX ISA.
the issue-width is proportional to the product of the number • All units within a cluster are indistinguishable or equally
of read and write ports of the shared register file. The likely for selection. This means that the operations to
resource/area requirement for a multiported register file is be executed by a cluster do not have to be assigned to
directly proportional to the product of number of read and particular units within this cluster. To assign operations
write ports, therefore these parameters are not scalable to a to the units within a cluster is the job of the hardware
large extent or we can say that the issue-width is not scalable decoding logic.
to a large extent. To reduce this pressure on the number of 2) Structure of the Default VEX Cluster: The default single
read and write ports of the shared register file, VEX defines VEX cluster is a 4-issue VLIW core, as depicted in Figure 2,
a clustered architecture [4]. Using modular execution clusters, and consists of the following units [4]:
the VEX provides scalability of issue-width and functionality.
• Four 32-bit integer ALUs
A cluster is a collection of register files and a tightly coupled
• Two 16x32 multipliers (MULs)
set of FUs.
• One Load/Store Unit
VEX clusters are numbered from zero. Cluster 0 is a
• One Branch Unit
special cluster that must always be present in any VEX
• 64 32-bit general-purpose registers (GRs)
implementation, because the control operations execute on this
• 8 1-bit branch registers (BRs)
cluster. Different clusters have different unit/register mixes, but
a single Program Counter (PC) and a unified I-cache control This cluster can issue up to four operations per instruction.
them all, so that they run in lockstep or proper sequence [1]. These operations could be either integer ALU, or MUL or
The structure of a VEX multicluster architecture is depicted Load/Store operations. All FUs are directly connected to
in Figure 1. registers, and no FU is directly connected to another FU. Two
FUs within a cluster can only access registers in the same types of register banks are GR and BR. Both are multiported
cluster. VEX provides a simple pair of send-receive instruction and shared register files. Memory units support only load and
primitives that move data among registers on different clusters. store operations, i.e., operations that act on memory and save
These intercluster copy operations may consume resources results directly in memory are not supported by the VEX
system. The branch unit (control unit) in the default cluster
is used for program sequencing and is present only in cluster
0 in case of a multicluster machine.
248
machine for our processor called ρ-VEX. Figure 4 depicts
the organization of our 32-bit, 4-issue VLIW processor imple-
mented on an FPGA. The ρ-VEX processor consists of fetch,
decode, execute and writeback stages. The fetch unit fetches a
VLIW instruction from the attached instruction memory, and
splits it into syllables that are passed on to the decode unit.
In the decode stage, the instructions are decoded and register
contents used as operands are fetched from the register file.
The actual operations take place in either the execute unit,
or in one of the parallel CTRL or MEM units. ALU and
MUL operations are performed in the execute stage. This stage
is implemented parametric, so that the number of ALU and
MUL functional units could be adapted. All jump and branch
operations are handled by the CTRL unit, and all data memory
Figure 2. The Default VEX Cluster load and store operations are handled by the MEM unit. All
write activities are performed in the writeback unit to ensure
that all targets are written back at the same time. Different
the compiler supports this scalability and customizability. To write targets could be the GR register file, BR register file,
compile for a different configuration, the compiler is provided data memory or PC.
with configuration information in the form of Machine Model The ρ-VEX implements all of the 73 operations of the
Configuration (fmm) file. To include a custom instruction, the VEX operations set. It additionally supports reconfigurable
application code is annotated with pragmas. Different compiler operations as the VEX compiler supports the use of custom
pragmas are available to improve the performance. Refer to [4] instructions via pragmas inside the application code. In the
for details on how to use the VEX compiler. current ρ-VEX prototype, it takes only a few lines of VHDL
code to add a custom operation to the architecture. One of the
D. The VEX Simulation System
24 available reserved opcodes could be chosen, and a provided
The VEX toolchain provides tools that allow C programs template VHDL function could be extended with the custom
compiled for a VEX VLIW configuration to be simulated on functionality. Currently, the following properties of ρ-VEX are
a host workstation. The VEX simulator is a fast compiled parametric:
simulator (CS) that translates VEX binary to a host computer • Syllable issue-width
binary. It first converts VEX binary to C and then using the C • Number of ALU units
compiler of a host generates a host executable. The compiled • Number of MUL units
simulation workflow is depicted in the Figure 3. • Number of GR registers (up to 64)
The VEX simulator produces instrumentation code to count • Number of BR registers (up to 8)
execution cycles and other statistics and generates a log file • Types of accessible FUs per syllable
at the end of simulation. This log file has all the statistical • Width of memory busses
information that can be analyzed for performance analysis To optimally exploit the processor utilization, a development
and architecture exploration. The simulator provides a simple framework is provided, which consists of compiling a piece
cache simulation library, which models an L1 instruction and of C code with VEX compiler and then generating a VHDL
data cache. The default cache simulation can be replaced by a instruction ROM file by assembling the assembly file with our
user-defined library. In addition, the VEX simulator includes assembler [3]. The ROM file is then synthesized with the rest
support for gprof, and different statistical files generated at the of the processor VHDL design files.
end of simulation can be used with gprof tool for analysis and As the target reconfigurable technology, Xilinx Virtex-II
profiling of the simulated application. Refer to [4] for details Pro (XC2VP30) FPGA was chosen, embedded on the XUP
on how to use the VEX simulator. V2P development board by Digilent. All experiments were
performed on a non-pipelined ρ-VEX system with 32 general
V. T HE ρ-VEX S OFTCORE VLIW P ROCESSOR
purpose registers (GR). A data memory of 1 kB implemented
In [3], we presented the design and implementation of a using BlockRAM was connected to ρ-VEX to store results.
reconfigurable and extensible softcore VLIW processor. We The issue-width of ρ-VEX was varied between 1, 2 and 4.
implemented a single-cluster standard configuration of VEX All configurations had the same number of ALU units as their
issue-width. The 2- and 4-issue ρ-VEX configurations had 2
MUL units. The application code was loaded in the instruction
memory before synthesis. We developed a debugging UART
interface to transmit data via the serial RS-232 protocol. This
interface invoked a transmission of the hexadecimal represen-
Figure 3. The VEX Simulation Flow tation of the data memory contents, as well as the contents
249
time without increasing cycle count.
VI. C ONCLUSIONS
In this paper, we presented the design and implementation
of a reconfigurable softcore VLIW processor based on the
Lx/ST200 ISA, developed by HP and STMicroelectronics. Our
processor design called ρ-VEX is parametric and different
parameters such as the number and type of functional units,
supported instructions, memory bandwidth, register file size
can be chosen depending upon the application and available
Figure 4. The ρ-VEX VLIW Processor resources on the FPGA. A toolchain including a C compiler
and a simulator is freely available. We provide a development
framework to optimally utilize the reconfigurable VLIW pro-
of the internal ρ-VEX cycle counter register. Synthesis results cessor. Any application written in C can be mapped to the
for ρ-VEX processor are presented in Table II. VLIW processor on the FPGA. This VLIW processor is able
to exploit the instruction level parallelism (ILP) inherent in
A. Recent Developments an application and make its execution faster compared to a
The following design improvements are added to the origi- RISC processor system. We described our rationale for the
nal ρ-VEX processor: ρ-VEX processor and presented the possible advantages it
can provide for its use as a general-purpose processor or
• The assembler has been extended and it now generates
application-specific co-processor.
a binary executable file for the processor. The hardware
design is modified and BlockRAM is used as the instruc- R EFERENCES
tion memory. The executable file can be downloaded into [1] P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, and F. Homewood, "Lx:
the instruction memory of the already placed processor A Technology Platform for Customizable VLIW Embedded Processing",
on FPGA using a serial port on the PC and the FPGA in Proceedings of the 27th annual International Symposium of Computer
Architecture (ISCA 00), June 2000, pp. 203 - 213.
development board, and therefore, re-synthesis and re- [2] TriMedia Processor Series. http://www.nxp.com/.
implementation of the processor with changing the ap- [3] S. Wong, T.V. As, and G. Brown, "ρ-VEX: A Reconfigurable and
plication is not required. Extensible Softcore VLIW Processor", in IEEE International Conference
on Field-Programmable Technologies (ICFPT 08), Taiwan, December
• We implemented dynamically reconfigurable register file
2008.
for the ρ-VEX processor to reduce the resources required [4] J. Fisher, P. Faraboschi, and C. Young, Embedded Computing: A VLIW
by the multiported register file [17]. The VEX archi- Approach to Architecture, Compilers and Tools. Morgan Kaufmann,
2004.
tecture supports up to 64 multiported shared registers [5] Hewlett-Packard Laboratories. VEX Toolchain. [Online]. Available:
in a register file for a single cluster VLIW processor. http://www.hpl.hp.com/downloads/vex/.
This register file accounts for a considerable amount of [6] C. Iseli and E. Sanchez, "Spyder: A Reconfigurable VLIW Processor
using FPGAs", in FPGAs for Custom Computing Machines, January
area when the VLIW processor is implemented on an 1993, pp. 17-24.
FPGA. Our processor design supports dynamic partial [7] C. Grabbe, M. Bednara, J.V.Z. Gathen, J. Shokrollahi, and J. Teich,
reconfiguration allowing the creation of dedicated register " A High Performance VLIW Processor for Finite Field Arithmetic",
in Proceedings of the 17th International Symposium on Parallel and
file sizes for different applications. The processor can Distributed Processing (IPDPS 03), April 2003.
dynamically create its own register file composed of the [8] M. Koester, W. Luk, and G. Brown, "A Hardware Compilation Flow For
actual number of registers the application needs. This Instance-Specific VLIW Cores", in Proceedings of the 18th International
Conference on Field Programmable Logic and Applications (FPL 08),
means that valuable area can be freed and utilized for Sep 2008.
other implementations running on the same FPGA when [9] A. Lodi, M. Toma, F. Campi, A. Cappelli, and R. Canegallo, "A
not the full register size is needed. Our design needs 924 VLIW Processor with Reconfigurable Instruction Set for Embedded
Applications", in IEEE Journal on Solid-State Circuits, vol. 38, no. 11,
slices on a Virtex-2 Pro device for dynamically placing Jan 2003, pp. 1876 - 1886.
a chunk of 8 registers, and places registers in multiples [10] A.K. Jones, R. Hoare, D. Kusic, J. Fazekas, and J. Foster, "An FPGA-
of 8 registers to simplify the design. The processor does based VLIW Processor with Custom Hardware Execution", in Pro-
ceedings of the 2005 ACM/SIGDA 13th Internal Symposium on Field
not need permanently 64 registers requiring 8594 slices Programmable Gate Arrays (FPGA 05), New York, NY, USA: ACM,
thereby considerably reducing the slice utilization at run- 2005, pp. 107 - 117.
[11] http://www.trimaran.org/.
[12] V. Brost, F. Yang, and M. Paindavoine, "A Modular VLIW Processor",
Table II in IEEE International Symposium on Circuits and Systems, ISCAS 2007.,
S YNTHESIS R ESULTS FOR ρ-VEX P ROCESSOR Apr 2007, pp. 3968 - 3971.
[13] M.A.R. Saghir, M. El-Majzoub, and P. Akl, "Customizing the Datapath
ρ-VEX Slices Max. Frequency and ISA of Soft VLIW Processors", in High Performance Embedded
Architectures and Compilers (HiPEAC 07), LNCS 4367 pp. 276-290,
1-issue 1895 (13%) 89.44 MHz Springer-Verlag Berlin Heidelberg 2007.
2-issue 5105 (37%) 89.44 MHz [14] W.F. Lee, VLIW Microprocessor Hardware Design For ASICs and
4-issue 10433 (76%) 89.44 MHz FPGA. McGraw-Hill, 2008.
250
[15] P.G. Lowney et al, "The Multiflow Trace Scheduling Compiler", The
Journal of Supercomputing, 7(1/2), 51-142, 1993.
[16] J. Fisher, "Trace Scheduling: A Technique for Global Microcode Com-
paction", IEEE Trans. on Computers, C-30(7), 478-490, 1981.
[17] S. Wong, F. Anjam, and M.F. Nadeem, "Dynamically Reconfigurable
Register File for a Softcore VLIW Processor", Accepted for publications
in DATE 2010.
251
Run-time Reconfiguration of Polyhedral Process Networks
Implementations
252
1.1 Problem Statement are respected, consistent system executions are guar-
anteed while allowing asynchronous reconfiguration of
different dynamic modules at rum time.
The principal benefits of using dynamic (partial) re-
The remaining part of the paper is organized as fol-
configuration (DPR) are the ability to execute larger
lows. In Section 2, we discuss the scope of the ap-
hardware designs with fewer gates and to realize the
proach and the main assumptions it relies on. Sec-
flexibility of a software-based (multi-threaded) solution
tion 3 presents the solution approach. Implementation
while retaining the execution speed of a more tradi-
details are discussed in Section 4. Section 5 concludes
tional, hardware-based approach. However, this comes
the paper.
at the price associated with the difficulties in realizing
run-time reconfigurable computing. First, the provided
design flows are weak and mostly experimental. It is 2 Scope of Work
not possible to model DPR during all the steps of a
system development. For instance, SystemC can be One of the main assumption in our work is that
used for first high-level steps but then it is difficult to we consider only dataflow dominated applications in
use other tools, e.g., HW/SW partitioning tools, sim- the realm of multimedia, imaging, and signal process-
ply because DPR is not integrated by the tool vendors. ing that naturally contain tasks communicating via
For the low-level steps, it is (almost) impossible to sim- streams of data. Such applications are very well mod-
ulate and validate the designs before the platform is in- eled by using the parallel dataflow model of computa-
tegrated into the final board. As a result, designers are tion (MoC) called Kahn Process Network (KPN) [4].
overwhelmed with too many and very low-level details The KPN model we use is a network of concurrent au-
in order to “get it right”, making reconfigurable com- tonomous processes that communicate data in a point-
puting a highly error-prone and time-consuming task. to-point fashion over bounded FIFO channels, using a
In addition to the lack of a tool support, a ma- blocking read/write on an empty/full FIFO as a syn-
jor challenge when using dynamic reconfiguration is chronization mechanism. Each process in the network
the execution management of the dynamic (reconfig- performs a sequential computation concurrently with
urable) modules. This includes both spatial and tem- the other processes. A well-known characteristic of
poral management. The latter is especially impor- KPNs is that their MoC is deterministic. Always for
tant in realizing reconfigurable implementations with a given input data, one and the same output data is
consistent run-time behavior. Consistency here means produced. This input/output relation does not depend
that any reconfigurable implementation and execution on the order in which the processes are executed. As
generates results equivalent to its non-reconfigurable the control is incorporated into the processes, no global
counterpart for the same application. The challenge in scheduler is present.
realizing an execution management is further exacer- To represent KPNs, we use polyhedral descriptions,
bated by the complexity of today’s applications, espe- therefore, we call our KPNs polyhedral process net-
cially in the domain of multimedia embedded systems. works (PPN). The PPNs are specific case of KPNs,
Usually, such systems consist of multiple compute mod- i.e., PPNs are static and everything about the exe-
ules that operate in a globally asynchronous fashion. cution of the process networks is known at compile
If these modules require reconfiguration, i.e., they are time. Moreover, the PPNs execute in finite memory
dynamic, it is very easy to violate consistency at rum and the amount of data communicated through the
time. This resembles very much the challenges in soft- FIFO channels is also known. We are interested in
ware multi-threading: common problems with thread this subset of KPNs because they are analyzable, e.g.,
synchronizations include deadlock and the inability to FIFO buffer sizes and execution schedules are decid-
(correctly) compose program fragments that are cor- able, and SW/HW synthesis from them is possible.
rect in isolation [3, 6]. In general, it is not known how A PPN is implemented as a heterogeneous multipro-
a programmer can come up with a multi-threaded pro- cessor system on chip (MPSoC) using the Daedalus
gram with correctness guarantee. The same problems design methodology [1, 10]. In such MPSoCs, the pro-
arise in the reconfigurable computing as well, i.e., there cessing components are programmable processors and
is no correctness guarantee for applications demanding dedicated HW compute modules (IP cores). The latter
and implementing reconfiguration at rum time. We may provide run-time reconfiguration. In this paper,
address this issue, and in this paper we present an ap- we consider fix communication topologies, i.e., a com-
proach based on conditions defining “save” points when munication topology can not be reconfigured in a target
reconfiguration may occur. The main contribution of MPSoC. Hence, reconfiguration can be applied only on
the proposed approach is that if the defined conditions the dedicated dynamic IP cores.
253
Y pxls
An IP core implements the main computation of a Yproc ...
Conv
Y,U,V pxls
Proc ...
PPN process which behaves like a function call. There-
reconfigure (Y,U,V)
fore, the computation performed by a reconfigurable IP RGB U pxls
...
has to resemble a function call as well. This means that Conv Uproc b) Processing with reconfiguration
for each input data read by the IP core, the core is ex-
d data ...
ecuted and it produces output data after an arbitrary V pxls
Vproc ... P1 P2
254
a balancing of the production and consumption of to- 1 // Execution of process P1
kens in the network. When this balancing is dependent N1 N2 2 while( 1 ) {
3 // Execution cycle
on dynamic parameters, consistency conditions may be 4 read_parameter( N1 );
violated. In the remaining part of the paper, we dis- N1 N2
5 for ( int i=1; i<=N1; i=i+1 ) {
a c 6
cuss how we address this problem in order to guarantee x P1 y
b x P2 y 7
read( a, x );
execute_P1( x, &y );
consistent executions of applications modeled as PPNs 8 write( y, b );
on platforms using run-time partial reconfiguration. 9 }
10 }
3.2 Polyhedral (Kahn) process networks (PPN) a) PPN with dynamic parameters b) Structure of a process
The parallelism in our PPNs is expressed at the level Figure 2. PPN and process execution cycle.
of the application tasks as a process implements a single
application task only. A process of a PPN consists of
a function, input ports, output ports, and control. The
function specifies how data tokens from input streams a producer-consumer pair, shown in Figure 2(a).
are transformed to data tokens to output streams. The N1 and N2 are FIFO channels of the parameters
function also has input and/or output arguments. The N1 for process P 1 and N2 for process P 2, respec-
input and output ports are used to connect a process to tively. Each parameter can take values within a fixed
FIFO channels in order to read data tokens, initializing range. PPN(N1 , N2 ) denotes an instance of the PPN.
the function input arguments, and to write data gener- There is generally a relation between the parame-
ated as a result of the function execution. The control ters, in this example N1 and N2 . Therefore, some
specifies how many times the function is executed and instances PPN(N1 , N2 ) are invalid instances. For the
which ports to read/write at every execution, i.e., at PPN network in Figure 2(a), all different instances are,
every iteration (firing) of the process. The control of a
process can be compactly represented mathematically Parameters Range: PPN instances – PPN(N1,N2):
in terms of linearly bounded sets of iterator vectors us-
1 ≤ N1 ≤ 3; PPN(1,1); PPN(1,2); PPN(1,3)
ing the polytope model [2]. A process has a Process
1 ≤ N2 ≤ 3; PPN(2,1); PPN(2,2); PPN(2,3)
Domain (DM ) which is the set of all iterator vectors.
N2 ≥ N1 ; PPN(3,1); PPN(3,2); PPN(3,3)
Each iterator vector corresponds to one and only one
integral point in a polytope. Formally,
Instances P P N (2, 1), P P N (3, 1), and P P N (3, 2)
DM = {P (p) ∩ Z }, n are invalid because they violate the condition N2 ≥ N1 .
Similarly, instance P P N (2, 4) is invalid because N2 is
where P (p) is a parametric polytope, out of its range. Figure 2(b) shows the structure of a
process we propose to deal with dynamic parameters.
P (p) = {i ∈ Qn , p ∈ Zm | Ai ≥ Bp + C}, Network instances are selected by reading parameter
values at run time. For this purpose, we add a read
where i is an iteration vector, A, B and C are in- parameters phase, see line 4, prior to the actual pro-
tegral matrices of appropriate dimensions, and p is a cessing at lines 5-9. Because reading parameters and
parameter vector with an affine range R(p), data processing are repeated (possibly an infinite num-
ber of times), we call it a process execution cycle (lines
R(p) = {p ∈ Zm | Dp ≥ E}, 3-9). When all processes in a P P N have performed an
execution cycle, a network instance has performed an
where D and E are integral matrices of appropri- execution.
ate dimensions. We use the values of the parameter
vector’s elements to determine different configuration Definition 3.1 (Consistency of a PN instance)
options at rum time. A PN instance is consistent if after an execution, the
number of tokens written to any channel is equal to the
3.3 Process network instance number of tokens read from it.
255
tency when changing parameter values at run time. A Read Write
Enable
Done
Done
Read
Done
Write
Valid
Exist
Conf
Conf
Conf
Full
point may violate the consistency of the instances and
the PPN execution. In order to transfer new values for FIFOs CONTROL
parameters to a process of the PPN at run time, i.e.,
to select a new PPN instance, we use control channels
with FIFO organization using blocking read/write syn- Figure 3. HW Module top-view.
chronization mechanism. In addition, we define the fol-
lowing three conditions which are sufficient to preserve
consistency when changing parameter values dynami-
cally at run time. shown in Figure 3, consisting of READ, EXECUTE,
C1: Parameter sets have to correspond to valid net- and WRITE blocks. The READ and WRITE blocks
work instances. constitute the communication part of a HM. A set of
C2: A valid parameter set has to initiate a network input data ports belongs to the read unit and a set
instance execution. of output data ports belongs to the write unit. The
C3: Processes may read new parameters from a number of input/output ports is equal to the number
valid set (corresponding to the selection of a new valid of the edges going in (respectively out of) the process
network instance) after they have completed a process of a PPN. The read unit is responsible for getting data
execution cycle. from proper channels (FIFOs) at each iteration. The
write unit is responsible for writing the result to proper
In other words, parameter values may be changed channels (FIFOs) at each iteration. Selecting a proper
(reconfiguration may take place) either before or af- channel at each iteration means to follow a local sched-
ter an execution cycle of the processes. This is taken ule incorporated into the read and write units. These
into account by the proposed execution cycle of a pro- local schedules are extracted from the PPN specifica-
cess illustrated in Figure 2(b). Note that the defined tion automatically by the Espam tool.
conditions are valid only for consistent PPN instances. The EXECUTE block of a HW Module (HM) is ac-
Therefore, a consistency check of a PPN instance is re- tually a dedicated HW IP core to be integrated. It is
quired, either at design time or at run time. In our not generated by Espam but it is taken from a library.
approach, a consistency check is performed at design In order to be incorporated into a HW Module, an IP
time since everything about the execution of a PPN is core has to provide Enable/Valid control interface. The
known. For more details about the defined conditions Enable signal is a control input to the IP core which
and the approach to deal with dynamic parameters at allows for running the core when there is data to be
run time, we refer to [7] where the presented approach processed. If input data is not available, or there is no
has been generalized for the SBF MoC [5]. room to store the output of the IP core to output FIFO
channels, then Enable is used to suspend the operation
4 Implementation of the IP core. The Valid signal is a control output
signal from the IP used to indicate whether the data
We consider that reconfiguration is applied on HW on the IP outputs is valid and ready to be written to
IP cores integrated in an MPSoC generated by Es- an output FIFO channel. In addition, the IP core has
pam [8, 9]. To integrate an IP core, Espam generates also to provide an interface for accepting configuration
a HW Module (HM) around an IP core taken from information, illustrated by the Conf/Done signals in
a library. To describe how reconfiguration, based on Figure 3.
parameter values, is realized with respect to the pre- A CONTROL block is added to capture the pro-
viously defined conditions, we explain the structure cess behavior, e.g., the number of process firings,
of a HM, shown in Figure 3. For additional details and to synchronize the operation of the other three
about HW IP core integration with Espam, we refer blocks. Also, CONTROL implements the block-
to [8]. The processes in our PPNs have always the ing read/write synchronization mechanism using Ex-
same structure. It reflects the KPN operational seman- ist/Read and Full/Write signals. Another function of
tics, i.e, read-execute-write using blocking read/write block CONTROL is to allow the parameter values to
synchronization mechanism. Therefore, a HW Module be set/modified from outside the HW Module at run
realizing a process of a PPN has a similar structure, time. Below, we present how the CONTROL block
256
implements the reconfiguration process such that the stances, i.e., the order in which the parameter sets are
previously defined conditions are respected. generated outside the network and written to the con-
trol channels. Since new parameter values are read
4.1 Respecting the conditions by the processes after performing an execution cycle,
parameter values selecting alternative PPN instances
Recall that the defined conditions are taken into ac- may be written to the control channels while a PPN
count by the proposed execution cycle of a PPN pro- instance is being executed. In addition, the proposed
cess, shown in Figure 2(b). Therefore, to respect the mechanism allows the processes to read the parameter
conditions and to preserve consistency of our PPNs, values independently of each other without violating
the CONTROL block of a HW Module (see Figure 3) the conditions defined for preserving the consistency.
implements this execution cycle. Our approach for run-time reconfiguration is applied
In the beginning, the CONTROL block reads pa- at two levels, i.e., high-level (no FPGA reconfiguration)
rameter values from the corresponding control FIFO by setting control registers and low-level by reconfigur-
channels. If data has not been written, the control ing the FPGA logic. Since we consider fixed commu-
block stalls waiting for it. The correctness of the pa- nication topology, the READ and WRITE units are
rameter values (i.e., the configuration data) has to be reconfigured by just writing data to control registers,
guaranteed (condition C1) by the module generating e.g., the amount of data to be communicated and par-
them. Thus, the combined writing of parameter val- ticular communication patterns to read/write from/to
ues and the reading of these parameters by the control different FIFO channels. Dynamic partial reconfigura-
block respects condition C2, because only a valid pa- tion is applied only on the IP core of a HW Module.
rameter set will cause a PPN process to initiate an From a design-complexity prospective, the proposed
execution cycle and, consequently, an execution of a approach to use PPNs with dynamic parameters to
network instance. After reading control data (e.g., capture (rum-time) reconfiguration information and
iteration domains and information about configuring to target reconfigurable MPSoC implementations con-
the IP core), the CONTROL block initiates an execu- tributes to a simplified (low-level) design effort be-
tion cycle. First, it performs an IP (re)configuration cause:
if it is required as well as setting control informa- 1. By using the defined conditions and the control
tion in the READ and WRITE blocks. After IP FIFOs, explicit handshaking (between processes)
core (re)configuration is completed (indicated by signal is eliminated. In addition, a reconfigurable IP core
’Done’), the control block uses the ’Exist/Read’, ’En- has to set only a “Done” signal to the CONTROL
able/Valid’, and ’Full/Write’ interfaces (see Figure 3) block after reconfiguration;
to control the execution (cycle) of the HW Module.
The end of the cycle is reached when the READ and 2. During the reconfiguration process, the dataflow
WRITE blocks have performed all required read and FIFOs used for communication between the dy-
write operations. This is indicated by the correspond- namic modules ensure proper operation of the
ing ’Done’ signals. After that, the control block is free static portion of the design.
to initiate another execution cycle (respecting condi-
tion C3), i.e., to read new configuration data from 5 Conclusions
the control channels and to repeat the steps described
above. In this paper, we proposed a general and technol-
ogy independent approach for modeling and implemen-
4.2 Discussion tation of run-time execution management for applica-
tions modeled as polyhedral process networks (PPNs)
By using FIFO control channels with blocking syn- and targeting reconfigurable computing. Based on the
chronization mechanism, we keep the KPN semantics characteristics of the PPN formal model of compu-
of our polyhedral process networks with dynamic pa- tations, we proposed conditions which define “save”
rameters, i.e., we have the capability to control the points when reconfiguration can occur. The main
execution without changing the model. Keeping the contribution of the presented work is that it guaran-
KPN model means that the deterministic behavior of tees consistent executions of reconfigurable implemen-
our PPNs with dynamic parameters is preserved. The tations. In addition, the FIFO communication and
FIFO organization of the control channels and the synchronization mechanism of the polyhedral process
blocking synchronization mechanism (the KPN seman- networks simplify design efforts and facilitate auto-
tics) keep the right order of selecting new network in- mated implementations.
257
References 6th Int. Workshop on Optimizations for DSP and Em-
bedded Systems (ODES-6), Boston, USA, Apr. 6 2008.
[1] Daedalus, a system-level design methodology and [8] H. Nikolov, T. Stefanov, and E. Deprettere. Au-
toolflow, http://daedalus.liacs.nl/. tomated Integration of Dedicated Hardwired IP
[2] P. Feautrier. Automatic parallelization in the polytope Cores in Heterogeneous MPSoCs Designed with
model. In The Data Parallel Programming Model, vol- ESPAM. EURASIP Journal on Embedded Sys-
ume 1132 of LNCS, pages 79–103, 1996. tems, 2008:Article ID 726096, 15 pages, 2008.
[3] M. Herlihy. The multicore revolution. In In 27th doi:10.1155/2008/726096.
FSTTCS: Foundations of Software Technology and [9] H. Nikolov, T. Stefanov, and E. Deprettere. System-
Theoretical Computer Science, pages 1–8, 2007. atic and automated multiprocessor system design, pro-
[4] G. Kahn. The Semantics of a Simple Language for gramming, and implementation. In IEEE Trans. on
Parallel Programming. In Proc. IFIP Congress 74. CAD of Integrated Circuits and Systems, volume 27,
North-Holland Publishing Co., 1974. Mar. 2008.
[5] B. Kienhuis and E. Deprettere. Modeling stream- [10] H. Nikolov, M. Thompson, T. Stefanov, A. Pimentel,
based applications using the sbf model of computa- S. Polstra, R. Bose, C. Zissulescu, and E. Deprettere.
tion. Journal of VLSI Signal Processing, 34(3), July Daedalus: Toward composable multimedia mp-soc de-
2003. sign. In In Proc. 45th ACM/IEEE Int. Design Au-
[6] E. A. Lee. The Problem With Threads. IEEE Com-
tomation Conference (DAC’08), pages 574–579, Ana-
puter, 36(5):33–42, 2006.
heim, USA, June 8-13 2008.
[7] H. Nikolov and E. Deprettere. Parameterized Stream-
Based Functions Dataflow Model of Computation. In
258
REDEFINE: Optimizations for Achieving
High Throughput
Keshavan Varadarajan#, Ganesh Garga#, Mythri Alle#, Alexander Fell#, Ranjani Narayan‡, S K Nandy#‡
#
CAD Lab, SERC, Indian Institute of Science, Bangalore, India
‡
Morphing Machines, Bangalore, India
259
Bypass Bus
Reservation
stations Register
File
FU FU FU FU
Figure 1 Functional Units (along with Reservation Stations) and Register File connected through
the Bypass Network
register file. These are connected over a bypass network.
The bypass network enables distribution of the result
operand to all operations that are waiting on this result so as
to proceed to execution. The bypass network uses the
broadcast mechanism to distribute the result operand to all
dependent operations. The use of broadcast is a worst—
case design, since it assumes that there might be operations
in all functional units that are awaiting this result. Our
analysis indicates that the number of consumers for a result
in most cases is one operation and 98% of the results are
consumed by at most 3 operations, when measured across
several kernels including IDCT, deblocking filter, CAVLC,
FIR and FFT. The cumulative density plot is shown in
Figure 3 Plot showing the average cumulative density for different
Figure 2 and Figure 3. Also the bus based interconnect used number of destinations of an instruction.
in the case of modern superscalar processors [2] is not
scalable. However, the use of dataflow computing inside
the execute stage, makes the design resilient to delays that
can be encountered, such as Load-Store delays. We modify
the design of the execute and write back stages as shown in
Figure 4.
The following changes were performed:
• The register file was eliminated, since it is a major
source of contention
• The registers of the reservation stations were made
addressable to enable explicit writes to it.
The bus based bypass network was replaced with a 2D
interconnect. In this case, we choose a Honeycomb
network, since it has the lowest degree per node [3]
Compute
Reservation Element
Router ALU
Station (CE)
Tile
260
• Explicit write back operation can be implemented only traffic is facilitated through the inter-HyperOp data
if all dependent instructions are in one of the forwarder (Figure 5; [7]) and data is stored in the global
reservation stations. Since this cannot be guaranteed, wait match unit, which is a part of the Scheduler.
an external register file is essential to store those For performing transports, the operations of the HyperOp
operands that cannot be consumed immediately. need to know the exact placement of the dependent
• Explicit write back also requires the source operation operations and their position on the reconfigurable fabric.
to know the placement of the destination operations. The compiler performs virtual binding of all operations i.e.
Dynamic assignment of reservation station and slots each operation assumes that it is placed at location 1 (0,0).
within them increases the complexity of hardware. The transport directives, to dependent operands, are
Our implementation of these requirements is elucidated in determined relative to the current location [8]. The routers
the subsequent section. In Transport Triggered support routing based on relative addresses. This
Architectures these requirements were implemented in the arrangement makes the HyperOps relocatable. So an
context of VLIW processors, whereas we approach the operation can be placed at any location as long as the
problem in the context of the dataflow paradigm. related operations, i.e. operations that supply data to the
The FPGA perspective of runtime reconfiguration is said operation, or operations that receive data from the said
presented in section III. Section IV compares our solution operation, are at offsets that are predetermined by the
with other solutions. In section V, we present a compiler. The exact location where a HyperOp will be
performance comparison between REDEFINE and a GPP. placed is determined at run time by the Resource Binder
In the subsequent section, section VI we indicate how the (Figure 5).
performance can be improved through the use of custom IP A brief description of all the various modules shown in
blocks. Other possible improvements identified through an Figure 5 is provided below:
analysis of time spent in various stages are presented in • The reconfigurable fabric consists of compute elements
section VII. The conclusions and scope for future work is (CE). The CEs include an ALU along with its
presented in section VIII. reservation station and router. A tile comprises a CE
and a router. The ALU supports a subset of the
II. REDEFINE: A RUNTIME RECONFIGURABLE operations present in the LLVM VISA. The
ARCHITECTURE interconnect employed is a toroidal honeycomb
structure [9]. The reservation stations determine which
of the ready operations will fire 2. In our current
Resource Binder HyperOp Launcher implementation (Figure 5), 64 tiles are interconnected
as an 8x8 configuration, with 12 access routers along
the periphery. The access routers provide connectivity
to the HyperOp Launcher, Inter-HyperOp Data
forwarder and the Load—Store Units that are used to
Global Wait
Match Unit
Scheduler
Reconfigurable Fabric access data memory banks. The access routers serve as
gateways of communication between the Fabric and the
external logic that drives it.
• The HyperOp Launcher is the hardware unit
responsible for transferring the compute and transport
Inter HyperOp Data Forwarder metadata to the Fabric, along with any data and
constant operands. The compute metadata specifies the
operation to be executed. This is connected to an
Figure 5 Schematic block diagram of REDEFINE instruction memory that is split into 5 banks. This
In order to specify explicit transports between two enables parallel transfer of 5 operations from the
operations, the compiler – RETARGET [4] constructs a instruction memory and HyperOp Launcher.
dataflow graph from the SSA representation for the • The Resource Binder, as described previously,
application specification (written in C language), generated determines the exact location on the fabric where the
by LLVM [5]. LLVM transforms the application into SSA HyperOp is to be placed. It also keeps track of the busy
form based on a virtual instruction set architecture (VISA). and unused tiles on the fabric.
The operations in the VISA are simple and non-orthogonal. • The Scheduler hosts the global wait match unit that
The Application is subdivided into smaller units called serves as the external register file, where data
HyperOps [4]. HyperOps are constructed by grouping exchanged between HyperOps is stored. Whenever all
together basic blocks such that a total order of HyperOps input operands of a HyperOp instance are available it is
can be constructed for execution. Explicit transports can be considered for being launched on the fabric for
performed between operations of a HyperOp. These explicit execution. The HyperOps thus chosen are then
transports are transformed into Transport Metadata that is
interpreted by the compute element, in order to forward the 1
The position of each CE on the fabric is specified by an ordered pair which specifies
results to the destination. Explicit Transports are facilitated the position as an offset from the origin along the x and y directions.
2
through the use of a Network on Chip [6]. Any data Static scheduling can be used in place of dynamic scheduling, however, since an
NoC is employed, the dynamic scheduling can schedule other instructions in case of
communication across HyperOps, i.e. inter-HyperOp data network delays.
261
forwarded to the Resource Binder where an appropriate application (when compared to a GPP). The recently
location on the fabric is determined. The HyperOp released Intel Nehalem too has several onboard application
Launcher transfers compute metadata, transport specific accelerators. On the other hand, FPGA vendors
metadata, constant operands and input operands to the ship a General Purpose core alongside to execute software
HyperOps. code, while the FPGA fabric provides hardware
• The input operands of a HyperOp are not placed at a acceleration. Stretch Inc explores a similar solution [11].
predetermined position. This necessitates the need for a This solution embeds an FPGA fabric alongside a Tensilica
lookup table that indicates the position in the global core to support Instruction Set Extensions at the post-
wait match unit where the input operands are placed. fabrication stage. Molen [12] uses a similar hardware fabric
The Inter-HyperOp Data Forwader is also responsible as Stretch; however, the compiler for Molen supports C to
for storing loop invariants and employs a sticky token RTL transformation to automatically program the
store for this purpose [10]. embedded FPGA. All these solutions help in exploiting the
best features of both GPPs and FPGAs. However, none of
III. RUNTIME RECONFIGURATION: AN FPGA PERSPECTIVE them reduce the time to reconfigure a hardware platform.
This is a critical requirement for runtime reconfigurable
FPGAs are composed of Look Up Tables (LUTs)
architectures.
interconnected by a programmable interconnect. The LUTs
The requirement of lower reconfiguration time has been
emulate the behavior of a combinational circuit. The truth
addressed in several hardware solutions viz. NEC-DRP,
table of the combinational circuit is programmed into the
DAP-DNA [13]. These hardware architectures employ
LUTs. More complex combinational circuits are realized
reconfigurable functional Units and ALUs in place of
through interconnection of LUTs. Programming an FPGA,
LUTs. However, they continue to use the programmable
thus, involves transferring the truth tables for each LUT and
interconnect, as in the FPGA. In order to reduce the runtime
the programming bits for the interconnect in order to setup
reconfiguration overhead, multiple configuration planes are
paths between communicating LUTs. The fine-grained
used. However, such a hardware configuration would be
structure of the LUTs and the programmable interconnect
useful only if the execution time of the application subtasks
has both advantages and disadvantages. The fine-grained
were sufficiently large to hide the configuration load
nature of the LUTs makes it more amenable to realization
latencies.
of circuits, which are combinational in nature. The use of
The hardware structure shown in Figure 5 is akin to several
programmable interconnect renders the transport of data
recently proposed general-purpose architectures viz. RAW
overhead free, due to pre-established paths. The primary
[14], TRIPS [15] and Wavescalar [16]. In the RAW
disadvantage of the fine-grained structure is the higher
processor several MIPS cores are repeated in space and are
latency incurred on programming the LUTs and the
interconnected by a 3-level Mesh interconnect. The TRIPS
interconnect. An FPGA has a reconfiguration time of the
and Wavescalar processors use Function Units and ALUs in
order of milliseconds at best, while the GPP can reconfigure
place of a full core. All these processors are geared towards
itself every clock cycle. Due to the large latencies involved
better exploitation of available thread level parallelism. On
in reconfiguration the FPGAs are not amenable to runtime
the other hand our solution is intended towards kernel
reconfiguration. More recently, FPGA vendors have been
execution acceleration. This difference in emphasis leads to
supporting partial reconfiguration, so as to reduce the
a completely different utilization of resources on the Fabric.
runtime overhead. However, this alone does not address the
The requirements for explicit transports stated in section I,
problem, as the ratio of configuration size for an application
can be implemented in several ways. Our solution employs
to size of the source specification is quite high.
explicit transports in the context of modern superscalar
In order to bring down the latency of the reconfiguration in
processors i.e. uses a dataflow paradigm. In Transport—
REDEFINE, we replaced the LUTs with more coarse-
Triggered Architectures (TTA) [1] this was achieved in the
grained functional units (viz. adder, shifter), akin to a GPP.
context of VLIW processors. The compiler computes the
This ensures that the amount of information needed to
explicit transports. The register file too is made a functional
program the Compute Elements (CEs) is quite low (just an
unit, into which data can be explicitly transferred. Due to
opcode). The programmable interconnect is replaced by a
the VLIW nature of the machine the locations of the
packet switched Network on Chip. This trades
operations are known apriori. The technique of explicit
reconfiguration overhead for hardware complexity. The
transports is amenable to application in other architectural
routers in the NoC have embedded routing logic, which
paradigms as well. The primary reason to adopt the
determines the path to be taken to transfer data from source
dataflow paradigm, in REDEFINE, as elucidated in the
to destination.
previous section, is to work around network delays. Other
VLIW techniques such as dynamic scheduling [17] could
IV. RELATED WORK
have been employed in the context of VLIW processors to
Several solutions have been proposed that try to address the work around nondeterministic NoC delays.
power—performance tradeoffs with regard to embedded
hardware solutions. Modern embedded processors viz.
PowerPC, come along with a host of domain specific
accelerators that help improve the performance while
having a lower energy overhead for the accelerated
262
V. COMPARING PERFORMANCE WITH A GPP for the said application compared to gcc [18]), the use of
The impact on performance of the architectural changes in RISC operations (as opposed to CISC operations in the
REDEFINE, as compared to a GPP is provided in this Pentium 4 processor), elimination of load stores for scalars
section. The SHA-1 hashing function is executed on both and reduction in register writebacks. In REDEFINE, the
REDEFINE and a GPP. SHA-1 is the most used average number of CEs used across several HyperOp
cryptographic hash function. Cryptographic hash functions executions is ~2.7 CEs. However, it has a peak CE
compute a cryptographic hash on a message digest, such utilization of 7 CEs. Thus the effects of variable issue
that a change in the message causes the generation of width, as reported in [19] are also seen, whereas the Intel
completely different hash. This facilitates detection of Pentium is a fixed issue width processor.
message modifications. The results of the SHA-1 run are The SHA1 function implemented in an ASIC can support
available in Table 1. The GPP run was performed on a up to 116Mbps as compared to ~5Mbps supported by
Pentium 4 running at 2.26GHz. The number of cycles was REDEFINE. Therefore REDEFINE performs ~20x worse
measured using Intel VTune, as described in [7]. The than an ASIC.
REDEFINE cycle count was obtained through a cycle-
accurate simulation 3, for an 8x8 fabric of tiles. A VI. IMPROVING PERFORMANCE THROUGH CUSTOM FU
computation granularity of 32-bit is used in both the GPP The REDEFINE architecture described thus far contains
and REDEFINE. ALUs that are quite general in nature. In order to improve
Table 1 Comparison of performance of SHA-1 function in REDEFINE the throughput and get it as close to an ASIC, it is essential
vs GPP. to integrate custom IP blocks in the CEs, so as to accelerate
the computation. However, these IP blocks need to domain-
SHA-1 REDEFINE General Purpose
specific, as opposed to application-specific, so as to avoid
Performance Processor
Execution Cycles 111777 21746290
the pitfalls of an ASIC. We consider the example of a field
multiplier to illustrate this.
Field multiplication is an important constituent of several
Operating 100MHz 2.26GHz cryptographic kernels, viz. ECDSA, ECC. Field
Frequency (@130nm) (@90nm) multiplication, unlike normal multiplication, involves a
Total Time taken 1.118 ms 1.071ms 4 sequence of Shift-Reduce (as opposed to a Shift in normal
multiplication) and XOR operations (as opposed to an add
A. Analysis of Results in normal multiplication). The Shift-Reduce operation is
compute intensive and is a good candidate for custom FU
REDEFINE takes 4.2% more time (in seconds) to execute
based acceleration. The design of the CE that contains the
the same program. However, the reported execution time
custom FU is shown in Figure 6.
(in seconds) does not take into account the technology node
The results for the runs without the Custom Function Unit
at which the processors were synthesized. After technology
and with Custom Function Unit are shown in Table 2. The
normalization, we find that REDEFINE performs 1.91x
performance of REDEFINE with custom FU is about 4x
better than the Pentium processor. This improvement in
better than the one without. A similar improvement in
performance can be attributed to the use of the LLVM
performance is seen in the context of FFT, as reported in
compiler [5] (which is known to perform about 30% better
[7].
Table 2 Comparison of Execution time of field multiplication without
and with Custom FU.
Without Custom With Custom FU
Reservation
FU
Station Execution Time 1692 424
(in clock cycles)
263
opcodes and data operands are transferred on to the fabric machine, it is not possible to allow one HyperOp to
by the HyperOp Launcher. It should be noted that the Inter- proceed to next iteration, while the other HyperOp of
HyperOp data forwarder continues to receive the data even the same Fused HyperOp is executing the previous
during normal execution and thus the data transport is iteration.
overlapped by computation. Similarly, the HyperOp • Further, to improve performance, a pipeline of Fused
Launcher continues to transfer data operands even after the HyperOps can be formed referred to as the Fused
first operation is ready for launch. Of the time spent in the HyperOp Pipeline. In [20] this was referred to as the
Support Logic, maximum amount of time is spent in the Custom Instruction Pipeline. In a Fused HyperOp
HyperOp Launcher. The Inter-HyperOp data forwarder, Pipeline several instances of the Fused HyperOps can
Scheduler and Resource Binder take fixed time periods for be unrolled and data transfer between various instances
processing the HyperOp. However, the time taken by the can be accomplished by setting up a pipeline between
HyperOp Launcher depends upon the number of operations, these entities. The Fused HyperOp Pipeline is useful
input operands and constant operands that need to be for linear communication structures. However, other
transferred and the position of the identified CEs on the applications are known to have different
fabric. Since the HyperOp Launcher can transfer data only communication structures as mentioned in [21]. Other
through the ports along the periphery of the fabric, it incurs computation structures may need to be created for
different latencies for different CEs. This limits the these applications.
scalability of the fabric. Beyond a certain size, 3-D
structures may become essential to support scaling of the VIII. CONCLUSIONS AND FUTURE WORK
fabric, while not incurring high launch delays. The
In this paper, we presented the architectural evolution of
HyperOp Launcher latency is also affected by the routing
REDEFINE, a runtime reconfigurable architecture. We
delays due to network congestions, since it employs the
used a general purpose processor (GPP) as the starting point
NoC to transfer data to the identified CEs.
Table 3 Percentage of Execution time spent in Support Logic.
for evolving a runtime reconfigurable architecture, since a
Application Percentage time spent in GPP reconfigures itself every clock cycle. With certain
Support Logic architectural optimizations i.e. explicit write backs, we were
SHA-1 48.58 able to achieve 1.91x performance improvement over a
Shift-Reduce 41.49 GPP for SHA-1. However, this solution performs 20x
(w/o Custom FU) worse than an ASIC. To further improve the performance,
Shift-Reduce (w/ 48.05 the use of domain-specific Custom IP blocks within the CE
custom FU) becomes necessary. Runtime Reconfigurable hardware with
custom IP block gives a 4x improvement in performance
A. Enhancements to reduce Support Logic Overhead for Shift-Reduce kernel, which is the most compute
In order to ameliorate this, several possible solutions exist. intensive component of field multiplication. Apart from
• The maximum latencies are incurred in HyperOps that optimizations to the execution fabric, optimizations are also
are part of a loop. In these cases the HyperOp’s essential in the support logic viz. schedule, placement,
opcodes and constants may be retained on the fabric, launch of application substructures, in order to come close
only the new input operands are transferred to the to the performance of an ASIC.
HyperOp after every iteration. These are called In this paper, we have presented only the performance
persistent HyperOps. comparison with regard to an ASIC and a GPP. We intend
• The persistent HyperOps solution helps reduce to extend this work to include a detailed power comparison
HyperOp launch latency only in the case of frequently and power-performance comparison with regard to an
executed HyperOps. Another mechanism is to prefetch FPGA. We also intend to develop a complete library of
a HyperOp. In this case the next HyperOp to be domain-specific custom IPs that can be used within a
launched is statically predicted using compile time compute element of the reconfigurable fabric, to accelerate
analysis and augmented with profiling runs. This cryptographic applications.
information is used in prefetching the HyperOps, so as
to overlap the launch of the next HyperOp with the REFERENCES
execution of the current HyperOp. [1] H. Corporaal, Microprocessor Architectures: From
• In the case of both persistent HyperOps and prefetch VLIW to TTA . John Wiley & sons, 1998.
the input data needs to be transferred from the
[2] K. C. Yeager, "The MIPS R10000 Superscalar
Scheduler to the HyperOp Launcher, for transfer onto
Microprocessor," IEEE Micro, vol. 16, no. 2, pp. 28--
the fabric. In the case of frequently interacting
40, 1996.
HyperOps, this latency can be eliminated through a
Fused HyperOp. In a Fused HyperOp, also referred to [3] A. N. Satrawala, K. Varadarajan, M. Alle, S. K.
as Custom Instruction in [20], two or more closely Nandy, and R. Narayan, "REDEFINE: Architecture
interacting HyperOps are merged together to form a of a SoC Fabric for Runtime Composition of
combined entity. In this combined entity, the inter- Computation Structures," in FPL 2007. International
HyperOp communication is achieved completely on Conference on Field Programmable Logic and
the fabric. However, since the fabric is a static dataflow Applications., Amsterdam, 2007, pp. 558-561.
264
[4] M. Alle, K. Varadarajan, A. Fell, S. K. Nandy, and R. [15] K. Sankaralingam, et al., "The Distributed
Narayan, "Compiling Techniques for Coarse Grained Microarchitecture of the TRIPS Prototype
Runtime Reconfigurable Architectures," in ARC'09 Processor," in 39th International Symposium on
International Workshop on Applied Reconfigurable Microarchitecture (MICRO), Orlando, 2006, pp. 480-
Computing, vol. Volume 5453/2009, London, U.K, 491.
2009, pp. 204-215. [16] S. Swanson, K. Michelson, A. Schwerin, and M.
[5] C. Lattner and V. Adve, "LLVM: A Compilation Oskin, "WaveScalar," in 36th Annual International
Framework for Lifelong Program Analysis & Symposium on Microarchitecture (MICRO-36),
Transformation," in CGO '04: Proceedings of the Washington, DC, USA, 2003, p. 291.
international symposium on Code generation and [17] B. R. Rau, "Dynamically scheduled VLIW
optimization, Palo Alto, CA, 2004, p. 75. processors," in MICRO 26: Proceedings of the 26th
[6] N. Joseph, et al., "RECONNECT: A NoC for annual international symposium on
polymorphic ASICs using a low overhead single Microarchitecture, Austin, Texas, 1993, pp. 80-92.
cycle router," in ASAP '08: Proceedings of the 2008 [18] M. Larabel. (2009, Sep.) GCC vs. LLVM-GCC
International Conference on Application-Specific Benchmarks. [Online]. HYPERLINK
Systems, Architectures and Processors, Leuven, "http://www.phoronix.com/scan.php?page=article&it
Belgium, 2008, pp. 251-256. em=apple_llvm_gcc&num=1"
[7] A. Fell, et al., "Streaming FFT On REDEFINE-V2: http://www.phoronix.com/scan.php?page=article&ite
an Application-Architecture Design Space m=apple_llvm_gcc&num=1
Exploration," in CASES '09: Proceedings of the 2009 [19] M. D. Hill and M. R. Marty, "Amdahl's Law in the
international conference on Compilers, architecture, Multicore Era," Computer, vol. 41, no. 7, 2008.
and synthesis for embedded systems, Grenoble,
[20] M. Alle, et al., "REDEFINE: Runtime reconfigurable
France, 2009, pp. 127--136.
polymorphic ASIC," ACM Trans. Embed. Comput.
[8] G. K. Singh, K. Varadarajan, M. Alle, S. K. Nandy, Syst., vol. 9, no. 2, pp. 1--48, 2009.
and R. Narayan, "A Generic Graph-Oriented
[21] K. Asanovic, et al., "A view of the parallel
Mapping Strategy for a Honeycomb," International
computing landscape," Commun. ACM, vol. 52, no.
Journal on Futuristic Computer Applications, vol. 1,
10, pp. 56-67, Oct. 2009.
no. 1, pp. xx-xx, 2010.
[9] A. Fell, P. Biswas, J. Chetia, S. K. Nandy, and R.
Narayan, "Generic Routing Rules and a Scalable
Access Enhancement for the Network-on-Chip
RECONNECT," in IEEE International SoC
Conference, Glasgow, 2009, pp. xx-xx.
[10] J. R. Gurd, C. C. Kirkham, and I. Watson, "The
Manchester prototype dataflow computer," Commun.
ACM, vol. 28, no. 1, pp. 34--52, 1985.
[11] Stretch Inc. ( , ) Stretch S6000 Devices. [Online].
HYPERLINK
"http://www.stretchinc.com/_files/s6ArchitectureOve
rview.pdf"
http://www.stretchinc.com/_files/s6ArchitectureOver
view.pdf
[12] S. Vassiliadis, et al., "The Molen Polymorphic
Processor," IEEE Transactions on Computers, vol.
53, no. 11, pp. 1363-1375, Nov. 2004.
[13] Fujitsu Limited, IPflex Inc. (2004, Mar.) IPFlex and
Fujitsu Introduce DAP/DNA®-2, the Dynamically
Reconfigurable Processor. [Online]. HYPERLINK
"http://www.fujitsu.com/global/news/pr/archives/mo
nth/2004/20040317-01.html"
http://www.fujitsu.com/global/news/pr/archives/mont
h/2004/20040317-01.html
[14] M. B. Taylor, et al., "The Raw Microprocessor: A
Computational Fabric for Software Circuits and
General Purpose Programs," IEEE Micro, vol. 22, no.
2, pp. 25-35, Mar. 2002.
265
ADCOM 2009
POSTER PAPERS
266
A Comparative Study of Different Packet Scheduling Algorithms with Varied
Network Service Load In IEEE 802.16 Broadband Wireless Access Systems
For this purpose we have designed a scenario which is Radio Type – 802.16 Radio
shown in figure 1. MAC Protocol – 802.16
BS Frame Duration – 20 ms
267
IP Queue scheduling – strict priority, WFQ, RR, 14
SCFQ, WRR strict
12 priority
IP Queue Management type – FIFO, RED, RIO, WRED
PHY Channel bandwidth – 20 MHz 10 WFQ
40000
WFQ 0.92
30000 WRR
RR FIFO RED RIO WRED
20000
10000 Figure 6. Queue Service ratio comparison for Service
SCFQ
Load 1
0
WRR 4 strict
Load 1 Load 2 Load 3
Average Queuing Delay
priority
Figure 2.Average Throughput (bps) comparison in 3 WFQ
queue scheduling schemes
2 RR
0.15 Strict
priority
WFQ 1 SCFQ
Average Jitter (sec)
0.1
RR 0 WRR
(sec)
2.5 strict
Average Queuing Delay (sec)
priority 20
2 RR
WFQ
1.5 19.5
RR SCFQ
1 19
0.5 SCFQ WRR
18.5
0
FIFO RED RIO WRED
FIFO RED RIO WRED WRR
Figure 8. Packet loss (%) comparison for Service Load 2
Figure 4. Average Queuing Delay (sec) comparison
for Service Load 1
268
strict In case of SERVICE LOAD 1, SERVICE LOAD 2,
0.965 SERVICE LOAD 3, we can observe that queue scheduling
priority
0.96 algorithm SCFQ, RR, WFQ respectively with queue
WFQ management type RED provides good QoS support among
0.955
all other algorithms.
0.95 RR
0.945 3. Conclusion
SCFQ In the existing literatures, many researchers have tried to
0.94
find out flaws within the queue scheduling algorithms and
0.935 modified those for having better QoS support irrespective of
WRR the network Service Load. The present work clearly shows
FIFO RED RIO WRED
that a particular algorithm which is considered to be good in
Figure 9. Queue Service ratio comparison for a specific environment may not provide good QoS to other
Service Load 2 environment. It is reasonable to take into account the
network Service Load while modifying the packet
8 strict
Average Queuing Delay (sec)
269
A Simulation Based Comparison of Gateway Load Balancing Strategies in
Integrated Internet-MANET
270
sum of the queue lengths of all mobile nodes lying on mobile ad hoc network. The second strategy, called
a path, as explained in the previous section Strategy-2, uses the Modified-AODV routing protocol.
Aggregate of Routing Table Entries (ARTE): Every Both the strategies have been implemented for hybrid
mobile node maintains a routing table which contains gateway discovery mechanism.
information about valid routes to a set of destinations.
If the number of entries is more, then this mobile node 4. Simulation of the Proposed Strategies
will act as an intermediate node which knows many
destinations and hence will find itself sending more A simulation of the proposed gateway load
Route Reply messages. Thus, a mobile node with more balancing strategies was carried out to determine the
routing table entries is more likely to be a busy node effectiveness of their performance when compared to
and hence better be avoided. Aggregate Routing each other. The network simulator used was ns-2.31.
Table Entry (ARTE) is the sum of all routing table The simulations were carried out, for varying Packet
entries on a route. Sending Rates. The simulation parameters common to
In the WLB-AODV routing protocol, a weighted all the simulations are given in table1. The constants
sum of the above three metrics is used in the selection a, b and c in metric Mi are given the values a=0.5,
of a route. Now, every mobile node maintains the b=0.25 and c=0.25.
AIQL and ARTE values in its routing table for known Table4.1. Simulation Parameters
destinations, apart from hop count as in original Parameter Name Value
AODV. The weighted sum is: Number of mobile nodes 25
Mi = a * HL + b * AIQL + c * ARTE
Number of source nodes 5
a+b+c=1
Number of IGW 3
Where, Mi represents value of the weighted metric
Number of
for route ‘i’. a, b and c represent the weights given to 3
Correspondent Nodes
each of the components.
Topology Size 1200 X 500m
Transmission Range 250m
3. Gateway Load Balancing Strategies for Traffic Type Constant Bit Rate
Integrated Internet-MANET using Load Mobile Node Speed 20 m/sec
Balanced Routing Protocols Packet Size 512 bytes
Pause Time 5 sec
The proposed network architecture for gateway Random Waypoint
load balancing consists of two-tiers. The high-tier Mobility model
Model
consists of foreign agents and the low-tier consists of Carrier Sensing Range 500m
mobile nodes which form the mobile ad hoc network Simulation time 900sec
as shown in figure 1. Packet Sending Rate 5 – 40 Packets/Sec
Interface Queue Length 50 packets
Advertisement Interval 5 sec
Advertisement Zone 3 hops
271
outperforms Strategy-2. This indicates that Strategy-1 and the other based on Modified-AODV. Through
successfully chooses lesser loaded routes to enable simulations in ns-2.31, it is observed that the Strategy
faster delivery of data packets. Figure 3 shows the based on WLB-AODV gives performance
comparison of the routing load of the two strategies. enhancements over the Strategy based on Modified
Here a slight, albeit consistent advantage is observed AODV, based on performance metrics End-to-End
for Strategy-1. This indicates that Strategy-1 incurs Delay, Normalized Routing Load and Packet Delivery
lower routing overhead due to lesser control packet Ratio.
retransmissions, by choosing lightly loaded routes.
Figure 4 again establishes the superiority of Strategy-1 6. References
over Strategy-2, by delivering a better packet delivery
ratio. This is due to the fact that lesser loaded routes [1] E.M. Royer and C-K. Toh, “A Review of Current
are chosen for data delivery and hence more number Routing Protocols for Ad Hoc Mobile Wireless Networks”,
of packets is delivered. IEEE Personal Communications Magazine, pp. 46-55,
(1999).
[5] Y-Y Hsu, Y-C Tseng, C-C Tseng, C-F Huang, J-H Fan
and H-L Wu, “Design and Implementation of Two-Tier
Mobile Ad Hoc Networks with Seamless Roaming and
Figure3. Normalized Routing Load as a
Load-Balancing Routing Capability”, Proceedings of the
function of Packet Sending Rate First International Conference on Quality of Service in
Heterogeneous Wired/Wireless Networks, Pages: 52 – 58,
(2004)
272
ECAR: An Efficient Channel Assignment and
Routing in Wireless Mesh Network
Chaitanya P. Umbare Dr. S. V. Rao
Department of Computer Science and Engineering Department of Computer Science and Engineering
Indian Institute of Technology Indian Institute of Technology
Guwahati, Assam Guwahati, Assam
India-781039 India-781039
Email: umbare@iitg.ernet.in Email: svrao@iitg.ernet.in
Abstract—Wireless mesh networking is a promising design Distributed channel assignment in multi-radio 802.11 mesh
paradigm for future generation wireless networks. Wireless mesh networks[5] presents a distributed, self-stabilizing mechanism
network(WMN) came into existence to resolve the limitations and that assigns channels to multi-radio nodes in WMN. Main
to significantly improve the performance of ad-hoc networks,
wireless local area networks(WLANs) etc. In a WMN or any disadvantages of this protocol is that the assigned channels
network, channel allocation is done based on the queue states at are not changes once the channel assignment is stabilized. A
different nodes which depends on how routing is done. Thus distributed channel assignment and routing in WMNs[6] is
it is important to have joint routing and channel allocation proposed by Raniwala. The assumption made in this paper
algorithms to control various measures of performance like is that, the traffic of all nodes goes or comes to/from the
delay and throughput. For better and efficient use of multiradio
and advanced physical layer technologies, cross-layer control gateway nodes only. For other traffic patterns, the protocol
is required. In this paper, we propose an intelligent channel does not works. And also due to uncoordinated allocation of
assignment strategy which results into minimum interference of nodes with the same priority, their channel assignment may
channels and efficient routing which results into reduction of not be convergent and, thus, may cause severe interference
end-to-end delay. among nodes.
In this paper we basically addresses two main issues which
I. I NTRODUCTION
significantly affect the performance of the network, routing
A WMN is a multi-hop, self configured and self healing and channel assignment. Basically we tries to improve the
network, which provides reliability, redundancy, scalability[1]. protocol[6] by intelligent channel assignment which signifi-
WMNs consist of two types of nodes: mesh routers and mesh cantly reduces the interference between neighboring nodes and
clients. Mesh routers which forms the backbone of the network intelligent routing which reduces lots of routing overhead. The
are generally not mobile and equipped with one or more radios next section describes our protocol in detail.
operating on same or different radio technology. And mesh
clients are generally mobile and equipped with one radio only II. P ROPOSED SCHEME
because of power constraint. The ECAR basically consists of two phases.
Now we discuss some of the past work related to our
1) Load-Balancing Routing
protocol.
2) Distributed Load-Aware Channel Assignment
Use of multiple 802.11 NICs per node is explored in
routing in multi-radio, multi-hop WMNs[2]. Due to the static These phases are explained in the following sections:
channel assignment to all the nodes, throughput improvement
is proportional to the number of NICs. The main idea in A. Load-Balancing Routing
multichannel CSMA MAC protocol for multihop wireless Main idea in the routing process is, if the source and
Networks[3] is to find an optimum channel for every single destination are in the same subnet and if the hop count
packet transmission. Basically channel switching is on packet between the router and the destination is less than equal to
by packet basis, which requires the re-synchronization among three, then router forwards the data packet using NIC[0] to its
communicating network cards for every packet. This leads to appropriate neighbor with the help of master routing table. If
decrease in network throughput. Centralized channel assign- the destination router is in the subtree rooted at this router, then
ment and routing algorithm for multi-Channel wireless mesh it will send the data packets to appropriate child using NIC[2]
networks[4], visits all the virtual links in the decreasing order using inter-domain routing table, otherwise it sends data packet
of their expected loads and upon visiting a link, algorithm to its parent using NIC[1] using Inter-domain routing table.
greedily assigns a channel that leads to minimum interference. Each router periodically broadcasts HELLO message to
This centralized algorithms demands the complete information all its 1-hop neighbors. After receiving HELLO message
about the network to perform channel assignment and routing. router updates its neighbors list, by considering hop-count
273
information. And broadcast this updated information to its 1- channel assigned to their child NIC by executing the above
hop neighbors. As a results of this, each router has information procedure periodically.
about all the routers in the network, with minimum hop-count
from the corresponding router. In this way, each router builds III. P ERFORMANCE E VALUATION
master routing table. This section gives the implementation details of our scheme
In order to build inter-domain routing table, each gateway in ns-2[7]. We also present our simulation environment and
node broadcasts ADVERTISE message to all its 1-hop neigh- results.
bors with residual capacity of the uplink. Upon receiving
ADVERTISE message, router sends JOIN message to the A. Simulation Environment
advertiser, if the existing gateway residual capacity is more The Simulation environment is as described in Table 5.1.
than the incoming one. Otherwise it will not send any reply. All the simulations are done in ns-2.
So when a router receives JOIN message, it will add the router
to its children list and send ACCEPT message containing No. Parameters Values
1 Area 240m X 240m
information about channel to be used for communication. 2 Transmission Range 22.5m
And also router send ROUTE-ADD message to its parent, 3 Nodes 25, 50, 75, 100
which contain address of the node and its children’s and this 4 Data Rate 50Kbps–4Mbps
process continues up to the gateway node. When a router 5 Number of Channels 13
6 Number of NICs/Node 3
receives ACCEPT message, it will update the parent entry and 7 Number of Flows 5–30
send LEAVE message to its previous parent. Upon receiving 8 Number of Gateway nodes 1–4
LEAVE message, router deletes the forwarding entries to that 9 Simulation Time 400sec
router and all its children’s and send ROUTE-DELETE mes- TABLE I
sage to its parent. The process of sending ROUTE-DELETE S IMULATION PARAMETERS
is recursive. The process of building master routing table
and inter-domain routing table can be done through only
one message, which avoids the overhead of sending control
B. Simulation Results
messages.
The performance parameters that were measured in our
B. Distributed Load-Aware Channel Assignment simulation are: Average aggregate throughput and average end-
In this section we present a localized distributed algorithm to-end delay. Our protocol assigns the channels to interfaces
for channel assignment to each interfaces. The NIC which is intelligently, because of this average aggregate throughput is
used to communicate with parent router is termed as parent much better in our case than DCAR[6] as shown in graphs
NIC, where as child NIC denotes the NIC used to commu- below.
nicate with its children’s. Each WMN router is responsible
for assigning channel to its child-NIC. And parent NIC of ECAR
DCAR
the router is associated with a unique child-NIC of the parent 14
12
router and is assigned the same channel as the parent’s child-
Average Aggregate Throughput (Mbits/sec)
10
NIC.
8
14
may result in change in the various channels load, which 6 Number of nodes: 100
Number of Gateway Nodes: 3
10 15 20 25 30
may imbalance the channel load and decreases the network Number of Flows
274
ECAR
which of the available non-overlapped radio channels should
18 DCAR
be assigned to each 802.11 interface in the WMN? Second,
16
how packets should be routed through a multi-channel wireless
10
master routing table, by sending data packets directly to next
8
hop, instead of sending through gateway nodes, which results
6
in the reduction of routing overhead. Moreover, by dynam-
4
Number of flows: 30
Number of Nodes: 100 ically changing the assigned channels which can minimize
1 2 3 4
Number of Gateway Nodes interference, hence increases throughput of the network.
R EFERENCES
[1] R. Bruno, M. Conti, and E. Gregori, “Mesh networks: commodity
In the following graphs we showed effect on average end- multihop ad hoc networks,” Communications Magazine, IEEE, vol. 43,
pp. 123–131, 2005.
to-end delay by changing the number of hops(1 in first graph, [2] R. Draves, J. Padhye, and B. Zill, “Routing in multi-radio, multi-hop
2 in second and 3 in third) between the source and destination wireless mesh networks,” in MobiCom ’04: Proceedings of the 10th
with varying number of nodes. annual international conference on Mobile computing and networking.
ACM Press, 2004, pp. 114–128.
[3] A. Nasipuri, J. Zhuang, and S. Das, “A multichannel csma mac protocol
for multihop wireless networks,” pp. 1402–1406, 1999.
30 ECAR
DCAR [4] A. Raniwala, K. Gopalan, and T.-C. Chiueh, “Centralized channel assign-
25
ment and routing algorithms for multi-channel wireless mesh networks,”
SIGMOBILE Mob. Comput. Commun. Rev., vol. 8, pp. 50–65, 2004.
Average End-to-End Delay (msec)
20
[5] B.-J. Ko, V. Misra, J. Padhye, and D. Rubenstein, “Distributed channel
15
assignment in multi-radio 802.11 mesh networks,” in Wireless Communi-
cations and Networking Conference, 2007.WCNC 2007. IEEE, 2007, pp.
10 3978–3983.
[6] A. Raniwala and T.-C. Chiueh, “Architecture and algorithms for an ieee
5
802.11 -based multi-channel wireless mesh network,” vol. 3, 2005, pp.
0
20 30 40 50 60 70 80 90 100 110
2223–2234.
Number of Nodes
[7] Information Sciences Institute, “NS-2 network simulator,” Software Pack-
age, 2003, http://www.isi.edu/nsnam/ns/.
ECAR
DCAR
25
Average End-to-End Delay (msec)
20
15
10
0
20 30 40 50 60 70 80 90 100 110
Number of Nodes
ECAR
DCAR
25
Average End-to-End Delay (msec)
20
15
10
0
20 30 40 50 60 70 80 90 100 110
Number of Nodes
275
Rotational Invariant Texture Classification of Color
Images using Local Texture Patterns
A.Suruliandi
Department of Computer Science and Engineering, Manonmaniam Sundaranar University
Tirunelveli, Tamilnadu, India
suruliandi@yahoo.com
E.M.Srinivasan K.Ramar
Department of ECE, Government Polytechnic College, Department of CSE, National Engineering College,
Nagercoil, Tamilnadu, India Kovilpatti, Tamilnadu, India.
emsvasan@yahoo.com kramar_nec@rediffmail.com
Abstract— In this paper a new approach to extend Local He and Wang [1] proposed a texture modeling scheme
Texture Patterns (LTP) texture model suitable for color called ‘Texture Spectrum (TS)’. The TS operator is a gray-
images is presented. In this study, to extract spatial feature of a scale invariant one but not a rotational invariant. Ojala et al.
color mage, Gray-Local Texture Patterns (GLTP) is extended their earlier work in 2002 [4] and proposed a new
introduced. Contrast is another important property of images.
To extract contrast features, Color-Local Contrast Variance
texture model called Local Binary Patterns (LBP)’. The
Patterns (CLCVP) is proposed. However, much important LBP model is gray-scale and rotational invariant. Recently,
information contained in the image can be revealed by joint Suruliandi and Ramar [6] proposed a new texture model
distributions of individual features. Hence, GLTP/CLCVP is LTP by combining the best features of TS and LBP models.
proposed as a textural feature extraction technique for Hence it is proposed to extend the LTP model and to
classification of color images. The performance of the proposed combine it with contrast variance feature for classification
features is carried out with rotational invariant classification of of color images.
Outex texture database images. From the experimental results
it is observed that GLTP/CLCVP is yielding higher B. Outline of the Proposed Work
classification accuracy rate of 99.32% for color images.
The spatial texture feature used in this study is LTP,
Keywords- Texture Analysis, Texture Classification, Local which is basically a gray-scale operator and it is extended to
Texture Patterns, Local Gray scale Texture Patterns, Local Color process the color image patterns. The feature GLTP is
Contrast Variance Patterns. introduced for this purpose. The LTP is an excellent
measure of the spatial structure of local image texture, but it
I. INTRODUCTION discards the contrast variance of the image which is an
important property. The feature CLCVP is proposed based
Texture methods can be categorized as statistical, on the contrast variance feature. Color and texture being
geometrical structural, model-based and signal processing complementary, it is expected that their orthogonal property
features. A comparative study on texture measures are in the form of joint distribution can perform better for
reported in [3] texture analysis. Hence, the feature GLTP/CLCVP is
A. Motivation and Justification for the Proposed Approach proposed as a joint distribution texture model.
The color texture models are essential for at least two C. Organization of the Paper
reasons. The first one is most approaches to texture analysis This paper is organized as follows. Section II describes
quantify texture measures by single values like means,
the basic operators LTP and VAR. The proposed texture
variance, entropy etc. However, much important feature extraction technique for color images is explained in
information contained in the distribution of texture values Section III. Classification principle is illustrated in Section
might be lost. The second one is color, which has
IV. Experimental results are illustrated in Section V.
considerable importance in image analysis. Color as well as Discussion and Conclusion is presented in Section VI.
texture has been discussed in literature intensively. But most
of the known texture models are based on gray-scale II. BASIC OPERATORS
images. This is an inadequate restriction in many real world
applications of computer vision. These restrictions are the A. Local Texture Patterns (LTP)
motivation behind the development of color texture models. The local image texture information can be extracted
from a neighborhood of 3 x 3 local regions. Let gc, g1,
276
g2,…,g8 be the pixel values of a local region where gc is the 1 P −1 1 P −1
value of the central pixel and g1, g2,…,g8 are the pixel
values of its 8 neighborhood. Let the pattern unit P, between
VAR = ∑ i C
P i =0
( g −μ ) 2
where μ c = ∑ gi
P i =0
(5)
gc and its neighbor gi (i = 1, 2,…,8 b) be defined as VAR is by definition, invariant against shifts in gray-scale.
In forming a pattern over a local region, either the LTP or
⎧0 if gi < (gc - Δg) VAR can be used as texture descriptors. The features, either
⎪ (1)
P(gi , gc ) = ⎨1 if (gc - Δg)≤ gi ≤ (gc + Δg) i =1,2,…,8 the joint pair LTP+VAR or their joint distribution
⎪9 if g > (g + Δg) LTP/VAR, can also be used as texture descriptors.
⎩ i c
where Δg is a small positive value that represents desirable C. Patterns Unification Procedure
threshold value set by the user. The values for P can be any VAR has a continuous valued output and hence
three distinct values and here it is chosen as 0, 1 and 9 in quantization of its feature space is needed. This can be
order to make the pattern labeling process easier and have no achieved by patterns unification procedure. The unique code
other importance. Fig.1 shows a 3 x 3 local region, P values corresponding to the inputs P, Q and R is computed as
calculated along the border and its pattern string. shown in Figure 2 The variables P, Q and R may represent
123 110 113 1 0 0 VAR values. A ‘●’ in the upper position means its value is
00999911 higher, a ‘●’ in the lower position means its value is lesser
117 120 135 1 9
and ‘●’s in the same line means values are equal.
130 125 128 9 9 9
(a) (b) (c) P Q R Code P Q R Code
Fig.1. (a) 3 x 3 local region (b) Pattern units matrix for Δg=4.
(c) Pattern String ●
● ● ● 1 ● 8
To define uniform circular patterns over the pattern
●
string, a uniformity measure ‘U’ which corresponds to the
number of spatial transitions circularly in the Pattern string is
● ●
defined as
● ● 2 ● 9
8
U = s(P(g8 , gc ),P(g1, gc )) + ∑s(P(gi , gc ),P(gi−1, gc )) (2) ●
i=2
where ●
● ● 3 ● 10
⎧1 if X - Y > 0
s( X ,Y ) = ⎨ (3) ● ●
⎩0 otherwise
●
The following rotational, gray-scale shift invariant ‘Local 4 11
● ●
Texture Pattern (LTP)’ operator is proposed for describing a
local image texture. ● ● ●
⎧ 8 ● ● ●
⎪ ∑ P ( g i , g c ) if U ≤ 3 (4)
LTP = ⎨ i =1 ● 5 ● 12
⎪ 73 otherwise
⎩ ●
B. Contrast Variance (VAR) III. PROPOSED TEXTURE MODEL FOR COLOR IMAGES
The local contrast variance (VAR) is defined by The procedure for computing GLTP/CLCVP is
illustrated in Figure 3
277
RGB images
B. k-Nearest Neighbor Classification
Isolate R, G and B planes The algorithm for k-Nearest Neighbor classification is
as follows.
• Have training data (xi, ti; i = 1, 2, ..., n) where xi is
the attribute of the training sample ‘i’, and ti is the
Form intensity scale image Compute VAR for the class label of the training sample ‘i’.
using local regions of R, G and • Have some test point x wish to classify.
I=0.299R+0.587G+0.114B B planes using a 3 x 3 • Calculate similarity or dissimilarity between the
sliding window and form test point and the training points.
R, G and B VAR planes • Find the k training points k1, k2,…, kk, which are
closest to the test point.
• Set the classification t for the test point to be the
Form LTP plane by Combine R, G and B most common of the k nearest neighbors.
applying LTP operator VAR planes using • The special case when k = 1, the algorithm will
over ‘I’ using a sliding patterns unification behave simply as a Nearest Neighbor classification.
window of size 3 x 3. procedure to form a
unified VAR plane V. EXPERIMENTS
A. Images Used in the Experiments
The performance of the proposed texture features for
Form 2D joint distribution histogram color images are demonstrated with the textures downloaded
GLTP/CLCVP using LTP plane and unified from Outex database [2]. In this study five classes of
VAR plane textures are used for classification purposes. In each class of
texture, images of three different rotation angles, three
different illuminations and three different resolutions (100,
Figure 3. Extraction of GLTP/CLCVP joint distribution 300 and 600 dpi) were used. In total there were 135 (5 x 3 x
features. 3 x 3) textures used for experiments. Figure 4 shows the five
texture classes.
IV. CLASSIFICATION PRINCIPLE
B. Rotational Invaiant Texture Classification
A. Texture Similarities There are many applications for texture analysis in which
Similarity between different textures is evaluated by rotation-invariance is important. In most approaches to
comparing their Pattern Spectrum using the log-likelihood texture classification, it is assumed that the unknown
ratio also known as the G-statistic [5]. samples to be classified are identical to the training samples
⎛ ⎡ ⎞ with respect to orientation. However, in reality the samples
n
⎤
⎜ ⎢ ∑ ∑ f i log f i ⎥ − ⎟ to be classified can occur at arbitrary orientation.
⎜ ⎣ s ,m i −1 ⎦ ⎟ In this experiment, the classifier was trained with samples
⎜ ⎟ of illuminant ‘inca’, rotation angle of 00 and resolution of
⎜ ⎡ ⎛ n
⎞ ⎛ n
⎞⎤ ⎟
⎜ ⎢ ∑ ⎜ ∑ f i ⎟ log ⎜ ∑ f i ⎟ ⎥ − ⎟ (6)
600 dpi. 10 samples each with the size of 64 x 64 were
⎣ ⎝ ⎠ ⎝ ⎠ ⎦ ⎟ extracted from all the five texture classes shown in Figure 4
G = 2⎜
s , m i − 1 i − 1
278
(a) (b) (c) (d) (e)
CV-1- CV-1- CV- CV-2- CV- CV-3- CV-5- CV-5- CB-1- CB-1-
Texture Average
45 90 2-30 60 3-00 10 15 30 15 75
Classification
Accuracy 90.91 97.73 100 100 100 100 100 100 100 100 98.86
( %)
279
Time Synchronization for an Efficient Sensor Network System
Anita Kanavalli, Vijay Krishan, Ridhi Hirani, Santosh Prasad, Saranya K.,
P Deepa Shenoy, and Venugopal K R
Department of Computer Science and Engineering
University Visvesvaraya College of Engineering, Bangalore, India
anita.kanavalli@gmail.com
L M Patnaik
Vice Chancellor
Defence Institute of Advanced Technology, Pune, India
280
2 Model and Algorithm 3 Resync Algorithm
Our system consists of a network of sensor nodes which This proposed algorithm is a continuation of the modified
are assumed to be in their power ranges. Such sensors L − SGS. As compared to the L − SGS, the modified
are called neighbors of each other. The radio link present LSGS, does not abort when a node gets compromised.
between the sensor nodes is bidirectional to allow a two- Instead, the node which has been identified as the compro-
way communication between the nodes. mised node runs the Resync algorithm in order to counter
In the group synchronization algorithm the nodes syn- the external attack and re-enter into the synchronization
chronized if their delay value has not crossed a pre- process after being compromised. The main idea behind
calculated threshold value. Both the receiver-sender, which is used in the Resync algorithm is that once the
sender-receiver can fall prey to the pulse delay attack. outliers have been detected i.e. the malicious time offsets,
they need to be excluded while calculating the true offsets
between the nodes. This can be done by calculating the
2.1 Problem definition mean of the offsets of the benign nodes and approximate
true time offsets.
Implementation of the modified L-SGS algorithm. The al- Let Γ be the time offset data set from all the nodes and
gorithm uses the broadcast property of the wireless chan- χ be the time offsets from the outlier set, then the benign
nel to broadcast messages from the central to the other time offset set is defined as Γ − χ . Also let is defined as
nodes. The times Tij are measured by the local clock the average of the set Γ − χ. The size of set of time offsets
of the node, where Tij is the Time i, received by Node is n and that of the time offsets of compromised nodes is
j. Each node has to keep track of four sets of time, Ti , k, then µ is calculated as follows: µ = P (χ − χ)i/(n−k)
i=1 to 4 in order to calculate dj and δj , if not compro- where I ranges from 1 to n − k Therefore, when a node
mised. Any of the nodes in the group can serve as the gets compromised, it has to ensure that it gets the true off-
central node, provided that the collisions in the wireless sets from the nodes which are not compromised in the net-
channel do not cause any drop of packets to and from the work. Accordingly, it has to take the average of these time
central node. The implementation of Resync algorithm in offsets received and set its ffset to the calculated mean.
order to counter the external attack and re-enter into the The following algorithm, implements the following con-
synchronization process after being compromised. cept. The Resync algorithm executes as follows:
Table II: Resync Algorithm
281
4 Implementation and Performance As we can see from the first plot, Nodes 1,3,4,5 and 7 i.e.
Analysis
4.1 Implementation of modified LSGS
Our implementation of the modified L-SGS is simulation
oriented. The simulation was performed using a recently
developed simulator Castalia built specifically for Wire-
less Sensor Networks. Castalia is built on top of open
source mobile network Omnet++.
For the sample run, there are nine nodes taken into ac-
count, one acting as the central node. The delays have
been plotted when no node is compromised. Based on
Figure 2: Run 1-Comparison of true and calculated offsets
the threshold calculation presented above average delay
davg = 0.0198446s and σ = 0.00794624. The re-
five out of the eight nodes have a difference in the time
quired n = 3 and thus the threshold is calculated to be
offsets between 1 to 10 ms. Thus, as we can see from
D = 0.04368332s. This value is then used to detect out-
the plot above even though the Resync algorithm has an
liers in the subsequent runs. Based on the threshold delay
efficiency > 70% in most cases.
calculation, the presence of compromised nodes can be
detected as described in the algorithm. The following fig-
ure illustrate how the algorithm is able to pick up on the 5 Conclusion
presence of a node under pulse delay attack based on the
threshold value. The performance of the modified L-SGS is measured us-
ing the Successful Detection Rate (SDR) where different
delays are introduced into the network. The performance
of the Resync algorithm is then measured by comparing
the calculated offsets with true time offsets. Even though
in most cases, the accuracy between the time offsets is
about 70algorithm can be refined in order to achieve a
higher accuracy rate.
References
Figure 1: End-to-end delay in a sample run [1] S. Ganeriwal, R. Kumar, M. B. Srivastava, Timesync
protocol for sensor networks. In Proceedings of the
The scatter graph in Figure 1 plots the uncompromised First ACM Conference on Embedded Networked
nodes and their respective delays. All the delays are below Sensor Systems (SenSys), pp 138-149, 2003.
the threshold which is represented by the dashed line.
[2] H. Song, S. Zhu, G. Cao, Attack-resilient time syn-
chronization for wireless sensor networks, AdHoc
4.2 Performance Evaluation Networks. V5 (1) 112-125, 2006.
The Resync algorithm is responsible for ensuring that if a
node is compromised, it broadcasts its status to the other
nodes which are present in its group and using the time
offsets of the benign nodes be able to correct its offset.
282
Parallel Hybrid Germ Swarm Computing for Video
Compression
K. M. Bakwad1, S.S. Pattnaik1, B. S. Sohi2, S. Devi1, B.K. Panigrahi3, M. R. Lohokare 1
1
National Institute of Technical Teachers’ Training and Research Chandigarh, India
2
UIET, Panjab University, Chandigarh, India
3
Indian Institute of Technology, Delhi, India
km_malikaforu@yahoo.co.in
shyampattnaik@yahoo.com
Abstract— This paper proposes Parallel Hybrid Germ Swarm Efficient Block Matching Motion Estimation [10], Content
Computing (PHGSC), for real time video compression. The Adaptive Video Compression [11] and Fast motion estimation
convergence of Bacterial Foraging Optimization (BFO) is very algorithm [12] . GA has also been used for fast motion
slow because of fixed step size and its performance heavily estimation [13]. In this paper, authors propose fusion of
degraded for real time processing. In this paper, initially the parallel germ computing as, with GLBestPSO for motion
authors tried to increase the speed of BFO by updating bacteria
estimation
positions parallel instead of serial, which is treated as Parallel
Germ Computing. Further Parallel germ computing is
II. PARALLEL HYBRID GERM SOFT COMPUTING
hybridized with GLBest Particle Swarm Optimization
(GLBestPSO) to improve global performance of PGC. The
PHGSC is used to reduce computational time of motion BFO can be classified into serial and parallel BFO. Standard
estimation in video compression. The adaptive step size with BFO is serial. In BFO, if all of the bacteria update their
prediction, zero motion vector and Von Neumann neighborhood information at the same time then it treated as Parallel
topology implemented in PHGSC ensures the best matching Bacterial Foraging or Parallel Germ Computing. The Parallel
block computationally very fast. The presented PHGSC saves
computational time up to 93.36 % when compared with other
Bacterial Foraging or Parallel Germ Computing developed by
published methods. the authors can be found in [14]. The PGC, when hybridized
with GLBestPSO is called as PHGSC. The PHGSC has been
used for video compression.
Keywords- Parallel Hybrid Germ Swarm Computing (PHGSC); The authors proposes an adaptive step size as given in Eq.(1),
Global and local Best Particle Swarm Optimization
(GLBestPSO); video compression; computional time; peak signal
which is used to predict best matching block in the reference
to noise ration; frame with respect to macro block in the current frame for
which motion vector is found. In PHGSC, the positions of
bacteria are updated as given in Eq (2). In step size equation,
I. INTRODUCTION
W and C are same as GLBest PSO [15] as given in Eq. (3) and
Eq.(4). Due to adaptive step equation of PHGSC, the next
Depending upon foraging strategies of the E. coli bacterium, block search will start from nearer to best matching block in
K.M. Passino proposed Bacterial Foraging Optimization in previous step.
2002[1]. The Bacteria foraging optimization [1] is gaining
popularity in research community due to its attractive features. Step size = abs [Mx + My] +r*W*C (1)
Motion estimation has been popularly used in video signal ∆ (i )
processing, which is a fundamental component of video θ (i , k , l ) = θ (i, j , k ) + C (i ) + Stepsize (2)
compression. In motion estimation, computational complexity ∆ T ( i ) ∆ (i )
varies from 70 percent to 90 percent for all video compression. gbesti
w = 1.1 − (3)
pbesti
The exhaustive search (ES) or full search algorithm gives the
highest peak signal to noise ratio amongst any block-matching
algorithm but requires more computational time [2]. To reduce gbesti (4)
c = 1 +
the computational time of exhaustive search method, many pbesti
other methods are proposed i.e. Simple and Efficient Search Where,
(SES)[2], Three Step Search (TSS)[3], New Three Step Search Mx = Horizontal position of motion vector of previous block.
(NTSS)[3], Four step Search (4SS)[4], Diamond Search My =Vertical position of motion vector of previous block.
(DS)[5], Adaptive Road Pattern Search (ARPS)[6], Novel
Cross Diamond search [7], New Cross-Diamond Search Step by step algorithm of PHGSC
algorithm [8], Adaptive Block Matching Algorithm [9],
283
Step1: Initialize Parameters p, S, Nc, Ns, Nre, C (i), i= 1, 2... S Step 8: Update Jglobal from Jbest (j, k).
Where, Step 9: If k < Nre, go to step3 otherwise end.
p = Dimension of search space
S = Number of bacteria in the population
Nc = Number of Chemo tactic steps III. PHGSC FOR MOTION ESTIMATION
Ns= Number of Swimming steps
Nre = Number of reproduction steps The authors have already used the MPPSO [14] and PBFO
C (i) = Step size taken in the random direction specified by the [16] for motion estimation. The Von Neumann topology is
tumble. used as search pattern. In the proposed method, macro block is
J (i, j, k) = Fitness value or cost of i-th bacteria in the j-th known as bacteria. Five bacterium are used in PHGSC for
chemotaxis and k-th reproduction steps. motion estimation. The initial position of block to be searched
θ (i, j, k)= Position vector of i-th bacterium in j-th chemotaxis in reference frame is same as block in current frame for which
step and k-th reproduction steps. motion vector is found. The mean absolute difference (MAD)
Jbest (j, k) = Fitness value or cost of best position in the j-th is taken as objective or cost function for motion estimation in
chemotaxis and k-th reproduction steps. and is expressed as Eq (5).
Jglobal= Fitness value or cost of the global best position in the
entire search space. 1 M N CurrentBlock (i, j )
Step 2: Update the following parameters. MeanAbsoluteDifference =
MN
∑∑ − ReferenceBlock (i, j )
i =1 j =1
J (i, j, k)
Jbest (j, k) (5)
Jglobal= Jbest (j, k) Where, M = Number of Rows in the frame.
Step 3: Reproduction Loop: k = k+1 N = Number of Columns in the frame
Step 4: Chemotaxis loop: j = j+1 The performance of the proposed method is evaluated by peak
a) Compute fitness function J (i, j, k) for i = 1, 2, signal to noise ratio, which is given by the eq. (6).
3… S.
b) Update Jbest (j, k). PSNR = 10 log10
c) Tumble: Generate a random vector 2552
[ ]
∆(i ) ∈ ℜ with each element ∆ m (i ) m =1,
p 1
∑
M ,N
(OrigionalFrame(i , j ) − CompensatedFrame(i , j )) 2
i , j =1
2,.,p, a random number on [-1 1] MN
(6)
d) Compute θ for i = 1, 2 …S
IV. RESULTS AND DISCUSSIONS
∆(i )
θ (i, j + 1, k ) = θ (i, j , k ) + C (i )
∆T (i) ∆(i ) The proposed method (PHGSC) has been tested for standard
video i.e. Caltrain and lecture based video sequences. Video
e) Swim
sequences with a distance of two frames between current
i) Let m =0 (counter for swim length)
ii) While m < Ns frame and reference frame are used to generate frame-by-
Let m = m+1 frame results of the proposed algorithm. To test the efficiency
Compute fitness function J (i, j+1, k) for i = of the proposed algorithm with existing algorithms, the
1, 2, 3… S algorithms are executed on HP workstation CPU 3.0 GHz and
Update Jbest (j+1, k) 2GB RAM with MATLAB. The performance of PHGSC is
If Jbest (j+1,k)< Jbest (j, k) (if doing better), compared with of other methods [2][3][4][5][6] and the result
is presented in Table 1 and Table 2. The speed of PHGSC is
Jbest (j, k) = Jbest (j+1, k). Compute θ for i faster than published methods and PSNR is close to published
= 1, 2...S methods as shown in Table 3. PHGSC saves computational
∆ (i) time from 93.36 to 6.06 percentages with PSNR gain of -
θ (i, j + 1, k ) = θ (i, j, k ) + C (i )
∆ (i )∆(i )
T 0.1573 to +1.7441. In the suggested method, zero motion is
stored directly. Zero motion vectors implemented PHGSC
saves computational time by maintaining accuracy.
Use this θ (i, j + 1, k ) to compute the new j
(i, j+1, k). V. CONCLUSION
Else, let m = Ns. This is the end of the while This paper presents new hybrid soft computing technique
statement known as PHGSC. The proposed technique is used for motion
Step 6: If j <Nc, go to step 4. In this case, continue estimation in video. As compared to ES, PHGSC gives less
Chemotaxis, since the life of bacteria is not over. PSNR of 0.1573 and 0.0189 for caltrain and lecturer based
Step 7: The Sr=S/2 bacteria with the highest cost function video sequence respectively.
values die and other Sr=S/2 bacteria with the best values split.
284
TABLE I. COMPARISION OF MEAN PSNR OF PROPOSED METHOD AND EXISTING METHODS
TABLE II. COMPARISION OF COMPUTIONAL TIME IN SECONDS OF PROPOSED METHOD AND EXISTING METHODS
TABLE III. COMPURIONAL TIME SAVE AND PSNR GAIN BY PROPOSED METHOD OVER EXISTING METHODS
[8] C.W. Lam., L.M. Po and C.H. Cheung, “A New Cross- Diamond
Search Algorithm foe Fast Block Matching Motion Estimation”, 2003
The PHGSC saves computational time from 93.36 to 6.06 IEEE International Conference on Neural Networks and Signal
with a PSNR gain of 1.7441 to 0.1726 over existing methods. Processing, Nanjing, China, Dec. 2003, pp.1262-1265.
The results show promising improvement in terms of [9] Humaira Niasr, Tae-Sun Chol, “An Adaptive Block Motion Estimation
accuracy, while drastically reducing the computational time. Algorithm Based on Spatio Temporal Correlation”, Digest of Technical
papers, International conference on consumer Electronics, Jan 7-11,
The code developed is generalized in nature and proves its 2006, pp.393-394.
identity as useful tool in motion estimation [10] Viet-Anh Nguyen, Yap-peng Tan, “Efficient Block Matching Motion
Estimation Based on Integral Frame Attributes”, IEEE Transactions on
Circuits and Systems for Video Technology, vol.16, no. 2, March 2006,
pp.375-385.
REFERENCES
[11] Jiancong Luo, Ishfog Ahmed, Yong Fang Liang and Vishwanathan
[1] Liu and K.M. Passino, M.A. Simaan, “Biomimircy of Social Foraging Swaminathan, “Motion Estimation for Content adaptive Video
Bacteria for distributed Optimization: Models, Principles, and Emergent Compression”, IEEE Transactions on Circuits and Systems for video
Behaviors”, Journal of Optimization Theory and Applications”, vol. 115, Technology, vol. 18, no.7, July 2008, pp.900-909.
no.3, December 2002, pp 603-628. [12] Chun-Man Mak, Chi keung Fong, and Wai Khen Chan, “Fast Motion
[2] Jianhua Lu., Ming L. Liou, “A Simple and Efficient Search Algorithm Estimation For H.264/AVC in Walsh Hadmard Domain”, IEEE
for Block Matching Motion Estimation”, IEEE Trans. Circuits and Transactions on Circuits and Systems for Video Technology, vol. 18,
Systems for Video Technology, vol.7, no.2, April 1997, pp. 429- -433. no.6, June 2008, pp. 735-745.
[3] Renxiang Li, Bing Zeng, and Ming L. Liou, “A New Three- Step Search [13] Shen Li., Weipu Xu, Nanning Zheng, Hui Wang , “A Novel Fast
Algorithm for Block Motion Estimation”, IEEE Trans. Circuits and Motion Estimation Method Based on Genetic Algorithm”, ACTA
Systems for Video Technology, vol. 4, no. 4, August 1994, pp. 438-442. ELECTRONICA SINICA, vol.28 no.6, June 2000, pp.114-117.
[4] Lai-Man Po., Wing –Chung Ma, “A Novel Four- Step Search Algorithm [14] K. M. Bakwad, Dr. S.S. Pattnaik, Dr. B. S. Sohi, Swapna Devi, M.R.
for Fast Block Motion Estimation”, IEEE Trans. Circuits and Systems Lohakare,” Parallel Bacterial Foraging Optimization for Video
for Video Technology, vol.6, no. 3, June 1996, pp. 313-317 . Compression” International Journal of Recent Trends in Electrical and
[5] Shan Zhu, Kai-Khuang Ma, “A New Diamond Search Algorithm for Electronics, International Journal of Recent Trends in Engineering
Fast Matching Motion Estimation”, IEEE Trans. Image Processing, (Computer Science), vol. 1 no.1, June 2009, pp.118-122.
vol.9, no.2, February 2000, pp.287-290. [15] M.Senthil Arumugam , M.V.C.Rao, Aarthi Chandramohan, A new and
[6] Yao Nie, Kai-Khuang Ma, “Adaptive Rood Pattern Search for Fast improved version of particle swarm optimization algorithm with global-
Block-Matching Motion Estimation”, IEEE Trans. Image Processing, local best parameters, Journal of Knowledge and Information System
vol.11, no.12,December 2002, pp. 1442-1448. (KAIS), Springer, vol.16 no.3, 2008,pp. 324-350.
[7] Chun-Ho Cheung, Lai-Man Po, “A Novel Cross Diamond Search [16] K. M. Bakwad, Dr. S.S. Pattnaik, Dr. B. S. Sohi, Swapna Devi, Ch.
Algorithm for Fast Block Estimation”, IEEE Trans. Circuits and Vidya Sagar, P. K. Patra, Sastry V. R. S. Gollapudi, “Small Population
Systems, vol.11, no.12, December 2002, pp. 1442-1448. Based Modified Parallel Particle Swarm Optimization for Motion
Estimation” IEEE Sixteenth International Conference on Advanced
Computing and Communication (ADCOM-08), Anna University,
Chennai, India, ,17 Dec, 2008, pp. 367-373
285
Texture Classification using Local Texture Patterns:
A Fuzzy Logic Approach
E.M. Srinivasan
Department of Electronics and Communication Engineering, Government Polytechnic College,
Nagercoil, Tamilnadu, India.
emsvasan@yahoo.com
A. Suruliandi K.Ramar
Department of CSE, M S University, Department of CSE, National Engineering College,
Tirunelveli, Tamilnadu, India. Kovilpatti, Tamilnadu, India.
suruliandi@yahoo.com kramar_nec@rediffmail.com
Abstract— Texture analysis plays a vital role in image occurrence frequency of LTP over the whole image. In the
processing. The prospect of texture based image analysis LTP model, the pattern associated with the local texture
depends on the texture features and the texture model. This region of size 3 x 3 is uniform or non-uniform that is based
paper presents a new texture model ‘Fuzzy Local Texture on the gray-level difference between the central pixel and its
Patterns (FLTP)’ and ‘Fuzzy Pattern Spectrum (FPS)’. The neighbors as well as a uniformity measure computed by a
local image texture is described by FLTP and the global image specific rotation scheme. In this approach, the total number
texture is described by FPS which is an occurrence frequency of patterns as well as the number of bins in the histogram is
of FLTP over the entire image. The efficiency of the proposed 46 only. The LTP operator is computationally simple and
texture model is tested with texture classification. The results
robust against gray-scale and rotational variations.
show that the proposed method provides a very good and
robust performance. As referred in the FTU model, if fuzzy techniques are
used, then there will be a significant improvement in texture
Keywords- Texture Analysis, Texture Classification, Local characterization. But, the number of bins required for FTS
Texture Patterns, Fuzzy Local Texture Patterns, Fuzzy Pattern model is 6561. This large number of bins will bring out the
Spectrum. local texture information in more detail. But, at the same
time, as the number of bins increases, computational time
I. INTRODUCTION complexity of texture analysis also increases. In the case of
Numerous texture modeling techniques have been LTP model, the total number bins required is 46 only and
developed by many researchers. Each method is superior in hence, the model is computationally efficient for texture
discriminating its texture characteristics but there are no analysis. Thus, it is realised that, fuzzy techniques as used in
obvious texture modeling techniques common for all texture the FTU model may be combined with LTP model for a
images. Comparative study about various texture analysis progressive approach. Hence, in this paper, it is proposed to
methods can be found in [5, 9]. introduce a new texture model that incorporates the
advantages of both methods.
A. Motivation and Justification for the Proposed Approach
Barcelo et. al. [1] proposed a texture characterization B. Outline of the Proposed Work
approach ‘Fuzzy Texture Spectrum (FTS)’, which is based In this paper, a new texture analysis operator ‘Fuzzy
on the texture model ‘Texture Spectrum (TS)’ introduced by Local Texture Patterns (FLTP)’ is proposed. The local image
He and Wang [3, 4, and 8]. In FTS approach, the fuzzy logic texture is described by FLTP and the global image texture is
and fuzzy techniques are included in TS texture model with described by ‘Fuzzy Pattern Spectrum (FPS)’ which is an
due consideration of uncertainties introduced by noise, and occurrence frequency of FLTP over the entire image. The
different caption and digitization processes. In this performance of the proposed approach is demonstrated with
representation scheme, the spectrum requires a total number texture classification.
of 6561 bins. For real textures, FTS method gives a better
representation. Moreover, the FTS method provides superior
C. Organization of the Paper
discrimination between textured regions and homogeneous
regions. This paper is organized as follows. Section II describes
Recently, Suruliandi and Ramar [7] proposed a new the LTP texture model. Section III describes the proposed
texture modeling approach ‘Local Texture Patterns (LTP)’. FLTP texture model. Section IV includes experiments
They describe the local image texture by LTP and global conducted on texture classification of Brodatz [2] images and
image texture by ‘Pattern Spectrum (PS)’ which is the comparative analysis of various texture models based on
classification performance. Section V concludes the work.
286
II. LOCAL TEXTURE PATTERNS (LTP) AND B. Global Image Texture Description by PS
PATTERN SPECTRUM (PS)
The occurrence frequency of all the LTP is PS, with the
abscissa indicating the LTP and the ordinate representing its
A. Local Image Texture Description by LTP occurrence frequency. Global image texture is described
Let gc be the central pixel value and g1, g2,…,g8 be its with the help of PS. The spectrum uses LTP defined earlier,
neighbor pixel values in a 3x3 local region. Let the ‘Pattern as the measure to describe the global texture.
Unit’ P, between gc and its neighbors gi (i=1,2, …, 8) be
defined as III. PROPOSED TEXTURE MODEL
0 if gi < (gc - ∆g) In this section, FLTP and FPS texture modeling approach
P(gi , gc ) = 1 if (gc - ∆g)≤ gi ≤ (gc + ∆g) i =1,2,…,8 (1) is proposed. The proposed method borrows some of the basic
principles of LTP method and FTS method.
9 if gi > (gc + ∆g)
where g is a positive value that represents the gray value A. Fuzzy Local Texture Patterns – FLTP
and has its importance in forming the patterns. P can be It is noted from the Figure 1(b), the Pattern Units matrix
assigned with one of any three distinct values 0, 1 and 9. is filled with unique P values (0, 1 or 9). The P values simply
There are eight P values for a local region. A Pattern Units represent the relationship between the central pixel and its
matrix is filled with these values. The method of calculating neighbors within a small 3 x 3 pixels image region. To
the P values in a 3 x 3 local region, formation of Pattern represent the same in a more flexible way, each cell of the
Units matrix and Pattern String are shown in Figure 1. Pattern Units matrix can be assigned with three membership
values. Without loss of generality of FTU and LTP, the
128 115 118 1 0 0 membership values are directly associated with the degree to
which the neighbor pixel is smaller than (0), equal to (1) or
122 125 140 1 9 greater than (9) the centre pixel.
135 130 133 9 9 9 In a 3 x 3 local image region, let gc be the value of the
central pixel and gi (i=1,2,…,8) be the values of its neighbor
(a) (b) pixels. With the assumption g=0 in (1), let the difference
between gc and gi be xi (i=1,2,…,8). Let µ 0(xi), µ 1(xi) and
0 0 9 9 9 9 1 1 µ 9(xi) be the membership degrees for the values 0,1 and 9 of
(c) xi respectively. The ‘Fuzzy Pattern Unit (FP)’ value between
gc and its neighbors gi (i=1,2, …, 8) is defined as
Figure 1. (a) 3 x 3 local region (b) Pattern Units matrix for g=4
(c) ‘Pattern String’ FP(gc,gi)= ( µ0 ( x i ) /0 , µ1( x i) /1 , µ9 ( x i) /9) i=1,2,…8 (5)
A ‘Uniformity’ measure U which corresponds to the If there is a local homogeneous region, then the
number of spatial transitions circularly in the ‘Pattern String’ difference between gc and gi will be equal to zero or almost
is defined as equal to zero and µ 1(xi) will be higher, and µ 0(xi) and µ 9(xi)
will be lower. In case of textured region, the difference
8 between gc and gi will be increasing and therefore µ 1(xi) will
U = s( P( g8 , gc ), P( g1, gc )) + s( P( gi , gc ), P( gi−1, gc )) (2) be decreasing and µ 0(xi) and µ 9(xi) will be increasing.
i =2
Based on the above considerations, it is proposed here,
where three membership functions which are arrived from the
heuristic results with parameters {a,b} which determine the
1 if X -Y > 0 (3) boundary coordinates of xi. The membership functions are
s(X,Y)= given below.
0 otherwise
1 if xi ≤ −a
The patterns with at most U value of 3 shall be treated as µ 0 (xi) = - (xi + b) if - b < x i <- a (6)
uniform patterns and others as non-uniform patterns. The a-b
LTP operator for describing a local texture is defined as 0 otherwise
8
P ( g i , g c ) if U ≤ 3
LTP = i =1 (4) 0 if abs ( x i )≥ a
73 otherwise
µ 1 (xi) = - (abs( xi ) − a) if b < abs (xi) < a (7)
There are 46 number of LTP in total. As there are few a-b
holes in the LTP numbering scheme, using a lookup table 1 otherwise
they are relabeled into continuous numbers from 1 to 46.
287
1 if xi ≥ a So, when this 3 × 3 local region is considered, the central
(8) pixel has associated FLTP which is defined by
µ 9 (xi) = µ 0 (- x i) if b < x i < a
K
0 otherwise FLTP = mLTPk * µ(mLTP)k (13)
k =1
The degrees to which the pixel gi is negative (smaller),
zero (similar), or positive (larger) with regard to the central where K is the total number of S or mLTP.
pixel gc are µ 0(xi), µ 1(xi) and µ 9(xi) respectively. Hence, the
FP associated to the central pixel is given by B. Fuzzy Pattern Spectrum – FPS
FP ={( µ 0 ( x1) /0, µ 1 ( x1) /1, µ 9 ( x1) /9),…, (9) Using the procedure outlined in the previous section, the
( µ 0 ( x8) / 0, µ 1 ( x8) /1, µ 9 ( x8) /9)} FLTP are calculated. Such FLTP are identified as uniform or
non uniform using (2). If U 3, then the pattern is assumed as
The local region can be represented as a Fuzzy Pattern uniform and otherwise non-uniform. In some natural
Units matrix as shown in Figure 2. The entries in the matrix textures, non-uniform patterns also help in describing the
are FP values which are calculated using (6, 7, 8, 9). texture characteristics. In this proposed approach, it is
decided to have 73 uniform patterns (0 to 72) and another 73
µ 0(x1)/0, µ 0(x2)/0, µ 0(x3)/0,
non-uniform patterns (73 to 146). Therefore, a total number
µ 1(x1)/1, µ 1(x2)/1, µ 1(x3)/1,
µ 9(x1)/9 µ 9(x2)/9 µ 9(x3)/9 of 146 bins are there in the occurrence histogram of FPS.
µ 0(x8)/0, µ 0(x4)/0,
µ 1(x8)/1, µ 1(x4)/1, IV. EXPERIMENTS AND RESULTS
µ 9(x8)/9 µ 9(x4)/9
µ 0(x7)/0, µ 0(x6)/0, µ 0(x5)/0, A. Textures used in the Experiments
µ 1(x7)/1, µ 1(x6)/1, µ 1(x5)/1,
µ 9(x7)/9 µ 9(x6)/9 µ 9(x5)/9 The textures used in the experiments are taken from the
Figure 2. Fuzzy Pattern Units matrix
publicly available Brodatz benchmark database. They are
shown in Figure 3. The textures are Beach sand, Stone, Sand,
From the matrix elements, the FLTP is calculated by the Grass and Water. Normally, these are the images
following procedure. Each matrix element contains three P encountered in remotely sensed image analysis.
(0,1 or 9) values and the corresponding membership values.
By using these values, a set of ‘Pattern Strings (S)’ is
constructed.
S k (psi ) = Pi u for one non - zero membership µu ( xi ) = 1 (a) (b) (c) (d) (e)
S k (psi ) = Pi u for two non - zero membership s Figure 3. Brodatz images (a) Beach sand (b) Stone (c) Sand (d) Grass
} (10) (e) Water.
S k +1(psi ) = Pi v 0 < µu ( xi ), µ v ( xi ) < 1
where, psi (i=1,2,…,8) is the element of S. Pi means P
v B. Texture Similarity
value of ith element having membership v. If ith matrix Similarity between different textures is evaluated by
element contains one of the membership degree values equal comparing their histograms. The histograms are compared as
to ‘1’, the ith element of the string is filled with the a test of goodness-of-fit using a nonparametric statistic, the
corresponding P value. For other non-zero membership log-likelihood ratio also known as the G-statistic [6]. The
values, there will be two strings filled with corresponding P G-statistic compares the bins of two histograms and is
values for ith position of the strings. defined as
We use a new mLTP operator with minor modifications n n n
When the membership degree values are ‘1’ in all the where s is a histogram of the first image and m is a
matrix elements, there will be only one S and one mLTP. If histogram of second image , n is the total number of bins in
there are ‘n’ elements in the matrix having two non-zero the histogram and fi is the frequency at bin i.
membership values, the total number of S and mLTP is 2n.
The degree to each mLTP is obtained by multiplying the C. Classification of Brodatz Images using the Proposed
eight corresponding membership degrees Texture Model
8 An experiment on image classification was conducted to
µ(mLTP ) = ∏ µ psi (x i ) (12) prove the efficiency of the proposed FLTP model. For this
i =1 study, Brodatz texture images shown in Figure 3 of size
288
512 x 512 were taken. Each individual texture image was TABLE 2. CLASSIFICATION ACCURACY OF VARIOUS TEXTURE MODELS
considered as a sample and there were 5 samples in total.
Classification Accuracy (%)
Test images were extracted from the source images, keeping Tex.
each pixel of 512 x 512 as the center of the sample and thus Mode Beach
Stone Sand Grass Water Avg
sand
262144 test samples were extracted irrespective of the
sample size. Each test sample was compared against the FTU 100 99.88 100 100 99.67 99.91
model samples using (14) and the test sample was classified
FLTP 100 98.92 100 100 99.52 99.69
as the category of the model sample which gives minimum G
value. The test was carried out for test samples of size W, LTP 99.90 99.12 99.71 99.12 100 99.57
equal to 15, 30 and 45. Table 1 shows the results.
V. DISCUSSION AND CONCLUSION
TABLE 1. CLASSIFICATION OF BRODATZ IMAGES USING THE PROPOSED
FLTP METHOD In this paper, a new method of texture characterization
Tex- % of
technique based on FLTP and FTS is presented. Local
Classified Samples (Total : 262144 Samples)
ture Accura patterns are identified by the FLTP method and these
W Beach Stone Sand Grass Water
cy patterns are used to form FPS which characterizes the global
sand
texture feature of the given texture image. The classification
15 259905 832 489 918 0 99.15
results in Table 1, which shows a high classification
accuracy of more than 99% for Brodatz images. It is
Beach
45 262144 0 0 0 0 100
observed from Table 2 that, classification accuracy is above
99% for FLTP model which is considerably enough to
15 3111 249738 0 9295 0 95.27
compare with other models. From the results, it is inferred
that, the proposed model has very good discriminatory power
Stone
30 0 0 262144 0 0 100
45 0 0 262144 0 0 100 REFERENCES
[1] A. Barcelo, E. Montseny and P. Sobrevilla, “Fuzzy Texture Unit and
15 121 3504 1806 256713 0 97.93 Fuzzy Texture Spectrum for Texture Characterization”, Fuzzy Sets
Grass
30 0 0 3958 0 258186 98.49 analysis”, IEEE Trans. on Geoscience and Remote sensing, 28(4),
509-512(1990).
45 0 0 343 0 261801 99.87
[4] D.C. He and L. Wang, “Unsupervised Textural Classification of
D. Quantitative Analysis of Various Texture Models using Images Using the Texture Spectrum”, Pattern Recognition, 25(3),
247-255(1992).
Classification Accuracy
[5] T. Ojala and M.Pietikäinen and D. Harwood, “A Comparative Study
The performance of FTS model and LTP model were of Texture Measures with Classification Based on Feature
compared with the proposed FLTP model. The result of the Distributions”, Pattern Recognition, 29(1), 51-59 (1996).
comparison is tabulated in Table 2. [6] R.R. Sokal and F.J. Rohlf, “Introduction to Biostatistics’, 2nd ed,
W.H.Freeman, (1987).
Using the FTS model, 99.91 percent classification [7] A. Suruliandi and K. Ramar, “Local Texture Patterns- A univariate
accuracy was obtained. This is due to the fact that FTS texture model for classification of images”,Proceedings of The 2008
model has very good discriminating power. 16th International Conference on Advanced Computing and
Communications (ADCOM08), 32-39 (2008).
The LTP model is yielding an accuracy of 99.57 percent. [8] L. Wang and D.C. He, “Texture Classification using Texture
The strength of this model is, it is a robust model against Spectrum”, Pattern Recognition, 23, 905-910(1990).
gray-scale variations and rotational variations which is an [9] J. Zhang and T. Tan, “Brief Review of Invariant Texture Analysis
important criterion for real time applications. Methods”, Pattern Recognition, 35, 735-747 (2002).
289
Integer Sequence based Discrete Gaussian and
Reconfigurable Random Number Generator
Arulalan Rajan, H S Jamadagni Ashok Rao
Centre for Electronics Design and Technology, Dept of E & C, CIT,
Indian Institute of Science, India Gubbi, Tumkur, India
(mrarul,hsjam)@cedt.iisc.ernet.in ashokrao.mys@gmail.com
290
These sequences, of certain length (determined by the deviating much from the conventional scheme, we propose to
range in which the random numbers are needed) are pairwise use variance as the input. One can make lengths of the
convolved and their convolution plots studied. The envelope sequences or the sequences themselves depend on the variance
of each of these convolution plots turns out to be similar to given.
Gaussian. Fig. 2 shows the convolution of sequences in which On obtaining the lengths of the sequences, we can use a
HCS denotes the Hofstadter Conway sequence and GS denotes generator as simple as the one proposed in [14] to generate the
Golomb Sequence. The idea of generating Gaussian sequence. The sequence elements are obtained for half the
distribution directly follows from the plot and is discussed in lengths and symmetry is forced on the sequence for the other
detail as follows: The indices of the sequence resulting from half of the sequence length. Once the sequences are
the convolution of two sequences are taken to be the set of generated, they can be convolved using either a single
random numbers that one can generate. The index typically multiply accumulate (MAC) unit or multiple MAC units. The
starts from 1 to L+M-1. We take these numbers, n, from 1 to result of this convolution is taken as the probability density
L+M-1 as the values that a random variable X can take. The function for the random variable X.
probability that X=n is given by the value of convolution at n. Higher level block diagram of the random number
We explain this in detail with an example. We take S1 to be generator shown in figure 4 is then used to generate the
Golomb sequence of length 51, and S2 also to be Golomb random numbers with Gaussian distribution. An approach
sequence of length, 50. We convolve the two sequences and similar to the one used for generating the sequence can be
the sum of the convolution over the entire length of the result, used here also. We obtain a pattern here from the convolution
L+M −1 and store that in the pattern information memory (PIM). The
Total = ∑ S3(n), where S3 (n) = S1(n) * S2 (n) - (2) addresses for this memory are precisely the values that the
n=1 random numbers can take. The addresses are generated using
gives the total number of random numbers that can be the LFSR, with the initial seed being capable of taking the
generated, between 1 and 100. The probability that a random LFSR through all the states. Once all the states have been
variable X can take an integer value between 1 and 100 is obtained, the seed of the LFSR is changed. The LFSR is used
given by S3(n), the sequence resulting from convolution, ie., to facilitate the randomness in the generator’s output. The
P ( X = ( x = n)) = S3 ( n ) /Total - (3) decrement and compare unit (DCU) basically decrements the
content of that location pointed to by the LFSR. The compare
With the technique of generating Gaussian random numbers
logic, compares if the contents of the memory location pointed
having been described in detail, we now proceed to discuss the
to by the LFSR is zero. In such a case, the address is
hardware implementation of a Gaussian random number
incremented by 1 so that the next element can be output. Since
generator.
the LFSR value has to remain the same till the random value is
output and the contents of the memory is decremented, the
A. Hardware Implementation of Random Number Generator
LFSR can be made to operate at half the clock rate of the
A sequence like Fibonacci sequence [11], described by
memory. The control engine shown is used to decide the
the following recurrence,
sequence and the length of the sequence, based on the variance
a ( n ) = a ( n − 1) + a ( n − 2), a (1) = 0, a (1) = 1 - (4)
input and the distribution type.
is easier to generate in hardware as it involves only a simple A simple modification to figure 4 results in a dynamically
recursion. However, this is not the case with most of the other reconfigurable random number generator, Here the pattern
sequences, which involve more than one recursion. To information memory is a segmented one, holding the pattern
generate these sequences, new and simple strategies were information of multiple convolution results. Higher order
developed. Let us look at the following sequences and discuss address bits can be made to identify the distribution type and
in detail the strategies developed for generating the same. the lower order address bits can be used as the values that the
a ( n ) = 1 + a ( n − a ( a ( n − 1)), a (1) = 1; a (2) = 2; - (5) random variable can take. Based on the variance and the
a(n) = a(a(n − 1)) + a(n − a(n − 1)); a(1) = a(2) = 1; - (6) distributions, the sequences and their lengths can be obtained.
Equations (5) and (6) are used to generate Golomb sequence Depending on the distribution, one can use the integer
[12] and Hofstadter Conway [13] sequence respectively. The sequences directly or convolve them or perform any other
elements generated from (5) are as follows: operations, followed by writing into the pattern information
1,2,2,3,3,4,4,4,5,5,5,6,6,6,6,7,7,7,7, 8 …. memory. An alternate approach to make the hardware a
As seen from (6) and (7), the direct hardware implementation reconfigurable one is by having another memory segment
of the generating function is not that simple. An alternate where the pattern can be stored. The contents of this memory
approach to generate these sequences, based on the inherent could be written into the pattern information memory shown
pattern was proposed in [14]. in figure 4, on the fly, so that any distribution can be obtained
Having looked at the individual sequence generator, we using the same hardware as in figure 4.
now proceed to describe the architecture used to generate the
random numbers.
Conventional random number generators take mean and
variance as the inputs and then generate random values. Not
291
propose to explore the use of integer sequences in generating
IV. RESULTS extreme valued probability distribution functions.
Sequences listed in table 1 were considered for random
number generation. Convolution was performed with the VI. REFERENCES
following combinations of the sequences: [1] D. E. Knuth, “Seminumerical Algorithms - The Art of
• A sequence S1 convolved with itself Computer Programming”, Vol.2, 3rd ed., Addison-Wesley,
• A sequence S1 convolved with another sequence S2. USA, 1998.
Without loss of generality, the lengths of the sequences were [2] Solomon W. Golomb. Shift Register Sequences. Aegean
taken to be the same, in order to study the relation between Park Press, 1981.
length and variance. The numbers were generated in [3] Xilinx 2002. Additive White Gaussian noise (AWGN)
MATLAB using the proposed technique and compared with core. CoreGen documentation file.
the random numbers generated using the same mean and [4] D.Thomas, P. Leong, J.Villasenor, “Gaussian Random
variance as that of the proposed technique. Mean squared error Number Generators”, ACM Computing Surveys, Vol.39,
is also plotted. No.4, Artcl 11, Oct 2007.
As mentioned in section 3, figure 1 and figure 2 are the [5] G. E. P. Box and M. E. Muller, “A note on the generation
plots of the sequences for various lengths and their of random normal deviates,” The Annals of Math. Statistics,
convolution respectively. Figure 3 shows the histogram plot of vol. 29, 1958, pp. 610–611.
the convolution of Golomb sequence with itself. The CDF plot [6] A Alimohammad, S. F. Fard, Bruce F. Cockburn, C.
is shown in figure 5. The variation of length (assuming that the Schlegel, “A Compact and Accurate Gaussian Variate
sequences convolved have same length and equal to L) with Generator”,IEEE Trans. on VLSI Systems, vol 16, no.5,2008.
standard deviation and variance are shown in figure 6 and pp 517-527.
figure 7 respectively. We find that for a Gaussian distribution, [7] G. Marsaglia, T. A. Bray, “A Convenient method for
the length of the sequence and the standard deviation has a generating normal variables”, SIAM Rev.6,3, 1964, pp 260-
linear relation, while there is a square relation between the 264.
length and variance. We thus find that the length of the [8] G. Zhang, P. H. W. Leong, D. Lee, J. D. Villasenor, R. C.
sequences can be made dependent on the standard deviation C. Cheung, and W. Luk, “Ziggurat-based hardware Gaussian
and hence the variance. random number generator,” in IEEE Intl. Conference on Field
In the usual Gaussian random number generators, within a Programmable Logic and its Applications, 2005, pp. 275–280.
given interval, the variance, σ2, can vary depending on the [9] T. Lindberg, “Scale-space for discrete signals”,
requirement and hence the profile or the shape of the Gaussian PAMI(12), No.3, March 1990, pp 234-254.
PDF varies. A similar situation here is that the range be the [10] N.J.A.Sloane, The Online Encyclopedia of Integer
same but with a different variance, the lengths of the two Sequences”, www.research.att.com/~njas/sequences/
sequences can be changed. To address this, the lengths of the [11] www.research.att.com/~njas/sequences/A000045
two sequences can be changed from 50 each to say 61 and 40 [12] www.research.att.com/~njas/sequences/A001462
or any other combination of lengths such that the sequence [13] www.research.att.com/~njas/sequences/A004001
resulting from convolution has a length of 100. This fact is [14] A. Rajan, HS Jamadagni, A. Rao, “Integer Sequence
illustrated in figure8. Window based Reconfigurable FIR filters”, Proc. of the First
Figure 9 compares the estimated PDF with the obtained IEEE Intl. Workshop on Reconfigurable Computing, Dec
PDF. Figure 10 gives the mean square error plot of the 2008, India, http://ewh.ieee.org/conf/hprcw/rcw08.html
comparison. We find that the mean square error value is of the
order of 10-6. This clearly shows that the technique proposed Table 1. List of Integer Sequences from [10]
in this paper, is more efficient to generate random integers Sequence Generating Function Offset
following Gaussian distribution. A001462 a[n] = 1 + a[n - a[a[n - 1]]] a[1]= 1, a[2]= 2
A004001 a[n] = a[a[n-1]] a[1] = a[2] = 1
V. CONCLUSION + a[n-a[n-1]]
The technique of using integer sequences for random A113886 a[n] = a[a[n-2]] a[1] = a[2] = 1
number generation has been proposed in this paper. It has also + a[n-a[n-1]]
been shown that convolution of a certain family of integer A005229 a[n] = a[a[n-2]] a[1] = a[2] = 1
sequences can be used to generate discrete Gaussian random + a[n-a[n-2]]
variable. The variance of the Gaussian random variable is A098378 a[n] = a[a[a[n - 1]]] a[1] = a[2] = 1
made to influence the choice of the integer sequences and the + a[n – a[a[n - 1]]]
length of the sequences to be convolved. A simple architecture A006158 a[n] = a[a[n-3]] a[1] = a[2] = 1
to generate Gaussian random number is also presented in this + a[n-a[n-3]]
paper. It has also been illustrated that a slight modification to A006161 a[n)= a[a[n-1]-1] a[1] = a[2] = 1
this architecture yields a reconfigurable random number + a[n+1-a[n-1]
generator, based on different distributions. The mean square
error plot shows that the error is very less. In future, we
292
S3[n]
a[n]
MOD K-1
LFSR
Control
DCU
engine Pointer PIM
to RV generate
Variance
Distribution L+M
type
Figure 3. Histogram plot for random variable obtained Figure 4. Random Number Generator
from Golomb sequence Convolution
Figure 5. Estimated CDF for random variable obtained Figure 6. Standard Deviation Vs Length Plot for Golomb
from Golomb sequence convolution Sequence convolution based distribution
(n)
Figure 7. Variance vs Sequence Length Figure 8. Profile variation with change in lengths for the
same range between 1 and 100, but different variance
293
Parallelization of PageRank and HITS
Algorithm on CUDA Architecture
Kumar Ishan, Mohit Gupta, Naresh Kumar, Ankush Mittal
Department of electronics & Computer Engineering,
Indian Institute of Technology, Roorkee, India.
{kicomuec, mickyuec, naresuec, ankumfec}@iitr.ernet.in
Abstract 2. PAGERANK
Efficiency of any search engine mostly depends on how efficiently 2.1. Algorithm
and precisely it can determine the importance and popularity of
a web document. Page Rank algorithm and HITS algorithm are Let Rank(p) denotes the rank of web page p from set of all
widely known approaches to determine the importance and web pages P. Let Sp bet a set of all web pages that points to
popularity of web pages. Due to large number of documents
page p and Nu be the outdegree of the page u ε Sp. Then the
available on World Wide Web, huge amount of computations are
required to determine the rank of web pages making it very time “importance” given by a page u to the page p due to its link is
consuming. Researchers have devoted much attention in measured as Rank(u)/Nu. So total “importance” given to a
parallelizing PageRank on PC Cluster, Grids, and Multi-core page is the sum of all the “importance” due to incoming link
processors like Cell Broadband Engine to overcome this issue but to page p. This is computed iteratively n times for each page
with little or no success. In this paper, we discuss the issues in rank to converge. This iteration is as follows.
porting these algorithms on Compute Unified Device
Architecture (CUDA) and introduce efficient parallel ∀p ∈ P,
implementation of these algorithms on CUDA by exploiting the Rank ୧ିଵ ሺuሻ
block structure of web, which not only cut down the computation Rank ୧ ሺpሻ = ሺ1 − dሻ + d ∗ … ሺ1ሻ
time but also significantly reduces of the cost of hardware N୳
୳ ∈ ୗ୮
required.
Where d is the “damping factor” from the random surfer
1. INTRODUCTION
model [1]. We will be using 0.85 as the value of d further in
In present days, the unceasing growth of World Wide Web this paper as given in [1]. The use of d insures the
has lead to a lot of research in page ranking algorithms used convergence of PageRank algorithm [5].
by the search engines to provide the most relevant results to
2.2. Related Works
the user for any particular query. The dynamic and diverse
nature of web graph further exaggerates the challenges in Since PageRank involves huge amount of computation,
achieving the optimum results. Web link analysis provides a therefore many researchers have attempted with their own
way to order the web pages by studying the link structure of approaches towards its parallel implementation. Haveliwala et
web graphs. PageRank and HITS (Hyperlink - Induced Topic al. [6] exploits the block structure of web for computing
Search) are two such most popular algorithms used by some Pagerank. Rungsawang and Manaskasemsak used PC Cluster
current search engines in same or modified form to rank the to compute PageRank[3] in which they divide the input
documents based on the link structure of the documents. graph into equal sets and calculate them on each pc cluster
PageRank, originally introduced by Brin and Page [1], is nodes. Rungsawang and Manaskasemsak also implemented
based on the fact that a web page is more important if many Partition-Based Parallel PageRank Algorithm [4] on PC
other web pages link to it. In core, it contains continuously Cluster. Another efficient parallel implementation of
iterating over the web graph until the Rank assigned to all of PageRank on pc cluster by Kohlschutter et al. [5] achieves
the pages converges to a stable value. In contrast to PageRank, gain of 10 times by using block structure of web page and
a similar HITS algorithm, developed by Kleinberg [2], ranks reformulating the algorithm by combining Jacobi and Gauss-
the documents on the basis of two scores which it assigns to a Seidel method. The implementation [9] on multi core 8 SPU
particular set of documents dependent on specific query, based Cell BE has shown that the PageRank Algorithm runs
although basis for computation are same for both. This paper 22 times slowly.
addresses issues related to parallel implementation of these
algorithms in an interesting manner and proposes an 2.3. CUDA Architecture
innovative way of exploiting the block structure of web
existing at much lower level. Our approach in parallel CUDA™, introduced by NVIDIA, is a general purpose
implementation of these algorithms on NVIDIA’s Multi-Core parallel computing that leverages the parallel compute engine
CUDA Architecture not only reduces the computation time in NVIDIA GPUs to solve many complex computational
but also requires much cheaper hardware. problems in a more efficient way than on a CPU. These GPUs
294
are used as coprocessor to assist CPU for computational separately. Then the rest of the nodes are added to the link
intensive task. structured input file for host. When blocks are scheduled on
More details about this architecture can be explored at [11]. device’s multiprocessor, all threads in a warp follow similar
Here, we highlight the features of CUDA architecture that execution flow to a greater extent.
need special mention in relation to our work. The number of calculation on host can be further decreased if
1. SIMT Architecture we use some constant multiple (average factor) of the average
2. Asynchronous Concurrent Execution value. This constant for peak performance is different for
3. Warps different input graphs depending on the distribution of the
4. Memory Coalescing number of indegrees among the nodes.
2.4. Porting issues of Parallel Implementation on CUDA 2.5. Results and observations
Architecture
We used four different parameters; block size, average factor,
Porting issues with the PageRank algorithm are mainly lower and upper limits for range of locality for experiments.
concerned with hardware restrictions of CUDA architecture. By increasing the range limits more than few thousands the
Some important issues of CUDA architecture are as follows: performance is either decreased or remains same and the block
1) Non coalesced memory access: - CUDA has some sizes giving reasonable increase in performance are 32 and 64.
constraint related to memory accesses. The protocol followed As for smaller block size the number of threads executing in
by the CUDA architecture for memory transaction ensures parallel are too few. And with larger block size threads
that all the threads referencing memory in the same segment become more divergent. As the average values for smaller
are serviced simultaneously. Therefore bandwidth will be used block size is very high therefore increase in the average factor
most efficiently only if, simultaneous memory accesses by gives decrease in performance. While with larger blocks
threads in a half warp belongs to a segment. But due to increase in average factor gives increase in performance.
uneven and random nature of indegrees of nodes the memory Using suitable parameters based on above discussion, we
reference sometimes become non coalesced hindering the achieved some promising results.
simultaneous service of memory transactions leading to the
wastage of memory bandwidth. 3. HITS
Solution: - The nodes generally link in the locality, with few 3.1. Algorithm
links to farther nodes. To improve the rank calculation of a
node, say p, we process only those nodes on kernel which HITS algorithm also ranks web pages on the basis of link
belong to locality of p, determined by the lower and upper structure of web. For this purpose, it assigns two scores to a
limits. And the rest of the nodes are processed on host web page, namely, Authority Score and Hub Score. Higher
processor. So we create two link structured input file, one to Authority score means that the given web page is linked by
be processed by kernel, which contains nodes lying in locality, many documents with high Hub Score & a higher Hub Score
and other contains rest of the nodes to be processed on host means that the given documents points to many documents
processor. with high Authority value. In contrast to PageRank, this
algorithm is query dependent & both the scores are assigned at
2) Divergence in control flow:- CUDA demands the execution the run time depending upon the query.
path of all threads in a warp to be similar for the thread to For a given query, a relatively small set of relevant documents
execute in parallel, hence, divergent execution paths of Root set (R) is retrieved from the web, generally on the basis
threads in a warp cause the CUDA to suspend the parallel of the reoccurrences of the words of the query (Q) or TDIDF.
execution of threads and executes them sequentially (or Then from the Root Set, a Complete Set(C) is formed by
become serialized), hence decreasing throughput. As the including all the documents which either points to at least one
number of indegrees of nodes can be very dissimilar, the loop document in the Root set or are pointed by at least one
involved in calculation, iterating for number of indegrees, can document in the root set. Finally, the scores are assigned to all
make the thread’s control flow to become divergent or the documents in the Complete Set in a number of iterations.
different from other threads. In [7], it is shown that generally the scores converge in 5-10
iterations.
Solution:- To solve this problem, we tried to exploit the block Authority Value Ai is the sum of the Hub Scores Hj of the
structure [8] of web. A careful study reveals that there exists documents pointing towards it. And Hub Value Hi defined as
block structure even at smaller level. So, we divide all the sum of the authority Scores Ai of the documents it points to.
nodes into blocks and calculate average for each block Since, Root Set & Complete set, and therefore the Authority
Score & Hub Score, are calculated at the time of query, both
295
these scores are query specific. This makes this algorithm [2] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment”,
quite slow, making it unfeasible to be used in real life Journal of the ACM (JACM) archive. Volume 46, Issue 5, September 1999.
[3] A. Rungsawang and B. Manaskasemsak, “PageRank Computation Using
situations. Since, the third step is the most time consuming PC Cluster”, Proceedings of the 10th European PVM/MPI User’s Group
step, we present our algorithm to make this part run faster on Meeting, Venice, Italy, 29th Sep – 2nd Oct 2003.
CUDA architecture. [4] A. Rungsawang and B. Manaskasemsak, “Partition-Based Parallel
PageRank Algorithm”, Proceedings of the Third international Conference on
Information Technology and Applications ICITA’05), Sydney, 4th - 7th July,
3.2. Parallel Implementation 2005.
[5] C. Kohlschutter, P. Chirita, and W. Nejdl, “Effıcient Parallel Computation
Since, the computations involved in this algorithm are similar of PageRank”, Proceedings of the 28th European Conference on Information
to PageRank; it follows the similar model of implementation Retrieval (ECIR), London, United Kingdom, 2006.
as discussed in section 2.4. As CUDA allows asynchronous [6] S. Kamvar, T.H. Haveliwala, C. D. Manning ,G. H. Golu, “Exploiting the
Block Structure of the Web for Computing PageRank”, Technical Report
concurrent execution, so the control returns to the host before CSSM-03-02, Computer Science Department, Stanford University, 2003.
the device completes its task leading to the parallel execution [7] Y. G. Saffar, K. S. Esmaili, M. Ghodsi, and H. Abolhassani, “Parallel
of code between host (CPU) and device (GPU). So, in order to Online Ranking of Web Pages”, The 4th ACS/IEEE International Conference
minimize the computation time, the task of calculating each on Computer Systems and Applications (AICCSA-06), UAE, March 2006,
pp. 104-109.
score is uniformly divided between host and device such that [8] A. Arasu, J. Novak, A. Tomkins, and J. Tomlin, “PageRank Computation
they both take approximately same time to compute their part and the Structure of the Web: Experiments and Algorithms”, In Proceedings
of job. of the 11th World Wide Web Conference, poster track, Honolulu, Hawaii, 7-
11 May 2002.
[9] G. Buehrer, S. Parthasarathy, and M. Goyder, "Data mining on the cell
3.3 Results of Parallel Implementation of HITS
broadband engine", Proceedings of ICS’08, Cairo, Egypt, 20-24 October,
2008.
Since, the computations in HITS algorithm are similar to [10] S. Nomura Satoshi Oyama Tetsuo Hayamizu, and Toru Ishida, “Analysis
PageRank therefore following the same model of and Improvement of HITS Algorithm for DetectingWeb Communities”.
implementation on a set of 300,000 nodes, generated using [11] NVIDIA CUDA Programming Guide 2.2 by NVIDIA Corporation.
[12] WebGraph Laboratory, http://webgraph.dsi.unimi.it/ in 2006.
WebGraph [12]; we achieved a significant gain on CUDA
Architecture as compared to CPU.
4. CONCLUSION
REFERENCES
[1] S. Brin and L. Page, “The Anatomy of a Large Scale Hypertextual Web
Search Engine,” Computer Networks and ISDN Systems archive, Volume 30,
Issue 1-7, April. 1998.
296
Designing Application Specific Irregular Topology
for Network-on-Chip
Naveen Choudhary M.S.Gaur, V. Laxmi Virendra Singh
Department of Computer Science & Department of Computer SERC
Engineering Engineering Indian Institute of Science
College of Engineering and Technology Malaviya National Institute of Bangalore, India
Udaipur, India Technology viren@serc.iisc.ernet.in
naveenc121@yahoo.com Jaipur, India
{gaurms|vlaxmi}@mnit.ac.in
297
process. Based on the constructed minimum spanning tree and
using Dijkstra's shortest path algorithm, the routing table
entries for the routers of the NoC is generated for each edge in
the core graph. At this stage the traffic load to these tree paths
is assigned according to the bandwidth requirement in the core
graph, the basic tree path for (υi, υj) in the NoC topology graph
is assigned the traffic load of bwi j and similarly the edges of the
path (υi, υj) are assigned the traffic load as the summation of
their previously assigned traffic load and bwi j. In the next phase
of the methodology a genetic algorithm based heuristic is used
for the design of customized irregular NoC. The proposed
genetic algorithm explores the search space to generate an Figure 1. Network construction flow using genetic algorithm
irregular topology with optimized bandwidth load distribution
and improved energy requirements. 3) Energy-Reduction-Mutation: This mutation is done on
randomly selected chromosome with the bias towards the best
A. Solution Representation class of the population in each generation. In this mutation
Each chromosome is represented by an array of genes with each path of every gene of the chromosome is traversed and we
the maximum size of the gene array to be equal to the number try to find a replacement shorter path by adding suitable
of edges in the core graph. Similarly each gene contains the shortcut.
information regarding the various possible paths in the NoC
topology graph between the <source(υi), destination(υj) > pair C. Crossover Operator
A gene is only permitted to have a maximum of n For achieving crossover two chromosomes and a random
(configurable parameter) number of paths and in these n paths crossover point is selected and then genes of these
at least one of the path is the shortest path through the edges chromosomes are mixed over the crossover point to produce
exclusively of the minimum spanning tree only and rest of the two new chromosomes. The new chromosome is accepted only
paths are generated by adding shortcuts. if it leads to improvement in the cost.
298
extended version of NIRGAM [13] for supporting Irregular energy estimates for providing multiple objective optimization
NoC with table based routing. For performance comparison frameworks.
IrNIRGAM was run for 10000 clock cycles and network
throughput in flits and average flit latency were used as
parameters for comparison.
REFERENCES
[1] W. J. Dally, B.Towles,,“Route Packets, Not Wires: On-Chip
Interconnection Networks,” in IEEE Proceedings of the 38th Design
Figure 2. Average Performance comparison of IrNoC with 2-D Mesh Automation Conference (DAC), pp. 684–689, 2001.
topology with X-Y and OE routing [2] L. Benini, G. DeMicheli., “Networks on Chips: A New SoC Paradigm,”
IEEE Computer Vol. 35, No. 1 pp. 70–78, January 2002.
Previously the proposed genetic algorithm was run for [3] e. a. M. D. Schroeder, “Autonet: A High-Speed Self-Configuring Local
1000 generation with population size of 200 for obtaining the Area Network Using Point-to-Point Links,” Journal on Selected Areas in
customized irregular topology. Mutations are done on 15% of Communications, vol. 9, Oct. 1991.
the population and crossover on 30% of the population in each [4] A. Jouraku, A. Funahashi, H. Amano, M. Koibuchi, “L-turn routing: An
generation. During optimization the maximum channel length Adaptive Routing in Irregular Networks,” in International Conference
on Parallel Processing, pp. 374-383, Sep. 2001.
was set to be twice the length of a tile. Figure 2 summarize the
[5] Y.M. Sun, C.H. Yang, Y.C Chung, T.Y. Hang, “An Efficient Deadlock-
performance results averaged over 50 generated irregular Free Tree-Based Routing Algorithm for Irregular Wormhole-Routed
topologies (IrNoC) with permitted node/core degree of 6 with Networks Based on Turn Model,” in International Conference on
number of cores varying between 16 to 81 and 2D-mesh of Parallel Processing, vol. 1, pp. 343-352, Aug. 2004.
equal number of cores as in IrNoC with X-Y [9] and OE [9] [6] U. Ogras, J. Hu, R. Marculescu, “Key research problems in NoC design:
routing. For IrNoC table based routing was used. Figure 2 a holistic perspective,” IEEE CODES+ISSS, pp. 69-74, 2005.
shows that optimized IrNoCs sustain a higher throughput and [7] W.H.Ho, T.M.Pinkston, “A Methodology for Designing Efficient On-
lower transmission latency in all cases. IrNoC with permitted Chip Interconnects on Well-Behaved Communication Patterns,” HPCA
2003, pp. 377-388, Feb 2003.
node degree of 6 achieves 19.4% and 32% more throughput on
average with decrease in average flit latency of 15.2 and 60.3 [8] J.Hu, R.Marculescu,“Energy-Aware Mapping for Tile-based NOC
Architectures Under Performance Constraints,” ASP-DAC 2003, Jan
clock cycles in comparison to corresponding 2-D Mesh with X- 2003.
Y and OE routing respectively. Similar tests on IrNoC with [9] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks : An
permitted node degree of 4 showed gains (7.5%, 18.9%) and Engineering Approach, Elsevier, 2003.
(12.4 clocks, 57.45 clocks) for throughput and latency [10] J. Hu, R. Marculescu, “Energy- and performance-aware mapping for
respectively. Figure 3 shows throughput and latency regular NoC architectures”, IEEE Trans. on CAD of Integrated Circuits
comparison of IrNoC (with permitted node degree of 6) and 2- and Systems, 24(4), April 2005.
D mesh with X-Y and OE routing with varying packet [11] R. P. Dick, D. L. Rhodes, W. Wolf, “TGFF: task graphs for free,” in
injection interval in clock cycles. Proc Intl. Workshop on Hardware/Software Codesign, March 1998.
[12] Y. C. Chang, Y. W. Chang, G. M. Wu and S. W. Wu, “B*-Trees : A
New Representation for Non-Slicing Floorplans,” in Proc. 37th Design
V. CONCLUSION Automation Conference, pp. 458-463, 2000.
A genetic algorithm based methodology was implemented [13] Lavina Jain, B.M.Al-Hashimi, M.S.Gaur, V.Laxmi, A.Narayanan,
“NIRGAM: A Simulator for NoC Interconnect Routing and Application
to tailor the NoC topology to the requirements of the Modelling, DATE 2007, 2007.
application captured in the core graph. In our future work to
further analyze the effectiveness of the proposed methodology,
we intend to compare the proposed methodology with other
application specific design methodologies proposed in the
literature with realistic benchmarks in addition to fine grained
299
QoS Aware Minimally Adaptive XY routing for
NoC
Navaneeth Rameshan∗ , Mushtaq Ahmed† , M.S.Gaur‡ , Vijay Laxmi§ and Anurag Biyani¶
∗†‡§ Computer Engineering, Malaviya National Institute of Technology Jaipur, India
∗ navaneeth.rameshan@gmail.com, ¶ mushtaq@mnit.ac.in, ‡ gaurms@mnit.ac.in, § vlaxmi@mnit.ac.in
¶ Jaypee Institute of Information technology, Noida, India
anuragbiyani@gmail.com
Abstract—Network-on-Chip (NoC) has emerged as a solution routing, packet is forwarded in X-direction until the destination
to communication handling in Systems-on-Chip design. A major and current column becomes equal, and then packet is routed
design consideration is high performance of router of smaller size. Y-direction until destination node is reached. XY routing is
To achieve this objective, routing algorithm need to be simple
as well as congestion-aware. QoS is also emerging as one of non-adaptive, which leads to bad load balancing and lack of
the design objectives in NoC design. Recent work has shown adaptability to congestion.
that deterministic routing does not fare well when traffic in the Odd-Even routing [3] is a deadlock-free partially adaptive
network increases [6]. An ideal routing algorithm should take routing in which turning is restricted to prevent deadlock
congestion awareness into account. In this paper, we propose occurrence. The routing path is governed by the following
a new Quality-of-service (QoS) aware routing algorithm which
is simple to implement and partially adapts (considers only rule (column starting from 0): for any packet, following turns
minimal paths) with the traffic congestion for meeting different are not allowed.
QoS requirements such as Best Effort (BE) and Guaranteed 1. East-to-North or East-to-South at even column.
Throughput (GT) [5]. Comparison of our algorithm with other 2. North-to-West or South-to-West at odd column.
routing algorithms namely XY, Odd-Even (OE) and DyAd
suggest improved performance in terms of average delay and III. MAXY: M INIMALLY A DAPTIVE XY ROUTING
jitter.
We propose a variation of the XY routing algorithm which
I. I NTRODUCTION introduces a capability to adapt with traffic, while still re-
NoC needs to support Quality of service (QoS) that poses taining the biggest advantage of XY routing: Simplicity. We
to be a main concern when there are varieties of applications select our routing direction only from the maximum of 2
sharing the network, as it becomes necessary to offer guar- path-length reducing directions at any stage, i.e., algorithm is
antees on performance. QoS refers to the capability of the minimalistic, and hence leads to freedom from livelock issues
network to provide communication services above a certain [1]. But despite being minimalistic, our algorithm is adaptive
minimum value(s) of one or more performance metrics such as and decisions among directions are chosen at crucial positions
dedicated bandwidth, control jitter and latency [2]. To manage enroute.
the allocation of resources to communication streams more We illustrate the functioning of the proposed algorithm by
judiciously and efficiently, network traffic is often divided into example. Let us say a packet is to be routed from source
a number of classes when QoS is at premium. Different classes node (Sx ,Sy ) to destination node (Dx ,Dy ). Unlike XY routing,
of packets have different QoS requirements and different levels in which we always route packet first in X direction, here
of importance. In general, the network traffic can be classified we route packet with aim to equalize the absolute difference
into two traffic classes – (1) Guaranteed Throughput (GT) and between X and Y coordinates of current and final nodes. So
(2) Best Effort (BE) [5]. the first step is to route the packet in the direction which
In this paper we propose a new method, Adaptive XY helps in equalizing the absolute difference between X and
Routing, for routing in NoC and compare its performance with Y coordinates of current and final nodes. Once the absolute
various standard routing algorithms, viz., XY, Odd-Even [3] difference is equal for both directions then buffer availability
and DyAd with context to QoS in NoC. Open-source simulator of both feasible directions is taken into account and packet is
NIRGAM [4] is used for comparison with other routing routed to one which has maximum buffer availability. If both
algorithms. have same buffer availability then a random selection is made
since it leads to equal load distribution probability among
II. XY AND O DD -E VEN ROUTING direction with same number of output buffer channel(s). The
In NoC, one of the simplest routing algorithms is XY same process (equalizing the coordinate and using buffer
routing. In XY routing, path is determined solely by addresses availability when equal) is repeated till the packet reaches
of source and destination nodes in a deterministic manner, i.e., destination.
same path is chosen for a given pair of source and destination For example, assume that in a 8x8 mesh we have to route
nodes irrespective of the traffic situation in the network. In XY a packet from (0,0) to (3,6). In this case, first we compute
300
absolute differences between current node (initially source Algorithm 1 Minimally Adaptive XY Algorithm
node) and destination node. Here since ∆x = |3-0| = 3 and ∆y Require: Sx , Sy hx, yi coordinates of source node
= |6-0| = 6, so we route it in a direction which minimizes | Cx , Cy hx, yi coordinates of current node
∆y|. Thus we chose South direction initially. Hence the current Dx , Dy hx, yi coordinates of destination node
node now becomes (0,1). Since still | ∆y| > | ∆x|, hence Ensure: Route fom hSx , Sy i → hDx , Dy i
packet is routed in south direction making the current direction Initialization
to be (0,2). Again | ∆y| > | ∆x| therefore again south direction 1: Cx = Sx
is chosen. Now current position becomes (0,3). Now at this 2: Cy = Sy
stage, | ∆y| = | ∆x|. In this example, after this stage algorithm 3: while (true) do
becomes non-deterministic, i.e., routing path now becomes a 4: absX = |Dx − Cx |
function of buffer availabilities at nodes. We now look at the 5: absY = |Dy − Cy |
buffer availability of both the favorable directions for routing 6: if ((Cx == Dx )and(Cy == Dy )) then
(South and West in this case). Say West direction0 s output 7: Return IP CORE
buffer has more channels free than South direction0 s output 8: end if
buffer at (0,3). Thus packet is routed in West direction at this 9: if (Cx > Dx ) then
point making the current node to be (1,3). Now | ∆y| > | ∆x|, 10: dirX = WEST
therefore packet is routed in south direction and next node is 11: else
(1,4). Since at (1,4) | ∆y| = | ∆x|, therefore same method 12: dirX = EAST
is used for choosing between south and west directions. This 13: end if
process is repeated till we reach our destination node (3,6). 14: if (Cy > Dy ) then
Thus one possible path for a packet can be: (0,0) → (0,1) → 15: dirY = NORTH
(0,2) → (0,3) → (1,3) → (1,4) → (1,5) → (2,5) → (3,5) → 16: else
(3,6). Complete algorithm is illustrated in Algorithm 1. 17: dirY = SOUTH
The proposed routing algorithm is free from livelock as it is 18: end if
minimalistic in nature and requires only two virtual channels 19: if (absX == absY ) then
for deadlock free routing. The virtual channel (VC) assignment 20: if (buffer[dirX] > buffer[dirY]) then
depends on the relative position of the source S and destination 21: Route in dirX
D. If D is towards East (West) of S, packets use the first 22: else if (buffer[dirX] < buffer[dirY]) then
(second) virtual channel along Y direction. For X direction, 23: Route in dirY
any virtual channel can be used. This approach breaks the 24: else
cycle formed in channel dependency graph, thereby preventing 25: Route in random(dirX,dirY)
deadlock as illustrated in Figure 1. 26: end if
27: end if
28: if (absX > absY ) then
29: Route in dirX
30: else
31: Route in dirY
32: end if
33: Update hCx , Cy i
34: end while
A. Experimental Setup
301
Fig. 3.Average Latency of BE traffic for XY, OE, DyAd and MAXY
Routing with bandwidth for GT=0.75.
Fig. 2. A 5x5, two-dimensional mesh showing source-destination
pairs for GT (1-17, 2-21) and BE (3-21, 6-17) traffic
302