Vous êtes sur la page 1sur 311

Proceedings of the

17 th International Conference on Advanced Computing and Communications

December 14 – 17, 2009, Bangalore, India

ADCOM 2009

Copyright © 2009 by the Advanced Computing and Communications Society

All rights reserved.

Advanced Computing & Communications Society

Gate #2, CV Raman Avenue, (Adj to Tata Book House) Indian Institute of Science, Bangalore, India PIN: 560012

CONTENTS

Message from the Organizing Committee

i

Steering Committee

ii

 

Reviewers

iii

GRID ARCHITECTURE

Parallel Implementation of Video Surveillance Algorithms on GPU Architecture using NVIDIA CUDA Sanyam Mehta, Ankush Mittal, Arindam Misra, Ayush Singhal, Praveen Kumar, and Kannappan Palaniappan

2

Adapting Traditional Compilers onto Higher Architectures incorporating Energy Optimization Methods for Sustained Performance Prahlada Rao BB, Mangala N and Amit Chauhan

10

SERVER VIRTUALIZATION

Is I/O Virtualization ready for End-to-End Application Performance? J. Lakshmi, S.K.Nandy

19

Eco-friendly Features of a Data Centre OS

26

S.

Prakki

HUMAN COMPUTER INTERFACE -1

 

Low Power Biometric Capacitive CMOS Fingerprint Sensor System Shankkar B, Roy Paily and Tarun Kumar

34

Particle Swarm Optimization for Feature Selection: An Application to Fusion of Palmprint and Face Raghavendra R, Bernadette Dorizzi, Ashok Rao and Hemantha Kumar

38

GRID SERVICES

OpenPEX: An Open Provisioning and EXecution System for Virtual Machines Srikumar Venugopal, James Broberg and Rajkumar Buyya

45

Exploiting Grid Heterogeneity for Energy Gain Saurabh Kumar Garg

53

Intelligent Data Analytics Console Snehal Gaikwad, Aashish Jog and Mihir Kedia

60

COMPUTATIONAL BIOLOGY

Digital Processing of Biomedical Signals with Applications to Medicine

69

D.

Narayan Dutt

Supervised Gene Clustering for Extraction of DiscriminativeFeatures from Microarray Data.

75

C. Das, P.Maji, S. Chattopadhyay

AD-HOC NETWORKS

Solving Bounded Diameter Minimum Spanning Tree Problem Using Improved

90

 

Heuristics

Rajiv Saxena and Alok Singh

 

Ad-hoc Cooperative Computation in Wireless Networks using Ant like Agents Santosh Kulkarni and Prathima Agrawal

96

A

Scenario-based Performance Comparison Study of the Fish-eye State Routing and

83

Dynamic Source Routing Protocols for Mobile Ad hoc Networks Natarajan Meghanathan and Ayomide Odunsi

NETWORK OPTIMIZATION

Optimal Network Partitioning for Distributed Computing Using Discrete

113

Optimization

Angeline Ezhilarasi G and Shanti Swarup K

An Efficient Algorithm to Reconstruct a Minimum Spanning Tree in an Asynchronous Distributed Systems Suman Kundu and Uttam Kumar Roy

118

A

SAL Based Algorithm for Convex Optimization Problems

125

Amit Kumar Mishra

WIRELESS SENSOR NETWORKS

Energy Efficient Cluster Formation using Minimum Separation Distance and Relay CH’s in Wireless Sensor Networks

130

V.

V. S. Suresh Kalepu and Raja Datta

An Energy Efficient Base Station to Node Communication Protocol for Wireless Sensor Networks Pankaj Gupta, Tarun Bansal and Manoj Misra

136

A

Broadcast Authentication Protocol for Multi-Hop Wireless Sensor Networks

144

R.

C. Hansdah, Neeraj Kumar and Amulya Ratna Swain

GRID SCHEDULING

 

Energy-efficient Scheduling of Grid Computing Clusters Tapio Niemi, Jukka Kommeri and Ari-Pekka Hameri

153

Energy Efficient High Available System: An Intelligent Agent Based Approach Ankit Kumar, Senthil Kumar R. K. and Bindhumadhava B. S

160

3. A Two-phase Bi-criteria Workflow Scheduling Algorithm in Grid Environments Amit Agarwal and Padam Kumar

168

HUMAN COMPUTER INTERFACE -2

Towards Geometrical Password for Mobile Phones Mozaffar Afaq, Mohammed Qadeer, Najaf Zaidi and Sarosh Umar

175

Improving Performance of Speaker Identification System Using Complementary Information Fusion Md Sahidullah, Sandipan Chakroborty and Goutam Saha

182

Right Brain Testing-Applying Gestalt psychology in Software Testing Narayanan Palani

188

MOBILE AD-HOC NETWORKS

Intelligent Agent based QoS Enabled Node Disjoint Multipath Routing Vijayashree Budyal, Sunilkumar Manvi and Sangamesh Hiremath

193

Close to Regular Covering by Mobile Sensors with Adjustable Ranges Adil Erzin, Soumyendu Raha and.V.N. Muralidhara

200

Virtual Backbone Based Reliable Multicasting for MANET Dipankaj Medhi

204

DISTRIBUTED SYSTEMS

Exploiting Multi-context in a Security Pattern Lattice for Facilitating User Navigation Achyanta Kumar Sarmah, Smriti Kumar Sinha and Shyamanta Moni Hazarika

215

Trust in Mobile Ad Hoc Service GRID Sundar Raman S and Varalakshmi P

223

Scheduling Light-trails on WDM Rings Soumitra Pal and Abhiram Ranade

227

FOCUSSED SESSION ON RECONFIGURABLE COMPUTING

AES and ECC Cryptography Processor with Runtime Reconfiguration Samuel Anato, Ricardo Chaves, Leonel Sousa

236

The Delft Reconfigurable VLIW Processor Stephen Wong, Fakhar Anjam

244

Runtime Reconfiguration of Polyhedral Process Network Implementation Hristo Nikolov, Todur Stefanov, Ed Depprettere

252

REDEFINE: Optimizations for Achieving High Throughput Keshavan Varadarajan, Ganesh Garga,Mythri Alle, S K Nandy, Ranjani Narayan

259

Poster Papers

A

Comparative Study of Different Packet Scheduling Algorithms with Varied

267

Network Service Load In IEEE 802.16 Broadband Wireless Access Systems Prasun Chowdhury Iti Saha Misra

A

Simulation Based Comparison of Gateway Load Balancing Strategies in

270

Integrated Internet-MANET Rafi-U-Zaman Khaleel-Ur-Rahman Khan M.A.Razzaq A. Venugopal Reddy

ECAR: An Efficient Channel Assignment and Routing in Wireless Mesh Network S. V. Rao, Chaitanya P. Umbare

273

Rotational Invariant Texture Classification of Color Images using Local Texture

276

 

Patterns

A.Suruliandi,E.M.Srinivasan K.Ramar

 

Time Synchronization for an Efficient Sensor Network System Anita Kanavalli, Vijay Krishan, Ridhi Hirani, Santosh Prasad, Saranya K.,P Deepa Shenoy, and Venugopal K R

280

Parallel Hybrid Germ Swarm Computing for Video Compression K. M. Bakwad, S.S. Pattnaik, B. S. Sohi, S. Devi1, B.K. Panigrahi, M. R. Lohokare

283

Texture Classification using Local Texture Patterns: A Fuzzy Logic Approach E.M. Srinivasan,A. Suruliandi K.Ramar

286

Integer Sequence based Discrete Gaussian and Reconfigurable Random Number

290

 

Generator

Arulalan Rajan, H S Jamadagni, Ashok Rao

 

Parallelization of PageRank and HITS Algorithm on CUDA Architecture Kumar Ishan, Mohit Gupta, Naresh Kumar, Ankush Mittal

294

Designing Application Specific Irregular Topology for Network-on-Chip Virendra Singh, Naveen Choudhary M.S.Gaur, V. Laxmi

297

QoS Aware Minimally Adaptive XY routing for NoC Navaneeth Rameshan , Mushtaq Ahmed , M.S.Gaur , Vijay Laxmi and Anurag Biyani

300

Message from the Organizers

Welcome to the 17 th International Conference on Advanced Computing and Communications (ADCOM 2009) being held at the Indian Institute of Science, Bangalore, India during December 14-18, 2009.

ADCOM, the flagship event of the Advanced Computing and Communication Society (ACCS), is a major international conference that attracts professionals from industry and academia across the world to share and disseminate their innovative and pioneering views on recent trends and development in computational sciences. ACCS is a registered scientific society founded to provide a forum to individuals, institutions and industry to promote advanced Computing and Communication technologies.

Building upon the success of last year’s conference, the 2009 Conference will focus on "Green Computing" to promote higher standards for energy-efficient data centers, central processing units, servers and peripherals as well as reduced resource consumption towards a sustainable 'green' ecosystem. ADCOM will also explore computing for the rural masses in improving delivery of public services like education and primary health care.

Prof. Patrick Dewilde from Technical University Munich, and Prof. N. Balakrishnan from the Indian Institute of Science are the General Chairs for ADCOM 2009. The organizers thank Padma Bhushan Professor Thomas Kailath, Professor Emeritus at Stanford University for being the Chief Guest for the inaugural event of the conference, and to honor the “DAC-ACCS Foundation Awards 2009” awardees Prof. Raj Jain from Washington University, USA and Prof. Anurag Kumar from Indian Institute of Science, India for their exceptional contributions to the advancement of networking technologies.

The conference features 8 plenary and 8 invited talks from internationally acclaimed leaders in the industry and academics. The Programme Committee had the arduous task of selecting 30 papers to be presented in 12 sessions and 11 poster presentations from a total of 326 submissions. ADCOM 2009 will have a Special Focused session on “Emerging Reconfigurable Systems” with 4 invited presentations to be followed by an open forum for discussions. In tune with the theme of the conference, an Industry Session on Green Datacenters is organized to disseminate awareness on energy-efficient solutions. A total of 8 tutorials in current topics in various aspects of computing is arranged following the main conference.

The organizers sincerely thank all authors, reviewers, programme committee members, volunteers and participants for their continued support for the success of ADCOM 2009. We welcome you all to enjoy the green and serene ambience of the Indian Institute of Science, the venue of the conference, in the IT capital of India.

i

ADCOM 2009 STEERING COMMITTEE

Patron P. Balaram,IISc, India

General Chairs

N. Balakrishnan, IISc, India

Patrick Dewilde, TU Munich

Technical Programme Chairs S. K. Nandy, IISc, India S Uma Mahesh,Indrion, India

Organising Chairs B.S. Bindhumadhava, CDAC, India S. K. Sinha, CEDT, India

Industry Chairs

H. S. Jamadagni,IISc, India

Krithiwas Neelakantan, SUN, India Lavanya Rastogi, Value-One, India Saragur M. Srinidhi, Prometheus Consulting, India

Publicity & Media Chairs G. L. Ganga Prasad, CDAC, India P. V. G. Menon, VANN Consulting, India

Publications Chair K. Rajan, IISc, India

Finance Chairs G. N. Rathna, IISc, India

Advisory Committee Harish Mysore, India K. Subramanian, IGNOU, India Ramanath Padmanabhan, INTEL, India Sunil Sherlekar, CRL, India Vittal Kini, Intel, India N. Rama Murthy, CAIR, India Ashok Das, Sun Moksha, India Sridhar Mitta, India

ii

Reviewers for ADCOM 2009

The following reviewers’ participated in the review process of ADCOM 2009. We greatly acknowledge them for their contributions.

Benjamin Premkumar Dhamodaran Sampath Ilia Polian Jordi Torres Lipika Ray Hari Gupta Sudha Balodia Madhu Gopinathan Kapil Vaswani

K V Raghavan

P S Sastry

Parag C Prasad

R Govindarajan

Rahul Banerjee Santanu Mahapatra Sathish S Vadhiyar Shipra Agarwal Soumyendu Raha Srinivasan Murali Sundararajan V

Chakraborty Joy

T

V Prabhakar

Aditya Kanade

Thara Angksun

Arnab De

V

Kamakoti

Aditya Kanade

Vadlamani Lalitha

Aditya Nori

Veni Madhavan C E

Karthik Raghavan Rajugopal Gubbi

Venkataraghavan k Vinayak Naik

A

Sriniwas

Virendra Singh

P V Ananda Mohan

Vishwanath G

Anirban Ghosh

V

C V Rao

Asim Yarkhan

Gopinath K

Bhakthavathsala

Vivekananda Vedula

Chandra Sekhar Seelamantula Chiranjib Bhattacharyya Debnath Pal Debojyoti Dutta Deepak D' Souza Haresh Dagale Joy Kuri

K R Ramakrishnan

R. Krishna Kumar

K S Venkataraghavan

Krishna Kumar R K

M J Shankar Raman

Manikandan Karuppasamy Mrs J Lakshmi Nagasuma Chandra Nagi Naganathan Narahari Yadati Natarajan Kannan Natarajan Meghanathan Neelesh B Mehta Nirmal Kumar Sancheti

Y N Srikant

Zhizhong Chen

iii

ADCOM 2009 GRID ARCHITECTURE

Session Papers:

1. Sanyam Mehta, Ankush Mittal, Arindam Misra, Ayush Singhal, Praveen Kumar, and Kannappan Palaniappan, “Parallel Implementation of Video Surveillance Algorithms on GPU Architecture using NVIDIA CUDA”.

2. Prahlada Rao BB, Mangala N and Amit Chauhan , “Adapting Traditional Compilers onto Higher Architectures incorporating Energy Optimization Methods for Sustained Performance

1

Parallel Implementation of Video Surveillance Algorithms on GPU Architecture using NVIDIA CUDA

Sanyam Mehta , Arindam Misra , Ayush Singhal , Praveen Kumar , Ankush Mittal , Kannappan Palaniappan

Department of Electronics and Computer Engineering, Indian Institute of Technology, Roorkee, INDIA

Department of Computer Science, University of Missouri-Columbia,

USA E-mail: san01uec@iitr.ernet.in,ari07uce@iitr.ernet.in,ayush488@gmail.com, praveen.kverma@gmail.com, ankumfec@iitr.ernet.in, palaniappank@missouri.edu

Abstract

At present high-end workstations and clusters are the commonly used hardware for the problem of real-time video surveillance. Through this paper we propose a

real time framework for a 640×480 frame size at 30

frames per second (fps) on an NVIDIA graphics

processing unit (GPU) (GeForce 8400 GS) costing

only Rs. 4000 which comes with many laptops and

PC’s. The processing of surveillance video is

computationally intensive and involves algorithms like Gaussian Mixture Model (GMM), Morphological Image operations and Connected Component Labeling (CCL). The challenges faced in parallelizing Automated Video Surveillance (AVS) were: (i) Previous work had shown difficulty in parallelizing

CCL on CUDA due to the dependencies between sub-

blocks while merging (ii) The overhead due to a large number of memory transfers, reduces the speedup obtained by parallelization. We present an innovative parallel implementation of the CCL algorithm, overcoming the problem of merging. The algorithms

scale well for small as well as large image sizes. We

have optimized the implementations for the above mentioned algorithms and achieved speedups of 10X, 260X and 11X for GMM, Morphological image operations and CCL respectively, as compared to the serial implementation, on the GeForce GTX 280.

Keywords: GPU, thread hierarchy, erosion, dilation, real time object detection, video surveillance.

1. Introduction

Automated Video Surveillance is a sector that is witnessing a surge in demand owing to the wide range of applications like traffic monitoring, security of

2

public places and critical infrastructure like dams and bridges, preventing cross-border infiltration, identification of military targets and providing crucial evidence in the trials of unlawful activities [11][13]. Obtaining the desired frame processing rates of 24-30 fps in real-time for such algorithms is the major challenge faced by the developers. Furthermore, with the recent advancements in video and network technology, there is a proliferation of inexpensive network based cameras and sensors for widespread deployment at any location. With the deployment of progressively larger systems, often consisting of hundreds or even thousands of cameras distributed over a wide area, video data from several cameras need to be captured, processed at a local processing server and transmitted to the control station for storage etc. Since there is enormous amount of media stream data to be processed in real time, there is a great requirement of High Performance Computational (HPC) solution to obtain an acceptable frame processing throughput.

The recent introduction of many parallel architectures has ushered a new era of parallel computing for obtaining real-time implementation of the video surveillance algorithms. Various strategies for parallel implementation of video surveillance on multi-cores have been adopted in earlier works [1][2], including our work on Cell Broadband Engine [15]. The grid based solutions have a high communication overhead and the cluster implementations are very costly.

The recent developments in the GPU architecture have provided an effective tool to handle the workload. The GeForce GTX 280 GPU is a massively parallel, unified shader design consisting of 240 individual

stream processors having a single precision floating point capability of 933 GFlops. CUDA enables new applications with a standard platform for extracting valuable information from vast quantities of raw data. It enables HPC on normal enterprise workstations and server environments for data-intensive applications, eg. [12]. CUDA combines well with multi-core CPU systems to provide a flexible computing platform. In this paper the parallel implementation of various video surveillance algorithms on the GPU architecture is presented. This work focuses on algorithms like (i) Gaussian mixture model for background modeling, (ii) Morphological image operations for image noise removal (iii) Connected component labeling for identifying the foreground objects. In each of these algorithms, different memory types and thread configurations provided by the CUDA architecture have been adequately exploited. One of the key contributions of this work is novel algorithmic modification for parallelization of the divide and conquer strategy for CCL. The speed-ups obtained with GTX 280(30 multiprocessors or 240 cores) were very significant, the corresponding speed-ups on 8400 GS (2 multiprocessors or 16 cores) were sufficient enough to process the 640×480 sized surveillance video in real-time. The scalability was tested by executing different frame sizes on both the GPUs.

2. GPU Architecture and CUDA

NVIDIA’s CUDA [14] is a general purpose parallel computing architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems. The programmable GPU is a highly parallel, multi-thread, many core co- processors specialized for compute intensive highly parallel computation.

for compute intensive highly parallel computation. Fig 1 Thread hierarchy in CUDA 3 The three key

Fig 1 Thread hierarchy in CUDA

3

The three key abstractions of CUDA are the thread hierarchy, shared memories and barrier synchronization, which render it as only an extension of C. All the GPU threads run the same code and, are very light weight and have a low creation overhead. A kernel can be executed by a one dimensional or two dimensional grid of multiple equally-shaped thread blocks. A thread block is a 3, 2 or 1-dimensional group of threads as shown in Fig. 1. Threads within a block can cooperate among themselves by sharing data through some shared memory and synchronizing their execution to coordinate memory accesses. Threads in different blocks cannot cooperate and each block can execute in any order relative to other blocks. The number of threads per block is therefore restricted by the limited memory resources of a processor core. On current GPUs, a thread block may contain up to 512 threads. The multiprocessor SIMT (Single Instruction Multiple Threads) unit creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. The constant memory is useful only when it is required that the entire warp may read a single memory location. The shared memory is on chip and the accesses are 100x-150x faster than accesses to local and global memory. The shared memory, for high bandwidth, is divided into equal sized memory modules called banks, which can be accessed simultaneously. However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The banks are organized such that successive 32-bit words are assigned to successive banks and each bank has a bandwidth of 32 bits per two clock cycles. For devices of compute capability 1.x, the warp size is 32 and the number of banks is 16. The texture memory space is cached so a texture fetch costs one memory read from device memory only on a cache miss, otherwise it just costs one read from the texture cache. The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture addresses that are close together will achieve best performance. The local and global memories are not cached and their access latencies are high. However, coalescing in global memory significantly reduce the access time and is an important consideration (for compute capability 1.3, global memory accesses are easily coalesced than earlier versions). Also CUDA 2.2 release provides page-locked host memory helps in increasing the overall bandwidth when the memory is required to be read or written exactly once. Also, it can be mapped to device address space – so no explicit memory transfer required.

Fig. 2 The device memory space in CUDA 3. Our approach for the Video Surveillance

Fig. 2 The device memory space in CUDA

3. Our approach for the Video Surveillance Workload

A typical Automated Video Surveillance (AVS) workload consists of various stages like background modelling, foreground/background detection, noise removal by morphological image operations and object identification. Once the objects have been identified other applications can be developed as per the security requirements. Fig. 3 shows the multistage algorithm for a typical AVS system. The different stages and our approach to each of them are described as follows:

and our approach to each of them are described as follows: Fig. 3 A typical video

Fig. 3 A typical video surveillance system

3.1 Gaussian Mixture Model

Many approaches for background modelling like [4][5]have been proposed. Here, Gaussian Mixture model proposed by Stauffer and Grimson [3] is taken up, which assumes that the time series of observations,

4

at a given image pixel, is independent of the observations at other image pixels. It is also assumed

that these observations of the pixel can be modelled by

a mixture of K Gaussians (currently, from 3 to 5 are

used). Let x t be a pixel value at time t. Thus, the probability that the pixel value x t is observed at time t

is given by:

=

|

(1)

Where, , and are the weights, the mean, and the standard deviation, respectively, of the k-th Gaussian of the mixture associated with the signal at time t. At each time instant t the K Gaussians are ranked in descending order using w=0.75 value (the most ranked components represent the “expected” signal, or the background) and only the first B distributions are used to model the background, where

= arg

>

(2)

T is a threshold representing the minimum fraction of data used to model the background. As the parameters of each pixel change, to determine the value of Gaussian that would be produced by the background process depends on the most supporting evidence and least variance. Since the variance for a new moving object that occludes the image is high which can be easily checked from the value of . GMM offers pixel level data parallelism which can

be easily exploited on CUDA architecture. Since the

GPU consists of multi-cores which allow independent thread scheduling and execution, perfectly suitable for independent pixel computation. So, an image of size

m × n requires m × n threads, implemented using the

appropriate size blocks running on multiple cores. Besides this the GPU architecture also provides shared memory which is much faster than the local and global

memory spaces. In fact, for all threads of a warp, accessing the shared memory is as fast as accessing a register as long as there are no bank conflicts [14] between the threads. In order to avoid too many global memory accesses, the shared memory was utilised to store the arrays of various Gaussian parameters. Each block has its own shared memory (upto 16 KB) which is accessible (read/write) to all its threads simultaneously, which greatly improves the computation on each thread since memory access time

is significantly reduced. The value for K (number of

Gaussians)is selected as 4, which not only results in effective coalescing [14] but also reduces the bank conflicts. As shown in the Table 1 the efficacy of coalescing is quite prominent. The approach for GMM involves streaming (Fig. 4) i.e. processing the input frame using two streams

allows for the memory copies of one stream to overlap with the kernel execution of the other stream.

for i <=2 do

create stream i //cudaStreamCreate for each stream i do

copy half the image from host to device each stream i. //cudaMemcpyAsync

for each stream i do

kernel execution for each stream i. (half image processed) //gmm<<<….>>>

cudaThreadSynchronize();

Fig. 4 Algorithm depicting streaming in CUDA

Streaming resulted in significant speed up in the case of 8400 GS, where the time for memory copies was closely matched to the time for kernel execution, while in case of GTX 280, the speed up was not so significant as the kernel execution took little time, being spread over 30 multiprocessors.

3.2 Morphological Image Operations

After the identification of the foreground pixels from the image, there are some noise elements (like salt and pepper noise) that creep into the foreground image. They essentially need to be removed in order to find the relevant objects by the connected component labelling method. This is achieved by morphological image operation of erosion followed by dilation [6]. Each pixel in the output image is based on a comparison of the corresponding pixel in the input image with its neighbours, depending on the structuring element (Fig. 5) used. In case of dilation (denoted by ʘ) the value of the output pixel is the maximum value of all the pixels in the input pixel's neighbourhood. In a binary image, if any of the pixels in the neighbourhood corresponding to the structural element is set to the value 1, the output pixel is set to 1. With binary images, dilation connects areas that are separated by spaces smaller than the structuring element and adds pixels to the perimeter of each image object. In erosion (denoted by Ɵ), the value of the output pixel is the minimum value of all the pixels in the input pixel's neighbourhood. In a binary image, if any of the pixels in the neighbourhood corresponding to the

5

structural element is set to the value 0, the output pixel is set to 0. With binary images, erosion completely removes objects smaller than the structuring element and removes perimeter pixels from larger image objects. This is described mathematically as:

 

ʘ =

(3)

and

Ɵ = |

(4)

where ( )

is the

translation of set B by point z as per the set theoretic definition.

is the reflection of set B and

0 0 1 0 0 0 1 0 1 0 1 0 0 0 1
0
0
1
0
0
0
1
0
1
0
1
0
0
0
1
0
1
0
1
0
0
0
1
0
0
Fig. 5 A 5×5 structuring element

As the texture cache is optimized for 2-dimensional spatial locality, the 2-dimensional texture memory is used to hold the input image; this has an advantage over reading pixels from the global memory, when coalescing is not possible. Also, the problem of out of bound memory references at the edge pixels are avoided by the cudaAddressModeClamp addressing mode of the texture memory in which out of range texture coordinates are clamped to a valid range. Thus the need to check out of bound memory references by conditional statements never arose, preventing the warps from becoming divergent and adding a significant overhead.

from becoming divergent and adding a significant overhead. Fig. 6 Approach for erosion and dilation As

Fig. 6 Approach for erosion and dilation

As shown in Fig. 6 a single thread is used to process two pixels. A half warp (16 threads) has a bandwidth of 32 bytes/cycle and hence 16 threads, each processing 2 pixels (2 bytes) use full bandwidth,

while writing back noise-free image. This halves the total number of threads thus reducing the execution time significantly. A structuring element of size 7×7 was used both in dilation and erosion. A straightforward convolution was done with one thread running on two neighbouring pixels. The execution times for the morphological image operations for the GTX 280 and the 8400 GS are shown in Table 2.

3.3 Connected Component Labelling

The connected component labelling algorithm works on a black and white (binary) image input to identify the various objects in the frame by checking pixel connectivity [8]. The image is scanned pixel-by- pixel (from top to bottom and left to right) in order to identify connected pixel regions, i.e. regions of adjacent pixels which share the same set of intensity values and temporary labels are assigned. The connectivity can be either 4 or 8 neighbour connectivity (8-connectivity in our case). Then, the labels are put under equivalence class, pertaining to their belonging to the same object. After constructing the equivalence classes the labels for the connected pixels are resolved by assigning label of the equivalence class to all the pixels of that object. Here the approach for parallelizing CCL on the GPU belongs to the class of divide and conquer algorithms [7]. The proposed implementation divides the image into small parts and labels the objects in those small parts. Then in the conquer phase the image parts are stitched back to see if the two adjoining parts have the same object or not. For initial labelling the image was divided into N×N small regions and the sub-images were scanned pixel by pixel from left to right and top to bottom. These small regions were executed in parallel on different blocks (32×32 in case of 1024×1024 images). Each pixel was labelled according to its connectivity with its neighbours. In case of more than one neighbour, one of the neighbour’s labels was used and rest were marked under one equivalence class. This was done similarly for all blocks that were running in parallel. The equivalence class array was stored in shared memory for each block which saved a lot of memory access time. The whole image frame was stored in texture memory to reduce memory access time, as global memory coalescing was not possible due to random but spatial accesses.

6

Region 1 – labels limited from 0 to 7

Region 2 – labels limited from 8 to 15

limited from 0 to 7 Region 2 – labels limited from 8 to 15 Region 3
limited from 0 to 7 Region 2 – labels limited from 8 to 15 Region 3

Region 3 – labels limited from 16 to 23

Region 4 – labels limited from 24 to 31

from 16 to 23 Region 4 – labels limited from 24 to 31 Fig 7. The

Fig 7. The connected components are assigned the maximum label after resolution

In earlier works on CCL like [9][10], the major limitation was that the sub-blocks into which the problem was broken had to be merged serially, the reason being each sub-block had blobs with serial labels and while merging any two connected sub- blocks, the labels in all the other sub-blocks had to be modified – clearly no parallelization possible. A new approach to enable parallelization of CCL is presented in this paper. The code (as indicated in Fig. 7) labels the blobs (objects) independent of the other sub-blocks, but according to the CUDA thread ids ( i.e. 1 st sub-block can label the blobs from 0 to 7, the 2 nd sub-block can label the blobs from 8 to 15 and so on). So in this case no sub-block can detect more than 8 blobs (which is generally the case, but one may easily choose to have a higher limit). In order to avoid conflicts between sub-blocks, connected parts of the image, in different regions, were given the highest label from amongst the different labels in different regions, as shown in the fig. 7. So, as a result of making the entire code ‘portable on GPU’, the speed up obtained was enormous – the entire processing being split and made parallel to be executed on the GTX 280, resulting in the entire CCL (i.e. including merge) code to be executed in just 2.4 milliseconds (Table 3) for a 1024×768 image, a speed- up of 11x as compared to sequential code.

Table 1 CUDA Profiler output for 1024X768 image size

Function

Occup-

Coalesced

Coalesced

Total

Divergent

 

Total

Total

ancy

global

global

branches

Branches

Static

global

global

memory

memory

Memory

memory

memory

loads

stores

per block

loads

stores

GMM

0.75

5682

826

829

27

3112

5682

347

GMM

0.75

5094

186

500

5

3112

5094

93

Erode

1

0

102

408

2

20

0

37

Dilate

1

0

154

4975

31

20

0

77

CCL

0.25

2657

1902

4128

0

536

2657

823

Merge

0.25

9898

36

1936

2

1064

9898

18

4. EXPERIMENTAL RESULTS

The parallel implementation of the above mentioned AVS workload was executed on two NVIDIA GPUs, the first GPU used is the GeForce GTX 280 on board a 3.2 GHz Intel Xeon machine with 1GB of RAM, the second one was the GeForce 8400 GS on board a 2 GHz Intel Centrino Duo machine. The GTX 280 has a

single precision floating point capability of 933GFlops and a memory bandwidth of 141.7 GB/sec, it also has 1 GB of dedicated DDR3 video memory and consists of 30 multiprocessor with 8 cores each, hence a total of

240 stream processors. It belongs to a compute

category 1.3 which supports many advanced features like page-locked host memory and those which take

care of the alignment and synchronization issues. The

8400 GS has a memory bandwidth of 6.4 GB/sec and

issues. The 8400 GS has a memory bandwidth of 6.4 GB/sec and Fig. 8 Execution times

Fig. 8 Execution times for GMM (Speed up = 10 for image size 320×240, as compared to sequential code)

7

has two multiprocessors with 8 cores each, i.e. 16 stream processors, single precision floating point capability of 28.8 GFlops and 128 MB of dedicated memory. It belongs to the compute capability 1.2. The development environment used was Visual Studio 2005 and the CUDA profiler version 2.2 was used for profiling the CUDA implementation. The image sizes that have been used are 1600×1200, 1024×768, 640×480, 320×240 and 160×120. In the subsequent discussion we mention the results obtained for the image size of 1024×768 on the GTX 280 (30 multiprocessors). The Gaussian Mixture Model, used for background modelling has the kind of parallelism that is required for implementation on a GPU. As evident by the Fig. 8, the time of execution increases with the increase in image size and the amount of speedup achieved also increases almost proportionately; this is due to the execution of a large number of threads that keeps the GPU busy. Hence, a significant speedup of 10x has been achieved for 320×240 image.

Table 2. Execution times for erosion and dilation for a 3×3 structuring element.

IMAGE

GTX 280

8400 GS (ms)

SIZE

(ms)

160×120

0.0445

0.120

320×240

0.0586

0.465

640×480

0.1254

1.75

1024×768

0.2429

3.61

1600×1200

0.5625

11.7

Shared memory was used to reduce the global memory accesses keeping in view the shared memory size ( 16 KB). As can be seen from the Table 1 a total of 4 blocks (192×4 threads out of a maximum of 1024 threads) could be executed in parallel on a

multiprocessor, giving an occupa ncy of 0.75.As a

result of using K=4 all the global coalesced as can be seen from the were lesser bank conflicts. The

reduced memory copy overhead b ut not to the extent anticipated, due to the efficient m emory copying in

1.3. This approach 8400 GS – compute

capability1.2.

GTX 280 - compute capability however, was of great help in the

memory loads were Table 1 also, there use of streaming

The morphological image oper ations contribute a major portion of the computationa l expense of the of the AVS workload. In our appro ach we are able to

time. The speedup the GTX 280 and

the 8400 GS, the comparison of s equential code with

the parallel implementation of th e 1024×768 image

of 260X with a time taken by the

sequential implementation was 89.8 06 ms as compared to the 0.352 ms taken by the paralle l implementation.

For this image size we were abl e to unleash the full computational power of the GPU w ith occupancy of 1.

the registers per factors) 280, as

indicated in the Table 1. Moreover , the use of texture memory and address clamp mode s have reduced the

<1%. On the 8400

percentage of divergent threads to

GS also a significant speedup has b een achieved.

(i.e. neither shared memory, nor multiprocessor were the limiting

size shows a significant speedup structuring element of 7×7. The

drastically reduce their execution scales with the image size both on

reduce their execution scales with the image size both on (a) (c) (b) (d) (a) Input

(a)

their execution scales with the image size both on (a) (c) (b) (d) (a) Input Image

(c)

their execution scales with the image size both on (a) (c) (b) (d) (a) Input Image

(b)

execution scales with the image size both on (a) (c) (b) (d) (a) Input Image (b)

(d)

(a) Input Image (b) Foreground

Image

(c) Image after noise removal ( d) Output Fig. 9 Image output of various s tages of AVS

In CCL (i.e. CCL & Mer ge) 32×32 sized

independent sub-blocks were assig ned to each thread

and 32 threads were run on one

experimentally observed to be o ptimal). Since, the maximum number of active blocks on a multiprocessor

block (which was

8

can be 8, the total numb er of active threads per multiprocessor were 256 a nd hence an occupancy of

0.25. The optimal paralleliz ation of the CCL algorithm was significant in itself, as the parallelization of CCL on CUDA has not been rep orted and was deemed very difficult. Apart from the cod e been parallelized, the use of shared memory and the n texture memory to store appropriate data led to signi ficant increases in speedup. The use of texture memor y not only prevented any

warps from diverging by

avoiding the conditional

statements (due to clam ped accesses in texture memory) but also led to s peedup due to the spatial locality of references i n CCL. However, the implementation of CCL is b lock size dependent, which still remains a bottleneck. Table 3 Executio n times for CCL

IMAGE SIZE

GTX 280 (ms)

8400 GS (ms)

160×120

0.106

0.522

320×240

0.220

1.34

640×480

1.256

4.5

1024×768

2.494

14.1

1600×1200

2.649

46.2

kernels page-locked host

memory (a feature of CU DA 2.2 ) has been used

whenever only one mem ory read and write were

involved which increased th e memory throughput.

to video surveillance cost

as much as lakhs of rupees , while the GeForce GTX 280 costs Rs.17000 and t he 8400 GS costs merely

Rs.4000. Even for an im age size of 640×480, 30 frames per second could be processed on 8400 GS, for an image size of 1024×76 8 close to 15 frames per

second could be processed

and for images of smaller

size 30 frames could be ea sily processed as shown in

the fig 10.

In

each of the above

Architectures dedicated

in the fig 10. In each of the above Architectures dedicated Fig. 10 Comparision of different

Fig. 10 Comparision of different sizes

total time for image of

5. CONCLUSION AND FUTURE WORK

Through this paper, we describe the implementation of a typical AVS workload on the parallel architecture of NVIDIA GPUs to perform real time AVS. The various algorithms, as described in the previous sections are GMM for background modelling, morphological image operations, and CCL for object identification. In our previous work [15] a detailed comparison has been done between Cell BE and CUDA for these algorithms. During the implementation on the GPU architecture, major emphasis was given to selecting the thread configurations, the memory types for each kind of data, out of the numerous options available on GPU architecture, so that the memory latency can be reduced and hidden. Lot of emphasis was given to memory coalescing and avoiding bank conflicts. Efficient usage of the different kinds of memories offered by the CUDA architecture and subsequent experimental verification resulted in the most optimal implementations. As a result, significant overall speed- up was achieved. Further testing and validations are going on. We have examined the performance on only 8400 GS(2 multiprocessors) and GTX 280(30 multiprocessors) in this paper, hence a range of intermediate devices are yet to be explored. Our future work will include the implementation of the AVS workload on other GPU devices to examine the scalability, as well as comparison with other parallel architectures to get an idea of their viability as compared to the GPU implementation.

6. REFERENCES

[1] S. Momcilovic and L. Sousa. A parallel algorithm for advanced video motion estimation on multicore architectures. Int. Conf. Complex, Intelligent and Software Intensive Systems, pp 831-836, 2008. [2] M. D. McCool. Data-Parallel Programming on the Cell BE and the GPU Using the RapidMind Development Platform. GSPx Multicore Applications Conference, 9 pages, 2006. [3] C. Stauffer and W. Grimson,Adaptive background

mixture models for real-time tracking, In Proceedings

CVPR,pp.246–252,1999.

[4] Zoran Zivkovic, Improved Adaptive Gaussian Mixture Model for Background Subtraction. In Proc.

2,2004

ICPR,pp

[5] Toyama, K.; Krumm, J.; Brumitt, B.; Meyers, B., Wallflower: principles and practice of background

maintenance. The Proceedings of the Seventh IEEE International Conference on Computer Vision, vol.1,

28-31

vol.

9

pp.255-261, 20-25 September, 1999, Kerkyra, Corfu, Greece [6] H.Sugano and R.Miyamoto. Parallel implementation of morphological processing on cell/be with opencv interface. Communications, Control and Signal Processing, 2008. ISCCSP 2008, pp 578–583,

2008.

[7] J. M. Park, G. C. Looney, and H. C. Chen, “ A Fast

Connected Component Labeling Algorithm Using Divide and Conquer”, CATA 2000 Conference on

Computers and Their Applications, pp 373-376, Dec.

2000.

[8] R. Fisher, S. Perkins, A. Walker and E. Wolfart Connected Component Labeling, 2003 [9] K.P. Belkhale and P. Banerjee, "Parallel Algorithms for Geometric Connected Component Labeling on a Hypercube Multiprocessor," IEEE Transactions on Computers, vol. 41, no. 6, pp. 799-

709,1992

[10] M. Manohar and H.K. Ramapriyan. Connected component labeling of binary images on a mesh connected massively parallel processor. Computer vision, Graphics, and Image Processing, 45(2):133-

149,1989.

[11] K.Dawson-Howe, “Active surveillance using dynamic background subtraction,” Tech. Rep.TCD- CS-96-06, Trinity College, 1996. [12] Michael Boyer, David Tarjan, Scott T. Acton†, and Kevin Skadron, Accelerating Leukocyte Tracking using CUDA:A Case Study in Leveraging Manycore

Copocessors,2009.

[13] A.C. Sankaranarayanan, A. Veeraraghavan, and R. Chellappa. Object detection, tracking and recognition for multiple smart cameras. Proceedings of the IEEE, 96(10):1606–1624,2008. [14] NVIDIA CUDA Programming Guide,Version 2.2,

page10,27-35,75-97,2009.

[15] Praveen Kumar, Kannappan Palaniappan, Ankush Mittal and Guna Seetharaman. Parallel Blob Extraction using Multicore Cell Processor. Advanced Concepts for Intelligent Vision Systems (ACIVS) 2009. LNCS 5807, pp. 320–332, 2009.

Adapting Traditional Compilers onto Higher Architectures incorporating Energy Optimization Methods for Sustained Performance

Prahlada Rao B B, Mangala N, Amit K S Chauhan

Centre for Development of Advanced Computing (CDAC), #1, Old Madras Road, Byappanahalli, Bangalore-560038, India

email: {prahladab, mangala}@cdacb.ernet.in

ABSTRACT - Improvements in processor technology are offering benefits, such as large virtual address space, faster computations, non-segmented memory, higher precision etc., but require up- gradation of system software to be able to exploit the benefits offered. The Authors in this paper present various tasks and constraints experienced to enhance compilers from 32-bit to 64- bit platform. The paper describes various aspects, ranging from design changes, to porting and testing issues that have been dealt with while enhancing the C-DAC’s Fortran90 (CDF90) compiler to work with 64-bit architecture and I:8 support. Features supported by CDF90, for energy efficiency of code, are presented. The regression testing carried out to test the I:8 support added in the compiler are discussed.

KEYWORDS: Compilers, Testing, Porting, LP64, CDF90, Optimizations

I. INTRODUCTION Based on Moore’s law, processors will continue to show an increase in speed and processing capability with time, and chip making companies like Intel, AMD, IBM, Sun etc. put

efforts to stay on the Moore curve of progress. Processor architectures are evolving with different techniques to gain in performance – dividing each instruction into a large number

of pipeline stages, scheduling instructions in an order different

from the order the processor received them, providing 64-bit architecture, providing multi cores, etc. However, the benefits

of all these components can be derived only if the system software is tailored to match the new processor. To keep pace with the rapid changes in processor technology, usually the existing system software codes are tweaked to exploit the offerings of the new architectures. Domain experts in areas such as weather forecasting, climate modeling and atomic physics still continue to maintain large programs written in Fortran language. Thus, Fortran being popular in the scientific community, requires to be supported on new architectures to take advantage of advancements in the hardware. However, redesigning a compiler from scratch is a major task. Instead, modifying an

existing compiler so that it generates code for newer targets is

a common way to make compilers compliant to enhanced

processors. This paper describes the important aspects of

adapting the Fortran90 compiler (CDF90) developed by C- DAC to higher architectures.

A. Overview of CDAC’s Fortran90 Compiler

The highlights of CDF90 compiler are that it supports both F77 and F90 standards; it supports Message Passing Interface (MPI), and mixed language programming with C [6]. It also has an in-built Fortran77 to Fortran90 converter. It is available on AIX, Linux and Solaris. CDF90 source code is written in ‘C’ language and comprises of about 557 source code files with 190 Kilo Lines of Code (KLOC). As with other traditional compilers CDF90 includes the key phases of lexical analysis, syntax - semantic analysis, optimization and code generation. Yacc [9] is used for developing the syntax analysis modules in CDF90. Internally the context free grammar is represented by a tree using AST (Abstract Syntax Tree) [7] notation. The tree can be traversed to generate intermediate code and also to carry out optimization transformations.

B. Traditional and Retargetable Methodologies

Advanced versions of the GNU Compiler Collection (GCC) offers many advantages over traditional compilers even though traditional compilers are important for they provide the basic building block for adding sophisticated features according to the requirements arising or to pursue research work which can be carried out on top of the existing compiler projects. It provides the flexibility of not writing everything from scratch. In an attempt in early 90’s to make the compiler highly portable, the code generator module was replaced with a ‘translator to C’ in order to generate intermediate C code; since stable C compilers were available on majority of the platforms. Hence CDF90 acts as a converter by translating input Fortran77/90 program into equivalent C program which is further passed to some standard ‘C’ compiler like gcc. CDF90 incorporates traditional compiler development approach. CDF90 offers various optimization techniques such as, Loop Unrolling, Loop Interchanging, Loop Merging, Loop Distribution and Function in-lining etc, for efficient storage

10

and execution of the code while gcc comes with more sophisticated and complex optimizations procedures, such as inter-procedural analysis, for better performance. CDF90 takes the benefits of both the approaches by converting the code into intermediate C code through traditional approach and later passing the intermediate C code to gcc compiler to exploit the benefits offered by advanced techniques. Various compilation approaches are depicted in Figure 1.

Traditional

Traditional

Scientific Scientific Code Code
Scientific
Scientific
Code
Code

Lexical Analyzer

Lexical Analyzer

Lexical Analyzer Lexical Analyzer Syntax + Semantic Syntax + Semantic

Syntax + Semantic

Syntax + Semantic

Analyzer

Analyzer

Optimizer

Optimizer

Code Generator &

Code Generator &

Assembler for

Assembler for

Code Generator & Code Generator & Assembler for Assembler for Target Architecture Target Architecture

Target Architecture

Target Architecture

Machine Independent

Machine Independent

Machine Dependent

Machine Dependent

Executable Executable for Target for Target Architecture Architecture
Executable
Executable
for Target
for Target
Architecture
Architecture

Portable

Portable

Scientific Scientific Scientific Code Code Code Lexical Analyzer Lexical Analyzer Lexical Analyzer Syntax +
Scientific
Scientific
Scientific
Code
Code
Code
Lexical Analyzer
Lexical Analyzer
Lexical Analyzer
Syntax + Semantic
Syntax + Semantic
Syntax + Semantic
Analyzer
Analyzer
Analyzer
Optimizer
Optimizer
Optimizer
Translator to
Translator to
Translator to
commonly used
commonly used
commonly used
language (c/c++) for
language (c/c++) for
language (c/c++) for
portability
portability
portability
Machine Independent
Machine Independent
Machine Independent
Compile using
Compile using
Compile using
popular compiler
popular compiler
popular compiler
Machine Dependent
Machine Dependent
Machine Dependent
Different
Different
Different
Executables
Executables
Executables
for Targets
for Targets
for Targets
Retargetable Retargetable Scientific Scientific Scientific Code Code Code Lexical Analyzer Lexical Analyzer
Retargetable
Retargetable
Scientific
Scientific
Scientific
Code
Code
Code
Lexical Analyzer
Lexical Analyzer
Lexical Analyzer
Syntax + Semantic
Syntax + Semantic
Syntax + Semantic
Analyzer
Analyzer
Analyzer
General Optimizer
General Optimizer
General Optimizer
Intermediate Code
Intermediate Code
Intermediate Code
Generator
Generator
Generator
Machine Independent
Machine Independent
Machine Independent
Code Generation
Code Generation
Code Generation
using .md & .rtl for
using .md & .rtl for
using .md & .rtl for
Target Architecture
Target Architecture
Target Architecture
Machine Dependent
Machine Dependent
Machine Dependent
Different
Different
Different
Executables
Executables
Executables
for Targets
for Targets
for Targets

Figure1. Different Compiler Approaches

II. CDF90 ARCHITECTURE

CDF90 is conventionally composed of following main modules

B. Lexer

This module converts sequence of characters into a sequence of tokens, which will be given as input to the Parser for construction of Abstract Syntax Tree. This Abstract Syntax Tree will be used in later modules of compiler for further processing.

C. Parser

This module receives input in the form of sequential source program instructions, interactive commands, markup tags, or some other defined interface and breaks them up into parts that can then be managed by other compiler components. Parser will get the tokens from the Lexer and will construct a data structure, usually an Abstract Syntax Tree.

D. Optimizer

This module provides a suite of traditional optimization methods to minimize energy cost of a program by minimizing memory access instructions and execution time. Optimization techniques like Loop Unrolling, Loop interchanging, Loop Merging, Loop Distribution and Function in-lining etc. have been applied on the Parse Tree structure.

11

E. Translator

This module translates FORTRAN source code to correct, compliable and clean C source code. Translator makes an in-order traversal of the full parse tree and replaces each FORTRAN construct by their corresponding C construct. Output of translator module is a ‘.c’ file. This .c file is passed to some standard C compiler to produce final executable. I/O libraries of CDF90 are linked using ‘–l’ option and passed to the linker ‘ld’

F. I/O Library

This module contains methods that are invoked by the intermediate C code generated as an output of translator module with the help of linker at link time.

Fortran77/90 Fortran77/90 Application Application (.f or .f90) (.f or .f90) Lexer Lexer Token Token s
Fortran77/90
Fortran77/90
Application Application
(.f or .f90)
(.f or .f90)
Lexer
Lexer
Token Token
s s
Parser
Parser
AST
AST
Optimizer
Optimizer
Optimized Optimized
AST AST
I/O
I/O
Translator Translator
Library
Library
Intermediate C
Intermediate C
Code Code
C C
Compiler
Compiler
(gcc)
(gcc)
Executable
Executable
File(XCOFF/XCOFF-64bit)
File(XCOFF/XCOFF-64bit)

Figure2. Control Flow Graph for CDF90 Compiler

III. CONSIDERATIONS FOR MIGRATING CDF90 TO

64-BIT

A. Need for Migration of CDF90

64 bit architectures are becoming popular since a decade with a promise of higher accuracy and speed through use of 64-bit registers and 64-bit addressing. The advantages offered by 64-bit processors are:

Large virtual address

Non-segmented memories

64–bit arithmetic

Faster Computations

Removal of certain system limitations

A study was taken up to understand the feasibility, impact, and effort for migrating existing Fortran90 compiler. Considering the extensive features offered by CDF90 and the advantages of enhancing this to higher bit processors, it was decided to enhance the existing compiler to support LP64 model, and this would require reasonable changes mainly in parser, translator and library modules.

B. Data Model Standards Followed

For higher precision and better performance, newer architectures have been designed to support various data models. The three basic models that are supported by most of the major vendors on 64-bit platform are LP64, ILP64 and LLP64 [5].

LP64 (also known as 4/8/8) denotes int as 4 bytes, long and pointer as 8 bytes each.

ILP64 (also known as 8/8/8) means int, long and pointers are 8 bytes each.

LLP64 (also known as 4/4/8) adds a new type (long long) and pointer as 64 bit types Many 64-bit compilers support LP64 [5] data model including gcc and xlc on AIX5.3 platform. CDF90 acts as a front-end and depends upon gcc/xlc for backend compilation. Since gcc/xlc follows LP64 data model for 64-bit compilation on AIX5.3 platform so 64BitCDF90 also needs to follow the same data model. Hence LP64 data models have been adopted for 64-bit compilation on 64-bit AIX5.3/POWER5 platform.

C. Approach Followed for Migration

Adding 8-byte integer (I: 8) support to CDF90 compiler was required to enjoy the benefit of faster computing with larger integer values. In order to implement this in CDF90, various Implicit FORTAN Library functions needed to be modified and new functions written that support 8-byte integer computations. Also various changes were required in different compiler modules which will be described in later part of the paper.

FORTRAN applications passed to CDF90 are translated to C code hence it was required to consider the data model of the underlying C compiler also. gcc4.2 follows LP64 data model for 64-bit compilation and for 32-bit compilation it uses ILP32 data model even on the 64-bit AIX platform. Hence LP64 data model would be suitable in this situation.

Most 64-bit processors support both 32-bit and 64-bit execution modes [10]. Hence 64BitCDF90 also needs to provide both 32-bit and 64-bit compilation support through the help of 32-bit and 64-bit compilation libraries, though the executable may be 32-bit only. Hence it is identified that two different libraries need to be prepared for 32-bit and 64-bit compilation environments.

Fortran77/90 executable file format requires to be changed to XCOFF64 when compiled by 64BitCDF90 on 64-bit platform using 64-bit libraries. 64-bit CDF90 APIs need to be generated for 64-bit compilation of any application, though CDF90 executable may be 32-bit only. Please note that the same approach has been followed by ‘gcc’ which uses 32-bit executable for 64- bit compilation, of any application file passed to it, through use of 64-bit library files.

IV.

64BitCDF90 needs to be validated for 64-bit architecture compliance against the existing test cases along with the newly added test cases specific for 8-byte integer support.

EFFECT OF LP64 DATA MODEL ON CDF90

Some fundamental changes occur when moving from ILP32 data model to LP64 data model which are listed as following

long and pointers are no longer 32-bit size. Direct or indirect assignments of int to long or pointer value is no more valid.

For 64-bit compilation, CDF90 needs to use 64-bit library archives. Also need to supply 64-bit specific flags to backend compiler so that it can operate in 64-bit mode.

System derived types such as size_t, time_t, ptrdiff_t are 64-bit aligned in 64-bit compilation environments. Hence these values must not be contained or assigned to 32-bit variables.

V.

DESIGN/IMPLEMENTATION CHANGES FOR

64-BIT ENHANCEMENTS 64BitCDF90 is able to perform 64-bit compilation correctly on 64-bit platform with I:8 support after carrying out the following tasks described in two phases as below:

A. Migration from 32-bit to 64-bit Major porting concerns can be summarized as below:

Pointers size changes to 64-bit. All direct or implied assignments or comparisons between “integer” and “pointer” values have been examined and removed.

Long size changes to 64-bit. All casts to allow the compiler to accept assignment and comparison between “long” and “integer” have been examined to ensure validity.

Code has been updated to use the new 64-bit APIs and hence the executable generated after 64-bit compilation is 64-bit compliant.

Macros depending on 32-bit layout have been adjusted for the 64-bit environment.

A variety of other issues, like data truncation, that can arise from sign extension, memory allocation sizes, shift counts, array offsets, and other factors have to be dealt with extreme care.

User has option to select between 32-bit or 64-bit APIs. If 64-bit compiler flag is used, for e.g. ‘-maix64’ flag is used for backend compilation with gcc 4.2, while compiling then 64-bit API is linked and 64-bit object file format is generated. If user does not use any 64-bit flag for compilation, then by default 32-bit API is linked to the application and 32-bit object file format is generated. Code has been added in the compiler to select either 32- bit or 64-bit API depending upon whether user has supplied 64-bit specific flags or not.

64-bit Library files are generated by compiling the source code using ‘gcc 4.2’ in 64-bit mode using ‘–

12

maix64’ option on AIX5.3/Power5 platform to produce 64-bit XCOFF file formats. These 64-bit object files are passed to ‘ar’ tool using ‘–X64’ flag to produce 64-bit library archives.

Compiler code, which is written in C, is compiled using gcc 4.2 in 64-bit architecture (AIX5.3/Power5) without using any 64-bit specific flags and by default it generates 32-bit executables for CDF90 compiler. The same CDF90 executable can be linked to 64-bit library API to compile applications, passed to it, in 64-bit mode and to generate object file in 64-bit XCOFF file format. The same 32-bit CDF90 executable can be linked to 32-bit API also to compile applications in 32-bit mode and to generate 32-bit XCOFF object file format on the same 64-bit platform (AIX5.3/POWER5).

32-bit library archives are generated using compilation by ‘gcc 4.2’ followed by archive file generation using ‘ar’ tool without any 64-bit specific flags on AIX5.3 platform with PowerPC_Power5 architecture.

B. Adding I:8 Support in 64BitCDF90

32-bit CDF90 compiler supports following KIND values, in a Fortran77/90 program, for Integer data types

TABLE 1

KIND VALUES FOR INTEGERS

KIND Value

Size in bytes

1

1

2

2

3

4

The compiler has been enhanced to allow KIND value 4 for which the size of the Integer is 8 bytes. Modifications are performed in the following modules to achieve the desired results.

Lexer: Code has been added to identify INTEGER*8 as a valid token.

Parser: The parser code has been modified so that it may correctly add the symbol (I: 8) correctly in the AST generated after the parser phase. All other programming constructs, like functions, macros, data types etc, dealing with 8-byte integer size are also updated, in the parser module, to transform in to the correct AST structure.

Translator: If the integer variable KIND value is 4 in an input Fortran77/90 program, then the translator should be able to declare and translate the symbol in to the corresponding C code symbol. Corresponding code has been added in the translator module. Functionalities have been added, in the translator module, to correctly convert implicit FORTRAN library function dealing with KIND value 4 integer data types to their corresponding C library function names based on conditional type checking for 8-byte integer data types. After successful conversion, the function call is dumped into an intermediate .c file. The C functions, dealing with I:8 data types, are called from CDF90 libraries.

13

E.g. The Translator is now able to internally translate the implicit matmul (a, b) function, where a, b are matrix with 8-byte integer elements, in to _matmul_i8i8_22 (long long **a, long long **b) which is an intermediate C library function to carry out the actual multiplication. This translation is carried out on the basis of integer size conditional checking. Most of such changes have been carefully debugged with the help of gdb6.5. I/O Library: There are two libraries supported by 64BitCDF90.One library is used for 32-bit mode and the other is linked for 64-bit mode compilation. Most of the library functions in 32-bit CDF90 library have been modified /added to handle 64-bit integer data types and hence to create 64-bit CDF90 library. Hence 64-bit CDF90 library can handle 8-byte integer for most of the functions it contains. The translator module will generate corresponding C code that invokes these library functions, which handle 8-byte integer data size.

_matmul_i8i8_(long long **a, long long **b)

library function has been added in the CDF90 library to compute matrix-matrix multiplications whose elements are 8-byte integers. Building Libraries: The 32-bit library archives have been created by compiling the source code in 32-bit mode while 64-bit library archives have been prepared by compiling the source code in 64-bit mode on 64-bit platform. More specifically 64-bit libraries have been

prepared by compiling source code using gcc4.2 with ‘-

E.g.

maix64’ flag and further passing it to archive tool ‘ar’

with the ‘-x64’ flag to create 64-bit library archives for

64BitCDF90.

VI. CODE OPTIMIZATIONS: ENERGY EFFICIENT METHODS

OFFERED BY CDF90

Energy cost of a program depends upon the number of memory access[15]. Large number of memory access results in high energy dissipation. Energy efficient compilers reduce memory access and thereby reduce energy consumed by these accesses. One approach is to reduce the LOAD and STORE instructions and storing data in registers or cache memory to provide faster and efficient computations reducing the CPU cycles significantly. According to Kremer [22], “Traditional compiler optimizations such as common subexpression elimination, partial redundancy elimination, strength reduction, or dead code elimination increase the performance of a program by reducing the work to be done during program execution”. There are also several other compiler strategies for energy reduction such as - compiler assisted Dynamic Voltage Scaling [20], instruction scheduling to reduce function unit utilization, memory bank allocation, inter-program optimizations [19] etc. which are being tried by the researchers. Reducing power dissipation of processors is also being attempted at the hardware level and in the operating system. Examples of these include power aware techniques for on-

chip buses, transactional memory, memory banks with low power modes [16,17] and work load adaptation [18,23]. However energy reduction through compiler promises some advantages such as - no overhead at execution time, assess ‘future’ program behavior through aggressive whole program analysis, identify optimizations and make code transformations for reduced energy usage [24].

A. Optimizations in CDF90 compiler Programs compiled without any optimization generally runs very slowly and hence takes more the cpu cycles and results in higher energy consumption. Usually a medium level optimization (-o2 on many machines) typically leads to a speed up by a factor of 2-3 without any significant improvements in compilation time. Different types of optimizations performed by CDF90 to improve program performance are listed as following

i. Common Sub-expression elimination The compiler takes the common sub-expression out from various expressions and calculate it only once instead of several times

t1=a+b-c

t2=a+b+c

Above two statements will be reduced by an optimizing compiler in to following statements t= a+b

t1=t-c

t2=t+c

Though this approach may not exhibit any significant performance enhancements for smaller expressions but for bigger expression it shows some certain performance enhancements

ii. Strength Reduction Strength Reduction is used to replace an existing arithmetic expression by an equivalent expression that can be evaluated faster. A simple example is replacing 2*i by i+i since integer addition is faster than integer multiplication.

iii. Loop Invariant Code The code that is not dependent on loop iterations is removed from the loop and calculated one-time only instead of calculating for each loop iteration. The following FORTRAN code let

do i=1,n a(i)=m*n*b(i) end do will be replaced by the following code by an optimizing compiler t=m*n do i=1,n a(i)=t*b(i) end do

iv. Constant Value Propagation

An expression involving several constant values will be calculated and replaced by a new constant value For ex:

x=2*y

z=3*x*s

will be transformed into the following code by an optimizing compiler

z=6*y*s

v. Register Allocation and Instruction Scheduling

This particular optimization is the most difficult and most important also. Since CDF90 depends upon gcc for backend compilation so it is left up to the backend compiler only.

B. Code Transformation Criteria and applied Techniques CDF90 offers following loop optimization techniques.

i. Loop Interchange It is a Process of exchanging the order of two iteration variable. Loop interchange can often be used to enhance the performance of code on parallel or vector machines. Determining when loops may be safely and profitably interchanged requires a study of the data dependences in the program Loop interchange mechanism has been implemented in CDF90. It contains a function which checks whether interchanging of loops can be done. The loops can be interchanged if the loops are perfectly nested. If one of the loop bound is dependent on the index variable of some other loop the loops can not be interchanged, also the loop should be totally independent, since after loop interchanging there should not be invalid computation. Based on condition output, loop interchanging is performed by CDF90 supported functions. For example consider the FORTRAN matrix addition algorithm below:

DO I = 1, N DO J = 1, M A(I, J) = B(I, J) + C(I, J) ENDDO ENDDO

The loop accesses the arrays A, B and C row by row, which, in FORTRAN, is very inefficient. Interchanging I and J loops, as shown in the following example, facilitates column by column access.

DO J = 1, M DO I = 1, N A(I, J) = B(I, J) + C(I, J) ENDDO ENDDO

14

ii. Loop Vectorization It is the conversion of loops from a non-vectored form to

a vectored form. A vectored form is one that has the same

operation happening on all members of a range (SIMD) without dependencies. Theoretically the ranges should be in contiguous memory areas because this makes it easy for the processor to know where to apply the computations next. Conceptually, it could be any set of data. Only built-in data type in C and FORTRAN are arrays which have contiguous memory.

iii. Loop Merging Another technique implemented by CDF90 to reduce loop overhead. When two adjacent loops would iterate the same number of times (whether or not that number is known at compile time), their bodies can be combined as long as they make no reference to each other's data.

iv. Loop Unrolling

Duplicates the body of the loop multiple times, in order to decrease the number of times the loop condition is tested and the number of jumps, which may degrade performance by impairing the instruction pipeline. Completely unrolling

a loop eliminates all overhead (except multiple instruction fetches & increased program load time), but requires that the number of iterations be known at compile time. Loop unrolling, is designed to unroll loops for parallelizing and optimizing compilers. To illustrate, consider the following loop:

for (i = 1; i <= 60; i++) a[i] = a[i] * b + c;

This FOR loop can be transformed into the following equivalent loop consisting of multiple copies of the original loop body:

for (i = 1; i <= 60; i+=3)

{

a[i] = a[i] * b + c; a[i+1] = a[i+1] * b + c; a[i+2] = a[i+2] * b + c;

}

The loop is said to have been unrolled twice, and the unrolled loop runs faster because of reduction in loop overhead. Loop unrolling was initially developed for reducing loop overhead and for exposing instruction level parallelism for machines with multiple functional units. Loop unrolling is limited by the number of registers available. If numbers of registers are less, there is an increased number of LOAD/STORE and hence increased memory access. In case of loop unrolling leads to performance loss because of frequent LOAD/STORE, loop is spitted into two, a procedure called loop fission .This can be done in a straight-forward manner if two independent code segments are present in the loop.

v. Loop Distribution Loop distribution is used for transforming a sequential program into a parallel one. Loop distribution attempts to break a loop into multiple loops over the same index range but each taking only a part of the loop's body. This can improve locality of reference, both of the data being accessed in the loop and the code in the loop's body.

vi. Function Inlining Function inlining is a powerful high level optimization which eliminates call cost and increases the chances of other optimizations taking effect due to the breaking down of the call boundaries. By declaring a function inline, you can direct CDF90 to integrate that function's code into the code for its callers. This makes execution faster by eliminating the function-call overhead. The effect on code size is less predictable; object code may be larger or smaller with function inlining, depending on the particular case.

VII. TESTING 64BitCDF90 ON 64-BIT ARCHITECTURE

Compilers are used to generate software for systems where correctness is important [4] and testing [2] is needed for quality control and error detection in compilers.

A. Challenges

One of the challenges encountered in the CDF90 project was non-availability of free (open source), standard test suite to test the 64-bit compiler (specially I:8 check). Hence a test suite to test all the functionalities of Fortran 77/90 was developed (DPTESTCASES). This was not an easy task as the test suite developer needed to be aware of all the features of the language.

B. Test Suites Used

Two suites of test programs, FCVS [8] and DPTESTCASES, are used, and whenever the compiler is modified, the test programs are compiled using both the new and old versions of the compiler. Any differences in the target programs output are reported back to the development team.

i. FORTRAN Compiler Validation System (FCVS) FCVS, developed by NIST is a standard test suite for FORTRAN compiler validation and used to validate 64BitCDF90 against FORTRAN77/90 applications. Scripts files are written to compile and execute each and every test case present in the test suite. After running script file one can see the result/status for FCVS.

ii. DPTESTCASES Unfortunately FCVS does not contain test cases to test 8- byte integer support in various FORTRAN language constructs. Hence a large set of small and specific test cases

15

that covered the complete language constructs with 8-byte integer support was developed called DPTESTCASES. These tests verify each of the compiler construct with 8-byte integer support. DPTESTCASES enables to test for 8-byte integer support for various FORTRAN language features like Data type and Declarations, Specification Statements, Control statements, IO Statements, Intrinsic functions, Subroutine functions, boundary checks for integer constants. Shell scripts are written to compile all test cases, execute them and check the compiler warnings and error messages against expected results of DPTESTCASES. The test package has been modified in several ways, like supplying large values in possible test cases for boundary value analysis, and check for the correct results. This approach of testing proved to be helpful in verifying compiler capabilities.

C. Testing Approach The approach followed for testing 64BitCDF90 can be described by listing out following main points:

i. Testing of 64BitCDF90 in 32-bit mode on 64-bit Machine Execute each test case with 64BitCDF90 compiler in 32-bit mode and check for correct results. This check is required to see if modifications performed, for 64-bit compatibility, in the compiler code may have resulted in any adverse side effects. In 32-bit mode compiler internally uses 32-bit library archives and generate 32-bit X-COFF on AIX5.3 platform. FCVS is used for testing in 32-bit mode.

ii. Testing of 64BitCDF90 in 64-bit mode on 64-bit Machine

Execute each test case with enhanced cdf90 compiler in 64- bit mode and check for correct results. This check is required to see if the enhanced compiler is successfully compiling each test case in the 64-bit environment. Both FCVS and DPTESTCASES are used to test 64-bit compatibility and 8- byte integer support in various FORTRAN language constructs. In 64-bit mode compiler internally uses 64-bit library archives and generate 64-bit X-COFF on AIX5.3/Power5 platform.

iii. White Box Testing White-Box testing is the execution of the maximum number of accessible code branches with the help of debugger or other means. The more code coverage is achieved the fuller is the testing provided [13]. For the large code base of 64BitCDF90, it is not easy to traverse the full code. Hence white-box testing methods are used in the case when an error is found with test cases and we need to find out the reason which caused it.

iv. Black Box Testing

Unit testing may be treated as black-box testing. The main idea of the method consists in writing a set of tests for separate modules and functions of Fortran 90, which test all the main constructs of Fortran.

v. Functional Testing Tested 64BitCDF90 along with original CDF90 for the same test cases with the intent of finding defects, demonstrated that defects are not present, verifying that the module performs its intended functions with integer up to 8 bytes since Fortran supports integers of 1,2,4 and 8 bytes, and establishing confidence that a program does what it is supposed to do.

vi. Regression Testing

Similar in scope to a functional test, a regression test allows

a consistent, repeatable validation of each new release of a compiler for new requirement. Such testing ensures reported compiler defects have been corrected for each new release and that no new quality problems were introduced in the maintenance process. Regression testing can be performed manually.

vii. Static Analysis of the CDF90 code The code size being very large, we needed a suitable static analyser which supports LP64 data model for 64-bit compilation on 64-bit platform. We found lint [12] as most suitable and offered following advantages:

Warns about incorrect, error-prone or nonstandard code that that the compiler does not necessarily flag. Issues potential bugs and portability problem

Assists in improving source code’s effectiveness, including reducing its size and required memory Out of various options provided by lint, –errchk=longptr64, in particular, has proved to be very helpful for migrating CDF90

to 64-bit platform.

D. Testing Statistics

i. 32-bit Compilation FCVS is compiled by 64BitCDF90 in 32-bit mode on 64- bit machine (AIX5.3/Power5).198 out of total 200 test cases, are successful. The two buggy test cases are being worked on. DPTESTCASES consisting of 210 test cases are compiled successfully with 64BitCDF90 and producing correct results.

ii. 64-bit Compilation

FCVS is compiled in 64-bit mode by setting 64-bit flag (set ‘-maix64’ flag when using ‘gcc4.2’ for backend compilation) to obtain 64-bit executable files. The same two test cases mentioned above are unsuccessful in 64-bit mode also. All the test cases present in DPTESTCASES are compiled successfully and producing correct results.

VIII. CONCLUSIONS Major hardware vendors have recently shifted to 64-bit processors because of the performance, precision, and scalability that 64-bit platforms can provide. The constraints

16

of 32-bit systems, especially the 4GB virtual memory ceiling, have spurred companies to consider migrating to 64-bit platforms. This paper presents some important porting issues that are faced while porting a compiler from 32-bit to 64-bit like - adding a new language feature (I:8 data type) to the existing compiler and ensuring that this feature is supported by a range of already existing library function. The testing methodology is elaborated. Performance optimizations contributing to energy efficiency of the code are explained in the paper. The authors in this paper present the traditional optimizations implemented in CDF90. Other energy-saving optimizations are being explored and shall be presented in future reports.

REFERENCES [1] AHO, ALFRED V. / JEFFREY D.

ULLMAN, “Principles of Compiler Design”, Tenth Indian Reprint, Pearson Education, 2003. [2] Jiantao Pan, “Software testing”, 18-849b Dependable Embedded Systems, Carnegie Mellon University, Spring

1999

http://www.ece.cmu.edu/~koopman/des_s99/sw_testing/

[3] DeMiIIo, RA1; Krauser, EW2; Mathur, AP3, “An Overview of Compiler-integrated Testing” Australian Software Engineering Conference 1991: Engineering Safe Software; Proceedings, 1991 [4] A. S. Kossatchev and M. A. Posypkin, “Survey of Compiler Testing Methods”, Programming and Computer Software, Vol. 31, No. 1, 2005, pp. 10–19 [5] Harsha S. Adiga, ’” Porting Linux applications to 64-bit systems”, 12 Apr 2006.

http://www.ibm.com/developerworks/library/l-port64.html

[6] http://cdac.in/html/ssdgblr/f90ide.asp. [7] http://www.cocolab.com/en/cocktail.html [8] http://www.fortran2000.com/ArnaudRecipes/

[9]

cvs21_f95.html

Stephen C. Johnson “Yacc: Yet Another Compiler-

Compiler”, July 31, 1978

http://www.cs.man.ac.uk/~pjj/cs211/yacc/yacc.html

[10] Cathleen Shamieh,” Understanding 64-bit PowerPC architecture”, 19 Oct 2004,

http://www.ibm.com/developerworks/library/pamicrodesign/

[11] Steven Nakamoto and Michael Wolfe,” Porting Compilers & Tools to 64 Bits” Dr. Dobb’s Portal, August 01, 2005

17

[12] “lint Source Code Checker”, C User's Guide, Sun Studio 11, 819-3688-10 [13] Andrey Karpov, Evgeniy Ryzhkov, “Traps detection

64-bit

Andrey Karpov , Evgeniy Ryzhkov , “Traps detection 64-bit during Windows” migration of C and C++

during

Windows”

migration

of

C

and

C++

code

to

http://www.viva64.com/content/articles/64-bit-

development/?f=TrapsDetection.html=en&content=64-

bit-development [14] http://www.itl.nist.gov/div897/ctg/fortran_form.htm

[15] Stefan Goedecker, Adolfy Hoisie “Performance optimization of numerically intensivecodes”, Society for Industrial and Applied Mathematics (Cambridge University Press), 1987

Tali Moreshet, R. Iris Bahar, Maurice Herlihy, “Energy Reduction in Multiprocessor Systems Using Transactional Memory”, ISLPED’05, USA

[17] CaoY, Okuma, Yasuura, “Low-Energy Memory Allocation and Assignment Based on Variable Analysis for Application-Specific Systems”, IEIC Technical Report, p31-38, Japan, 2002

Changjiu Xian, Yung-Hsiang Lu, “Energy reduction by

workload adaptation in a multi-process environment”, Proceedings of the conference on Design, automation and test in Europe, 2006 [19] J. Hom and U. Kremer, “Inter-Program Optimizations for Conserving Disk Energy”, International Symposium on Low Power Electronics and Design (ISLPED'05), San Diego, California, August 2005 [20] C-H. Hsu and U. Kremer, “Compiler-Directed Dynamic Voltage Scaling Based on Program Regions” Rutgers University Technical Report DCS-TR461, November 2001. [21] Wei Zhang , “Compiler-Directed Data Cache Leakage Reduction”, IEEE Computer Society Annual Symposium on VLSI, ISVLSI'04 [22] U. Kremer. “Low Power/Energy Compiler Optimizations”, Low-Power Electronics Design (Editor:

[16]

[18]

Christian Piguet), CRC Press, 2005. [23] Majid Sarrafzadeh, Prithviraj Banerjee, Alok Choudhary, Andreas Moshovos, “PACT: Power Aware Compilation and Architectural Techniques”, California University LA, Dept of Computer Science. [24] U Kremer, “Compilers for Power and Energy Management”, Tutorial, ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'03), San Diego, CA, June 2003.

ADCOM 2009 SERVER VIRTUALIZATION

Session Papers:

1. J. Lakshmi, S.K.Nandy, “ Is I/O Virtualization ready for End-to-End Application Performance? ” INVITED PAPER

2. S. Prakki, “ Eco-friendly Features of a Data Centre OS” – INVITED PAPER

18

Is I/O virtualization ready for End-to-End Application Performance?

J. Lakshmi, S. K. Nandy Indian Institute of Science, Bangalore, India {jlakshmi,nandy}@serc.iisc.ernet.in

Abstract: Workload consolidation using system virtualization feature is the key to many successful green initiatives in data centres. In order to exploit the available compute power, such systems warrant sharing of other hardware resources like memory, caches, I/O devices and their associated access paths among multiple threads of independent workloads. This mandates the need for ensuring end-to-end application performance. In this paper we explore the current practices for I/O virtualization, using sharing of the network interface card as the example, with the aim to study the support for end-to-end application performance guarantees. To ensure end-to-end application performance and limit interference caused due to the sharing of devices, we present an evaluation of previously proposed end-to-end I/O virtualization architecture. The architecture is an extension to the PCI-SIGV specification of I/O hardware to support reconfigurable device partitions and uses VMM- bypass technique for device access by the virtual machines. Simulation results of the architecture for application quality of service guarantees demonstrate the flexibility and scalability of the architecture.

Keywords Multicore server virtualization, IO- virtualization architectures,QoS, Performance.

I.

Introduction

Multi-core servers have brought in tremendous computing capacity to the commodity systems. These multi-core servers have not only prompted applications to use fine-grained parallelism to gain advantage of the abundance of CPU cycles, they have also initiated the coalescing of multiple independent workloads onto a single server. Multicore servers combined with system virtualization have led to many successful green initiatives of data centre workload consolidation. This consolidation however needs to satisfy end-to- end application performance guarantees. Current virtualization technologies have evolved from the prevalent single-hardware single-OS model which presumes the availability of all other hardware resources to the current scheduled process. This causes performance interference among multiple independent workloads sharing an I/O device, based on the individual workloads. Major efforts towards consolidation have focussed on aggregating the CPU cycle requirement of the

1

19

target workloads. But I/O handling of these workloads on the consolidated servers results in sharing of the physical resources and their associated access paths. This sharing causes interference that is dependent on the consolidated workloads and makes the application performance non-deterministic [1][2][3]. In such scenarios, it is essential to have appropriate mechanisms to define, monitor and ensure resource sharing policies across the various contending workloads. Many applications like real-time hybrid voice and data communication systems onboard aircraft and naval vessels, streaming and on-demand video delivery, database and web-based services, when consolidated onto virtualized servers, need to support soft real-time application deadlines to ensure performance.

Standard I/O devices are not virtualization aware and hence their virtualization is achieved using a software layer for multiplexing device access to independent VMs. In such cases I/O device virtualization is commonly achieved following two basic modes of virtualization, namely para- virtualization and emulation [4]. In para- virtualization the physical device is accessed and controlled using a protected domain which could be the virtual machine monitor (VMM) itself or an independent virtual machine (VM), also called the independent driver domain (IDD) as in Xen. The VMM or IDD actually do the data transfer to and from the device into their I/O address space using the device’s native driver. From there the copy or transfer of the data to the VM’s address space is done using what is commonly called the para- virtualized device driver. The para-virtualized driver is specifically written to support a specific mechanism of data transfer between the VMM/IDD and the VM and needs a change in the OS of the VM (also called GuestOS).

In emulation, the GuestOS of VM installs a device driver of the emulated virtual device. All the calls of this emulated device driver are trapped by the VMM and translated to the native device driver’s calls. The advantage of the emulation is that it allows the GuestOS to be unmodified and hence easier to adopt. However, para-virtualization has been found to be much better in performance when compared to emulation. This is because emulation results in each instruction translation whereas para-

virtualization involves only page-address translation. But, both these modes of device virtualization impose resource overheads when compared to non-virtualized servers. These overheads translate into application performance loss.

The second drawback of the existing architectures is their lack of sufficient quality of Service (QoS) controls to manage device usage on a per VM basis. A desirable feature of these controls is that they should guarantee application performance with specified QoS on the shared device and this performance should be unaffected by the workloads sharing the device. The other desirable feature is that the unused device capacity should be available for use to the other VMs. Prevalent virtualization technologies like Xen and Vmware and even standard linux distributions use a software layer within the network stack to implement NIC usage policies. Since these systems were built with the assumption of single-hardware single-OS model, these features provide required control on the outgoing traffic from the NIC of the server. The issue is with the incoming traffic. Since the policy management is done above the physical layer, ingress traffic accepted by the device is later dropped based on input stream policies. This results in the respective application not receiving the data, which perhaps satisfies the application QoS, but causes wasted use of the device bandwidth that affects the delivered performance of all the applications sharing the device. Also, it leads to non-deterministic performance that varies with the type of applications using the device. This model is insufficient for the virtualized server supporting sharing of the NIC across multiple VMs. In this paper we describe and evaluate an end-to-end I/O virtualization architecture that addresses these drawbacks.

Rest of the paper is organized as follows. section II presents experimental results on existing virtualization technologies, namely Xen and Vmware, that motivate this work; section III then describes an end-to-end I/O virtualization architecture to overcome the issues raised in section II; section IV details the evaluation of the architecture and presents the results; section V highlights the contributions of this work with respect to existing literature and section VI details the conclusions.

II.

Motivation

Existing I/O Virtualization architectures use extra CPU cycles to fulfill equivalent I/O workload. These overheads reflect in the achievable application performance, as depicted by the graph

2

20

of Figure 1. The data in this graph represents achievable throughput by the httperf [5] benchmark hosted on a non-virtualized and virtualized Xen[6] and Vmware-ESXi [7] servers. In each of the case the http server was hosted on a Linux(FC6) OS and for the virtualized server, the hypervisor, IDD (Xen) [8] and the virtual machine were pinned to use the same physical CPU. The server used was dual core Intel Core2Duo system with 2GB RAM and 10/100/1000Mbps NIC. In the Xen hypervisor the virtual NIC used by the VM was configured to use a para-virtualized device driver implemented using event channel mechanism and a software bridge for creating virtual NICs. In the case of Vmware hypervisor the virtual NIC used inside the VM was configured using a software switch with access to device through emulation.

a software switch with access to device through emulation. Figure 1: httperf benchmark throughput graph for

Figure 1: httperf benchmark throughput graph for non- virtualized, Xen and Vmware-ESXi virtual machine hosting http server on Linux(FC6). 1

As can be observed from the graphs of Figure 1, the sustainable throughput of the benchmark drops considerably when the http server is moved to a virtualized server when compared to the non- virtualized server. The reason for this drop is answered by the CPU Utilization graph depicted in Figure 2. From the graphs we notice that moving the http server from non-virtualized to virtualized server, the %CPU utilization, to support the same httperf workload, is increased significantly and this increase is substantial for the emulated mode of device virtualization. The reason for this increased CPU utilization is because of I/O device virtualization overheads.

Further, when the same server is consolidated with two VMs sharing the same NIC, each supporting one stream of an independent httperf benchmark, there is further drop of achievable throughput per VM. This is explicable since each VM now contends for the same NIC. The virtualization mechanisms not only share the device but also the

1 Some data of the graph has been reused from [9][10]

device access paths. This sharing causes serialization which leads to latencies and application performance loss which is dependent on the nature of the consolidated workloads. Also, the increased latencies in supporting the same I/O workload on the virtualized platform causes loss of usable device bandwidth which further reduces the scalability of device sharing by multiple VMs.

reduces the scalability of device sharing by multiple VMs. Figure 2: CPU resource utilized by the

Figure 2: CPU resource utilized by the http server to support

httperf benchmark throughput. 1

This scalability can be improved to some extent by pinning different VMs to independent cores and using a high speed, high bandwidth NIC. Still the high virtualization overheads coupled with serialization due to shared access paths restricts device sharing scalability.

The next study is on evaluating the NIC specific QoS controls existing in Xen and Vmware. Since current NICs do not support QoS controls these are provided by the OS managing the device. In either case, these controls are implemented in the network stack above the physical device. Because of this they are insufficient as is displayed by the graphs in Figure 3and Figure 4.

as is displayed by the graphs in Figure 3 and Figure 4 . Figure 3: httperf

Figure 3: httperf achievable throughput on a Vmware-ESXi consolidated server with NIC sharing and QoS guarantees.

3

21

server with NIC sharing and QoS guarantees. 3 21 Figure 4: httperf achievable throughput on a

Figure 4: httperf achievable throughput on a Xen consolidated server with NIC sharing and QoS guarantees.

Each of the graphs shows the maximum achievable throughput by VM1 when VM2 is constrained by a specified QoS guarantee. This guarantee is the maximum throughput that VM2 should deliver and is implemented using the network QoS controls available in Xen and Vmware servers. In Xen these QoS controls are implemented using tc utilities of the netfilter module of linux OS of the Xen-IDD. In Vmware, Veam enabled network controls are used. VM1 and VM2 are two different VMs hosted on the same server sharing the same NIC. We observe that for the unconstrained VM, in this case VM1, maximum achievable throughput does not exceed 200 for Vmware and 475 for Xen. This is considerably low when compared to the maximum achievable throughput for a single VM using the NIC. The reason being, the constrained VM, namely VM2, is receiving all requests. VM2 is also processing these requests and generating appropriate replies which results in CPU resource consumption. Only some replies of the received requests are dropped based on the currently applicable QoS on the usable bandwidth. This is because both Vmware and Xen support QoS controls on the egress traffic at the NIC. This approach of QoS control on resource usage is wasteful and coarse-grained. As can be observed, as the constraint on VM2 is relaxed the behaviour of NIC sharing reaches best effort and the resulting throughput achievable by any of the VM is obviously less than what can be achieved when a single VM is hosted on the server. These graphs clearly demonstrate the insufficiency of the existing QoS controls.

From the above experiments we conclude the following drawbacks in the existing I/O virtualization architectures:

Building hypervisors or VMMs using single- hardware single-OS model leads to cohesive architectures leading to high virtualization overheads. Virtualization overheads being high cause loss of usable device bandwidth. This

often results in under-utilized resources and limited consolidation ratios, particularly for I/O workloads. The remedy to this approach is to build I/O devices that are virtualization aware and decouple the device management from device access i.e, provide native access to the I/O device from within the VM and allow VMM to manage concurrency issues rather than the ownership issues. Lack of fine-grained QoS controls on device sharing cause performance loss which is dependent on the workloads of the VMs sharing the device. This leads scalability issues in sharing the I/O device. To address this the I/O device should support QoS controls for both the incoming and outgoing traffic.

To overcome the above drawbacks, we propose an I/O virtualization architecture. This architecture proposes an extension to the PCI-SIG IOV specification [11] for virtualization enabled hardware I/O devices with a VMM-bypass [12] mechanism for virtual device access.

III. End-to-End I/O Virtualization Architecture

We propose an end-to-end I/O virtualization architecture that enables direct or native access to the I/O device from within the VM rather than accessing it through the layer of VMM or IDD. PCI-SIG IOV specification proposes virtualized I/O devices that can support native device access by the VM, provided the hypervisor is built to support such architectures. IOV specified hardware can support multiple virtual devices at the hardware level. The VMM needs to be built such that it can recognize and export each virtual device to an independent VM, as if the virtual device was an independent physical device. This allows native device access to the VM. When a packet hits the hardware virtualized NIC, the VMM should recognize the destination VM of an incoming packet by the interrupt raised by the device and forwards it to the appropriate VM. The VM processes the packet as it would do so in the case of non-virtualized environment. Here, device access and scheduling of device communication is managed by the VM that is using the device. This eliminates the intermediary VMM/IDD on the device access path and reduces I/O service time which improves the usable device bandwidth and application throughput.

To support the idea of QoS based on device usage, we extend the IOV architecture specification by enabling reconfigurable memory on the I/O device. For each of the virtual device defined on the physical device, the device memory associated with the virtual device is derived from the QoS

4

22

requirement of the VM to which the virtual device is allocated. This, along-with features like TCP offload, virtual device priority and bandwidth specification support at the device level provide fine-grained QoS controls at the device while sharing it with other VMs, as is elaborated upon in the evaluation section.

Figure 5 gives a block schematic of the proposed I/O virtualization architecture. 2 The picture depicts a NIC card that can be housed within a virtualized server. The card has a controller that manages the DMA transfer to and from the device memory. The standard device memory is replaced by a re- partitionable memory supported with n sets of device registers. A set of m memory partitions, where m ≤ n, with device registers forms the virtual-NICs (vNICs). Ideally the device memory should be reconfigurable, i.e. dynamically partitionable, and the VM’s QoS requirements would drive the sizing of the memory partition. The advantage of having a dynamically partitionable device memory is that any unused memory can be easily extended into or reduced from a vNIC in order to match adaptive QoS specifications. The NIC identifies a vNIC request by generating message signaled interrupts (MSI). The number of interrupts supported by the controller restricts the number of vNICs that can be exported. Based on the QoS guarantees a VM needs to honour, judicious use of native and para-virtualized access to the vNICs can overcome this limitation. A VM that has to support stringent QoS guarantees can choose to use native access to the vNIC whereas those VMs that are looking for best-effort NIC access can be allowed para-virtualized access to the vNIC.

access can be allowed para-virtualized access to the vNIC. Figure 5: NIC architecture supporting MSI interrupts

Figure 5: NIC architecture supporting MSI interrupts with partitionable device memory, multiple device register sets and DMA channels enabling independent virtual-NICs.

2 This section is being reproduced from [10] to maintain continuity in the text. Complete architecture description with performance statistics on achievable application throughput can be found in [10].

The VMM can aid in setting up the appropriate hosting connections based on the requested QoS requirements.

The proposed architecture can be realized by the following modifications:

Virtual-NIC: In order to define vNIC, the physical device should support time-sharing in hardware. For a NIC this can be achieved by using MSI and dynamically partitionable device memory. These form the basic constructs to define a virtual device on a physical device as depicted in Figure 5. Each virtual device has a specific logical device address, like the MAC address in case of NICs, based on which the MSI is routed. Dedicated DMA channels, a specific set of device registers and a partition of the device memory are part of the virtual device interface which is exported to a VM when it is started. We call this virtual interface as the virtual-NIC or vNIC which forms a restricted address space on the device for the VM to use and remains in possession of the VM till it is active or relinquishes the device.

Accessing virtual-NIC: For accessing the virtual-NIC native device driver is hosted inside the VM and is initialized with the help of VMM when the VM is initialized. This device driver can only manipulate the restricted device address space which was exported through the vNIC interface by the VMM. With the vNIC, the VMM only identifies and forwards the device interrupts to the destination VM. The OS of the VM now handles the I/O access and thus can be accounted for the resource usage it incurs. This eliminates the performance interference due to VMM/IDD handling multiple VMs’ request to/from a shared device. Also, because the I/O access is now directly done by the VM, the service time on the I/O access reduces thereby resulting in better bandwidth utilization. With the vNIC interface, data transfer is handled by the VM. While initializing the device driver for the virtual NIC the VM sets up the Rx/Tx descriptor rings within its address space and makes request to the VMM for initializing the I/O page translation table. The device driver uses this table and performs DMA transfers directly into the VM’s address space.

QoS and virtual-NIC: The device memory partition acts as a dedicated device buffer for each of the VMs and with appropriate logic on the NIC card one can easily implement QoS based SLAs on the device that translate to bandwidth restrictions and VM specific priority. The key is being able to identify the incoming

5

23

packet to the corresponding VM, which the NIC is now expected to do. While communicating, the NIC controller decides on whether to accept or reject the incoming packet based on the bandwidth specification or the virtual device’s available memory. This gives a fine-grained control on the incoming traffic and helps reduce the interference effects. The outbound traffic can be controlled by the VM itself using any of the mechanisms as is done in the existing architectures.

IV. Architecture Evaluation for QoS controls

The proposed architecture was generated using Layered Queuing Network (LQN) Model and service times for the various entries of the model were obtained by using runtime profilers on the actual Xen based virtualized server. Complete model building and validation details are available in [9][10]. Here we present the results of QoS evaluation carried out using the LQN model [12] of the proposed architecture. The QoS experiments were conducted along the same lines as described in the introduction section. The difference now is that the QoS control is applied on the ingress traffic of the constrained VM, namely VM2. The results obtained are depicted in Figure 6.The proposed architecture allows for achieving higher application throughput on the shared NIC firstly because of the VMM-bypass [12]. Also, as can be observed from the graphs above, the control of ingress traffic in the case of httperf benchmark shows highly improved performance benefit to the unconstrained VM, namely VM1.

performance benefit to the unconstrained VM, namely VM1. Figure 6: httperf throughput sharing on a QoS

Figure 6: httperf throughput sharing on a QoS controlled, shared NIC between two VMs using the proposed architecture with throughput constraints applied on the ingress traffic of VM2 at the NIC.

For request-response kind of benchmarks like the httperf, controlling the ingress bandwidth is beneficial because once a request is dropped due to saturation of allocated bandwidth, there is no

downstream activity associated with it and wasteful resource utilization of NIC and CPU is avoided. The QoS control at the device on the input stream of VM2 and the native access to the vNICs by the VMs gives the desired flexibility of making the unused bandwidth available to the unconstrained VM.

V. Related work

In early implementations, I/O virtualization adopted dedicated I/O device assignment to a VM. This later evolved to device sharing across multiple VMs through virtualized software interfaces [14][4]. A dedicated software entity, called the I/O domain is used to perform physical device management. The I/O domain is either part of the VMM or is by itself an independent domain, like the IDD of Xen [8][15]. With this intermediary software layer between the device and the VM, any application in a VM seeking access to the device has to route the request through it. This architecture still builds over the single-hardware single-OS model [16]-[21]. The consequence of such virtualization techniques is visible in the loss of application throughput and usable device bandwidth on virtualized servers as discussed earlier. Because of the poor performance of the I/O virtualization architectures a need to build concurrent access to the shared I/O devices was felt and recent publications on concurrent direct network access (CDNA)[22] [19] and scalable self- virtualizing network interface describe such efforts. However, the scalable self-virtualizing interface [23] describes assigning a specific core for network I/O processing on the virtual interface and exploits multiple cores on embedded network processors for this. The paper does not detail how the address translation issues are handled, particularly in the case of virtualized environments. The CDNA architecture is similar to the proposal in this paper in terms of allowing multiple VM specific Rx and Tx device queues. But CDNA still builds over the VMM/IDD handling the data transfer to and from the device. Although the results of this work are exciting, the architecture still lacks the flexibility required to support fine-grained QoS. And, the paper does not discuss about the performance interference due to uncontrolled data reception by the device nor does it highlight the need for addressing the QoS constraints at the device level. The proposed architecture in this paper addresses these and also the issue of pushing the basic constructs to assign QoS attributes like required bandwidth and priority into the device to get finer control on resource usage and on restricting performance interference.

The proposed architecture has it basis in exokernel’s[24] philosophy of separating device

6

24

management from protection. In exokernel, the idea was to extend native device access to applications with exokernel providing the protection. In our approach, the extension of native device access is to the VM, the protection being managed by the VMM. A VM is assumed to be running the traditional OS. Further, the PCI-SIG community has realized the need for I/O device virtualization and has come out with the IOV specification to deal with it. The IOV specification however, details about device features to allow native access to virtual device interfaces, through the use of I/O page tables, virtual device identifiers and virtual device specific interrupts. The specification presumes that QoS is a software feature and does not address this. Many implementations adhering to the IOV specification are now being introduced in the market by Intel[25], Neterion[26], NetXen[27], Solarflare [28] etc. CrossBow[29]suite from SUN Microsystems talks about this kind of resource provisioning, but it is a software stack over a standard IOV complaint hardware. The results published using any of these products are exciting in terms of the performance achieved, but almost all of them have ignored the control of reception at the device level. We believe that lack of such a control on highly utilized devices will cause non- deterministic application performance loss and under-utilization of the device bandwidth.

VI.

Conclusion

In this paper we described how the lack of virtualization awareness in I/O devices leads to latency overheads on the I/O path. Added to this, the intermixing of device management and data protection issues further increases the latency, thereby reducing the effective usable bandwidth of the device. Also, lack of appropriate device sharing control mechanisms, at the device level leads to loss bandwidth and performance interference on the device sharing VMs. To address these issues we proposed I/O device virtualization architecture, as an extension to the PCI-SIG IOV specification, and demonstrated its benefit through simulation techniques. Results demonstrate that by moving the QoS controls to the shared device, the unused bandwidth is made available to the unconstrained VM, unlike the case in prevalent technologies. The proposed architecture also improves the scalability of VMs sharing the NIC because it eliminates the common software entity which regulates I/O device sharing. The other advantage is that with this architecture, the maximum resource utilization is now accounted for by the VM. Also, this architecture reduces the workload interference on sharing a device and simplifies the consolidation process.

References

M. Welsh and D. Culler, “Virtualization considered harmful: OS design directions for well-conditioned services”, Hot Topics in OS, 8th Workshop, 2001.

[2] Kyle J. Nesbit, James E Smith, Miquel Moreto, Francisco J. Cazorla, Alex Ramirez, Mateo Valero, Multicore Resource Management “, IEEE Micro, Vol. 28, Issues 3, P-6-16, 2008. [3] Kyle J. Nesbit, Miquel Moreto, Francisco J. Cazorla, Alex Ramirez, Mateo Valero, and James E. Smith, Virtual Private Machines:

[1]

Hardware/Software Interactions in the Multicore Era, In IEEE Micro special issue on Interaction of Computer Architecture and Operating System in the Manycore Era, May/June 2008. [4] Scott Rixner, “Breaking the Performance Barrier:

Shared I/O in virtualization platforms has come a long way, but performance concerns remain”, ACM Queue Virtualization, Jan/Feb 2008.

[5]

D. Mosberger and T. Jin, “httperf: A Tool for

[6]

Measuring Web Server Performance,” ACM, Workshop on Internet Server Performance, pp. 59- 67, June 1998. Paul Barham , Boris Dragovic , Keir Fraser , Steven Hand , Tim Harris , Alex Ho , Rolf Neugebauer , Ian Pratt , Andrew Warfield, “Xen and the art of virtualization”, 19th ACM SIGOPS, Oct. 2003.

[7] “VMware ESX Server 2 - Architecture and

Performance Implications”, 2005, available at

plications.pdf K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A.

[8]

War_eld, and M. Williamson, “Safe hardware access with the Xen virtual machine monitor.” 1st Workshop on OASIS, Oct 2004. [9] J.Lakshmi, S.K.Nandy, “Modeling Architecture-OS interactions using Layered Queuing Network

Models”, International Conference Proceedings of HPC Asia, March, 2009, Taiwan. [10] J. Lakshmi, S. K. Nandy, “I/O Device virtualization in Multi-core era, a QoS Pespective”, Workshop on Grids, Clouds and Virtualization, International Conference on Grids and Pervasive computing, Geneva, May 2009. [11] PCI-SIG IOV Specification available online at http://www.pcisig.com/specifications/iov [12] J. Liu, W. Huang, B. Abali, and D. K. Panda. High performance VMM-bypass I/O in virtual machines. In Proceedings of the USENIX Annual Technical Conference, June 2006. [13] Layered Queueing Network Solver software package, http://www.sce.carleton.ca/rads/lqns/ [14] T. von Eicken and W. Vogels. Evolution of the virtual interface architecture. Computer, 31(11),

1998.

[15] J. Sugerman, G. Venkatachalam, and B. Lim. Virtualizing I/O devices on VMware Workstation’s

7

25

hosted virtual machine monitor. In Proceedings of the USENIX Annual Technical Conference, June

2001.

[16] D. Gupta, L. Cherkasova, R. Gardner, and A. Vahdat. Enforcing performance isolation across virtual machines in Xen. In M. van Steen and M. Henning, editors, Middleware, volume 4290 of Lecture Notes in Computer Science, pages 342362. Springer, 2006. [17] Weng, C., Wang, Z., Li, M., and Lu, X. 2009. The hybrid scheduling framework for virtual machine systems. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS international Conference on Virtual Execution Environments (Washington, DC, USA, March 11 - 13, 2009).

[18] Kim, H., Lim, H., Jeong, J., Jo, H., and Lee, J. 2009. Task-aware virtual machine scheduling for I/O performance. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS international Conference on Virtual Execution Environments (Washington, DC, USA, March 11 - 13, 2009). [19] Menon, J. R. Santos, Y. Turner, G. J. Janakiraman, and W. Zwaenepoel. Diagnosing performance overheads in the Xen virtual machine environment. In Proceedings of the ACM/USENIX Conference on Virtual Execution Environments, June 2005. [20] Menon, A. L. Cox, and W. Zwaenepoel. Optimizing network virtualization in Xen. In Proceedings of the USENIX Annual Technical Conference, June 2006. [21] Santos, J. R., Janakiraman, G., Turner, Y., Pratt, I. 2007. Netchannel 2: Optimizing network performance. Xen Summit Talk (November) [22] Willmann, P., Shafer, J., Carr, D., Menon, A., Rixner, S., Cox, A. L., Zwaenepoel, W. Concurrent direct network access for virtual machine monitors. In Proceedings of the International Symposium on High-Performance Computer Architecture,2007 (February). [23] H. Raj and K. Schwan. Implementing a scalable self-virtualizing network interface on a multicore platform. In Workshop on the Interaction between Operating Systems and Computer Architecture, Oct.

2005.

[24] M. Frans Kaashoek, et. Al., “Application

Performance and Flexibility on Exokernel Systems “, 16th ACM SOSP, Oct, 1997. [25] Intel Virtualization Technology for Directed-I/O

www.intel.com/technology/itj/2006/v10i3/2-io/7-

conclusion.htm [26] Neterion http://www.neterion.com/ [27] NetXen http://www.netxen.com/ [28] Solarflare

Communications

http://www.solarflare.com/ [29] CrossBow: Network Virtualization and Resource Control http://www.opensolaris.org/os/community/networki ng/crossbow_sunlabs_ext.pdf

Eco-Friendly Features of a Data center OS

Surya Prakki

Sun Microsystems Bangalore, India surya.prakki@sun.com

Abstract—This paper presents the different technologies a modern operating system like OpenSolaris offers which will help data centers to become more eco-friendly. It starts off with the various virtualization technologies OpenSolaris has got which help in driving consolidation of systems thus reducing system foot print in a data center. It introduces the power aware dispatcher (PAD), how it plays well with the various processor supported power states, and moves onto observability tools which prove how well the system is using the power management features.

Keywords-virtualization; consolidation; green computing;

I. INTRODUCTION

Data centers are becoming the backbone of success of any enterprise. Enterprises are offering lot of services over the web – whether it be bank transactions or booking of tickets or maintaining social relationships, etc. This is leading to automation of more and more services, thus resulting in even more computers being deployed in a datacenter - server sprawl. And as these computers are getting more and more powerful, their energy requirements have gone up, and with increasing energy costs, data centers are impacting both ecology as well as economics of running them.

This brings along a very interesting challenge to the modern OS developers:

How to move away from the traditional approach of hosting one service on one computer to, running multiple services on a single computer and address the following challenges in doing so :

o

Isolation

o

Minimizing the overheads

o

Meeting peak load requirements (QoS)

o

Secure execution

o

Meeting different patch requirements of different applications

o

Heterogeneous work loads

o

Testing and Development

o

Enforcing resource controls

o

observability

o

Fail over through replication

o

Supporting legacy applications

26

o Simple administration

How to reduce the energy requirements of an idling computer?

First problem of workload consolidation can be addressed using virtualization.

There is no silver bullet virtualization technology address all of the above challenges.

that can

II.OPERATING SYSTEM LEVEL VIRTUALIZATION

If the requirement is to consolidate homogenous workloads [applications compiled for the same platform [I.e. Operating System]], it is preferable to opt for OS level virtualization. For the rest of the discussion let us look at Zones technology in OpenSolaris which provides this feature and see how it solves the above challenges.

A. Zones Zones provide a very low overhead mechanism to virtualize operating system services, allowing one or more processes to

run in isolation from other activity on the system. The kernel exports a number of distinct objects that can be associated with

a particular zone, like processes, file system mounts, network

interfaces (I/F) etc. No zone can access objects belonging to

another zone. This isolation prevents processes running within

a given zone from monitoring or affecting processes running in

other zones. Thus a zone is a 'sandbox' within which one or more applications can run without affecting or interacting with the rest of the system.

As the underlying kernel is the same, physical resources are multiplexed across zones, improving the effective utilization of the system.

The first zone that comes up on installing OpenSolaris system is referred to as 'global zone' and all non-global zones need to be configured and installed from this global zone.

Zones framework provides an abstraction layer that separates applications from physical attributes of the machine on which they are deployed, such as physical device paths and network I/F names. This enables, things like, multiple web servers running in different zones to connect to the same port using the distinct IP addresses associated with each zone.

OpenSolaris had broken down the privileges associated with root owned process into finer grained ones and the zones inherit a subset of these privileges(5), thus making sure that even if a zone is compromised, the intruder can't do any

damage to the rest of the system. Another outcome of this is, even privileged processes in non-global zones are prevented from performing operations that can have system-wide impact.

Administration of individual zones can be delegated to others, knowing that any actions taken by them would not affect the rest of the system.

Zones do not present a new API or ABI to which applications need to be 'ported' – I.e. the existing OpenSolaris applications run inside zones without any changes or recompilation. So a process running inside a zone runs natively on the CPU and hence doesn't incur any performance penalty.

A zone or multiple zones can be tied to a resource pool – which pulls together CPUs and scheduler. The resources associated with a pool can be dynamically changed. This enables the global administrator to give more resources to a zone when it is nearing its peak demand. Faire Share Scheduler (FSS) can be associated with a pool. Using FSS a physical CPU assigned to multiple zones, via a resource pool, can be shared as per their entitlement – eve in the face of most demanding work loads, thus guaranteeing quality of service (QoS). The physical memory and swap memory are also configured per zone and can be changed dynamically to meet the varying demands of a zone.

Zones can be booted and halted independent of the underlying kernel. As zone boot involves only setting up the virtual platform, mounting the configured file systems, this is a much faster operation compared to booting of even a smallest physical system.

The software packages that need to be available in a zone is a function of the service it is hosting and a zone administrator is free to pick and choose. This makes a zone look like 'Just Enough OS' kind of slick environment which is tailor made to the services it hosts. Likewise a zone administrator is also free to maintain the patch levels of the packages she installed.

To save on file system space, two types of zones are supported – sparse and whole root. In case of sparse zone, some file systems like /usr are loop back mounted from global zone so that multiple copies of the binaries are not present. In whole root zone, all components that make up the platform are installed and thus takes more disk space.

The zones technology is extended to be able to run applications compiled for earlier versions of Solaris. This is referred to as branding a zone and is used to run Solaris 8 and Solaris 9 applications. In such a branded zone, run time environment for an Solris 8 application is no different from what it was on a box running Solaris 8 kernel. This branding feature helps customers to replace old power-hungry systems with newer eco-friendly computers. Operating environment should also provide tools which will help in making this transition smoother.

The OpenSolaris system has got a rich set of observability tools like DTrace(1M), kstat(1M), truss(1), proc(4) and debugging tools like mdb(1) which could be used to study the behavior and debug applications in production environment.

These tools could be run either from global zone or from inside a non-global zone itself.

The extensibility of the zones framework can be gauged by the fact that there is an 'lx' brand using which linux 2.6, 32bit applications can be run unmodified on an OpenSolaris x86 system inside an lx branded zone. Thus even linux applications can be observed from the global zone using earlier mentioned tools.

A zone can be configured with an exclusive IP stack, such that it can have its own routing table, ARP table, IPSec policies and associations, IP filter rules and TCP/IP ndd variables. Each zone can continue to run services such as NFS on top of TCP/UDP/IP. This way a physical NIC can be set aside for exclusive use of the zone. Zones can also be configured to have shared IP stacks on top of a single NIC.

Disks can be provided to a zone for its exclusive use, or a portion of a file system space can be set aside for the zone, or using ZFS file system, space can be grown as the demand grows dynamically.

The

discussion :

following

block

diagram

captures

the

above

The discussion : following block diagram captures the above Thus to facilities : summarize,  Processes

Thus

to

facilities :

summarize,

Processes

File Systems

Networking

Identity

Devices

Packaging

27

Figure 1.

zones

virtualizes

the

following

1) Field Use Case :

www.blastwave.com offsers open source packages for different versions of Solaris. As a result they needed to continue to have physical systems running Solaris 8 and Solaris 9. When they made use of the branded zones feature, they were able to consolidate these legacy systems onto systems running Solaris10 and they quantified the gain as follows :

o

65 percent reduction in rack space, saving tens of

thousands

of

dollars

in

power,

cooling,

and

hardware-maintenance costs

 

o

Reduced setup time from hours to minutes.

 

Sun IT heavily uses zones to host quite a few services, to quote a few:

o

Request to make source changes, is made via a portal which runs in a zone.

o

Service to manage lab infrastructure.

o

Namefinder service which reports basic information about employees.

o

Service to host software patches.

One web server is run in each of the above zones and they all listen on port:80 without stamping on each other. All the above services are very critical to running of the business. First 3 cases do not see much change in workload and hence are easily virtualized using zones. In case of 4 th , patches are released periodically and there will be lot of hits in the first 48hrs of patches being made available and during this time, additional CPUs and physical memory can be set aside for this zone using dynamic resource pools via a cron(1M) job. In consolidating these services, we replaced 4 physical systems with a single one.

Price et al. [11] report less than 4% performance degradation for time sharing workloads and is attributed to the loop back mounts in case of sparse zone.

B. Crossbow Crossbow provides the building blocks for network virtualization and resource control by Virtualizing the network stack and NIC around any service (HTTP, HTTPS, FTP, NFS etc.), protocol or Virtual machine. Crossbow does to networking stack what Zones did to OS services.

One of the main component of Crossbow is the ability to virtualize a physical NIC into multiple virtual NIC(VNICs). These VNICs can be assigned to either zones or any virtual machines sharing the physical NIC. Virtualization is implemented by the MAC layer and the VNIC pseudo driver of OpenSolaris network stack. It allows physical NIC resources such as hardware rings and interrupts to be allocated to specific VNICs. This allows each VNIC to be scheduled independently

28

as per the load on the VNIC and also allows classification of packets between VNICs to be off-loaded in hardware.

Each VNIC is assigned its own MAC address and optional VLAN id (VID). The resulting MAC+VID tuple is used to identify a VNIC on the network, physical or virtual.

Crossbow allows a bandwidth limit to be set on a VNIC. The bandwidth limit is enforced by the MAC transparently to the user of the VNIC. This mechanism allows the administrator to configure the link speed of VNICs that are assigned to zones or VMs. - this way they can't use more bandwidth than their assigned share.

Crossbow provides virtual switching semantics between VNICs created on top of the same physical NIC. Virtual switching done by Crossbow is consistent with the behavior of a typical physical switch found on a physical network.

Crossbow VNICs and virtual switches can be combined to build a Crossbow Virtual Wire (vWire). A vWire can be a fully virtual network in a box. A vWire can be used to instantiate a layer-2 network which can be used to run a distributed application spanning multiple virtual hosts,

Virtual Network Machines (VNMs) are pre-canned OpenSolaris zones which encapsulate a network function such as routing, load balancing, etc. VNMs are assigned their own VNIC(s), and can be deployed on a vWire to provide network functions needed by the network applications. VNMs can come in pre-configured fashion, which helps in deploying new instances quickly.

III. PLATFORM LEVEL VIRTUALIZATION

There are some instances where OS level virtualization falls short :

Any decent sized data center will have heterogeneous workloads - applications compiled for different platforms and any effort to consolidate such workloads can't be fully achieved by OS level virtualization.

A service or application needs a specific kernel module or driver to operate.

applications need different kernel patch

Different

levels.

Consolidate legacy applications which expect end of life (EOL) Operating Systems.

In such scenarios, platform virtualization solution can be used to consolidate the workloads. These solutions carve multiple Virtual Machines (VMs) out of a physical machine and are referred to as either Hypervisors or Virtual Machine Monitors (VMMs).

For the rest of the discussion let us look at the hypervisors LDOMs and Xen, that OpenSolaris supports on Sparc and x86 architectures respectively and see how they address some of the challenges mentioned in Introduction.

A. Logical DOMains

LDOMs can be viewed as a hypervisor implemented in firmware. LDOMs hypervisor is shipped along with sun4v based Sparc systems.

Hypervisor divides the physical Sparc machine into multiple virtual machines, called domains. Each domain can be dynamically configured with a subset of machine resources and its own independent OS. Isolation and protection are provided by the LDOMs hypervisor, by restricting access to registers

and address space identifiers (ASI).

Hypervisor takes care of partitioning of resources. Hardware resources are assigned to logical domains using

'Machine Descriptions (MD)'. It is a graph describing devices available, which includes CPU threads/strands, memory blocks,

PCI bus. Each LDOM has its own MD. This is used to build

Open Boot Prom (OBP) and consequently the OS device tree upon booting guest. A fallout of this model is, guest OS doesn't even see any HW resources not present in its MD.

The first domain that comes up on a sun4v based system

acts

as a control domain. LDOMs software manager runs in

the

control domain. This helps us to interact with the

hypervisor and allocate and deallocate physical resources to

guest domains. Management software consists of a daemon (ldmd(1M)) which interacts with the hypervisor and ldm(1M)

CLI to interact with the daemon. ldmd(1M) can be run only in

the control domain.

Hypervisor provides a fast point-to-point communication channel called 'logical domain channel' (LDC) to enable

communication: between LDOMs, between LDOM and hypervisor. LDC end points are defined in the MD. ldmd(1M)

also uses LDC to interact with the HV in managing domains.

Initially all the resources are given to the control domain. Using ldm(1M), administrator needs to detach the resources from the primary domain and pass them onto the newly created guest domains.

I/O for guest domains can be configured in two ways:

Direct I/O: A guest domain could be given direct access to a PCI device and the guest domain could manage all the I/O devices connected to them using its own disk drivers.

Virtualized I/O: One of the domains, called a 'service domain' controls the HW present on the system and presents 'virtual devices' to other guest domains. Then the guest could use a virtual block device driver which forwards the I/O to the service domains through LDC. Virtual device being presented to the guest could be backed by a physical disk, or a slice of it, or even a regular file.

In case of network I/O, hypervisor presents a layer-2 'virtual network switch' which enables domain to domain traffic. Any number of v-LANs can be created by creating additional v-switch in the service domain. The switch talks to the device driver in the service domain to connect to physical NIC for external network connection; And can tag along the layer-3 features of service domain kernel to do routing, iptable filtering, NAT and firewalling.

29

Hypervisor automatically powers off CPU cores that are not in use – I.e. not assigned to any domains.

The following block diagram captures the above discussion:

The following block diagram captures the above discussion: Figure 2. B. X86 Platform virtualization In x86

Figure 2.

B. X86 Platform virtualization

In x86 space, over the years, many virtualization technologies have come in and they can be classified under two types :

Type 1 – Where virtualization solution does not need a host OS to operate.

Type 2 – Where virtualization solution needs host OS to operate.

For the rest of the discussion, let us look at Xen (a Type 1 hypervisor) and VirtualBox (a Type 2 hypervisor).

1)Xen

Xen is an open source hypervisor technology developed at University of Cambridge. This solution is specific to Solaris running on x86 based systems. Each VM created can run a complete OS. Xen sits directly on top of HW and below the guest operating systems. It takes care of partitioning of available CPU(s) and physical memory resources across the VMs.

Unlike other contemporary virtualization solutions that exist for x86, xen started off with a different design approach to minimise the overheads a guest OS incurs – guest needs to be ported to xen architecture, which very closely resembles the underlying x86 arch – This approach is referred to as 'Para Virtualization (PV)'.

In this approach, a PV guest kernel is made to run in ring-1 of the x86 architecture, while the hypervisor runs in ring-0. This way hypervisor protects itself against any malicious guest

kernel. The way an application makes a system call to get into kernel, a PV kernel needs to make a hypercall to get into the hypervisor – hypercalls are needed to request any services from the hypervisor.

The first VM that comes up on boot is referred as Dom0 and the rest of the VMs are referred as DomUs. To keep the hypervisor thin, Xen runs the device drivers that control the peripherals on the system, in dom0. This approach makes a lot of code execute in ring-1 rather than ring-0 and thus improves the security of the system.

In PV mode, IO between guests and their virtual devices is directed through a combination of Front End (FE) drivers that run in a guest and Back End (BE) drivers that run in a domain that is hosting the device. The communication between the front end and back end happens via the hypervisor and hence is subject to privilege checks. The FE and BE drivers are class specific I.e. one set for all block devices, and one set for network devices – thus the finer details associated with each specific device is completely avoided. This model of implementing the IO performs far better than the emulated devices approach.

Depending on the guest OS needs, more physical memory can be passed on dynamically and likewise, if there is, memory shortage in the system, hypervisor can also take back memory from a guest.

Likewise, the number of virtual CPUs (vCPUs) associated with a guest can by dynamically increased if the guest needs more compute resources. Xen schedules these vCPUs onto the physical CPUs.

The management tools needed to configure and install guests, also run in dom0 thus making the hypervisor even thinner. This also improves the debuggability of the tools as they run in user space.

Given the recent advances in the CPUs, like VT-x of Intel and SVM of AMD, xen allows unmodified guests to be run in a VM – these are referred to as HVM guests. There are a couple of major differences between how a PV guest is handled vis-a- vis how a HVM guest is handled :

Unlike PV guest, HVM guest, expects to handle the devices directly by installing its own drivers – for this, the hypervisor has to emulate different physical devices and this emulation is done in the user land of dom0. As can be inferred IO this way will be slower than in PV approach.

HVM guest tries to install its own page tables. Xen uses shadow page tables which track guests 'page table modifications'. This can slow down handling page faults in the guest.

In both PV as well as HVM guests, the dom0 needs to be ported to xen platform. The applications run without any modifications inside the guest operating systems.

x86 CPUs, off late, are seeing a steady stream of new features, to help implement VMMs easier:

30

Extended Page Tables (EPT): With EPT, VMM doesn't have to maintain shadow page tables. The way page tables convert VA to PA, EPT converts guests physical to host physical address. This virtualizes CR3, which guest continues to manipulate.

VT-d: This enables guests to access the devices directly so that performance impact of emulated devices is cut out.

To get the best of both the worlds [I.e. PV approach and wanting to use HW advances], hybrid virtualization is picking up momentum – where we start off with a guest in HVM mode [so that there is no porting exercise] and then incrementally add PV drivers which will bypass device emulation and thus reduce virtualization overheads – in case of OpenSolaris these PV drivers are already implemented for block and network devices.

Xen tracing facility provides a way to record hypervisor events like VCPU getting on and off a CPU, VM blocking, etc, and this data can help nail performance issues with virtual machines.

Xen allows live migration of guests to similar physical machines, which effectively brings in load balancing features.

Xen also allows suspend and resume of guests, which can be used to start service on demand – this feature along with ZFS snapshots can be used to configure a guest and then take a snapshot of it, and move it to a different physical machine and these nodes could give a simple fail over capability.

The following block diagram captures the above discussion:

The following block diagram captures the above discussion: Figure 3. To summarize, xen helps in consolidating

Figure 3.

To summarize, xen helps in consolidating heterogeneous work loads that are commonly seen in a data center.

2)Virtual Box (VB)

VB is an open source virtualization solution for x86 based systems. VB runs as a process in the host operating system and supports various latest as well as legacy operating systems to run as guests. The power of VB lies in the fact that the guest is completely virtualized and is being run as just a process on the

host OS. VB follows the usual x86 virtualization technique of running guest ring-0 code in ring-1 of the host; and running guest ring-3 code natively.

Given its ease of use, VB can be used to consolidate desktop workloads – a new desktop can be configured and deployed in a few seconds time.

It supports cold migration of the guests, where one could copy over the Virtual Disk Image (VDI) to a different system along with XML files (config details) and start the VM on the other system.

Creating duplicate VMs is as easy as copying over the VDI and removing the uuid from the image, and registering it with VB. VB emulates PCnet, and Intel PRO network adopters, supports different networking modes: NAT, HIF, Internal networking.

C. Field Use Case :

The following are real world cases where we used platform virtualization inside Sun :

Once a supported release of Solaris is made, a separate build machine is spawned off which is used to host sources and allows gate keepers to create patches. Earlier one physical system used to be set aside for each release. Now, with Xen and LDOMs, a new VM is created for each release and multiple of such guests could be hosted on a single system. Patch windows for each of the releases are staggered such that multiple guests won't hit the peak load together. A complete build of Solaris takes at least 10% longer inside a guest, but this is still acceptable as it is not a time critical job. Performance impact on the interactive workload which an engineer might see while pushing her code change is too less to be noticed.

Engineers heavily use VB to test their changes in older releases of Solaris. So even though performance degradation could be in the range of 5-20% depending on workload, it is still acceptable as it is only functionality and sanity testing [after my kernel change will the system boot?].

So though there is an increase in the number of supported Solaris releases there is not a corresponding increase in the number of physical systems in the data center, thus significantly saving on the capital expenditure and carbon foot print.

IV.CPU MANAGEMENT

To address the second problem mentioned in the introduction, operating system should support the various features provided by the hardware to reduce the idling power consumption of the system. For the rest of the discussion let us look at how OpenSolaris kernel supports various power saving features of both the x86 and Sparc platform.

31

A. Advanced Configuration and Power Interface (ACPI)

Processor C-states :

ACPI C0 state refers the normal functioning of the CPU when it draws the rated voltage. CPU enters C1 state while running idle thread by issuing halt instruction, in this state, other than APIC and bus, rest of the units do not consume power. CPU runs idle thread when there are no threads in runnable state.

ACPI process C3 state and beyond are referred to as Deep C-state support, as wake up latency of this state is higher than the earlier state. In C3 state, even APIC and bus units are also stopped, and caches lose state. OS support is needed in C3 state because of state loss and OpenSolaris incorporates this support.

B. Power Aware Dispatcher (PAD)

This feature extends existing topology aware scheduling facility to bring 'power domain' awareness to the dispatcher. With this awareness in place, the dispatcher can implement coalescence dispatching policy to consolidate utilization onto a smaller subset of CPU domains, freeing up other domains to be power managed. In addition to being domain aware, the dispatcher will also tend to prefer to utilize domains already running at lower C-states – this will increase the duration and extent to which domains can remain quiescent, improving the kernel's power saving ability.

Because the dispatcher will track power domain utilization along the way, it can drive active domain state changes in an event driven fashion, eliminating the need for the power management subsystem to poll.

These current conservative logs yield a 3.5% improvement in SpecPower on Nehalem but more importantly a 22.2% idle power savings.

C. CPU Frequency Scaling :

It is possible to reduce power consumption of a system by running at a reduced clock frequency, when it is observed that CPU utilization is low.

D. Memory Power Management:

Like CPUs, even memory could be put in power saving state when the system is idle. OpenSolaris enables this feature on chip sets which support this feature.

E. Suspend to RAM:

It is common for desktops and laptops to have extended periods of no activity as the user could be away at lunch. To save power in such cases, OpenSolaris supports what is referred to as ACPI S3 support – whereby the whole system is suspended to RAM and power cut off to CPU.

F. CPU hotplug:

This is supported on quite a few Sparc platforms and is achieved by 'Dynamic Reconfiguration (DR)' support in the kernel. A DRed out board containing CPUs is effectively powered off and can be pulled out; But can be left in there, so

that depending on workload changes, the board can be DRed back in.

V.OBSERVABILITY TOOLS

There should be a mechanism by which administrator can see how well a system is taking advantage of the power management features discussed above. For this OpenSolaris provides a couple of tools :

A. Power TOP:

powertop(1M) reports the activity that is making the CPU to move to lower C-states and thus increase the power consumption. Addressing the reasons for those, will make a CPU stay longer in higher C-states.

B. Kstat :

kstat(1M) reports at what clock frequency the CPU is currently operating and what are the supported frequencies. Lower the frequency, the better from power consumption point of view.

32

REFERENCES

Paul Barham et al. Xen and the art of virtual-ization. In Proceedings of the 19th Symposium on Operating Systems Principles, 2003.

[2] Bryan Cantrill, Mike Shapiro, and Adam Leventhal. Dynamic

[1]

[3]

instrumentation of production systems. In USENIX Annual Technical Conference, 2004. I. Pratt, et al. The Ongoing Evolution of Xen. In Proceedings of the Ottawa Linux Symposium (OLS), Canada, 2006.

[4]

C. Clark, et al. Live Migration of Virtual Machines. In Proceedings of the 2nd Symposium on Networked Systems Design and Implementation (NSDI), Boston, 2005. USENIX

[5]

http://www.usenix.org/event/lisa04/tech/price.html

[6]

http://hub.opensolaris.org/bin/view/Project+tesla/

[8]

[9]

http://hub.opensolaris.org/bin/view/Project+ldoms-mgr/

[10] http://hub.opensolaris.org/bin/view/Community+Group+zones/ [11] http://www.sun.com/customers/software/blastwave.xml

ADCOM 2009 HUMAN COMPUTER INTERFACE -1

Session Papers:

1. Shankkar B, Roy Paily and Tarun Kumar , “Low Power Biometric Capacitive CMOS Fingerprint Sensor System

2. Raghavendra R, Bernadette Dorizzi, Ashok Rao and Hemantha Kumar, “Particle Swarm Optimization for Feature Selection: An Application to Fusion of Palmprint and Face

33

Low Power Biometric Capacitive CMOS Fingerprint Sensor System

B. Shankkar, Tarun Kumar and Roy Paily

Department of Electronics and Communication Engineering IIT Guwahati, Assam, India

Abstract—A charge sharing based sensor for obtaining fin- gerprint has been designed. The design used sub threshold operations of MOSFETs for achieving low power sensor device working at 0.5 V. Also the interfacing circuitry and a fingerprint matching algorithm was also designed to support the sensor and complete a fingerprint verification system. Index Terms—Fingerprint, sensor, charge sharing, sub- threshold, low power

I. INTRODUCTION

Biometrics is the automated method of recognizing a person based on a physiological or behavioral characteristic. Biomet- ric recognition can be used in identification mode, where the biometric system identifies a person from the entire enrolled population by searching a database for a match based solely on the biometric. A biometric system can also be used in verifica- tion mode, where the biometric system authenticates a person’s claimed identity from their previously enrolled pattern. Various biometrics used for such a purpose are signature verification, face recognition, fingerprints, iris recognition. Fingerprinting is one of the oldest ways of technically identifying a person’s identity. Fingerprints have distinct ridge and valley pattern on tip of a finger for every individual. Every finger print can be divided into two structures - Ridge and Valleys. As the world becomes more accustomed to devices reducing in size with each passing day and dependence on mobile devices increasing at enormous rate, a fingerprint authentication and identification system that can be mounted on such a device is imperative. Any fingerprint authentication system requires a sensor and corresponding circuitry, an interface and fingerprint matching algorithm. Paper presents the the results achieved for simulation for all these modules.

II. SENSOR AND CORRESPONDING CIRCUITRY

Figure 1 shows the basic principle of charge-sharing scheme [1]. The finger is modeled as upper metal plate of the capacitor, with its lower metal plate in the cell. These electrodes are separated by the passivation layer of the silicon chip, caused by the metal oxide layer on the chip. The series combination of the two capacitances is called C s . The basic principle behind the working of capacitive fingerprinting sensors is that C s changes according to whether the part of finger on that pixel is a ridge or valley. If the part of finger is a ridge then C s is higher than when it encounters a valley, in which case the series combination of the two capacitances falls low due to the

modeling of the capacitor between the metal plate and finger as a capacitor with air medium in between. C p 1 and C p 2 are the internal parasitic capacitances of the nodes N 1 and N 2 . In the pre-charge phase, the switches S 1 and S 3 are on and S 2 is off. The capacitors C p 1 and C p 2 get charged up. During the evaluation phase, S2 is turned on. The voltage stored during the pre-charge phase is now divided between C s , C p 1 and C p 1 . The output voltage at N 1 is easily seen as the following expression:

V 0 = V N 1 = V N 2 = C p 1 V 1 + C p 2 V 2 + C s V 1 C p 1 + C p 2 + C s

(1)

Fig. 1. Basic Charge Sharing Scheme [1]
Fig. 1.
Basic Charge Sharing Scheme [1]

As given in figure 2 C s differs for ridge and valley, thus the output voltage also differs according to the above expression. This difference in voltage when passed to a comparator with an appropriate reference voltage, gives a binary output. The binary values from all the pixels in the chip then constitute the required fingerprinting image. In the pre-charge phase (pch = 0), it can be seen that N 1 and N 3 are kept at Vdd by the PMOS transistors. During this phase, the capacitors C p 2 and C p 3 are shorted with voltages at both ends as ground and V dd respectively. The capacitors C p 1 and C s begin to charge up. They store a charge of C p 1 V dd and C s (V dd - V f i n ) respectively. This is the charge accumulation phase. At the beginning of the evaluation phase (pch = 1), both the input and output voltages are equal to V dd . Even when the voltage at N 1 starts decreasing due to charge-sharing between the capacitors, the unity-gain buffer ensures that the voltage at N 3 is equal to voltage at N 1 , thus effectively shorting the capacitor C p 3 and removing its effect. Meanwhile,

34

Fig. 2. The Sensor Circuitry the comparator is also enabled, which is able to produce

Fig. 2.

The Sensor Circuitry

the comparator is also enabled, which is able to produce the required binary output. Thus the fingerprint pattern is captured. The circuits were simulated and implementation results are shown in Figure 3. The presence of Ridge and

Valley are characterized by the voltage levels 1.6 V and 1.2

V respectively.

by the voltage levels 1.6 V and 1.2 V respectively. Fig. 3. Resolution obtained for basic

Fig. 3.

Resolution obtained for basic Sense Amplifier

III. INTERFACE

Circuit was designed for interfacing the sensor with the digital processing system. Circuitry was required to select particular pixels to be activated and then to selectively transfer the pixel value to the FPGA board for fingerprint matching. The module involved use of a decoder and an and-or gate array. Also a basic circuitry to detect long sequence of single value was designed to implement auto correction. This ensured that a long stream of a single bit value is not consecutively sensed

as it is highly unlikely to occur in practical conditions. The

complete sensor module along with circuitry and auto error detection module is shown in Figure 4. The in0 and in1 signals are from a counter which increases with positive edge of the clock. The ctrl signal goes to the reset of the counter. The out signal is the sensed value of fingerprint and goes to fingerprint matching algorithm. The four f ps blocks represent the sensor array. The decode unit helps activate single pixel sensor at a time values of which are passed to the output through the and-or gate array.

are passed to the output through the and-or gate array. Fig. 4. Building blocks for integrating

Fig. 4.

Building blocks for integrating sensor module with FPGA kit

IV. RESOLUTION IMPROVEMENT

We proposed a modification to the basic circuit to improve the resolution of the output signal. We introduced an inverter at the output stage, which magnifies the output difference for ridge and valley obtained at output port. The inverter was designed to have a characteristic such that the point where V in = V ou t occurs at value that is equal to the middle value of voltage swing for ridge and valley. This difference is reasonably easily discernible, and it could help easing in on the limited range of V re f . It could give a wider range for V re f thus making the circuit more reliable. A voltage swing of around 2V was achieved using the designed inverter. The result of proposed improvement is shown in Figure 5

The result of proposed improvement is shown in Figure 5 Fig. 5. Results obtained with Resolution

Fig. 5.

Results obtained with Resolution Improvement Circuit

V. P OWER I MPROVEMENT

Sub-Threshold [3] operations of MOSFET were used to achieve reduction in supply voltage and power for required purpose. Various steps of implementing the circuitry using sub-threshold approach are given as follows:

35

The supply voltage was first fixed at 0.5V.

Since the current when the transistor is in weak-inversion is very less, the node capacitances take a long time to charge and discharge. Thus the frequency of clock was reduced to 1KHz.

The sense amplifier has to work as a unity gain amplifier when the output and V- terminals are shorted. To design the sense amplifier to work at 0.5V, 0.5 V was given at V+ and the width of the transistors was changed to get output voltage as 0.5V. It is well-known that NMOS transistor acts as a pull-down device and the PMOS transistor acts as a pull-up device. After many iterations, the widths were decided, with the corresponding characteristics. The circuit developed was as shown in Figure 6. The output characteristics are shown in Figure 7

Figure 6. The output characteristics are shown in Figure 7 Fig. 6. Low Power Sense Amplifier

Fig. 6.

Low Power Sense Amplifier

are shown in Figure 7 Fig. 6. Low Power Sense Amplifier Fig. 7. Low Power Sense

Fig. 7.

Low Power Sense Amplifier input output response

The inverter was designed to have a change from logic 0 to logic 1 at 0.25V. The width of the transistors was changed to get the desired characteristics. The circuit is shown in Figure 8 and the output characteristics are given in Figure 9.

8 and the output characteristics are given in Figure 9. Fig. 8. Low Power Inverter Fig.

Fig. 8.

Low Power Inverter

are given in Figure 9. Fig. 8. Low Power Inverter Fig. 9. DC transfer characteristics of

Fig. 9.

DC transfer characteristics of low Power Inverter

After the modules were designed and characterized individually, they were combined to obtain the overall circuit. Since the voltages to the node capacitances are very less, it has to be taken care to allow more current flow for faster charging. This decided the widths of the transistors in the main circuit.

Thus the overall circuit was designed at 0.5V supply voltage with the transistors in weak inversion to reduce power.

The improvement resulted in power requirement of 736nW per pixel position. The improvements introduced in the MOSFET designs also improved resolution further and we obtained a resolution of around 350 mV for a power supply of 0.5 V. Results are presented in Figure 10

36

Fig. 10. Results obtained with low power Resolution Improvement Circuit VI. F INGERPRINT M ATCHING

Fig. 10.

Results obtained with low power Resolution Improvement Circuit

VI. FINGERPRINT MATCHING ALGORITHMS

Minutiae points [4] are these local ridge characteristics that occur either at a ridge ending or a ridge bifurcation. A ridge ending is defined as the point where the ridge ends abruptly and the ridge bifurcation is the point where the ridge splits into two or more branches. Once the fingerprint image has been enhanced and thinning algorithm applied, the minutiae are then extracted. The algorithms were implemented and simulated on Xilinx 10.1 ISE simulator. The simulations were performed based on Virtex II Pro FPGA board. Following points can be considered as basic steps used for implementation of the fingerprint matching algorithm.

Removing local variations: Local variations in the fingerprint were removed by scanning the whole image pixel by pixel and at each location determining the values of all neighboring pixels. If the particular pixel value is different from all other pixel values than that is changed to the value of neighboring pixel values.The code was implemented using VHDL.

Thinning Algorithm: Thinning algorithms are used to to narrow down the ridge structures within a fingerprint image. Thinning [5] is done by removing the pixels from borders of the ridges without causing any disturbance to the continuity of the structures. For this purpose the image was scanned from left to right. When ever for every pixel value on left did not match the values on the right, the particular pixel values were changed to ’1’ (white). This way only the single pixel values at the end were retained as ’0’ to represent single pixel width of ridge pattern.

Minutiae Extraction: The minutiae that were considered are ridge endings and ridge bifurcations. At every pixel locations sharp changes were detected by checking op- posite points. If any of two opposite pixel positions with respect to current pixel position had different values then it was considered to be some special point. Then certain next pixels were analyzed and determined whether a particular pixel was a minutiae or not. For every minutiae its location with respect to first minutiae were saved for later comparison.

Minutiae matching: In this step minutiae of fingerprints in the records were loaded and were compared to the

minutiae of the recently processed fingerprint. The minu- tiae of the fingerprint records were saved as separate templates. for every clock cycle one of the templates was accessed. The minutiae set of current candidate fingerprint was compared element by element to that of the accessed template. In case of 15 or more matches in minutiae locations a match is announced and loop exited. One sample of the actual 100 pixel * 50 pixel image that was input to the matching algorithm and the corresponding thinned image with minutiae positions is shown in figure 11.

image with minutiae positions is shown in figure 11. Fig. 11. Fingerprint image and extracted minutiae

Fig. 11.

Fingerprint image and extracted minutiae

Another approach based on correlation of the test finger- print image with the fingerprint images in the database was implemented on FPGA board. In this approach every single pixel value that was fed to the FPGA board was compared to the corresponding pixel value in all the other fingerprint images stored in the database. Parameters were updated at occurrence of each pixel value to calculate the correlation of the complete images. A correlation value higher than 0.6 was used to announce a match.

VII. CONCLUSION

The paper presents a complete basic model for obtain- ing a fingerprint and identifying or authenticating it. The whole project was implemented on suitable simulators for each individual module. The sensor and related circuitry was simulated using Mentor tools and the fingerprint algorithm was implemented on Xilinx 10.1. Important achievement were in terms of resolution improvement and reduction in power requirement.

REFERENCES

[1] J.W. Lee and D.J. Min and J.Y. Kim and W.C. Kim, “A 600-dpi capacitive fingerprint sensor chip and image-synthesis technique,” IEEE J. Solid-State

Circuits,1999.

[2] Jin-Moon Nama and Seung-Min Jungb and Moon-Key Lee,“Design and implementation of a capacitive fingerprint sensor circuit in CMOS technology,” EE. Sensors and Actuators, 2007. [3] Hendrawan Soeleman and Kaushik Roy, ”Ultra-low power digital sub- threshold logic circuits,” Proceedings of the international symposium on Low power electronics and design, 1999. [4] Nimitha Chama, “Fingerprint image enhancement and minutiae extrac- tion” http://www.ces.clemson.edu/ stb/ece847/fall2004/projects/proj10.doc. [5] Pu Hongbin, Chen Junali, Zhang Yashe , “Fingerprint Thinning Algorithm Based on Mathematical Morphology,” The Eighth International Conference on Electronic Measurement and Instruments, 2007.

37

PARTICLE SWARM OPTIMIZATION FOR FEATURE SELECTION: AN APPLICATION TO FUSION OF PALMPRINT AND FACE

Raghavendra.R 1

Bernadette Dorizzi 2

Ashok Rao 3

Hemantha Kumar.G 1

1 Dept of Studies in Computer Science, University of Mysore, Mysore-570 006, India. 2 Institut TELECOM, TELECOM and Management SudParis,France 3 Professor, Dept of E & C, CIT, Gubbi, India.

ABSTRACT

This paper relates to multimodal biometric analysis. Here we present an efficient feature level fusion and selection scheme that we apply on face and palmprint images. The features for each modality are obtained using Log Gabor transform and concatenated to form a fused feature vector. We then use a Particle Swarm Optimization (PSO) scheme, a random op- timization method to select the dominant features from this fused feature vector. Final classification is performed in the projection space of the selected features using Kernel Direct Discriminant Analysis. Extensive experiments are carried out on a virtual multimodal biometric database of 250 users built from the face FRGC and the palmprint PolyU databases. We compare the proposed selection method with the well known feature selection methods such as Sequential Floating For- ward Selection (SFFS), Genetic Algorithm (GA) and Adap- tive Boosting (AdaBoost) in terms of both number of features selected and performance. Experimental results show that the proposed method of feature fusion and selection using PSO outperforms all other schemes in terms of reduction of the number of features and it corresponds to a system that is eas- ier to implement, while showing the same or even better pe- formance.

Keywords: Multimodal Biometrics, Feature level fusion, Feature selection, Particle Swarm Optimization

1. INTRODUCTION

Recently, multimodal biometric fusion techniques have at- tracted much attention as the complementary information be- tween different modalities could improve the recognition per- formance. In practice, fusion of several biometric systems can be performed at 4 different levels: Sensor level, Feature level, Match score level and Decision level. As reported in [1], a biometric system that integrates information at earlier stage of processing is expected to provide better performance than systems that integrate information at a later stage, because of the availability of more and richer information. In this paper, we are going to explore the interest of feature fusion and we

38

will experiment it on two widely studied modalities namely, face and palmprint. Very few papers address the feature fu- sion of palmprint and face [2][3][4][5]. From these articles, it is clear that, performing feature level fusion leads to the curse of dimensionality due to the large size of the fused feature vector and hence linear or non linear projection schemes are used by above mentioned authors to overcome dimensional- ity problem. In this work, we address this problem by reduc- ing the dimension of feature space through some appropriate feature selection procedure. To this aim, we experimented the binary Partial Swarm Optimization (PSO) algorithm pro- posed in [6] to perform feature selection. Indeed PSO based feature selection has been shown to be very efficient on some large scale application problems with performance better than Genetic Algorithms [7][8]. We therefore implemented it for this biometric feature fusion problem of high dimension and this is the main novelty of this paper. Extensive experiments conducted on a virtual multimodal biometric database of 250 users show the efficacy of the proposed scheme. The rest of this paper is organized as follows: Section 2 describes the proposed method of feature fusion using PSO and it also discusses the selection of parameters for PSO. Sec- tion 3 presents the experimental setup, Section 4 describes the results and discussion and finally Section 5 draws the conclu- sion.

2. PROPOSED METHOD Face Log Gabor Accept Concatenation Feature Selection Feature using PSO KDDA Classification
2. PROPOSED METHOD
Face
Log Gabor
Accept
Concatenation Feature
Selection Feature using
PSO
KDDA
Classification
Reject
Palmprint
Log Gabor

Fig. 1. Proposed scheme of feature fusion and selection in Log Gabor Space including PSO feature selection

Figure 1 shows the proposed block diagram of feature level fusion of palmprint and face in Log Gabor space. As observed from Figure 1, we first extract the texture features of face and palmprint separately using Log Gabor transform. We use Log Gabor transform as it is suitable for analyzing gradually changing data such as face, iris and palmprint [3] and also it is mentioned in [9] that the Log Gabor transform, can reflect the frequency response of images more realisti- cally than usual Gabor transform. On the linear frequency scale, the transfer function of the Log Gabor transform has the form [9]:

G(ω) = exp

log(ω/ω o ) 2

2

2 × log(k/ω o )

(1)

Where, ω o is the filter center frequency. To obtain a constant shape filter, the ratio k/ω o must be held constant for varying

ω

The Log Gabor transform used in our experiments has 4 different scales and 8 orientations. We fixed these values based on the results of different trials and also in conformity with the literature [4][3]. Thus, each image (of palmprint and face) is analyzed using 8×4 different Log Gabor filters result- ing in 32 different filtered images of resolution 60 × 60 . To reduce the computation cost, we down sample the image by a ratio equal to 6. Thus, the final size is reduced to 40×80. Sim-

ilar type of analysis is also carried out for palmprint modality. By concatenating the column vectors associated to each im- age we obtain the fused feature vector of size 6400×1. As the imaging conditions of face and palmprint are different, a fea- ture vector normalization is carried out as mentioned in [3]. In order to reduce the size of each vector, we propose to per- form feature selection through PSO as explained in Section 2.1 and illustrated in Figure 2 where ’K’ indicates the dimen- sion of the feature space after concatenation and ’S’ indicates the reduced dimension obtained by PSO. Then, we use KDDA

to project the selected features on Kernel discriminant space.

Here we employ KDDA because of its good performance as well as its high dimension reduction ability. Finally, decision about accept/reject is carried out using NNC.

o .

2.1. Particle Swarm Optimization(PSO)

PSO is a stochastic, population based optimization technique aiming at finding a solution to an optimization problem in

a search space. The PSO algorithm was first described by

J. Kennedy and R.C. Eberhart in 1995 [8]. The main idea of PSO is to simulate the social behavior of birds flocking to describe an evolving system. Each candidate solution is therefore modeled by an individual bird that is a particle in a search space. Each particle adjusts its flight by making use of its individual memory and of the knowledge gained from its neighbors to find the best solution.

39

2.2. Principle of PSO

The main objective of PSO is to optimize a given function called fitness function. PSO is initialized with a population of particles distributed randomly over the search space and eval- uated to compute the fitness function together. Each particle is treated as a point in the N-Dimension space. The i th particle

is represented as X i = {x 1 , x 2 ,

each particle is updated by two best values called pbest and gbest. pbest is the best position associated with the best fit-

ness value of particle ’i’ obtained so far and is represented as

, pbest iN } with fitness func-

tion f (pbest i ). gbest is the best position among all the par-

ticles in the swarm. The rate of the position change (veloc-

ity) for particle ’i’ is represented as V i = {v i1 , v i2 ,

pbest i = {pbest i1 , pbest i2 ,

, x N }. At every iteration,

, v iN } .

The particle velocities are updated according to the following equations [8]:

new

V

id

old

= w V

id

+ C1 rand 1 () (P bestid x id )

+C2 rand 2 () (gbest d x id )

(2)