Handbook Scalable Computing

The Handbook of
Research on Scalable
Computing Technologies
Kuan-Ching Li
Providence University, Taiwan
Ching-Hsien Hsu
Chung Hua University, Taiwan
Laurence Tianruo Yang
St. Francis Xavier University, Canada
Jack Dongarra
University of Tennessee, USA
Hans Zima
Jet Propulsion Laboratory, California Institute of Technology, USA
and University of Vienna, Austria
InformatIon scIence reference

Hershey New York
Director of Editorial Content:

Senior Managing Editor:
Assistant Managing Editor:
Publishing Assistant:
Typesetter:
Cover Design:
Printed at:
Kristin Klinger
Jamie Snavely
Carole Coulson
Sean Woznicki
Carole Coulson, Dan Wilson, Daniel Custer, Kait Betz
Lisa Tosheff
Yurchak Printing Inc.
Published in the United States of America by

Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: cust@igi-global.com
Web site: http://www.igi-global.com/reference
Copyright 2010 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data

Handbook of research on scalable computing technologies / Kuan-Ching Li ...
[et al.], editors.
p. cm.
Includes bibliographical references and index.
Summary: "This book presents, discusses, shares ideas, results and
experiences on the recent important advances and future challenges on enabling
technologies for achieving higher performance"--Provided by publisher.
ISBN 978-1-60566-661-7 (hardcover) -- ISBN 978-1-60566-662-4 (ebook) 1.
Computational grids (Computer systems) 2. System design. 3. Parallel
processing (Electronic computers) 4. Ubiquitous computing. I. Li, KuanChing.
QA76.9.C58H356 2009
004--dc22
2009004402
British Cataloguing in Publication Data

A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the
authors, but not necessarily of the publisher.
Editorial Advisory Board
Minyi Guo, The University of Aizu, Japan

Timothy Shih, Tamkang University, Taiwan
Ce-Kuen Shieh, National Cheng Kung University, Taiwan
Liria Matsumoto Sato, University of Sao Paulo, Brazil
Jeffrey Tsai, University of Illinois at Chicago, USA
Chia-Hsien Wen, Providence University, Taiwan
Yi Pan, Georgia State University, USA
List of Contributors
Allenotor, David / University of Manitoba, Canada ..........................................................................................471

Altmann, Jorn / Seoul National University, South Korea ..................................................................................442
Alves, C. E. R. / Universidade Sao Judas Tadeu, Brazil ....................................................................................378
Bertossi, Alan A. / University of Bologna, Italy .................................................................................................645
Buyya, Rajkumar / The University of Melbourne, Australia.....................................................................191, 517
Cceres, E. N. / Universidade Federal de Mato Grosso do Sul, Brazil ..............................................................378
Cappello, Franck / INRIA & UIUC, France ........................................................................................................31
Chang, Jih-Sheng / National Dong Hwa University, Taiwan ................................................................................1
Chang, Ruay-Shiung / National Dong Hwa University, Taiwan ...........................................................................1
Chen, Jinjun / Swinburne University of Technologies, Australia.......................................................................396
Chen, Zizhong / Colorado School of Mines, USA ..............................................................................................760
Chiang, Kuo / National Taiwan University, Taiwan .............................................................................................. 123
Chiang, Shang-Feng / National Taiwan University, Taiwan ................................................................................ 123
Chiu, Kenneth / University at Binghamtom, State University of NY, USA ........................................................471
Dai, Yuan-Shun / University of Electronics Science Technology of China, China
& University of Tennessee, Knoxville, USA .........................................................................................................219
de Assuno, Marcos Dias / The University of Melbourne, Australia ..............................................................517
de Mello, Rodrigo Fernandes / University of So Paulo ICMC, Brazil.........................................................338
Dehne, F. / Carleton University, Canada ............................................................................................................378
Dodonov, Evgueni / University of So Paulo ICMC, Brazil ...........................................................................338
Dongarra, Jack / University of Tennessee, Knoxville, USA; Oak Ridge National Laboratory, USA;
& University of Manchester, UK ..........................................................................................................................219
Doolan, Daniel C. / Robert Gordon University, UK ...........................................................................................705
Dou, Wanchun / Nanjing University, P. R. China .............................................................................................396
Dmmler, Jrg / Chemnitz University of Technology,Germany .........................................................................246
Eskicioglu, Rasit / University of Manitoba, Canada ..........................................................................................486
Fahringer, Thomas / University of Innsbruck, Austria ........................................................................................89
Fedak, Gilles / LIP/INRIA, France .......................................................................................................................31
Ferm, Tore / Sydney University, Australia ..........................................................................................................354
Gabriel, Edgar / University of Houston, USA ....................................................................................................583
Gaudiot, Jean-Luc / University of California, Irvine, USA ...............................................................................552
Gentzsch, Wolfgang / EU Project DEISA and Board of Directors of the Open Grid Forum, Germany .............62
Graham, Peter / University of Manitoba, Canada .............................................................................................486
Grigg, Alan / Loughborough University, UK ......................................................................................................606
Grigoras, Dan / University College Cork, Ireland .............................................................................................705
Guan, Lin / Loughborough University, UK ........................................................................................................606
Gunturu, Sudha / Oklahoma State University, USA ..........................................................................................841
Guo, Minyi / Shanghai Jiao Tong University, China ..........................................................................................421
Gupta, Phalguni / Indian Institute of Technology Kanpur, India .......................................................................645

He, Xiangjian / University of Technology, Sydney (UTS), Australia ..........................................................739, 808
Jang, Yong J. / Yonsei University, Seoul, Korea .................................................................................................276
Ji, Yanqing / Gonzaga University, USA ..............................................................................................................874
Jiang, Hai / Arkansas State University, USA ......................................................................................................874
Jiang, Hong / University of NebraskaLincoln, USA .........................................................................................785
Kondo, Derrick / ENSIMAG - antenne de Montbonnot, France..........................................................................31
Lam, King Tin / The University of Hong Kong, Hong Kong .............................................................................658
Li, Xiaobin / Intel Corporation, USA ..............................................................................................................552
Li, Xiaolin / Oklahoma State University, USA ....................................................................................................841
Liu, Chen / Florida International University, USA ............................................................................................552
Liu, Shaoshan / University of California, Irvine, USA.......................................................................................552
Malcot, Paul / Universit Paris-Sud, France ......................................................................................................31
Malyshkin, V.E. / Russian Academy of Science, Russia .....................................................................................295
March, Verdi / National University of Singapore, Singapore ............................................................................140
Mihailescu, Marian / National University of Singapore, Singapore..................................................................140
Nadeem, Farrukh / University of Innsbruck, Austria ..........................................................................................89
Nanda, Priyadarsi / University of Technology, Sydney (UTS), Australia ..........................................................739
Oh, Doohwan / Yonsei University, Seoul, Korea ................................................................................................276
Ou, Zhonghong / University of Oulu, Finland ...................................................................................................682
Parashar, Manish / Rutgers, The State University of New Jersey, USA ..............................................................14
Pierson, Jean-Marc / Paul Sabatier University, France ......................................................................................14
Pinotti, M. Cristina / University of Perugia, Italy .............................................................................................645
Prodan, Radu / University of Innsbruck, Austria .................................................................................................89
Quan, Dang Minh / International University in Germany, Germany ................................................................442
Ranjan, Rajiv / The University of Melbourne, Australia ..................................................................................191
Rauber, Thomas / University Bayreuth, Germany .............................................................................................246
Rautiainen, Mika / University of Oulu, Finland ................................................................................................682
Rezmerita, Ala / Universit Paris-Sud, France ....................................................................................................31
Rizzi, Romeo / University of Udine, Italy ...........................................................................................................645
Ro, Won W. / Yonsei University, Seoul, Korea....................................................................................................276
Rnger, Gudula / Chemnitz University of Technology,Germany .......................................................................246
Shen, Haiying / University of Arkansas, USA ....................................................................................................163
Shen, Wei / University of Cincinnati, USA .........................................................................................................718
Shorfuzzaman, Mohammad / University of Manitoba, Canada.......................................................................486
Song, S. W. / Universidade de Sao Paulo, Brazil................................................................................................378
Sun, Junzhao / University of Oulu, Finland .......................................................................................................682
Tabirca, Sabin / University College Cork, Ireland .............................................................................................705
Tang, Feilong / Shanghai Jiao Tong University, China ......................................................................................421
Teo, Yong Meng / National University of Singapore, Singapore ........................................................................140
Thulasiram, Ruppa K. / University of Manitoba, Canada ........................................................................312, 471
Thulasiraman, Parimala / University of Manitoba, Canada ............................................................................312
Tian, Daxin / Tianjin University, China ..............................................................................................................858
Tilak, Sameer / University of California, San Diego, USA ................................................................................471
Wang, Cho-Li / The University of Hong Kong, Hong Kong ..............................................................................658
Wang, Sheng-De / National Taiwan University, Taiwan ....................................................................................... 123
Wu, Qiang / University of Technology, Australia ...............................................................................................808
Xiang, Yang / Central Queensland University, Australia ...................................................................................858
Xu, Meilian / University of Manitoba, Canada ..................................................................................................312
Yang, Laurence Tianruo / St. Francis Xavier University, Canada ...........................................................442, 841
Yi, Jaeyoung / Yonsei University, Seoul, Korea .................................................................................................276
Ylianttila, Mika / University of Oulu, Finland...................................................................................................682

Yu, Ruo-Jian / National Taiwan University, Taiwan ............................................................................................. 123
Zeng, Qing-An / University of Cincinnati, USA .................................................................................................718
Zhou, Jiehan / University of Oulu, Finland........................................................................................................682
Zhu, Yifeng / University of Maine, USA .............................................................................................................785
Zomaya, Albert Y. / Sydney University, Australia..............................................................................................354
Table of Contents
Foreword .......................................................................................................................................... xxxi

Preface ............................................................................................................................................xxxiii
Acknowledgment ........................................................................................................................... xxxiv
Volume I
Section 1
Grid Architectures and Applications
Chapter 1
Pervasive Grid and its Applications ....................................................................................................... 1
Ruay-Shiung Chang, National Dong Hwa University, Taiwan
Jih-Sheng Chang, National Dong Hwa University, Taiwan
Chapter 2
Pervasive Grids: Challenges and Opportunities ................................................................................... 14
Manish Parashar, Rutgers, The State University of New Jersey, USA
Jean-Marc Pierson, Paul Sabatier University, France
Chapter 3
Desktop Grids: From Volunteer Distributed Computing to High Throughput
Computing Production Platforms ......................................................................................................... 31
Franck Cappello, INRIA & UIUC, France
Gilles Fedak, LIP/INRIA, France
Derrick Kondo, ENSIMAG - antenne de Montbonnot, France
Paul Malcot, Universit Paris-Sud, France
Ala Rezmerita Universit Paris-Sud, France
Chapter 4
Porting Applications to Grids................................................................................................................ 62
Wolfgang Gentzsch, EU Project DEISA and Board of Directors
of the Open Grid Forum, Germany
Chapter 5
Benchmarking Grid Applications for Performance and Scalability Predictions .................................. 89
Radu Prodan, University of Innsbruck, Austria
Farrukh Nadeem, University of Innsbruck, Austria
Thomas Fahringer, University of Innsbruck, Austria
Section 2
P2P Computing
Chapter 6
Scalable Index and Data Management for Unstructured Peer-to-Peer Networks ...................................123
Shang-Feng Chiang, National Taiwan University, Taiwan
Kuo Chiang, National Taiwan University, Taiwan
Ruo-Jian Yu, National Taiwan University, Taiwan
Sheng-De Wang, National Taiwan University, Taiwan
Chapter 7
Hierarchical Structured Peer-to-Peer Networks.................................................................................. 140
Yong Meng Teo, National University of Singapore, Singapore
Verdi March, National University of Singapore, Singapore
Marian Mihailescu, National University of Singapore, Singapore
Chapter 8
Load Balancing in Peer-to-Peer Systems ............................................................................................ 163
Haiying Shen, University of Arkansas, USA
Chapter 9
Decentralized Overlay for Federation of Enterprise Clouds............................................................... 191
Rajiv Ranjan, The University of Melbourne, Australia
Rajkumar Buyya, The University of Melbourne, Australia
Section 3
Programming Models and Tools
Chapter 10
Reliability and Performance Models for Grid Computing ................................................................. 219
Yuan-Shun Dai, University of Electronics Science Technology of China, China
& University of Tennessee, Knoxville, USA
Jack Dongarra, University of Tennessee, Knoxville, USA; Oak Ridge National Laboratory, USA;
& University of Manchester, UK
Chapter 11
Mixed Parallel Programming Models Using Parallel Tasks ............................................................... 246
Jrg Dmmler, Chemnitz University of Technology,Germany
Thomas Rauber, University Bayreuth, Germany
Gudula Rnger, Chemnitz University of Technology,Germany
Chapter 12
Programmability and Scalability on Multi-Core Architectures .......................................................... 276
Jaeyoung Yi, Yonsei University, Seoul, Korea
Yong J. Jang, Yonsei University, Seoul, Korea
Doohwan Oh, Yonsei University, Seoul, Korea
Won W. Ro, Yonsei University, Seoul, Korea
Chapter 13
Assembling of Parallel Programs for Large Scale Numerical Modeling............................................ 295
V.E. Malyshkin, Russian Academy of Science, Russia
Chapter 14
Cell Processing for Two Scientific Computing Kernels ..................................................................... 312
Meilian Xu, University of Manitoba, Canada
Parimala Thulasiraman, University of Manitoba, Canada
Ruppa K. Thulasiram, University of Manitoba, Canada
Section 4
Scheduling and Communication Techniques
Chapter 15
On Application Behavior Extraction and Prediction to Support and
Improve Process Scheduling Decisions ............................................................................................. 338
Evgueni Dodonov, University of So Paulo ICMC, Brazil
Rodrigo Fernandes de Mello, University of So Paulo ICMC, Brazil
Chapter 16
A Structured Tabu Search Approach for Scheduling in Parallel Computing Systems........................ 354
Tore Ferm, Sydney University, Australia
Albert Y. Zomaya, Sydney University, Australia
Chapter 17
Communication Issues in Scalable Parallel Computing ..................................................................... 378
C.E.R. Alves, Universidade Sao Judas Tadeu, Brazil
E. N. Cceres, Universidade Federal de Mato Grosso do Sul, Brazil
F. Dehne, Carleton University, Canada
S. W. Song, Universidade de Sao Paulo, Brazil
Chapter 18
Scientific Workflow Scheduling with Time-Related QoS Evaluation ................................................ 396
Wanchun Dou, Nanjing University, P. R. China
Jinjun Chen, Swinburne University of Technologies, Australia
Section 5
Service Computing
Chapter 19
Grid Transaction Management and Highly Reliable Grid Platform ................................................... 421
Feilong Tang, Shanghai Jiao Tong University, China
Minyi Guo, Shanghai Jiao Tong University, China
Chapter 20
Error Recovery for SLA-Based Workflows Within the Business Grid ............................................... 442
Dang Minh Quan, International University in Germany, Germany
Jorn Altmann, Seoul National University, South Korea
Laurence T. Yang, St. Francis Xavier University, Canada
Chapter 21
A Fuzzy Real Option Model to Price Grid Compute Resources ........................................................ 471
David Allenotor, University of Manitoba, Canada
Kenneth Chiu, University at Binghamtom, State University of NY, USA
Sameer Tilak, University of California, San Diego, USA
Volume II
Chapter 22
The State of the Art and Open Problems in Data Replication in Grid Environments ........................ 486
Mohammad Shorfuzzaman, University of Manitoba, Canada
Rasit Eskicioglu , University of Manitoba, Canada
Peter Graham, University of Manitoba, Canada
Chapter 23
Architectural Elements of Resource Sharing Networks ..................................................................... 517
Marcos Dias de Assuno, The University of Melbourne, Australia
Section 6
Optimization Techniques
Chapter 24
Simultaneous MultiThreading Microarchitecture ............................................................................... 552
Chen Liu, Florida International University, USA
Xiaobin Li, Intel Corporation, USA
Shaoshan Liu, University of California, Irvine, USA
Jean-Luc Gaudiot, University of California, Irvine, USA
Chapter 25
Runtime Adaption Techniques for HPC Applications ........................................................................ 583
Edgar Gabriel, University of Houston, USA
Chapter 26
A Scalable Approach to Real-Time System Timing Analysis............................................................. 606
Alan Grigg, Loughborough University, UK
Lin Guan, Loughborough University, UK
Chapter 27
Scalable Algorithms for Server Allocation in Infostations ................................................................. 645
Alan A. Bertossi, University of Bologna, Italy
M. Cristina Pinotti, University of Perugia, Italy
Romeo Rizzi, University of Udine, Italy
Phalguni Gupta, Indian Institute of Technology Kanpur, India
Section 7
Web Computing
Chapter 28
Web Application Server Clustering with Distributed Java Virtual Machine ...................................... 658
King Tin Lam, The University of Hong Kong, Hong Kong
Cho-Li Wang, The University of Hong Kong, Hong Kong
Chapter 29
Middleware for Community Coordinated Multimedia ....................................................................... 682
Jiehan Zhou, University of Oulu, Finland
Zhonghong Ou, University of Oulu, Finland
Junzhao Sun, University of Oulu, Finland
Mika Rautiainen, University of Oulu, Finland
Mika Ylianttila, University of Oulu, Finland
Section 8
Mobile Computing and Ad Hoc Networks
Chapter 30
Scalability of Mobile Ad Hoc Networks ............................................................................................. 705
Dan Grigoras, University College Cork, Ireland
Daniel C. Doolan, Robert Gordon University, UK
Sabin Tabirca, University College Cork, Ireland
Chapter 31
Network Selection Strategies and Resource Management Schemes
in Integrated Heterogeneous Wireless and Mobile Networks............................................................ 718
Wei Shen, University of Cincinnati, USA
Qing-An Zeng, University of Cincinnati, USA
Section 9
Fault Tolerance and QoS
Chapter 32
Scalable Internet Architecture Supporting Quality of Service (QoS) ................................................. 739
Priyadarsi Nanda, University of Technology, Sydney (UTS), Australia
Xiangjian He, University of Technology, Sydney (UTS), Australia
Chapter 33
Scalable Fault Tolerance for Large-Scale Parallel and Distributed Computing ................................. 760
Zizhong Chen, Colorado School of Mines, USA
Section 10
Applications
Chapter 34
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems ................... 785
Yifeng Zhu, University of Maine, USA
Hong Jiang, University of NebraskaLincoln, USA
Chapter 35
Image Partitioning on Spiral Architecture .......................................................................................... 808
Qiang Wu, University of Technology, Australia
Xiangjian He, University of Technology, Australia
Chapter 36
Scheduling Large-Scale DNA Sequencing Applications .................................................................... 841
Sudha Gunturu, Oklahoma State University, USA
Xiaolin Li, Oklahoma State University, USA
Laurence Tianruo Yang, St. Francis Xavier University, Canada
Chapter 37
Multi-Core Supported Deep Packet Inspection .................................................................................. 858
Yang Xiang, Central Queensland University, Australia
Daxin Tian, Tianjin University, China
Chapter 38
State-Carrying Code for Computation Mobility ................................................................................. 874
Hai Jiang, Arkansas State University, USA
Yanqing Ji, Gonzaga University, USA
Compilation of References ............................................................................................................... 895
Detailed Table of Contents
Foreword .......................................................................................................................................... xxxi

Preface ............................................................................................................................................xxxiii
Acknowledgment ........................................................................................................................... xxxiv
Volume I
Section 1
Grid Architectures and Applications
Chapter 1
Pervasive Grid and its Applications ....................................................................................................... 1
Ruay-Shiung Chang, National Dong Hwa University, Taiwan
Jih-Sheng Chang, National Dong Hwa University, Taiwan
With the advancements of computer system and communication technologies, Grid computing can be
seen as the popular technology bringing about significant revolution for the next generation distributed
computing application. As regards general users, a grid middleware is complex to setup and necessitates
a steep learning curve. How to access to the grid system transparently from the point of view of users
turns into a critical issue then. Therefore, various challenges may arise from the incomprehensive system
design as coordinating existing computing resources for the sake of achieving pervasive grid environment. We are going to investigate into the current research works of pervasive grid as well as analyze
the most important factors and components for constructing a pervasive grid system here. Finally, in
order to facilitate the efficiency in respect of teaching and research within a campus, we would like to
introduce our pervasive grid platform.
Chapter 2
Pervasive Grids: Challenges and Opportunities ................................................................................... 14
Manish Parashar, Rutgers, The State University of New Jersey, USA
Jean-Marc Pierson, Paul Sabatier University, France
Pervasive Grid is motivated by the advances in Grid technologies and the proliferation of pervasive systems, and is leading to the emergence of a new generation of applications that use pervasive and ambient
information as an integral part to manage, control, adapt and optimize. However, the inherent scale and
complexity of Pervasive Grid systems fundamentally impact how applications are formulated, deployed
and managed, and presents significant challenges that permeate all aspects of systems software stack.
In this chapter, the authors present some use-cases of Pervasive Grids and highlight their opportunities
and challenges. They then present why semantic knowledge and autonomic mechanisms are seen as
foundations for conceptual and implementation solutions that can address these challenges.
Chapter 3
Desktop Grids: From Volunteer Distributed Computing to High Throughput
Computing Production Platforms ......................................................................................................... 31
Franck Cappello, INRIA & UIUC, France
Gilles Fedak, LIP/INRIA, France
Derrick Kondo, ENSIMAG - antenne de Montbonnot, France
Paul Malcot, Universit Paris-Sud, France
Ala Rezmerita Universit Paris-Sud, France
Desktop Grids, literally Grids made of Desktop Computers, are very popular in the context of Volunteer
Computing for large scale Distributed Computing projects like SETI@home and Folding@home.
They are very appealing, as Internet Computing platforms for scientific projects seeking a huge amount
of computational resources for massive high throughput computing, like the EGEE project in Europe.
Companies are also interested of using cheap computing solutions that does not add extra hardware and
cost of ownership. A very recent argument for Desktop Grids is their ecological impact: by scavenging
unused CPU cycles without increasing excessively the power consumption, they reduce the waste of
electricity. This book chapter presents the background of Desktop Grid, their principles and essential
mechanisms, the evolution of their architectures, their applications and the research tools associated
with this technology.
Chapter 4
Porting Applications to Grids................................................................................................................ 62
Wolfgang Gentzsch, EU Project DEISA and Board of Directors
of the Open Grid Forum, Germany
Aim of this chapter is to guide developers and users through the most important stages of implementing
software applications on Grid infrastructures, and to discuss important challenges and potential solutions.
Those challenges come from the underlying grid infrastructure, like security, resource management, and
information services; the application data, data management, and the structure, volume, and location of
the data; and the application architecture, monolithic or workflow, serial or parallel. As a case study, we
present the DEISA Distributed European Infrastructure for Supercomputing Applications and describe
its DEISA Extreme Computing Initiative DECI for porting and running scientific grand challenge applications. The chapter concludes with an outlook on Compute Clouds, and suggests ten rules of building
a sustainable grid as a prerequisite for long-term sustainability of the grid applications.
Chapter 5
Benchmarking Grid Applications for Performance and Scalability Predictions .................................. 89
Radu Prodan, University of Innsbruck, Austria
Farrukh Nadeem, University of Innsbruck, Austria
Thomas Fahringer, University of Innsbruck, Austria
Application benchmarks can play a key role in analyzing and predicting the performance and scalability
of Grid applications, serve as an evaluation of the fitness of a collection of Grid resources for running a
specific application or class of applications (Tsouloupas & Dikaiakos, 2007), and help in implementing
performance-aware resource allocation policies of real time job schedulers. However, application benchmarks have been largely ignored due to diversified types of applications, multi-constrained executions,
dynamic Grid behavior, and heavy computational costs. To remedy these, we present an approach taken
by the ASKALON Grid environment that computes application benchmarks considering variations in
the problem size of the application and machine size of the Grid site. Our system dynamically controls
the number of benchmarking experiments for individual applications and manages the execution of these
experiments on different Grid sites. We present experimental results of our method for three real-world
applications in the Austrian Grid environment.
Section 2
P2P Computing
Chapter 6
Scalable Index and Data Management for Unstructured Peer-to-Peer Networks ...................................123
Shang-Feng Chiang, National Taiwan University, Taiwan
Kuo Chiang, National Taiwan University, Taiwan
Ruo-Jian Yu, National Taiwan University, Taiwan
Sheng-De Wang, National Taiwan University, Taiwan
In order to improve the scalability and reduce the traffic of Gnutella-like unstructured peer-to-peer
networks, index caching and controlled flooding mechanisms had been an important research topic in
recent years. In this chapter we will describe and present the current state of the art about index management schemes, interest groups and data clustering for unstructured peer-to-peer networks. Index caching
mechanisms are an approach to reducing the traffic of keyword querying. However, the cached indices
may incur redundant replications in the whole network, leading to the less efficient use of storage and
the increase of traffic. We propose a multiplayer index management scheme that actively diffuses the
indices in the network and groups indices according to their request rate. The peers of the group that have
indices with higher request rate will be placed in layers that receive queries earlier. Our simulation shows
that the proposed approach can keep a high success query rate as well as reduce the flooding size.
Chapter 7
Hierarchical Structured Peer-to-Peer Networks.................................................................................. 140
Yong Meng Teo, National University of Singapore, Singapore
Verdi March, National University of Singapore, Singapore
Marian Mihailescu, National University of Singapore, Singapore
Structured peer-to-peer networks are scalable overlay network infrastructures that support Internet-scale
network applications. A globally consistent peer-to-peer protocol maintains the structural properties
of the network with peers dynamically joining, leaving and failing in the network. In this chapter, we
discuss hierarchical distributed hash tables (DHT) as an approach to reduce the overhead of maintaining
the overlay network. In a two-level hierarchical DHT, the top-level overlay consists of groups of nodes
where each group is distinguished by a unique group identifier. In each group, one or more nodes are
designated as supernodes and act as gateways to nodes at the second level. Collisions of groups occur
when concurrent node joins result in the creation of multiple groups with the same group identifier. This
has the adverse effects of increasing the lookup path length due to a larger top-level overlay, and the
overhead of overlay network maintenance. We discuss two main approaches to address the group collision problem: collision detection-and-resolution, and collision avoidance. As an example, we describe
an implementation of hierarchical DHT by extending Chord as the underlying overlay graph.
Chapter 8
Load Balancing in Peer-to-Peer Systems ............................................................................................ 163
Haiying Shen, University of Arkansas, USA
Structured peer-to-peer (P2P) overlay networks like Distributed Hash Tables (DHTs) map data items to
the network based on a consistent hashing function. Such mapping for data distribution has an inherent
load balance problem. Thus, a load balancing mechanism is an indispensable part of a structured P2P
overlay network for high performance. The rapid development of P2P systems has posed challenges in
load balancing due to their features characterized by large scale, heterogeneity, dynamism, and proximity.
An efficient load balancing method should flexible and resilient enough to deal with these characteristics. This chapter will first introduce the P2P systems and the load balancing in P2P systems. It then
introduces the current technologies for load balancing in P2P systems, and provides a case study of a
dynamism-resilient and proximity-aware load balancing mechanism. Finally, it indicates the future and
emerging trends of load balancing, and concludes the chapter.
Chapter 9
Decentralized Overlay for Federation of Enterprise Clouds............................................................... 191
Rajiv Ranjan, The University of Melbourne, Australia
This chapter describes Aneka-Federation, a decentralized and distributed system that combines enterprise
Clouds, overlay networking, and structured peer-to-peer techniques to create scalable wide-area networking of compute nodes for high-throughput computing. The Aneka-Federation integrates numerous small
scale Aneka Enterprise Cloud services and nodes that are distributed over multiple control and enterprise
domains as parts of a single coordinated resource leasing abstraction. The system is designed with the
aim of making distributed enterprise Cloud resource integration and application programming flexible,
efficient, and scalable. The system is engineered such that it: enables seamless integration of existing
Aneka Enterprise Clouds as part of single wide-area resource leasing federation; self-organizes the system components based on a structured peer-to-peer routing methodology; and presents end-users with a
distributed application composition environment that can support variety of programming and execution
models. This chapter describes the design and implementation of a novel, extensible and decentralized
peer-to-peer technique that helps to discover, connect and provision the services of Aneka Enterprise
Clouds among the users who can use different programming models to compose their applications.
Evaluations of the system with applications that are programmed using the Task and Thread execution
models on top of an overlay of Aneka Enterprise Clouds have been described here.
Section 3
Chapter 10
Reliability and Performance Models for Grid Computing ................................................................. 219
Yuan-Shun Dai, University of Electronics Science Technology of China, China
& University of Tennessee, Knoxville, USA
Jack Dongarra, University of Tennessee, Knoxville, USA; Oak Ridge National Laboratory, USA;
& University of Manchester, UK
Grid computing is a newly developed technology for complex systems with large-scale resource sharing,
wide-area communication, and multi-institutional collaboration. It is hard to analyze and model the Grid
reliability because of its largeness, complexity and stiffness. Therefore, this chapter introduces the Grid
computing technology, presents different types of failures in grid system, models the grid reliability with
star structure and tree structure, and finally studies optimization problems for grid task partitioning and
allocation. The chapter then presents models for star-topology considering data dependence and treestructure considering failure correlation. Evaluation tools and algorithms are developed, evolved from
Universal generating function and Graph Theory. Then, the failure correlation and data dependence are
considered in the model. Numerical examples are illustrated to show the modeling and analysis.
Chapter 11
Mixed Parallel Programming Models Using Parallel Tasks ............................................................... 246
Jrg Dmmler, Chemnitz University of Technology,Germany
Thomas Rauber, University Bayreuth, Germany
Gudula Rnger, Chemnitz University of Technology,Germany
Parallel programming models using parallel tasks have shown to be successful for increasing scalability
on medium-size homogeneous parallel systems. Several investigations have shown that these programming models can be extended to hierarchical and heterogeneous systems which will dominate in the
future. In this article, we discuss parallel programming models with parallel tasks and describe these
programming models in the context of other approaches for mixed task and data parallelism. We discuss compiler-based as well as library-based approaches for task programming and present extensions
to the model which allow a flexible combination of parallel tasks and an optimization of the resulting
communication structure.
Chapter 12
Programmability and Scalability on Multi-Core Architectures .......................................................... 276
Jaeyoung Yi, Yonsei University, Seoul, Korea
Yong J. Jang, Yonsei University, Seoul, Korea
Doohwan Oh, Yonsei University, Seoul, Korea
Won W. Ro, Yonsei University, Seoul, Korea
In this chapter, we will describe todays technological trends on building a multi-core based microprocessor and its programmability and scalability issues. Ever since multi-core processors have been commercialized, we have seen many different multi-core processors. However, the issues related to how to
utilize the physical parallelism of cores for software execution have not been suitably addressed so far.
Compared to implementing multiple identical cores on a single chip, separating an original sequential
program into multiple running threads has been an even more challenging task. In this chapter, we introduce several different software programs which can be successfully ported on the future multi-core
based processors and describe how they could benefit from the multi-core systems. Towards the end,
the future trends in the multi-core systems are overviewed.
Chapter 13
Assembling of Parallel Programs for Large Scale Numerical Modeling............................................ 295
V.E. Malyshkin, Russian Academy of Science, Russia
The main ideas of the Assembly Technology (AT) in its application to parallel implementation of large
scale realistic numerical models on a rectangular mesh are considered and demonstrated by the parallelization (fragmentation) of the Particle-In-Cell method (PIC) application to solution of the problem
of energy exchange in plasma cloud. The implementation of the numerical models with the assembly
technology is based on the construction of a fragmented parallel program. Assembling of a numerical
simulation program under AT provides automatically different useful dynamic properties of the target
program including dynamic load balance on the basis of the fragments migration from overloaded into
underloaded processor elements of a multicomputer. Parallel program assembling approach also can be
considered as combination and adaptation for parallel programming of the well known modular programming and domain decomposition techniques and supported by the system software for fragmented
programs assembling.
Chapter 14
Cell Processing for Two Scientific Computing Kernels ..................................................................... 312
Meilian Xu, University of Manitoba, Canada
Parimala Thulasiraman, University of Manitoba, Canada
This chapter uses two scientific computing kernels to illustrate challenges of designing parallel algorithms
for one heterogeneous multi-core processor, the Cell Broadband Engine processor (Cell/B.E.). It describes
the limitation of the current parallel systems using single-core processors as building blocks. The limitation deteriorates the performance of applications which have data-intensive and computation-intensive
kernels such as Finite Difference Time Domain (FDTD) and Fast Fourier Transform (FFT). FDTD is a
regular problem with nearest neighbour comminuncation pattern under synchronization constraint. FFT
based on indirect swap network (ISN) modifies the data mapping in traditional Cooley-Tukey butterfly
network to improve data locality, hence reducing the communication and synchronization overhead.
The authors hope to unleash the Cell/B.E. and design parallel FDTD and parallel FFT based on ISN by
taking into account unique features of Cell/B.E. such as its eight SIMD processing units on the single
chip and its high-speed on-chip bus.
Section 4
Scheduling and Communication Techniques
Chapter 15
On Application Behavior Extraction and Prediction to Support and
Improve Process Scheduling Decisions ............................................................................................. 338
Evgueni Dodonov, University of So Paulo ICMC, Brazil
Rodrigo Fernandes de Mello, University of So Paulo ICMC, Brazil
The knowledge of application behavior allows predicting their expected workload and future operations.
Such knowledge can be used to support, improve and optimize scheduling decisions by distributing data
accesses and minimizing communication overheads. Different techniques can be used to obtain such
knowledge, varying from simple source code analysis, sequential access pattern extraction, history-based
approaches and on-line behavior extraction methods. The extracted behavior can be later classified into
different groups, representing process execution states, and then used to predict future process events.
This chapter describes different approaches, strategies and methods for application behavior extraction
and classification, and also how this information can be used to predict new events, focusing on distributed process scheduling.
Chapter 16
A Structured Tabu Search Approach for Scheduling in Parallel Computing Systems........................ 354
Tore Ferm, Sydney University, Australia
Albert Y. Zomaya, Sydney University, Australia
Task allocation and scheduling are essential for achieving the high performance expected of parallel
computing systems. However, there are serious issues pertaining to the efficient utilization of computational resources in such systems that need to be resolved, such as, achieving a balance between system
throughput and execution time. Moreover, many scheduling techniques involve massive task graphs with
complex precedence relations, processing costs, and inter-task communication costs. In general, there are
two main issues that should be highlighted: problem representation and finding an efficient solution in
a timely fashion. In the work proposed here, we have attempted to overcome the first problem by using
a structured model which offers a systematic method for the representation of the scheduling problem.
The model used can encode almost all of the parameters involved in a scheduling problem in a very
systematic manner. To address the second problem, a Tabu Search algorithm is used to allocate tasks to
processors in a reasonable amount of time. The use of Tabu Search has the advantage of obtaining solutions to more general instances of the scheduling problem in reasonable time spans. The efficiency of
the proposed framework is demonstrated by using several case studies. A number of evaluation criteria
will be used to optimize the schedules. Communication- and computation-intensive task graphs are
analyzed, as are a number of different task graph shapes and sizes.
Chapter 17
Communication Issues in Scalable Parallel Computing ..................................................................... 378
C.E.R. Alves, Universidade Sao Judas Tadeu, Brazil
E. N. Cceres, Universidade Federal de Mato Grosso do Sul, Brazil
F. Dehne, Carleton University, Canada
S. W. Song, Universidade de Sao Paulo, Brazil
In this book chapter, we discuss some important communication issues to obtain a highly scalable
computing system. We consider the CGM (Coarse-Grained Multicomputer) model, a realistic computing model to obtain scalable parallel algorithms. The communication cost is modeled by the number of
communication rounds and the objective is to design algorithms that require the minimum number of
communication rounds. We discuss some important issues and make considerations of practical importance, based on our previous experience in the design and implementation of parallel algorithms. The first
issue is the amount of data transmitted in a communication round. For a practical implementation to be
successful we should attempt to minimize this amount, even when it is already within the limit allowed
by the CGM model. The second issue concerns the trade-off between the number of communication
rounds which the CGM attempts to minimize and the overall communication time taken in the communication rounds. Sometimes a larger number of communication rounds may actually reduce the total
amount of data transmitted in the communications rounds. These two issues have guided us to present
efficient parallel algorithms for the string similarity problem, used as an illustration.
Chapter 18
Scientific Workflow Scheduling with Time-Related QoS Evaluation ................................................ 396
Wanchun Dou, Nanjing University, P. R. China
Jinjun Chen, Swinburne University of Technologies, Australia
This chapter introduces a scheduling approach for cross-domain scientific workflow execution with timerelated QoS evaluation. Generally, scientific workflow execution often spans self-managing administrative
domains to achieving global collaboration advantage. In practice, it is infeasible for a domain-specific
application to disclose its process details for privacy or security reasons. Consequently, it is a challenging endeavor to coordinate scientific workflows and its distributed domain-specific applications from
service invocation perspective. Therefore, in this chapter, we aim at proposing a collaborative scheduling approach, with time-related QoS evaluation, for navigating cross-domain collaboration. Under this
collaborative scheduling approach, a private workflow fragment could maintain temporal consistency
with a global scientific workflow in resource sharing and task enactments. Furthermore, an evaluation
is presented to demonstrate the scheduling approach.
Section 5
Service Computing
Chapter 19
Grid Transaction Management and Highly Reliable Grid Platform ................................................... 421
Feilong Tang, Shanghai Jiao Tong University, China
Minyi Guo, Shanghai Jiao Tong University, China
As Grid technology is expanding from scientific computing to business applications, open grid platform
increasingly needs the support of transaction services. This chapter proposes a grid transaction service
(GridTS) and GridTS based transaction processing model, defines two kinds of grid transactions: atomic
grid transaction for short-lived reliable applications and long-lived transaction for business processes.
We also present solutions to managing these two kinds of transactions to reach different consistent
requirements. Moreover, this chapter investigates a mechanism for automatic generation of compensating transactions in the execution of long-lived transactions through the GridTS. Finally, we discuss the
future trends along the reliable grid platform research.
Chapter 20
Error Recovery for SLA-Based Workflows Within the Business Grid ............................................... 442
Dang Minh Quan, International University in Germany, Germany
Jorn Altmann, Seoul National University, South Korea
Laurence T. Yang, St. Francis Xavier University, Canada
This chapter describes the error recovery mechanisms in the system handling the Grid-based workflow
within the Service Level Agreement (SLA) context. It classifies the errors into two main categories.
The first is the large-scale errors when one or several Grid sites are detached from the Grid system at a
time. The second is the small-scale errors which may happen inside an RMS. For each type of error, the
chapter introduces a recovery mechanism with the SLA context imposing the goal to the mechanisms.
The authors believe that it is very useful to have an error recovery framework to avoid or eliminate the
negative effects of the errors.
Chapter 21
A Fuzzy Real Option Model to Price Grid Compute Resources ........................................................ 471
David Allenotor, University of Manitoba, Canada
Kenneth Chiu, University at Binghamtom, State University of NY, USA
Sameer Tilak, University of California, San Diego, USA
A computational grid is a geographically disperssed heterogeneous computing facility owned by dissimilar organizations with diverse usage policies. As a result, guaranteeing grid resources availability as
well as pricing them raises a number of challenging issues varying from security to management of the
grid resources. In this chapter we design and develop a grid resources pricing model using a fuzzy real
option approach and show that finance models can be effectively used to price grid resources.
Volume II
Chapter 22
The State of the Art and Open Problems in Data Replication in Grid Environments ........................ 486
Mohammad Shorfuzzaman, University of Manitoba, Canada
Rasit Eskicioglu , University of Manitoba, Canada
Peter Graham, University of Manitoba, Canada
Data Grids provide services and infrastructure for distributed data-intensive applications that need to access, transfer and modify massive datasets stored at distributed locations around the world. For example,
the next-generation of scientific applications such as many in high-energy physics, molecular modeling,
and earth sciences will involve large collections of data created from simulations or experiments. The size
of these data collections is expected to be of multi-terabyte or even petabyte scale in many applications.
Ensuring efficient, reliable, secure and fast access to such large data is hindered by the high latencies of
the Internet. The need to manage and access multiple petabytes of data in Grid environments, as well as
to ensure data availability and access optimization are challenges that must be addressed.
Chapter 23
Architectural Elements of Resource Sharing Networks ..................................................................... 517
Marcos Dias de Assuno, The University of Melbourne, Australia
This chapter first presents taxonomies on approaches for resource allocation across resource sharing
networks such as Grids. It then examines existing systems and classifies them under their architectures,
operational models, support for the life-cycle of virtual organisations, and resource control techniques.
Resource sharing networks have been established and used for various scientific applications over the
last decade. The early ideas of Grid computing have foreseen a global and scalable network that would
provide users with resources on demand. In spite of the extensive literature on resource allocation and
scheduling across organisational boundaries, these resource sharing networks mostly work in isolation,
thus contrasting with the original idea of Grid computing. Several efforts have been made towards providing architectures, mechanisms, policies and standards that may enable resource allocation across Grids.
A survey and classification of these systems are relevant for the understanding of different approaches
utilised for connecting resources across organisations and virtualisation techniques. In addition, a classification also sets the ground for future work on inter-operation of Grids.
Section 6
Chapter 24
Simultaneous MultiThreading Microarchitecture ............................................................................... 552
Chen Liu, Florida International University, USA
Xiaobin Li, Intel Corporation, USA
Shaoshan Liu, University of California, Irvine, USA
Jean-Luc Gaudiot, University of California, Irvine, USA
Due to the conventional sequential programming model, the Instruction-Level Parallelism (ILP) that
modern superscalar processors can explore is inherently limited. Hence, multithreading architectures
have been proposed to exploit Thread-Level Parallelism (TLP) in addition to conventional ILP. By issuing and executing instructions from multiple threads at each clock cycle, Simultaneous MultiThreading
(SMT) achieves some of the best possible system resource utilization and accordingly higher instruction
throughput. In this chapter, we describe the origin of SMT microarchitecture, comparing it with other
multithreading microarchitectures. We identify several key aspects for high-performance SMT design:
fetch policy, handling long-latency instructions, resource sharing control, synchronization and communication. We also describe some potential benefits of SMT microarchitecture: SMT for fault-tolerance
and SMT for secure communications. Given the need to support sequential legacy code and emerge of
new parallel programming model, we believe SMT microarchitecture will play a vital role as we enter
the multi-thread multi/many-core processor design era.
Chapter 25
Runtime Adaption Techniques for HPC Applications ........................................................................ 583
Edgar Gabriel, University of Houston, USA
This chapter discusses runtime adaption techniques targeting high-performance computing applications.
In order to exploit the capabilities of modern high-end computing systems, applications and system
software have to be able to adapt their behavior to hardware and application characteristics. Using
the Abstract Data and Communication Library (ADCL) as the driving example, the chapter shows the
advantage of using adaptive techniques to exploit characteristics of the network and of the application.
This allows to reduce the execution time of applications significantly and to avoid having to maintain
different architecture dependent versions of the source code.
Chapter 26
A Scalable Approach to Real-Time System Timing Analysis............................................................. 606
Alan Grigg, Loughborough University, UK
Lin Guan, Loughborough University, UK
This Chapter describes a real-time system performance analysis approach known as reservation-based
analysis (RBA). The scalability of RBA is derived from an abstract (target-independent) representation of system software components, their timing and resource requirements and run-time scheduling
policies. The RBA timing analysis framework provides an evolvable modeling solution that can be
instigated in early stages of system design, long before the software and hardware components have
been developed, and continually refined through successive stages of detailed design, implementation
and testing. At each stage of refinement, the abstract model provides a set of best-case and worst-case
timing guarantees that will be delivered subject to a set of scheduling obligations being met by the
target system implementation. An abstract scheduling model, known as the rate-based execution model
then provides an implementation reference model with which compliance will ensure that the imposed
set of timing obligations will be met by the target system.
Chapter 27
Scalable Algorithms for Server Allocation in Infostations ................................................................. 645
Alan A. Bertossi, University of Bologna, Italy
M. Cristina Pinotti, University of Perugia, Italy
Romeo Rizzi, University of Udine, Italy
Phalguni Gupta, Indian Institute of Technology Kanpur, India
The server allocation problem arises in isolated infostations, where mobile users going through the coverage area require immediate high-bit rate communications such as web surfing, file transferring, voice
messaging, email and fax. Given a set of service requests, each characterized by a temporal interval and
a category, an integer k, and an integer hc for each category c, the problem consists in assigning a server
to each request in such a way that at most k mutually simultaneous requests are assigned to the same
server at the same time, out of which at most hc are of category c, and the minimum number of servers
is used. Since this problem is computationally intractable, a scalable 2-approximation on-line algorithm
is exhibited. Generalizations of the problem are considered, which contain bin-packing, multiprocessor
scheduling, and interval graph coloring as special cases, and admit scalable on-line algorithms providing constant approximations.
Section 7
Web Computing
Chapter 28
Web Application Server Clustering with Distributed Java Virtual Machine ...................................... 658
King Tin Lam, The University of Hong Kong, Hong Kong
Cho-Li Wang, The University of Hong Kong, Hong Kong
Web application servers, being todays enterprise application backbone, have warranted a wealth of
J2EE-based clustering technologies. Most of them however need complex configurations and excessive
programming effort to retrofit applications for cluster-aware execution. This chapter proposes a clustering
approach based on distributed Java virtual machine (DJVM). A DJVM is a collection of extended JVMs
that enables parallel execution of a multithreaded Java application over a cluster. A DJVM achieves
transparent clustering and resource virtualization, extolling the virtue of single-system-image (SSI). We
evaluate this approach through porting Apache Tomcat to our JESSICA2 DJVM and identify scalability
issues arising from fine-grain object sharing coupled with intensive synchronizations among distributed
threads. By leveraging relaxed cache coherence protocols, we are able to conquer the scalability barri-
ers and harness the power of our DJVMs global object space design to significantly outstrip existing
clustering techniques for cache-centric web applications.
Chapter 29
Middleware for Community Coordinated Multimedia ....................................................................... 682
Jiehan Zhou, University of Oulu, Finland
Zhonghong Ou, University of Oulu, Finland
Junzhao Sun, University of Oulu, Finland
Mika Rautiainen, University of Oulu, Finland
Mika Ylianttila, University of Oulu, Finland
Community Coordinated Multimedia (CCM) envisions a novel paradigm that enables the user to consume multiple media through requesting multimedia-intensive Web services via diverse display devices,
converged networks, and heterogeneous platforms within a virtual, open and collaborative community.
These trends yield new requirements for CCM middleware. This chapter aims to systematically and extensively describe middleware challenges and opportunities to realize the CCM paradigm by reviewing
the activities of middleware with respect to four viewpoints, namely mobility-aware, multimedia-driven,
service-oriented, and community-coordinated.
Section 8
Mobile Computing and Ad Hoc Networks
Chapter 30
Scalability of Mobile Ad Hoc Networks ............................................................................................. 705
Dan Grigoras, University College Cork, Ireland
Daniel C. Doolan, Robert Gordon University, UK
Sabin Tabirca, University College Cork, Ireland
This chapter addresses scalability aspects of mobile ad hoc networks management and clusters built on
top of them. Mobile ad hoc networks are created by mobile devices without the help of any infrastructure
for the purpose of communication and service sharing. As a key supporting service, the management
of mobile ad hoc networks is identified as an important aspect of their exploitation. Obviously, management must be simple, effective, consume least of resources, reliable and scalable. The first section
of this chapter discusses different incarnations of the management service of mobile ad hoc networks
considering the above mentioned characteristics. Cluster computing is an interesting computing paradigm that, by aggregation of network hosts, provides more resources than available on each of them.
Clustering mobile and heterogeneous devices is not an easy task as it is proven in the second part of the
chapter. Both sections include innovative solutions for the management and clustering of mobile ad hoc
networks, proposed by the authors.
Chapter 31
in Integrated Heterogeneous Wireless and Mobile Networks............................................................ 718
Wei Shen, University of Cincinnati, USA
Qing-An Zeng, University of Cincinnati, USA
Integrated heterogeneous wireless and mobile network (IHWMN) is introduced by combing different
types of wireless and mobile networks (WMNs) in order to provide more comprehensive service such as
high bandwidth with wide coverage. In an IHWMN, a mobile terminal equipped with multiple network
interfaces can connect to any available network, even multiple networks at the same time. The terminal
also can change its connection from one network to other networks while still keeping its communication
alive. Although IHWMN is very promising and a strong candidate for future WMNs, it brings a lot of
issues because different types of networks or systems need to be integrated to provide seamless service
to mobile users. In this chapter, we focus on some major issues in IHWMN. Several noel network selection strategies and resource management schemes are also introduced for IHWMN to provide better
resource allocation for this new network architecture.
Section 9
Chapter 32
Scalable Internet Architecture Supporting Quality of Service (QoS) ................................................. 739
Priyadarsi Nanda, University of Technology, Sydney (UTS), Australia
Xiangjian He, University of Technology, Sydney (UTS), Australia
The evolution of Internet and its successful technologies has brought a tremendous growth in business,
education, research etc. over the last four decades. With the dramatic advances in multimedia technologies and the increasing popularity of real-time applications, recently Quality of Service (QoS) support
in the Internet has been in great demand. Deployment of such applications over the Internet in recent
years, and the trend to manage them efficiently with a desired QoS in mind, researchers have been trying
for a major shift from its Best Effort (BE) model to a service oriented model. Such efforts have resulted
in Integrated Services (Intserv), Differentiated Services (Diffserv), Multi Protocol Label Switching
(MPLS), Policy Based Networking (PBN) and many more technologies. But the reality is that such
models have been implemented only in certain areas in the Internet not everywhere and many of them
also faces scalability problem while dealing with huge number of traffic flows with varied priority levels
in the Internet. As a result, an architecture addressing scalability problem and satisfying end-to-end QoS
still remains a big issue in the Internet. In this Chapter we propose a policy based architecture which we
believe can achieve scalability while offering end to end QoS in the Internet.
Chapter 33
Scalable Fault Tolerance for Large-Scale Parallel and Distributed Computing ................................. 760
Zizhong Chen, Colorado School of Mines, USA
Todays long running scientific applications typically tolerate failures by checkpoint/restart in which all
process states of an application are saved into stable storage periodically. However, as the number of
processors in a system increases, the amount of data that need to be saved into stable storage also increases
linearly. Therefore, the classical checkpoint/restart approach has a potential scalability problem for large
parallel systems. In this chapter, we introduce some scalable techniques to tolerate a small number of
process failures in large parallel and distributed computing. We present several encoding strategies for
diskless checkpointing to improve the scalability of the technique. We introduce the algorithm-based
checkpoint-free fault tolerance technique to tolerate fail-stop failures without checkpoint or rollback
recovery. Coding approaches and floating-point erasure correcting codes are also introduced to help applications to survive multiple simultaneous process failures. The introduced techniques are scalable in the
sense that the overhead to survive k failures in p processes does not increase as the number of processes
p increases. Experimental results demonstrate that the introduced techniques are highly scalable.
Section 10
Applications
Chapter 34
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems ................... 785
Yifeng Zhu, University of Maine, USA
Hong Jiang, University of NebraskaLincoln, USA
This chapter discusses the false rates of Bloom filters in a distributed environment. A Bloom filter (BF)
is a space-efficient data structure to support probabilistic membership query. In distributed systems, a
Bloom filter is often used to summarize local services or objects and this Bloom filter is replicated to
remote hosts. This allows remote hosts to perform fast membership query without contacting the original
host. However, when the services or objects are changed, the remote Bloom replica may become stale.
This chapter analyzes the impact of staleness on the false positive and false negative for membership
queries on a Bloom filter replica. An efficient update control mechanism is then proposed based on the
analytical results to minimize the updating overhead. This chapter validates the analytical models and
the update control mechanism through simulation experiments.
Chapter 35
Image Partitioning on Spiral Architecture .......................................................................................... 808
Qiang Wu, University of Technology, Australia
Xiangjian He, University of Technology, Australia
Spiral Architecture is a relatively new and powerful approach to image processing. It contains very useful
geometric and algebraic properties. Based on the abundant research achievements in the past decades,
it is shown that Spiral Architecture will play an increasingly important role in image processing and
computer vision. This chapter presents a significant application of Spiral Architecture for distributed
image processing. It demonstrates the impressive characteristics of spiral architecture for high performance image processing. The proposed method tackles several challenging practical problems during
the implementation. The proposed method reduces the data communication between the processing
nodes and is configurable. Moreover, the proposed partitioning scheme has a consistent approach: after
image partitioning each sub-image should be a representative of the original one without changing the
basic object, which is important to the related image processing operations.
Chapter 36
Scheduling Large-Scale DNA Sequencing Applications .................................................................... 841
Sudha Gunturu, Oklahoma State University, USA
Xiaolin Li, Oklahoma State University, USA
Laurence Tianruo Yang, St. Francis Xavier University, Canada
This chapter studies a load scheduling strategy with near-optimal processing time that is designed to
explore the computational characteristics of DNA sequence alignment algorithms, specifically, the
Needleman-Wunsch Algorithm. Following the divisible load scheduling theory, an efficient load scheduling strategy is designed in large-scale networks so that the overall processing time of the sequencing
tasks is minimized. In this study, the load distribution depends on the length of the sequence and number of processors in the network and, the total processing time is also affected by communication link
speed. Several cases have been considered in the study by varying the sequences, communication and
computation speeds, and number of processors. Through simulation and numerical analysis, this study
demonstrates that for a constant sequence length as the numbers of processors increase in the network
the processing time for the job decreases and minimum overall processing time is achieved.
Chapter 37
Multi-Core Supported Deep Packet Inspection .................................................................................. 858
Yang Xiang, Central Queensland University, Australia
Daxin Tian, Tianjin University, China
Network security applications such as intrusion detection systems (IDSs), firewalls, anti-virus/spyware
systems, anti-spam systems, and security visualisation applications are all computing-intensive applications. These applications all heavily rely on deep packet inspection, which is to examine the content of
each network packets payload. Today these security applications cannot cope with the speed of broadband
Internet that has already been deployed, that is, the processor power is much slower than the bandwidth
power. Recently the development of multi-core processors brings more processing power. Multi-core
processors represent a major evolution in computing hardware technology. While two years ago most
network processors and personal computer microprocessors had single core configuration, the majority
of the current microprocessors contain dual or quad cores and the number of cores on die is expected to
grow exponentially over time. The purpose of this chapter is to discuss the research on using multi-core
technologies to parallelize deep packet inspection algorithms, and how such an approach will improve
the performance of deep packet inspection applications. This will eventually provide a security system
the capability of real-time packet inspection thus significantly improve the overall status of security on
current Internet infrastructure.
Chapter 38
State-Carrying Code for Computation Mobility ................................................................................. 874
Hai Jiang, Arkansas State University, USA
Yanqing Ji, Gonzaga University, USA
Computation mobility enables running programs to move around among machines and is the essence
of performance gain, fault tolerance, and system throughput increase. State-carrying code (SCC) is a
software mechanism to achieve such computation mobility by saving and retrieving computation states
during normal program execution in heterogeneous multi-core/many-core clusters. This chapter analyzes
different kinds of state saving/retrieving mechanisms for their pros and cons. To achieve a portable, flexible
and scalable solution, SCC adopts the application-level thread migration approach. Major deployment
features are explained and one example system, MigThread, is used to illustrate implementation details.
Future trends are given to point out how SCC can evolve into a complete lightweight virtual machine.
New high productivity languages might step in to raise SCC to language level. With SCC, thorough
resource utilization is expected.
Compilation of References ............................................................................................................... 895
xxxi
Foreword
I am delighted to write the Foreword to this book, as it is a very useful resource in a time where change
is dramatic and guidance on how to proceed in the development and use of scalable computing technology is in demand.
The book is timely, as it comes at the meeting point of two major challenges and opportunities.
Information technology, having grown at an increasingly rapid pace since the construction of the first
electronic computer, has now reached a point where it represents an essential, new and transformational
enabler of progress in science, engineering, and the commercial world. The performance of today's
computer hardware and the sophistication of their software systems yield a qualitatively new tool for
scientific discovery, industrial engineering, and business solutions. The solutions complement and
promise to go beyond those achievable with the classical two pillars of science - theory and real-world
experiments. The opportunity of the third pillar is substantial; however, building it is still a significant
challenge. The second challenge and opportunity lays is the current transformation that the computer
industry is undergoing. The emergence of multicore processors has been called "the greatest disruption
information technology has seen." As several decades of riding Moore's law to easily accelerate clock
speeds have come to and end, parallel hardware and software solutions must be developed. While the
challenge of creating such solutions is formidable, it also represents an opportunity that is sure to create food for thought and work for new generations of scientists, engineers, students, and practitioners.
Scalable computing technologies are at the core of both challenges; they help create the hardware and
software architectures underlying the third pillar of science and they help create the parallel computing
solutions that will make or break the multicore revolution.
The book addresses many issues related to these challenges and opportunities. Among them is the
question of the computer model of the future. Will we continue to obtain computer services from local
workstations and personal computers? Will the compute power be concentrated in servers? Will these
systems be connected in the form of Grids? The book also discusses the Cloud model, where the end user
obtains all information services via networks from providers "out there" - possibly via small hand-held
devices. Embedded and real-time computer systems are another factor in this equation, as information
technology continues to penetrate all appliances, equipment, and wearables in our daily lives.
While computer systems evolve, the question of the relevant new applications continues to boggle
our minds. Classical performance-thirsty programs are those in the area of science and engineering. Future scalable applications are likely to include business and personal software, such as web and mobile
applications, tools running on ad-hoc networks, and a myriad of entertainment software.
Among the grandest challenges is the question of programming tools and environments for future,
scalable software. In the past, parallel programming has been a niche for a small number of scientists and
geeks. With multicores and large-scale parallel systems, this technology now must quickly be learned by
masses of software engineers. Many new models are being proposed. They include those where multiple
xxxii
cores and computers communicate by exchanging messages as well as those that share a global address
space. The book also discusses mixed models, which will likely have an important role in bridging and
integrating heterogeneous computer architectures.
The book touches on both classical and newly emerging issues to reach for the enormous opportunities ahead. Among the classical issues are those of performance analysis and modeling, benchmarking,
development of scalable algorithms, communication, and resource management. While many solutions
to these issues have been proposed in the past, evolving them to true scalability is likely to lead to many
more doctoral dissertations at universities and technologies in industries.
Among the chief pressing new issues is the creation of scalable hardware and software solutions.
Tomorrow's high-performance computers may contain millions of processors; even their building blocks
may contain tens of cores within foreseeable time. Today's hardware and software solutions are simply
inadequate to deal with this sheer scale. Managing power and energy is another issue that has emerged
as a major concern. On one hand, power dissipation of computer chips is the major reason that clock
speeds can no longer increase; on the other hand, the overall consumption of information technology's
power has risen to a political issue - we will soon use more energy for information processing than for
moving matter! Furthermore, as computer systems scale to a phenomenal number of parts, their dependability is of increasing concern; failures and their tolerance may need to be considered as part of standard
operating procedures. Among the promising principles underlying many of these technologies is that of
dynamic adaptation. Future hardware and software systems may no longer by static. They many change,
adapting to new data, environments, faults, resource availability, power, and user demands. They may
dynamically incorporate newly available technology, possibly creating computer solutions that evolve
continually.
The large number of challenges, opportunities, and solutions presented herein will benefit a broad
readership from students, to scientists, to practitioners. I am pleased to be able to recommend this book
to all those who are looking to learn, use, and contribute to future scalable computing technologies.
Rudolf Eigenmann
Professor of Electrical and Computer Engineering and
Technical Director for HPC, Computing Research Institute
Purdue University
November 2008
Rudolf Eigenmann is a professor at the School of Electrical and Computer Engineering and Technical Director for HPC
of the Computing Research Institute at Purdue University. His research interests include optimizing compilers, programming
methodologies and tools, performance evaluation for high-performance computers and applications, and Internet sharing technology. Dr. Eigenmann received a Ph.D. in Electrical Engineering/Computer Science in 1988 from ETH Zurich, Switzerland.
xxxiii
Preface
There is a constantly increasing demand for computational power for solving complex computational
problems in science, engineering and business. The past decade has witnessed a proliferation of more
and more high-performance scalable computing systems. The impressive progress is mainly due to the
availability of enabling technologies in hardware, software or networks. High-end innovations on such
enabling technologies have been fundamental and present cost-effective tools to explore the currently
available high performance systems to make further progress.
To that end, this Handbook of Research on Scalable Computing Technologies presents, discusses,
share ideas, results and experiences on the recent important advances and future challenges on such
enabling technologies.
This handbook is directed to those interested in: developing programming tools and environments
for academic or research computing, extracting the inherent parallelism, and achieving higher performance. This handbook will also be useful for upper-level undergraduate and graduate students studying
this subject.
Main topics covered in this book are on scalable computing and cover a wide array of topics:
Architectures and systems

Software and middleware
Data and resource management paradigms
Programming models, tools, problem solving environments
Trust and security
Service-oriented computing
Data-intensive computing
Cluster and Grid computing
Community and collaborative computing networks
Scheduling and load balancing
Economic and utility computing models
Peer-to-Peer systems
Multi-core/Many-core based computing
Parallel and distributed techniques
Scientific, engineering and business computing
This book is a valuable source targeted to those interested in the development of field of grid engineering for academic or enterprise computing, aimed for computer scientists, researchers and technical
managers working all areas of science, engineering and economy from academia, research centers and
industries.
xxxiv
Acknowledgment
Of course, the represented areas/topics in this handbook, are not an exhaustive representation of the
world of current scalable computing. Nonetheless, they represent the rich and many-faceted knowledge,
that we have the pleasure of sharing with the readers.
The editors would like to acknowledge all of the authors for their insights and excellent contributions
to this handbook and the help of all involved in the collaboration and review process of the handbook,
without whose support the project could not have been satisfactorily completed. Most of the authors of
chapters included in this handbook also served as referees for chapters written by other authors. Thanks
go to all those who provided constructive and comprehensive reviews.
Special thanks also go to the publishing team at IGI Global, whose contributions throughout the
whole process from inception of the initial idea to final publication have been invaluable. In particular
to, Rebecca Beistline, who continuously prodded via e-mail for keeping the project on schedule and to
Joel A. Gamon who has been helping us to complete the book projects production professionally.
Kuan-Ching Li
Ching-Hsien Hsu
Jack Dongarra
Hans Zima
Section 1
Grid Architectures and

Applications
Chapter 1
Pervasive Grid and

its Applications
Ruay-Shiung Chang
National Dong Hwa University, Taiwan
Jih-Sheng Chang
National Dong Hwa University, Taiwan
ABSTRACT
With the advancements of computer system and communication technologies, Grid computing can be
seen as the popular technology bringing about significant revolution for the next generation distributed
computing application. As regards general users, a grid middleware is complex to setup and necessitates a steep learning curve. How to access to the grid system transparently from the point of view of
users turns into a critical issue then. Therefore, various challenges may arise from the incomprehensive
system design as coordinating existing computing resources for the sake of achieving pervasive grid
environment. The authors are going to investigate into the current research works of pervasive grid as
well as analyze the most important factors and components for constructing a pervasive grid system
here. Finally, in order to facilitate the efficiency in respect of teaching and research within a campus,
they would like to introduce their pervasive grid platform.
INTRODUCTION
The current scientific problems are becoming more and more complex for computers. As the aid of the
advances in the computing power of hardware and the diversification of the Internet services, distributed computing applications are becoming more and more important and wide-spread. However, the
past technologies such as cluster and parallel computing are insufficient to process the data-intensive
or computing-intensive applications with the large amount of data file transmissions. In addition, from
the perspective of the most users, a secure and powerful computing environment is beneficial for a
tremendous amount of computing jobs and data-intensive applications. Fortunately, a new technology
DOI: 10.4018/978-1-60566-661-7.ch001
Copyright 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Pervasive Grid and its Applications
called grid computing (Reed, 2003; Foster, 2002; Foster, 2001) has been developed to contribute to
the powerful computing ability to support such distributed computing applications recently. Grid is a
burgeoning technology with the capability of integrating a variety of computing resources as well as
scheduling jobs from various sites, in order to supply a number of users with breakthrough computing
power at low cost.
The most current grid system in operation is on the basis of middleware-based approach. A few grid
middleware projects have been developed such as Globus, Legion, UNICORE, and SRB so far. However,
as regards general users, a grid middleware is complex to setup and necessitates a steep learning curve.
Take Globus as an example, which is now in widespread use for the deployment of grid middleware,
only a command mode environment is provided for users. To cooperate with Globus very well must
have strong knowledge in grid functions and system architecture. As far as a general user is concerned,
it seems rather complex as manipulating grid middleware. Overhead of managing and maintaining a
Grid middleware will limit the popularization for users. In addition, it is hard to integrate various computing resources such as mobile devices, handsets, laptops into ubiquitous computing platform due to
the deficient system functionalities to underlying heterogeneous resources. How to access to the grid
system transparently from the point of view of users turns into a critical issue then. On the other hand,
as far as a programmer is concerned, a lack of programming modules may increase the complexity of
system development for pervasive grid. Limited support for applications level components also restricts
programmers to develop pervasive services.
Therefore, various challenges may arise from the incomprehensive system design as coordinating
existing computing resources for the sake of achieving pervasive grid environment. We are going to
investigate into the current research works of pervasive grid as well as analyze the most important factors and components for constructing a pervasive grid system here. In addition, in order to facilitate the
efficiency in respect of teaching and research within a campus, we would like to introduce our pervasive grid platform to make resources available as conveniently as possible. The pervasive grid platform
integrates all of the wired and mobile devices into a uniform resource on the grounds of the existing
grid infrastructure. Resources can be accessed easily anytime and anywhere through the pervasive grid
platform.
CURRENT AND FUTURE RESEARCH TRENDS

(Mario, 2003) brought up an architecture of pervasive grid with utilization of diverse grid technologies
as indicated in figure 1(a). For example, knowledge grid is able to extract interesting information from
a huge amount of data source by means of the data mining technology. Semantic grid is an emerging
technology, aiming for the translation of semantic job to corresponding grid job or command. The grid
fabric provides various grid services, including data grid and information grid. Data grid intends to
process data-intensive jobs by way of the powerful distributed storage system and data management
technology in order to bring about superior performance with minimal job exection time. Information
grid provides job broker with complete system information for job dispatch. The interconnection between diverse computing resources is achieved via P2P technology coupled with efficient management
strategies, tending towards a more fullness architecture.
There had been several works (Arshad, 2006; Pradeep Padala, 2003; Vazhkudai, 2002) attempting
to develop a high performance framework for grid-enabled operating system. A modular architecture
Figure 1.
called GridOS (Pradeep Padala, 2003) was proposed in order to provide a layered infrastructure. Four
design principles are considered including modularity, policy neutrality, universality of infrastructure,
and minimal core operating system changes.
Figure 1(b) points out the system framework of GridOS from the point of view of modular design.
As regards kernel level, GridOS focuses on a high-performance I/O processing engine. In terms of dataintensive applications, since a large amount of data are distributed and transported across the Internet,
how to process these requests in an efficient way is worth a great deal of thought. Two aspects need to
be taken into consideration including inner disk I/O processing and TCP transmission throughput. The
improvement of inner I/O processing can be realized by integrating user-level FTP service to kernel level
one. The overhead of copying data in system-space to user-space can be avoided. As for the improvement of TCP transmission throughput, the optimal buffer size is calculated to maximize the throughput.
In addition, there are three modules based on the above I/O processing engine to support multi-thread
communication with different quality of service requirement, resource allocation management, and
process communication management respectively:

Communication module
Resource management module
Process management module
(Arshad, 2006) suggested several design points for developing a P2P-based grid operating system.
Since centralized system may not be appropriate in support of Plug and Play. That is, if there are many
external computing resources attempting to join a pervasive grid environment, the centralized system
is hard to manage the join and leave process dynamically in an efficient way. Hence, enabling grid operating system to discovery distributed resources and sharing resources in a P2P fashion transparently
may be a proper alternative.
Figure 1(c) shows the overall architecture of P2P-based Grid Operating System. The existing gird
middleware support only a few types of applications without interactive ones. The layer of grid-enabled
process management is going to support grid-enabled interactive applications rather than batch ones.
Further, process management layer also dominates process migration which is the transit of a process
between two grid nodes. In regard to the underlying connection, each node connects with others by P2P
communication in a self-organization way. All near peers are organized into a sub-Grid while a sub-Grid
will be a member of a RootGrid. In addition, in order to provide all inter-processes with a grid wide
shared memory space for accessing required data, a virtual file system is used to emulate such global
data access system.
A proxy-based clustered architecture was proposed in (Thomas, 2002) for supporting mobile devices
to be integrated into a grid computing environment, as shown in Figure 1(d). A dedicated node called
interlocutor running a grid middleware such as Globus is responsible for the job management and resource aggregation on behalf of mobile devices. All requests from users will be handled and decomposed
by the interlocutor for further job dispatch or resource requests. This is a scalable way to help mobile
devices join a grid computing environment since most part of them is insufficient to install and run a
grid middleware.
In (Junseok, 2004), a proxy-based wireless grid architecture was proposed. A proxy component is
deployed as a interface between computing resources and mobile devices for service managements and
QoS requirements. Having built the proxy, a mobile user can connect a grid environment with ease,
without taking care of the differences between various mobile devices, for attaining to heterogeneous
interworking and pervasive computing. Degistry and discovery mechanism are deployed via Web Service
while all non-grid wireless devices are capable of access to grid system as illustrated in Figure 2(a).
A conception of pervasive wireless grid was put forward in (Srinivasan, 2005). The whole computing
environment consists of a backbone grid and access grids as depicted in Figure 2(b). Mobile devices
are considered a terminal to connect the backbone grid. Most of computing jobs are dispatched to the
backbone grid. In addition, the impact of service handoff for mobile users is discussed in this paper.
In conclusion, under the consideration of the above discussion, to determine the implementation of
pervasive grid system with OS level or middleware level depends on the requirements. As indicated in
Figure 2(c), the middleware level is suitable for developing a pervasive computing system for mobile
devices due to the scalability. Since mobile devices cannot afford the overhead of running a middleware
system, a proxy-based approach may be a proper solution. A dedicated proxy server handles the interconnections between mobile devices and grids, and then the join of a grid environment will become easier
as for mobile devices. As regards fixed devices with powerful computing ability and storage resources,
the implementation based on OS level will be an efficient way to bring all available resources into full
play.
Figure 2.
APPLICATION OF PERVASIVE GRID

In this session, we would like to introduce the application based on the pervasive grid conception. In
our implementation, we have made use of the Globus Toolkit as our system infrastructure. It provides
several fundamental grid technologies along with an open source implementation of a series of grid
services and libraries.
A few critical components within Globus Toolkit are listed below:

Security: GSI (Grid Security Infrastructure) provides the authentication and authorization mechanisms for system protection according to X.509 proxy certificates.
Data management: It is utilized to manipulate data including GridFTP and RLS (Replica Location
Service). RLS maintains location information of replicas from logical file names (LFN) to physical file names (PFN).
Execution management: GRAM (The Grid Resource Allocation and Management) provides a
series of uniform interfaces to simplify the access of remote grid resources for job execution. A
job is defined by RSL (Resource Specification Language) in terms of binary execution file, arguments, standard output, and so forth.
Information services: MDS (Monitoring and Discovery System) enables the monitoring and discovery services for grid resources.
We have developed a portal program on client side by means of the CoG Toolkit. The Java CoG
Toolkit provides a series of programming interfaces as well as reusable objects in grid services, such as
GSI, GRAM, GridFTP, MDS and so on. It presents programmers with a mapping between the Globus
Toolkit and Java APIs, so as to ease the programming complexity.
Figure 2(d) reveals the overall architecture of pervasive grid with the hierarchical components. The
underlying grid middleware is deployed by the Globus Toolkit. Pervasive grid platform is implemented
on the basis of the Globus Toolkit. A service-oriented provider, consisting of data, computation and
information service, offers users a comprehensive computing environment. Data transmission and
replication are the main operations in data service. We utilize GridFTP as the underlying transmission
protocol. Computation service provides the computing resources for job executions. Information service
gives up-to-date resource information such as CPU frequency, available memory space and average
system loading. Such information could be utilized by region job dispatcher during job submission in
order to decide a proper grid site for execution. E-campus applications are built above the platform and
services.
Pervasive grid system makes use of the advantages of pervasive grid to provide students and teachers
with a digitalized education system. From the perspective of the most users, a friendly interface without
complicated manipulations is necessary. In order to simplify the interconnection and operations, we have
developed a user portal by means of Java CoG Toolkit. Due to the nature of cross-platform execution
of Java, our portal solution can run on various operating systems. A user could connect and access the
e-campus services in an unsophisticated way via our client portal.
System Overview
This research is based on the grid technology in support of pervasive computing for digitalized platform
in a campus. We attempt to develop a pervasive grid environment based on the grid computing technology to coordinate all of wireless and wired computing devices within a grid computing environment.
From the standpoint of users, all the resources are considered a uniform type regardless of the type of
resources. A user can access a variety of resources conveniently through the Web Services deployed in
our system.
We have adapted the layered-design approach to implement the pervasive grid system. The design
framework of pervasive grid system appears in Figure 3(a). The layered-design makes pervasive grid
system more flexible if new services are added to the system as needed. Based on the Web Services architecture, we develop a service-oriented provider, which offers users a comprehensive grid computing
service, including data, computation and information service. It provides flexibility for future services
that support the pervasive grid system.
There are five components within pervasive grid system as shown in Figure 3(b):

Core computing infrastructure

Edge grid node
Web services
Pervasive grid platform
Figure 3.
Applications
The core computing infrastructure is the main computing and storage resource. It provides a computing platform with a capability of storage elements, scheduling system, workflow management. In
opposition to the core computing infrastructure, the edge grid node is a terminal, such as notebooks,
PDAs, and personal computers, for connection with the core grid infrastructure. An edge grid node could
access the grid services as well as publish services to the public. Web Services is a popular technology based on XML and HTTP for the construction of distributed computing applications. It works at
an open-architecture with the capability of bridging any programming language, computing hardware
or operating system. Accordingly, we adopt Web Services as our software interface, in order to build a
uniform entry between an edge grid node and our grid services. As for the pervasive grid platform, it is
a middleware to provide the basic grid services for users, such as location management, service handoff,
and personal information management. In the applications layer, we develop some useful applications
based on the pervasive grid platform, including e-Classroom and e-Ecology.
Pervasive Grid Platform

In the mater of the core computing infrastructure, it contains the computing power and storage capability, to provide mobile or wired users with grid services. An edge grid node is just a terminal between a
user and the core computing infrastructure. It is necessary to provide users with an efficient interface
in a seamless and transparent way by the core computing infrastructure. Consequently, it is essential to
develop a high-performance platform to process the users requirements.
There are several works to be addressed as given below in terms of the pervasive grid platform:

To process the join and leave of edge computing nodes: Our system follows GSI (Grid Security
Infrastructure) to design a user authentication/authorization mechanism for adapting to our
environment.
Managing the interconnection between the core computing infrastructure and edge grid nodes:
There are several differences and limitations among various edge grid devices. The pervasive
grid platform is capable of managing these differences as well as fit in with users QoS (Quality
of Service).
The interconnection between the pervasive grid platform and core computing infrastructure: As
presented in Figure 6, we implement the interface to handle the interconnection between the pervasive grid platform and core computing infrastructure through the Globus APIs. The corresponding algorithm is developed to cope with users jobs via Globus as well.
Job dispatch, management, and QoS: We are concerned with the development of the flexible,
high-performance, and reliable dispatcher and scheduler within the pervasive grid platform, in
order to suffice for the requirements of users. Users with different priority could obtain the corresponding service level.
Grid Service Provider

On the basis of the pervasive grid platform, we would like to implement a service-oriented provider
in a module-design way based on the Web Services technology. It is easy to add or remove a service
without taking great pains to maintain system services. For example, Data Grid (Ann Chervenak, 2000;
Hoschek, 2000) service is intended to provide a large amount of storage resources and distributed access technology for data-intensive applications. There are three grid service modules within our system,
including Data-Grid service, Computational service, and Information service. The computational service
is to supply users with computing services for job execution. The information service has the capability
of gathering the information about hardware resources.
With the exception of the core grid infrastructure, an edge node could also publish and provide some
specified services. For instance, a PDA (Personal Digital Assistant) may publish the GPS service to the
public. Other edge grid nodes could access the GPS service provided by the PDA. Through the share of
services, our system has not merely a better service-oriented architecture, but a complete and diverse service provider. Therefore, as shown in Figure 3(c), it is reasonable to deploy and build a service repository
Figure 4.
system for maintaining all registered services dynamically such as query, joint, remove for services.
Applications
In the light of the development of the pervasive grid platform and service-oriented provider system,
we make a study of academic applications within a campus environment called e-Campus system, for
teachers and students with the comprehensive services, regarding researches and teaching. There are
two applications for e-Campus system, including e-Ecology and e-Classroom.
The National Dong Hwa University (NDHU) has widespread natural ecosystem. It is precious treasure
for teaching and education. In addition, with regard to visitors, they may feel like understanding and
observing the natural environment within NDHU. For this reason above, we are attempted to develop the
e-Ecology application, as shown in Figure 3(d), by keeping records of the daily activities of the natural
ecology system as video files over a long period of time in NDHU. It is observable that the data size of
video files must be very tremendous.
As presented in Figure 4(a), in order to cope with such large amount data, we have implemented
a storage broker based on the Data-Grid technology in support of the e-Ecology system. The overall
components of the storage broker are presented in Figure 4 (b). The file mover uses GridFTP as its
transmission protocol to copy files between two grid nodes. The upload processing engine gets the space
information of each storage node from MDS. We adapt the roulette wheel algorithm (Goldberg, 1989)
to the storage broker for choosing a node to upload a file. According to the roulette wheel algorithm, the
larger capability of storage resources has the larger probability to be chose, with a view to achieve the
system balance for storage resources. Download processing engine is an agent distributed in each node.
When the broker gets a download request, it retrieves information from RLS database and redirects this
request to the node that contains the file. The download agent receives the request and start transferring
file. The Storage Broker would distribute the download jobs to each storage node, in order to shorten
the download time. Therefore, users could browse the digital files smoothly without significant delay.
The search engine helps us to look for some specific files by some keywords or properties.
With the efficient storage broker system, the visitors can join the pervasive grid to access our ecological data via the e-Ecology system, as long as they are authorized. The students or teachers also can
investigate into the ecology within NDHU for their researches.
In respect of e-Classroom, the video data for a course can be digitalized as well as stored in our
system via Data-Grid technology. The students can review the course by browsing the video data via
e-Classroom. With the review of courses in multimedia, the teaching efficiency could be improved. In
addition, as shown in Figure 4 (c), the teaching data could be shared among various universities for
distance learning, so as to achieve the objective of the share of education resources.
Implementation
We have implemented an integrated portal of pervasive grid system with a kind user interface called
NDHU Grid Client, as shown in Figure 4 (d). NDHU Grid Client is very friendly towards students and
teachers even if they have little knowledge of computer. Several functions are integrated in this portal,
including users certification tool, GridFTP transmission tool, and grid job tool and e-Campus applications. One application is created by an internal frame as an independent thread. Each job will not influence other jobs by means of multi-threading programming model.
In the matter of e-Campus applications, take e-Ecology as an example, if we feel like browsing an
ecological video file via e-Ecology, we should connect the storage broker at first, as shown in Figure
5(a). Then we input the LFN (Logical File Name) of the video file. The storage broker will search an
optimal site containing this file to download via GridFTP. GridFTP supports parallel data transfer using
multiple TCP streams to improve the bandwidth over a single one. We make use of parallel data transfer
to shorten the waiting time for users. After the transmission, the ecological file is presented through
e-Ecology interface, as shown in Figure 5(b).
Performance Evaluation and Analyses

GridFTP supports parallel data transfer using multiple TCP streams for better performance. We adapt
the parallel data transfer to our system, in order to shorten the download time. Increasing the parallelism of transmissions seems to achieve better performance; it may lead to more computing overhead on
account of too many working threads in your system. We have experimented on a variety of the number
10
Figure 5.
of TCP data streams from one stream to six, for downloading a video file with 700 Mega Bytes, with a
view to determine the appropriate parallelism value. The result is shown in Figure 5(c). It is found from
the result that data stream of three is superior to the others. Therefore, we adopt the parallelism as three
data streams in our implementation.
We have also made experiments on the comparison between the conventional transmission with single
stream and the parallel transmission. As shown in Figure 5(d), the result indicates that our transmission
model can outperform the conventional one. Users can obtain excellent browsing quality for large size
video data via e-Campus system.
CONCLUSION
In this chapter, we have investigated into the current research works of pervasive grid as well as analyzed
the most important factors and components for constructing a pervasive grid system. Two approaches
of implementation of pervasive grid system have been exploited mainly as yet including OS level and
middleware level. To determine the implementation of pervasive grid system with OS level or middleware
level depends on the system requirements and environment. Finally, we have introduced applications
of pervasive grid system.
11
RFFERENCES
Ali, A., McClatchey, R., Anjum, A., Habib, I., Soomro, K., Asif, M., et al. (2006). From grid middleware
to a grid operating system. In Proceedings of the Fifth International Conference on Grid and Cooperative Computing, (pp. 9-16). China: IEEE Computer Society.
Cannataro, M., & Talia, D. (2003). Towards the next-generation grid: A pervasive environment for
knowledge-based computing. In Proceedings of the International Conference on Information Technology: Computers and Communications (pp.437-441), Italy.
Chervenak, A., Foster, L., Kesselman, C., Salisbury, C., & Tueckem, S. (2000). The data grid: Towards
an architecture for the distributed management and analysis of large scientific data sets. Journal of
Network and Computer Applications, 23(3), 187200. doi:10.1006/jnca.2000.0110
CoG Toolkit (n.d.). Retrieved from http://www.cogkit.org/
Foster, I. (2002). The grid: A new infrastructure for 21st century science. Physics Today, 55, 4247.
doi:10.1063/1.1461327
Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the the grid: Enabling scalable virtual
organization. The International Journal of Supercomputer Applications, 15(3), 200222.
Globus: Grid security infrastructure (GSI) (n.d.). Retrieved from http://www.globus.org/security/
Globus: The grid resource allocation and management (GRAM) (n.d.). Retrieved from http://www.
globus.org/toolkit/docs/3.2/gram/
Goldberg, & D. E. (1989). Genetic algorithm: In search, optimization and machine learning. New York:
Addison-Wesley.
Grid Computing, I. B. M. (n.d.). Retrieved from http://www-1.ibm.com/grid/
GridFTP (n.d.). Retrieved from http://www.globus.org/toolkit/docs/4.0/data/gridftp/
GSI (Globus Security Infrastructure). Retrieved from http://www.globus.org/Security/
Hoschek, W., Jaen-Martinez, J., Samar, A., Stockinger, H., & Stockinger, K. (2000). Data management
in an international data grid project. grid computing - GRID 2000 (pp.333-361). UK.
Hwang, J., & Arvamudham, P. (2004). Middleware services for P2P computing in wireless grid networks.
IEEE Internet Computing, 8(4)4046. doi:10.1109/MIC.2004.19
Information Services. (n.d.). Retrieved from http://www.globus.org/toolkit/mds/
Legion (n.d.). from http://www.legion.virginia.edu/
Padala, P., & Wilson, J. N. (2003). GridOS: Operating system services for grid architectures. In High
Performance Computing (pp. 353-362). Berlin: Springer.
Phan, T., Huang, L., & Dulan, C. (2002). Challenge: Integrating mobile wireless devices into the computational grid. In Proceedings of the 8th annual international conference on Mobile computing and
networking (pp. 271-278), USA.
12
Reed, D. A. (2003). Grids: The teragrid, and beyond. IEEE Computer, 36(1), 6268.
Replica Location Service (RLS) (n.d.). Retrieved from http://www.globus.org/toolkit/docs/4.0/data/rls/
Siagri, R. (2007). Pervasive computers and the GRID: The birth of a computational exoskeleton for
augmented reality. In 6th Joint Meeting of the European Software Engineering Conference and the ACM
SIGSOFT Symposium on The foundations of software engineering (pp.1-4), Croatia.
SRB (Storage Resource Broker) (n.d.). Retrieved from http://www.sdsc.edu/srb/index.php/Main_Page
Srinivasan, S. H. (2005). Pervasive wireless grid architecture. In Proceedings of The Second Annual
Conference on Wireless On-demand Network Systems and Services (pp.83-88), Switzerland.
The EU Data Grid Project (n.d.). Retrieved from http://www.eu-datagrid.org/.
The Globus Alliance (n.d.). Retrieved from http://www.globus.org/
Unicore (n.d.). Retrieved from http://unicore.sourceforge.net
Vazhkudai, S., & Syed, J., & Maginnis T. (2002). PODOS - The design and implementation of a performance oriented Linux cluster. Future Generation Computer Systems, 18(3), 335352. doi:10.1016/
S0167-739X(01)00055-3
KEY TERMS AND DEFINITIONS

Grid Computing: A new technology has been developed to contribute to the powerful computing
ability for supporting distributed computing applications.
Grid Middleware: A toolkit of software between grid applications and grid fabrics provides a series of functionalities including grid security infrastructure, data management, job management, and
information services.
The Grid Resource Allocation and Management (GRAM): GRAM provides a series of uniform
interfaces to simplify the access of remote grid resources for job execution.
Grid Security Infrastructure (GSI): It provides the authentication and authorization mechanisms
for system protection according to X.509 proxy certificates.
GridFTP: A securer file transmission protocol in grid computing.
Pervasive Grid: A novel grid architecture that enables users to manipulate grid services transparently.
Replica Location Service (RLS): RLS maintains the location information of replicas from logical
file names (LFN) to physical file names (PFN).
13
14
Chapter 2
Pervasive Grids
Challenges and Opportunities

Manish Parashar
Rutgers, The State University of New Jersey, USA
Jean-Marc Pierson
Paul Sabatier University, France
ABSTRACT
Pervasive Grid is motivated by the advances in Grid technologies and the proliferation of pervasive systems, and is leading to the emergence of a new generation of applications that use pervasive and ambient
information as an integral part to manage, control, adapt and optimize. However, the inherent scale and
complexity of Pervasive Grid systems fundamentally impact how applications are formulated, deployed
and managed, and presents significant challenges that permeate all aspects of systems software stack.
In this chapter, the authors present some use-cases of Pervasive Grids and highlight their opportunities
and challenges. They then present why semantic knowledge and autonomic mechanisms are seen as
foundations for conceptual and implementation solutions that can address these challenges.
INTRODUCTION
Grid computing has emerged as the dominant paradigm for wide-area distributed computing (Parashar &
Lee, 2005). The goal of the original Grid concept is to combine resources spanning many organizations
into virtual organizations that can more effectively solve important scientific, engineering, business and
government problems. Over the last decade, significant resources and research efforts have been devoted
towards making this vision a reality and have lead to the development and deployment of a number of
Grid infrastructures targeting a variety of applications.
However, recent technical advances in computing and communication technologies and associated
cost dynamics are rapidly enabling a ubiquitous and pervasive world - one in which the everyday objects
surrounding us have embedded computing and communication capabilities and form a seamless Grid of
DOI: 10.4018/978-1-60566-661-7.ch002
Pervasive Grids
information and interactions. As these technologies weave themselves into the fabrics of everyday life
(Weiser, 1991), they have the potential of fundamentally redefining the nature of applications and how
they interact with and use information.
This leads to a new revolution in the original Grid concept and the realization of a Pervasive Grid
vision. The Pervasive Grid vision is driven by the advances in Grid technologies and the proliferation
of pervasive systems, and seamlessly integrates sensing/actuating instruments and devices together with
classical high performance systems as part of a common framework that offers the best immersion of users
and applications in the global environment. This is, in turn, leading to the emergence of a new generation
of applications using pervasive and ambient information as an integral part to manage, control, adapt and
optimize (Pierson, 2006; Matossian et al., 2005; Bangerth, Matossian, Parashar, Klie, &Wheeler, 2005;
Parashar et al., 2006). These applications include a range of application areas including crisis management, homeland security, personal healthcare, predicting and managing natural phenomenon, monitoring
and managing engineering systems, optimizing business processes, etc (Baldridge et al., 2006).
Note that it is reasonable to argue that in concept, the vision of Pervasive Grids was inherent in the
visions of computing as a utility originally by Corbat et al (Corbat & Vyssotsky, 1965) and later by
Foster et al (Foster, Kesselman, & Tuecke, 2001). In this sense, Pervasive Grids are the next significant
step towards realizing the metaphor of the power grid. Furthermore, while, Foster et al., defined a computational Grid in (Foster & Kesselman, 1999) as ... a hardware and software infrastructure that provides
dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities, the
term pervasive in this definition refers to the transparent access to resources rather than the nature of the
resources themselves. Pervasive Grids focus on the latter and essentially address an extreme generalization of Grid concept where the resources are pervasive and include devices, services, information, etc.
The aim of this chapter is to introduce the vision of Pervasive Grid computing and to highlight its
opportunities and challenges. In this paper we first described the nature of applications in a Pervasive
Grid and outline their requirements. We then describe key research challenges, and motivate semantic
knowledge and autonomic mechanisms as the foundations for conceptual and implementation solutions
that can address these challenges.
PERVASIVE GRID APPLICATIONS AND THEIR REqUIREMENTS

The applications enabled by Pervasive Grid systems can be classified along three broad axes based on
their programming and runtime requirements. Opportunistic applications can discover and use available pervasive information and resources, to potentially adapt, optimize, improve QoS, provide a better
user experience, etc. For example, a navigation system may use real-time traffic information (possibly
obtained from other vehicles) to reduce or avoid congested routes. Similarly, a vehicle safety system
may use information from oncoming vehicles to appropriately warn the driver of possible hazards. A
key characteristic of these applications is that they do not depend on the availability of the information,
but can opportunistically use information if it is available. Note that this application may consume raw
information and process it locally. Alternately, they may outsource the processing of information using
available resources at the source of the information or within the pervasive environment. While the above
applications are centered on a single user, in cooperative applications, multiple application entities (possibly wireless devices) cooperate with each other, each providing partial information, to make collective
decisions in an autonomous manner. An example is a swarm of wireless robotic devices cooperatively
15
Pervasive Grids
exploring a disaster site or a group of cars sharing information to estimate the overall traffic situation.
Finally, certain control applications provide autonomic control capabilities using actuation devices in
addition to sensors, for example, a car may anticipate traffic/road conditions and appropriately apply
the brakes.
As an illustration consider the somewhat futuristic use-case scenario presented below that describes
how an international medical emergency may be handled using the anytime-anywhere access to information and services provided by a Pervasive Grid. This scenario shares some of the views of (Akogrimo,
2004) while adding the semantic dimension to the process.
Mr. Smith lives in Toulouse, France, and leaves for a few days to Vienna, Austria. Unfortunately, on the
way, he is involved in an accident leaving him lying unconscious on the road. When help arrives, they
only find a single piece of information on Mr. Smith, i.e., a numerical identifier (for example on a smart
card), which allows the helps to quickly access Mr. Smiths medical file (which is at least partially in
France, perhaps in Toulouse), to find important information (for example, details of drug allergies, of
its operational antecedents - was already anesthetized, with which product? did he have an operation?
are there records available such as an operation report or x-rays?) that will allow the responders to
adapt and customize the care given to Mr. Smith.
Let us consider this use-case in detail. First, let us assume (unrealistically) that the problem of the single
identifier is solved (this particular point is a subject political, ethical, and is far from being solved, even
at the European scale), and that Mr. Smith has a health card that encodes his identifier. Pervasive sensors
are already embedded with Mr. Smith to monitor his blood pressure and sugar rate in his blood. These
data are available through a specific application available for a range of devices (Palm, notebooks,...)
and transmitted via WiFi from the sensors to the application devices. Further, Mr. Smiths medical data
is distributed across various medical centers. The contents of the medical files must be accessible in a
protected way: Only authorized individuals should be able to access relevant parts of the file, and ideally
these authorizations are given by Mr. Smith himself. Note that all the documents would be naturally in
French and possibly in different formats and modalities.
Now, the Austrian responder, who only speaks German, has a Palm, with WiFi connection. The WiFi hot
spot is located in the ambulance and allows the responder to consult patient medical records through
a public hospital network. The intervention by the responder begins on the spot of the accident and
continues on the road towards the hospital. Please note that at this stage, the responder has no idea of
the pervasive presence of the sensors embedded with Mr. Smith. When the responder wants to access
information about allergies to certain medication, he should initially know where this information resides.
From both the identifier of Mr. Smith and the request itself (allergies?), the system seeks the storage
centers likely to have some information about Mr. Smith. The responder contacts these centers. He also
needs to obtain authorization to enter the French information systems, which he obtains by starting from
his certificate of membership to the health Austrian system. Trust certificates are established to allow
him to access the network of care where the required data are.
16
Pervasive Grids
An integration service must transform the responders request to be compatible with the schema of
the databases containing the relevant information, and negotiates, according to his profile and of the
presented request, the parts of the database accessible to him. The request is expressed using a common vocabulary and semantic (ontology of the medical field) representation to get around the language
issue. To reach the data itself, the responder presents the mandatory certificates to read the files. Mr.
Smith must have previously created certificates for standard accesses to some of his data, for example,
the people being able to endorse the responders role can access information about drug allergies. A
repository of the standard certificates for Mr. Smith must be accessible on line. The responder presents
the retrieved certificates which authorizes the access and returns the data.
After this interaction, two kinds of information are available: First, the system alerts the responder of
the presence of sensors with Mr. Smith, and starts the download of the appropriate application (graphic
and language interface must be adapted) on its Palm. Thanks to the retrieved information, the responder
knows the sugar rate in the blood. The second kind of information is related to the medical records of
Mr. Smith. The metadata of the documents are analyzed to know their nature and to see how the Palm
can exploit them. An adaptation service is probably required, to create a chain of transformation from
the original documents (in written and spoken French) into documents that can be used by the responder
currently in the moving ambulance, where he can only read and not listen (due to the noisy environment).
Appropriate services include a service for audio-to-text transformation, a French-German translation
service, etc. Finally, the first-aid worker gets the relevant data and administers the appropriate medication to the patient.
During the transportation, information about patient (drugs, known allergies, identifier of the patient)
are transmitted to the hospital. In the hospital, even before the arrival of the ambulance, a surgeon can
recover, using similar mechanisms but with different conditions (less constrained terminal, higher role
in the care network, etc.), more complete information (operational antecedents, scanner, etc.) in order
to be able to intervene appropriately. The surgeon can decide to start some more complex computation
on the data he retrieved like comparing this patient characteristics (and data, such as images, analysis,
etc.) to a patient database to better suit this particular patient case and provide personalized help. This
may lead to use utility computing facilities on a stable infrastructure.
In the scenario, the responder is very active, interacting with the local sensors and the global infrastructure. One should understand that much of the tasks should be automated, delegated and performed
transparently by his device.
The pervasive grid ecosystem, which integrates computers, networks, data archives, instruments,
observatories, experiments, and embedded sensors and actuators, is also enabling new paradigms in
science and engineering - ones that are information/data-driven and that symbiotically and opportunistically combines computations, experiments, observations, and real-time information to understand and
manage natural and engineering systems.
For example, an Instrumented Oil-Field can (theoretically) achieve efficient and robust control
and management of diverse subsurface and near subsurface geo-systems by completing the symbiotic
feedback loop between measured data and a set of computational models, and can provide efficient,
17
Pervasive Grids
cost-effective and environmentally safe production of oil reservoirs. Similar strategies can be applied
to CO2 sequestration, contaminated site cleanup, bio-landfill optimization, aquifer management and
fossil fuel production.
Another example application is the modelling and understanding of complex marine and coastal phenomena, and the associated management and decision-making processes. This involves an observational
assessment of the present state, and a scientific understanding of the processes that will evolve the state
into the future, and requires combining surface remote sensing mechanisms (satellites, radar) and spatially
distributed in situ subsurface sensing mechanisms to provide a well sampled blueprint of the ocean, and
coupling this real-time data with modern distributed computational models and experiments. Such a
pervasive information-driven approach is essential to address important national and global challenges
such as (1) safe and efficient navigation and marine operations, (2) efficient oil and hazardous material
spill trajectory prediction and clean up, (3) monitoring, predicting and mitigating coastal hazards, (4)
military operations, (5) search and rescue, and (6) prediction of harmful algal blooms, hypoxic conditions, and other ecosystem or water quality phenomena. For example, underwater and aerial robots and
oceanic observatories can provide real-time data which, coupled with online satellite, radar and historical
data, advanced models and computational and data-management systems, can be used to predict and
track extreme weather and coastal behaviours, manage atmospheric pollutants and water contaminants
(oil spills), perform underwater surveillance, study coastal changes, track hydrothermal plumes (black
smokers), and study the evolution of marine organisms and microbes.
An area where pervasive grids can potentially impact in a dramatic way is crisis management and
response where immediate and intelligent responses to a rapidly changing situation could mean the difference between life and death for people caught up in a terrorist or other crisis situation. For example, a
prototype disaster response test bed, which combines information and data feeds from an actual evolving
crisis event with a realistic simulation framework (where the on-going event data are continually and
dynamically integrated with the on-line simulations), can provide the ability for decision support and
crisis management of real situations as well as more effective training of first-responders. Similarly,
one can conceive of a fire management application where computational models use streaming information from sensors embedded in the building along with real time and predicted weather information
(temperature, wind speed and direction, humidity) and archived history data to predict the spread of the
fire and to guide fire-fighters, warning of potential threats (blowback if a door is opened) and indicating
most effective options. This information can also be used to control actuators in the building to manage
the fire and reduce damage.
CROSSCUTTING CHALLENGES
The Pervasive Grid environment is inherently large, heterogeneous and dynamic, globally aggregating large numbers of independent computing and communication resources, data stores, instruments
and sensing/actuating devices. The result is an unprecedented level of uncertainty that is manifested
in all aspects of the Pervasive Grid: System, Information and Application (Parashar & Browne, 2005;
Parashar, 2006).

18
System uncertainty reflects in its structure (e.g., flat, hierarchical, P2P, etc.), in the dynamism of its
components (entities may enter, move or leave independently and frequently), in the heterogeneity
Pervasive Grids
of its components (their connectivity, reliability, capabilities, cost, etc.), in the lack of guarantees,
and more importantly, in the lack of common knowledge of numbers, locations, capacities, availabilities and protocols used by its constituents.
Information uncertainty is manifested in its quality, availability, compliance with common understanding and semantics, as well the trust in its source.
Finally, application uncertainty is due to the scale of the applications, the dynamism in application behaviours, and the dynamism in its compositions, couplings and interactions (services may
connect to others on a dynamic and opportunistic way).
The scale, complexity, heterogeneity, and dynamism of Pervasive Grid environments and the resulting uncertainty present thus requires that the underlying technologies, infrastructures and applications
must be able to detect and dynamically respond during execution to changes in the state of execution
environment, the state and requirements of the application and the overall context of the applications.
This requirement suggests that (Parashar & Browne, 2005):
1.
2.
3.
Applications should be composed from discrete, self-managing components which incorporate separate specifications for all of functional, non-functional and interaction-coordination behaviours.
The specifications of computational (functional) behaviours, interaction and coordination behaviours and non-functional behaviours (e.g. performance, fault detection and recovery, etc.) should
be separated so that their combinations are composable.
The interface definitions of these components should be separated from their implementations to
enable heterogeneous components to interact and to enable dynamic selection of components.
Given these features, a Pervasive Grid application requiring a given set of computational behaviours
may be integrated with different interaction and coordination models or languages (and vice versa) and
different specifications for non-functional behaviours such as fault recovery and QoS to address the
dynamism and heterogeneity of the application and the underlying environments.
RESEARCH OPPORTUNITIES IN PERVASIVE GRID COMPUTING

We believe that addressing the challenges outlined above requires new paradigm for realizing the Pervasive Grid Infrastructure and its technologies that is founded on semantic knowledge and autonomic
mechanisms (Parashar & Browne, 2005; Parashar, 2006). Specifically:
1.
2.
3.
4.
Static (defined at the time of instantiation) application requirements, system and application behaviours to be relaxed
The behaviours of elements and applications to be sensitive to the dynamic state of the system and
the changing requirements of the application and to be able to adapt to these changes at runtime,
Common knowledge to be expressed semantically (ontology and taxonomy) rather than in terms
of names, addresses and identifiers,
The core enabling middleware services (e.g., discovery, coordination, messaging, security) to
be driven by such a semantic knowledge. Further the implementations of these services must be
resilient and must scalably support asynchronous and decoupled behaviours.
19
Pervasive Grids
Key research challenges includes:

20
Programming models, abstractions and systems: Applications targeted to emerging Pervasive

Grids must be able to address high levels of uncertainty inherent in these environments, and require the ability to discover, query, interact with, and control instrumented physical systems using
semantically meaningful abstractions. As a result, they require appropriate programming models
and systems that support notions of dynamic space-time context, as well as enable applications
capable of correctly and consistently adapting their behaviours, interactions and compositions
in real time in response to dynamic data and application/system state, while satisfying real time,
functional, performance, reliability, security, and quality of service constraints. Furthermore, since
these behaviours and adaptations are context dependent, they need to be specified separately and
at runtime, and must consistently and correctly orchestrate appropriate mechanisms provided by
the application components to achieve autonomic management.
Data/information quality/uncertainty management: A key issue in pervasive systems is the
characterization of the quality of information and the need of estimating its uncertainty, so that
it can effectively drive the decision making process. This includes algorithms and mechanisms
to synthesize actionable information with dynamic qualities and properties from streams of data
from the physical environment, and address issues of data quality assurance, statistical synthesis
and hypotheses testing, and in-network data assimilation, spatial and/or temporal multiplexing,
clustering and event detection. Works done in the field of data management (Dong, Halevy, & Yu,
2007; Benjelloun, Sarma, Halevy, Theobald, & Widom, 2008) gives some hints on how to handle
the data integration when the certainty of individual sources is not sure. Another related aspect
is providing mechanisms for adapting the level and frequency of sensing based on this information. Achieving this in an online and in-network manner (as opposed to post-processing stored
data) with strict space-time constraints presents significant challenges, wich are not addressed by
most existing systems. Note that, since different in-network data processing algorithms will have
different cost/performance behaviours, strategies for adaptive management of tradeoffs so as to
optimize overall application requirements are required.
Systems software and runtime & middleware services: Runtime execution and middleware
services have to be extended to support context-/content-/location-aware and dynamic, data/ knowledge-driven and time-constrained executions, adaptations, interactions, compositions
of application elements and services, while guaranteeing reliable and resilient execution and/
or predictable and controllable performances. Furthermore, data acquisition, assimilation and
transport services have to support seamless acquisition of data from varied, distributed and possibly unreliable data sources, while addressing stringent real-time, space and data quality constraints. Similarly, messaging and coordination services must support content-based scalable and
asynchronous interactions with different service qualities and guarantees. Finally, sensor system
management techniques are required for the dynamic management of sensor systems including
capacity and energy aware topology management, runtime management including adaptations
for computation/communication/power tradeoffs, dynamic load-balancing, and sensor/actuator
system adaptations.
Pervasive Grids
RELATED WORK
Research Landscape in Grid and Autonomic Computing
Grid computing research efforts over the last decade can be broadly divided into efforts addressing the
realization of virtual organizations and those addressing the development of Grid applications. The
former set of efforts have focused on the definition and implementation of the core services that enable
the specification, construction, operation and management of virtual organizations and instantiation of
virtual machines that are the execution environments of Grid applications. Services include:

Security services to enable the establishment of secure relationships between a large number of
dynamically created subjects and across a range of administrative domains, each with its own local security policy,
Resource discovery services to enable discovery of hardware, software and information resources
across the Grid,
Resource management services to provide uniform and scalable mechanisms for naming and locating remote resources, support the initial registration/discovery and ongoing monitoring of resources, and incorporate these resources into applications,
Job management services to enable the creation, scheduling, deletion, suspension, resumption,
and synchronization of jobs,
Data management services to enable accessing, managing, and transferring of data, and providing
support for replica management and data filtering.
Efforts in this class include Globus (The Globus Alliance), Unicore (Unicore Forum), Condor (Thain,
Tannenbaum, & Livny, 2002) and Legion (Grimshaw & Wulf, 1997).
Other efforts in this class include the development of common APIs, toolkits and portals that provide
high-level uniform and pervasive access to these services. These efforts include the Grid Application
Toolkit (GAT) (Allen et al., 2003), DVC (Taesombut & Chien, 2004) and the Commodity Grid Kits (CoG
Kits) (Laszewski, Foster, & Gawor, 2000). These systems often incorporate programming models or
capabilities for utilizing programs written in some distributed programming model. For example, Legion
implements an object-oriented programming model, while Globus provides a capability for executing
programs utilizing message passing.
The second class of research efforts deals with the formulation, programming and management of
Grid applications. These efforts build on the Grid implementation services and focus on programming
models, languages, tools and frameworks, and application runtime environments. Research efforts in
this class include GrADS (Berman et al., 2001), GridRPC (Nakada et al., 2003), GridMPI (Ishikawa,
Matsuda, Kudoh, Tezuka, & Sekiguchi, 2003), Harness (Migliardi & Sunderam, 1999), Satin/IBIS
(Nieuwpoort, Maassen, Wrzesinska, Kielmann, & Bal, 2004) (Nieuwpoort et al., n.d.), XCAT (Govindaraju et al., 2002) (Krishnan & Gannon, 2004), Alua (Ururahy & Rodriguez, 2004), G2 (Kelly, Roe,
& Sumitomo, 2002), J-Grid (Mathe, Kuntner, Pota, & Juhasz, 2003), Triana (Taylor, Shields, Wang, &
Philp, 2003), and ICENI (Furmento, Hau, Lee, Newhouse, & Darlington, 2003). These systems have
essentially built on, combined and extended existing models for parallel and distributed computing. For
example, GridRPC extends the traditional RPC model to address system dynamism. It builds on Grid
system services to combines resource discovery, authentication/authorization, resource allocation and task
21
Pervasive Grids
scheduling to remote invocations. Similarly, Harness and GridMPI build on the message passing parallel computing model, Satin supports divide-and-conquer parallelism on top of the IBIS communication
system. GrADS builds on the object model and uses reconfigurable object and performance contracts
to address Grid dynamics, XCAT and Alua extend the component based model. G2, J-Grid, Triana and
ICENI build on various service based models. G2 builds on .Net (Microsoft .Net), J-Grid builds on Jini
(Jini Network Technology) and current implementations of Tirana and ICENI build on JXTA (Project
JXTA, 2001). While this is natural, it also implies that these systems implicitly inherit the assumptions
and abstractions that underlie the programming models of the systems upon which they are based and
thus in turn inherit their assumptions, capabilities and limitations.
In the last years, the semantic grid paradigm has gained much interests from authors and at the
Global Grid forum. In (De Roure, Jennings, & Shadbolt, 2005), De Roure and Jennings propose a view
on semantic grid its past, present and future. The identify some key requirements of the semantic grid:
Resource description discovery and use, process description and enactment, security and trust, annotation to enrich the description of digital content, information integration and fusion (potentially on the
fly), context awareness, communities, smart environments, ... Ontologies, semantic web services are
entitled to give some help to achieve a semantic grid. Works on semantic grid can be enlarged to encompass pervasive computing (Roure, 2003). In this work, the author defines where semantic grid can
benefit from pervasive devices, and vice-versa: Indeed on one side the semantic grid can benefit to the
processing of the data acquired for instance by sensors, on the other hand, the semantic grid benefits
from potential metadata coming from the pervasive appliances themselves allowing for the automatic
creation of annotation describing them.
There has also been research by the authors and other on applying Autonomic Computing (Kephart
& Chess, 2003; Parashar & Hariri, 2006) concepts to Grid systems and applications. The autonomic
computing paradigm is inspired by biological systems, and aims at developing systems and applications
that can manage and optimize themselves using only high-level guidance. The key concept is a separation
of (management, optimization, fault-tolerance, security) policies from enabling mechanisms, allowing
a repertoire of mechanisms to operate at runtime to respond to the heterogeneity and dynamics, both of
the applications and the infrastructure. This enables undesired changes in operation to trigger changes in
the behaviourof the computing system to respond to the changes, so that the system continues to operate (or possibly degrades) in a conformant manner - for example,the system may recover from faults,
reconfigure itself to match its environment, and maintainits operations at a near optimal performance.
Autonomic techniques have be applied to various aspects of Grid computing such as application runtime
management, workloadmanagement and data distribution, data steaming and processing, etc. (Parashar
& Hariri,2006).
As we will see in the next part, these works on semantically enhanced grids and autonomic computing
are complimentary to other works directly related to the presence of mobile and context aware appliances in the environment. Most of these works do not deal with all the specificities of Pervasive Grids.
We now detail some works in that specific directions.
Pervasive Grid Efforts

Davies, Storz and Friday (Storz, Friday, & Davies, 2003; Davies, Friday, & Storz, 2004) were among
the first to introduce the concept of Ubiquitous Grid, that is close to our Pervasive Grid vision. The
purpose of their research paper is to compare the notion of Grid Computing (definition of I. Foster (Fos-
22
Pervasive Grids
ter, Kesselman, Nick, & Tuecke, 2002)) and the notion of Pervasive Systems (definition of M. Weiser
(Weiser, 1991)). They identify similar interests: heterogeneity, interoperability, scalability, adaptability
and fault tolerance, resources management, services composition, discovery, security, communication,
audit, payment. They then briefly present a use-case for a ubiquitous Grid, which they develop using
Globus Toolkit 3 (GT3). Lack of details makes it difficult to evaluate exactly what has been done to
make GT3 behave as a an ubiquitous Grid, and what aspects of ubiquity has been addressed.
Hingne et al. (Hingne, Joshi, Finin, Kargupta, & Houstis, 2003) propose a multi-agent approach to
realize a P-Grid. They are primarily interested in communication, heterogeneity, discovery and services
composition, and scheduling of tasks between the different devices constituting the P-Grid.
McKnight et al. (McKnight, Howison, & Bradner, 2004) introduce the concept of a Wireless Grid.
Their interest is in the mobile and nomadic issues, which they compare with traditional computing Grids,
P2P networks and web services. An interesting aspect of this article is that it investigates the relationships
between these actors. In the article, the authors focus on services that they identify as the most important,
i.e., resources description and discovery, coordination, trust management and access control.
In (Srinavisan, 2005), S.H. Srinivasan details a Wireless Pervasive Grid architecture. The author
separates Grid in two parts: the backbone grid, physically linked and analogous to network backbones, and the wireless access grid. Agents realize the proxy between the two grids, and act on behalf
of mobile devices in the access grid on the backbone grid. Interesting aspects of this effort are the
pro-activity and context-awareness of the presentation to end-users.
Coulson et al. (Coulson et al., 2005) present a middleware structured using a lightweight run-time
component model (OpenCom) that enables appropriate profiles to be configured on a wide range of
device types, and facilitates runtime reconfiguration (as required to adapt to dynamic environments).
More recently, Coronato and De Pietro (Coronato & Pietro, 2007) describe MiPEG, a middleware
consisting of a set of services (compliant to grid standard OGSA) enhancing classic Grid environments
(namely the Globus Toolkit) with mechanisms for handling mobility, context-awareness, users session
and distribution of tasks on the users computing facilities.
Complementary to these, existing research efforts have tackled aspects of integrating pervasive systems with computing Grids, primarily in the fields of mobile computing and pervasive computing. They
include works on interaction, mobility and context adaptation. Research presented in (Allen et al., 2003;
Graboswki, Lewandowski, & Russell, 2004; Gonzalez-Castano, Vales-Alonso, Livny, Costa-Montenegro,
& Anido-Rifo, 2003) focused on the use of light devices to interact with computing Grids, e.g., submitting jobs and visualizing results. A closer integration of mobile devices with the Grids is addressed in
(Phan, Huang, & Dulan, 2002; Park, Ko, & Kim, 2003), which proposes proxy services to distribute and
organize jobs among a pool of light devices. The research presented in (Kurkovsky, Bhagyavati, Ray, &
Yang, 2004) solicits surrounding devices to participate in a problem-solving environment.
Mobile Grids have received much interest in the last years with the development of ad hoc networks
and/or IPv6 and the works in the mobile computing field. Some researchers (Chu & Humphrey, 2004;
Clarke & Humphrey, 2002) have investigated how a Grid middleware (Legion, OGSI.NET) can be adapted
to tackle mobility issues. In (Litke, Skoutas, & Varvarigou, 2004), the authors present the opportunities
of research challenges in resource management in mobile grid environments, namely resource discovery
and selection, job management from scheduling, replication, to migration and monitoring, and replica
management. (Li, Sun, & Ifeachor, 2005) gives some challenges of mobile ad-hoc network and adds a
Quality of Service dimension to previous works, including provisioning and continuity of service, latency, energy constraints, and fault tolerance in general. The authors map their observations to a mobile
23
Pervasive Grids
healthcare scenario. (Oh, Lee, & Lee, 2006) proposes in a wireless world to allocate dynamically the
tasks to surroundings resources, taking into account the context of these resources (their possibilities in
terms of resources: energy, network, CPU power,...). In (Waldburger & Stiller, 2006) the authors focus
on the provisioning of services in mobile grids and compare business and technical metrics between Grid
Computing, Service Grid, mobile and knowledge grids, SOA and P2P systems. They extend the vision
of classical Virtual Organizations to Mobile Dynamic VO. Mobile agents are used in (Guo, Zhang, Ma,
& Zhang, 2004; Bruneo, Scarpa, Zaia, & Puliafito, 2003; Baude, Caromel, Huet, & Vayssiere, 2000) to
migrate objects and codes among the nodes while (Wang, Yu, Chen, & Gao, 2005) apply mobile agents
to MANET with dynamic and ever changing neighbors. (Wong & Ng, 2006) focus on security while
combining mobile agents and the Globus grid middleware to handle mobile grid services. (Akogrimo,
2004; Jiang, OHanlon, & Kirstein, 2004) are interested in the advantages of mobility features of IPv6
in the notification and adaptation of Grids. The authors of (Messig & Goscinski, 2007) relate their work
on autonomic system management in mobile grid environment, encompassing the self discovery, selfconfiguration and dynamic deployment and self healing for fault tolerance.
Context-awareness is the primary focus of the work presented in (Jean, Galis, & Tan, 2004). The
authors present an extension of virtual organization to context, providing personalization of the services.
In (Zhang & Parashar, 2003), the authors propose a context aware access control in grids. (Yamin et al.,
2003; Otebolaku, Adigun, Iyilade, & Ekabua, 2007) include mobility and context-awareness in their
presentation.
CONCLUSION
The proliferation of pervasive sensing/actuating devices coupled with advances in computing and communication technologies are rapidly enabling the next revolution in Grid computing - the emergence of
Pervasive Grids. This, in turn, is enabling a new generation of application that use pervasive information and services to manage, control, adapt and optimize natural and engineering real-world systems.
However, the inherent scale and complexity of Pervasive Grid systems fundamentally impact the nature
of applications and how they are formulated, deployed and managed, and presents significant challenges
that permeate all aspects of systems software stack from applications to programming models and systems
to middleware and runtime services. This paper outlined the vision of Pervasive Grid Computing along
with its opportunities and challenges, and presented a research agenda for enabling this vision.
REFERENCES
Allen, G., Davis, K., Dolkas, K. N., Doulamis, N. D., Goodale, T., Kielmann, T., et al. (2003). Enabling
applications on the grid: A Gridlab overview. International Journal of High Performance Computing
Applications: Special issue on Grid Computing: Infrastructure and Applications.
Baldridge, K., Biros, G., Chaturvedi, A., Douglas, C. C., Parashar, M., How, J., et al. (2006, January).
National Science Foundation DDDAS Workshop Report. Retrieved from http://www.dddas.org/nsfworkshop-2006/wkshp report.pdf.
24
Pervasive Grids
Bangerth, W., Matossian, V., Parashar, M., Klie, H., & Wheeler, M. (2005). An autonomic reservoir
framework for the stochastic optimization of well placement. Cluster Computing, 8(4), 255269.
doi:10.1007/s10586-005-4093-3
Baude, F., Caromel, D., Huet, F., & Vayssiere, J. (2000, May). Communicating mobile active objects
in java. In R. W. Marian Bubak Hamideh Afsarmanesh & B. Hetrzberger (Eds.), Proceedings of HPCN
Europe 2000 (Vol. 1823, p. 633-643). Berlin: Springer. Retrieved from http://www-sop.inria.fr/oasis/
Julien.Vayssiere/publications/18230633.pdf
Benjelloun, O., Sarma, A. D., Halevy, A. Y., Theobald, M., & Widom, J. (2008). Databases with uncertainty and lineage. The VLDB Journal, 17(2), 243264. doi:10.1007/s00778-007-0080-z
Berman, F., Chien, A., Cooper, K., Dongarra, J., Foster, I., & Gannon, D. (2001). The grads project:
Software support for high-level grid application development. International Journal of High Performance
Computing Applications, 15(4), 327344. doi:10.1177/109434200101500401
Bruneo, D., Scarpa, M., Zaia, A., & Puliafito, A. (2003). Communication paradigms for mobile grid
users. In CCGRID 03 (p. 669).
Chu, D., & Humphrey, M. (2004, November 8). Bmobile ogsi.net: Grid computing on mobile devices.
In Grid computing workshop (associated with supercomputing 2004), Pittsburgh, PA.
Clarke, B., & Humphrey, M. (2002, April 19). Beyond the device as portal: Meeting the requirements
of wireless and mobile devices in the legion grid computing system. In 2nd International Workshop On
Parallel And Distributed Computing Issues In Wireless Networks And Mobile Computing (associated
with ipdps 2002), Ft. Lauderdale, FL.
Corbat, F. J., & Vyssotsky, V. A. (1965). Introduction and overview of the multics system. FJCC, Proc.
AFIPS, 27(1), 185196.
Coronato, A., & Pietro, G. D. (2007). Mipeg: A middleware infrastructure for pervasive grids. Journal
of Future Generation Computer Systems.
Coulson, G., Grace, P., Blair, G., Duce, D., Cooper, C., & Sagar, M. (2005, April). A middleware approach for pervasive grid environments. In Uk-ubinet/ uk e-science programme workshop on ubiquitous
computing and e-research.
Davies, N., Friday, A., & Storz, O. (2004). Exploring the grids potential for ubiquitous computing.
IEEE Pervasive Computing / IEEE Computer Society [and] IEEE Communications Society, 3(2), 7475.
doi:10.1109/MPRV.2004.1316823
De Roure, D., Jennings, N., & Shadbolt, N. (2005, March). The semantic grid: Past, present, and future.
Proceedings of the IEEE, 93(3), 669681. doi:10.1109/JPROC.2004.842781
Dong, X., Halevy, A. Y., & Yu, C. (2007). Data integration with uncertainty. In Vldb 07: Proceedings
of the 33rd International Conference on Very Large Data Bases (pp. 687698). VLDB Endowment.
Foster, I., & Kesselman, C. (Eds.). (1999). The grid: Blueprint for a new computing infrastructure. San
Francisco: Morgan Kaufmann Publishers, Inc.
25
Pervasive Grids
Foster, I., Kesselman, C., Nick, J., & Tuecke, S. (2002). The physiology of the grid: An open grid services
architecture for distributed systems integration. Retrieved from citeseer.nj.nec.com/foster02physiology.
html
Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the grid: Enabling scalable virtual organizations. The International Journal of Supercomputer Applications, 15(3), 200222.
Furmento, N., Hau, J., Lee, W., Newhouse, S., & Darlington, J. (2003). Implementations of a serviceoriented architecture on top of jini, jxta and ogsa. In Proceedings of uk e-science all hands meeting.
Gonzalez-Castano, F. J., Vales-Alonso, J., Livny, M., Costa-Montenegro, E., & Anido-Rifo, L. (2003).
Condor grid computing from mobile handheld devices. SIGMOBILE Mobile Comput. Commun. Rev.,
7(1), 117126. doi:10.1145/881978.882005
Govindaraju, M., Krishnan, S., Chiu, K., Slominski, A., Gannon, D., & Bramley, R. (2002, June). Xcat
2.0: A component-based programming model for grid web services (Tech. Rep. No. Technical ReportTR562). Dept. of C.S., Indiana Univ., South Bend, IN.
Graboswki, P., Lewandowski, B., & Russell, M. (2004). Access from j2me-enabled mobile devices to
grid services. In Proceedings of Mobility Conference 2004, Singapore.
Grimshaw, A. S., & Wulf, W. A. (1997). The legion vision of a worldwide virtual computer. Communications of the ACM, 40(1), 3945. doi:10.1145/242857.242867
Guo, S.-F., Zhang, W., Ma, D., & Zhang, W.-L. (2004, Aug.). Grid mobile service: using mobile software agents in grid mobile service. Machine learning and cybernetics, 2004. In Proceedings of 2004
International Conference on, 1, 178-182.
Hingne, V., Joshi, A., Finin, T., Kargupta, H., & Houstis, E. (2003). Towards a pervasive grid. In International parallel and distributed processing symposium (ipdps03) (p. 207).
Ishikawa, Y., Matsuda, M., Kudoh, T., Tezuka, H., & Sekiguchi, S. (2003). The design of a latency-aware
mpi communication library. In Proceedings of swopp03.
Jean, K., Galis, A., & Tan, A. (2004). Context-aware grid services: Issues and approaches. In Computational scienceiccs 2004: 4th international conference Krakow, Poland, June 69, 2004, proceedings,
part iii (LNCS Vol. 3038, p. 1296). Berlin: Springer.
Jiang, S., OHanlon, P., & Kirstein, P. (2004). Moving grid systems into the ipv6 era. In Proceedings of
Grid And Cooperative Computing 2003 (LNCS 3033, pp. 490499). Heidelberg, Germany: SpringerVerlag.
Kelly, W., Roe, P., & Sumitomo, J. (2002). G2: A grid middleware for cycle donation using. net. In
Proceedings of the 2002 International Conference on Parallel and Distributed Processing Techniques
and Applications.
Kephart, J. O., & Chess, D. M. (2003). The vision of autonomic computing. Computer IEEE Computer
Society, 36(1), 4150.
26
Pervasive Grids
Krishnan, S., & Gannon, D. (2004). Xcat3: A framework for cca components as ogsa services. In Proceedings of Hips 2004, 9th International Workshop on High-Level Parallel Programming Models and
Supportive Environments.
Kurkovsky, S. Bhagyavati, Ray, A., & Yang, M. (2004). Modeling a grid-based problem solving environment for mobile devices. In ITCC (2) (p. 135). New York: IEEE Computer Society.
Laszewski, G. v., Foster, I., & Gawor, J. (2000). Cog kits: A bridge between commodity distributed
computing and high-performance grids. In ACM 2000 Conference on java grande (p.97 - 106). San
Francisco, CA: ACM Press.
Li, Z., Sun, L., & Ifeachor, E. (2005). Challenges of mobile ad-hoc grids and their applications in ehealthcare. In Proceedings of Second International Conference on Computational Intelligence in Medicine
And Healthcare (cimed 2005).
Litke, A., Skoutas, D., & Varvarigou, T. (2004). Mobile grid computing: Changes and challenges of
resource management in a mobile grid environment. In Proceedings of Practical Aspects of Knowledge
Management (PAKM 2004), Austria.
Mathe, J., Kuntner, K., Pota, S., & Juhasz, Z. (2003). The use of jini technology in distributed and grid
multimedia systems. In MIPRO 2003, Hypermedia and Grid Systems (p. 148-151). Opatija, Croatia.
Matossian, V., Bhat, V., Parashar, M., Peszynska, M., Sen, M., & Stoffa, P. (2005). Autonomic oil reservoir optimization on the grid. [John Wiley and Sons.]. Concurrency and Computation, 17(1), 126.
doi:10.1002/cpe.871
McKnight, L., Howison, J., & Bradner, S. (2004, July). Wireless grids, distributed resource sharing by
mobile, nomadic and fixed devices. IEEE Internet Computing, 8(4), 2431. doi:10.1109/MIC.2004.14
Messig, M., & Goscinski, A. (2007). Autonomic system management in mobile grid environments. In
Proceedings of the Fifth Australasian Symposium on ACSW Frontiers (ACSW 07), (pp. 4958). Darlinghurst, Australia: Australian Computer Society, Inc.
Migliardi, M., & Sunderam, V. (1999). The harness metacomputing framework. In Proceedings of Ninth
Siam Conference on Parallel Processing for Scientific Computing. San Antonio, TX: SIAM.
Nakada, H., Matsuoka, S., Seymour, K., Dongarra, J., Lee, C., & Casanova, H. (2003). Gridrpc: A remote
procedure call api for grid computing.
Nieuwpoort, R. V. v., Maassen, J., Wrzesinska, G., Hofman, R., Jacobs, C., & Kielmann, T. (2005). Ibis:
a flexible and efficient Java-based Grid programming environment. Concurrency and Computation,
17(7/8), 1079-1108.
Nieuwpoort, R. V. v., Maassen, J., Wrzesinska, G., Kielmann, T., & Bal, H. E. (2004). Satin: Simple and
efficient Java-based grid programming. Journal of Parallel and Distributed Computing Practices.
Oh, J., Lee, S., & Lee, E. (2006). An adaptive mobile system using mobile grid computing in wireless
network. In Computational Science And Its Applications - ICCSA 2006 (LNCS Vol. 3984, pp. 49-57).
Berlin: Springer.
27
Pervasive Grids
Otebolaku, A., Adigun, M., Iyilade, J., & Ekabua, O. (2007). On modeling adaptation in context-aware
mobile grid systems. In Icas 07: Proceedings of the Third International Conference on Autonomic And
Autonomous Systems (p. 52). Washington, DC: IEEE Computer Society.
Parashar, M., & Browne, J. (2005, Mar). Conceptual and implementation models for the grid. Proceedings of the IEEE, 93(3), 653668. doi:10.1109/JPROC.2004.842780
Parashar, M., & Hariri, S. (Eds.). (2006). Autonomic grid computing concepts, requirements, infrastructures, autonomic computing: Concepts, infrastructure and applications, (pp. 4970). Boca Raton, FL:
CRC Press.
Parashar, M., & Hariri, S. (Eds.). (2006). Autonomic computing: Concepts, infrastructure and applications. Boca Raton, FL: CRC Press.
Parashar, M., & Lee, C. A. (2005, March). Scanning the issue: Special isssue on grid-computing. In
Proceedings of the IEEE, 93 (3), 479-484. Retrieved from http://www.caip.rutgers.edu/TASSL/Papers/
proc-ieee-intro-04.pdf
Parashar, M., Matossian, V., Klie, H., Thomas, S. G., Wheeler, M. F., Kurc, T., et al. (2006). Towards
dynamic data-driven management of the ruby gulch waste repository. In V. N. Alexandrox & et al. (Eds.),
Proceedings of the Workshop on Distributed Data Driven Applications and Systems, International Conference on Computational Science 2006 (ICCS 2006) (Vol. 3993, pp. 384392). Berlin: Springer Verlag.
Park, S.-M., Ko, Y.-B., & Kim, J.-H. (2003, December). Disconnected operation service in mobile grid computing. In First International Conference on Service Oriented Computing (ICSOC2003), Trento, Italy.
Phan, T., Huang, L., & Dulan, C. (2002). Challenge: integrating mobile wireless devices into the computational grid. In Mobicom 02: Proceedings of the 8th annual international conference on mobile
computing and networking (pp. 271278). New York: ACM Press.
Pierson, J.-M. (2006, June). A pervasive grid, from the data side (Tech. Rep. No. RR-LIRIS-2006-015).
LIRIS UMR 5205 CNRS/INSA de Lyon/Universit Claude Bernard Lyon 1/Universit Lumire Lyon 2/
Ecole Centrale de Lyon. Retrieved from http://liris.cnrs.fr/publis/?id=2436
Roure, D. D. (2003). Semantic grid and pervasive computing. http://www.semanticgrid.org/GGF/ggf9/gpc/
Srinavisan, S. (2005). Pervasive wireless grid architecture. In Second annual conference on wireless
on-demand network systems and services (wons05).
Storz, O., Friday, A., & Davies, N. (2003, October). Towards ubiquitous ubiquitous computing: an alliance with the grid. In Proceedings of the First Workshop On System Support For Ubiquitous Computing
Workshop (UBISYS 2003) in association with Fifth International Conference On Ubiquitous Computing,
Seattle, WA. Retrieved from http://ciae.cs.uiuc.edu/ubisys/papers/alliance-w-grid.pdf
Taesombut, N., & Chien, A. (2004). Distributed virtual computer (dvc): Simplifying the development
of high performance grid applications. In Workshop on Grids and Advanced Networks (GAN 04), IEEE
Cluster Computing and the Grid (ccgrid2004) Conference, Chicago.
28
Pervasive Grids
Taylor, I., Shields, M., Wang, I., & Philp, R. (2003). Distributed p2p computing within triana: A galaxy
visualization test case. In International Parallel and Distributed Processing Symposium (IPDPS03).
Nice, France: IEEE Computer Society Press.
Thain, D., Tannenbaum, T., & Livny, M. (2002). Condor and the grid. John Wiley & Sons Inc.
Ururahy, C., & Rodriguez, N. (2004). Programming and coordinating grid environments and applications. In Concurrency and computation: Practice and experience.
Waldburger, M., & Stiller, B. (2006). Toward the mobile grid:service provisioning in a mobile dynamic
virtual organization. In. Proceedings of the IEEE International Conference on Computer Systems and
Applications, 2006, (pp.579583).
Wang, Z., Yu, B., Chen, Q., & Gao, C. (2005). Wireless grid computing over mobile ad-hoc networks
with mobile agent. In Skg 05: Proceedings of the first international conference on semantics, knowledge
and grid (p. 113). Washington, DC: IEEE Computer Society.
Weiser, M. (1991, February). The computer for the 21st century. Scientific American, 265(3), 6675.
Wong, S.-W., & Ng, K.-W. (2006). Security support for mobile grid services framework. In Nwesp06:
Proceedings of the international conference on next generation web services practices (pp.7582).
Washington, DC: IEEE Computer Society.
Yamin, A., Augustin, I., Barbosa, J., da Silva, L., Real, R., & Cavalheiro, G. (2003). Towards merging
context-aware, mobile and grid computing. International Journal of High Performance Computing Applications, 17(2), 191203. doi:10.1177/1094342003017002008
Zhang, G., & Parashar, M. (2003). Dynamic context-aware access control for grid applications. In 4th
international workshop on grid computing (grid 2003), (pp. 101 108). Phoenix, AZ: IEEE Computer
Society Press. Retrieved from citeseer.ist.psu.edu/zhang03dynamic.html

Autonomic Computing: Accounts for a system that does not need human intervention to work,
repair, adapt and optimize. Autonomous entities must adapt to their usage context to find the best fit for
their execution.
Grid: The goal of the original Grid concept is to combine resources spanning many organizations
into virtual organizations that can more effectively solve important scientific, engineering, business and
government problems.
Pervasive: A term that covers the ubiquity of the system. A pervasive system is transparent to its
users that use it without noticing it. It is often linked with mobility since it helps to cover the anywhere/
anytime resources access for nomadic users.
Pervasive Grid: A pervasive grid mixes a grid resource sharing with an anywhere/anytime access
to these resources, either data or computing resources.
Quality of Service: Designs the achievable performances that a system, an application or a service
is expected to deliver to its consumers.
29
Pervasive Grids
Semantic Knowledge: Designs the enriched value of the information. Raw information coming
from sensors or monitored by the system is not enough to achieve ubiquitous access to resources. Only
higher level abstractions allow for handling seamlessly the system.
Uncertainty: The dubiety that can be put on the system, the application or the information in a pervasive grid. Information cannot be accepted without doubt and double checking, redundancy, is often
the rule.
30
31
Chapter 3
Desktop Grids
From Volunteer Distributed

Computing to High Throughput
Computing Production Platforms
Franck Cappello
INRIA and UIUC, France
Gilles Fedak
LIP/INRIA, France
Derrick Kondo
ENSIMAG - antenne de Montbonnot, France
Paul Malcot
Universit Paris-Sud, France
Ala Rezmerita
Universit Paris-Sud, France
ABSTRACT
Desktop Grids, literally Grids made of Desktop Computers, are very popular in the context of Volunteer
Computing for large scale Distributed Computing projects like SETI@home and Folding@home.
They are very appealing, as Internet Computing platforms for scientific projects seeking a huge
amount of computational resources for massive high throughput computing, like the EGEE project
in Europe. Companies are also interested of using cheap computing solutions that does not add extra
hardware and cost of ownership. A very recent argument for Desktop Grids is their ecological impact:
by scavenging unused CPU cycles without increasing excessively the power consumption, they reduce
the waste of electricity. This book chapter presents the background of Desktop Grid, their principles
and essential mechanisms, the evolution of their architectures, their applications and the research tools
associated with this technology.
DOI: 10.4018/978-1-60566-661-7.ch003
Desktop Grids
ORIGINS AND PRINCIPLES

Nowadays, Desktop Grids are very popular and are among the largest distributed systems in the world:
the BOINC platform is used to run over 60 Internet Computing projects and scale up to 4 millions of
participants. To arrive at this outstanding result, theoretical and experimental projects and researches have
investigated on how to take advantage of idle CPUs and derived the principles the of Desktop Grids.
Origins of Desktop Grids

The very first paper discussing a Desktop Grid like system (Shoch & Hupp, 1982) presented the Worm
programs and several key ideas that are currently investigated in autonomous computing (self replication, migration, distributed coordination, etc). Several projects preceded the very popular SETI@home.
One of the first application of Desktop Grids was cracking RSA keys. Another early system, in 1997,
gave the name of distributed computing used sometimes for Desktop Grids: distributed.net. The aim
of this project was finding prime numbers using the Mersen algorithm. The folding@home project was
one of the first project with SETI@home to gather thousands of participants in the first years of 2000.
At that time folding@home used the COSM technology. The growing popularity of Desktop Grids has
raised a significant interest in the industry. Companies like Entropia (Chien, Calder, Elbert, Bhatia,
2003), United Device1, Platform2, Mesh Technologies3 and Data Synapse have proposed Desktop Grid
middleware. Performance demanding users are interested by these platforms, considering their costperformance ratio which is even lower than the one of clusters. As a mark of success, several Desktop
Grid platforms are daily used in production by large companies in the domains of pharmacology, petroleum, aerospace, etc.
The origin of Desktop Grids came from the association of several key concepts: 1) cycle stealing, 2)
computing over several administration domains and 3) the Master-Worker computing paradigm.
Desktop Grids inherit the principle of aggregating inexpensive, often already in place, resources,
from past research in cycle stealing. Roughly speaking, cycle stealing consists of using the CPUs
cycles of other computers. This concepts is particularly relevant when the target computers are idle.
Mutka and al. demonstrated in 1987 that the CPUs of workstations are mostly unused (M. W. Mutka
& Livny, 1987), opening the opportunity for high demanding users to scavenge these cycles for their
applications. Due to its high attractiveness, cycle stealing has been studied in many research projects
like Condor (Litzkow, Livny, Mutka, 1988), Glunix (Ghormley, Petrou, Rodrigues, Vahdat, Anderson,
1998) and Mosix (Barak, Guday, 1993), to cite a few. In addition to the development of these computing environments, a lot of research has focused on theoretical aspects of cycle stealing (Bhatt, Chung,
Leighton, Rosenberg, 1997).
Early cycle stealing systems where bounded to the limits of a single administration domain. To harness more resources, techniques were proposed to cross the boundaries of administration domains. A first
approach was proposed by Web Computing projects such as Jet (Pedroso, Silva, Silva, 1997), Charlotte
(Baratloo, Karaul, Kedem, Wyckoff, 1996), Javeline (P. Cappello et al., 1997), Bayanihan (Sarmenta
& Hirano, 1999), SuperWeb (Alexandrov, Ibel, Schauser, Scheiman, 1997), ParaWeb (Brecht, Sandhu,
Shan, Talbot, 1996) and PopCorn (Camiel, London, Nisan, Regev, 1997). These projects have emerged
with Java, taking benefit of the virtual machine properties: high portability across heterogeneous hardware and OS, large diffusion of virtual machine in Web browsers and a strong security model associated with bytecode execution. Performance and functionality limitations are some of the fundamental
32
Desktop Grids
motivations of the second generation of Global Computing systems like COSM4, BOINC (Anderson,
2004) and XtremWeb (Fedak, Germain, Nri, Cappello, 2001). These systems use some firewall and
NAT traversing protocols to transport the required communications.
The Master-Worker paradigm is the third enabling concept of Desktop Grids. The concept of MasterWorker programming is quite old (Mattson, Sanders, Massingill, 2004), but its application to large scale
computing over many distributed resources has emerged few years before 2000 (Sarmenta & Hirano,
1999). The Master-Worker programming approach essentially allows the implementing of non trivial (bag
of tasks) parallel applications on loosely coupled computing resources. Because it can be combined with
simple fault detection and tolerance mechanisms, it fits extremely well with the Desktop Grid platforms
that are very dynamic by essence.
Main Principles
Desktop Grids have emerged while the community was considering clustering and hierarchical designs as
good performance-cost tread-offs. However several parameters distinguish Desktop Grids from clusters:
scale, communication, heterogeneity and volatility. Moreover, Desktop Grids share with Grid a common
objective: to extend the size and accessibility of a computing infrastructure beyond the limit of a single
administration domain. In (Foster & Iamnitchi, 2003), the authors present the similarities and differences
between Grids and Desktop Grids. Two important distinguishing parameters are the user community
(professional or not) and the resource ownership (who own the resources and who is using them).
From the system architecture perspective, we consider two main differences: the system scale and the
lack of control of the participating resources. The notion of Large Scale is linked to a set of features that
has to be taken into account. An example is the system dynamicity caused by node volatility: in Internet
Computing Plaforms (also called Desktop Grids), a non predictable number of nodes may leave the system at any time. Some researches even consider that they may quit the system without any prior mention
and reconnect the system in the same way. The lack of control of the participating nodes has a direct
consequence on nodes connectivity. Desktop Grid designers cannot assume that external administrator
is able to intervene in the network setting of the nodes, especially their connection to Internet via NAT
and Firewalls. This means that we have to deal with the in place infrastructure in terms of performance,
heterogeneity, dynamicity and connectivity. Large scale and lack of control have many consequences, at
least on the architecture of system components, the deployment methods, programming models, security
(trust) and more generally on the theoretical properties achievable by the system. These characteristics
established a new research context in distributed systems.
From previous considerations, the Desktop Grid designers arrived to a set of properties that any
Desktop Grid system should fulfill: resources connectivity across administrative boundaries, resilience
to high resource volatility, job scheduling efficient for heterogeneous resources, standalone, self and
automatically managed resource applications. Several extra properties have been considered and integrated in some Desktop Grids: resources security, results certification, etc.
Figure 1 presents the simple architecture of basic Desktop Grids.
A typical Desktop Grid consists in 3 components: clients that submit requests, servers that accept
request and return results and a coordinator that schedules the client requests to the servers. Desktop
Grids have applications in High Throughput Computing as well as in data access and communication.
Thus, for a shake of simplicity, the requests and results presented in the figure can be either for computing or data operations. Clients may send requests with some specific requirements, such as CPU
33
Desktop Grids
Figure 1. General architecture of desktop Grids
architecture, OS version, the availability of some applications and libraries. Because only some servers
may provide the required environment, the task of the coordinator is generally extended to realize the
match making between clients requests and servers capabilities. Clients and servers are PCs belonging to different administrative domains. They are protected by firewall and may sit behind a NAT. By
default, there is no possibility of direct communication between them. As a consequence, any Desktop
Grid should implement some protocols to cross administrative domains boundaries. The communication between the components of the Desktop Grid concerns, data, job descriptions, job parameters and
results but also application codes. If the application is not available on servers, it is transmitted by the
client or the coordinator to the servers, prior to the execution. The coordinator can be implemented in
various ways. The simple organization consists in a central node. This architecture can be extended to
support the central node failure by using replicated nodes. Other designs consider using a distributed
architecture where several nodes handle and manage the clients requests and server results. In addition
to scheduling and matchmaking, the coordinator must implement fault detection and fault tolerance
mechanisms because it is expected that some servers fail or quite the Desktop Grid (permanently or not)
without prior notification. The lack of control of the servers implies that Desktop Grids rely on humans
(In most cases, the owners of the PCs) for the installation of the server software on participating PCs.
However Desktop Grid systems must not rely on PC owners for the managements and maintenance of
the software. Thus the server software is designed to allow remote upgrade and remote management. The
server software as well as all other Desktop Grid related software components are managed remotely
by the Desktop Grid administrator.
CLASSIFICATION OF DESKTOP GRIDS

In this section, we propose a classification of the Desktop Grids systems.
34
Desktop Grids
Figure 2. Overview of the OurGrid platform architecture
Local Desktop Grids

Enterprise Desktop Grid consists of Desktop PC hosts within a LAN. LANs are often found within a
corporation or University, and several companies such as Entropia and United Devices have specifically
targeted these LANs as a platform for supporting Desktop Grid applications.
Enterprise Desktop Grids are an attractive platform for large scale computation because the hosts
usually have better connectivity with 100Mbps Ethernet for example and have relatively less volatility
and heterogeneity than Desktop Grids that span the entire Internet. Nevertheless, compared to dedicated
clusters, enterprise Desktop Grids are volatile and heterogeneous platforms, and so the main challenge
is then to develop fault-tolerant, scalable, and efficient scheduling.
Enterprises also provides commercial Desktop Grids. Their source code is most of the time unavailable and there is less documentation about their internal components. The server part may be available
for use inside an enterprise.
There are several industrial Desktop Grid platforms from Entropia (Chien et al., 2003) (ceased commercial operations in 2004), from United Device, Platform, Mesh Technologies.
Collaborative Desktop Grids

Collaborative Desktop Grids consists of several Local Desktop Grids which agree to aggregate their
resources for a common goal. The OurGrid project (Andrade, Cirne, Brasileiro, Roisenberg, 2003 ;
Cirne et al., 2006) is a typical example of such systems. It proposes a mechanisms for laboratories to
put together their local Desktop Grids. A mechanism allows the local resource managers to construct a
P2P network (Figure 2). This solution is attractive because utilization of computing power by scientists
is usually not constant. When scientists need an extra computing power, this setup allows them to access
easily their friend universities resources. In exchange, when their resources are idle, it can be given or
rented to others universities. This requires a cooperation of the local Desktop Grid systems, usually at
the resource manager level, and mechanisms to schedule several applications.
A similar approach has been proposed by the Condor team under the term flock of condor (Pruyne
& Livny, 1996).
35
Desktop Grids
Internet Volunteer Desktop Grids

For over a decade, the largest distributed computing platforms in the world have been Internet Volunteer
Desktop Grids, (IVDG) which use the idle computing power and free storage of a large set of networked
(and often shared) hosts to support large-scale applications. In this case of Grid, owners of resources
are end-user Internet volunteer who provide their personal computer for free. IVDG are an extremely
attractive platform because they offer huge computational power at relatively low cost. Currently, many
projects, such as SETI@home (Anderson, Cobb, Korpela, Lebofsky, Werthimer, 2002), FOLDING@
home (Shirts & Pande, 2000), and EINSTEIN@home5, use TeraFlops of computing power of hundreds
of thousands of desktop PCs to execute large, high-throughput applications from a variety of scientific
domains, including computational biology, astronomy, and physics.
Single-Application Internet Volunteer Desktop Grids.

At the beginning of Internet Volunteer Desktop Grids, most of the largest projects were running only one
application. Only data were automatically distributed, most of the time using a simple CGI script on a
web server. Upgrading the application was requiring that volunteers manually download and install the
application. In this section, we will describe some of these projects.
The Great Internet Mersenne Prime Search (GIMPS)6 is one of the oldest computation using resources
provided by volunteer Desktop Grid users. Its started in 1996 and still running. The 44th known Mersenne
prime have been found in september 2006. Each client connect to a central server (PrimeNet) to get
some works. Resources are divided in 3 classes based on the processor model and gets different type of
tasks. The program only use 8Mb of RAM, 10Mb of disk space and do very little communications with
the servers (permanent connection is not required). The program checkpoints every half hour.
Since 1997, Distributed.net7 tries to solve cryptographic challenges. RC5 and several DES challenges
have been solved.
The first version of SETI@Home has been released in may 1999. There was already 400,000 preregistered volunteers. 200,000 clients registered the first week. Between July 2001 and July 2002, the
platform computed workunits at an average rate of 27.36 TeraFLOPS. The programs is doing some treatments on a signal recorded by a radio-telescope and then search for particular artificially made signal in
it. The original record is split in workunit both by time (107s long) and by frequency (10KHz)
The Electric Sheep8 (Draves, 2005) screen-saver realizes the collective dream of sleeping computers.
It harnesses the power of idle computers (because they are running the screen-saver) to render, using a
genetic algorithm, the fractal animation displayed by itself. The computation uses the volunteers to decide
which animation is beautiful and should be improved. This system consists only of one application but,
as the project website claims, about 30,000 unique IP addresses contact the server each day and 2Tb are
transfered. At article writing time, the unique centralized server was the bottleneck of this system.
XtremWeb.
XtremWeb (Fedak et al., 2001; Cappello et al., 2004) is an open source research project at LRI and LAL
that belongs to light weight Grid systems. It is primary designed in order to explore scientific issues
about Desktop Grid, Global Computing and Peer to Peer distributed systems but have been also used in
real computations, especially in physics. First version was released in 2001.
36
Desktop Grids
Figure 3. Overview of the XtremWeb platform architecture
The architecture (Figure 3) is similar to most well known platforms. It is a three-tier architecture with
clients, servers and workers. Several instances of those components might be used at the same time.
Clients allows platforms users to interact with the platform by submitting stand-alone jobs, retrieving
results and managing the platform. Workers are responsible for executing jobs. The server is a coordination service that connects clients and workers. The server accepts tasks from clients, distributes them to
workers according to the scheduling policy, provides applications for running them and supervises the
execution by detecting worker crash or disconnection. If needed tasks are restarted on other available
workers. At the end, it retrieves and stores results before clients download them.
Clients and Workers are initiators of all connections to the server which have for consequence that
only the server needs to be accessible from behind firewalls. Multiples protocols are supported and can
be used depending on the type of workload. Communications may also be secured both by encryption
and authentication.
Since its first version, XtremWeb has been deployed over networks of common Desktop PCs providing
an efficient and cost effective solution for a wide range of application domains: bioinformatics, molecular
synthesis, high energy physics, numerical analysis and many more. At the same time, there have been
many researches around Xtremweb: XtremWeb-CH9 (Abdennadher & Boesch, 2006) funded by the
University of Applied Sciences in Geneva, is an enrichment of XtremWeb in order to better match P2P
concepts. Communications are distributed, i.e. direct communications between workers are possible. It
provides a distributed scheduler that takes into account the heterogeneity and volatility of workers. There
is an automatic detection of the optimal granularity according to the number of available workers and
scheduling tasks. There is also a monitoring tool for visualizing the executions of the applications.
BOINC
All these mono-application projects share many common components. So, there was a need for a generic platform that would provide all these components for an easy integration and deployment of these
37
Desktop Grids
Figure 4. Overview of the BOINC platform architecture
projects. Only the part that really does the computation need to be changed for each project.
The Berkeley Open Infrastructure for Network Computing (BOINC) (Anderson, 2004) is the largest
volunteer computing platform. More than 900,000 users from nearly all countries participate with more
than 1,300,000 computers. More than 40 projects, not including private projects, are available including
the popular SETI@Home project. Projects usually last several month mainly because of the time needed
to attract volunteers and set up a users community.
Each client (computing node) is manually attached by the user to one or more projects (servers).
Each project runs a central server (Figure 4) and most of the scheduling is done by clients. Projects have
the ability to run several a small number of different applications which can be updated (jobs have to
be very homogeneous).
The BOINC server is composed of several daemons which execute the management tasks: first,
workunits are produced by a generator. Then, the transitioner, the daemon that will take care of the
different states of the workunit life cycle, will replicate (redundancy) the workunit in several results
(instances of workunits). Each result will be executed on a different client. Then, back to the server,
each result will be checked by the validator before being stored in the project science database by the
assimilator.
All communications are done using cgi programs on the project server, so, only port 80 and client
to server connections are needed. Each user is rewarded with credits, a virtual money, for the count of
cycles used on its computer.
The client maintains a cache of results to be executed between connections to the Internet. The scheduler tries to enforce many constraints: First, the user may choose to run the applications according to its
activity (screen-saver), working hours, resources available. Second, the user assigns a resource share
ratio to the projects. Third, sometimes, some projects may run out of work to distribute.
38
Desktop Grids
Some others projects were inspired by the BOINC platform. SLINC10 (Baldassari, Finkel, & Toth,
2006) addresses the main limitations of BOINC by simplifying the project creation process. This software
is also operating system independent as it runs on the Java platform. It is also database independent
(use Hibernate) while BOINC runs only with Mysql. All communications between components are
done with XML-RPC and for simplifying the architecture, they have removed the validator component.
Users applications are also programming language independent, but only Java and C++ available for
now. Two versions of the same application, the first one written in Java, the second one written in C++
have almost the same performance. Some BOINC issues have not been fixed here, such as the time
needed to have all the volunteers register their resources.
The POPCORN (Nisan, London, Regev, Camiel, 1998) is a platform for global distributed computing over the Internet. It has been available from mid 1997 until mid 1998. Today, only the papers
and documentation are still available. This platform runs on the Java platform and tasks are executed
on workers as computelets, a system similar to usual Java applets. Computelets need only to be
instanciated for a task to be distributed. Error and verification process is left to the application level.
The platform provides a debugging tool that shows the tree of spawned computelets (for debugging
concurrency issues). There is also a market system that enable users to sell their CPU time. The currency works almost the same as BOINC credits. Some applications have been tested on the platform:
brute force breaking, genetic algorithm,... At the implementation level, they had some issues with Java
immaturity (in 1997-1998).
Bayanihan (Sarmenta & Hirano, 1999) is another platform for volunteer computing over the Internet.
It is written in Java and uses Hord, a package similar to Suns RMI for communications. Many clients
(applet started from a web browser or command line applications) connect to one or more servers.
Korea@Home (Jung, 2005) is a Korean volunteer computing platform. Work management is centralized on one server but since version 2, there is a P2P mechanism that allows direct communication
between computing nodes (agents). This platform harnesses more than 36,000 agents, about 300 of
them are available at the same time.
EVOLUTION OF MIDDLEWARE ARCHITECTURE

Job Management
The functionality required for job management includes job submission, resource discovery, resource
selection and scheduling, and resource binding. With respect to job submission, most systems, like
XtremWeb or Entropia, have a interface similar to batch systems, such as PBS, where a jobs executable and inputs are specified. Recently, there have been efforts to provide higher-level programming
abstractions, such as Map-Reduce (Dean & Ghemawat, 2004).
After a job is submitted to the system, the job management system must identify a set of available
resources. Resource discovery is the process of identifying which resources are currently available and
is challenging given the dynamicity and large-scale of systems. There have been both centralized and
distributed approaches. The classic method is via matchmaking (Raman, Livny, Solomon, 1998) where
application requirements are paired with compatible resource requirements via ClassAds. A number of
works have addressed the scalability and fault-tolerance issue of this type of centralized matchmaking
system.
39
Desktop Grids
Several distributed approaches have been proposed. The challenges of building a distributed resource
discovery system are the overheads of distributing queries, guaranteeing that queries can be satisfied,
being able to support a range of application constraints specified through queries, and being able to
handle dynamic loads on nodes. In (Zhou & Lo, 2006), the authors propose distributed resource discovery using a distributed hash table (DHT) in the context of a P2P system. This was one of the first
P2P resource discovery mechanisms ever proposed. However, one the characteristics of resources can
be heavily skewed such that the query load is heavily imbalanced.
In (Iamnitchi, Foster, Nurmi, 2002), the authors propose a P2P approach where the overheads of a
query are limited with a time-to-live (TTL). The drawback of this approach is that there is no guarantee
that a resource that meets the constraints of the application will be found. In (Kim et al., 2006), the
authors proposed a rendezvous-node tree (RNT) where load is balanced using random application assignment. The RNT deals with load dynamics by conducting a random-walk (of limited length) after the
mapping. In (Lee, Ren, Eigenmann, 2008), the authors use a system where information is summarized
hierarchically, and a bloom filter is used to reduce the overheads for storage and maintenance.
After a set of suitable resources have been determined, the management system must then selection
a subset of the resources and determine how to schedule tasks among the resources. We discuss this
issue in-depth in next section.
One resources have been selected and a schedule has been determined, the tasks must then be deployed across resources, i.e., bound. In systems such the Condor Matchmaker, binding occurs last in a
separate step between the consumer and provider (without the matchmaker as the middle-man) to allow
for the detection of any change in state. If change in state occurs (for example, the resource is no longer
available), then the renegotiation of selected resources can occur.
Resource Scheduling
At the application and resource management level, most research assumes that a centralized scheduler
maintains a queue of tasks to be scheduled, and a ready queue of available workers. As workers become
available, they notify the server, and the scheduler on the server places the corresponding task requests
of workers in the ready queue. During resource selection, the scheduler examines the ready queue to
determine the possible choices for task assignment. Because the hosts are volatile and heterogeneous, the
size of the host ready queue changes dramatically during application execution as workers are assigned
tasks (and thus removed from the ready queue), and as workers of different speeds and availability complete tasks and notify the server. The host ready queue is usually only a small subset of all the workers,
since workers only notify the server when they are available for task execution.
At the Worker level, most research assumes that the worker running on each host periodically sends
a heartbeat to the server that indicates the state of the task. In the XtremWeb system (Fedak et al., 2001),
a worker sends a heartbeat every minute to indicate whether the task is running or has failed. With respect to the recovery from failures, some works assume local checkpointing abilities. However, remote
checkpointing is still work in progress in real Internet-wide systems such as BOINC (Anderson, 2004)
and XtremWeb (Fedak et al., 2001).
Also, most works do not assume the server can cancel a task once it has been scheduled on a worker.
The reason for this is that resource access is limited, as firewalls are usually configured to block all
incoming connections precluding incoming RPCs and to allow only outgoing connections (often on a
restricted set of ports like port 80). As such, the heuristics cannot preempt a task once it has been as-
40
Desktop Grids
signed, and workers must make the initiative to request tasks from the server.
This platform model deviates significantly from traditional grid scheduling models (Berman, Wolski,
Figueira, Schopf, Shao, 1996 ; Casanova, Legrand, Zagorodnov, Berman, 2000 ; Casanova, Obertelli,
Berman, Wolski, 2000 ; Foster & Kesselman, 1999). The pull nature of work distribution and random
behavior of resources in desktop grids places several limitations on scheduling operations. First, it
makes advance planning with sophisticated Gantt charts difficult as resources may not be available for
task execution at the scheduled time slot. Second, as task requests are typically handled in a centralized
fashion and a (web) server can handle a maximum of a few hundred connections, the choice of resources
available is always a small subset of the whole. Nevertheless, we focus on scheduling solutions applicable
in current centralized systems below.
The majority of application models in desktop grid scheduling have focused on jobs requiring either
high-throughput (Sonnek, Nathan, Chandra, Weissman, 2006) or low latency (Heien, Fujimoto, Hagihara,
2008 ; Kondo, Chien, 2004). These jobs are typically compute-intensive.
There are four complementary strategies for scheduling in desktop grid environments, namely resource selection, resource prioritization, task replication, and host availability prediction. In practice,
these strategies are often combined in heuristics.
With respect to resource selection, hosts can be prioritized according to various static or dynamic
criteria. Surprisingly, simple criteria such as clock rates has been shown to be effective with real-world
traces (Kondo, Chien, 2004). Other studies (Kondo, Chien, Casanova, 2007 ; Sonnek et al., 2006) have
used probabilistic techniques based on a hosts history of unavailability to distinguish more stables hosts
from others.
With respect to resource exclusion, hosts can be excluded using various criteria, as often slow hosts
(either due to failures, slow clock rates, or other host load) are the bottlenecks in the computation. Thus,
excluding them from the entire resource pool can improve performance dramatically.
With respect to task replication, schedulers often replicate a fixed number of times. The authors in
the studies (Kondo et al., 2007) and (Sonnek et al., 2006) investigated the use of probabilistic methods
for varying the level of replication according to a hosts volatility.
With respect to host availability prediction, recently the authors in (Andrzejak, Domingues, Silva,
2006) have shown that simple prediction methods (in particular a naive bayes classifier) can allow one
to give guarantees on host availability. In particular, in that study, the authors show how to predict that
N hosts will be available for T time.
Volatility Tolerance
There are several issues with volatility concerning detection and resolution. With respect to detection,
systems such as XtremWeb (Fedak et al., 2001) and Entropia (Chien et al., 2003) use heartbeats. In
BOINC, where a centralized (web) server can only handle a few hundred simultaneously, the use of
heartbeats with millions of resources is not an option. Moreover, heartbeats cant be used when BOINC
operate without network connection. Instead BOINC uses job deadlines as a indication of whether the
job as permanently failed or not.
When a failure has been detected, one can resolve the failure in a number of ways. Task checkpointing is one means of dealing with task failures since the task state can be stored periodically either on
the local disk or on a remote checkpointing server; in the event that a failure occurs, the application
can be restarted from the last checkpoint. In combination with checkpointing, process migration can be
41
Desktop Grids
used to deal with CPU unavailability or when a better host becomes available by moving the process
to another machine.
The authors in (Araujo, Domingues, Kondo, Silva, 2008 ; Domingues, Araujo, Silva, 2006) develop
a distributed checkpoint system where checkpoints are stored in peers in a P2P fashion using a DHT, or
using a clique. Thus, when a failure occurs, a checkpoint could potentially be used to restart the computation on another node in scalable way.
Another common solution for masking failures is replication. The authors in (Ghare & Leutenegger,
2004 ; Kondo et al., 2007 ; Sonnek et al., 2006) use probabilistic models to analyze various replication
issues. The platform model used in (Ghare & Leutenegger, 2004) assumes that the resources are shared,
task preemption is disallowed, and checkpointing is not supported. The application models was based on
tightly-coupled applications, while the other was based on loosely-coupled application, which consisted
of task parallel components before each barrier synchronization. The authors then assume that the probability of task completion follows a geometric distribution.
The work in (Leutenegger & Sun, 1993) examines analytically the costs of executing task parallel
applications in desktop grid environments. The model assumes that after a machine is unavailable for
some fixed number of time units, at least one unit of work can be completed. Thus, the estimates for
execution time are lower bounds. The assumption is restrictive, especially since the size of an availability
intervals can be correlated in time (Mutka & Livny, 1991); that is, a short availability interval (which
would likely cause task failure) will most likely be followed by another short availability interval.
In terms of proactively avoiding failures, the authors in (Andrzejak, Kondo, Anderson, 2008) use
prediction methods for avoiding resources likely to fail. They show the existence of long stretches of
availability of certain Internet hosts and that such patterns can be modeled efficiently with basic classification algorithms. Simple and computationally cheap metrics are reliable indicators of predictability, and resources can be divided into high and low predictability groups based on such indicators. So,
they show that a deployment of enterprise services in a pool of volatile resources is possible and incurs
reasonable overheads.
Data Management
Despite the attractiveness of Desktop Grids, little work has been done to support data-intensive applications in this context of massively distributed, volatile, heterogeneous, and network-limited resources.
Most Desktop Grid systems, like BOINC (Anderson, 2004), XtremWeb (Fedak et al., 2001), Condor
(Litzkow et al., 1988) and OurGrid (Andrade et al., 2003) rely on a centralized architecture for indexing
and distributing the data, and thus potentially face issues with scalability and fault tolerance. Thus, data
management is still a challenging issue.
Parameter-sweep applications composed of a large set of independent tasks sharing large data are the
first class of applications which has driven a lot of effort in the area of data distribution.
Authors in (Wei, Fedak, Cappello, 2005) have shown that using a collaborative data distribution
protocol BitTorrent over FTP can improve execution time of parameter sweep applications. In contrast,
it has also been observed that the BitTorrent protocol suffers a higher overhead compared to FTP when
transferring small files. Thus, one must be allowed to select the correct distribution protocol according
to the size of the file and level of sharability of data among the task inputs.
Recently, a similar approach has been followed in (Costa, Silva, Fedak, Kelley, 2008), where the
BitTorrent protocol has been integrated within the BOINC platform.
42
Desktop Grids
This works confirm that the basic blocks for building Data Management components can be found in
P2P systems. Recently, a subsystem dedicated to data management for Desktop Grid, named BitDew, has
been proposed in (Fedak, He, Cappello, 2008). It could be easily integrated into systems like BOINC,
OurGrid or XtremWeb. It offers programmers (or an automated agent that works on behalf of the user)
a simple API for creating, accessing, storing and moving data with ease, even on highly dynamic and
volatile environments.
Researchers of DHTs (Distributed Hash Tables) (Stoica, Morris, Karger, Kaashoek, Balakrishnan,
2001 ; Maymounkov & Mazires, 2002 ; Rowstron & Druschel, 2001) and collaborative data distribution (Cohen, 2003 ; Gkantsidis & Rodriguez, 2005 ; Fernandess & Malkhi, 2006), storage over volatile
resources (Bolosky, Douceur, Ely, Theimer, 2000 ; Butt, Johnson, Zheng, Hu, 2004 ; Vazhkudai, Ma,
Strickland, Tammineedi, Scott, 2005) and wide-area network storage (Bassi et al., 2002 ; Rhea et al., 2003)
offer various tools that could be of interest for Data Grids. To build Data Grids from and to utilize them
effectively, one needs to bring together these components into a comprehensive framework. BitDew suits
this purpose by providing an environment for data management and distribution in Desktop Grids.
Large data movement across wide-area networks can be costly in terms of performance because bandwidth across the Internet is often limited, variable and unpredictable. Caching data on the local storage
of the Desktop PC (Iamnitchi, Doraimani, Garzoglio, 2006 ; Otoo, Rotem, Romosan, 2004 ; Vazhkudai
et al., 2005) with adequate scheduling strategies (Santos-Neto, Cirne, Brasileiro, Lima, 2004 ; Wei et
al., 2005) to minimize data transfers can improve overall application execution performance.
Long-running applications are challenging due to the volatility of executing nodes. To achieve application execution, it requires local or remote checkpoints to avoid losing the intermediate computational state when a failure occurs. In the context of Desktop Grid, these applications have to cope with
replication and sabotage. An idea proposed in (Kondo, Araujo, et al., 2006) is to compute a signature of
checkpoint images and use signature comparison to eliminate diverging execution. Thus, indexing data
with their checksum as commonly done by DHT and P2P software permits basic sabotage tolerance
even without retrieving the data.
BitDew leverages the use of metadata, a technique widely used in Data Grid (Jin, Xiong, Wu, Zou,
2006), but in more directive style. It defines 5 different types of metadata: i) replication indicates how
many occurrences of data should be available at the same time in the system, ii) fault tolerance controls the resilience of data in presence of machine crash, iii) lifetime is a duration, absolute or relative
to the existence of other data, which indicates when a datum is obsolete, iv) affinity drives movement
of data according to dependency rules, v) transfer protocol gives the runtime environment hints about
the file transfer protocol appropriate to distribute the data. Programmers tag each data with these simple
attributes, and simply let the BitDew runtime environment manage operations of data creation, deletion,
movement, replication, as well as fault tolerance.
Security Model
In this section we review the security model of several Desktop Grid system.
The BOINC (Anderson, 2004) middleware is a popular Volunteer Computing System which permits
to aggregate huge computing power from thousands of Internet users. A key points is the asymmetry of
its security model: there are few projects well identified and which belongs to established institutions
(by example, University of Berkeley for the SETI@Home project) while volunteers are numerous and
anonymous. Of course the notion of users exists in BOINC, because volunteers needs to receive a re-
43
Desktop Grids
ward from their contribution. However, the definition of a users is close to the one of avatar: it allows
users to participate to forum and receives credits according to the computing time and power given to
the project.
Despite anonymity, the security model is based on trust. Volunteers trust the project they are contributing to. Security mechanism is simple and based on asymmetric cryptography.
Security model aims at enforcing the trust between volunteers and the project they participate in. At
installation time, the owners of a project produce a pair of public/private keys and stores those keys in a
safe place, typically, as recommended on the BOINC web site in a machine isolated from the network.
When volunteers contribute for the first time to the project, they obtain the public key of the project.
Project owners have to digitally sign the application files of the project, so that volunteers can verify
that the binary codes downloaded by the BOINC client really belongs to the project. This mechanism
ensures that, if a pirate get access to one of the BOINC server, he would not be able to upload malicious
code to hundreds of thousands users. If volunteers trust the projects, the reverse is not true. To protect
against malicious users, BOINC implements a result certification mechanism (Sarmenta, 2002), based
on redundant computation. BOINC gives the ability to project administrator to write their own custom
results certifying code according to their application.
XtremWeb is an Internet Desktop Grid middleware which also permits public resources computing.
It differs from BOINC by the ability given to every participants to submit new applications and tasks
in the system. XtremWeb is a P2P system in the sense that every participant can provide computing resources but also utilize others participants computing resources. XtremWeb is organized as a three-tiers
architecture where clients consumes resources, worker provides resources and coordinator is a central
agent which manages the system by performing the scheduling and fault-tolerance tasks.
Even if BOINC defines users in its implementation, they are anonymous and are only used to facilitate the platform management They cant be trusted, only project owners can be trusted. In contrast
with BOINC, because everyone can submit application, there cannot be any form of trust between users, applications, results and even the coordinator itself. Thus XtremWeb security model is based on
autonomous mechanisms which aims at protecting each component of the platform from the others
elements. For instance, to protect volunteers computer from malicious code, a sandbox mechanism is
used to isolate and monitor the running application, and prevent it to damage volunteers system. Public/
private keys mechanism are also used to authenticate the coordinator to prevent results to be uploaded
to another coordinator.
The XGRID system, proposed by Apple is a Desktop Grid designed to run on a local network environment. XGrid features ease of use and ease of deployment. To work, the Xgrid system needs a Xgrid
server, which can be configured with or without password. If the server run without password, then every
user in the local environment can submit jobs and application, else only those who can authenticate to
the servers are granted this authorization. Computing nodes, in the Xgrid system can accept jobs or no,
this property is set on the computing nodes itself. Thus there is no real distinction between users and
theres no possibility for a user or a machine to accept or refuse other users application or work. While
this solution is acceptable when used within a single organization (lab or small company), this solution
would not scale to a large Grid setup which typically aims at several institutions to cooperate.
44
Desktop Grids
Figure 5. Bridging service Grid and desktop Grid, the superworker approach vs. the gliding-in approach
Bridging Service Grids and Desktop Grids

There exists 2 main approaches to bridge Service Grids and Desktop Grids (see Fig. 5). In this section
we present the principles of these two approaches and discuss them according to security perspective.
The Superworker Approach

Since the superworker is a centralized agent this solution has several drawbacks: i) the superworker can
become a bottleneck when the number of computing resources increases, ii) the round trip for a work
unit is increased because it has to be marshalled/unmarshalled by the superworker, iii) it introduces a
single point of failure in the system, which has low fault-tolerance. On the other hand, this centralized
solution provides better security properties, concerning the integration with the Grid. First the superworker does not require modification of the infrastructure, it can be ran under any user identity as long
as the user has the right to submit jobs on Grid. Next, as works are wrapped by the by the superworker,
they are run under the user identity, which conforms with the regular security usage, in contrast with
the approach described in the following paragraph.
A first solution proposed used by the Lattice (Myers, Bazinet, Cummings, 2008) project and the
SZTAKI Desktop Grid (Balaton et al., 2007) is to build a superworker which enables several Grid or
cluster resources to compute to a Desktop Grid. The superworker is a bridge implemented as a daemon
between the Desktop Grid server and the Service Grid resources. From the Desktop Grid server point of
view, the Grid or cluster appears as one single resources with large computing capabilities. The superworker continuously fetches tasks or work units from the Desktop Grid server, wraps and submits the
tasks accordingly to the local Grid or cluster resources manager. When the computations are finished
on the SG computing nodes, the superworker send back the results to the Desktop Grid server. Thus,
45
Desktop Grids
the superworker by itself is a scheduler which needs to continuously scan the queues of the computing
resources and watch for available resources to launch jobs.
The Gliding-In Approach

The Gliding-in approach to cluster resources spread in different Condor pool using the Global Computing system (XtremWeb) was first introduced in (Lodygensky et al., 2003). The main principle consists
in wrapping the XtremWeb worker as regular condor task and submit this task to the Condor pool.
Once the worker is executed on a Condor resource, the worker pulls jobs from the Desktop Grid server,
executes the XtremWeb task and return the result to the XtremWeb server. As a consequence, the Condor resources communicates directly to the XtremWeb server. Similar mechanisms are now commonly
employed in Grid Computing (Thain & Livny, 2004). For example Dirac (Tsaregorodtsev, Garonne,
Stokes-Rees, 2004) uses a combination of push/pull mechanism to execute jobs on several Grid clusters.
The generic approach on the Grid is called a pilot job. Instead of submitting jobs directly to the Grids
gatekeeper, this system submits so-called pilot jobs. When executed, the pilot job fetches jobs from an
external job scheduler.
The gliding-in or pilot job approach has several advantages. While simple, this mechanism efficiently
balance the load between heterogeneous computing sites. It benefits from the fault tolerance provided
by the Desktop Grid server: if Grid nodes fail then jobs get rescheduled to the next available resources.
Finally, as the performance study of the Falskon (Raicu, Zhao, Dumitrescu, Foster, Wilde, 2007) system
shows, it gives better performances because series of jobs does not have to go throught the gatekeeper
queues which is generally characterized by long waiting time, and communications are direct between the
CE and the Desktop Grid server without intermediate agent such as the superworker. From the security
point of view, this approach breaks the Grid security rules because jobs owner may be different than
pilot job owners. This is a well known issue of pilot jobs and new solution such as gLExec (Sfiligoi et
al., 2007) are proposed to circumvent this security hole.
Result Certification
Result certification in desktop grids is essential for several reasons. First, malicious users can report
erroneous results. Second, hosts can unintentionally report erroneous results because of viruses that
corrupt the system or hardware problems (for example overheating of the CPU). Third, differences in
system or hardware configuration can result in different computational results.
We discuss three of the most common state-of-the-art methods (Sarmenta, 2002 ; Zhao & Lo, 2001;
Taufer, Anderson, Cicotti, 2005) for result certification namely spot-checking, majority voting, and
credibility-based techniques, and emphasize the issues related to each method.
The majority voting method detects erroneous results by sending identical workunits to multiple
workers. After the results are retrieved, the result that appears most often is assumed to be correct. In
(Sarmenta, 2002), the author determines the amount of redundancy for majority voting needed to achieve
a bound on the frequency of voting errors given the probability that a worker returns a erroneous result.
Let the error rate be the probability that a worker is erroneous and returns an erroneous result unit,
and let be the percentage of final results (after voting) that are incorrect.
Let m be the number of identical results out of 2m1 required before a vote is considered complete
and a result is decided upon. Then the probability of an incorrect result being accepted after a majority
46
Desktop Grids
vote is given by:
2m -1 2m - 1 j
j (1 - j)2m -1- j
majv (j, m ) = j =m
(1)
The redundancy of majority voting is

m
1-f .
The main issues for majority voting are the following. First, the error bound assumes that error
rates are not correlated among hosts. Second, majority voting is most effective when error rates are
relatively low (1%); otherwise the required redundancy could be too high.
A more efficient method for error detection is spot-checking, whereby a workunit with a known
correct result is distributed at random to workers. The workers results are then compared to the
previously computed and verified result. Any discrepancies cause the corresponding worker to be
blacklisted, i.e., any past or future results returned from the erroneous host are discarded (perhaps
unknowingly to the host).
Erroneous workunit computation was modelled as a Bernoulli process (Sarmenta, 2002) to determine the error rate of spot-checking given the portion of work contributed by the host, and the rate
at which incorrect results are returned. The model uses a work pool that is divided into equally sized
batches.
Allowing the model to exclude coordinated attacks, let q be the frequency of spot-checking, let n
be the amount of work contributed by the erroneous worker, let f be the fraction of hosts that commit
at least 1 error, and let s be the error rate per erroneous host. (1-qs)n is the probability that an erroneous
host is not discovered after processing n workunits. The rate which spot-checking with blacklisting
will fail to catch bad results is given by:
scbl (q, n, f , s ) =
sf (1 - qs )n
(1 - f ) + f (1 - qs )n
(2)
The amount of redundancy of spot-checking is given by

1
1-q .
There are several critical issues related to spot-checking with blacklisting. First, it assumes that
blacklisting will effectively remove erroneous hosts, in spite of the possibility of hosts registering
with new identities or high host churn as shown by (Anderson & Fedak, 2006). Without blacklisting,
the upper bound on the error rate is much higher and does not decrease inversely with n. Second,
spot-checking is effective only if error rates are consistent over time. Third, spot-checking is most
47
Desktop Grids
effective when error rates are high (>1%); otherwise, the number of workunits to be computed per
worker n must be extremely high.
To address the potential weaknesses of majority voting and spot-checking, credibility-based systems
were proposed (Sarmenta, 2002), which use the conditional probabilities of errors given the history of
host result correctness. The idea is based on the assumption that hosts that have computed many results
with relatively few errors have a higher probability of errorless computation than hosts with a history
of returning erroneous results. Workunits are assigned to hosts such that more attention is given to the
workunits distributed to higher risk hosts.
To determine the credibility of each host, any error detection method such as majority voting, spotchecking, or various combinations of the two can be used. The credibilities are then used to compute
the conditional probability of a results correctness.
RESEARCH AND ExPLORATION TOOLS

Platforms Observations
Prior to improve algorithms of existing platforms and simulate them, observing existing software real
behavior on Internet is necessary. First, in (Kondo, Taufer, Brooks, Casanova, Chien, 2004 ; Kondo,
Fedak, Cappello, Chien, Casanova, 2006) hundreds of desktop PC have been measured and characterized at the University of California at San Diego and the University of Paris-Sud.
In (Anderson & Fedak, 2006), the authors measured aggregate statistics gathered through BOINC.
A limitation of this work is that the measurements do not describe the temporal structure of availability
of individual resources
Recently, the XtremLab11 (Malcot, Kondo, Fedak, 2006) project, running on BOINC, have collected
over 15 months of traces from 15,000 hosts. It runs an active measurement software that give the exact
amount of computing power earned by the project by time for each node. After minimal treatment, they
are used by simulators.
DG Simulation: SimBOINC
There are several challenges for desktop grid simulation. First, one needs ways for abstracting failures
and handling them. In simulation toolkits, one needs a way to specify the type of failure (permanent
or transient) and how the system reacts to the failures (for example, restart after the system becomes
available again). Simulators, such as SimGrid, are beginning to use exception handling for cleanly dealing with failures. Second, one needs to be able to deal with scaling issues. In some respects, building
a trace-driven simulator using 50,000 resources is trivial when resources are not shared and they are
interconnected with trivial network models. However, when resources are shared by a number of competing entities, issues of scale arise because one must recompute the allocation of the resource for each
entity whenever the resource state changes. Third, as desktop grids and volunteer computing systems
are invariably distributed over wide-area networks, one needs accurate network models that scale to
hundreds and thousands of resources. The open issue is to get the speed of flow-based network models
and at the same time the accuracy of packet-level simulation. Below we describe one recent approach
for desktop grid simulation.
48
Desktop Grids
SimBOINC is a simulator for heterogeneous and volatile desktop grids and volunteer computing
systems. The goal of this project is to provide a simulator by which to test new scheduling strategies
in BOINC, and other desktop and volunteer systems, in general. SimBOINC is based on the SimGrid
simulation toolkit for simulating distributed and parallel systems, and uses SimGrid (Casanova, Legrand,
Quinson) to simulate BOINC (in particular, the client CPU scheduler, and eventually the work fetch
policy) by implementing a number of required functionalities.
Simulator Overview.
SimBOINC simulates a client-server platform where multiple clients request work from a central server.
In particular, we have implemented a client class that is based on the BOINC client, and uses (almost
exactly) the clients CPU scheduler source code. The characteristics of client (for example, speed,
project resource shares, and availability), of the workload (for example, the projects, the size of each
task, and checkpoint frequency), and of the network connecting the client and server (for example,
bandwidth and latency) can all be specified as simulation inputs. With those inputs, the simulator
will execute and produce an output file that gives the values for a number of scheduler performance
metrics, such as effective resource shares, and task deadline misses.
The current simulator can simulate a single client that downloads workunits from multiple projects
and use its CPU scheduler to decide when to schedule each workunit.
The server in SimBOINC is different from the typical BOINC server in that there is one server for
multiple projects, and so requests for work from multiple projects are channeled to a single server.
The server consists of a request_handler that basically uses work_req_seconds and project_id parameters sent in the scheduler_request to determine the amount of work from a specific project to send
to a client.
We understand that for testing new work-fetch policies and CPU schedulers, only a single client that work downloads for multiple projects is needed. But we wanted SimBOINC to be a general
purpose volunteer computing simulator that could simulate new uses of BOINC by different kinds
of applications. For example, people should be able to use SimBOINC to simulate the scheduling of
low-latency jobs or for simulating large peer-to-peer file distribution; in both these cases, simulating
multiple clients would be essential.
Execution.
SimBOINC expects the following inputs in the form of xml files:

Platform file: This specifies the hosts in the platform and the network connecting the hosts.
Host availability trace files: These are to be specified within the platform file.
Workload file: This specifies the jobs, i.e., projects, to be executed on the clients.
Client states file: This specifies the configuration of the BOINC
Clients simulator file: This specifies the configuration of the specific simulator execution.
The platform file is where one constructs the computing and network resources on which the
BOINC client and server run. In particular, SimBOINC expects a set of CPU resources, and a set of
network links that connect those resources. For each resource, one can specify set of attributes. For
49
Desktop Grids
example, with CPU resources, one can specify the power, and corresponding availability trace files.
For network resources, one can specify their bandwidth and latency.
The workload file specifies the projects to be executed over the BOINC platform. In particular, it
specifies for each project, the name, total number of tasks to execute, the task size in terms of computation,
the task size in terms of communication, the checkpoint frequency for each task, and the delay_bound,
and rsc_fpops_est BOINC task attributes.
Client States File.

The client states input file is based on the client states format exported by the BOINC client to store
persistent state. The idea is that the client states files could be collected and assembled to produce a client_states input file to SimBOINC, which would allow the simulation of BOINC clients using realistic
settings.
Simulation File
This simulation input file specifies the type of simulation to be conducted (e.g. BOINC), the maximum
time for simulation after which the simulation will be terminated, and the output file name.
Using Availability Traces

In SimGrid, the availability of network and CPU resources can be specified through traces. For CPU
resources, one specifies a cpu availability file that denotes the availability of the cpu as a percentage
over time. Also, for the cpu, one specifies a failure file that indicates when the cpu fails. A cpu is considered to fail when it is not available anymore for computation. In SimGrid, a CPU failure causes all
processing running on that CPU to terminate. In BOINC, at least three things can cause an executing
task to fail. First, the task could be preempted by the BOINC client because of the client scheduling
policy. Second, the task could be preempted by the BOINC client because of user activity according to
the users preferences. Third, the host could fail (for example due to a machine crash or shutdown). In
SimBOINC, the failures of a host specified in the CPU trace files represent the failure resulting from
the latter two causes. That is, when a cpu fails as specified in the traces, all processes on the cpu will
terminate. However, their state is maintained and persists through the failure so that when the host becomes available again, the processes will be restarted in the same state. That is, the tasks that had been
executing before the failure are restarted from the last checkpoint after the failure, and the client state
data structure is the same as before the failure.
Logging
SimBOINC uses the logging facility called XBT provided by SimGrid, which is similar in spirit to log4j
(and in turn, log4cxx, and etc.) It allows for runtime configuration of messages output and the level of
detail. However, it does yet support appenders. We chose to use XBT instead the BOINCs message
logger because XBT it integrated with SimGrid, and as such can show more informative messages by
default (like the name of the process, the simulation time, and etc).
50
Desktop Grids
Simulator Output and Performance Metrics.

The simulator output file must be specified in the simulation input file. The simulator then outputs the
following metrics to that file in xml: for each client for each project that the client participates in total
number of tasks completed resource share and effective resource shared calculated by the using the CPU
time for each completed task compared to the total number and percentage of missed report deadlines for
completed tasks number and percentage of report deadlines met for completed tasks. Also, for each CPU
specified in the platform.xml file, the simulator will output a corresponding .trace file, which records
information about the execution of tasks on that CPU. In particular, the trace file shows in each column,
the simulation time, the task name, the event (START, COMPLETED, CANCELLED, or FAILED), the
CPU name, and completion time when applicable.
Use of SimGrid.
We chose to implement the BOINC simulator using SimGrid for a number of reasons. First, SimGrid
provides a number abstractions and tools that simplify the process of simulating of complex parallel and
distributed systems. For example, SimGrid provides abstractions for processes, computing elements,
network links, and etc. These abstractions and tools greatly simplified the implementation of the BOINC
simulator. Second, we can leverage the proven accuracy of SimGrids resource models. For example,
SimGrid models allocation of network bandwidth among competing data transfers using a flow-based
TCP model for networks that has been shown to be reasonably accurate. Third, SimGrid was implemented
in C and using it with BOINCs C++ source code is straightforward.
APPLICATIONS
Bag of Task Applications
Applications composed of a set of independent tasks is the most common class of application that one can
execute on a Desktop Grid. This class of application is straight-forward to schedule and simple to execute
when there is little IO. However, it is a very popular class of application that is used in many scientific
domains. In particular, it permits multi-parametric studies, when one application, typically a simulation
code is run against a large set of parameters in order to explore a range of possible solutions.
Data Intensive
Enabling Data Grids is one of the fundamental efforts of the computational science community as
emphasized by projects such as EGEE (Enabling Grigs for E-Science in Europe,) and PPDG (2006).
This effort is pushed by the new requirements of E-Science. That is, large communities of researchers
collaborate to extract knowledge and information from huge amounts of scientific data. This has lead
to the emergence of a new class of application, called data-intensive applications that require secure
and coordinated access to large datasets, wide-area transfers and broad distribution of TeraBytes of data
while keeping track of multiple data replicas. The Data Grid aims at providing such an infrastructure and
services to enable data-intensive applications. Despite the attractiveness of Desktop Grids, little work has
51
Desktop Grids
been done to support data-intensive applications in this context of massively distributed, volatile, shared
and heterogeneous resources. Most Desktop Grid systems, like BOINC (Anderson, 2004), XtremWeb
(Fedak et al., 2001) and OurGrid (Andrade et al., 2003) rely on a centralized architecture for indexing
and distributing the data, and thus potentially face issues with scalability and fault-tolerance.
Large data movement across wide-area networks can be costly in terms of performance because
bandwidth across the Internet is often limited, variable and unpredictable. Caching data on local
workstation storage (Iamnitchi et al., 2006 ; Otoo et al., 2004 ; Vazhkudai et al., 2005) with adequate
scheduling strategies (Santos-Neto et al., 2004 ; Wei et al., 2005) to minimize data transfers can improve
overall application execution time. Implementing a simple execution principle like Owner Compute
still requires the system to efficiently locate data and to provide a model for the cost of moving data.
Moreover, accurate modeling (Qiu & Srikant, 2004) and forecasting of P2P communication is still a
challenging and open issue, and it will be required before one can efficiently execute more demanding
types of applications, such as those that require real-time or stream processing.
Long Running Applications

Long-running applications are challenging due to the volatility of executing nodes and often require
checkpointing services. To achieve their execution it requires local or remote checkpointing to avoid
loosing their computational state when a failure occurs.
Real-Time Applications
In this paragraph, we focus on enabling soft real-time applications to execute on enterprise desktop
Grids; soft real-time applications often have a deadline associated with each task but can afford to miss
some of these deadlines. A number of soft real-time applications ranging from information processing
of sensor networks (Sensor Networks), real-time video encoding (Rodriguez, Gonzalez, Malumbres),
to interactive scientific visualization (Lopez et al., 1999 ; Smallen, Casanova, Berman, 2001) could potentially utilize desktop Grids. An example of such an application that has soft real-time requirements is
on-line parallel tomography (Smallen et al., 2001). Tomography is the construction of 3-D models from
2-D projections, and it is common in electron microscopy to use tomography to create 3-D images of
biological specimens. On-line parallel tomography applications are embarrassingly parallel as each 2-D
projection can be decomposed into independent slices that must be distributed to a set of resources for
processing. Each slice is on the order of kilobytes or megabytes in size, and there are typically hundreds
or thousands of slices per projection, depending on the size of each projection. Ideally, the processing
time of a single projection can be done while the user is acquiring the next image from the microscope,
which typically takes several minutes (Hsu, 2005). As such, on-line parallel tomography could potentially
be executed on desktop Grids if there were effective method for meeting the applications relatively
stringent time demands.
Network-Intensive Applications
There are a few desktop Grid applications that are not CPU or data intensive; they use other resources
available on the compute node. The execution time is not limited by processing speed, the amount of
available memory, or communication times but by the availability of these resources.
52
Desktop Grids
The network is one of these resources. Malicious distributed applications (zombies PC) use it for
sending a huge amount of data: sending SPAM, distributed attack targeting a given host. But network
may also be useful for web spiders. For example, YaCy is a P2P-based search engine. On each volunteer
resources, a web crawler collects data from the web that are locally indexed and stored. A local client is
available for retrieving search results from other computing nodes through a DHT.
Those tasks often require special scheduling policy from the desktop Grid because usual criteria cannot be used. For example, BOINC has support for non CPU-intensive (a special mode that applies to a
whole project) tasks but some limitations are imposed: First, the client doesnt maintain a cache of task
to run: there is only one task present on the client at a given time. This is due to the fact that BOINC
cant estimate completion time by measuring CPU usage as it does for normal projects. Second, non
CPU-intensive applications have to restrict there CPU usage to the minimum because there are some
other CPU intensive task running at the same time: BOINC doesnt mix scheduling policies.
CONCLUSION
Thorough this chapter, we have presented an historical review of Desktop Grid System as well as the
state-of-the-art of scientific researches and most recent technological innovations. In the late 90s, the
history of Desktop Grid System has started with simple computational applications featuring trivial
and massive parallelism. Systems were based on common and rough-and-ready technologies, such as
Web server with servers-side scripts and Java applets. Despite or because of this seeming architectural
simplicity, these systems grew rapidly to appear amongst the largest distributed applications. In the early
2000s, the challenge of gathering TeraFlops of volunteers PCs was met, attracting the attention of the
mainstream media. Several high-tech companies have been built-up to sell services and commercial
systems. In some sense, Desktop Grids systems appeared to be the most successful amongst the Grid
applications to popularize and democratize the Grids to people at large.
During the first decade of research in Desktop Grids systems, a huge effort has been made to bring
this paradigm to a common facility usable for a broad range of scientific and industrial applications. As
a consequence, this effort has led to a impressive set of innovations which has improved Desktop Grids
system in term of reliability (for instance fault-tolerant communication libraries, distributed checkpointing), of data management (use of P2P protocols to distribute and manage data), of security (result
certification, sandboxing) and performance (new classes of scheduling heuristics based on replication
and evaluation of host availability).
What are the perspectives of Desktop Grid Systems? The singularity of DG system is the location the
are, at the frontier between Grid systems and Internet. As DG systems will become more efficient and
more reliable, they will incorporate more deeply into Grid system. On one hand this will enable more
scientists to benefit from this technology. On the other hand, the price will be an increased complexity
in term of management.
Has such, future of DG system will certainly follow the evolution of the Internet towards more users
provided content, social network, distributed intelligence etc...
Desktop Grid computing may also have a role to play in the context of Could computing. Currently
the service infrastructure envisioned for Clouds is designed from large scale data centers. However, like
for P2P systems, an approach of Cloud computing based on community of users sharing resources for
free may counter balance the actual trend toward commercial service infrastructures. Of course, using
53
Desktop Grids
Desktop Grid as an underlying technology and infrastructure for Cloud computing raises a lot of research
issues and opens exciting perspectives for Desktop Grids.
REFERENCES
Abdennadher, N., & Boesch, R. (2006, August). A scheduling algorithm for high performance peer-topeer platform. In W. Lehner, N. Meyer, A. Streit, & C. Stewart (Eds.), Coregrid Workshop, Euro-Par
2006 (p. 126-137). Dresden, Germany: Springer.
Alexandrov, A. D., Ibel, M., Schauser, K. E., & Scheiman, C. (1997, April). SuperWeb: Towards a global
web-based parallel computing infrastructure. In Proceedings of the 11th IEEE International Parallel
Processing Symposium (IPPS).
Anderson, D. (2004). BOINC: A system for public-resource computing and storage. In Proceedings of
the 5th IEEE/ACM International Grid Workshop, Pittsburgh, PA.
Anderson, D., & Fedak, G. (2006). The computational and storage potential of volunteer computing. In Proceedings of The IEEE International Symposium on Cluster Computing and The Grid (CCGRID06).
Anderson, D. P., Cobb, J., Korpela, E., Lebofsky, M., & Werthimer, D. (2002, November). Seti@
home: An experiment in public-resource computing. Communications of the ACM, 45(11), 5661.
doi:10.1145/581571.581573
Andrade, N., Cirne, W., Brasileiro, F., & Roisenberg, P. (2003, June). OurGrid: An approach to easily
assemble grids with equitable resource sharing. In Proceedings of the 9th Workshop on Job Scheduling
Strategies for Parallel Processing.
Andrzejak, A., Domingues, P., & Silva, L. (2006). Predicting Machine Availabilities in Desktop Pools.
In IEEE/IFIP Network Operations and Management Symposium (pp. 225234).
Andrzejak, A., Kondo, D., & Anderson, D. P. (2008). Ensuring collective availability in volatile resource
pools via forecasting. In 19th Ifip/Ieee Distributed Systems: Operations And Management (DSOM 2008).
Samos Island, Greece.
Araujo, F., Domingues, P., Kondo, D., & Silva, L. M. (2008, April). Using cliques of nodes to store
desktop grid checkpoints. In Coregrid Integration Workshop, Crete, Greece.
Balaton, Z., Gombas, G., Kacsuk, P., Kornafeld, A., Kovacs, J., & Marosi, A. C. (2007, March 26-30).
Sztaki desktop grid: a modular and scalable way of building large computing grids. In Proceedings of
the 21st International Parallel And Distributed Processing Symposium, Long Beach, CA.
Baldassari, J., Finkel, D., & Toth, D. (2006, November 13-15). Slinc: A framework for volunteer computing. In Proceedings of the 18th Iasted International Conference On Parallel And Distributed Computing
And Systems (PDCS 2006). Dallas, TX.
Barak, A., Guday, S., & R., W. (1993). The MOSIX Distributed Operating System, Load Balancing for
UNIX (Vol. 672). Berlin: Springer-Verlag.
54
Desktop Grids
Baratloo, A., Karaul, M., Kedem, Z., & Wyckoff, P. (1996). Charlotte: Metacomputing on the Web.
In Proceeidngs of the 9th International Conference On Parallel And Distributed Computing Systems
(PDCS-96).
Bassi, A., Beck, M., Fagg, G., Moore, T., Plank, J. S., & Swany, M. (2002). The Internet BackPlane
Protocol: A Study in Resource Sharing. In Second ieee/acm international symposium on cluster computing and the grid, Berlin, Germany.
Berman, F., Wolski, R., Figueira, S., Schopf, J., & Shao, G. (1996). Application-Level Scheduling on
Distributed Heterogeneous Networks. In Proc. of supercomputing96, Pittsburgh, PA.
Bhatt, S. N., Chung, F. R. K., Leighton, F. T., & Rosenberg, A. L. (1997). An optimal strategies
for cycle-stealing in networks of workstations. IEEE Transactions on Computers, 46(5), 545557.
doi:10.1109/12.589220
Bolosky, W., Douceur, J., Ely, D., & Theimer, M. (2000). Feasibility of a Serverless Distributed file
System Deployed on an Existing Set of Desktop PCs. In Proceedings of sigmetrics.
Brecht, T., Sandhu, H., Shan, M., & Talbot, J. (1996). Paraweb: towards world-wide supercomputing.
In Ew 7: Proceedings of the 7th workshop on acm sigops european workshop (pp. 181188). New York:
ACM.
Butt, A. R., Johnson, T. A., Zheng, Y., & Hu, Y. C. (2004). Kosha: A Peer-to-Peer Enhancement for the
Network File System. In Proceeding of International Symposium On Supercomputing SC04.
Camiel, N., London, S., Nisan, N., & Regev, O. (1997, April). The PopCorn Project: Distributed computation over the Internet in Java. In Proceedings of the 6th international world wide web conference.
Cappello, F., Djilali, S., Fedak, G., Herault, T., Magniette, F., & Nri, V. (2004). Computing on large
scale distributed systems: Xtremweb architecture, programming models, security, tests and convergence
with grid. Future Generation Computer Science (FGCS).
Cappello, P., Christiansen, B., Ionescu, M., Neary, M., Schauser, K., & Wu, D. (1997). Javelin: InternetBased Parallel Computing Using Java. In Proceedings of the sixth acm sigplan symposium on principles
and practice of parallel programming.
Casanova, H., Legrand, A., & Quinson, M. SimGrid: a Generic Framework for Large-Scale Distributed
Experimentations. In Proceedings of the 10th ieee international conference on computer modelling and
simulation (uksim/eurosim08).
Casanova, H., Legrand, A., Zagorodnov, D., & Berman, F. (2000, May). Heuristics for Scheduling Parameter Sweep Applications in Grid Environments. In Proceedings of the 9th heterogeneous computing
workshop (hcw00) (pp. 349363).
Casanova, H., Obertelli, G., Berman, F., & Wolski, R. (2000, Nov.). The AppLeS Parameter Sweep
Template: User-Level Middleware for the Grid. In Proceedings of supercomputing 2000 (sc00).
Chien, A., Calder, B., Elbert, S., & Bhatia, K. (2003). Entropia: Architecture and performance of an enterprise desktop grid system. Journal of Parallel and Distributed Computing, 63, 597610. doi:10.1016/
S0743-7315(03)00006-6
55
Desktop Grids
Cirne, W., Brasileiro, F., Andrade, N., Costa, L., Andrade, A., & Novaes, R. (2006, September). Labs of
the world, unite!!! Journal of Grid Computing, 4(3), 225246. doi:10.1007/s10723-006-9040-x
Cohen, B. (2003). Incentives build robustness in BitTorrent. In Workshop on economics of peer-to-peer
systems, Berkeley, CA.
Costa, F., Silva, L., Fedak, G., & Kelley, I. (2008, in press). Optimizing the Data Distribution Layer of
BOINC with BitTorrent. In 2nd workshop on desktop grids and volunteer computing systems (pcgrid
2008), Miami, FL.
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. In Osdi04:
Sixth symposium on operating system design and implementation, (pp. 137150). San Francisco, CA.
Domingues, P., Araujo, F., & Silva, L. M. (2006, December). A dht-based infrastructure for sharing
checkpoints in desktop grid computing. In Conference on e-science and grid computing (escience 06),
Amsterdam, The Netherlands.
Draves, S. (2005, March). The electric sheep screen-saver: A case study in aesthetic evolution. In 3rd
european workshop on evolutionary music and art.
Fedak, G., & Germain, C. Neri, V., & Cappello, F. (2001, May). XtremWeb: A Generic Global Computing System. In Proceedings of the ieee international symposium on cluster computing and the grid
(ccgrid01).
Fedak, G., He, H., & Cappello, F. (2008, November). BitDew: A Programmable Environment for LargeScale Data Management and Distribution. In Proceedings of the acm/ieee supercomputing conference
(sc08), Austin, TX.
Federation, M. D. The Biomedical Informatics Research Network (2003). In I. Foster & C. Kesselman (Eds.), The grid, blueprint for a new computing infrastructure (2nd ed.). San Francisco: Morgan
Kaufmann.
Fernandess, Y., & Malkhi, D. (2006). On Collaborative Content Distribution using Multi-Message Gossip. In Proceedings of the international parallel and distributed processing symposium. Rhodes Island,
Greece: IEEE.
Foster, I., & Kesselman, C. (Eds.). (1999). The Grid: Blueprint for a New Computing Infrastructure.
San Francisco, USA: Morgan Kaufmann Publishers, Inc.
Foster, I. T., & Iamnitchi, A. (2003). On death, taxes, and the convergence of peer-to-peer and grid
computing. 2735, 118-128.
Ghare, G., & Leutenegger, L. (2004, June). Improving Speedup and Response Times by Replicating
Parallel Programs on a SNOW. In Proceedings of the 10th workshop on job scheduling strategies for
parallel processing.
Ghormley, D., Petrou, D., Rodrigues, S., Vahdat, A., & Anderson, T. (1998, July). GLUnix: A global
layer unix for a network of workstations. Software, Practice & Experience, 28(9), 929. doi:10.1002/
(SICI)1097-024X(19980725)28:9<929::AID-SPE183>3.0.CO;2-C
56
Desktop Grids
Gkantsidis, C., & Rodriguez, P. (2005, March). Network Coding for Large Scale Content Distribution.
In Proceedings of ieee/infocom 2005, Miami, USA.
Heien, E., Fujimoto, N., & Hagihara, K. (2008). Computing low latency batches with unreliable workers
in volunteer computing environments. In Pcgrid.
Hsu, A. (2005, March). Personal communication.
Iamnitchi, A., Doraimani, S., & Garzoglio, G. (2006). Filecules in High-Energy Physics: Characteristics and Impact on Resource Management. In proceeding of 15th ieee international symposium on high
performance distributed computing hpdc 15, Paris.
Iamnitchi, A., Foster, I. T., & Nurmi, D. (2002). A peer-to-peer approach to resource location in grid
environments. In Hpdc (p. 419).
Jin, H., Xiong, M., Wu, S., & Zou, D. (2006). Replica Based Distributed Metadata Management in Grid
Environment. Computational Science (LNCS 3944, pp. 1055-1062). Berlin: Springer-Verlag.
Jung, E. B., Choi, S.-J., Baik, M.-S., Hwang, C.-S., Park, C.-Y., & Young, S. (2005). Scheduling scheme
based on dedication rate in volunteer computing environment. In Third international symposium on
parallel and distributed computing (ispdc 2005), Lille, France.
Kim, J.-S., Nam, B., Keleher, P. J., Marsh, M. A., Bhattacharjee, B., & Sussman, A. (2006). Resource
discovery techniques in distributed desktop grid environments. In Grid (pp. 9-16).
Kondo, D., Araujo, F., Malecot, P., Domingues, P., Silva, L. M., & Fedak, G. (2006). Characterizing result
errors in internet desktop grids (Tech. Rep. No. INRIA-HALTech Report 00102840), INRIA, France.
Kondo, D., Chien, A., & H., C. (2004, November). Rapid Application Turnaround on Enterprise Desktop
Grids. In Acm conference on high performance computing and networking, sc2004.
Kondo, D., Chien, A. A., & Casanova, H. (2007). Scheduling task parallel applications for rapid turnaround on enterprise desktop grids. Journal of Grid Computing, 5(4), 379405. doi:10.1007/s10723007-9063-y
Kondo, D., Fedak, G., Cappello, F., Chien, A. A., & Casanova, H. (2006, December). On Resource
Volatility in Enterprise Desktop Grids. In Proceedings of the 2nd IEEE International Conference On
E-Science And Grid Computing (eScience06) (pp. 7886). Amsterdam, Netherlands.
Kondo, D., Taufer, M., Brooks, C., Casanova, H., & Chien, A. (2004, April). Characterizing and evaluating desktop grids: An empirical study. In Proceedings of the International Parallel and Distributed
Processing Symposium (IPDPS04).
Lee, S., Ren, X., & Eigenmann, R. (2008). Efficient content search in ishare, a p2p based internet-sharing
system. In PCGRID.
Leutenegger, S., & Sun, X. (1993). Distributed computing feasibility in a non-dedicated homogeneous
distributed system. In Proceedings of SC93, Portland, OR.
Litzkow, M., Livny, M., & Mutka, M. (1988). Condor - A hunter of idle workstations. In Proceedings
of the 8th International Conference Of Distributed Computing Systems (ICDCS).
57
Desktop Grids
Lodygensky, O., Fedak, G., Cappello, F., Neri, V., Livny, M., & Thain, D. (2003). XtremWeb & Condor:
Sharing resources between Internet connected condor pools. In Proceedings of CCGRID2003, Third
International Workshop On Global And Peer-To-Peer Computing (GP2PC03) (pp. 382389). Tokyo,
Japan.
Lopez, J., Aeschlimann, M., Dinda, P., Kallivokas, L., Lowekamp, B., & OHallaron, D. (1999, June).
Preliminary report on the design of a framework for distributed visualization. In Proceedings of the international conference on parallel and distributed processing techniques and applications (PDPTA99)
(pp. 18331839). Las Vegas, NV.
Malcot, P., Kondo, D., & Fedak, G. (2006, June). Xtremlab: A system for characterizing internet desktop
grids. In Poster in the 15th ieee international symposium on high performance distributed computing
hpdc06. Paris, France.
Mattson, T., Sanders, B., & Massingill, B. (2004). Patterns for parallel programming. New York:
Addison-Wesley.
Maymounkov, P., & Mazires, D. (2002). Kademlia: A Peer-to-peer Information System Based on the
XOR Metric. In Proceedings of the 1st international workshop on peer-to-peer systems (iptps02) (pp.
5365).
Mutka, M., & Livny, M. (1991, July). The available capacity of a privately owned workstation environment. Performance Evaluation, 4(12).
Mutka, M. W., & Livny, M. (1987). Profiling workstations available capacity for remote execution. In
Proceedings of performance-87, the 12th ifip w.g. 7.3 international symposium on computer performance
modeling, measurement and evaluation. Brussels, Belgium.
Myers, D. S., Bazinet, A. L., & Cummings, M. P. (2008). Expanding the reach of grid computing:
combining globus- and boinc-based systems. In Grids for Bioinformatics and Computational Biology.
New York: Wiley.
Nisan, N., London, S., Regev, O., & Camiel, N. (1998). Globally distributed computation over the internet - the popcorn project. In International conference on distributed computing systems 1998 (p. 592).
New York: IEEE Computer Society.
Otoo, E., Rotem, D., & Romosan, A. (2004). Optimal File-Bundle Caching Algorithms for Data-Grids.
In Sc 04: Proceedings of the 2004 acm/ieee conference on supercomputing (p. 6). Washington, DC:
IEEE Computer Society.
Pedroso, J., Silva, L., & Silva, J. (1997, June). Web-based metacomputing with JET. In Proc. of the acm
ppopp workshop on java for science and engineering computation.
PPDG. (2006). From fabric to physics (Tech. Rep.). The Particle Physics Data Grid.
Pruyne, J., & Livny, M. (1996). A Worldwide Flock of Condors: Load Sharing among Workstation
Clusters. Journal on Future Generations of Computer Systems, 12.
Qiu, D., & Srikant, R. (2004). Modeling and performance analysis of bittorrent-like peer-to-peer networks. Computer Communication Review, 34(4), 367378. doi:10.1145/1030194.1015508
58
Desktop Grids
Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., & Wilde, M. (2007). Falkon: a fast and light-weight task
execution framework. In Ieee/acm supercomputing.
Raman, R., Livny, M., & Solomon, M. H. (1998). Matchmaking: Distributed resource management for
high throughput computing. In Hpdc (p. 140).
Rhea, S. C., Eaton, P. R., Geels, D., Weatherspoon, H., Zhao, B. Y., & Kubiatowicz, J. (2003). Pond: The
oceanstore prototype. In Fast. Rodriguez, A., Gonzalez, A., & Malumbres, M. P. Performance evaluation of parallel mpeg-4 video coding algorithms on clusters of workstations. International Conference
on Parallel Computing in Electrical Engineering (PARELEC04), 354-357.
Rowstron, A., & Druschel, P. (2001, November). Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Proceedings of the 18th ifip/acm international conference
on distributed systems platforms (middleware 2001), Heidelberg, Germany.
Santos-Neto, E., Cirne, W., Brasileiro, F., & Lima, A. (2004). Exploiting Replication and Data Reuse
to Efficiently Schedule Data-intensive Applications on Grids. In Proceedings of the 10th workshop on
job scheduling strategies for parallel processing.
Sarmenta, L. F. G. (2002). Sabotage-tolerance mechanisms for volunteer computing systems. Future
Generation Computer Systems, 18(4), 561572. doi:10.1016/S0167-739X(01)00077-2
Sarmenta, L. F. G., & Hirano, S. (1999). Bayanihan: Building and studying volunteer computing systems
using Java. Future Generation Computer Systems, 15(5/6), 675-686.
Sensor Networks. Retrieved from http://www.sensornetworks.net.au/network.html
Sfiligoi, K. O., Venekamp, G., Yocum, D., Groep, D., & Petravick, D. (2007). Addressing the Pilot
security problem with gLExec (Tech. Rep. No. FERMILAB-PUB-07-483-CD). Fermi National Laboratory, Batavia, IL.
Shirts, M., & Pande, V. (2000). Screen savers of the world, unite! Science, 290, 19031904. doi:10.1126/
science.290.5498.1903
Shoch, J. F., & Hupp, J. A. (1982). 03). The worm programs - early experience with a distributed
computation. Communications of the ACM, 3(25).
Smallen, S., Casanova, H., & Berman, F. (2001, Nov.). Tunable on-line parallel tomography. In Proceedings of Supercomputing01, Denver, CO.
Sonnek, J. D., Nathan, M., Chandra, A., & Weissman, J. B. (2006). Reputation-based scheduling on
unreliable distributed infrastructures. In ICDCS (p. 30).
Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001, August). Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the ACM SIGCOMM 01
Conference, San Diego, CA.
Taufer, M., Anderson, D., Cicotti, P., & III, C. L. B. (2005). Homogeneous redundancy: a technique to
ensure integrity of molecular simulation results using public computing. In Proceedings of The International Heterogeneity In Computing Workshop.
59
Desktop Grids
Thain, D., & Livny, M. (2004). Building reliable clients and services. In The grid2 (pp. 285318). San
Francisco: Morgan Kaufman.
The seti@home project. Retrieved from http://setiathome.ssl.berkeley.edu/
Tsaregorodtsev, A., Garonne, V., & Stokes-Rees, I. (2004). Dirac: A scalable lightweight architecture
for high throughput computing. In Fifth IEEE/ACM International Workshop On Grid Computing
(Grid04).
Vazhkudai, S., & Ma, X. V. F., Strickland, J., Tammineedi, N., & Scott, S. (2005). Freeloader:scavenging
desktop storage resources for scientific data. In Proceedings of Supercomputing 2005 (SC05), Seattle,
WA.
Wei, B., Fedak, G., & Cappello, F. (2005). scheduling independent tasks sharing large data distributed with
BitTorrent. In The 6th IEEE/ACM International Workshop On Grid Computing, 2005, Seattle, WA.
Yacy - distributed p2p-based Web indexing.
Zhao, S., & Lo, V. (2001, May). Result Verification and Trust-based Scheduling in Open Peer-to-Peer Cycle
Sharing Systems. In Proceedings of Ieee Fifth International Conference on Peer-To-Peer Systems.
Zhou, D., & Lo, V. M. (2006). Wavegrid: A scalable fast-turnaround heterogeneous peer-based desktop
grid system. In IPDPS.

Cycle Stealing: Consists in using the unused cycles of desktop workstations. Participating workstations also donate some supporting amount of disk storage space, RAM, and network bandwidth, in
addition to raw CPU power. The volunteer must get back full usage of its resources with no delay when
it request them.
Desktop Grid: A computing environment making use of Desktop computers connected via the Internet. Desktop Grids are not used only for voluntary computing projects, but also for enterprise Grids.
connected via non dedicated network connection
Master-Worker Paradigm: Consists in two entities: a master and several workers. The master
decomposes the problem into smaller tasks and distributes them among workers. The worker receives
the task from the master, executes it and sends back the result to the master.
Result Certification: In distributed computing the result certification is a mechanism that aims to
validate the results computed by volatile and possibly malicious hosts. The most common mechanisms
for result validation are: the majority voting, spot-checking and credibility-based technique.
Volunteer Computing: An arrangement in which computer owners provide there computing resources
to one or more projects that are using them to do distributed computing. Those Desktop Grids are made
of plenty tiny and uncontrollable administrative domains.
60
Desktop Grids
ENDNOTES
1 United Devices Inc., http://www.ud.com/
2 Platform Computing Inc., http://www.platform.com/
3 Mesh Technologies, http://www.meshtechnologies.com/
4 The COSM project, http://www.mithral.com/projects/cosm/
5 EINSTEIN@home, http://einstein.phys.uwm.edu
6 The Great Internet Mersenne Prime Search, http://www.mersenne.org/
7 Distributed.net, www.distributed.net
8 Electric Sheep, http://electricsheep.org/
9 XtremWeb-CHs website, http://www.xtremwebch.net/
10 Simple Light-weight Infrastructure for Network Computing, http://slinc.sourceforge.net/
11 XtremLab: A System for Characterizing Internet Desktop Grids, http://xtremlab.lri.fr
61
62
Chapter 4
Porting Applications to Grids1

Wolfgang Gentzsch
EU Project DEISA and Board of Directors of the Open Grid Forum, Germany
ABSTRACT
Aim of this chapter is to guide developers and users through the most important stages of implementing
software applications on Grid infrastructures, and to discuss important challenges and potential solutions.
Those challenges come from the underlying grid infrastructure, like security, resource management, and
information services; the application data, data management, and the structure, volume, and location of
the data; and the application architecture, monolithic or workflow, serial or parallel. As a case study, the
author presents the DEISA Distributed European Infrastructure for Supercomputing Applications and
describes its DEISA Extreme Computing Initiative DECI for porting and running scientific grand challenge applications. The chapter concludes with an outlook on Compute Clouds, and suggests ten rules
of building a sustainable grid as a prerequisite for long-term sustainability of the grid applications.
INTRODUCTION
Over the last 40 years, the history of computing is deeply marked of the affliction of the application
developers who continuously are porting and optimizing their application codes to the latest and greatest computing architectures and environments. After the von-Neumann mainframe came the vector
computer, then the shared-memory parallel computer, the distributed-memory parallel computer, the
very-long-instruction word computer, the workstation cluster, the meta-computer, and the Grid (never
fear, it continues, with SOA, Cloud, Virtualization, Many-core, and so on). There is no easy solution
to this, and the real solution would be a separation of concerns between discipline-specific content and
domain-independent software and hardware infrastructure. However, this often comes along with a loss
DOI: 10.4018/978-1-60566-661-7.ch004
Porting Applications to Grids
of performance stemming from the overhead of the infrastructure layers. Recently, users and developers
face another wave of complex computing infrastructures: the Grid.
Lets start with answering the question: What is a Grid? Back in 1998, Ian Foster and Carl Kesselman
(1998) attempted the following definition: A computational grid is a hardware and software infrastructure
that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities. In a subsequent article (Foster, 2002), The Anatomy of the Grid, Ian Foster, Carl Kesselman,
and Steve Tuecke changed this definition to include social and policy issues, stating that Grid computing
is concerned with coordinated resource sharing and problem solving in dynamic, multi-institutional
virtual organizations. The key concept is the ability to negotiate resource-sharing arrangements among
a set of participating parties (providers and consumers) and then to use the resulting resource pool for
some purpose. They continued: The sharing that we are concerned with is not primarily file exchange
but rather direct access to computers, software, data, and other resources, as is required by a range of
collaborative problem-solving and resource-brokering strategies emerging in industry, science, and
engineering. This sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which
sharing occurs. A set of individuals and/or institutions defined by such sharing rules form what we call
a virtual organization. This authors concern, from the beginning (Gentzsch, 2002), was that the new
definition seemed very ambitious, and as history has proven, many of the Grid projects with a focus
on these ambitious objectives did not lead to a sustainable grid production environment, so far. We can
only repeat that the simpler the grid infrastructure, and the easier to use, and the sharper its focus, the
bigger is its chance for success. And it is for a good reason (which we will explain in the following) that
currently the so-called Clouds are becoming more and more popular (Amazon, 2007).
Over the last ten years, hundreds of applications in science, industry and enterprises have been ported
to Grid infrastructures, mostly prototypes in the early definition of Foster & Kesselman (1998). Each
application is unique in that it solves a specific problem, based on modeling, for example, a specific
phenomenon in nature (physics, chemistry, biology, etc.), presented as a mathematical formula together
with appropriate initial and boundary conditions, represented by its discrete analogue using sophisticated
numerical methods, translated into a programming language computers can understand, adjusted to the
underlying computer architecture, embedded in a workflow, and accessible remotely by the user through
a secure, transparent and application-specific portal. In just these very few words, this summarizes the
wide spectrum and complexity we face in problem solving on grid infrastructures.
The user (and especially the developer) faces several layers of complexity when porting applications to a computing environment, especially to a compute or data grid of distributed networked nodes
ranging from desktops to supercomputers. These nodes, usually, consist of several to many loosely or
tightly coupled processors and, more and more, these processors contain few to many cores. To run
efficiently on such systems, applications have to be adjusted to the different layers, taking into account
different levels of granularity, from fine-grain structures deploying multi-core architectures at processor
level to the coarse granularity found in application workflows representing for example multi-physics
applications. Not enough, the user has to take into account the specific requirements of the grid, coming
from the different components of the grid services architecture, such as security, resource management,
information services, and data management.
Obviously, in this article, it seems impossible to present and discuss the complete spectrum of applications and their adaptation and implementation on Grids. Therefore, we restrict ourselves in the
following to briefly describe the different application classes, present a checklist (or classification) with
63
respect to grouping applications according to their appropriate grid-enabling strategy. Also, for lack of
space, here, we are not able to include a discussion of mental, social, or legal aspects which sometimes
might be the knock-out criteria for running applications on a grid. Other show-stoppers such as sensitive data, security concerns, licensing issues, and intellectual property, were discussed in some detail
in Gentzsch (2007a).
In the following, we will consider the main three areas of impact on porting applications to grids:
infrastructure issues, data management issues, and application architecture issues. These issues can
have an impact on effort and success of porting, on the resulting performance of the grid application,
and on the user-friendly access to the resources, the grid services, the application, the data, and the final
processing results, among others.
APPLICATIONS AND THE GRID INFRASTRUCTURE

As mentioned before, the successful porting of an application to a grid environment highly depends on
the underlying distributed resource infrastructure. The main services components offered by a grid infrastructure are security, resource management, information services, and data management. Bart Jacob
et al. suggest that each of these components can affect the application architecture, its design, deployment, and performance. Therefore, the user has to go through the process of matching the application
(structure and requirements) with those components of the grid infrastructure, as described here, closely
following the description in Jacob at al. (2003).
Applications and Security

The security functions within the grid architecture are responsible for the authentication and authorization
of the user, and for the secure communication between the grid resources. Fortunately, these functions
are an inherent part of most grid infrastructures and dont usually affect the applications themselves,
supposed the user (and thus the users application) is authorized to use the required resources. Also,
security from an application point of view might be taken into account in the case that sensitive data is
passed to a resource to be processed by a job and is written to the local disk in a non-encrypted format,
and other users or applications might have access to that data.
Applications and Resource Management

The resource management component provides the facilities to allocate a job to a particular resource,
provides a means to track the status of the job while it is running and its completion information, and
provides the capability to cancel a job or otherwise manage it. In conjunction with Monitoring and
Discovery Service (described below) the application must ensure that the appropriate target resource(s)
are used. This requires that the application accurately specifies the required environment (operating
system, processor, speed, memory, and so on). The more the application developer can do to eliminate
specific dependencies, the better the chance that an available resource can be found and that the job will
complete. If an application includes multiple jobs, the user must understand (and maybe reduce) their
interdependencies. Otherwise, logic has to be built to handle items such as inter-process communication,
sharing of data, and concurrent job submissions. Finally, the job management provides mechanisms to
64
query the status of the job as well as perform operations such as canceling the job. The application may
need to utilize these capabilities to provide feedback to the user or to clean up or free up resources when
required. For instance, if one job within an application fails, other jobs that may be dependent on it may
need to be cancelled before needlessly consuming resources that could be used by other jobs.
Applications and Resource Information Services

An important part of the process of grid-enabling an application is to identify the appropriate (if not
optimal) resources needed to run the application, i.e. to submit the respective job to. The service which
maintains and provides the knowledge about the grid resources is the Grid Information Service (GIS),
also known as the Monitoring and Discovery Service (e.g. MDS in Globus [xx]. MDS provides access
to static and dynamic information of resources. Basically, it contains the following components:

Grid Resource Information Service (GRIS), the repository of local resource information derived
from information providers.
Grid Index Information Service (GIIS), the repository that contains indexes of resource information registered by the GRIS and other GIISs.
Information providers, translate the properties and status of local resources to the format defined
in the schema and configuration files.
MDS client which initially performs a search for information about resources in the grid
environment.
Resource information is obtained by the information provider and it is passed to GRIS. GRIS registers its local information with the GIIS, which can optionally also register with another GIIS, and so
on. MDS clients can query the resource information directly from GRIS (for local resources) and/or a
GIIS (for grid-wide resources).
It is important to fully understand the requirements for a specific job so that the MDS query can be
correctly formatted to return resources that are appropriate. The user has to ensure that the proper information is in MDS. There is a large amount of data about the resources within the grid that is available
by default within the MDS. However, if the application requires special resources or information that is
not there by default, the user may need to write her own information providers and add the appropriate
fields to the schema. This may allow the application or broker to query for the existence of the particular
resource/requirement.
Applications and Data Management

Data management is concerned with collectively maximizing the use of the limited storage space, networking bandwidth, and computing resources. Within the application, data requirements have been built
in which determine, how data will be move around the infrastructure or otherwise accessed in a secure
and efficient manner. Standardizing on a set of grid protocols will allow to communicate between any
data source that is available within the software design. Especially data intensive applications often
have a federated database to create a virtual data store or other options including Storage Area Networks, network file systems, and dedicated storage servers. Middleware like the Globus Toolkit provide
GridFTP and Global Access to Secondary Storage data transfer utilities in the grid environment. The
65
GridFTP facility (extending the FTP File Transfer Protocol) provides secure and reliable data transfer
between grid hosts.
Developers and users face a few important data management issues that need to be considered in application design and implementation. For large datasets, for example, it is not practical and may be impossible to move the data to the system where the job will actually run. Using data replication or otherwise
copying a subset of the entire dataset to the target system may provide a solution. If the grid resources are
geographically distributed with limited network connection speeds, design considerations around slow
or limited data access must be taken into account. Security, reliability, and performance become an issue
when moving data across the Internet. When the data access may be slow or prevented one has to build
the required logic to handle this situation. To assure that the data is available at the appropriate location
by the time the job requires it, the user should schedule the data transfer in advance. One should also be
aware of the number and size of any concurrent transfers to or from any one resource at the same time.
Beside the above described main requirements for applications for running efficiently on a grid infrastructure, there are a few more issues which are discussed in Jacob (2003), such as scheduling, load
balancing, grid broker, inter-process communication, and portals for easy access, and non-functional
requirements such as performance, reliability, topology aspects, and consideration of mixed platform
environments.
The Simple API for Grid Applications (SAGA)

Among the many efforts in the grid community to develop tools and standards which simplify the porting
of applications to Grids by enabling the application to make easy use of the Grid middleware services
as described above, one of the more predominant ones is SAGA, a high-level Application Programmers
Interface (API), or programming abstraction, defined by the Open Grid Forum (OGF, 2008), an international committee that coordinates standardization of Grid middleware and architectures. SAGA intends
to simplify the development of grid-enabled applications, even for scientists without any background in
computer science or grid computing. Historically, SAGA was influenced by the work on the GAT Grid
Application Toolkit, a C-based API developed in the EU-funded project GridLab (GAT, 2005). The purpose of SAGA is two-fold:
1.
2.
Provide a simple API that can be used with much less effort compared to the interfaces of existing
grid middleware.
Provide a standardized, portable, common interface for the various grid middleware systems.
According to Goodale (2008) SAGA facilitates rapid prototyping of new grid applications by allowing
developers a means to concisely state very complex goals using a minimum amount of code.
SAGA provides a simple, POSIX-style API to the most common Grid functions at a sufficiently highlevel of abstraction so as to be able to be independent of the diverse and dynamic Grid environments. The
SAGA specification defines interfaces for the most common Grid-programming functions grouped as a
set of functional packages. Version 1.0 (Goodale, 2008) defines the following packages:

66
File package - provides methods for accessing local and remote file systems, browsing directories,
moving, copying, and deleting files, setting access permissions, as well as zero-copy reading and
writing
Replica package - provides methods for replica management such as browsing logical file systems, moving, copying, deleting logical entries, adding and removing physical files from a logical
file entry, and search logical files based on attribute sets.
Job package - provides methods for describing, submitting, monitoring, and controlling local and
remote jobs. Many parts of this package were derived from the largely adopted DRMAA [11]
specification.
Stream package - provides methods for authenticated local and remote socket connections with
hooks to support authorization and encryption schemes.
RPC package - is an implementation of the OGF GridRPC API definition and provides methods
for unified remote procedure calls.
The two critical aspects of SAGA are its simplicity of use and the fact that it is well on the road to
becoming a community standard. It is important to note, that these two properties are provide the added
value of using SAGA for Grid application development. Simplicity arises from being able to limit the
scope to only the most common and important grid-functionality required by applications. There a major
advantages arising from its simplicity and imminent standardization. Standardization represents the fact
that the interface is derived from a wide-range of applications using a collaborative approach and the
output of which is endorsed by the broader community.
More information about the SAGA C++ Reference Implementation (developed at the Center for
Computation and Technology at the Louisiana State University) and various aspects of grid enabling
toolkits is available on the SAGA implementation home page (SAGA, 2006). It also provides additional
information with regard to different aspects of grid enabling toolkits.
GRID APPLICATIONS AND DATA

Any e-science application at its core has to deal with data, from input data (e.g. in the form of output data
from sensors, or as initial or boundary data), to processing data and storing of intermediate results, to
producing final results (e.g. data used for visualization). Data has a strong influence on many aspects of
the design and deployment of an application and determines whether a grid application can be successfully
ported to the grid. Therefore, in the following, we present a brief overview of the main data management
related aspects, tasks and issues which might affect the process of grid-enabling an application, such as
data types and size, shared data access, temporary data spaces, network bandwidth, time-sensitive data,
location of data, data volume and scalability, encrypted data, shared file systems, databases, replication,
and caching. For a more in-depth discussion of data management related tasks, issues, and techniques,
we refer to Bart Jacobs tutorial on application enabling with Globus (Jacob, 2003).
Shared Data Access

Sharing data access can occur with concurrent jobs and other processes within the network.
Access to data input and the data output of the jobs can be of various kinds. During the planning and
design of the grid application, potential restrictions on the access of databases, files, or other data stores
for either read or write have to be considered. The installed policies need to be observed and sufficient
access rights have to be granted to the jobs. Concerning the availability of data in shared resources,
67
it must be assured that at run-time of the individual jobs the required data sources are available in the
appropriate form and at the expected service level. Potential data access conflicts need to be identified
up front and planned for. Individual jobs should not try to update the same record at the same time, nor
dead lock each other. Care has to be taken for situations of concurrent access and resolution policies
imposed.
The use of federated databases may be useful in data grids where jobs must handle large amounts of
data in various different data stores, you. They offer a single interface to the application and are capable
of accessing data in large heterogeneous environments. Federated database systems contain information about location (node, database, table, record) and access methods (SQL, VSAM, privately defined
methods) of connected data sources. Therefore, a simplified interface to the user (a grid job or other
client) requires that the essential information for a request should not include the data source, but rather
use a discovery service to determine the relevant data source and access method.
Data Topology
Issues about the size of the data, network bandwidth, and time sensitivity of data determine the location of data for a grid application. The total amount of data within the grid application may exceed the
amount of data input and output of the grid application, as there can be a series of sub-jobs that produce
data for other sub-jobs. For permanent storage the grid user needs to be able to locate where the required
storage space is available in the grid. Other temporary data sets that may need to be copied from or to
the client also need to be considered.
The amount of data that has to be transported over the network is restricted by available bandwidth.
Less bandwidth requires careful planning of the data traffic among the distributed components of a
grid application at runtime. Compression and decompression techniques are useful to reduce the data
amount to be transported over the network. But in turn, it raises the issue of consistent techniques on
all involved nodes. This may exclude the utilization of scavenging for a grid, if there are no agreed
standards universally available.
Another issue in this context is time-sensitive data. Some data may have a certain lifetime, meaning
its values are only valid during a defined time period. The jobs in a grid application have to reflect this
in order to operate with valid data when executing. Especially when using data caching or other replication techniques, it has to be assured that the data used by the jobs is up-to-date, at any given point
in time. The order of data processing by the individual jobs, especially the production of input data for
subsequent jobs, has to be carefully observed.
Depending on the job, the authors Jacob at al. (2003) recommend to consider the following datarelated questions which refer to input as well as output data of the jobs within the grid application:

68
Is it reasonable that each job or set of jobs accesses the data via the network?
Does it make sense to transport a job or set of jobs to the data location?
Is there any data access server (for example, implemented as a federated database) that allows
access by a job locally or remotely via the network?
Are there time constraints for data transport over the network, for example, to avoid busy hours
and transport the data to the jobs in a batch job during off-peak hours?
Is there a caching system available on the network to be exploited for serving the same data to
several consuming jobs?
Is the data only available in a unique location for access, or are there replicas that are closer to the
executable within the grid?
Data Volume
The ability for a grid job to access the data it needs will affect the performance of the application. When
the data involved is either a large amount of data or a subset of a very large data set, then moving the
data set to the execution node is not always feasible. Some of the considerations as to what is feasible
include the volume of the data to be handled, the bandwidth of the network, and logical interdependences
on the data between multiple jobs.
Data volume issues: In a grid application, transparent access to its input and output data is required.
In most cases the relevant data is permanently located on remote locations and the jobs are likely to
process local copies. This access to the data results in a network cost and it must be carefully quantified. Data volume and network bandwidth play an important role in determining the scalability of a
grid application.
Data splitting and separation: Data topology considerations may require the splitting, extraction, or
replication of data from data sources involved. There are two general approaches that are suitable for
higher scalability in a grid application: Independent tasks per job and a static input file for all jobs. In
the case of independent tasks, the application can be split into several jobs that are able to work independently on a disjoint subset of the input data. Each job produces its own output data and the gathering
of all of the results of the jobs provides the output result by itself. The scalability of such a solution
depends on the time required to transfer input data, and on the processing time to prepare input data and
generate the final data result. In this case the input data may be transported to the individual nodes on
which its corresponding job is to be run. Preloading of the data might be possible depending on other
criteria like timeliness of data or amount of the separated data subsets in relation to the network bandwidth. In the case of static input files, each job repeatedly works on the same static input data, but with
different parameters, over a long period of time. The job can work on the same static input data several
times but with different parameters, for which it generates differing results. A major improvement for
the performance of the grid application may be derived by transferring the input data ahead of time as
close as possible to the compute nodes.
Other cases of data separation: More unfavorable cases may appear when jobs have dependencies on
each other. The application flow may be carefully checked in order to determine the level of parallelism
to be reached. The number of jobs that can be run simultaneously without dependences is important in
this context. For independent jobs, there needs to be synchronization mechanisms in place to handle the
concurrent access to the data.
Synchronizing access to one output file: Here all jobs work with common input data and generate their
output to be stored in a common data store. The output data generation implies that software is needed
to provide synchronization between the jobs. Another way to process this case is to let each job generate
individual output files, and then to run a post-processing program to merge all these output files into the
final result. A similar case is that each job has its individual input data set, which it can consume. All jobs
then produce output data to be stored in a common data set. Like described above, the synchronization
of the output for the final result can be done through software designed for the task.
Hence, thorough evaluation of the input and output data for jobs in the grid application is needed to
properly handle it. Also, one should weigh the available data tools, such as federated databases, a data
69
joiner, and related products and technologies, in case the grid application is highly data oriented or the
data shows a complex structure.
PORTING AND PROGRAMMING GRID APPLICATIONS

Besides taking into account the underlying grid resources and the applications data handling, as discussed
in the previous two paragraphs, another challenge is the porting of the application program itself. In
this context, developers and users are facing mainly two different approaches when implementing their
application on a grid. Either they port an existing application code on a set of distributed grid resources.
Often, in the past, the application previously has been developed and optimized with a specific computer
architecture in mind, for example, mainframes or servers, single- or multiple-CPU vector computers,
shared- or distributed-memory parallel computers, or loosely coupled distributed systems like workstation clusters, for example. Or developers start from scratch and design and develop a new application
program with the grid in mind, often such that the application architecture respectively its inherent
numerical algorithms are optimally mapped onto the best-suited (set of) resources in a grid.
In both scenarios, the effort of implementing an application can be huge. Therefore, it is important
to perform a careful analysis beforehand on: the user requirements for running the application on a
grid (e.g. cost, time); on application type (e.g. compute or data intensive); application architecture and
algorithms (e.g. explicit, or implicit) and application components and how they interact (e.g. loosely or
tightly coupled, or workflows); what is the best way to map the application onto a grid; and which is
the best suited grid architecture to run the application in an optimally performing way. Therefore, in the
following, we summarize the most popular strategies for porting an existing application to a grid, and
for designing and developing a new grid application.
Many scientific papers and books deal with the issues of designing, programming, and porting grid
applications, and it is difficult to recommend the best suited among them. Here, we mainly follow the
books from Ian Foster and Carl Kesselman (1999 & 2004), the IBM Redbook (Jacob, 2003), the SURA
Grid Technology Cookbook (SURA, 2007), several research papers on programming models and environments, e.g. Soh (2006), Badia (2003), Karonis (2002), Seymour (2002), Buyya (2000), Venugopal
(2004), Luther (2005), Altintas (2004), and Frey (2005), and our own experience at Sun Microsystems
and MCNC (Gentzsch, 2004), RENCI (Gentzsch, 2007), D-Grid (Gentzsch, 2008, and Neuroth, 2007),
and currently in DEISA-2 (2008).
Grid Programming Models and Environments

Our own experience in porting applications to distributed resource environments is very similar to the one
from Soh et al. (2006) who present a useful discussion on grid programming models and environments
which we briefly summarize in the following. In their paper, they start with differentiating application
porting into resource composition and program composition. Resource composition, i.e. matching the
application to the grid resources needed, has already been discussed in paragraphs 2 and 3 above.
Concerning program composition, there is a wide spectrum of strategies of distributing an application
onto the available grid resources. This spectrum ranges from the ideal situation of simply distributing a
list of, say, n parameters together with n identical copies of that application program onto the Grid, to the
other end of the spectrum where one has to compose or parallelize the program into chunks or components
70
that can be distributed to the grid resources for execution. In the latter case, Soh (2006) differentiates
between implicit parallelism, where programs are automatically parallelized by the environment, and
explicit parallelism which requires the programmer to be responsible for most of the parallelization effort such as task decomposition, mapping tasks to processors and inter-task communication. However,
implicit approaches often lead to non-scalable parallel performance, while explicit approaches often are
complex and work- and time-consuming. In the following we summarize and update the approaches and
methods discussed in detail in Soh (2006):
Superscalar (or STARSs), sequential applications composed of tasks are automatically converted
into parallel applications where the tasks are executed in different parallel resources. The parallelization
takes into account the existing data dependencesbetween the tasks, building a dependence graph. The
runtime takes care of the task scheduling and data handling between the different resources, and takes
into account the locality of the data between other aspects. There are several implementations available, like GRID Superscalar (GRIDSs) for computational Grids (Badia, 2003), which is also used in
production at the MareNostrum supercomputer at the BSC in Barcelona; or Cell Superscalar (CellSs)
for the Cell processor (Perez, 2007) and SMP Superscalar (SMPSs) for homogeneous multicores or
shared memory machines.
Explicit Communication, such as Message Passing and Remote Procedure Call (RPC). A messages passing example is MPICH-G2 (Karonis, 2002), a Grid-enabled implementation of the Message
Passing Interface (MPI) which defines standard functions for communication between processes and
groups of processes, extended by the Globus Toolkit. An RPC example is GridRPC, an API for Grids
(Seymour, 2002), which offers a convenient, high-level abstraction whereby many interactions with a
Grid environment can be hidden.
Bag of Tasks, which can be easily distributed on grid resources. An example is the Nimrod-G Broker
(Buyya, 2000) which is a Grid-aware version of Nimrod, a specialized parametric modeling system.
Nimrod uses a simple declarative parametric modeling language and automates the task of formulating,
running, monitoring, and aggregating results. Another example is the Gridbus Broker (Venugopal, 2004)
that permits users access to heterogeneous Grid resources transparently.
Distributed Objects, as in ProActive (2005), a Java based library that provides an API for the creation, execution and management of distributed active objects. Proactive is composed of only standard
Java classes and requires no changes to the Java Virtual Machine (JVM) allowing Grid applications to
be developed using standard Java code.
Distributed Threads, for example Alchemi (Luther, 2005), a Microsoft .NET Grid computing framework, consisting of service-oriented middleware and an application program interface (API). Alchemi
features a simple and familiar multithreaded programming model.
Grid Workflows. Many Workflow Environments have been developed in recent years for Grids, such
as Triana, Taverna, Simdat, P-Grade, and Kepler. Kepler, for example, is a scientific workflow management system along with a set of Application Program Interfaces (APIs) for heterogeneous hierarchical
modeling (Altintas, 2004). Kepler provides a modular, activity oriented programming environment, with
an intuitive GUI to build complex scientific workflows.
Grid Services. An example is the Open Grid Services Architecture (OGSA) (Frey, 2005) which is an
ongoing project that aims to enable interoperability between heterogeneous resources by aligning Grid
technologies with established Web services technology. The concept of a Grid service is introduced as
a Web service that provides a set of well defined interfaces that follow specific conventions. These grid
services can be composed into more sophisticated services to meet the needs of users.
71
Grid-Enabling Application Programs and Numerical Algorithms

In many cases, restructuring (grid-enabling, decomposing, parallelizing) the core algorithm(s) within a
single application program doesnt make sense, especially in the case of a more powerful higher-level
grid-enabling strategy. For example, in the case of parameter jobs (see below), many identical copies of
the application program together with different data-sets can easily be distributed onto many grid nodes,
or where the application program components can be mapped onto a workflow, or where applications
(granularity, run time, special dimension, etc.) simply are too small to efficiently run on a grid, and the
grid latencies and management overhead become too dominant. In other cases, however, where e.g. just
one very long run has to be performed, grid-enabling the application program itself can lead to dramatic
performance improvements and, thus, time savings. In an effort to better guide the reader through this
complex field, in the following, we will briefly present a few popular application codes and their algorithmic structure and provide recommendations for some meaningful grid-enabling strategies.
General Approach. First, we have to make sure that we gain an important benefit form running our
application on a grid. And we should start asking a few more general questions, top-down. Has this code
been developed in-house, or is it a third-party code, developed elsewhere? Will I submit many jobs (as
e.g. in a parameter study), or is the overall application structure a workflow, or is it a single monolithic
application code? In case of the latter, are the core algorithms within the application program of explicit
or of implicit nature? In many cases, grid-enabling those kinds of applications can be based on experience made in the past with parallelizing them for the moderately or massively parallel systems, see e.g.
Fox et al. (1994) and Dongarra et al. (2003).
In-house Codes. In case of an application code developed in-house, the source code of this application
is often still available, and ideally the code developers are still around. Then, we have the possibility to
analyze the structure of the code, its components (subroutines), dependencies, data handling, core algorithms, etc. With older codes, sometimes, this analysis has already been done before, especially for the
vector and parallel computer architectures of the 1980ies and 1990ies. Indeed, some of this knowledge
can be re-used now for the grid-enabling process, and often only minor adjustments are needed to port
such a code to the grid.
Third-Party Codes licensed from so-called Independent Software Vendors (ISVs) cannot be gridenabled without the support from these ISVs. Therefore, in this case, we recommend to contact the ISV.
In case the ISV receives similar requests from other customers as well, there might be a real chance
that the ISV will either provide a grid-enabled code or completely change its sales strategy and sell its
software as a service, or develops its own application portal to provide access to the application and the
computing resources. But, obviously, this requires patience and is thus not a solution if you are under
a time constraint.
Parameter Jobs. In science and engineering, often, the application has to run many times: same
code, different data. Only a few parameters have to be modified for each individual job, and at the end
of the many job runs, the results are analyzed with statistical or stochastic methods, to find a certain
optimum. For example, during the design of a new car model, many crash simulations have to be performed, with the aim to find the best-suited material and geometry for a specific part of the wire-frame
model of the car.
Application Workflows. It is very common in so-called Problem Solving Environments that the
application program consists of a set of components or modules which interact with each other. This can
be modeled in grid workflow environments which support the design and the execution of the workflow
72
representing the application program. Usually, these grid workflow environments contain a middleware
layer which maps the application modules onto the different resources in the grid. Many Workflow Environments have been developed in recent years for Grids, such as Triana (2003), Taverna (2008), Simdat
(2008), P-Grade (2003), and Kepler (Altintas, 2004). One application which is well suited for such a
workflow is climate simulation. Todays climate codes consist of modules for simulating the weather on
the continent with mesoscale meteorology models, and include other modules for taking into account
the influence from ocean and ocean currents, snow and ice, sea ice, wind, clouds and precipitation, solar
and terrestrial radiation, absorption, emission, and reflection, land surface processes, volcanic gases
and particles, and human influences. Interactions happen between all these components, e.g. air-ocean,
air-ice, ice-ocean, ocean-land, etc. resulting in a quite complex workflow which can be mapped onto
the underlying grid infrastructure.
Highly Parallel Applications. Amdahls Law states that the scalar portion of a parallel program
becomes a dominant factor as processor number increases, leading to a loss in application scalability
with growing number of processors. Gustafson (1988) proved that this holds only for fixed problem
size, and that in practice, with increasing number of processors, the user increases problem size as
well, always trying to solve the largest possible problem on any given number of CPUs. Gustafson
demonstrated this on a 1028-processor parallel system, for several applications. For example, he was
able to achieve a speed-up factor of over 1000 for a Computational Fluid Dynamics application with
1028 parallel processes on the 1028-processor system. Porting these highly parallel applications to a
grid, however, has shown that many of them degrade in performance simply because overhead of communication for message-passing operations (e.g. send and receive) drops from a few microseconds on a
tightly-coupled parallel system to a few milliseconds on a (loosely-coupled) workstation cluster or grid.
In this case, therefore, we recommend to implement a coarse-grain Domain Decomposition approach,
i.e. to dynamically partition the overall computational domain into sub-domains (each consisting of as
many parallel processes, volumes, finite elements, as possible), such that each sub-domain completely
fits onto the available processors of the corresponding parallel system in the grid. Thus, only moderate
performance degradation from the reduced number of inter-system communication can be expected. A
prerequisite for this to work successfully is that the subset of selected parallel systems is of homogeneous
nature, i.e. architecture and operating system of these parallel systems should be identical. One Grid
infrastructure which offers this feature is the Distributed European Infrastructure for Supercomputing
Applications (DEISA, 2008), which (among others) provides a homogeneous cluster of parallel AIX
machines distributed over several of the 11 European supercomputing centers which are part of DEISA
(see also Section 5 in this Chapter).
Moderately Parallel Applications. These applications, which have been parallelized in the past,
often using Message Passing MPI library functions for the inter-process communication on workstation
clusters or on small parallel systems, are well-suited for parallel systems with perhaps a few dozen to
a few hundreds of processors, but they wont scale easily to a large number of parallel processes (and
processors). Reasons are a significant scalar portion of the code which cant run in parallel and/or the
relatively high ratio of inter-process communication to computation, resulting in relatively high idle
times of the CPUs waiting fore the data. Many commercial codes fall in this category, for example finiteelement codes such as Abaqus, Nastran, or Pamcrash. Here we recommend to check if the main goal
is to analyze many similar scenarios with one and the same code but on different data sets, and run as
many codes in parallel as possible, on as many moderately parallel sub-systems as possible (this could
be virtualized sub-systems on one large supercomputer, for example).
73
Explicit vs. Implicit Algorithms. Discrete Analogues of systems of partial differential equations,
stemming from numerical methods such as finite difference, finite volume, or finite element discretizations, often result in large sets of explicit or implicit algebraic equations for the unknown discrete variables
(e.g. velocity vectors, pressure, temperature). The explicit methods are usually slower (in convergence
to the exact solution vector of the algebraic system) than the implicit ones but they are also inherently
parallel, because there is no dependence of the solution variables among each other, and therefore there
are no recursive algorithms. In case of the more accurate implicit methods, however, solution variables
are highly inter-dependent leading to recursive sparse-matrix systems of algebraic equations which cannot easily split (parallelized) into smaller systems. Again, here, we recommend to introduce a Domain
Decomposition approach as described in the above section on Highly Parallel Algorithms, and solve
an implicit sparse-matrix system within each domain, and bundle sets of neighboring domains into
super-sets to submit to the (homogeneous) grid.
Domain Decomposition. This has been discussed in the paragraphs on Highly Parallel Applications
and on Explicit vs. Implicit Algorithms.
Job Mix. Last but not lease, one of the most trivial but most widely used scenarios often found in
university and research computer centers is the general job mix, stemming from hundreds or thousands
of daily users, with hundreds or even thousands of different applications, with varying requirements
for computer architecture, data handling, memory and disc space, timing, priority, etc. This scenario is
ideal for a grid which is managed by an intelligent Distributed Resource Manager (DRM), for example
GridWay (2008) for a global grid, Sun Grid Engine Enterprise Edition (Chaubal, 2003) for an enterprise
grid, or the open source Grid Engine (2001) for a departmental grid or a simple cluster. These DRMs are
able to equally balance the overall job load across the distributed resource environment and submit the
jobs always to the best suited and least loaded resources. This can result in overall resource utilization
of 90% and higher.
Applications and Grid Portals

Grid portals are an important part of the process of grid-enabling, composing, manipulating, running,
and monitoring applications. After all the lower layers of the grid-enabling process have been performed
(described in the previous paragraphs), often, the user is still exposed to the many details of the grid
services and even has to take care of configuring, composing, provisioning, etc. the application and
the services by hand. This however can be drastically simplified and mostly hidden from the user
through a Grid portal, which is a Web-based portal able to expose Grid services and resources through
a browser to allow users remote, ubiquitous, transparent and secure access to grid services (computers,
storage, data, applications, etc). The main goal of a Grid portal is to hide the details and complexity of
the underlying Grid infrastructure from the user in order to improve usability and utilization of the Grid,
greatly simplifying the use of Grid-enabled applications through a user-friendly interface.
Grid portals have become popular in research and the industry communities. Using Grid portals,
computational and data-intensive applications such as genomics, financial modeling, crash test analysis,
oil and gas exploration, and many more, can be provided over the Web as traditional services. Examples
of existing scientific application portals are the GEONgrid (2008) and CHRONOS (2004) portals that
provide a platform for the Earth Science community to study and understand the complex dynamics
of Earth systems; the NEESGrid project (2008) focuses on earthquake engineering research; the BIRN
portal (2008) targets biomedical informatics researchers; and the MyGrid portal (2008) provides access
74
to bioinformatics tools running on a back-end Grid infrastructure. As it turns out, scientific portals are
usually being developed inside specific research projects. As a result they are specialized for specific
applications and services satisfying project requirements for that particular research application area.
In order to rapidly build customized Grid portals in a flexible and modular way, several more generic
toolkits and frameworks have been developed. These frameworks are designed to meet the diverse needs
and usage models arising from both research and industry. One of these frameworks is EnginFrame,
which simplifies development of highly functional Grid portals exposing computing services that run on
a broad range of different computational Grid systems. EnginFrame (Beltrame, 2006) has been adopted
by many industrial companies, and by organizations in research and education.
Example: The EnginFrame Portal Environment

EnginFrame (2008) is a Web-based portal technology that enables the access and the exploitation of
grid-enabled applications and infrastructures. It allows organizations to provide application-oriented
computing and data services to both users (via Web browsers) and in-house or ISV applications (via
SOAP/WSDL based Web services), thus hiding the complexity of the underlying Grid infrastructure.
Within a company or department, an enterprise portal aggregates and consolidates the services and exposes them to the users, through the Web. EnginFrame can be integrated as Web application in a J2EE
standard application server or as a portlet in a JSR168 compliant portlet container.
As a Grid portal framework, EnginFrame offers a wide range of functionalities to IT developers
facing the task to provide application-oriented services to the end users. EnginFrames plug-in mechanism allows to easily and dynamically extend its set of functionalities and services. A plug-in is a selfcontained software bundle that encapsulates XML Extensible Markup Language service descriptions,
custom layout or XSL Extensible Stylesheet Language and the scripts or executables involved with the
services actions. A flexible authentication delegation offers a wide set of pre-configured authentication
mechanisms: OS/NIS/PAM, LDAP, Microsoft Active Directory, MyProxy, Globus, etc. It can also be
extended throughout the plug-in mechanism.
Besides authentication, EnginFrame provides an authorization framework that allows to define groups
of users and Access Control Lists (ACLs), and to bind ACLs to resources, services, service parameters
and service results. The Web interface of the services provided by the portal can be authorized and thus
tailored to the specific users roles and access rights.
EnginFrame supports a wide variety of compute Grid middleware like LSF, PBS, Sun Grid Engine,
Globus, gLite and others. An XML virtualization layer invokes specific middleware commands and
translates results, jobs and Grid resource descriptions into a portable XML format called GridML that
abstracts from the actual underlying Grid technology. For the GridML, as for the service description
XML, the framework provides pre-built XSLs to translate GridML into HTML. EnginFrame data management allows for browsing and handling data on the client side or remotely archived in the Grid and
then to host a service working environment in file system areas called spoolers.
The EnginFrame architecture is structured into three tiers, Client, Resource, Server. The Client Tier
normally consists of the users Web browser and provides an easy-to-use interface based on established
Web standards like XHTML and JavaScript, and it is independent from the specific software and hardware
environment used by the end user. When needed, the client tier also provides integration with desktop
virtualization technologies like Citrix Metaframe (ICA), VNC, X, and Nomachine NX. The Resource
Tier consists of one or more Agents deployed on the back-end Grid infrastructure whose role is to control
75
and provide distributed access to the actual computing resources. The Server Tier consists of a server
component that provides resource brokering to manage resource activities in the back-end.
The EnginFrame server authenticates and authorizes incoming requests from the Web, and asks an
Agent to execute the required actions. Agents can perform different kind of actions that range from the
execution of a simple command on the underlying Operating System, to the submission of a job to the
Grid. The results of the executed action are gathered by the Agent and sent back to the Server which
applies post processing transformations, filters the output according to ACLs and transforms the results
into a suitable format according to the nature of the client: HTML for Web browsers and XML in a SOAP
message for Web services client applications.
CASE STUDY: APPLICATIONS ON THE DEISA INFRASTRUCTURE

As one example, in the following, we will discuss the DEISA Distributed European Infrastructure for
Supercomputing Applications. DEISA (2008) is different from many other Grid initiatives which aim at
building a general purpose grid infrastructure and therefore have to cope with many (almost) insurmountable barriers such as complexity, resource sharing, crossing administrative (and even national) domains,
handling IP and legal issues, dealing with sensitive data, working on interoperability, and facing the
issue to expose every little detail of the underlying infrastructure services to the grid application user.
DEISA avoids most of these barriers by staying very focused: The main focus of DEISA is to provide the
European supercomputer user with a flexible, dynamic, user-friendly supercomputing ecosystem (one
could say Supercomputing Cloud, see next paragraph) for easy handling, submitting, and monitoring
long-running jobs on the best-suited and least-loaded supercomputer(s) in Europe, trying to avoid the
just mentioned barriers. In addition, DEISA offers application-enabling support. For a similar European
funded initiative especially focusing on enterprise applications, we refer the reader to the BEinGRID
project (2008), which consists of 18 so-called business experiments each dealing with a pilot application that addresses a concrete business case, and is represented by an end-user, a service provider, and
a Grid service integrator. Experiments come from key business sectors such as multimedia, financial,
engineering, chemistry, gaming, environmental science, and logistics and so on, based on different Grid
middleware solutions, see (BEinGRID, 2008).
The DEISA Project

DEISA is the Distributed European Initiative for Supercomputing Applications, funded by the EU in
Framework Program 6 (DEISA1, 2004 2008) and Framework Program 7 (DEISA2, 2008 2011).
The DEISA Consortium consists of 11 partners, MPG-RZG (Germany, consortium lead), BSC (Spain),
CINECA (Italy), CSC (Finland), ECMWF (UK), EPCC (UK), FZJ (Germany), LRS (Germany), IDRIS
(France), LRZ (Germany), and SARA (Netherlands), and 3 asociated partners KTH (Sweden), CSCS
(Switzerland), and JSCC (Russia).
DEISA develops and supports a distributed high performance computing infrastructure and a collaborative environment for capability computing and data management. The resulting infrastructure enables
the operation of a powerful Supercomputing Grid built on top of national supercomputing services,
facilitating Europes ability to undertake world-leading computational science research. DEISA is certainly instrumental for advancing computational sciences in scientific and industrial disciplines within
76
Europe and is paving the way towards the deployment of a cooperative European HPC ecosystem. The
existing infrastructure is based on the coupling of eleven leading national supercomputing centers, using
dedicated network interconnections (currently 10 GBs) of GANT2 and the NRENs.
DEISA2 develops activities and services relevant for applications enabling, operation, and technologies, as these are indispensable for the effective support of computational sciences in the area of
supercomputing. The service provisioning model is extended from one that supports a single project (in
DEISA1) to one supporting Virtual European Communities (now in DEISA2). Collaborative activities
will be carried out with new European and other international initiatives. Of strategic importance is the
cooperation with the PRACE (2008) initiative which is preparing for the installation of a limited number
of leadership-class Tier-0 supercomputers in Europe.
The DEISA Infrastructure Services

The essential services to operate the infrastructure and support its efficient usage are organized in the
three Service Activities Operations, Technologies, and Applications:
Operations refer to operating the infrastructure including all existing services, adopting approved new
services from the Technologies activity, and advancing the operation of the DEISA HPC infrastructure
to a turnkey solution for the future European HPC ecosystem by improving the operational model and
integrating new sites.
Technologies cover monitoring of technologies in use in the project, identifying and selecting
technologies of relevance for the project, evaluating technologies for pre-production deployment, and
planning and designing specific sub-infrastructures to upgrade existing services or deliver new ones
based on approved technologies. User-friendly access to the DEISA Supercomputing Grid is provided
by DEISA Services for Heterogeneous management Layer (DESHL, 2008) and the UNiforme Interface
for COmputing Resources (UNICORE, 2008).
Applications cover the areas applications enabling and extreme computing projects, environment
and user related application support, and benchmarking. Applications enabling focuses on enhancing
scientific applications from the DEISA Extreme Computing Initiative (DECI), Virtual Communities
and EU projects. Environment and user related application support addresses the maintenance and
improvement of the DEISA application environment and interfaces, and DEISA-wide user support in
the applications area. Benchmarking refers to the provision and maintenance of a European Benchmark
Suite for supercomputers.
In DEISA2, two Joint Research Activities (JRA) complement the portfolio of service activities. JRA1
(Integrated DEISA Development Environment) aims at an integrated environment for scientific application development, based on a software infrastructure for tools integration, which provides a common
user interface across multiple computing platforms. JRA2 (Enhancing Scalability) aims at the enabling
of supercomputer applications for the efficient exploitation of current and future supercomputers, to
cope with a production infrastructure characterized by an aggressive parallelism on heterogeneous HPC
architectures at a European scale.
DECI DEISA Extreme Computing Initiative for Supercomputing Applications

The DEISA Extreme Computing Initiative (DECI, 2008) has been launched in May 2005 by the
DEISA Consortium, as a way to enhance its impact on science and technology. The main purpose
77
of this initiative is to enable a number of grand challenge applications in all areas of science and
technology. These leading, ground breaking applications must deal with complex, demanding and innovative simulations that would not be possible without the DEISA infrastructure, and which benefit
from the exceptional resources provided by the Consortium. The DEISA applications are expected to
have requirements that cannot be fulfilled by the national services alone.
In DEISA2, the single-project oriented activities (DECI) will be qualitatively extended towards
persistent support of Virtual Science Communities. This extended initiative will benefit from and build
on the experiences of the DEISA scientific Joint Research Activities where selected computing needs
of various scientific communities and a pilot industry partner were addressed. Examples of structured
science communities with which close relationships are planned to be established are EFDA and the
European climate community. DEISA2 will provide a computational platform for them, offering integration via distributed services and web applications, as well as managing data repositories.
Applications Adapted to the DEISA Grid Infrastructure

In the following, we describe examples of application profiles and use cases that are well-suited for the
DEISA supercomputing Grid, and that can benefit from the computational resources made available by
the DECI Extreme Computing Initiative.
International collaboration involving scientific teams that access the nodes of the AIX super-cluster
in different countries, can benefit from a common data repository and a unique, integrated programming and production environment (via common global file systems). Imagine, for example, that team A
in France and team B in Germany dispose of allocated resources at IDRIS in Paris and FZJ in Juelich,
respectively. They can benefit from a shared directory in the distributed super-cluster, and for all practical purposes it looks as if they were accessing a single supercomputer.
Extreme computing demands of a challenging project requiring a dominant fraction of a single
supercomputer. Rather than spreading a huge, tightly coupled parallel application on two or more supercomputers, DEISA can organize the management of its distributed resource pool such that it is possible to allocate a substantial fraction of a single supercomputer to this project which is obviously more
efficient that splitting the application and distributing it over several supercomputers.
Workflow applications involving at least two different HPC platforms. Workflow applications are
simulations where several independent codes act successively on a stream of data, the output of one
code being the input of the next one in the chain. Often, this chain of computations is more efficient if
each code runs on the best-suited HPC platform (e.g. scalar, vector, or parallel supercomputers) where
it develops the best performance. Support of these applications via UNICORE (2008) which allows
treating the whole simulation chain as a single job is one of the strengths of the DEISA Grid.
Coupled applications involving more than one platform. In some cases, it does make sense to spread
a complex application over several computing platforms. This is the case of multi-physics, multi-scale
application codes involving several computing modules each dealing with one particular physical phenomenon, and which only need to exchange a moderate amount of data in real time. DEISA has already
developed a few applications of this kind, and is ready to consider new ones, providing substantial support
to their development. This activity is more prospective, because systematic production runs of coupled
applications require a co-allocation service which is currently being implemented.
78
APPLICATIONS IN THE CLOUD

With increasing demand for higher performance, efficiency, productivity, agility, and lower cost, since
several years, Information Communication Technologies, ICT, are dramatically changing from static
silos with manually managing resources and applications, towards dynamic virtual environments with
automated and shared services, i.e. from silo-oriented to service-oriented architectures.
With sciences and businesses turning global and competitive, applications, products and services
becoming more complex, and research and development teams being distributed, ICT is in transition
again. Global challenges require global approaches: on the horizon, so-called virtual organizations and
partner grids will provide the necessary communication and collaboration platform, with grid portals
for secure access to resources, applications, data, and collaboratories.
One component which will certainly foster this next-generation scenario is Cloud Computing, as
recently offered by companies like Sun (2006) Network.com, IBM (2008), Amazon (2007) Elastic Compute Cloud, and Google (2008) App Engine, Google Group (2008), and CloudCamp (2008), and many
more in the near future. Clouds will become important dynamic components of research and enterprise
grids, adding a new external dimension of flexibility to them by enhancing their home resource capacity whenever needed, on demand. Existing businesses will use them for their peak demands and for
new projects, service providers will host their applications on them and provide Software as a Service
(SaaS), start-ups will integrate them in their offerings without the need to buy resources upfront, and
setting up new Web 2.0 communities will become very easy.
To Cloud-enable applications will follow similar strategies as with grid-enabling, as discussed in the
previous paragraphs. Similarly challenging as with Grids, though, are the cultural, mental, legal, and
political aspects in the Cloud context. Building trust and reputation among the users and the providers will
help in some scenarios. But it is currently difficult to imagine that users may easily entrust their corporate
core assets and sensitive data to Cloud service providers. Today (in October 2008) the status of Clouds
seems to be similar to the status of Grids in the early 2000s: a few simple and well-suited application
scenarios run on Clouds, but by far most of the more complex and demanding applications in research
and enterprises will face many barriers on Clouds which still have to be removed, one by one.
One example of an early innovative Cloud system came from Sun when it truly built its SunGrid (2005)
from scratch, based on the vision that the network is the computer. As with other early technologies in
the past, Sun paid a high price for being first and doing all the experiments and the evangelization, but
their reputation as an innovator is here to stay. Its successor, Sun Network.com (2008), is very popular
among its few die-hard clients. This is because of an easy-to use technology (Grid Engine, Jini, JavaSpaces), but its especially because of their innovative early users, such as CDO2 (2008), and because of
the instant support users get from the Sun team.
A similar promising example in the future might be the DEISA Distributed European Infrastructure
for Supercomputing Applications, with its DECI DEISA Extreme Computing Initiative. Why is DECI
currently so successful in offering millions of supercomputing cycles to the European e-Science community and helping scientists gain new scientific insights? Several reasons, in my opinion: because
DEISA has a very targeted focus on specific (long-running) supercomputing applications and most of
the applications just run on one best-suited - system; because of its user-friendly access - through technology like DESHL (2008) and UNICORE (2008); because of staying away from those more ambitious
general-purpose Grid efforts; because of its coordinating function which leaves the consortium partners
(the European supercomputer centers) fully independent; and similar to network.com because of
79
ATASKF (DECI (2008), the application task force, application experts who help the users with porting their applications to the DEISA infrastructure. If all this is here to stay, and the (currently funded)
activities will be taken over by the individual supercomputer centers, DEISA will have a good chance
to exist for a long time, even after the funding will run dry. And then, we might end up with a DEISA
Cloud which will become an (external) HPC node within your Grid application workflow.
With this sea-change ahead of us, there will be a continuous strategic importance for sciences and
businesses to support the work of the Open Grid Forum (OGF, 2008). Because only standards will enable building e-infrastructures and grid-enabled applications easily from different components and to
transition towards an agile platform for federated services. Standards, developed in OGF, guarantee
upfront - interoperation of components best suited for your applications, and thus reducing dependency
from proprietary building blocks, keeping cost under control, and increasing research and business
flexibility.
CONCLUSION: 10 RULES FOR BUILDING A SUSTAINABLE

GRID FOR SUSTAINABLE APPLICATIONS
Sustainable grid-enabled applications require sustainable grid infrastructures. It doesnt make any sense,
for example, in a three-year funded Grid project, to develop or port a complex application to the Grid
which will shut down after the project ends. We have to make sure that we are able to build sustainable
grid infrastructures which will last for a long time. Therefore, in the following, the author offers his 10
rules for building a sustainable grid, available also from the OGF Thought Leadership (2008). These rules
are derived from mainly four sources: my research on major grid projects published in a RENCI report
(Gentzsch, 2007a), the e-IRG Workshop on A Sustainable Grid Infrastructure for Europe in (Gentzsch,
2007b), the 2nd International Workshop on Campus and Community Grids at OGF20 in Manchester
(McGinnis, 2007), and my personal experience with coordinating the German D-Grid Initiative (D-Grid,
2008). The rules presented here are mainly non-technical, because I believe most of the challenges in
building and operating a grid are in the form of cultural, legal and regulatory barriers.
Rule 1: Identify your specific benefits. Your first thought should be about your users and your
organization. Whats in it for them? Identify the benefits which fit best: transparent access to and better
utilization of resources; almost infinite compute and storage capacity; flexibility, adaptability and automation through dynamic and concerted interoperation of networked resources; cost reduction through
utility model; shorter time-to-market because of more simulations at the same time on the grid. Grid
technology helps to adjust an enterprises IT architecture to real business requirements (and not vice
versa). For example, global companies will be able to decompose their highly complex processes into
modular components of a workflow which can be distributed around the globe such that on-demand
availability and access to suitable workforce and resources are assured, productivity increased, and cost
reduced. Application of grid technology in these processes, guarantees seamless integration of and communication among all distributed components and provides transparent and secure access to sensitive
company information and other proprietary assets, world-wide. Grid computing is especially of great
benefit for those research and business groups which cannot afford expensive IT resources . It enables
engineers to remotely access any IT resource as a utility, to simulate any process and any product (and
product life cycle) before it is built, resulting in higher quality, increased functionality, and cost and
risk reduction.
80
Rule 2: Evangelize your decision makers first. They give you the money and authority for your
grid project. The more they know about the project and the more they believe in it (and in you) the more
money and time you will get, and the easier becomes your task to lead and motivate your team and to
get things done. Present a business case (current deficiencies, specific benefits of the grid (see Rule #1),
how much will it cost and how much will it return, etc. They might also have to modify existing policies, top down, to make it easier for users (and providers) to cope with the challenges of and to accept
and use the new services. For example, why would a researcher (or a department in an enterprise) stop
buying computers when money continues to be allocated for buying it? This policy should be changed to
support a utility model instead of an ownership model. If you are building a national grid, for example,
convincing your government to modify its research funding model is a tough task.
Rule 3: Dont re-invent wheels. In the early grid days, many grid projects tried to develop the
whole software stack themselves: from the middleware layer, to the software tools, to grid-enabling the
applications, to the portal and Web layerand got troubled by the next technology change. Today, so
many grid technologies, products and projects exist that you want to start looking for similar projects,
select your favorite (successful) ones which fit best your users needs, and copy what they have built,
and that will be your prototype. Then, you might still have some time and money left to optimize it
so it fully matches the requirements of your users. Consider, however, that all grids are different. For
example, research grids are mainly about sharing (e.g. sharing resources, knowledge, data), commercial
enterprise grids are about cost and revenue (e.g. TCO, ROI, productivity). Therefore, if your community
is academic, look for academic use cases, if its commercial, look for commercial use cases in your
respective business field.
Rule 4: KISS (Keep It Simple and Stupid). It took your users years to get acquainted with their current working environment and tools. Ideally, you wont change that. Try hard to stick with what they have
and how they do things. Plan for an incremental approach and lots of time listening and talking. Social
effects dominate in grids. Join forces with the system people to change/modify mainly the lower layers of
the architecture. Your users are your customers, they are king. Differentiate between two groups of users:
the end users who are designing and developing the products (or the research results) which account for
all the earnings of your company (or reputation and therefore funding for your research institute), and
the system experts who are eager to support the end users with the best possible services. Therefore, you
can only succeed if you demonstrate a handful of clear benefits to these two user groups.
Rule 5: Evolution, not revolution. As the saying goes: never change a running system... We all
hate changes in our daily lives, except when we are sure that things will drastically improving. Your users and their applications deeply depend on a reliable infrastructure. So, whenever you have to change
especially the user layer, only change it in small steps and in large time cycles. And, start with enhancing
existing service models moderately, and test suitable utility models first as pilots. And, very important,
part of your business plan has to be an excellent training and communications strategy.
Rule 6: Establish a governance structure. Define clear responsibilities and dependencies for
specific tasks, duties and people during and after the project. An advisory board should include your
representatives of your end-users as well as application and system experts. In case of more complex
projects, e.g. consisting of an integration project and several application or community projects, an
efficient management board should lead and steer coordination and collaboration among the projects
and the working groups. The management board (Steering Committee) should consist of leaders of the
sub-projects. Regular face-to-face meetings are very important.
Rule 7: Money, money, money. Dont have unrealistic expectations that grid computing will save you
81
money initially.. In their early stage, grid projects need enough funding to get over the early-adopter phase
into a mature state with a rock-solid grid infrastructure such that other user communities can join easily.
In research grids, for example, we estimate this funding phase currently to be in the order of 3-5 years,
with more funding in the beginning for the grid infrastructure, and later more funding for the application
communities. In larger (e.g. global) research grids, funding must cover Teams or Centers of Excellence,
for building, managing and operating the grid infrastructure, and for middleware tools, application support, and training. Also, todays funding models in research and education are often project based and
thus not ready for a utilitarian approach where resource usage is based on a pay-as-you- go approach.
Old funding models first have to be adjusted accordingly before a utility model can be introduced successfully. For example, todays existing government funding models are often counter-productive when
establishing new and efficient forms of utility services (see Rule #2). In the long run, grid computing
will save you money through a much more efficient, flexible and productive infrastructure.
Rule 8: Secure some funding for after the end of the project. Continuity especially for maintenance and support are extremely important for the sustainability of your grid infrastructure. Make sure
at the beginning of your project that additional funding will be available after the end of the project, to
guarantee service and support and continuous improvement and adjustment of the infrastructure.
Rule 9: Try not to grid-enable your applications in the first place. Adjusting your application to
changing technologies costs a lot of effort and money, and takes a lot of your precious time. Did you
macro-assemble, vectorize, multitask, parallelize, or multithread your application yourself in the past?
Then, grid-enabling that code is relatively easy, ay we have seen in this article. But doing this from
scratch is not what the user should do. Better to use the money to buy (lease, rent, subscribe to) software
as a service or to hire a few consultants who grid-enable your application and/or (even better) help you
enable your grid architecture to dynamically cope with the applications and user requirements (instead
vice versa). Today, in grids, we are looking more at chunks of independent jobs, (or chunks of transactions). And we let our schedulers and brokers decide how to distribute these chunks onto the best-suited
and least-loaded servers in the grid, or let the servers decide themselves to share the chunks with their
neighbors automatically whenever they become overloaded.
Rule 10: Adopt a human business model. Dont invent new business models. This usually increases
the risk for failure. Learn from the business models we have with our other service infrastructures: water,
gas, telephony, electricity, mass transportation, the Internet, and the World Wide Web. Despite this wide
variety of areas, there is only a handful of successful business models: on one end of the spectrum, you
pay the total price, and the whole thing is yours. Or you pay only a share of it, but pay the other share
on a per usage basis. Or you rent everything, and pay chunks back on a regular basis, like a subscription
fee or leasing. Or you pay just for what you use. Sometimes, however, there are hidden or secondary
applications. For example, electrical power alone doesnt help. Its only useful if it generates something,
e.g. light, or heat, or cold, etc. And this infrastructure is what creates a whole new industry of new appliances: light bulbs, heaters, refrigerators, etc. Back to grids: providing the right (transparent) infrastructure
(services) and the right (simple) business model will most certainly create a new set of services which
most probably will improve our quality of life in the future.
82
REFERENCES
Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., & Mock, S. (2004). Kepler: an extensible
system for design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM), Santorini Island, Greece.
Amazon Elastic Compute Cloud (2007). Retrieved from www.amazon.com/ec2
Badia, R. M., Labarta, J. S., Sirvent, R. L., Perez, J. M., Cela, J. M., & Grima, R. (2003). Programming grid applications with GRID superscalar. Journal of Grid Computing, 1, 151170. doi:10.1023/
B:GRID.0000024072.93701.f3
Baker, S. (2007). Google and the wisdom of clouds. Business Week, Dec. 13. Retrieved from www.
businessweek.com/magazine/content/07_52/b4064048925836.htm
BEinGRID. (2008). Business experiments in grids. Retrieved from www.beingrid.com
Beltrame, F., Maggi, P., Melato, M., Molinari, E., Sisto, R., & Torterolo, L. (2006, February 2-3). SRB
Data grid and compute grid integration via the enginframe grid portal. In Proceedings of the 1st SRB
Workshop, San Diego, CA. Retrieved from www.sdsc.edu/srb/Workshop/SRB-handout-v2.pdf
BIRN. (2008). Biomedical informatics research network. Retrieved from www.nbirn.net/index.shtm
Buyya, R., Abramson, D., & Giddy, J. (2000). Nimrod/G: An architecture for a resource management
and scheduling system in a global computational grid. In Proceedings of the 4th International Conference on High Performance Computing in the Asia-Pacific Region. Retrieved from www.csse.monash.
edu.au/~davida/nimrod/nimrodg.htm
CDO2. (2008). CDOSheet for pricing and risk analysis. Retrieved from www.cdo2.com
Chaubal, Ch. (2003). Sun grid engine, enterprise editionSoftware configuration guidelines and use
cases. Sun Blueprints, Retrieved from www.sun.com/blueprints/0703/817-3179.pdf
CloudCamp. (2008). Retrived from http://www.cloudcamp.com/
D-Grid (2008). Retrieved from www.d-grid.de/index.php?id=1&L=1
DECI. (2008). DEISA extreme computing initiative. Retrieved from www.deisa.eu/science/deci
DEISA. (2008). Distributed European infrastructure for supercomputing applications. Retrieved from
www.deisa.eu
DESHL. (2008). DEISA services for heterogeneous management layer. http://forge.nesc.ac.uk/projects/
deisa-jra7/
Dongarra, J., Foster, I., Fox, G., Gropp, W., Kennedy, K., Torczon, L., & White, A. (2003). Sourcebook
of parallel computing. San Francisco: Morgan Kaufmann Publishers.
EnginFrame. (2008). Grid and cloud portal. Retrieved from www.nice-italy.com
Foster, I. (2000). Internet computing and the emerging grid. Nature. Retrieved from www.nature.com/
nature/webmatters/grid/grid.html
83
Foster, I. (2002). What is the Grid? A three point checklist. Retrieved from http://www-fp.mcs.anl.
gov/~foster/Articles/WhatIsTheGrid.pdf
Foster, I. Kesselman, & C., Tuecke, S. (2002). The anatomy of the Grid: Enabling scalable virtual organizations. Retrieved from www.globus.org/alliance/publications/papers/anatomy.pdf
Foster, I., & Kesselman, C. (Eds.). (1999). The Grid: Blueprint for a new computing infrastructure. San
Francisco: Morgan Kaufmann Publishers.
Foster, I., & Kesselman, C. (Eds.). (2004). The Grid 2: Blueprint for a new computing infrastructure.
San Francisco: Morgan Kaufmann Publishers.
Fox, G., Williams, R., & Messina, P. (1994). Parallel computing works! San Francisco: Morgan Kaufmann Publishers.
Frey, J., Mori, T., Nick, J., Smith, C., Snelling, D., Srinivasan, L., & Unger, J. (2005). The open grid
services architecture, Version 1.0. Retrieved from www.ggf.org/ggf_areas_architecture.htm
GAT. (2005). Grid application toolkit. www.gridlab.org/WorkPackages/wp-1/
Gentzsch, W. (2002). Response to Ian Fosters What is the Grid? GRIDtoday, August 5. Retrieved
from www.gridtoday.com/02/0805/100191.html
Gentzsch, W. (2004). Grid computing adoption in research and industry. In A. Abbas (Ed.), Grid computing: A practical guide to technology and applications (pp. 309 340). Florence, KY: Charles River
Media Publishers.
Gentzsch, W. (2004). Enterprise resource management: Applications in research and industry. In I. Foster
& C. Kesselman (Eds.), The Grid 2: Blueprint for a new computing infrastructure (pp. 157 166). San
Francisco: Morgan Kaufmann Publishers.
Gentzsch, W. (2007a). Grid initiatives: Lessons learned and recommendations. RENCI Report. Retrieved
from www.renci.org/publications/reports.php
Gentzsch, W. (Ed.). (2007b). A sustainable Grid infrastructure for Europe, Executive Summary of the
e-IRG Open Workshop on e-Infrastructures, Heidelberg, Germany. Retrieved from www.e-irg.org/
meetings/2007-DE/workshop.html
Gentzsch (2008). Top 10 rules for building a sustainable Grid. In Grid thought leadership series. Retrieved from www.ogf.org/TLS/?id=1
GEONgrid. (2008). Retrieved from www.geongrid.org
Goodale, T., Jha, S., Kaiser, H., Kielmann, T., Kleijer, P., Merzky, A., et al. (2008). A simple API for
Grid applications (SAGA). Grid Forum Document GFD.90. Open Grid Forum. Retrieved from www.
ogf.org/documents/GFD.90.pdf
Google (2008). Google App Engine. Retrieved from http://code.google.com/appengine/
Google Groups. (2008). Cloud computing. Retrieved from http://groups.google.ca/group/cloud-computing
84
Grid Engine. (2001). Open source project. Retrieved from http://gridengine.sunsource.net/

GridSphere (2008). Retrieved from www.gridsphere.org/gridsphere/gridsphere
GridWay. (2008). Metascheduling technologies for the grid. Retrieved from www.gridway.org/
Gustafson, J. (1988). Reevaluating Amdahls law. Communications of the ACM, 31, 532533.
doi:10.1145/42411.42415
Jacob, B., Ferreira, L., Bieberstein, N., Gilzean, C., Girard, J.-Y., Strachowski, R., & Yu, S. (2003).
Enabling applications for Grid computing with Globus. IBM Redbook. Retrieved from www.redbooks.
ibm.com/abstracts/sg246936.html?Open
Jha, S., Kaiser, H., El Khamra, Y., & Weidner, O. (2007, Dec. 10-13). Design and implementation of
network performance aware applications using SAGA and Cactus. 3rd IEEE Conference on eScience
and Grid Computing, (pp. 143- 150). Bangalore, India.
Karonis, N. T., Toonen, B., & Foster, I. (2003). MPICH-G2: A Grid-enabled implementation of the
message passing interface. [JPDC]. Journal of Parallel and Distributed Computing, 63, 551563.
doi:10.1016/S0743-7315(03)00002-9
Lee, C. (2003). Grid programming models: Current tools, issues and directions. In G. F. Fran Berman,
T. Hey, (Eds.), Grid computing (pp. 555578). New York: Wiley Press.
Luther, A., Buyya, R., Ranjan, R., & Venugopal, S. (2005). Peer-to-peer grid computing and a. NETbased alchemi framework. high performance computing: Paradigm and Infrastructure. In M. Guo, (Ed.).
New York: Wiley Press. Retrieved from www.alchemi.net
McGinnis, L., Wallom, D., & Gentzsch, W. (Eds.). (2007). 2nd International Workshop on Campus and
Community Grids. retrieved from http://forge.gridforum.org/sf/go/doc14617?nav=1
MyGrid. (2008). Retrieved from www.mygrid.org.uk
NEESgrid. (2008). Retrieved from www.nees.org/
Neuroth, H., Kerzel, M., & Gentzsch, W. (Eds.). (2007). German Grid Initiative D-Grid. Gttingen, Germany: Universittsverlag Gttingen Publishers. Retrieved from www.d-grid.de/index.php?id=4&L=1
OGF. (2008). Open Grid Forum. Retrieved from www.ogf.org
P-GRADE. (2003). Parallel grid run-time and application development environment. Retrieved from
www.lpds.sztaki.hu/pgrade/
Perez, J.M., Bellens, P., Badia, R.M., & Labarta, J. (2007, August). CellSs: Programming the Cell/ B.E.
made easier. IBM Journal of R&D, 51(5).
Portal, C. H. R. O. N. O. S. (2004). Retrieved from http://portal.chronos.org/gridsphere/gridsphere
PRACE. (2008). Partnership for advanced computing in Europe. Retrieved from www.prace-project.
eu/
85
Proactive (2005). Proactive manual REVISED 2.2., Proactive, INRIA. Retrieved from http://www-sop.
inria.fr/oasis/Proactive/
Saara Vrt, S. (Ed.). (2008). Advancing science in Europe. DEISA Distributed European Infrastructure for Supercomputing Applications. EU FP6 Project. Retrieved from www.deisa.eu/press/DEISAAdvancingScienceInEurope.pdf
SAGA. (2006). SAGA implementation home page Retrieved from http://fortytwo.cct.lsu.
edu:8000/SAGA
Seymour, K., Nakada, H., Matsuoka, S., Dongarra, J., Lee, C., & Casanova, H. (2002). Overview of
GridRPC: A remote procedure call API for Grid computing. In Proceedings of the Third International
Workshop on Grid Computing, Baltimore, MD (LNCS 2536, pp. 274278). Berlin: Springer.
SIMDAT. (2008). Grids for industrial product development. Retrieved from www.scai.fraunhofer.de/
about_simdat.html
Soh, H., Shazia Haque, S., Liao, W., & Buyya, R. (2006). Grid programming models and environments.
In Yuan-Shun Dai, et al. (Eds.) Advanced parallel and distributed computing (pp. 141173). Hauppauge,
NY: Nova Science Publishers.
Sun Network. com (2008). Retrieved from www.network.com/
SunGrid. (2005). Sun utility computing. Retrieved from www.sun.com/service/sungrid/
SURA Southeastern Universities Research Association. (2007). The Grid technology cookbook: Programming concepts and challenges. Retrieved from www.sura.org/cookbook/gtcb/
TAVERNA. (2008). The Taverna Workbench 1.7. Retrieved from http://taverna.sourceforge.net/
TRIANA. (2003). The Triana Project. Retrieved from www.trianacode.org/
UNICORE. (2008). UNiform Interface to COmputing Resources. Retrieved from www.unicore.eu/
Venugopal, S., Buyya, R., & Winton, L. (2004). A grid service broker for scheduling distributed dataoriented applications on global grids. Proceedings of the 2nd workshop on Middleware for grid computing, Toronto, Canada, (pp. 7580). Retrieved from www.Gridbus.org/broker

Clouds Computing: Computing paradigm focusing on provisioning of metered services related to
the use of hardware, software platforms, and applications, billed on a pay-per-use base, and pushed by
vendors such as Amazon, Google, Microsoft, Salesforce, Sun, and others. Accordingly, there are many
different (but similar) definitions (as with Grid Computing).
DECI: The purpose of the DEISA Extreme Computing Initiative (DECI) is to enhance the impact
of the DEISA research infrastructure on leading European science and technology. DECI identifies,
enables, deploys and operates flagship applications in selected areas of science and technology. These
leading, ground breaking applications must deal with complex, demanding, innovative simulations that
86
would not be possible without the DEISA infrastructure, and which would benefit from the exceptional
resources of the Consortium.
DEISA: The Distributed European Infrastructure for Supercomputing Applications is a consortium
of leading national supercomputing centres that currently deploys and operates a persistent, production
quality, distributed supercomputing environment with continental scope. The purpose of this EU funded
research infrastructure is to enable scientific discovery across a broad spectrum of science and technology,
by enhancing and reinforcing European capabilities in the area of high performance computing. This
becomes possible through a deep integration of existing national high-end platforms, tightly coupled by
a dedicated network and supported by innovative system and grid software.
Grid: A service for sharing computer power and data storage capacity over the Internet, unlike the
Web which is a service just for sharing information over the Internet. The Grid goes well beyond simple
communication between computers, and aims ultimately to turn the global network of computers into
one vast computational resource. Today, the Grid is a work in progress, with the underlying technology still in a prototype phase, and being developed by hundreds of researchers and software engineers
around the world.
Open Grid Forum: The Open Grid Forum is a community of users, developers, and vendors leading
the global standardisation effort for grid computing. OGF accelerates grid adoption to enable business
value and scientific discovery by providing an open forum for grid innovation and developing open
standards for grid software interoperability. The work of OGF is carried out through community-initiated
working groups, which develop standards and specifications in cooperation with other leading standards
organisations, software vendors, and users. The OGF community consists of thousands of individuals in
industry and research, representing over 400 organisations in more than 50 countries.
Globus Toolkit: A software toolkit designed by the Globus Alliance to provide a set of tools for Grid
Computing middleware based on standard grid APIs. Its latest development version, GT4, is based on
standards currently being drafted by the Open Grid Forum.
Grid Engine: An open source batch-queuing and workload management system. Grid Engine is
typically used on a compute farm or compute cluster and is responsible for accepting, scheduling, dispatching, and managing the remote execution of large numbers of standalone, parallel or interactive user
jobs. It also manages and schedules the allocation of distributed resources such as processors, memory,
disk space, and software licenses.
Grid Portal: A Grid Portal provides a single secure web interface for end-users and administrators
to computational resources (computing, storage, network, data, applications) and other services, while
hiding the complexity of the underlying hardware and software of the distributed computing environment. An example is the EnginFrame cluster, grid, and cloud portal which for example in DEISA serves
as the portal for the Life Science community.
OGSA: The Open Grid Services Architecture, describes an architecture for a service-oriented grid
computing environment for business and scientific use, developed within the Open Grid Forum. OGSA
is based on several Web service technologies, notably WSDL and SOAP. Briefly, OGSA is a distributed
interaction and computing architecture based around services, assuring interoperability on heterogeneous
systems so that different types of resources can communicate and share information. OGSA has been
described as a refinement of the emerging Web Services architecture, specifically designed to support
Grid requirements.
Web Service: A software system designed to support interoperable machine-to-machine interaction
over a network. It has an interface described in a machine-process able format (specifically WSDL).
87
Other systems interact with the Web service in a manner prescribed by its description using SOAPmessages, typically conveyed using HTTP with an XML serialisation in conjunction with other Webrelated standards.
UNICORE: The Uniform Interface to Computing Resources offers a ready-to-run Grid system
including client and server software. UNICORE makes distributed computing and data resources available in a seamless and secure way in intranets and the internet. The UNICORE project created software
that allows users to submit jobs to remote high performance computing resources without having to
learn details of the target operating system, data storage conventions and techniques, or administrative
policies and procedures at the target site.
Virtual Organization: A group of people with similar interest that primarily interact via communication media such as newsletters, telephone, email, online social networks etc. rather than face to face,
for social, professional, educational or other purposes. In Grid Computing, a VO is a group who shares
the same computing resources.
ENDNOTE
1
88
Another version of this chapter was published in the International Journal of Grid and High Performance Computing, Volume 1, Issue 1, edited by Emmanuel Udoh, pp. 55-76, copyright 2009
by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
89
Chapter 5
Benchmarking Grid
Applications for Performance
and Scalability Predictions
Radu Prodan
University of Innsbruck, Austria
Farrukh Nadeem
Thomas Fahringer
ABSTRACT
Application benchmarks can play a key role in analyzing and predicting the performance and scalability
of Grid applications, serve as an evaluation of the fitness of a collection of Grid resources for running
a specific application or class of applications (Tsouloupas & Dikaiakos, 2007), and help in implementing performance-aware resource allocation policies of real time job schedulers. However, application
benchmarks have been largely ignored due to diversified types of applications, multi-constrained executions, dynamic Grid behavior, and heavy computational costs. To remedy these, the authors present an
approach taken by the ASKALON Grid environment that computes application benchmarks considering
variations in the problem size of the application and machine size of the Grid site. Their system dynamically controls the number of benchmarking experiments for individual applications and manages the
execution of these experiments on different Grid sites. They present experimental results of our method
for three real-world applications in the Austrian Grid environment.
INTRODUCTION
Grid infrastructures provide an opportunity to the scientific and business communities to exploit the
powers of heterogeneous resources in multiple administrative domains under a single umbrella (Foster
& Kesselman, The Grid: Blueprint for a Future Computing Infrastructure, 2004). Proper characterization
DOI: 10.4018/978-1-60566-661-7.ch005
Benchmarking Grid Applications for Performance and Scalability Predictions
of Grid resources is of key importance in effective mapping and scheduling of the jobs in order to
minimize execution time of complex workflows and utilize maximum power of these resources.
Benchmarking has been used for many years to characterize a large variety of resources ranging from CPU architectures to file systems, databases, parallel systems, internet infrastructures, or
middleware (Dikaiakos, 2007). There have always been issues regarding optimized mapping of jobs
to the Grid resources on the basis of available benchmarks (Tirado-Ramos, Tsouloupas, Dikaiakos,
& Sloot, 2005). Existing Grid benchmarks (or their combinations) do not suffice to measure/predict
application performance and scalability, and give a quantitative comparison of different Grid sites
for individual applications while taking into effect variations in the problem size. In addition, there
are no integration mechanisms and common units available for existing benchmarks to make meaningful inferences about the performance and scalability of individual Grid applications on different
Grid sites.
Application benchmarking on the Grid can provide a basis for users and Grid middleware services
(like meta-schedulers (Berman, et al., 2005) and resource brokers (Raman, Livny, & Solomon, 1999))
for optimized mapping of jobs to the Grid resources by serving as an evaluation of fitness to compare
different computing resources in the Grid. The performance results obtained from real application
benchmarking are much more useful for scheduling these applications on a highly distributed Grid
infrastructure than the regular resource information provided by the standard Grid information services
(Tirado-Ramos, Tsouloupas, Dikaiakos, & Sloot, 2005) (Czajkowski, Fitzgerald, Foster, & Kesselman, 2001). Application benchmarks are also helpful in predicting the performance and scalability of
Grid applications, studying the effects of variations in application performance for different problem
sizes, and gaining insights into the properties of computing nodes architectures.
However, the complexity, heterogeneity, and the dynamic nature of Grids raise serious questions
about the overall realization and applicability of application benchmarking. Moreover, diversified
types of applications, multi-constrained executions, and heavy computational costs make the problem
even harder. Above all, mechanizing the whole process of controlling and managing benchmarking
experiments and making benchmarks available to users and Grid services in an easy and flexible
fashion makes the problem more challenging.
To overcome this situation, we present a three layered Grid application benchmarking system
that produces benchmarks for Grid applications taking into effect the variations in problem size and
machine size of the Grid sites. Our system provides the necessary support for conducting controlled
and reproducible experiments, for computing performance benchmarks accurately, and for comparing and interpreting benchmarking results in the context of application performance and scalability
predictions. It takes the specifications of executables, set of problem sizes, pre-execution requirements
and the set of available Grid sites in an input in XML format. These XML specifications, along with
the available resources are parsed to generate jobs to be submitted to different Grid sites. At first, the
system completes pre-experiment requirements like the topological order of activities in a workflow,
and then runs the experiments according to the experimental strategy. The benchmarks are computed
from experimental results and archived in a repository for later use. Performance and scalability prediction and analysis from the benchmarks are available through a graphical user interface and Web
Service Resource Framework (WSRF) (Banks, 2006) service interfaces. We do not require complex
integration/analysis of measurements, or new metrics for interpretation of benchmarking results.
Among our considerations for the design of Grid application benchmarks were conciseness, portability, easy computation and adaptability for different Grid users/services. We have implemented a
90
prototype of the proposed system as a WSRF service in the context of the ASKALON Grid application development and computing environment (Fahringer, et al., 2006).
The rest of the chapter is organized as follows. The next section presents the Grid resource, application, and execution models that serve as foundation for our work. Then, we summarize the requirements
of a prediction system, followed by a detailed architecture design. Afterwards we present our experimental design method for benchmarking and prediction of Grid applications. Experimental results that
validate our work on real-world applications in a real Grid environment are presented in the second
half of this chapter, followed by a related work summary and an outlook into the future work. The last
section concludes the chapter.
BACKGROUND
In this section we first review the relevant related work in the area of Grid application benchmarking,
and then define the general Grid resource, application, and execution models that represent the foundation for our benchmarking and prediction work.
Related Work
There have been several significant efforts that targeted benchmarking of individual Grid resources
such as (Hockney & Berry, 1994) (Bailey, et al., 1991) (Dixit, 1991) (Dongarra, Luszczek, & Petitet,
2003). The discussion presented in (Van der Wijngaart & Frumkin, 2004) shows that the configuration, administration, and analysis of NAS Grid Benchmarks requires an extensive manual effort like
other benchmarks. Moreover, these benchmarks lack some integration mechanism needed to make
meaningful inferences about the performance of different Grid applications.
A couple of comprehensive tools like (Tsouloupas & Dikaiakos, 2007) are also available for benchmarking a wide range of Grid resources. These provide easy means of archiving and publishing of
results. Likewise, GrenchMark (Iosup & Epema, GRENCHMARKIosup & Epema, GRENCHMARK:
A Framework for Analyzing, Testing, and Comparing Grids, 2006) is a framework for analyzing, testing, and comparing Grid settings. Its main focus is the generation and submission of synthetic Grid
workloads. In contrast, our work focuses on single application benchmarks which are extensively
supported.
Individual benchmarks have been successfully used for resource allocation (Afgan, Velusamy, &
Bangalore, 2005) (Jarvis & Nudd, 2005) and application scheduling (Heymann, Fernandez, Senar,
& Salt, 2003). A good work for resource selection is presented in (Jarvis & Nudd, 2005) by building
models from resource performance benchmarks and application performance details. Authors in (Afgan,
Velusamy, & Bangalore, 2005) present resource filter, resource ranker and resource MakeMatch on
the basis of benchmarks, and user provided information. Though this work provides good accuracy, it
requires much user intervention during the whole process. Moreover, these benchmarks do not support
cross-platform performance translations of different Grid applications while considering variations in
problem sizes.
A similar work has been presented in (Tirado-Ramos, Tsouloupas, Dikaiakos, & Sloot, 2005). The
authors present a tool for resource selection for different applications while considering variations
in performance due to different machine sizes. Importance of application-specific benchmarks is
91
also described by (Seltzer, Krinsky, & Smith, 1999). In this work, the authors present three different
methodologies to benchmark Grid applications by modeling application and Grid site information and
require much manual intervention.
The distinctive part of our work is that we focus on controlling and specifying the total number of
experiments needed for benchmarking process. Our proposed benchmarks are flexible regarding variations in machine size as well as problem sizes required for real-time scheduling and application performance prediction. Moreover, we support a semi-automatic benchmarking process. The cross-platform
interoperability of our benchmarks allows trade-off analysis and translation of performance information
between different platforms.
Grid Resource Model

We consider the Grid as an aggregation of heterogeneous Grid sites. A Grid site consists of a number of
compute and storage systems that share same local security, network, and resource management policies. Our experimental Grid environment comprises homogeneous parallel computers within a Grid site,
including cache coherent Non-Uniform Memory Architectures (ccNUMA), Clusters of Workstations
(COW), and Networks of desktop Workstations (NOW). Each parallel computer is utilized as a single
computing resource using a local resource management system such as Sun Grid Engine (SGE), Portable
Batch System (PBS) or its Maui and Torque derivatives.
To simplify the presentation and without losing any generality, we assume in the remainder of the
paper that a Grid site is a homogeneous parallel computer. A heterogeneous Grid consists of an aggregation of homogeneous sites.
Grid Workflow Model

The workflow model based on loosely-coupled coordination of atomic activities has emerged as one
of the most attractive paradigms in the Grid community for programming Grid applications. Despite
this, most existing Grid application development environments provide the application developer with
a nontransparent Grid. Commonly, application developers are explicitly involved in tedious tasks such
as selecting software components deployed on specific sites, mapping applications onto the Grid, or
selecting appropriate computers for their applications. In this section we propose an abstract Grid workflow model that is completely decoupled from the underlying Grid technologies such as Globus toolkit
(Foster & Kesselman, Globus: A Metacomputing Infrastructure Toolkit, 1997) or Web services ((W3C),
World Wide Web Consortium).
We define a workflow as a Directed Acyclic Graph (DAG): W = (Nodes, C-Edges, D-edges, IN-ports,
OUT-ports), where Nodes is the set of activities, C-edges = (A1,A2)Nodes (A1, A2) is the set of control flow
dependencies, D-edges = (A1,A2,D-port)Nodes (A1, A2, D-port) is the set of data flow dependencies, IN-ports
is the set of workflow input data ports, and OUT-ports is the set of output data ports.
An activity ANodes is a mapping from a set of input data ports IN-portsA to a set of output data
ports OUT-portsA:
A: IN-portsA OUT-portsA.
92
A data port D-port IN-portsA OUT-portsA is an association between a unique identifier (within the
workflow representation) and a well-defined activity type: D-port = (identifier, type).
The type of a data port is instantiated by the type system supported by the underlying implementation
language, e.g. the XML schema. The most important data type in our experience that shall be supported
for Grid workflows is file alongside other basic types such as integer, float, or string.
An activity NNodes can be of two kinds:
1.
2.
Computational activity or atomic activity represents an atomic unit of computation such as a legacy
sequential or parallel application;
Composite activity is a generic term for an activity that aggregates multiple (atomic and composite)
activities according to one of the following four patterns:
a. parallel loop activity allows the user to express large-scale workflows consisting of hundreds
or thousands of atomic activities in a compact manner;
b. sequential loop activity defines repetitive computations with possibly unknown number of
iterations (e.g. dynamic convergence criteria that depend on the runtime output data port
values computed within one iteration);
c. conditional activity models if and switch-like statements that activate one from its multiple
successor activities based on the evaluation of a boolean condition;
d. workflow activity is introduced for modularity and reuse purposes, and is recursively defined
according to this definition.
In the remainder of this paper we will use the terms activity and application interchangeably. In this
paper we only deal with the benchmarking and prediction of computational activities, while data transfer
prediction has been addressed in related work such as (Wolski, 2003).
We designed and implemented our approach within the ASKALON Grid application development
and computing environment that allows the specification of workflows according to this model at two
levels of abstraction (see Figure 1):

graphical, based on the standard Unified Modeling Language (UML);

XML-based using the Abstract Grid Workflow Language (AGWL) which can be automatically
generated from the graphical UML representation.
Grid Execution Model

The XML-based AGWL representation of a workflow represents the input to the ASKALON WSRFbased (Banks, 2006) middleware services for execution on the Grid (see Figure 1). To support this
requirement transparently, a set of sophisticated services whose functionality is not directly exposed to
the end-user is essential:

Resource Manager (Siddiqui, Villazon, Hoffer, & Fahringer, 2005) is responsible for negotiation,
reservation, allocation of resources, and automatic deployment of services required executing
Grid applications. In combination with AGWL, the Resource Manager shields the user from the
low-level Grid infrastructure;
Scheduler (Wieczorek, Prodan, & Fahringer, 2005) determines effective mappings of single or
93
Figure 1. The ASKALON Grid application development and computing environment architecture
multiple workflows onto the Grid using graph-based heuristics and single or bi-criteria optimization algorithms such as dynamic programming, game theory, or genetic algorithms;
Performance prediction supports the scheduler with information about the expected execution
time of activities on individual Grid resources. The design and implementation of this service is
the scope of the this paper;
Enactment Engine (Duan, Prodan, & Fahringer, 2006) targets scalable, reliable and fault-tolerant
execution of workflows;
Data repository is a relational database used by the Enactment Engine to log detailed workflow
execution events required for post-mortem analysis and visualization;
Performance analysis (Prodan & Fahringer, 2008) supports automatic instrumentation and bottleneck detection through online monitoring of a broad set of high-level workflow overheads (over
50), systematically organized in a hierarchy comprising job management, control of parallelism,
communication, load imbalance, external load, or other middleware overheads.
PREDICTION REqUIREMENTS
The performance of an application is dependent upon a number of inter-related parameters at different
levels of Grid infrastructure (e.g. Grid site, computing nodes, processor architecture, memory hierarchy,
I/O, storage node, network (LAN or WAN), network topology), as shown in Figure 2 adapted from
(Dikaiakos, 2007). Practically it is almost impossible to characterize the effects of all these individual
components to shape the overall performance behavior of different Grid applications. Even benchmarks
of different resource components cannot be put together to describe application performance because
application properties must also be taken into effect (Hockney & Berry, 1994). In such a case, application performance benchmarks with some mechanism of application performance translation across the
94
Figure 2. Factors affecting Grid application performance
heterogeneous Grid sites can help to describe its performance in the Grid. Application benchmarks include effects of different resource components, in particular their combinational varying effects specific
to individual applications.
Our solution to performance prediction is therefore to benchmark scientific applications according
to a well-thought experimental design strategy across heterogeneous Grid sites. These benchmarks are
flexible to application problem size and number of processors (machine size) and are thus called soft
benchmarks.
More specifically there is a need for benchmarks, which:

Represent the performance of Grid application on different Grid sites;

Incorporate the individual effects of different Grid resources specific to different applications (like
memory, caching, etc.);
Can be used for performance and scalability predictions of the application;
Are portable to different platforms (De Roure & Surridge, 2003);
Are flexible regarding variations in problem and machine sizes;
Support fast and simplified computation and management;
Are comprehensively understandable and usable by different users and services.
On the other hand, it is also necessary to address the high cost of Grid benchmarking administration,
benchmarking computation, and analysis which requires a comprehensive system with a visualization
and analysis component.
95
ARCHITECTURE DESIGN
The design of our prediction framework illustrated in Figure 3 consists of a set of tools organized in three
layers that perform and facilitate the benchmarking process (the benchmarking experiments, computation, and storage of results) in a flexible way, and later publish the results and perform analysis.
In the first layer, the application details for benchmarking experiments are specified in an XMLbased language, which is parsed by a small compiler that produces the job descriptions in the Globus
Resource Specification Language (RSL) (Foster & Kesselman, Globus: A Metacomputing Infrastructure
Toolkit, 1997). Later, to these job descriptions are added resource specifications on which these jobs
are to be launched, to produce final jobs used for executing the benchmarks experiments. In this layer,
the total number of benchmarking experiments for individual applications is controlled with respect to
different parameters.
In layer 2, the Experiment Execution Engine executes the benchmark experiments on available Grid
sites provided by the Resource Manager (Siddiqui, Villazon, Hoffer, & Fahringer, 2005). A Grid site
is considered at both micro-level (the individual Grid nodes), as well as macro-level (the entire parallel computer) by taking machine size as a variable in the benchmark measurements. Such application
benchmarks therefore incorporate the variations in application performance associated to different
problem and machine sizes.
Figure 3. The prediction framework architecture
96
The monitoring component watches the execution of the benchmarking experiments and alerts the
Orchestrator component in layer 3 to collect the data and coordinate the start-up of the Benchmarks
Computation component to compute the benchmarks. The Archive component stores this information in
the benchmarks repository for future use. The Benchmarks Visualization Browser publishes the benchmarks in a graphical user interface for user analysis, and Information Service component is an interface
to other services.
ExPERIMENTAL DESIGN
To support the automatic application execution time prediction, benchmarking experiments need to be
made against some experimental design and the generated data be archived automatically for later use.
Specifically in our work, the general purpose of the experimental design phase is to set a strategy for
generation and execution of a minimum number of benchmarking experiments for an application to
support its performance prediction later on. Among others, our key objectives for this phase are to:

Reduce/minimize training phase time;

MMinimize/eliminate the heavy modeling requirements after the training phase;
Develop and maintain the efficient scalability of experimental design with respect to the Grid
size;
Make it generalizable to a wide range of applications on heterogeneous Grid-sites.
To address these objectives, we design our experimental design in the light of guidelines given by
Montgomery (Montgomery, 2004):
1.
2.
3.
4.
5.
Recognition of statement of problem: We describe our problem statement as to obtain maximum

execution time information of the application at different problem sizes on all heterogeneous Grid
sites with different possible machine sizes in minimum number of the experiments;
Selection of response variables: In our work the response variable is the execution time of the
application;
Choice of factors, levels, and ranges: The factors affecting the response variable are the problem
size of the application, the Grid size, and the machine size of one parallel computer;
Choice/formulation of experimental design: In our experimental design strategy, we minimize
first the combinations of Grid size with problem size, and then the combinations of Grid size with
machine size. By minimizing the combinations of Grid size with problem size, we minimize the
number of experiments against different problem sizes across the heterogeneous Grid sites. Similarly,
by minimizing the Grid size combinations with the machine size factor, we minimize number of
experiments against different problem sizes across different number of processors. We designed this
to eliminate the need of next two steps presented by Montgomery et. al. called statistical analysis
and modeling, and conclusions, to minimize the serving costs on the fly;
Performing of experiments: We address performing of experiments under automatic training phase
as described later in Section 0.
97
Experiment Specification
To describe application specifications we created a small language called Grid Application Description Language (GADL). A GADL definition specifies the application to be executed, its exact paths
retrieved from the Resource Manager, the problem size ranges, and pre-requisites of execution (e.g.
predecessor activities within a workflow, environment variables), if any. More precisely, every GADL
instance is described by:

Application name with a set of problems sizes given either as enumerations or as value ranges
using a start: end: step pattern:
<application name=Wien2k />

<parameter>
<name=k-points value=5.0:0.1:9.0>
</parameter>

Resource manager (Siddiqui, Villazon, Hoffer, & Fahringer, 2005) URI used to retrieve the available Grid sites and location of the application executables implementing the activity types:
<resourcemanager> <location path=

http://karwendel.dps.uibk.ac.at:40105/wsrf/services/GlareService/
<resourcemanager/>

A set of pre-requisites, comprising the activities which must be finished before the execution (of
some components of) the application:
<prerequisites>
<location path=http://dps.uibk.ac.at:/home/farrukh/pre-reqs/>
</prerequisites>

A set of input files required for executing the application:
<inputfile>
<location path=http://dps.uibk.ac.at:/home/farrukh/input.tar/>
<inputfile/>

98
An executable needed to change the problem size in some peculiar input files characteristic to
scientific applications:
<probsizechange>
<location path=http://dps.uibk.ac.at:/home/farrukh/probSize/>
<probsizechange/>
Training Experiments
The training or benchmarking phase for an application consists of the experiment executions for different selections of problem and machine sizes on different Grid sites to obtain the corresponding execution times referred as training set or historical data. Automatic performance prediction based on such
historical data needs enough amounts of data present in data repository in order to deliverable accurate
results. In general, there is a tradeoff between the number of experiment conducted and the accuracy of
the prediction. The historical data needs to be generated for every new application ported onto the Grid
and/or for every new machine (different from existing machines) added to the Grid.
Conducting automatic benchmarking for application execution time predictions on the Grid is a
complex problem due to variety of the factors involved. More formally, the automatic training phase
benchmarking comprises:

A set of A activities of different activity types belonging to a workflow W = (Nodes, C-edges,

D-edges, IN-ports, OUT-ports);
A set of distributed heterogeneous Grid sites;
A set of Grid sites or homogeneous parallel computers;
A set PA of different workflow problem sizes every workflow activity type A.
The total number of experiments N produced by this parameter set is:

N =
ANodes
|site|
PA m ,
site Grid m =1
where |PA| denotes the number of problem sizes in the set PA (or the set cardinality) and |sites| the number
of processors in a Grid site. The cardinality of Nodes, PA, site, as well as the number of Grid sites and
CPU types have a significant effect on the number of experiments and, therefore on the overall duration
of the automatic training phase. The goal is to compute a set of experiments such that N is minimized
and the prediction accuracy is maximized.
Performance Sharing and Translation

Computing the full cross product of the parameters involved in the benchmarking process may lead to
a huge number of experiments that cannot be executed exhaustively (or is not necessary). Therefore,
controlling the number of experiments is of key importance in the efficiency of the whole benchmarking process. Our focus is to reduce the total number of benchmarking experiments and to maximize the
utility of benchmarking results.
To reduce the experimental space, we introduce a Performance Sharing and Translation (PST) mechanism based on several multi-parameter performance relativity properties, experimentally observed for
99
our case study applications. We normalize the execution times against that of a reference problem size
selected by default as the largest problem size to take in effect of inter process communication in the set
of problem sizes specified by the user. The normalization mechanism not only makes the performance
of different machines comparable, but also provides a basis for translating different performance values
across different Grid sites. The normalization of values is based on the observation that for many computeintensive applications, and in particular the embarrassingly pilot parallel applications that scale linearly
with the machine and problem sizes and that drive our experimental work, the normalized execution
times for different problem and machine sizes are the same on all the Grid sites with 90% accuracy.
This allows cross-platform interoperability. For example, the normalized execution on a Grid site g for
a certain problem size and machine size will be equal to that of another Grid site h.
We define in the following sections the inter- and intra-platform performance relativity properties.
Inter-Platform PST
Inter-platform PST specifies that the performance behavior Tg(A,p) of an application A for a problem
size p relative to another problem size q on a Grid site g is the same as that of the same problem sizes
on another Grid site h:
Tg (A, p)
Tg (A, q )
Th (A, p)
Th (A, q )
This phenomenon is based on the fact that rate of change in execution time of an application across
different problem sizes is preserved on different Grid sites, i.e. the rate of change in execution time of an
application A for the problem size p (the target problem size) with respect to another problem size q (the
reference problem size) on Grid site g is equal to the rate of change in execution time for the problem
size p with respect to the problem size q on Grid site h:
DTg (A, p)
DTg (A, q )
DTh (A, p)
DTh (A, q )
Intra-Platform PST
Similarly, intra-platform PST specifies that the performance behavior of an embarrassingly parallel application A on a Grid site g for a machine size m relative to another machine size n for a problem size p
is similar to that for another problem size q:
Tg (A, p, m )
Tg (A, p, n )
Tg (A, q, m )
Tg (A, q, n )
This phenomenon is based on the fact that rate of change in execution time of an application across
different problem sizes is preserved for different machine sizes, i.e. the rate of change in execution
time of an application for the problem size p and machine size m on Grid site g with respect to that for
100
machine size n will be equal to the rate of change in execution time for the problem size q and machine
size m with respect to that for a machine size n on the same Grid site:
DTg (A, p, m )
DTg (A, p, n )
DTg (A, q, m )
DTg (A, q, n )
Similarly, the rate of change in executions time of the application across different machine sizes is
also preserved for different problem sizes:
DTg (A, p, m )
DTg (A, q, m )
DTg (A, p, n )
DTg (A, q, n )
We use this phenomenon to share execution times for scalability within one Grid site. The accuracy
of inter- and intra-platform similarity of normalized behaviors does not depend upon the selection of
reference point for embarrassingly parallel applications. However, for parallel applications exploiting
inter-process communications during their executions, this accuracy increases as the reference point
gets closer to the target point. Usually, the closer the reference point, the greater the similarity (of interprocess communication) it encompasses. Thus, the accuracy increases as the reference problem size gets
closer to the target problem size in case of inter-platform PST, respectively the reference problem and
machine sizes get closer to the target problem and machine sizes in case of intra-platform PST. More
formally, for inter-platform PST:
T (A, p) T (A, p)
= 0,
lim g
- h
q p T (A, q )
T
A
q
(
,
)
g
h
and similarly for intra-platform PST:
T (A, p, m ) T (A, q, m )
= 0.
lim g
- g
n m T (A, p, n )
T
A
q
n
(
,
,
)
g
g
For normalization from the minimum training set only, we select the maximum problem size (in
normal practice of user of the application) and maximum machine size as reference point, to incorporate the maximum effects of inter-process communications in the normalization. The distance between
the target point and the reference point for inter- and intra-platform PST on one Grid site is calculated
respectively as:
2
2
(T (p) - T (q )) + (p - q ) ,
d =
2
2
(T (m ) - T (n )) + (m - n ) ,
for nearest problem size;

for nearest machine size.
101
Reduced Experiment Set

Our methodology of reducing the number of benchmarking experiments is to enable sharing of the
benchmarking information across the Grid sites, as we explained in Section 0. The performance ratios
of different Grid sites are different for different applications and they also vary for different problem
and machine sizes.
First of all, for each application we make one experiment for the reference problem size on each of
the non-identical Grid sites. Afterwards, we make a full factorial design of benchmarking experiments
on the fastest (in terms of processor speed) available Grid site considering the problem and machine
size as parameters for that application. We select the largest problem size as the reference problem size
whose benchmarks are used to share information across the Grid sites. In the next prediction phase, the
process of normalization helps in completing the benchmarks computation for all the Grid sites. For
scalability analysis and prediction, one benchmark experiment for each of different machine sizes is
also made for the reference problem size.
By means of inter-platform PST, the total number of experiments reduces for an activity A with PA
problem sizes on a Grid with G sites and an average of M different machine sizes per site from PA G
M to PA M + G 1 and for single parallel computers from PA M to PA + M 1. By introducing intraplatform PST, we reduce total number of experiments for parallel machines as Grid sites further to a
linear complexity of PA + (M 1) + (G 1
Later, we employ prediction mechanism to derive performance values for the problem and machine
size combinations that were not effectively benchmarked.
We argue that this reduction in the number of performed benchmarks is a reasonable trade-off between
duration of the benchmarking process and accuracy. In Section 5, we show experimentally that predictions based on our approach are within 90% accuracy. A similar or better accuracy can be achieved with
either more benchmarks, or by using analytical modeling techniques. However, both these alternatives
are time-consuming. In addition, analytical modeling requires a separate model and expert knowledge for
each new type of application. With current Grid environments hosting hundreds to thousands of different
applications (Iosup & Epema, Build-and-Test Workloads for Grid Middleware: Problem, Analysis, and
Applications, 2007), analytical modeling for individual application performance and scalability (which
requires manual efforts) is impractical, whereas, benchmarking requires only one generic setup.
Experiment Execution
Once the set of experiments has been computed, the next phase towards prediction is to execute them
according to the experimental design strategy.
We employ the opportunistic load balancing algorithm (Schwiegelshohn & Yahyapour, 2000) for scheduling benchmarking experiments in the Grid. The algorithm for automatically conducting benchmarking
experiments is shown in Figure 4. We schedule the benchmarking experiments for each workflow activity
type in topological order. For every activity type, we make one experiment on every Grid site for one reference problem size and one sequential machine size, which we use later on in the normalization process.
Afterwards, we perform the full factorial design of experiments for one processor as machine size. Finally,
for scalability predictions we make one experiment for each of the different machine sizes for the reference
problem size on the fastest available Grid site. Jobs within one Grid site are submitted in parallel to the
local queuing system that executes them according to the local system administration policies.
102
Figure 4. The automatic application benchmarking algorithm

algorithm benchmark_scheduling;
input: W = (Nodes, C-edges, D-edges, IN-ports, OUT-ports);
Set of problem sizes: PA, A Nodes;
Set of Grid sites;
output: TS = execution time set;
TS = ;
for A Nodes pred(A) Nodes
p = reference problem size of A (p PA);
for site Grid do in parallel
Tsite(A, p) = time(execution of A on site for reference problem size p);
TS = TS Tsite(A, p);
end for;
site = the idle Grid site with the fastest processor CPU;
for r PA do in parallel
Tsite(A, r, n) = time(execution of A on site for problem size r on n reference processors);
TS = TS Tsite(A, r, n);
end for;
for m = 1 to |site| do in parallel
Tsite(A, p, m) = time(execution of A on site for problem size p with machine size m);
TS = TS Tsite(A, p, m);
end for;
Nodes = Nodes A;
end for;
return TS;
end algorithm.
The algorithm returns the execution times of these experiments which are then archived and later
used by Benchmarks Computation component to calculate the benchmarks (see Figure 3).
Background Load
Sometimes the background load, that is, the applications run by external users, severely affects the
performance of some (or even all) the applications in the system, especially on ccNUMA SMP parallel
computers. This happens mostly when several applications contend for the same network or processor
shares, or when resource utilization is very high and the resource manager is ineffective (Arpaci-Dusseau,
Arpaci-Dusseau, Vahdat, Liu, Anderson, & Patterson, 1995). However, our benchmarking procedure
does not take into account the background load, at least for the moment. The reason is threefold. First,
our goal is to quantify the best achievable performance of an application on a Grid platform without
the contention generated by additional users. Work in (Arpaci-Dusseau, Arpaci-Dusseau, Vahdat, Liu,
Anderson, & Patterson, 1995) helps quantifying the ratio between the maximum achievable performance
and the performance achieved in practice. Second, work in hotspot or symbiotic scheduling (Snavely
& Weinberg, 2006) helps scheduling applications with overlapping resource requirements such that the
overlap is minimized. Third, while mechanisms for ensuring the background load on the resources have
been proposed (Mohamed & Epema, 2005), a better understanding of the structure and of the impact of
the background load is needed. We plan to investigate aspects of this problem in future work.
103
Performance and Scalability Predictions

The benchmarks are computed from the results of experiments and archived in a data repository for
future references. This is done in a manner that facilitates the comparisons between the benchmarks
for different Grid sites, problem sizes, and machine sizes, along with the performance and scalability
predictions. Benchmarks can be browsed through a graphical user interface (see Figure 5) for application performance and scalability predictions for different problem sizes on different Grid sites. In this
section we explain how the benchmarks are used for performance and scalability predictions and Grid
site comparisons.
The performance of an application A can be predicted for any problem size p on any Grid site g from
another Grid site h (for which execution time for problem size p exists) from the benchmarks using the
normalization method, as follows:
Tg (A, p) =
Th (A, p)
Th (A, q )
Tg (A, q ).
Figure 5. Graphical user interface for application benchmarks and predictions
104
where Tg(A, p) represents the execution time of an activity A, for a problem size p, on a Grid site g.
Similarly, for scalability analysis and prediction taking machine size as a parameter, the performance
of the parallel applications for different number of CPUs can be predicted from the benchmarks as follows:
Tg (A, p, m ) =
Tg (A, q, m )
Tg (A, q, n )
Tg (A, p, n ).
where Tg(A, p, m) represents execution time of an application for problem size p on a Grid site g for a
machine size m.
For execution time and scalability predictions, normalization is done based on execution time for
the closest set of parameters (problem size and machine size). At the start, this is made based on the
only common set of parameters in the benchmark repository and later, if some other performance values
are available (after adding some experimental values from real runs), calculated based on the closer
performance value, as it increases accuracy in the cross platform performance and scalability predictions. For our prediction results we obtained a minimum accuracy of 90% from our proposed number
of experiments as we will demonstrate in Section 0.
Grid Site Comparisons

Our training benchmarks help facilitating the comparisons of applications performance for different
values of problem and machine sizes on different Grid sites, as the second key use. This can guide the
Grid site selection policies for real time schedulers, resource brokers and different Grid users. Furthermore, these comparisons provide application developers with information about the systems capabilities
in terms of application performance, so that they can develop and tune their applications for high-quality
implementations.
ExPERIMENTS
We have conducted experiments to validate our experimental design method in a heterogeneous subset
of the Austrian Grid environment summarized in Table 1.
Workflow Applications
We used three real-world workflow applications to validate our method: WIEN2k, MeteoAG, and Invmod, which we describe in the next subsections.
WIEN2k
WIEN2k (Schwarz, Blaha, & Madsen, 2002) is a program package for performing electronic structure
calculations of solids using density functional theory based on the full potential (linearized) augmented
105
Table 1. The Austrian Grid testbed

Site Name
Architecture
No.
CPUs
Processor
Architecture
Gigahertz
RAM
[megabytes]
Location
altix1.jku
ccNUMA, SGI Altix

3000
64
Itanium 2
1.6
14000
Linz
altix1.uibk
ccNUMA, SGI Altix

350
16
Itanium 2
1.6
16000
Innsbruck
schafberg
ccNUMA, SGI Altix

350
16
Itanium 2
1.6
14000
Salzburg
agrid1
NOW, Fast Ethernet
20
Pentium 4
1.8
1800
Innsbruck
hydra
COW, Fast Ethernet
16
AMD Athlon
2.0
1600
Linz
hc-ma
NOW, Fast Ethernet
16
AMD Opteron 2.2
2.2
4000
Innsbruck
zid-cc
NOW, Fast Ethernet
22
Intel Xeon
2.2
2000
Innsbruck
karwendel
COW, Infiniband
80
AMD Opteron
2.4
16000
Innsbruck
Figure 6. Simplified WIEN2k workflow representation
plane wave ((L)APW) and local orbital (lo) method. We first ported the application onto the Grid by
splitting the monolithic code into several course grain activity types coordinated in a workflow as illustrated in Figure 6. The LAPW1 and LAPW2 activities can be solved in parallel by a fixed number of so
called k-points. A final activity called Converged applied on several output files tests whether the problem
convergence criterion is fulfilled. The number of sequential loop iterations is statically unknown.
106
MeteoAG
We designed MeteoAG (Schller, Qin, Nadeem, Prodan, Fahringer, & Mayr, 2006) as a Grid workflow
application for meteorological simulations based on the RAMS (Cotton, et al., 2003) numerical atmospheric model. The simulations produce spatial and temporal fields of heavy precipitation cases over the
western part of Austria to resolve most alpine watersheds and thunderstorms. A database of reanalyzed
heavy precipitation cases is generated in order to study various aspects of objective analysis algorithms
for rain gauge networks and the impact of weather radar on the analysis.
Figure 7 illustrates the workflow structure in which a large set of simulation cases modeled as a
parallel loop. For each simulation, another nested parallel loop is executed with a different so called
akmin parameter value. For each individual akmin value, the activities rams makevfile, rams init, revu
compare and raver are processed sequentially. Based on the results of the raver activity, a conditional
Figure 7. The simplified MeteoAG workflow
107
activity decides whether the activity rams_hist or the parallel loop pforTimeStep (in which the activity
revu_dump is enclosed as iteration) are executed.
Invmod
Invmod is a hydrological application designed at the University of Innsbruck for calibration of parameters of the WaSiM tool developed at the Swiss Federal Institute of Technology Zurich. Invmod uses the
Levenberg-Marquardt algorithm to minimize the least squares of the differences between the measured and
the simulated runoff for a determined time period. We re-engineered the monolithic Invmod application
into a Grid-enabled scientific workflow consisting of two levels of parallelism as depicted Figure 8:

Each iteration of the outermost parallel loop called random run performs a local search optimization starting from an arbitrarily chosen initial solution;
Alternative local changes are examined separately for each calibrated parameter, which is done in
parallel in the inner nested parallel loop
Figure 8. The simplified Invmod workflow
108
Figure 9. Experiment reduction with problem, machine, and Grid sizes
The number of sequential loop iterations is variable and depends on the actual convergence of the
optimization process. However, it is usually equal to the input maximum iteration number.
Experiment Set Reduction

We analyzed the scalability of our experimental design strategy by varying the problem size of our applications from 10 to 200 for fixed values of the remaining factors: ten Grid sites with machine size 20
and 50 single processor machines. We observed a reduction in the total number of experiments from
96% to 99%, as shown in Figure 9. A reduction from 77% to 97% in the total number of experiments
was observed when we varied the machine size from one to 80, for fixed factors of 10 parallel machines,
50 single processor Grid sites and problem size of five. From another perspective, we observed that
the total number of experiments increased from 7% to 9% when the Grid size was increased from 15
to 155, for the fixed factors of five parallel machines with machine size of 10 and problem size 10. We
observed an overall reduction of 78% to 99% when we varied all factors simultaneously: five parallel
machines with machine size from 1 to 80, single processor Grid sites from 10 to 95, and problem size
from 10 to 95.
Normalized Benchmarks
Due to space limitations, we report results on benchmarking one activity from each of the three previously introduced workflows: LAPW1 from WIEN2k, rams_hist from MeteoAG, and wasim_b2c from Invmod.
The training benchmarks for the LAPW1 activity type of the WIEN2k workflow on different Grid
sites of the Austrian Grid are shown in Figure 10. We made a total of 45 benchmark experiments for
109
41 different problem sizes on 5 different Grid sites. The total execution time of our reduced LAPW1
benchmarking phase was 4203.24 seconds, while the total full factorial set would need 5.6 times longer
(23614.8 seconds). We repeated every experiment for five times and took their average to reduce the
anomalies in the computations due to external factors. For LAPW1, we took the execution time of the
problem size 9.0 as the base performance value for normalization. The similar benchmarks curves (for
different values of problem size) on different machines show the realization of normalized performance
behavior of the Grid benchmarks across heterogeneous platforms.
Performance and scalability benchmarks for different number of CPUs for the rams_hist activity type
of the MeteoAG workflow on zid-cc and hc-ma Grid sites are shown in Figure 11 and Figure 12, respectively. A total of 30 benchmarking experiments were made for 19 problem sizes and 12 machine sizes on
zid-cc, and 32 experiments for 19 problem sizes and 14 machine sizes on hc-ma. In these experiments,
we have used a machine size of one for normalization. The identical scalability curves demonstrate the
realization of normalized performance behavior of application benchmarks with respect to problem and
machine size on one platform.
We observed similar results for the wasim_b2c activity of Invmod for different problem sizes on different Grid sites, as showed in Figure 13. Here, the reduced training phase took 2190.62 seconds, while
the full factorial set would need about 10711.5 seconds (4.8 times longer).
Grid Site Comparison

A comparison of different Grid sites for LAPW1 and wasim_b2c is shown in Figure 14 and Figure 15,
respectively. The scalability comparison for MeteoAG for different problem sizes on two different platforms, 32 bit zid-cc and 64 bit hc-ma, is shown in Figure 16 and Figure 17, respectively. A comparison
Figure 10. Normalized LAPW1 benchmarks for 41 problem sizes and 5 Grid sites
110
Figure 11. Normalized rams_hist benchmarks on zid-cc with 19 problem sizes and 12 machine sizes
Figure 12. Normalized rams_hist benchmarks on hc-ma with 19 problem sizes and 14 machine sizes
111
Figure 13. Normalized wasim_b2c benchmarks on hc-ma with 9 problem sizes and 5 Grid sites
of two different versions of LAPW1 (32-bit versus 64-bit version) on karwendel is presented in Figure
18. These graphs were generated from the application benchmarks when only one benchmark measurement for the 64-bit version was made.
To give a glimpse of the variability in the quantitative comparisons of different Grid sites for different applications, we present our experimental results in Figure 19. As shown in this figure, the agrid1
and altix1.uibk Grid sites yielded different execution time ratios for the three different applications. For
WIEN2k this ratio is 2.37, for Invmod 10.37, and for MeteoAG 1.71. It is noteworthy that these ratios
are irrespective of the total execution times on these Grid sites. This is the reason that why benchmarks
for individual resources (e.g. CPU, memory) do not suffice for application performance and scalability
predictions. Furthermore, considering one application, the comparison of execution times on Grid sites
yields different ratios for different problem sizes. This performance behavior of Grid applications urged
us to make a full factorial design of experiments on the Grid, rather than modeling individual applications
analytically which is complex and inefficient. The execution time ratios of the two Grid sites altix1.uibk
and agrid1 for 41 different problem sizes are shown in Figure 20.
Prediction Accuracy
Figure 21 and Figure 22 show the graphs of comparison between the measured and predicted values for
LAPW1 and wasim_b2c, respectively. The lowest curves in both figures represent the execution values
on agrid1 taken as reference Grid site. The values of execution times of these activities on four other
Grid sites were calculated through the normalization process. We have used the reference point available
from the training set for normalization. Every two curves of measured and predicted values are very
112
Figure 14. Performance benchmark and Grid site comparison for 41 problem sizes of LAPW1 and five
Grid sites
Figure 15. Performance benchmark and Grid site comparison for 9 problem sizes of wasim_b2c and
five Grid sites
113
Figure 16. Performance benchmark and Grid site comparison for 18 problem sizes of rams_hist and
12 machine sizes on zid-cc
Figure 17. Performance benchmark and Grid site comparison for 18 problem sizes of rams_hist and
14 machine sizes on hc-ma
114
Figure 18. 32 bit versus 64 bit performance benchmark and comparison for different problem sizes on
karwendel
Figure 19. Quantitative performance comparison of altix1 and agrid1 for three workflow applications
115
Figure 20. Execution times ratios for altix1.uibk and agrid1 for 41 problem sizes
much similar. However, we can see that they are closest near the reference problem size and have little
differences as the distance from the reference problem size increases. Due to this reason, we always
take the reference problem size as close as possible to the target problem size. We observed a maximum
average variation of 10% from the actual values (obtained from real runs) in our performance and scalability prediction, which means 90% accuracy in our predictions with a maximum standard deviation of
2%. As we get more data during the actual runs, it will increase the probability of finding closer problem
sizes other than the one obtained during the benchmarking phase, to be used in the normalization and
thus increase the accuracy even beyond 90%.
FUTURE TRENDS
In the future we plan to enhance our present work in multiple dimensions. First, we aim to refine the
experimental design phase to further reduce the number of experiments on one Grid site (with full factorial experiments) by applying intelligent space search methods. In the beginning this will be done with
the help of end-users, and later we plan to automate it for different applications. Second, we intend to
make another set of benchmarks by keeping track of memory used for different problem sizes of an application. This will help in translating application performance across different machines with different
memory capacities, including performance variations due to paging in case of data-intensive applications. Third, we are also enhancing our present work towards application benchmarking at the level of
Grid constellations comprising of multiple sites spreading across multiple Virtual Organizations. Fourth,
we plan to incorporate application throughput information for performance transformation across the
116
Figure 21. Comparison of real and predicted values for LAPW1
Figure 22. Comparison of real and predicted values for wasim_b2c
platforms and learn from previous errors in predictions. Last but not least, we want to find the effect of
prediction inaccuracies in scheduling workflows.
117
CONCLUSION
Application benchmarks provide a concrete basis for performance analysis and predictions incorporating
variations in the problem and machine sizes on different platforms, and for real quantitative comparison
of different Grid sites.
Efficient and reliable design of experiments to support automatic benchmarking for the training set
of application performance prediction on the Grid is of crucial importance. We proposed in this paper an
effective experimental design through a step-by-step controlling mechanism that reduces the combinations of the factors affecting the application performance prediction. Our scalable approach is based on
two intra- and inter-platform performance sharing and translation mechanisms that reduce the number
of benchmarking experiments in the training phase to a complexity linear with the number of problem
sizes, the size of one Grid site, and the number of Grid sites.
Benchmarking an application with our method requires executing a full factorial set of experiments
on one Grid site, and a scalability analysis for different machine sizes for a reference problem size.
Using this information, predicting the performance of the application for an arbitrary problem size and
machine size on another Grid site requires performing one single additional benchmarking experiment
and then applying the inter- and intra-platform translation mechanisms.
We demonstrated experimental results for three real-world applications in the Austrian Grid environment. In our experiments we achieved a 77% 99% reduction in the number of experiments while
maintaining 90% accuracy in the prediction results.
REFERENCES
Afgan, E., Velusamy, V., & Bangalore, P. (2005). Grid resource broker using application benchmarking.
European Grid Conference, (LNCS 3470, pp. 691-701). Amsterdam: Springer Verlag.
Arpaci-Dusseau, R. H., Arpaci-Dusseau, A. C., Vahdat, A., Liu, L. T., Anderson, T. E., & Patterson, D.
A. (1995). The interaction of parallel and sequential workloads on a network of workstations. SIGMETRICS, (pp. 267-278).
Bailey, D., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., & Dagum, L. (1991). The NAS
parallel benchmarks. The International Journal of Supercomputer Applications, 5(3), 6373.
Banks, T. (2006). Web services resource framework (WSRF). Organization for the Advancement of
Structured Information Standards (OASIS).
Berman, F., Casanova, H., Chien, A. A., Cooper, K. D., Dail, H., & Dasgupta, A. (2005). New Grid
scheduling and rescheduling methods in the GrADS project. International Journal of Parallel Programming, 33(2-3), 209229. doi:10.1007/s10766-005-3584-4
Cotton, W., Pielke, R., Walko, R., Liston, G., Tremback, C., & Jiang, H. (2003). RAMS 2001: Current
status and future directions. Meteorology and Atmospheric Physics, 82(1-4), 529. doi:10.1007/s00703001-0584-9
118
Czajkowski, K., Fitzgerald, S., Foster, I., & Kesselman, C. (2001). Grid information services for distributed resource sharing. 10th International Symposium on High Performance Distributed Computing
(pp. 181-194). San Francisco: IEEE Computer Society Press.
De Roure, M., & Surridge, D. (2003). Interoperability challenges in Grid for industrial applications.
GGF9 Semantic Grid Workshop, Chicago.
Dikaiakos, M. D. (2007). Grid benchmarking: vision, challenges, and current status. [New York: Wiley
InterScience.]. Concurrency and Computation, 19, 89105. doi:10.1002/cpe.1086
Dixit, K. M. (1991). The SPEC benchmarks. Parallel Computing, 17(10-11), 11951209. doi:10.1016/
S0167-8191(05)80033-X
Dongarra, J., Luszczek, P., & Petitet, A. (2003, August). The LINPACK Benchmark: past, present and
future. Concurrency and Computation, 15(9), 803820. doi:10.1002/cpe.728
Duan, R., Prodan, R., & Fahringer, T. (2006). Run-time optimization for Grid workflow applications.
International Conference on Grid Computing. Barcelona, Spain: IEEE Computer Society Press.
Fahringer, T., Prodan, R., Duan, R., Hofer, J., Nadeem, F., Nerieri, F., et al. (2006). ASKALON: A development and grid computing environment for scientific workflows. In I. J. Taylor, E. Deelman, D. G.
Ganon, & M. Shields (Eds.), Workflows for e-Science (p. 530). Berlin: Springer Verlag.
Foster, I., & Kesselman, C. (1997). Globus: A metacomputing infrastructure toolkit. International
Journal of Supercomputer Applications and High Performance Computing, 11(2), 115128.
doi:10.1177/109434209701100205
Foster, I., & Kesselman, C. (2004). The Grid: Blueprint for a future computing infrastructure (2 Ed.).
San Francisco: Morgan Kaufmann.
Heymann, E., Fernandez, A., Senar, M. A., & Salt, J. (2003). The EU-Crossgrid approach for grid
application scheduling. European Grid Conference, (LNCS 2970, pp. 17-24). Amsterdam: Springer
Verlag.
Hockney, R., & Berry, M. (1994). PARKBENCH report: public international benchmarks for parallel
computers. Science Progress, 3(2), 101146.
Iosup, A., & Epema, D. H. (2006). GRENCHMARK: A framework for analyzing, testing, and comparing grids. International Conference on Cluster Computing and the Grid (pp. 313-320). Singapore:
IEEE Computer Society Press.
Iosup, A., & Epema, D. H. (2007). Build-and-test workloads for Grid middleware: Problem, analysis,
and applications. International Conference on Cluster Computing and the Grid (pp. 205-213). Rio
de Janeiro, Brazil: IEEE Computer Society Press.
Jarvis, S. A., & Nudd, G. R. (2005, February). Performance-based middleware for Grid computing. Concurrency and Computation: Practactice and Experience, 17(2-4), 215234. doi:10.1002/
cpe.925
119
Mohamed, H. H., & Epema, D. H. (2005). Experiences with the KOALA co-allocating scheduler in
multiclusters. International Conference of Cluster Computing and the Grid (pp. 784-791). Cardiff,
UK: IEEE Computer Society Press.
Montgomery, D. C. (2004). Design and analysis of experiments (6 ed.). New York: Wiley.
Prodan, R., & Fahringer, T. (2008, March). overhead analysis of scientific workflows in grid
environments. Transactions on Parallel and Distributed Systems, 19(3), 378393. doi:10.1109/
TPDS.2007.70734
Raman, R., Livny, M., & Solomon, M. H. (1999). Matchmaking: An extensible framework for distributed resource management. Cluster Computing, 2(2), 129138. doi:10.1023/A:1019022624119
Schller, F., Qin, J., Nadeem, F., Prodan, R., Fahringer, T., & Mayr, G. (2006). Performance, scalability and quality of the meteorological grid workflow MeteoAG. In Austrian Grid Symposium.
Innsbruck, Austria: OCG Verlag.
Schwarz, K., Blaha, P., & Madsen, G. K. (2002). Electronic structure calculations of solids using the
WIEN2k package for material sciences. Computer Physics Communications, 147(71).
Schwiegelshohn, U., & Yahyapour, R. (2000). Fairness in parallel job scheduling. Journal of Scheduling, 3(5), 297320. doi:10.1002/1099-1425(200009/10)3:5<297::AID-JOS50>3.0.CO;2-D
Seltzer, M. I., Krinsky, D., & Smith, K. A. (1999). The case for application-specific benchmarking.
Workshop on Hot Topics in Operating Systems (pp. 102-109). Rio Rico, AZ: IEEE Computer Society
Press.
Siddiqui, M., Villazon, A., Hoffer, J., & Fahringer, T. (2005). GLARE: A Grid activity registration,
deployment, and provisioning framework. Supercomputing Conference. Seattle, WA: IEEE Computer
Society Press.
Snavely, A., & Weinberg, J. (2006). Symbiotic space-sharing on SDSCs datastar system. Job Scheduling
Strategies for Parallel Processing. (LNCS 4376, pp.192-209). St. Malo, France: Springer Verlag.
Theiner, D., & Rutschmann, P. (2005). An inverse modelling approach for the estimation of hydrological model parameters. (I. Publishing, Ed.) Journal of Hydroinformatics.
Tirado-Ramos, A., Tsouloupas, G., Dikaiakos, M. D., & Sloot, P. M. (2005). Grid resource selection
by application benchmarking: A computational haemodynamics case study. International Conference
on Computational Science. (LNCS 3514, pp. 534-543). Atlanta, GA: Springer Verlag.
Tsouloupas, G., & Dikaiakos, M. D. (2007). GridBench: A tool for the interactive performance exploration of Grid infrastructures. Journal of Parallel and Distributed Computing, 67(9), 10291045.
doi:10.1016/j.jpdc.2007.04.009
Van der Wijngaart, R. F., & Frumkin, M. A. (2004). Evaluating the information power Grid using
the NAS Grid benchmarks. International Parallel and Distributed Processing Symposium. Santa Fe,
NM: IEEE Computer Society Press.
120
Wieczorek, M., Prodan, R., & Fahringer, T. (2005). Scheduling of scientific workflows in the ASKALON Grid environment. SIGMOD Record, 09.
Wolski, R. (2003). Experiences with predicting resource performance on-line in computational grid settings. ACM SIGMETRICS Performance Evaluation Review, 30(4), 4149.
doi:10.1145/773056.773064
World Wide Web Consortium (W3C). (n.d.). Web services activity. Retrieved from http://www.
w3.org/2002/ws/

Benchmark: A measurement to be used as a reference value for future calculations such as performance predictions.
Experimental Design: Design of all information gathering exercises where variation is present,
whether under the full control of the experimenter or not.
Grid: A geographically distributed hardware and software infrastructure that integrates high-end
computers, networks, databases, and scientific instruments from multiple sources to form a virtual
supercomputer on which users can work collaboratively within virtual organizations.
Performance Prediction: Estimation of the execution time of an application for a certain problem
size in a certain configuration (e.g. machine size) on the target computer architecture.
Scalability: The ability of a system to either handle growing amounts of work without losing
processing speed.
Scheduling: The process of finding an appropriate execution resource to each atomic activity
of a large application; scheduling is usually employed for parallel applications, bags of tasks, and
workflows and is an NP-complete problem for certain objective functions such as execution time.
Scientific Workflow: A large-scale loosely coupled application consisting of a set of commodity
off-the-shelf software components (also called tasks or activities) interconnected in a directed graph
through control flow and data flow dependencies.
121
Section 2
P2P Computing
123
Chapter 6
Scalable Index and Data

Management for Unstructured
Peer-To-Peer Networks
Shang-Feng Chiang
National Taiwan University, Taiwan
Kuo Chiang
Ruo-Jian Yu
Sheng-De Wang
ABSTRACT
In order to improve the scalability and reduce the traffic of Gnutella-like unstructured peer-to-peer
networks, index caching and controlled flooding mechanisms had been an important research topic
in recent years. In this chapter the authors will describe and present the current state of the art about
index management schemes, interest groups and data clustering for unstructured peer-to-peer networks.
Index caching mechanisms are an approach to reducing the traffic of keyword querying. However, the
cached indices may incur redundant replications in the whole network, leading to the less efficient use of
storage and the increase of traffic. They propose a multiplayer index management scheme that actively
diffuses the indices in the network and groups indices according to their request rate. The peers of the
group that have indices with higher request rate will be placed in layers that receive queries earlier.
Their simulation shows that the proposed approach can keep a high success query rate as well as reduce
the flooding size.
DOI: 10.4018/978-1-60566-661-7.ch006
Scalable Index and Data Management for Unstructured P2P Networks
INTRODUCTION
With the growth of the Internet, peer-to-peer (P2P) systems have become an important paradigm in
designing large scale distribution systems. Peer-to-peer systems (Androutsellis-Theotokis, & Spinellis,
2004) provide effective ways of sharing data and can be based on overlay networks, which are classified
by the degree of decentralization. The three categories are as follows: purely decentralized, partially
decentralized, and hybrid decentralized architecture. Supporting efficient search of desired documents
has been the most important issue in a decentralized peer-to-peer network. The overlay for decentralized
peer-to-peer networks can be either unstructured or structured based on some distributed hash functions.
Gnutella and Napster are pioneers in peer-to-peer file sharing systems and belong to the unstructured
ones.
A class of structured peer-to-peer networks uses DHT (Distributed Hash Table) to maintain the shared
documents. Distributed hash tables (DHTs) make use of hashing functions to provide distribution and
lookup services. In this way, any participating node can efficiently retrieve the value associated with a
given key. Responsibility for maintaining the mapping from names to values is distributed among the
nodes, in such a way that a change in the set of participants causes a minimal amount of disruption.
DHTs are typically designed to scale a large number of nodes and handle nodes arrival and departure.
With a routing table, all the participating nodes only need to communicate with a small fraction of all
the nodes in a structured overlay network.
On the other hand, the unstructured peer-to-peer networks often rely on flooding mechanisms to
search the desired objects. As a result, it needs to use techniques like index caching, active replication,
or controlled flooding, to reduce the query traffic. The search algorithm of Gnutella use a kind of flooding method to discover objects, which sends queries to all nodes within a given TTL value. However,
the mechanism is not scalable, since the query messages will grow exponentially due to its blind search
method.
In this chapter, we will discuss some scalable techniques for index and data management for unstructured peer-to-peer networks. The concepts of interest group and data clustering will also be addressed.
We also propose an index diffusion scheme to maintain a high success query rate and reduce the traffic
load, for unstructured peer-to-peer systems.
BACKGROUND AND RELATED WORK

BitTorrent
BitTorrent is a peer-to-peer communication protocol (Cohen 2002) that can distribute large amounts of
data widely without the original distributor incurring the entire costs of hardware, hosting, and bandwidth
resources. Instead, when data is distributed using the BitTorrent protocol, each recipient supplies pieces
of the data to newer recipients, reducing the cost and burden on any given individual source, providing
redundancy against system problems, and reducing dependence on the original distributor.
124
Blind Search Methods

Most search methods in Gnutella-like peer-to-peer networks can be categorized into blind search methods
and informed search methods (Tsoumakos, & Rousseopoulos, 2006). In blind search methods, there is
no mechanism to keep the query results, to judge the best query path, or to use some other information
to reduce traffic. Blind search confines the nodes to transmit messages to some adjacent nodes instead
of all adjacent nodes without using the information of messages or choosing the best path for transmitting. The typical systems include Gnutella, Modified-BFS (Kalogeraki, Gunopulos, & Zeinalipour-Yazti
2002), Random Walks (Lv, Cao, Cohen, Li, & Shenker 2002), and Dynamic Query (Fisk, 2003) (Ripeanu,
Foster, & Iamnitchi 2002).
Index Caching Mechanisms

Some informed search methods are efficient in reducing flooding messages by using a class of caching
mechanisms. In a normal Uniform Index Caching (UIC) mechanism, all peers in the path of a query
record the query hit results in their caches. DiCAS records the query hit results in a multilayer peer-topeer network (Wang, Xiao, Liu, & Zheng, 2004). In DiCAS with m layers, each peer randomly takes an
initial value in a certain range, for example, 0 to m-1, as a group id when it joins a peer-to-peer network.
A query qr matches a peer if and only if the peers group id matches the following equation:
id = hash(qr) mod m
(1)
Unstructured peer-to-peer networks often use partially centralized architecture to reduce the query
traffic. Some nodes in the networks play specific roles, for example, managing network configuration,
monitoring network status, and forwarding messages to other peers. Distributed caching (Ambastha,
Beak, Gokhale, & Mohr, 2003) and adaptive search (DiCAS) algorithm has been proposed to reduce
network search traffic with the help of small cache space contributed by each individual peer. With indices passively cached in a group of peers based on a predefined hash function, the DiCAS protocol can
significantly reduce network search traffic. Based on the DiCAS algorithm, we will propose an index
management scheme to enhance the search performance.
Interest Group and Data Clustering

The concept of interest groups (Chao 2006) can play an important role in designing peer-to-peer systems.
Everyone has his own interest and the communities in the Internet also reflect the fact. If we can place
files with the same attribute or files interested by the same group of users in the same cluster, the users
can easily find the files by searching for them within the same cluster. It is worthy of noticing that the
resource types themselves present the properties of fractals both in coarse and fine classifications of
resources. To further improve the efficiency, locality should be considered by multicast or routing algorithms of the application layer when constructing overlay networks (Zhang et al 2004). It is noted that
even better performance can be achieved if we combine the locality and the concept of interest group.
125
Popularity of Content:
The shared contents in a peer-to-peer network in general follow a kind of probability distribution that
reflects some forms of popularity. We would like to model this kind of distribution, which is close to
the realistic cases, in peer-to-peer networks. In many cases, keywords of queries may follow a Zipf
distribution. The distribution model of content popularity described in (Saleh, Hefeeda, 2006) follows
a Mandelbrot-Zipf distribution (Silagadze, 1997). The Mandelbrot-Zipf distribution, a general form of
the Zipf-like distribution with an extra parameter, defines the probability of accessing an object at rank
i out of N available objects as:
p(i ) =
HN ,
s, q
(i +q )s
H N , s, q =
i =1
1
(i +q )s
q 0
(2)
Here, s is the skewness factor, and q is the plateau factor. In (Saleh, Hefeeda, 2006), it is observed that
the typical value for s is between 0.4 and 0.7, and typical value for q is between 5 and 60. MandelbrotZipf distribution degenerates to a Zipflike distribution with a skewness factor s if q = 0. In our simulation, we will use these parameters to setup the simulation environment in order to make the simulation
model close to real cases.
AN INDEx DIFFUSION MECHANISM

Request Popularity and Disposition
In order to group indices, we cluster the indices of files with similar properties into the same groups. In
structured peer-to-peer networks, each keyword is exactly matched by a single node. The architecture of
structured peer-to-peer network may be the best choice for grouping the indices by keywords. Search in
structured peer-to-peer networks is also more efficient than in unstructured peer-to-peer networks. For
an unstructured peer-to-peer network, we group indices by the request popularity instead of keywords.
The request popularity is a special property of any shared files, and the request popularity cannot be
calculated only by the information of sharing files. To measure the request popularity of the shared files,
the peer that owns the file must update the hit counts by listening to the queries or messages to update
indices. This is similar to the webpage ranking. A popular webpage will be marked with a higher rank
value by the search engine. And the search engine can sort the search results by the rank of each webpage.
So, we can assign a higher rank value to a more popular file. The indices of the more popular files can
be distributed to peers that lie in more ahead locations so that a query can reach it earlier.
We design a hierarchical multilayer peer-to-peer network, and the priorities of layers are different.
The top layer has the highest priority, and queries always reach the peers that are in the top layer first.
Indices of the most popular files are placed in the top layer. We will move those file indices whose
request rates are growing up to an upper layer so that they can be reached earlier in the subsequence
queries. The indices of the files that become unpopular will be replaced by more popular ones. There
is a situation needed to be considered. When having new files inserted into the peer-to-peer network,
we suppose they are popular so we insert their indices into the top layer to make them be reached as
126
Figure 1. The multilayer structure
earlier as possible.
Network Architecture
In order to maintain our multilayer architecture easily and balance the load in each layer, we divide the
network into three layers (0 to 2) and 7 groups (0 to 6). The architecture is shown in Figure 1. Each
peer randomly selects a value (0 to 6) as its group id when it joins the network. The top layer, Layer-0,
contains only one group (Group-0), and the second layer contains 2 groups (Group-1 and Group-2). The
other 4 groups are assigned to the third layer. Because only a few files are popular, Layer-0 has only one
group. The index diffusion will be carried out by the owner of the file according the request rate, which
is characterized by the query hit counts occurred in some period, T. The owner of a file will diffuse the
index in Layer-0 if the hit count is larger than 4 in T. And if the hit counts are 2 or 3 in T, the indices
will be randomly diffused to one group in Layer-1. In order to control the traffic, indices in Layer-2 will
never expire. The indices in Layer-2 will be removed if and only if the procedure of index update fails.
So, when a new file is inserted to the network, the file owner will not only place file indices to Layer-0,
but also will randomly select one group in Layer-2 to keep the file index.
Control of Indices
Most index caching mechanisms are based on passively overhearing the traffic or actively querying and
collecting the information about shared files from their neighbors. Overhearing does not cost any additional overhead about traffic by caching the results of query hits because it does not affect the original
flooding algorithms. However, the active querying and collecting the locations of shared files will produce
overheads due to the update mechanism. The overhead is significant when peers frequently leave or
join the overlay network. Obviously, the hit counts of shared files that contain more popular keywords
will be higher than others. As a result, passive caching methods will cache more copies of indices of
popular items than unpopular ones. The advantage is that the popular files will be found easily and
quickly because these files are cached widely over the network. In this case, a flooding algorithm with
a low TTL value is sufficient to find such an item. However, there is no method to control the amount
of indices of an item in the index caching mechanisms. Also, it produces at least two disadvantages: (1)
127
it is hard to find new shared files that contains popular keywords; (2) the information of cache may be
invalid in peer-to-peer networks with high churn rate. In order to limit the traffic incurred by flooding,
some search methods assign a smaller TTL value in the process of search. Flooding with a small TTL
will only cover a small part of networks, so the peers that have newer files may not be included in the
range of flooding. On the other hand, in a peer-to-peer network, the destination peer that was pointed
to by an index in the index cache may have left the network. In the DiCAS algorithm, peers use a hash
method to redirect query messages in the multilayer peer-to-peer network. By this method, search is
more efficient than the original uniform index caching mechanism in terms of traffic. But there is one
issue about the number of layers in the multilayer peer-to-peer network. If we insert more layers in the
peer-to-peer network, the probability of finding a desired file in a matched layer will be lower. This may
limit the performance of query success rate in the peer-to-peer network, and the amount of results may
not be enough to meet the preset expected number of matched results.
The proposed approach of the active index diffusing (AID) scheme can overcome these problems in
index caching mechanisms. In our scheme, we group the peers that share the same files in a cluster, and
they could diffuse indices actively. There are four main strategies in the AID (Active Index Diffusion)
scheme:
1.
128
Clustering peers that share the same file names: The main concept of our AID scheme is similar to some existing peer-to-peer applications, such as BitTorrent. There are several trackers that
record owners of the files and peers that are downloading the files in BitTorrent. These trackers
will provide the information when peers request the files. Peer-to-peer networks like BitTorrent
are efficient in getting peer lists from trackers, so we group peers that share the same file names
in a cluster. In the AID scheme, each peer knows other peers that share the file with the same file
names. In other words, every peer could play the role just like trackers in BitTorrent. If a peer
requests the file, the owner of the file will provide the information of the file. Another reason of
grouping files with the same file names is that there are more and more fake files appearing in
peer-to-peer networks. Fake files are detected in KaZaA, while files may be infected by virus in
Winny and Share. Grouping these fake files and real files together, we can apply some method to
differentiate between fake and real files with a tag. The tag may help peers tell if the file is real.
2. Index updated when an index hit occurs: As we have discussed, the request popularity
characterizes our index diffusion scheme. We define the request popularity by the number of
index hit counts over a timeout period, T. In each cluster, owners will elect one peer as the
master of the file. The master may be the one that has the longest online time in the same
cluster. When the master leaves the network, peers will reelect one peer as the master. When
an index hit occurs in a peer, the peer will send an index update message to the file owner. The
owner of the file then selects a peer in its cluster randomly to carry out the index update. The
file owner will also send an index update message to the master. The master in each cluster
keeps the hit count obtained from the index update messages.
3. Actively diffusing indices: According to the hit rate, the master will actively diffuse its
indices to the peers in an appropriate layer when the indices expire. We denote the value of
timeout as T. The value of T is should not be too small such that masters may update indices
too often. In order to control the traffic of actively diffusing indices, T must be assigned a
value larger than a certain number that may be related to the network size and the amount
of shared files. Assume that a network has N peers, the index replication rate is r, totally I
Figure 2. Index copy when peer joins
different files are shared, and the maximal amount of messages as S. If you want to limit the
number of messages S to satisfy the following inequality:
S Total pupular indices that need updates in T = N r I k /T
T
(3)
So we can determine the value of T by:

T N r I k /S
(4)
where k is the probability if a file is popular. In (Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N.,
& Shenker, S., 2003), it is estimated that 32% to 42% queries are repeated. According to Zipf distribution, these queries are only requesting files with small sizes. Thus, we assign a small value to k.
4.
Copy indices when peer joins: The basic procedure of joining our network is the same as Gnutella/
LimeWire (Fisk 2003). Because we wish to control the flooding size and keep the hit rate steady,
the index replication rate should be kept in a constant value. In a peer-to-peer network, we cannot
control the departure of each peer, but we could do something when peers join in. When peers
receive copy request messages, they will replicate a part of their indices to the newly-joining peer.
The method can make the replication ratio in a stable value. After finishing the joining procedure,
the newcomer will request indices replication by sending n copy request messages to n connected
peers. If a peer receives a copy request message, it will reply with 1/n of indices in its index cache
to the newcomer. The example is shown in Figure 2. As a result, the index replication ratio will
129
Table 1. p(F) and F in different network sizes

Peers
10000
100000
p(F)
p(F)
6 groups
95.57%
30.26303
95.48%
30.31680
7 groups
97.43%
29.01158
97.35%
29.06710
8 groups
98.52%
28.04401
98.46%
28.10261
9 groups
99.15%
27.30189
99.11%
27.35940
10 groups
99.52%
26.73488
99.49%
26.78876
keep a constant value because the new peer will keep 1 = n 1/n table of indices.
Flooding of Query
In peer-to-peer networks, if the file indices are diffused in the whole network and there is no intelligent
method (just flooding with blind) applied to search, the query success rate can be described by a hypergeometric distribution. By definition, if a random variable X follows the hypergeometric distribution
with parameters N, D, and n, the probability of getting exactly k successes is given by,
f (X = k ; N , D, n ) =
(kD )(nN--kD )
(nN )
(5)
where N is the whole sample space, D is the amount of total desired items, and n is the amount of samples.
Now, let a simple peer-to-peer network be consisted of N peers and the index replication rate is r. Then,
the expression of the probability model is:
f (X = k ; N , rN , m ) =
(rN
)((m1--rk)N )
k
(mN )
(6)
If the query messages had been sent to m peers, the success rate, hm, would be equal to:
hm = 1-f (X = 0; N , rN , m )
(7)
Generally speaking, a higher flooding size incurs more traffic overheads, but the success rate can
also be higher. We wish that our flooding method in each group is efficient in the aspect of the flooding
size and success rate, so we modify the search method of dynamic query to suit our architecture. The
concept is to satisfy the demand of files in dynamic query. In our search method, we expect the success
rate higher than a certain value if the indices of desired items are in the group.
Trying to make our network scalable, we set the basic flooding size F of each group to the same value.
Searching in each group contains two iterations and begins with sending query messages to F peers in
130
Figure 3. Example of routing path 1 with the modified dynamic query
the first iteration. The message will be flooded to another F peers if and only if there is no hit message
returned in the first iteration. When the network is divided into g groups, the success rate of a search in
a group having the indices of desired files is:
hm = 1-f (X = 0; N , rN , m )
(8)
The final success rate with two iterations can be written as:
p (F ) = p(F ) + (1 - p(F )) p(F ) = 2p(F ) - p(F )2
(9)
Figure 4. Example of routing path 1 with the modified random walk
131
Figure 5. Example of routing path 2 with the modified random walk
The value of p(F) is very close to p(2F), and the average flooding size F is equal to:
F = F + (1 - p(F )) F = 2F - F p(F )
(10)
Because p(F) is close to 1, the average flooding size, F will approach F. If we flood a query to F
peers directly, the value of p(F) is lower than p(F). We set the value of F as 25 and then calculate
p(F) and F with different network sizes. The results are shown in the Table 1. Our architecture with
10000 peers reaches a success rate of 97.43%, with F = 29.01158 for the 7-group peer-to-peer system.
When the network size is ten times larger, the success rate is 97.35%, with F = 29.06710. It shows that
the average flooding sizes and success rates in different network sizes are quite similar, so there is no
Figure 6. Flooding size
132
Figure 7. Average success rate
problem with the scalability of our network.
Modified Search Methods

Because our peer-to-peer network is based on a multilayer architecture, the existing search method
does not be suitable for our architecture. We should adapt some existing blind search methods for our
peer-to-peer network.
1.
2.
Dynamic query: A requesting peer randomly selects a peer from each group as the forwarder; then,
the query will start with sending a query message to the forwarder of Group-0. This forwarder will
send the query in Layer-0. If hit results are not enough, the query messages will be sent to Group-1
and Group-2 peers in Layer-1. If the results returned from Layer-0 and Layer-1 are not satisfied
with desired amount, the query will reach the final layer, Layer-2. Figure 3 illustrates an example
of modified dynamic query.
Random walks:We adopt an F-walker random walk algorithm and each walker will send queries
in each group at most 2 times, where F will be set to 25. There are 25 walkers sent by a requesting peer, and the walkers start walking in the Group-0. Walkers will send queries in each group
at most 2 times and end in Group-3 or Group-6 of the Layer-2. The examples of routing path are
shown in Figure 4 and Figure 5. There are two cases where the walker will stop: one is the walker
had finished the query, and the other one is when a hit occurs.
133
Figure 8. Average hit files
SIMULATION RESULTS
In our simulations, there are 10000 peers in the unstructured peer-to-peer network, and these peers are
divided into 3 layers and 7 groups. The basic flooding size F is set to be 25 peers for each group. The
timeout value T is 2500. File names in our simulations are titles of 1268 distinguishing books, and the
popularity of file names is modeled by the Mandelbrot-Zipf distribution. Each book has an average of
7 keywords. Finally, we totally assign 20000 query iterations in each simulation, and the parameters of
query keywords in the Mandelbrot-Zipf distribution are the same as the parameters that the file names
used.
As suggested by Share, we tried 2 different experiment setups of the Mandelbrot-Zipf distribution,
where one is s = 0.4 and q = 60, and the other one is s = 0.7 and q = 5. In the following experiments,
results are slightly different. We used m-zipf(0.7, 5) and m-zipf(0.4, 60) to distinguish these two setups.
In the configurations with m-zipf(0.7, 5), we totally inserted 3415 files to the whole network. And we
assigned 1866 files to the networks that is configured with m-zipf(0.4, 60).
In (Yinglian and OHallaron 2002), the authors only experimented on the DiCAS peer-to-peer network
with 2-layer configuration. We will make a comparison in query success rates of the DiCAS in different
layer configurations. The network degree is modified from 5 to 25, and the cache size is set to 50. The
dynamic query search method was implemented and modified for the DiCAS. First, we want to know
how the flooding size will be reduced when we insert more layers in the DiCAS. The simulations run in
2- to 4-layer configurations. Due to the use of the index caching mechanism and the search method of
dynamic query, the flooding sizes are decreased when more query iterations are submitted. The amount
134
Figure 9. Average duplicated rate per hit
of desired files of dynamic query is set to 4. After the 10000th query iteration, the flooding sizes become
more stable than before, so we calculate the average flooding size from the 10000th query iteration to
the 20000th one. Figure 6 shows that the average flooding sizes of different multilayer configurations
in the DiCAS.
As the result of average flooding sizes in the DiCAS, the sizes are larger than 850 peers in the original 2-layer configuration with m-zipf(0.7, 5) and m-zipf(0.4, 60). When the amount of layers grows,
the flooding sizes are decreased evidently. In the 4-layer configuration, the flooding size is only half
of the 2-layer configuration. We can also notice that the traffic in m-zipf(0.4, 60) is larger than that in
Figure 10. Average messages of DiCAS
135
Figure 11. Average messages of the AID
m-zipf(0.7, 5) in this figure.

Compared with the results in the DiCAS, the proposed method has smaller flooding size. Furthermore, more layers in the DiCAS cause the reduction of the query success rate, as the simulation results
shown in Figure 7. In order to make a fair comparison, the success rates are calculated by the success
hit count from the 10000th query iteration to the 20000th one too. During that period, the success rates
are more stable than previous section. In the 2-layer configuration, hit rates are close to 95%. But the
query success rate in the 4-layer configuration with m-zipf(0.4, 60) is lower than 85%.
According to this figure we can see that the query success rate decreases when the number of layers
increases.
In our AID scheme with two different search methods, dynamic query (DQ) and random walks (RW),
the query success rates of each search methods could be seen in Figure 7. The simulation results about
the success rates of each method are all higher than 98%, and some of them can reach 99%. Obviously,
the success rate is closer to 1 in the proposed AID scheme than the DiCAS. Figure 6 also shows that
the average flooding sizes of each search method in the AID scheme are smaller than the DiCAS. And
Figure 8 is about the number of hit files per query of the DiCAS and the AID scheme.
The other problem for index caching mechanisms like DiCAS is that we can get a lot of redundant
indices from the hit messages, but these indices only cover a few different files. In other words, there
are too many indices replicated in the system. Although more layers in the network could reduce the
replicated indices per query, the success rate became lower. We also calculated the average replicated
rate per query in Figure 9 to compare the DiCAS and our AID scheme. The ratios in the AID scheme
are much smaller than DiCAS, and the ratios are between 1.8 and 2.6. It shows that our AID scheme
reduces more hit messages about these indices, which are replicated in each search. We can also compare
the total messages in the DiCAS and our AID scheme. The results are shown in Figure 10 and Figure
11. When networks become stable, the average number of messages per query in the DiCAS approach
is larger than 1000, but in our AID scheme, it only costs 270 to 340 messages. Finally, the MandelbrotZipf distribution with s = 0.7 and q = 5 in the DiCAS could get more files and a higher success rate in a
smaller flooding size. The main difference between our AID scheme and the DiCAS scheme is that our
modified search methods try to maximize the search success rate, so the values of success rates in the
AID scheme are quite similar in different Mandelbrot-Zipf configurations.
136
DISCUSSIONS AND FUTURE TRENDS

We understand that the popularity of files cannot be determined only by a period of time T. In other
words, judging which layer to diffuse a file will be better if we also take the hit count in the past into
consideration. Otherwise, if the popularity of files varies rapidly, they will bounce between different layers frequently, even though it is unnecessary, after each period time T. Hence we introduce the concept
of moving average which can smooth a short-term oscillation and make the value which represents the
popularity of files more authentic in case of short-term abnormal burst or tranquility. Assume HN is the
hit counts observed in time interval [(N-1)*T, N*T], and W is the weighting we assign to HP(N-1), which
is the value representing the popularity of files in time interval [(N-2)*T, (N-1)*T]; HN *(1-W) means that
we assign the weighting (1-W) to HN with respect to the weight W to HP(N-1). We can use an exponential
moving average to estimate the popularity of files HP(N) in time interval [(N-1)*T, N*T] as follows.
HP(N) = HP(N-1) * W + HN *(1-W)
(11)
Furthermore, we can also make a clear definition for popular files on our own by using some learning algorithm. By training a large amount of data, the method can be more efficient in deciding which
layer to diffuse files.
CONCLUSION
The experimental results show that our architecture is efficient in keeping a high success rate of queries
and reducing the flooding size. The proposed approach can also result in an index table with less redundant indices than the existing system DiCAS, making an efficient use of storage and does not incur too
much traffic. Flooding and caching according to hashed keywords like DiCAS may cause flooding in
layers that do not contain desired files, and result in a lower query hit rate. This problem is even worse
in configuring the system with more layers. Compared with DiCAS, the query success rate has improvements from 95% to 98% and the traffic of messages has reduced 73% to 80%. The replicated indices of
query also had been decreased to 13%. This shows that the proposed unstructured peer-to-peer system
is scalable and efficient in reducing traffic and increasing hit rate.
REFERENCES
Ambastha, N., Beak, I., Gokhale, S., & Mohr, A. (2003). A cache-based resource location approach for
unstructured P2P network architectures. Graduate Research Conference, Department of Computer Science, Stony Brook University, NY.
Androutsellis-Theotokis, S., & Spinellis, D. (2004). A survey of peer-to-peer content distribution technologies. ACM Computing Surveys, 36(4), 335371. doi:10.1145/1041680.1041681
Chao, C.-H. (2006, April). An Interest-based architecture for peer-to-peer network systems. In Proceedings of the International Conference AINA.
137
Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N., & Shenker, S. (2003). Making gnutella-like p2p
systems scalable. In Proceedings of the 2003 Conference on Applications, Technologies, Architectures,
and Protocols for Computer Communications (pp. 407-418).
Cheng, A. H., & Joung, Y. J. (2006). Probabilistic file indexing and searching in unstructured peer-topeer networks. Computer Networks, 50(1), 106127. doi:10.1016/j.comnet.2005.12.008
Cohen, B. (2002). BitTorrent Protocol 1.0. Retrieved from BitTorrent.org.
Fisk, A. (2003). Gnutella dynamic query protocol v. 0.1. Retrieved from http://www9.limewire.com/
developer/dynamic query.html.
Kalogeraki, V., Gunopulos, D., & Zeinalipour-Yazti, D. (2002). A local search mechanism for peer-topeer networks. In Proceedings of the Eleventh International Conference on Information and Knowledge
Management (pp. 300-307).
Lv, C., Cao, P., Cohen, E., Li, K., & Shenker, S. (2002). Search and replication in unstructured peer-topeer networks. In Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement
and modeling of computer systems (pp.258-259).
Ripeanu, M., Foster, I., & Iamnitchi, A. (2002). Mapping the gnutella network: properties of large-scale
peer-to-peer systems and implications for system design. IEEE Internet Computing, 6(1), 50-57.
Saleh, O., & Hefeeda, M. (2006). Modeling and caching of peer-to-peer traffic. In Proc. of 14th IEEE
International Conference on Network Protocols (ICNP06), (pp. 249-258).
Silagadze, Z. (1997). Citations and the Mandelbrot-Zipfs law. Complex Systems, 11, 487499.
Stoica, R. Morris, Karger, D., Kaashoek, M. F., & Balakrishnan, H., (2001). Chord: A scalable peer-topeer lookup service for Internet applications. In ACM SIGCOMM, August, (pp. 149-160).
Tsoumakos, D., & Rousseopoulos, N. (2006). Analysis and comparison of p2p search methods. In
Proceedings of the 1st International Conference on Scalable Information Systems (INFOSCALE 2006),
No. 25.
Wang, C., Xiao, L., Liu, Y., & Zheng, P. (2004). Distributed caching and adaptive search in multilayer P2P
networks. In International Conference on Distributed Computing Systems (ICDCS04) (pp. 219-226).
Yang, B., & Garcia-Molina, H. (2002). Improving search in peer-to-peer networks. In Proceedings of
the 22nd International Conference on Distributed Computing Systems (ICDCS02) (pp. 5).
Yinglian, X. OHallaron D. (2002). Locality in search engine queries and its implications for caching.
In Proceedings of the IEEE Infocom (pp. 1238-1247).
Zhang, X. Y., Zhang, Q., Zhang, Z., Song, G., & Zhu, W. (2004). A construction of locality-aware overlay
network: mOverlay and its performance. IEEE Journal on Selected Areas in Communications, 22(1),
1828. doi:10.1109/JSAC.2003.818780
138

Dynamic Query: The search technique the Gnutella adopted to reduce traffic overhead and make
searches more efficient, where a search reaches only those clients which are likely to have the files and
stops as soon as the program has acquired enough search results.
Peer-to-Peer Networks: A network with equal peer nodes that simultaneously function as both clients
and servers to the other nodes on the network.
Random Walk: A search algorithm used in a peer-to-peer network where the query will be forwarded
up and down the given list until a match is found, the query is aborted, or it reaches the limits of the
list.
Structured Peer-to-Peer Networks: A peer-to-peer network that employ a consistent protocol such
as a distributed hashing function to ensure that any node can efficiently route a search to some peer that
has the desired file. The consistent hashing is also used to distribute the file to a peer.
Uniform Index Caching: A cache scheme where query results are cached in all peers along the
query path.
Unstructured Peer-to-Peer Networks: A peer-to-peer network that is formed when the overlay
links are established arbitrarily.
139
140
Chapter 7
Hierarchical Structured
Peer-to-Peer Networks
Yong Meng Teo
National University of Singapore, Singapore
Verdi March
Marian Mihailescu
ABSTRACT
Structured peer-to-peer networks are scalable overlay network infrastructures that support Internet-scale
network applications. A globally consistent peer-to-peer protocol maintains the structural properties of
the network with peers dynamically joining, leaving and failing in the network. In this chapter, the authors
discuss hierarchical distributed hash tables (DHT) as an approach to reduce the overhead of maintaining
the overlay network. In a two-level hierarchical DHT, the top-level overlay consists of groups of nodes
where each group is distinguished by a unique group identifier. In each group, one or more nodes are
designated as supernodes and act as gateways to nodes at the second level. Collisions of groups occur
when concurrent node joins result in the creation of multiple groups with the same group identifier. This
has the adverse effects of increasing the lookup path length due to a larger top-level overlay, and the
overhead of overlay network maintenance. We discuss two main approaches to address the group collision problem: collision detection-and-resolution, and collision avoidance. As an example, they describe
an implementation of hierarchical DHT by extending Chord as the underlying overlay graph.
INTRODUCTION
Structured peer-to-peer systems or distributed hash tables (DHT) are self-organizing distributed systems
designed to support efficient and scalable lookups with dynamic network topology changes. Nodes are
organized as structured overlay networks, and data is mapped to nodes in the overlap network based
on their identifier. There are two main types of structured peer-to-peer architectures: flat and hierarchiDOI: 10.4018/978-1-60566-661-7.ch007
Hierarchical Structured Peer-to-Peer Networks
cal. A flat DHT (Alima, 2003; Ratnasamy, 2001; Stoica, 2001; Rowstron, 2001; Maymounkov, 2002;
Zhao, 2001) organizes nodes into one overlay network, in which each node has the same responsibility
and uses the same rules for routing messages. On the other hand, a hierarchical DHT organizes nodes
into a multi-level overlay network with the primary aim of reducing the maintenance overhead of its
overlay network. In a peer-to-peer system, peers join and leave the system dynamically. A process called
stabilization updates the routing information maintained in each peer so as to keep the overlay network
up-to-date (Ghinita, 2006).
A hierarchical DHT employs a multi-level overlay network where the top-level overlay consists of
logical groups (Garcs-Erice, 2003; Harvey, 2003; Karger, 2004; Mislove, 2004; Tian, 2005; Xu, 2003;
Zhao, 2003). Each group, which consists of a number of nodes, is assigned a group identifier with a
specific objective such as improving administrative autonomy (Harvey, 2003; Mislove, 2004; Zhao,
2003), reducing network latency (Tian, 2005; Xu, 2003), and integrating various services into one system (Karger, 2004). Within a group, one or more nodes are selected as supernodes to act as gateways to
nodes in the groups. Within each group, nodes can further form a second-level overlay network.
In this chapter, we discuss the organization of a hierarchical DHT with the aim of reducing its overlay maintenance overhead. Using a two-level hierarchical Chord as an example, the top-level overlay
network consists of groups with distinct group identifiers. However, collision of groups occurs when
two or more groups are created with the same group identifier. Collisions increase stabilization overhead
and degrade lookup performance. To address the collision problem, we discuss two main approaches:
collision detection-and-resolution, and collision avoidance.
The rest of this chapter is organized as follows. Section 2 presents an overview of flat DHT using
Chord as the example (Stoica, 2001). Three main approaches to reduce routing maintenance overhead
are introduced: hierarchical DHT, varying frequency of stabilizations, and varying number of routing
states. Extending Chord into a hierarchical Chord DHT, Section 3 discusses two differing approaches in
addressing the collision problem, namely, collision detection-and-resolution, and collision avoidance.
Section 4 summarizes this chapter and discusses open issues.
DISTRIBUTED HASH TABLES

Distributed hash table (Gummadi, 2003; Hsiao, 2003; Ratnasamy, 2002; Stribling, 2004) is a decentralized lookup scheme designed to provide scalable lookups, i.e., shorter lookup path length with high
result guarantee and reduced number of false negative answers. The DHT protocol provides an interface
to retrieve a key-value pair. A key is an identifier assigned to a resource; traditionally this key is a hash
value associated with the resource. A value is an object to be stored into DHT; this could be the shared
resource itself such as a file, an index or a pointer to a resource, or a resource metadata. An example of
a key-value pair is <SHA1(file name), http://peer-id/file>, where the key is the SHA1 hash of the files
name and the value is the address (location) of the file.
To support scalable lookups with high result guarantee, DHT exploits the following:
1.
Key-to-node mapping: Assuming that keys and nodes share the same identifier space, DHT
maps key k to node n where n is the node closest to k in the identifier space; we refer to n as the
responsible node of k. The key-to-node mapping improves result guarantee because searching for
a key-value pair equals to locating the node responsible for the key (Loo, 2004).
141
Figure 1. Chord lookup
2.
3.
Data-item distribution: Key-value pairs, also called data items, with key equals to k are stored
at node n independent of the owners of these key-value pairs. This is implemented in DHT by a
store operation (Dabel, 2003; Rhea, 2005). The concept of data-item distribution has been further
exploited for various optimizations, including load balancing (Godfrey, 2004; Godfrey, 2005;
Karger, 2004) and high availability (Dabek, 2001; Ghodsi, 2005a; Kubiatowicz, 2000; Landers,
2004; Leslie, 2006).
Structured overlay network: Searching and storing a key-value pair requires routing of the
request to a responsible node. To achieve scalable routing, nodes are organized as structured
overlay network. A structured overlay network exhibits two main properties: (i) it resembles a
graph and is organized into a network topology such as a ring (Rowstron, 2001; Stoica, 2001), a
torus (Ratnasamy, 2001), or a tree (Aberer, 2003; Maymounkov, 2002), and (ii) each node uses
its identifier to position itself in the structured overlay network. The tradeoff in different overlay
topologies are routing performance and overhead of maintaining routing states.
As an example of DHT implementation, we discuss Chord which supports O(log N)-hop lookup path
length and maintains O(log N) routing states per node, where N denotes the total number of nodes (Stoica,
2001). Chord organizes nodes as a ring that represents an m-bit one-dimensional circular identifier space,
and as a consequence, all arithmetic is modulo 2m. To form a ring overlay, each node n maintains two
pointers to its immediate neighbors as shown in Figure 1(a). The successor pointer points to successor(n),
the immediate clockwise neighbor of n. Similarly, the predecessor pointer points to predecessor(n), the
142
Figure 2. Join operation in chord
immediate counter-clockwise neighbor of n.

In Chord, every piece of data is assigned an m-bit identifier called a key. Key k is then map onto
successor(k), the first node whose identifier is equal to or greater than k in the identifier space (Figure
1(b)). Thus, node n is responsible for keys in the range of (predecessor(n), n], i.e. keys that are greater
than predecessor(n) but smaller than or equal to n. For example, node 32 is responsible for all keys in
(21, 32]. All key-value pairs whose key equals to k are then stored on successor(k) regardless of who
owns the key-value pairs. This distribution of keys is called data-item distribution.
Finding key k implies that we route a request to successor(k). To achieve scalable routing, each node
n maintains a finger table of m entries as shown in Figure 1(c). Each entry in this table is also called a
finger. The ith finger of n is denoted as n.finger[i] and points to successor(n + 2i1), where 1 i m. Note
that the first finger is also the successor pointer while the largest finger divides the circular identifier space
into two halves. When N < 2m, the finger table consists of only O(log N) unique entries (Stoica, 2001).
By utilizing finger tables, Chord locates successor(k) in O(log N) hops with high probability (Stoica,
2001). Intuitively, the process resembles a binary search where each step halves the distance to successor(k).
Thus, each node n forwards a request to the nearest known preceding node of k. This is repeated until the
request arrives at predecessor(k), the node whose identifier precedes k, which will forward the request to
successor(k). Figure 1(d) shows an example of finding successor(54) initiated by node 8. Node 8 forwards
the request to its sixth finger which points to node 48. Node 48 is the predecessor of key 54 because its
first finger points to node 56 and 48 < 54 56. Finally, node 48 will forward the request to node 56.
Figure 2 illustrates the construction of a Chord ring. A new node n joins a Chord ring by locating
its own successor. Then, n inserts itself between successor(n) and the predecessor of successor(n), illustrated in Figure 2(a). The key-value pairs stored on successor(n), whose key is less than or equal to n,
is migrated to node n (Figure 2(b)). Because the join operation invalidates the ring overlay, every node
periodically invokes a maintenance process called stabilization to correct its successor and predecessor
pointers (Figure 2(c)), and its remaining fingers.
A number of approaches have been proposed to reduce the maintenance overhead of DHT. We classify these approaches into three main categories: hierarchical DHT, varying frequency of stabilizations,
and varying number of routing states. The last two approaches are applicable directly to both flat and
hierarchical DHTs.
143
HIERARCHICAL DHT
In hierarchical DHT, nodes are organized as a two-level overlay network. The top-level overlay consists
of logical groups of nodes, where each group is identified by a group identifier (gid). In each group,
one or more nodes are designated as supernodes and act as gateways to the nodes at the second level.
Each node is assigned an identifier consisting of two subfields: a unique node identifier as is common
in DHT to distinguish different peers, and a group identifier to reflect the nodes group. For example, in
compute-cycle sharing, a group identifier denotes the type of shared resource or processor type (March,
2007). Grouping of shared resources by processor types facilitates resource discovery and allocation.
Figure 3 shows a hierarchical Chord system (Garcs-Erice, 2003), where nodes with the same gid
form a group and the groups are organized in the top-level overlay network. Routing in the top-level and
the second-level overlay are based on the group identifier and the node identifier, respectively.
A hierarchical DHT groups nodes based on various properties to achieve specific objectives. Examples include:
1.
2.
3.
Grouping by administrative domains improves the administrative autonomy and reduces latency
(Harvey, 2003; Mislove, 2004; Zhao, 2002);
Grouping by physical proximity reduces network latency (Tian, 2005; Xu, 2003);
Grouping by services promotes the integration of services into one system (Karger, 2004).
In terms of topology maintenance, the hierarchical structure has the following advantages compared
to the flat structure:
1.
2.
Lower overhead of overlay maintenance: Maintenance of structured overlay network involves the
correction of nodes routing states to adapt to dynamic events of node joining, leaving, or failing.
Since the hierarchical structure partitions nodes into multiple overlays, each of which is smaller
than a flat overlay, maintenance messages are routed only in one of these smaller overlays. This
speeds up the correction of routing states while reducing the number of stabilization messages
processed by each node.
Isolation of churn: Topology changes within a group due to churn, i.e., continuous changes due
to node joins, leaves, or failures, do not affect the top-level overlay or other groups. Stable overlay
topologies improve the result guarantee of DHT lookups.
However, when new nodes join such a hierarchical DHT system, collisions of groups may occur.
Collisions result in the top-level overlay containing two or more groups with the same group identifier,
and increase the size of the overlay. For example, in a join operation, a new node firstly requests a bootstrap node to locate an existing group identified with gid. However, when the bootstrap node belongs
to another group gid and some routing states in the top-level overlay are incorrect, the bootstrap node
may fail to locate group gid. Thus, instead of joining group gid, the new node creates a new group with
the same gid.
Collisions increase the size of the top-level overlay, which in turn increases the lookup path length
and the total number of stabilization messages. In the worst case, collisions lead to the degeneration of
the hierarchical structure into the flat structure, where every node occupies the top-level overlay. If the
number of groups is c times larger than the number of ideal groups1, the lookup path length is increased
144
Figure 3. Hierarchical structured chord
by O(log c) hops, but the total number of stabilization messages is increased by (c) times.
There are two main approaches to address the problem of collisions in hierarchical DHT systems:
1.
2.
Collision detection and resolution: With this approach, collisions are allowed to occur but it is
the responsibility of the hierarchical DHT systems to detect collisions and merge these groups into
a single group (March, 2005). In systems such as hierarchical Chord-based DHT (Garcs-Erice,
2003), Diminished Chord (Karger, 2004), Hieras (Xu, 2003) and HONet (Tian, 2005), collisions
can occur but the problem is not directly addressed. They assume that collisions can be resolved
by mechanisms inherent in the system structure, and the extent of collisions is not studied.
Collision avoidance: In hierarchical DHT systems, schemes can be devised to ensure that collisions do not occur. This can be achieved through collision-free join protocols (Teo, 2008) or
collision-free grouping policies (Harvey, 2003; Karger, 2004; Mislove, 2004; Xu, 2003; Zhao,
2003). Collision-free join protocol such as in (Teo, 2008) uses the predecessor node to serialize the
join lookup operation. All nodes in the overlay network maintain accurate fingers and new groups
are reflected instantaneously by the predecessor supernode. The leave protocol is also modified
to ensure the correctness of the finger table, the successor pointers, and the predecessor pointers.
Thus, a departing supernode notifies its successor and predecessor to update their pointers accordingly. As long as the fingers are maintained in an accurate state, collisions do not occur.
In hierarchical DHT such as Brocade (Zhao, 2003), SkipNet (Harvey, 2003), and hierarchical Scribe (Mislove, 2004), collisions do not occur because a new node always chooses a
bootstrap node from the same group. In such systems, nodes are grouped by their administrative domain. Therefore, it is natural for the new node to choose a bootstrap node from the
same administrative domain. This grouping policy guarantees that multiple groups with the
same group identifier are not created. However, such systems do not address other grouping
policies that can introduce collisions, i.e., when a new node is bootstrapped from a node in
145
a different group.
In (Karger, 2004; Xu, 2003), all nodes in a group are assumed to be supernodes. Hence, collisions do not occur. However, the size of the top-level overlay, with or without collisions,
is the same. In addition, the top-level overlay is larger than systems where only a subset of
nodes becomes supernodes. Thus, the total number of stabilization messages is increased
because more supernodes have to perform stabilization.
VARYING FREqUENCY OF STABILIzATION

Frequency-based approaches such as adaptive stabilization (Castro, 2004; Ghinita, 2006), piggybacking
stabilization with lookups (Alima, 2003; Li, 2005), and reactive stabilization (Alima, 2003) reduce the
maintenance overhead by reducing the frequency in invoking routing-state correction procedures. Adaptive
stabilization adjusts the frequency based on churn rate and the importance of each routing state to lookup
performance2. Systems such as DKS (Alima, 2003) and Accordion (Li, 2005) piggyback stabilization
with lookups to reduce the necessity of performing dedicated periodic stabilization; DKS refers to this
as correction-on-use. Reactive stabilization such as DKSs correction-on-change (Ghodsi, 2005) does
away altogether with periodic stabilization. Instead, changes to overlay networks due to membership
changes are propagated immediately when membership-change events are detected. However, Rhea
et. al. reported that reactive stabilization can increase maintenance overhead under high churn rate and
constrained bandwidth availability (Rhea, 2004).
As an example, we discuss the stabilization mechanism in DKS (Distributed k-ary Search). DKS is
proposed as a framework that generalizes different DHT implementations as a k-ary search, e.g., Chord
is an instance of DKS when k = 2 (Alima, 2003). Rather than periodic stabilization, DKS maintains
its overlay network based on three main principles: local atomic operations, correction-on-use, and
correction-on-change. With the local atomic operations, DKS serializes concurrent node insertions/
leaves between two existing adjacent nodes. This reduces the number of incorrect successor and predecessor pointers during churn. However, the local atomic join does not correct other routing states
such as fingers affected by the churn. These routing states will be corrected by correction-on-use and
correction-on-change.
The correction-on-use technique piggybacks stabilization during lookup processes. If the number
of lookup messages is high, then the overlay network can be maintained without a need for dedicated
stabilizations. Essentially, a routing table entry is not corrected until it is used during lookups. To realize
correction-on-use, every lookup message contains information about the position of the receiver from
the senders perspective3. If the receiver determines that the information (i.e. the senders perspective
regarding the position of the receiver) is wrong, then the receiver advises the sender about the correct
information (to the best of the receivers knowledge). The disadvantage of correction-on-use is that the
speed at which the overlay network is corrected depends on the amount of lookup traffic. To address
this disadvantage, DKS also employs correction-on-change: after a new node joins, it notifies all nodes
that need to be updated.
146
VARYING SIzE OF ROUTING TABLES

This approach reduces the size of routing tables so that the number of routing states to correct becomes
smaller. Examples of DHT that implement this approach include CAN (Ratnasamy, 2001), Koorde
(Kaashoek, 2003), and Accordion (Li, 2005). However, reducing the size of routing tables potentially
increases lookup path length (Xu, 2003).
In Accordion (Li, 2005), the size of routing tables is controlled through the process of acquisition and
eviction of routing states. The rate of state acquisition is determined by a specified bandwidth budget,
while the rate of state eviction is influenced by the churn rate. During acquisition, new states are added
into a routing table. Accordion couples DKSs correction-on-use approach with explicit stabilization.
The frequency of explicit stabilization is constrained by the bandwidth budget. During eviction, node
removes routing entries that point to nodes perceived to be non-existent. In addition, Accordion favors
routing states that points to nodes with a longer live time; pointers to relative newer nodes have a higher
probability to be evicted. Thus, a higher bandwidth budget increases routing-table size, whereas a higher
churn rate reduces it.
Besides reducing the size of routing tables, DHT can also partition each routing table into two parts:
one part consisting of entries that are corrected through stabilization, and the other part consisting of
cached entries. This reduces the maintenance overhead while achieving a shorter lookup path length.
For example, in the latest implementation of Chord, a finger table consists of O(log N) fingers, and a
number of location caches maintained by a LRU replacement policy (Stoica, 2001).
Hierarchical Chord
A hierarchical Chord partitions its nodes into a multi-level overlay network. Because nodes join a smaller
overlay network than in a flat structure, each node maintains and corrects a smaller number of routing
states than in a flat structure. Figure 4 shows an example of hierarchical Chord. In hierarchical Chord,
each node is assigned a group identifier (gid) and a unique node identifier (nid). We use the notation
gid|nid to denote the group identifier and node identifier of each node.
Nodes with the same gid form a group and groups are organized in the top-level as Chord overlay
network. Within each group, nodes are organized as a second-level overlay using the node identifier. The
topology and stabilization mechanism can differ from the top-level. In each group, one or more nodes
designated as supernodes act as gateways to other nodes in the group. In Figure 4, node 0|5, node 2|7,
node 4|2, and node 6|4 are respectively the supernodes of groups g0, g2, g4, and g6.
In hierarchical Chord, a lookup request for key k implies locating the group responsible for k. Figure
5 illustrates the process. Firstly, a lookup request for key k is routed to the supernode of the initiating
group. Secondly, using Chord lookup algorithm (Chord, 2001), the lookup request is further routed to
the supernode of group whose group identifier is gid = successor(k). Thirdly, the lookup request can be
further forwarded to one of the second-level nodes in group k based on additional criteria. As shown in
Figure 5, a lookup request for key 2, initiated by second-level node 6|6, is forwarded to its supernode
6|4 (step 1). In the top-level overlay, the lookup request is routed to supernode 2|7 of group 2 (step 2).
Finally, supernode 2|7 can further forward the request to its second-level nodes (step 3), e.g., lookup for
compute resources of type 2 in multiple administrative domains (Teo, 2005).
If new nodes join a hierarchical Chord when some routing states in the top-level overlay are incorrect,
i.e., yet to be updated, then the top-level overlay may end up with two or more groups with the same
147
Figure 4. A two-level overlay network consisting of four groups
group identifier. This is called collisions of groups. In the following subsections, we discuss how collisions occur and present a collision detection and resolution scheme, and a collision avoidance scheme.
To avoid sending additional overhead messages, collision detection is performed together with successor stabilization, i.e., the process of correcting successor pointers. This is because successful collision
detections require the successor pointers in the top-level Chord overlay to be correct, and the correctness
of the successor pointers is maintained by stabilization.
In presenting our algorithm, we assume that each node maintains a list of variables shown in Table
1. The algorithm adopts the same convention as in (Stoica, 2001), where remote procedure calls or
variables are preceded by the remote node identifier, while the local procedure calls and variables omit
the local node identifier.
Figure 5. Example of lookup in hierarchical chord
148
Table 1. Variables maintained by node n in hierarchical chord

Variable
Description
gid
m-bit group identifier
nid
m-bit node identifier
successor
pointer to successor(gid) if n is a supernode, nil otherwise
predecessor
pointer to predecessor(gid) if n is a supernode, nil otherwise
is_super
true if n is a supernode, false otherwise
supernode
pointer to supernode of group gid if node is a supernode, nil othewise
COLLISIONS OF GROUP IDENTIFIERS

Collisions of group identifiers arise because of join operations invoked by nodes. Figure 6 shows the nodejoin algorithm for hierarchical Chord. Node n, whose group identifier is denoted as n.gid, makes a request
to join group g through bootstrap node n. In a hierarchical Chord, this means finding successor(g|0) in
the top-level overlay. If n successfully finds an existing group g then n joins this group using a groupspecific protocol (line 59). However, if n returns g > g, then n creates a new group with identifier g
(line 1115). A collision occurs if the new group is created even though a group with identifier g already
exists. This happens when n and bootstrap node n are in two different groups, and the top-level overlay
has not fully stabilized, i.e., some supernodes successor pointers are yet to be updated.
Figure 7 illustrates a collision scenario when node 1|2 and node 1|3 belonging to the same group g1,
join concurrently. Due to concurrent joins, find_successor() invoked by both nodes returns node 2|7. As
a result, both the new node joins create two groups with the same group identifier g1.
Collisions increase the maintenance overhead in the top-level Chord ring by (c) times. Let K denotes the number of groups and N denotes the number of nodes. Assuming that each group assigns one
supernode, the ideal size of the top-level overlay is K supernodes. Without collisions, the total number
of stabilization messages is denoted as S. With collisions, the size of top-level overlay is increased by c
times, i.e., cK groups. As each group performs periodic stabilization, the cost of stabilization with collisions (SC) is (cS). The stabilization cost ratio, with and without collisions, is shown in Equation 1.
Sc
S
cK log2 cK
c log2 cK
=
= W(c)
K log2 K
log2 cK
(1)
Collisions also increase the lookup path length in the top-level Chord by O(log c) hops. Without collisions, the top-level Chord ring consists of K supernodes, and hence, the lookup path length is O(log
K). With collisions, the size of the top-level overlay becomes cK and the lookup path length is O(log
cK) = O(log c + log K) hops.
149
Figure 6. Join operation
COLLISION DETECTION AND RESOLUTION SCHEME

Collisions can be detected during successor stabilization. This is achieved by extending Chords stabilization so that it not only checks and corrects the successor pointer of supernode n, but also detects if n
and its new successor should be in the same group. Figure 8 presents a collision detection algorithm. It
first ensures that the successor pointer of a node is valid (line 45). It then checks for a potential collision
Figure 7. Collision at the top-level overlay
150
Figure 8. Collision detection algorithm
(line 810), before updating the successor pointer to point to the correct node (line 1113).
Figure 9 illustrates the collision detection process. In Figure 9(a), a collision occurs when nodes 1|2
and 1|3 belonging to the same group, group 1, join concurrently. In Figure 9(b), node 1|3 stabilizes and
causes node 2|7 to set its predecessor pointer to node 1|3 (step 1). Then, the stabilization by node 0|5
causes 0|5 to set its successor pointer to node 1|3 (step 2), and node 1|3 to set its predecessor pointer
151
Figure 9. Collision detection piggybacks successor stabilization
to node 0|5 (step 3). In Figure 9(c), the stabilization by node 1|2 causes 1|2 to set its successor pointer
to node 1|3. At this time, a collision is detected by node 1|2 and is resolved by merging 1|2 to 1|3.
If each group contains more than one supernode, then the is_collision routine shown in Figure 8
may incorrectly detect collisions. Consider the example in Figure 10(a). When node n stabilizes, it
incorrectly detects a collision with node n because n.successor.predecessor = n and n.gid = n.gid.
An approach to avoid this problem is for each group to maintain a set of its supernodes (Garcs-Erice,
2003; Gupta, 2003) so that each supernode can accurately decide whether a collision has occurred.
The modified collision detection algorithm is shown in Figure 10(b).
To resolve collisions, groups associated with the same gid are merged. After the merging, some
supernodes, depending on the group policy, become ordinary nodes. Before a supernode changes its
state into a second-level node, the supernode notifies its successors and predecessors to update their
pointers (see Figure 11). Nodes in the second level also need to be merged to the new group. We
discuss two methods to merge groups, namely supernode initiated and node initiated.
152
Figure 10. Collision detection for groups with several supernodes
Supernode Initiated
To merge two groups n.gid and n.gid, supernode n notifies its second-level nodes to join group n.gid
(Figure 12). The advantage of this approach is that second-level nodes join a new group as soon as a
collision is detected. However, n needs to keep track of its group membership. If n has only partial
knowledge of group membership, some nodes in the second-level can become orphans.
Figure 11. Announce leave to preceding and succeeding supernodes
153
Figure 12. Collision resolution-supernode-initiated approach
Node Initiated
In node-initiated merging, each second-level node periodically checks that its known supernode n is still
a valid supernode (Figure 13). If n is no longer a supernode, then the second-level node will ask n to
find the correct supernode. These second-level nodes then join a new group through the new supernode.
This approach does not require supernodes to track group membership. However, it introduces an additional overhead to the second-level nodes as they periodically check the status of their supernode.
COLLISION AVOIDANCE SCHEME

Avoiding collision has the following advantages:
1.
2.
3.
4.
154
lower overhead: Runaway collisions are very costly, and detecting and resolving collisions is highly
difficult in a decentralized and dynamic peer-to-peer system with high churn rate (Teo, 2008);
reduced bootstrap time: New peers can join the network at a faster rate because the time between
the join event and the update of the underlying overlay network states is reduced;
improved lookup performance: Without collision, the top-level overlay is maintained at the ideal
size;
faster resource availability: As costly collision resolution is not necessary, resources are available
once the nodes join the network.
Figure 13. Collision resolution-node-initiated approach
In the join operation in Figure 6, a node performs a lookup for the group identifier, which is handled
by the supernode of the successor group. If the joining node and the supernode that respond to the
lookup have the same group identifier, the node joins the second-level overlay. Collisions occur when
concurrent joins create multiple new groups with the same group identifier in the first-level overlay.
This scenario arises because before the routing states are updated, each joining node is unaware of the
existence of other joining nodes.
To avoid collisions due to join requests, the join protocol is modified such that the predecessor node
handles the join lookup request instead of the successor node. The rationale behind this change is that
all join requests are serialized at the predecessor. If the group identifier of the successors supernode
is different from the group identifier of the joining node, then the predecessor immediately changes its
successor pointer to reflect the new group created by the joining node. Thus, this modification allows
the overlay network to reveal new groups to subsequent joining nodes and make them available to incoming lookups.
155
Figure 14. Collision-free join operation
Join Protocol
The detailed join algorithm shown in Figure 14 is divided into the following steps:
1.
2.
3.
4.
156
A joining node performs a lookup for group gid, which is routed at the top overlay to the supernode
whose identifier is successor(gid|0) (line 3).
If a group for the resource type exists, a supernode is already created for the resource type and the
joining node becomes a member of the second-level overlay (lines 57).
If a group for the resource type does not exist, the joining node becomes the supernode of a newly
created group. The joining node then sets its predecessor and successor pointers accordingly (lines
911). In addition, the supernode in step 1 updates its successor pointer to the joining node.
Stabilization is used by the new supernode to build a finger table (line 12).
Figure 15. Leave operation
Leave Protocol
When a supernode leaves its group becomes an orphan group if the supernode is the only one in the
group. If a new node attempts to join the orphan group, then a collision occurs because the new node
cannot locate the orphan group in the top-level overlay. Hence, a new group is created in the top-level
overlay where its group identifier is the same as the orphan group. To prevent this type of collisions,
the departing supernode notifies its first-level overlay successor and predecessor to update their finger
tables. Furthermore, a new supernode needs to be elected for the orphan group to prevent collisions
during subsequent node joins.
Figure 15 presents a simple-but-costly leave protocol that reuses our collision-free join operation
(Figure 14) to elect new supernodes. In this protocol, the orphan group is disbanded where all its members are forced to rejoin the system. Thus, the node which completes its join operation first becomes
the new supernode.
Failures
A more complex case which leads to collisions is when supernodes fail. A supernode failure invalidates
other nodes successor pointers and finger table. While inaccurate finger table only degrade lookup
performance, inaccurate successor pointers leads to collisions. However, avoiding collisions due to
supernode failures is a challenging problem. Unlike departures (Section 3.3.2) where supernodes leave
the overlay network gracefully, failures can be viewed as supernodes leave the overlay network silently.
This means that there is no notification to the overlay network to indicate that any collision avoidance
procedures should be triggered. Hence, it is necessary for the system to detect the presence of supernode failures so that any corrective measures can be initiated, e.g. the collision detection-and-resolution
157
scheme presented in Section 3.2.
SUMMARY AND OPEN ISSUES

Efficient lookup is an essential service in peer-to-peer applications. In structured peer-to-peer systems,
dynamic joining and leaving of peers and failing of peer nodes change the structural properties of the
overlay network. Stabilization, the process of overlay network maintenance is a necessary overhead and
impact on the lookup performance. In this chapter, we discuss three main approaches in reducing overlay
maintenance overhead, namely, hierarchical DHT, varying frequency of stabilizations and varying number
of routing states. We discuss in more detail hierarchical DHT where nodes are organized as multi-level
overlay networks. In hierarchical DHT, collisions of groups occur when concurrent node joins result in
multiple groups with the same group identifier being created at the top-level overlay. Collisions increase
the size of the top-level overlay by a factor c, which increases the lookup path length by only O(log c)
hops, but increases the total number of stabilization messages by (c) times. To address the collision
problem, we present firstly a collision detection-and-resolution scheme and two approaches to merge
collision groups, namely, supernode-initiated and node-initiated. Though the effect of collisions can be
reduced by collision detection and correction, the message overhead cost is high. A collision avoidance
scheme where join and leave operations are collision free is discussed.
The open issues of group collisions in hierarchical DHT include:
1.
2.
3.
158
Current experimental results on both collision detection-and-resolution and avoidance schemes

assume that node joins, leaves, and fails occur exclusively (March, 2005; Teo, 2008). However,
in practice, these three events are interleaved and are important when network churn rate is high.
Thus, in addition to the frequency of top-level overlays stabilizations during collision detections
(March, 2005), churn also impacts how often second-level nodes should check the status of their
supernode during the node-initiated collision resolution approach. An adaptive method similar to
(Ghinita, 2006) is a possible direction; however, this has not been studied in detail.
When a supernode leaves, the current collision-free leave protocol uses a simple but nave approach
to deal with orphan groups where all the second-level nodes are forced to rejoin a hierarchical
DHT. A more efficient approach is required. For example, an efficient distributed election scheme
can be used to select a supernode among the second-level nodes, and only the elected supernode
joins the top-level overlay.
Node failures are unplanned and collisions that arise due to node failure are therefore harder to
address. Avoiding collisions due to supernode failures is a challenge. We envisage two possible
solutions; both using multiple supernodes. Firstly, each group employs a number of backup supernodes so that the collision-free join protocol is able to resolve the problem of orphan group before
redirecting new nodes to the group. Alternatively, each group can have multiple supernodes in the
top-level overlay; but this is at the expense of a larger top-level overlay.
REFERENCES
Aberer, K., Cudr-Mauroux, P., Datta, A., Despotovic, Z., Hauswirth, M., & Punceva, M. (2003). P-Grid:
A self-organizing structured p2p system. SIGMOD Record, 32(3), 2933. doi:10.1145/945721.945729
Alima, L. O., El-Ansary, S., Brand, P., & Haridi, S. (2003). DKS (N, k, f): A Family of Low Communication, Scalable and Fault-tolerant Infrastructures for P2P Applications. In Proceedings of the 3rd IEEE Intl.
Symp. on Cluster Computing and the Grid (pp. 344-350). New York: IEEE Computer Society Press.
Androutsellis-Theotokis, S., & Spinellis, D. (2004). A survey of peer-to-peer content distribution technologies. ACM Computing Surveys, 36(4), 335371. doi:10.1145/1041680.1041681
Castro, M., Costa, M., & Rowstron, A. (2004). Performance and Dependability of Structured Peer-toPeer Overlays. In Proceedings of the 2004 Intl. Conf. on Dependable Systems and Networks (pp. 9-18).
New York: IEEE Computer Society Press.
Dabek, F., Kaashoek, M. F., Karger, D., Morris, R., & Stoica, I. (2001). Wide-Area Cooperative Storage with CFS. In Proceedings of the 11th ACM Symp. on Operating Systems Principles (pp. 202-215).
New York: ACM Press.
Dabek, F., Zhao, B. Y., Druschel, P., Kubiatowicz, J., & Stoica, I. (2003). Towards a Common API for
Structured Peer-to-Peer Overlays. In Proceedings of the 2nd Intl. Workshop on Peer-to-Peer Systems
(pp. 33-44). Berlin: Springer-Verlag.
Garcs-Erice, L., Biersack, E. W., Felber, P. A., Ross, K. W., & Urvoy-Keller, G. (2003). Hierarchical
Peer-to-Peer Systems. In Proceedings of the 9th Intl. Euro-Par Conf. (pp. 1230-1239). Berlin: SpringerVerlag.
Ghinita, G., & Teo, Y. M. (2006). An adaptive stabilization framework for distributed hash tables. In
Proceedings of the 20th IEEE Intl. Parallel and Distributed Processing Symp. New York: IEEE Computer Society Press.
Ghodsi, A., Alima, L. O., & Haridi, S. (2005). Low-bandwidth topology maintenance for robustness
in structured overlay networks. In Proceedings of 38th Hawaii Intl. Conf. on System Sciences (p. 302).
Ghodsi, A., Alima, L. O., & Haridi, S. (2005a). Symmetric replication for structured peer-to-peer systems. In Proceedings of the 3rd Intl. Workshop on Databases, Information Systems and Peer-to-Peer
Computing (p. 12). Berlin: Spinger-Verlag.
Godfrey, B., Lakshminarayanan, K., Surana, S., Karp, R., & Stoica, I. (2004). Load balancing in dynamic
structured p2p systems. In Proceedings of INFOCOM (pp. 2253- 2262). New York: IEEE Press.
Godfrey, P. B., & Stoica, I. (2005). Heterogeneity and load balance in distributed hash tables. In Proceedings of INFOCOM (pp. 596-606). New York: IEEE Press.
Gummadi, K., Gummadi, R., Gribble, S., Ratnasamy, S., Shenker, S., & Stoica, I. (2003). The impact
of dht routing geometry on resilience and proximity. In Proceedings of ACM SIGCOMM (pp. 381-394).
159
Gupta, I., Birman, K., Linga, P., Demers, A., & Renesse, R. V. (2003). Kelips: Building an efficient and
stable P2P DHT through increased memory and background overhead. In Proceedings of the 2nd Intl.
Workshop on Peer-to-Peer Systems (pp. 160-169). Berlin: Springer-Verlag.
Harvey, N. J., Jones, M. B., Saroiu, S., Theimer, M., & Wolman, A. (2003). SkipNet: A scalable overlay
network with practical locality properties. In Proceedings of the 4th USENIX Symp. on Internet Technologies and Systems (pp. 113-126). USENIX Association.
Hsiao, H.-C., & King, C.-T. (2003). A tree model for structured peer-to-peer protocols. In Proceedings
of the 3rd IEEE Intl. Symp. on Cluster Computing and the Grid (pp. 336-343). New York: IEEE Computer Society Press.
Kaashoek, M. F., & Karger, D. R. (2003). Koorde: A simple degree-optimal distributed hash table. In
Proceedings of the 2nd Intl. Workshop on Peer-to-Peer Systems (pp. 98-107). Berlin: Springer-Verlag.
Karger, D. R., & Ruhl, M. (2004). Diminished chord: A protocol for heterogeneous subgroup. In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 288-297). Berlin: Springer-Verlag.
Karger, D. R., & Ruhl, M. (2004). Simple, efficient load balancing algorithms for peer-to-peer systems.
In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 131-140). Berlin: SpringerVerlag.
Kubiatowicz, J., Bindel, D., Chen, Y., Eaton, P., Geels, D., Gummadi, R., et al. (2000). OceanStore: An
Architecture for Global-Scale Persistent Storage. In Proceedings of the 9th Intl. Conf. on Architectural
Support for Programming Languages and Operating Systems (pp. 190-201). New York: ACM Press.
Landers, M., Zhang, H., & Tan, K.-L. (2004). PeerStore: Better performance by relaxing in peer-to-peer
backup. In Proceedings of the 4th Intl. Conf. on Peer-to-Peer Computing (pp. 72-79). New York: IEEE
Computer Society Press.
Leslie, M., Davies, J., & Huffman, T. (2006). replication strategies for reliable decentralised storage. In
Proceedings of the 1st Workshop on Dependable and Sustainable Peer-to-Peer Systems (pp. 740-747).
Li, J., Stribling, J., Gil, T. M., Morris, R., & Kaashoek, M. F. (2004). Comparing the performance of
distributed hash tables under churn. In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems
Li, J., Stribling, J., Morris, R., & Kaashoek, M. F. (2005). Bandwidth-efficient management of dht
routing tables. In Proceedings of 2nd Symp. on Networked Systems Design and Implementation (pp.
99-114). USENIX Association.
Loo, B. T., Huebsch, R., Stoica, I., & Hellerstein, J. M. (2004). The case for a hybrid p2p search infrastructure. In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 141-150). Berlin:
Springer-Verlag.
March, V., Teo, Y. M., Lim, H. B., Eriksson, P., & Ayani, R. (2005). Collision detection and resolution in
hierarchical peer-to-peer systems. In Proceedings of the 30th IEEE Conf. on Local Computer Networks
(pp. 2-9). New York: IEEE Computer Society Press.
160
March, V., Teo, Y. M., & Wang, X. (2007). DGRID: A DHT-based resource indexing and discovery
scheme for computational grids. In Proceedings of the 5th Australasian Symp. on Grid Computing and
e-Research (pp. 41-48). Australian Computer Society, Inc.
Maymounkov, P., & Mazieres, D. (2002). Kademlia: A peer-to-peer information system based on the
XOR metric. In Proceedings of the 1st Intl. Workshop on Peer-to-Peer Systems (pp. 53-65). Berlin:
Springer-Verlag.
Mislove, A., & Druschel, P. (2004). Providing administrative control and autonomy in structured peerto-peer overlays. Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 162-172). Berlin:
Springer-Verlag.
Oram, A. (2001). Peer-to-Peer: Harnessing the power of disruptive technologies. OReilly.
Rao, A., Lakshminarayanan, K., Surana, S., Karp, R., & Stoica, I. (2003). Load Balancing in structured
P2P systems. Proceedings of the 2nd Intl. Workshop on Peer-to-Peer Systems (pp. 68-79). Berlin:
Springer-Verlag.
Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Shenker, S. (2001). A scalable content-addressable
network. Proceedings of ACM SIGCOMM (pp. 161-172). New York: ACM Press.
Ratnasamy, S., Stoica, I., & Shenker, S. (2002). Routing algorithms for DHTs: Some open questions.
Proceedings the 1st Intl. Workshop on Peer-to-Peer Systems (pp. 45-52). Berlin: Springer-Verlag.
Rhea, S., Geels, D., Roscoe, T., & Kubiatowicz, J. (2004). Handling Churn in a DHT. Proceedings of
the USENIX (pp. 127-140). USENIX Association.
Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., et al. (2005). OpenDHT:
A public DHT service and its uses. In Proceedings of ACM SIGCOMM (pp. 73-84). New York: ACM
Press.
Rowstron, A., & Druschel, P. (2001). Pastry: Scalable, distributed object location and routing for largescale peer-to-peer systems. In Proceedings of IFIP/ACM Intl. Conf. on Distributed Systems Platforms
Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A scalable peerto-peer lookup service for Internet applications. In Proceedings of ACM SIGCOMM (pp. 149-160). New
York: ACM Press.
Teo, Y. M., & Mihailescu, M. (2008). Collision avoidance in hierarchical peer-to-peer systems. In Proceedings of 7th Intl. Conf. on Networking (pp. 336-341). New York: IEEE Computer Society Press.
Tian, R., Xiong, Y., Zhang, Q., Li, B., Zhao, B. Y., & Li, X. (2005). Hybrid Overlay Structure Based
on Random Walks. In Proceedings of the 4th Intl. Workshop on Peer-to-Peer Systems (pp. 152-162).
Berlin: Springer-Verlag.
Xu, J. (2003). On the fundamental tradeoffs between routing table size and network diameter in peerto-peer networks. In Proceedings of INFOCOM (pp. 2177-2187). New York: IEEE Press.
161
Xu, Z., Min, R., & Hu, Y. (2003). HIERAS: A DHT based hierarchical p2p routing algorithm. In Proceedings of the 2003 Intl. Conf. on Parallel Processing (pp. 187-194). New York: IEEE Computer
Society Press.
Zhao, B. Y., Duan, Y., Huang, L., Joseph, A., & Kubiatowicz, J. (2003). Brocade: landmark routing
on overlay networks. In Proceedings of the 2nd Intl. Workshop on Peer-to-Peer Systems (pp. 34-44).
Zhao, B. Y., Kubiatowicz, J., & Joseph, A. D. (2001). Tapestry: An infrastructure for fault-tolerant widearea location and routing, (Tech. Rep.). UC Berkeley, Computer Science Department, Berkeley, CA.

Chord: A structured overlay network with nodes organized as a logical ring.
Churn: Changes in overlay networks due to dynamic node joins, leaves, or failures.
Collision of Groups: An occurrence when two or more groups with the same group identifier occupy
the top-level overlay network.
Distributed Hash Table: A class of distributed systems where keys are map onto nodes and nodes
are organized as a structured overlay network to support scalable lookup service.
Finger: An entry in each nodes routing table (finger table) in Chord
Key-Value Pair: A tuple consisting of a unique identifier (key) and an object (value) to be stored
into DHT.
Predecessor: The immediate counter-clockwise neighbor of a node in Chord.
Successor: The immediate clockwise neighbor of a node in Chord.
Supernode: A gateway node to a second-level hierarchical overlay network.
Stabilization: A procedure to keep the routing information in each peer nodes updated.
ENDNOTES
1
2
162
Size of the top-level overlay without collision.

Routing states with higher importance such as successor pointers in Chord (Stoica, 2001) and leaf
sets in Pastry (Rowstron, 2001), are refreshed/corrected more frequently.
This is possible due to the k-ary model.
163
Chapter 8
Load Balancing in Peerto-Peer Systems

Haiying Shen
University of Arkansas, USA
ABSTRACT
Structured peer-to-peer (P2P) overlay networks like Distributed Hash Tables (DHTs) map data items to
the network based on a consistent hashing function. Such mapping for data distribution has an inherent
load balance problem. Thus, a load balancing mechanism is an indispensable part of a structured P2P
overlay network for high performance. The rapid development of P2P systems has posed challenges in
load balancing due to their features characterized by large scale, heterogeneity, dynamism, and proximity.
An efficient load balancing method should flexible and resilient enough to deal with these characteristics. This chapter will first introduce the P2P systems and the load balancing in P2P systems. It then
introduces the current technologies for load balancing in P2P systems, and provides a case study of a
dynamism-resilient and proximity-aware load balancing mechanism. Finally, it indicates the future and
emerging trends of load balancing, and concludes the chapter.
1. INTRODUCTION
Peer-to-peer (P2P) overlay network is a logical network on top of a physical network in which peers
are organized without any centralized coordination. Each peer has equivalent responsibilities, and offers both client and server functionalities to the network for resource sharing. Over the past years, the
immense popularity of P2P resource sharing services has produced a significant stimulus to contentdelivery overlay network research (Xu, 2005). An important class of the overlay networks is structured
P2P overlays, i.e. distributed hash tables (DHTs), that map keys to the nodes of a network based on a
consistent hashing function (Karger, 1997). Representatives of the DHTs include CAN (Ratnasamy,
DOI: 10.4018/978-1-60566-661-7.ch008
Load Balancing in Peer-to-Peer Systems
2001), Chord (Stoica, 2003), Pastry (Rowstron, 2001), Tapestry (Zhao, 2001), Kademlia (Maymounkov,
2002), and Cycloid (Shen, 2006); see (Shen, 2007) and references therein for the details of the representatives of the DHTs.
In a DHT overlay, each node and key has a unique ID, and each key is mapped to a node according to
the DHT definition. The ID space of each DHT is partitioned among the nodes and each node is responsible for those keys the IDs of which are located in its space range. For example, in Chord, a key is stored
in a node whose ID is equal to or succeeding the keys ID. However, a downside of consistent hashing
is uneven load distribution. In theory, consistent hashing produces a bound of O(log n) imbalance of
keys between nodes, where n is the number of nodes in the system (Karger, 1997). Load balancing is an
indispensable part of DHTs. The objective of load balancing is to prevent nodes from being overloaded
by distributing application load among the nodes in proportion to their capacities.
Although the load balancing problem has been studied extensively in a general context of parallel and
distributed systems, the rapid development of P2P systems has posed challenges in load balancing due to
their features characterized by large scale, heterogeneity, dynamism/churn, and proximity. An efficient
load balancing method should flexible and resilient enough to deal with these characteristics. Network
churn represents a situation where a large percentage of nodes and items join, leave and fail continuously
and rapidly, leading to unpredicted P2P network size. Effective load balancing algorithms should work
for DHTs with and without churn and meanwhile be capable of exploiting the physical proximity of the
network nodes to minimize operation cost. By proximity, we mean that the logical proximity abstraction
derived from DHTs dont necessarily match the physical proximity information in reality. In the past,
numerous load balancing algorithms were proposed with different characteristics (Stoica, 2003; Rao,
2003; Godfrey, 2006; Zhu, 2005; Karger, 2006). This chapter is dedicated to providing the reader with
a complete understanding of load balancing in P2P overlays.
The rest of this chapter is organized as follows. In Section 2, we will give an in depth background
of load balancing algorithms in P2P overlays. We move on to present the load balancing algorithms
discussing their goals, properties, initialization, and classification in Section 3. Also, we will present a
case study of a dynamism-resilient and locality-aware load balancing algorithm. In Section 4, we will
discuss the future and emerging trends in the domain of load balancing, and present the current open
problems in load balancing from the P2P overlay network perspective. Finally, in Section 5 we conclude
this chapter.
2. BACKGROUND
Over the past years, the immerse popularity of the Internet has produced a significant stimulus to P2P file
sharing systems. A recent study of large scale characterization of traffic (Saroiu, 2002) shows that more
than 75% of Internet traffic is generated by P2P applications. Load balancing is an inherent problem in
DHTs based on consistent hashing functions. Karger et al. proved that the consistent hashing function in
Chord (Karger, 1997) leads to a bound of O(log n) imbalance of keys between the nodes. Load imbalance
adversely affects system performance by overloading some nodes, while prevents a P2P overlay from
taking full advantage of all resources. One main goal of P2P overlays is to harness all available resources
such as CPU, storage, and bandwidth in the P2P network so that users can efficiently and effectively
access files. Therefore, load balancing is crucial to achieving high performance of a P2P overlay. It helps
to avoid overloading nodes and make full use of all available resources in the P2P overlay.
164
Load balancing in DHT networks remains challenging because of their two unique features:
1.
2.
Dynamism. A defining characteristic of DHT networks is dynamism/churn. A great number of nodes

join, leave and fail continually and rapidly, leading to unpredicted network size. A load balancing
solution should be able to deal with the effect of churn. Popularity of the items may also change
over time. A load balancing solution that works for static situations does not necessarily guarantee
a good performance in dynamic scenarios. Skewed query patterns may also result in considerable
number of visits at hot spots, hindering efficient item access.
Proximity. A load balancing solution tends to utilize proximity information to reduce the load balancing overhead. However, logical proximity abstraction derived from DHTs doesnt necessarily
match the physical proximity information in reality. This mismatch becomes a big obstacle for the
deployment and performance optimization of P2P applications.
In addition, DHT networks are often highly heterogeneous. With the increasing emergence of diversified end devices on the Internet equipped with various computing, networking, and storage capabilities, the heterogeneity of participating peers of a practical P2P system is pervasive. This requires a load
balancing solution to distribute not only the application load (e.g. file size, access volume), but also the
load balancing overhead among the nodes in proportion to their capacities.
Recently, numerous load balancing methods have been proposed. The methods can be classified into
three categories: virtual server, load transfer and ID assignment or reassignment. Virtual server methods
(Stoica, 2003; Godfrey, 2005) map keys to virtual servers the number of which is much more than real
servers. Each real node runs a number of virtual servers, so that each real node is responsible for O(1/n)
of the key ID space with high probability. Load transfer methods (Rao, 2003; Karger, 2004; Zhu, 2005)
move load from heavily loaded nodes to lightly loaded nodes to achieve load balance. ID assignment or
reassignment methods (Bienkowski, 2005; Byers, 2003) assign a key to a lightly loaded node among a
number of options, or reassign a key from a heavily loaded node to a lightly loaded node.
3. LOAD BALANCING METHODS

3.1 Examples of Load Balancing Methods
In this section we will review various load balancing methods that have been proposed for structured
P2P overlays over the last few years. For each method, we will review its goals, algorithms, properties,
and pros and cons.
3.1.1 Virtual Server

Basic virtual server method. Consistent hashing leads to a bound of O(log n) imbalance of keys between
nodes. Karger et al. (Karger, 1997) pointed out that the O(log n) can be reduced to an arbitrarily small
constant by having each node run (log n) virtual nodes, each with its own identifier. If each real node
runs v virtual nodes, all bounds should be multiplied by v. Based on this principle, Stoica et al. (2003)
proposed an abstraction of virtual servers for load balancing in Chord. With the virtual server method,
Chord makes the number of keys per node more uniform by associating keys with virtual nodes, and map-
165
ping multiple virtual nodes (with unrelated identifiers) to each real node. This provides a more uniform
coverage of the identifier space. For example, if (log n) randomly chosen virtual nodes are allocated to
each real node, with high probability each of the n bins will contain O(log n) virtual nodes (Motwani,
1995). The virtual server-based approach for load balancing is simple in concept. There is no need for
the change of underlying DHTs. However, the abstraction incurs large space overhead and compromises
lookup efficiency. The storage for each real server increases from O(log n) to O(log2 n) and the network
traffic increase considerably by a factor of (log n). In addition, node joins and departures generate high
overhead for nodes to update their routing tables. This abstraction of virtual server simplifies the treatment of load balancing problem at the cost of higher space overhead and lookup efficiency compromise.
Moreover, the original concept of virtual server ignores the node heterogeneity.
Y0 DHT protocol. Brighten et al. (Godfrey, 2005) addressed the problem of virtual server method by
arranging a real server for virtual ID space of consecutive virtual IDs. This reduces the load imbalance
from O(log n) to a constant factor. The authors developed a DHT protocol based on Chord, called Y0,
that achieves load balancing with minimal overhead under the assumption that the load is uniformly
distributed in the ID space. The authors proved that Y0 can achieve near-optimal load balancing with
low overhead, and it increases the size of the routing tables by at most a constant factor.
Y0 is based on the concept of virtual servers, but with a twist: instead of picking k virtual servers
with random IDs, a node clusters those IDs in a random fraction (k/n) of the ID space. This allows the
node to share a single set of overlay links among all k virtual servers. As a result, the number of links
per physical node is still (log n), even with (log n) virtual servers per physical node. To deal with
node heterogeneity, Y0 arranges higher-capacity nodes to have a denser set of overlay links, and allows
lower-capacity nodes to be less involved in routing. It results in reduced lookup path length compared
to the homogeneous case in which all nodes have the same number of overlay links. Y0 leads to more
significant than Chord with the original concept of virtual server, because its placement of virtual servers
provides more control over the topology.
Real-world simulation results show that Y0 reduces the load imbalance of Chord from O(log n) to
a less than 3.6 without increasing the number of links per node. In addition, the average path length is
significantly reduced as node capacities become increasingly heterogeneous. For a real-word distribution of node capacities, the path length in Y0 is asymptotically less than half the path length in the case
of a homogeneous system.
Y0 operates under the uniform load assumption that the load of each node is proportional to the size
of the ID space it owns. This is reasonable when all objects generate similar load (e.g., have the same
size), the object IDs are randomly chosen (e.g., are computed as a hash of the objects content), and the
number of objects is large compared to the number of nodes (e.g., (n log n)). However, some of the
cases may not hold true in reality.
Virtual node activation. In virtual server methods, to maintain connectivity of the network, every
virtual node needs to periodically check its neighbors to ensure their updated status. More virtual nodes
will lead to higher overhead for neighbor maintenance. Karger and Ruhl (Karger, 2004) coped with the
virtual server problem by arranging for each real node to activate only one of its O(log n) virtual servers
at any given time. The real node occasionally checks its inactive virtual servers and may migrate to one
of them if the distribution of load in the system has changed. Since only one virtual node is active, the
overhead for neighbor information storage and neighbor maintenance will not be increased in a real node.
As in the Chord with the original virtual server method, this scheme gives each real node a small number
of addresses on the Chord ring, preserving Chords protection against address spoofing by malicious
166
nodes trying to disrupt the routing layer. Combining the virtual node activation load-balancing scheme
with the Koorde routing protocol (Kaashoek, 2003), the authors got a protocol that simultaneously offers (i) O(log n) degree per real node, (ii) O(log n/log log n) lookup hops, and (iii) constant factor load
balance. The authors claimed that previous protocols could achieve any two of these but not all three.
Generally speaking, achieving (iii) required operating O(log n) virtual nodes, which pushed the degree
to O(log2 n) and failed to achieve (i).
3.1.2 ID Assignment or Reassignment

In this category of load balancing methods, most proposals are similar in that they consider a number
of (typically, (log n)) locations for a node and select the one which gives the best load balance. The
proposals differ in which locations should be considered, and when the selection should be conducted
(Godfrey, 2005). Some proposals arrange a newly-jointed node to select a location, while others let
nodes re-select a location when a node is overloaded.
Naor and Weider (2003) proposed a method in which a node checks (log n) random IDs when
joining, and chooses the ID which leads to the best load balance. They show that this method produces
a maximum share of 2 if there are no node deletions. Share is an important metric for evaluating the
performance of a load balancing method (Godfrey, 2005). Node vs share is defined as:
share(v ) =
fv
cv n
where fv is the ID space assigned to node v, and cv is the normalized capacity of node v such that the
average capacity is 1 and v cv = n . To handle load imbalance incurred by node departures, nodes
are divided into groups of (log n) nodes and periodically reposition themselves in each group. Adler et
al. (Adler, 2003) proposed to let a joining node randomly contacts an existing node already in the DHT.
The joining node then chooses an ID in the longest interval owned by one of the proposed nodes O(log
n) neighbors to divide the interval into half. As a result, the intervals owned by nodes have almost the
same length, leading to an O(1) maximum share.
Manku (Manku, 2004) proposed a load balancing algorithm. In the algorithm, a newly-joined node
randomly chooses a node and splits in half the largest interval owned by one of the (log n) nodes adjacent to the chosen node in the ID space. This achieves a maximum share of 2 while moving at most
one node ID for each node arrival or departure. It extends to balancing within a factor 1 + but moves
(1/) IDs for any >0. As mentioned that Karger and Ruhl (Karger, 2004) proposed an algorithm in
which each node has O(log n) virtual nodes, i.e. IDs, and periodically selects an ID among them as an
active ID. This has maximum share 2 + , but requires reassignment of O(log log n) IDs per arrival or
departure.
Bienkowski et al. (2005) proposed a node departure and re-join strategy to balance the key ID intervals
across the nodes. In the algorithm, lightly loaded nodes leave the system and rejoin the system with a
new ID to share the load of heavily loaded ones. The strategy reduces the number of reassignments to
a constant, but shows only O(1) maximum share.
Byers et al. (Byers, 2003) proposed the use of the power of two choices algorithm. In this algorithm, each object is hashed to d 2 different IDs, and is placed in the least loaded node of the nodes
167
responsible for those IDs. The other nodes are given a redirection pointer to the destination node so that
searching is not slowed significantly. For homogeneous nodes and objects and a static system, picking
d = 2 achieves a load balance within a (log log n) factor of optimal, and when d = (log n), the load
balance is within a constant factor of optimal.
The ID assignment or reassignment methods reassign IDs to nodes in order to maintain the load
balance when nodes arrive and depart the system. The object transfer and neighbor update involved in
ID rearrangement would incur a high overhead. Moreover, few methods directly take into account the
heterogeneity of file load.
3.1.3 Load Transfer

The virtual server methods and key assignment and reassignment methods ignore the heterogeneity of
file load. Further load imbalance may result from non-uniform distribution of files in the identifier space
and a high degree of heterogeneity in file loads and node capacities. In addition, few of the methods are
able to deal with both the network churn and proximity. In general, the DHT churn should be dealt with
by randomized matching between heavily loaded nodes with lightly loaded nodes. Load transfer methods
to move load from heavily loaded nodes to lightly loaded nodes can deal with these problems.
Rao et al. (2003) proposed three algorithms to rearrange load based on nodes different capacities:
one-to-one, many-to-many, and one-to-many. Their basic idea is to move virtual servers, i.e. load, from
heavily loaded nodes to lightly loaded nodes so that each nodes load does not exceed its capacity.
Specifically, the method periodically collects the information of servers load status, which helps load
rearrangement between heavily loaded nodes and lightly loaded nodes. The algorithms are different
primarily in the amount of information used to decide load rearrangement. In the one-to-one algorithm,
each lightly loaded server randomly probes nodes for a match with a heavily loaded one. In the manyto-many algorithm, each heavily loaded server sends its excess virtual nodes to a global pool, which
executes load rearrangement periodically. The one-to-one scheme produces too many probes, while
the many-to-many scheme increases overhead in load rearrangement. As a trade-off, the one-to-many
algorithm works in a way that each heavily loaded server randomly chooses a directory which contains
information about a number of lightly loaded servers, and moves its virtual servers to lightly loaded
servers until it is not overloaded anymore.
In a DHT overlay, a nodes load may vary greatly over time since the system can be expected to experience continuous insertions and deletions of objects, skewed object arrival patterns, and continuous
arrival and departure of nodes. To cope with this problem, Godfrey et al. (2006) extended Raos work
(Rao, 2003) for dynamic DHT networks with rapid arrivals and departures of items and nodes. In their
approach, if a nodes capacity utilization exceeds a predetermined threshold, its excess virtual servers
will be moved to a lightly loaded node immediately without waiting for the next periodic balancing.
This work studied this algorithm by using extensive simulations over a wide set of system scenarios
and algorithm parameters.
Most recently, Karger and Ruhl (2004) proved that the virtual server method could not be guaranteed
to handle item distributions where a key ID interval has more than a certain fraction of the load. As a
remedy, they proposed two schemes with provable features: moving items and moving nodes to achieve
equal load between a pair of nodes, and then achieves a system-wide load balance state. In the moving
items scheme, every node occasionally contacts a random other node. If one of the two nodes has much
larger load than the other, then items are moved from the heavily loaded node to the lightly loaded node
168
until their loads become equal. In the moving nodes scheme, if a pair of nodes has very uneven loads,
the load of the heavier node gets split between the two nodes by changing their addresses. However, this
scheme breaks DHT mapping and cannot support key locations as usual. Karger and Ruhl (2004) provided
a theoretic treatment for load balancing problem and proved that good load balance can be achieved by
moving items if the fraction of address space covered by every node is O(1/n) (Karger, 2004).
Almost all of these algorithms assume the objective of minimizing the amount of moved load. The
algorithms treat all nodes equally in random probing, and neglect the factor of physical proximity on
the effectiveness of load balancing. With proximity consideration, load transferring and communication
should be within physically close heavy and light nodes.
One of the first works to utilize the proximity information to guide load balancing is due to Zhu and
Hu (2005). They presented a proximity-aware algorithm to take into account the node proximity information in load balancing. The authors suggested to build a K-nary tree (KT) structure on top of a DHT
overlay. Each KT node is planted in a virtual server. A K-nary tree node reports the load information of
its real server to its parent, until the tree root is reached. The root then disseminates final information
to all the virtual nodes. Using this information, each real server can determine whether it is heavily
loaded or not. Lightly loaded and heavily loaded nodes report their free capacity, excess virtual nodes
information to their KT leaf nodes respectively. The leaf nodes will propagate the information upwards
along the tree. When the total length of information reaches a certain threshold, the KT node would
execute load rearrangement between heavily loaded nodes and lightly loaded nodes. The KT structure
helps to use proximity information to move load between physically close heavily and lightly loaded
nodes. However, the construction and maintenance of KT are costly, especially in churn. In churn, a KT
will be destroyed without timely fixes, degrading load balancing efficiency. For example, when a parent
fails or leaves, the load imbalance of its children in the subtree cannot be resolved before its recovery.
Therefore, although the network is self-organized, the algorithm is hardly applicable to DHTs with
churn. Besides, the tree needs to be reconstructed every time after virtual server transferring, which is
imperative in load balancing. Second, a real server cannot start determining its load condition until the
tree root gets the accumulated information from all nodes. This centralized process is inefficient and
hinder the scalability improvement of P2P systems.
3.2 Case Study: Locality-Aware Randomized Load Balancing Algorithms

This section presents Locality-Aware Randomized load balancing algorithms (LAR) (Shen, 2007) that take
into account proximity information in load balancing and deal with network dynamism meanwhile.
The algorithms take advantage of the proximity information of the DHTs in node probing and distribute
application load among the nodes according to their capacities. The LAR algorithms introduce a factor
of randomness in the probing of lightly loaded nodes in a range of proximity so as to make the probing
process robust in DHTs with churn. The LAR algorithms further improve the efficiency by allowing
the probing of multiple candidates at a time. Such a probing process is referred as d-way probing, d 1.
The algorithms are implemented in Cycloid (Shen, 2006), based on a concept of moving item (Karger,
2004) for retaining DHT network efficiency and scalability. The algorithms are also suitable for virtual
server methods. The performance of the LAR load balancing algorithms is evaluated via comprehensive
simulations. Simulation results demonstrate the superiority of a locality-aware 2-way randomized load
balancing algorithm, in comparison with other pure random approaches and locality-aware sequential
algorithms. In DHTs with churn, it performs no worse than the best churn resilient algorithm. In the
169
Table 1. Routing table of a cycloid node (4,101-1-1010)

NodeID (4,101-1-1010)
Routing Table
Cubical neighbor: (3,101-0-xxxx)
Cyclic neighbor: (3,101-1-1100)
Cyclic neighbor: (3,101-1-0011)
Leaf Sets (half smaller, half larger)
Inside Leaf Set
(3,101-1-1010)
(6,101-1-1010)
Outside Leaf Set
(7,101-1-1001)
(6,101-1-1011)
following, Cycloid DHT is first introduced before the LAR algorithms are presented.
3.2.1 Cycloid: A Constant-Degree DHT

Cycloid (Shen, 2006) is a lookup efficient constant-degree DHT that we recently proposed. In a Cycloid
system with n = d 2d nodes, each lookup takes O(d) hops with O(1) neighbors per node. In this section,
we give a brief overview of the Cycloid architecture and its self-organization mechanism, focusing on
the structural features related to load balancing.
ID and structure. In Cycloid, each node is represented by a pair of indices (k, ad1 ad2 . . . a0), where
k is a cyclic index and ad1 ad2 . . . a0 is a cubical index. The cyclic index is an integer, ranging from 0
to d 1 and the cubical index is a binary number between 0 and 2d 1. Each node keeps a routing table
and two leaf sets, inside leaf set and outside leaf set, with a total of 7 entries to maintain its connectivity
to the rest of the system. Table 1 shows a routing state table for node (4,10111010) in an 8-dimensional
Cycloid, where x indicates an arbitrary binary value. Its corresponding links in both cubical and cyclic
Figure 1. Cycloid node routing links state
170
aspects are shown in Figure 1. In general, a node (k, ad1 ad2 . . . a0), k 6= 0, has one cubical neighbor
(k 1, ad1a d2 . . . ak xx...x) where x denotes an arbitrary bit value, and two cyclic neighbors (k1, bd1
bd2 . . . b0) and (k1, cd1 cd2 . . . c0). The cyclic neighbors are the first larger and smaller nodes with
cyclic index k1 mod d and their most significant different bit with the current node in cubical indices
is no larger than (k 1).
That is, (k-1, bd1 . . . b1b0) = min{(k-1, yd1 . . . y1y0)|yd1 . . . y0ad1 . . . a1a0},
(k-1, cd1 . . . c1c0) = max{(k-1, yd1 . . . y1y0)|yd1 . . . y0ad1 . . . a1a0}.
The node with a cyclic index k = 0 has no cubical neighbor or cyclic neighbors. The node with cubical
index 0 has no small cyclic neighbor, and the node with cubical index 2d 1 has no large cyclic neighbor.
The nodes with the same cubical index are ordered by their cyclic index (mod d) on a local circle. The
inside leaf set of a node points to the nodes predecessor and successor in the local circle. The largest
cyclic index node in a local circle is called the primary node of the circle. All local circles together form
a global circle, ordered by their cubical index (mod 2d). The outside leaf set of a node points to the primary nodes in its preceding and succeeding small circles in the global circle. The Cycloid connection
pattern is resilient in the sense that even if many nodes are absent, the remaining nodes are still capable
of being connected. The Cycloid DHT assigns keys onto its ID space by the use of a consistent hashing
function. For a given key, the cyclic index of its mapped node is set to its hash value modulated by d
and the cubical index is set to the hash value divided by d. If the target node of an item key (k, ad1 . . .
a1a0) is not present in the system, the key is assigned to the node whose ID is first numerically closest
to ad1 ad2 . . . a0 and then numerically closest to k.
Self-organization. P2P systems are dynamic in the sense that nodes are frequently joining and
departing from the network. Cycloid deals with the dynamism in a distributed manner. When a new
node joins, it initializes its routing table and leaf sets, and notifies the nodes in its inside leaf set of its
participation. It also needs to notify the nodes in its outside leaf set if it becomes the primary node of its
local circle. Before a node leaves, it notifies its inside leaf set nodes, as well. Because a Cycloid node
has no incoming connections for cubical and cyclic neighbors, a leaving node cannot notify those who
take it as their cubical neighbor or cyclic neighbor. The need to notify the nodes in its outside leaf set
depends on whether the leaving node is a primary node or not. Updating cubical and cyclic neighbors
are the responsibility of system stabilization, as in Chord.
3.2.2 Load Balancing Framework

This section presents a framework for load balancing based on item movement on Cycloid. It takes
advantage of the Cycloids topological properties and conducts a load balancing operation in two steps:
local load balancing within a local circle and global load balancing between circles. A general approach
with consideration of node heterogeneity is to partition the nodes into a super node with high capacity
and a class of regular nodes with low capacity (Fasttrack, 2001; Yang, 2003). Each super node, together
with a group of regular nodes, forms a cluster in which the super node operates as a server to the others.
All the super nodes operate as equals in a network of super-peers. Super-peer networks strike a balance
between the inherent efficiency of centralization and distribution, and take advantage of capacity heterogeneity, as well. Recall that each local circle in Cycloid has a primary node. We regard Cycloid as a
quasi-super-peer network by assigning each primary node as a leading super node in its circle. A node
171
Table 2. Donating and starving sorted lists

Load information in a primary node
Donating sorted list
Starving sorted list
< Lj, Aj >
< Li,1, Di,1, Ai >
< Lm, Am >
< Li,k, Di,k, Ai >
is designated as supernode if its capacity is higher than a pre-defined threshold. The Cycloid rules are
modified for node join and leave slightly to ensure that every primary node meets the capacity requirement of supernodes. If the cyclic ID selected by a regular node is the largest in its local circle, it needs to
have another choose unless it is the bootstrap node of the circle. In the case of primary node departure or
failure, a supernode needs to be searched in the primary nodes place if the node with the second largest
cyclic ID in the circle is not a super node. This operation can be regarded as the new supernode leaves
and re-joins the system with the ID of the leaving or failing primary node. Let Li,k denote the load of
item k in node i. It is determined by the item size Si,k and the number of visits of the item Vi,k during a
certain time period. That is, Li,k = Si,k Vi,k. The actual load of a real server i, denoted by Li, is the total
load of all of its items:
mi
Li = Li,k ,
k =1
assuming the node has mi items. Let Ci denote the capacity of node i; it is defined as a pre-set target load
which the node is willing to hold. We refer to the node whose actual load is no larger than its target load
(i.e. Li Ci) as a light node; otherwise a heavy one. We define utilization of a node i, denoted by NUi,
as the fraction of its target capacity that is occupied. That is, NUi= Li/Ci. System utilization, denoted
by SU, is the ratio of the total actual load to the total node capacity. Each node contains a list of data
items, labeled as Dk, k = 1, 2, .... To make full use of node capacity, the excess items chosen to transfer
should be with minimum load. We define excess items of a heavy node as a subset of the resident items,
satisfying the following condition. Without loss of generality, we assume the excess items are {D1,D2, .
. ., Dm}, 1 m m. Their corresponding loads are {Li,1, ...,Li, m}. The set of excess items is determined
in such a way that
m'
minimizes Li,k
(1)
k =1
m'
subject to(L i - Li,k ) C i
(2)
k =1
Each primary node has a pair of sorted donating and starving lists which store the load information of
all nodes in its local cycle. A donating sorted list (DSL) is used to store load information of light nodes
and a starving sorted list (SSL) is used to store load information of heavy nodes as shown in Table 2.
The free capacity of light node i is defined as Li = Ci Li. Load information of heavy node I includes
172
the information of its excess items in a set of 3-tuple representation: < Li,1, Di,1, Ai >,< Li,k, Di,k, Ai >, . . .,
<Li,m, Di,m, Ai >, in which Ai denotes the IP address of node i. Load information of light node j is represented in the form of < Lj, Aj >. An SSL is sorted in a descending order of Li,k; minLi,k represents the
item with the minimum load in the primary nodes starving list. A DSL is sorted in an ascending order
of Lj ; max Lj represents the maximum Lj in the primary nodes donating list. Load rearrangement
is executed between a pair of DSL and SSL, as shown in Algorithm 1.
This scheme guarantees that heavier items have a higher priority to be reassigned to a light node, which
means faster convergence to a system-wide load balance state. A heavy item Li,k is assigned to the mostfit light node with Lj which has minimum free capacity left after the heavy item Li,k is transferred to it.
It makes full use of the available capacity. Our load balancing framework is based on item movement,
which transfers items directly instead of virtual servers to save cost. Cycloid maintains two pointers for
each transferred item. When an item D is transferred from heavy node i to light node j, node i will have
a forward pointer in D location pointing to the item D in js place; item D will have a backward pointer
to node i indicating its original host. When queries for item D reach node i, they will be redirected to
node j with the help of forward pointer. If item D needs to be transferred from node j to another node,
say g, for load balancing, node j will notify node i via its backward pointer of the items new location.
Algorithm 1: Primary node periodically performs load rearrangement
between a pair of DSL and SSL
for each item k in SSL dofor each item j in DSL doifLi,k Ljthen
Item k is arranged to be transferred from i to j
if Lj - Li,k > 0 then
Put <( Li Li,k),Ai> back to DSL
We use a centralized method in local load balancing, and a decentralized method in global load balancing. Each node (k, ad1a d2 . . . a0) periodically reports its load information to the primary node in its
local circle. Unlike a real super-peer network, Cycloid has no direct link between a node and the primary
node. The load information needs to be forwarded using Cycloid routing algorithm, which ensures the
information reaches the up-to-the-minute primary node. Specifically, the information is targeted to the
node (d 1, ad1a d2 . . . a0). By the routing algorithm, the destination it reaches, say node i, may be
the primary node or its successor depending on which one is closer to the ID. If the cyclic index of the
successor(i) is larger than the cyclic index of i, then the load information is forwarded to the predecessor(i),
which is the primary node. Otherwise, node i is the primary node. According to the Cycloid routing
algorithm, each report needs to take d/2 steps in the worst case. Cycloid cycle contains a primary node
all the time. Since the load information is guaranteed to reach the up-to-the-minute primary node, there
is no serious advert effect of primary node updates on load balancing. After receiving the load information, the primary node puts it to its own DSL and SSL accordingly. A primary node with nonempty
starving list (PNS) first performs local load rearrangement between its DSL and SSL. Afterwards, if its
SSL is still not empty, it probes other primary nodes DSLs for global load rearrangement one by one
until its SSL becomes empty. When a primary node doest have enough capacity for load balancing, it
can search for a high capacity node to replace itself. We arrange the PNS to initiate probing because the
probing process will stop once it is not overloaded. If a node of nonempty donating list initiates probing, the probing process could proceed infinitely, incurring much more communication messages and
bandwidth cost. Because primary nodes are super peers with high capacities, they are less likely to be
173
overloaded in the load balancing. This avoids the situation that heavy nodes will be overloaded if they
perform probing, such as in the schemes in (Rao, 2003). This scheme can be extended to perform load
rearrangement between one SSL and multiple DSLs for improvement.
3.2.3 Locality-Aware Randomized Load Balancing Algorithms

The load balancing framework in the preceding section facilitates the development of load balancing
algorithms with different characteristics. A key difference between the algorithms is, for a PNS, how to
choose another primary node for a global load rearrangement between their SSL and DSL. It affects the
efficiency and overhead to reach a system-wide load balance state.
D-way randomized probing. A general approach to dealing with the churn of DHTs is randomized
probing. In the policy, each PNS probes other primary nodes randomly for load rearrangement. A simple
form is one-way probing, in which a PNS, say node i, probes other primary nodes one by one to execute
load rearrangement SSLi and DSLj, where j is a probed node. We generalize the one-way randomized
probing policy to a d-way probing, in which d primary nodes are probed at a time, and the primary node
with the most total free capacity in its DSL is chosen for load rearrangement. A critical performance issue is the choice of an appropriate value d. The randomized probing in our load balancing framework is
similar to load balancing problem in other contexts: competitive online load balancing and supermarket
model. Competitive online load balancing is to assign each task to a server on-line with the objective of
minimizing the maximum load on any server, given a set of servers and a sequence of task arrivals and
departures. Azar et al. (1994) proved that in competitive online load balancing, allowing each task to have
two server choices to choose a less loaded server instead of just one choice can exponentially minimize
the maximum server load and result in a more balanced load distribution. Supermarket model is to allocate each randomly incoming task modeled as a customer with service requirements, to a processor (or
server) with the objective of reducing the time each customer spends in the system. Mitzenmacher et al.
(1997) proved that allowing a task two server choices and to be served at the server with less workload
instead of just one choice leads to exponential improvements in the expected execution time of each
task. But a poll size larger than two gains much less substantial extra improvement. The randomized
probing between the lists of SSLs and DSLs is similar to the above competitive load balancing and
supermarket models if we regard SSLs as tasks, and DSLs as servers. But the random probing in P2P
systems has a general workload and server models. Servers are dynamically composed with new ones
joining and existent ones leaving. Servers are heterogeneous with respect to their capacities. Tasks are
of different sizes and arrive in different rates. In (Fu, 2008), we proved the random probing is equivalent
to a generalized supermarket model and showed the following results.
Theorem 5.1: Assume servers join in a Poisson distribution. For any fixed time interval [0,T], the
length of the longest queue in the supermarket model with d = 1 is ln n/ ln ln n(1+O(1)) with high
probability; the length of the longest queue in the model with d 2 is ln ln n/ ln d + O(1), where
n is the number of servers.
The theorem implies that 2-way probing could achieve a more balanced load distribution with faster
speed even in churn, because 2-way probing has higher possibility to reach an active node than 1-way
probing, but d-way probing, d > 2, may not result in much additional improvement.
Locality-aware probing. One goal of load balancing is to effectively keep each node lightly loaded
174
Table 3. Simulation settings and algorithm parameters

Environmental Parameter
Default Value
Object arrival location
Uniform over ID space
Number of nodes
4906
Node capacity
Bounded Pareto: shape 2

lower bound: 2500, upper bound: 2500 * 10
Number of items
20480
Existing item load
Bounded Pareto: shape 2

lower bound: mean item actual load / 2
upper bound: mean item actual load / 2 * 10
with minimum load balancing overhead. Proximity is one of the most important performance factors.
Mismatch between logical proximity abstraction and physical proximity information in reality is a big
obstacle for the deployment and performance optimization issues for P2P applications. Techniques
to exploit topology information in overlay routing include geographic layout, proximity routing and
proximity-neighbor selection (Castro, 2002).
The proximity-neighbor selection and topologically-aware overlay construction techniques in (Xu,
2003; Castro, 2002; Waldvogel, 2002) are integrated into Cycloid to build a topology-aware Cycloid. As
a result, the topology-aware connectivity of Cycloid ensures that a message reaches its destination with
minimal overhead. Details of topology-aware Cycloid construction will be presented in Section 3.2.4.
In a topology-aware Cycloid network, the cost for communication and load movement can be reduced
if a primary node contacts other primary nodes in its routing table or primary nodes of its neighbors. In
general, the primary nodes of a nodes neighbors are closer to the node than randomly chosen primary
nodes in the entire network, such that load is moved between closer nodes. This method should be the
first work that handles the load balancing issue with the information used for achieving efficient routing.
There are two methods for locality-aware probing: randomized and sequential method.
1.
2.
Locality-aware randomized probing (LAR): In LAR, each PNS contacts primary nodes in a
random order in its routing table or primary nodes of its neighbors except the nodes in its inside
leaf set. After all these primary nodes have been tried, if the PNSs SSL is still nonempty, global
random probing is started in the entire ID space.
Locality-aware sequential probing (Lseq): In Lseq, each PNS contacts its larger outside leaf set
Successor(PNS). After load rearrangement, if its SSL is still nonempty, the larger outside leaf set
of Successor(PNS), Successor(Successor(PNS)) is tried. This process is repeated, until that SSL
becomes empty. The distances between a node and its sequential nodes are usually smaller than
distances between the node and randomly chosen nodes in the entire ID space.
3.2.4 Performance Evaluation

We designed and implemented a simulator in Java for evaluation of the load balancing algorithms on
topology-aware Cycloid. Table 3 lists the parameters of the simulation and their default values. The
simulation model and parameter settings are not necessarily representative of real DHT applications.
They are set in a similar way to related studies in literature for fair comparison. We will compare the
175
different load balancing algorithms in Cycloid without churn in terms of the following performance
metrics; the algorithms in Cycloid with churn will also be evaluated.
1.
2.
3.
4.
5.
Load movement factor: Defined as the total load transferred due to load balancing divided by the
system actual load, which is system target capacity times SU. It represents load movement cost.
Total time of probing: Defined as the time spent for primary node probing assuming that probing
one node takes 1 time unit, and probing a number of nodes simultaneously also takes 1 time unit.
It represents the speed of probing phrase in load balancing to achieve a system-wide load balance
state.
Total number of load rearrangements: Defined as the total number of load rearrangement between a pair of SSL and DSL. It represents the efficiency of probing for light nodes.
Total probing bandwidth: Defined as the sum of the bandwidth consumed by all probing operations. The bandwidth of a probing operation is the sum of bandwidth of all involved communications, each of which is the product of the message size and physical path length of the message
traveled. It is assumed that the size of a message asking and replying for information is 1 unit. It
represents the traffic burden caused by probings.
Moved load distribution: Defined as the cumulative distribution function (CDF) of the percentage
of moved load versus moving distance. It represents the load movement cost for load balance. The
more load moved along the shorter distances, the less load balancing costs.
Topology-aware cycloid construction. GT-ITM (transit-stub and tiers) (Zegura, 1996) is a network
topology generator, widely used for the construction of topology-aware overlay networks (Ratnasamy,
2002; Xu, 2003; Xu, 2003; Gummadi, 2003). We used GT-ITM to generate transit-stub topologies for
Cycloid, and get physical hop distance for each pair of Cycloid nodes. Recall that we use proximityneighbor selection method to build topology-aware Cycloid; that is, it selects the routing table entries
pointing to the physically nearest among all nodes with nodeID in the desired portion of the ID space.
We use landmark clustering and Hilbert number (Xu, 2003) to cluster Cycloid nodes. Landmark
clustering is based on the intuition that close nodes are likely to have similar distances to a few landmark
nodes. Hilbert number can convert d dimensional landmark vector of each node to one dimensional index while still preserve the closeness of nodes. We selected 15 nodes as landmark nodes to generate the
landmark vector and a Hilbert number for each node cubic ID. Because the nodes in a stub domain have
close (or even same) Hilbert numbers, their cubic IDs are also close to each other. As a result, physically
close nodes are close to each other in the DHTs ID space, and nodes in one cycle are physically close
to each other. For example, assume nodes i and j are very close to each other in physical locations but
far away from node m. Nodes i and j will get approximately equivalent landmark vectors, which are different from ms. As a result, nodes i and j would get the same cubical IDs and be assigned to the circle
different from ms. In the landmark approach, for each topology, we choose landmarks at random with
the only condition that the landmarks are separated from each other by four hops. More sophisticated
placement schemes, as described in (Jamin, 2000) would only serve to improve our results.
Our experiments are built on two transit-stub topologies: ts5k-large and ts5k-small with approximately 5,000 nodes each. In the topologies, nodes are organized into logical domains. We classify
the domains into two types: transit domains and stub domains. Nodes in a stub domain are typically an
endpoint in a network flow; nodes in transit domains are typically intermediate in a network flow. ts5klarge has 5 transit domains, 3 transit nodes per transit domain, 5 stub domains attached to each transit
176
Figure 2. Effect of load balancing
node, and 60 nodes in each stub domain on average. ts5k-small has 120 transit domains, 5 transit nodes
per transit domain, 4 stub domains attached to each transit node, and 2 nodes in each stub domain on
average. ts5k-large has a larger backbone and sparser edge network (stub) than ts5k-small. ts5klarge is used to represent a situation in which Cycloid overlay consists of nodes from several big stub
domains, while ts5k-small represents a situation in which Cycloid overlay consists of nodes scattered
in the entire Internet and only few nodes from the same edge network join the overlay. To account for
the fact that interdomain routes have higher latency, each interdomain hop counts as 3 hops of units of
latency while each intradomain hop counts as 1 hop of unit of latency.
Effectiveness of LAR algorithms. In this section, we will show the effectiveness of LAR load balancing
algorithm. First, we present the impact of LAR algorithm on the alignment of the skews in load distribuFigure 3. Effect of load balancing due to different probing algorithms
177
tion and node capacity when the system is fully loaded. Figure 2(a) shows the initial node utilization of
each node. Recall that node utilization is a ratio of the actual load to its target (desired) load. Many of
the nodes were overloaded before load balancing.
Load balancing operations drove all node utilizations down below 1 by transferring excess items
between the nodes, as shown in Figure 2(b). Figure 2(c) shows the scatterplot of loads according to node
capacity. It confirms that the capacity-aware load balancing feature of
the LAR algorithm. Recall that LAR algorithm was based on item movement, using forward pointers to keep DHT lookup protocol. We calculated the fraction of items that are pointed to by forward
pointers in systems of different utilization levels. We found that the fraction increased linearly with the
system load, but it would be no higher than 45% even when the system becomes fully loaded. The cost
is reasonably low compared to the extra space, maintenance cost and efficiency degradation in virtual
server load balancing approach.
We measured the load movement factors due to different load balancing algorithms: one-way random
(R1), two-way random (R2), LAR1, LAR2, and Lseq, on systems of different loads and found that the
algorithms led to almost the same amount of load movement in total at any given utilization level. This
is consistent with the observations by Rao et al. (2003) that the load moved depends only on distribution of loads, the target to be achieved, but not on load balancing algorithms. This result suggests that
an effective load balancing algorithm should explore to move the same amount of load along shorter
distance and in shorter time to reduce load balancing overhead.
In the following, we will examine the performance of various load balancing algorithms in terms of
other performance metrics. Because metrics (2) and (3) are not affected by topology, the results of them
in ts5k-small will not be presented sometimes.
Comparison with other algorithms.Figure 3(a) shows the probing process in Lseq takes much more
time than R1 and LAR1. This implies that random algorithm is better than sequential algorithm in probing efficiency.
Figure 3(b) shows that the numbers of rearrangements of the three algorithms are almost the same.
This implies that they need almost the same number of load rearrangement to achieve load balance.
However, long probing time of Lseq suggests that it is not as efficient as random probing. It is consistent
with the observation of Mitzenmacher in (Mitzenmacher, 1997) that simple randomized load balancing
schemes can balance load effectively. Figure 3(c) and (d) show the performance of the algorithms in
ts5k-large. From Figure 3(c), we can observe that unlike in lightly loaded systems, in heavily loaded
systems, R1 takes more bandwidth than LAR1 and Lseq, and the performance gap increases as the system
load increases. This is because that much less probings are needed in a lightly loaded system, causing
less effect of probing distance on bandwidth consumption.
The bandwidth results of LAR and Lseq are almost the same when the SU is under 90%; when the SU
goes beyond 0.9, LAR consumes more bandwidth than Lseq. This is due to the fact that in a more heavily loaded system, more nodes need to be probed in the entire ID space, leading to longer load transfer
distances. Figure 3(d) shows the moved load distribution in load balancing as the SU approaches 1. We
can see that LAR1 and Lseq are able to transfer about 60% of global moved load within 10 hops, while
R1 transfers only about 15% because R1 is locality-oblivious.
Figure 3(e) and (f) show the performance of the algorithms in ts5k-small. These results also confirm that LAR1 achieve better locality-aware performance than R1, although the improvement is not so
significant as in ts5k-large. It is because that in ts5k-small topology,nodes are scattered in the entire
network, and the neighbors of a primary node may not be physically closer than other nodes.
178
Figure 4. Breakdown of probed nodes
Figures 3(d) and (f) also include the results due to two other popular load balancing approaches:
proximity-aware Knary Tree (KTree) algorithm (Zhu, 2005) and churn resilient algorithm (CRA)
(Godfrey, 2006) for comparison. From the figures, we can see that LAR performs as well as KTree,
and outperforms proximity-oblivious CRA, especially in ts5k-large. The performance gap between
proximity-aware and proximity-oblivious algorithms is not as large as in ts5k-small. It is because the
nodes in ts5k-small are scattered in the entire Internet with less locality.
Figure 5. Effect of load balancing due to different LAR algorithms
179
In summary, the results in Figure 3 suggest that the randomized algorithm is more efficient than the
sequential algorithm in the probing process. The locality-aware approaches can effectively assign and
transfer loads between neighboring nodes first, thereby reduce network traffic and improve load balancing efficiency. The LAR algorithm performs no worse than the proximity-aware KTree algorithm. In
Section 3.2.5, we will show LAR works much better for DHTs with churn.
Effect of D-way random probing (Figure 4). We tested the performance of the LARd algorithms with
different probing concurrency degree d. Figure 5(a) shows that LAR2 takes much less probing time than
LAR1. It implies that LAR2 reduces the probing time of LAR1 at the cost of more number of probings.
Unlike LAR1, in LAR2, a probing node only sends its SSL to a node with more total free capacity in its
DSL between two probed nodes. The more item transfers in one load rearrangement, the less probing
time. It leads to less number of SSL sending operation of LAR2 than LAR1, resulting in less number of
load rearrangements as shown in Figure 5(b). Therefore, simultaneous probings to get a node with more
total free capacity in its DSL can save load balancing time and reduce network traffic load.
Figures 4(a) and (b) show the breakdown of total number of probed nodes in percentage that are from
neighbors or randomly chosen in entire ID space in LAR1 and LAR2 respectively. Label one neighbor
and one random represents the condition when theres only one neighbor in routing table, then another
probed node is chosen randomly from ID space. We can see that the percentage of neighbor primary
node constitutes the most part, which means that neighbors can support most of system excess items in
load balancing.
With SU increases, the percentage of neighbor primary node decreases because the neighbors DSLs
dont have enough free capacity for a larger number of excess items, then randomly chosen primary
nodes must be resorted to. Figures 5(a) and (b) show that the probing efficiency of LARd (d>2) is almost
the same as LAR2, though they need to probe more nodes than LAR2. The results are consistent with the
expectations in Section 3.2.1 that a two-way probing method leads to an exponential improvement over
one-way probing, but a d-way (d>2) probing leads to much less substantial additional improvement. In
the following, we will analyze whether the improvement of LARd (d 2) over LAR1 is at the cost of
more bandwidth consumption or locality-aware performance degradation. We can observe from Figure
5(c) that the probing bandwidth of LAR2 is almost the same as LAR1.
Figure 5(d) shows the moved load distribution in global load balancing due to different algorithms.
We can see that LAR2 leads to an approximately identical distribution as LAR1 and they cause slightly
less global load movement cost than LAR4 and LAR6. This is because the more simultaneous probed
nodes, the less possibility that the best primary node is a close neighbor node. These observations demonstrate that LAR2 improves on LAR1 at no cost of bandwidth consumption. It retains the advantage of
locality-aware probing. Figures 5(e) and (f) show the performance of different algorithms in ts5k-small.
Although the performance gap is not as wide as in ts5k-large, the relative performance between the
algorithms retains.
In practice, nodes and items continuously join and leave P2P systems. It is hard to achieve the objective of load balance in networks with churn. We conducted a comprehensive evaluation of the LAR
algorithm in dynamic situations and compare the algorithm with CRA, which was designed for DHTs
with churn. The performance factors we considered include load balancing frequency, item arrival/departure rate, non-uniform item arrival pattern, and network scale and node capacity heterogeneity. We
adopted the same metrics as in (Godfrey, 2006):
180
Figure 6. Effect of load balancing with churn
1.
2.
The 99.9th percentile node utilization (99.9th NU). We measure the maximum 99.9th percentile
of the node utilizations after each load balancing period T in simulation and take the average of
these results over a period as the 99.9th NU. The 99.9th NU represents the efficiency of LAR to
minimize load imbalance.
Load moved/DHT load moved (L/DHT-L), defined as the total load moved incurred due to load
balancing divided by the total load of items moved due to node joins and departures in the system.
This metric represents the efficiency of LAR to minimize the amount of load moved.
Unless otherwise indicated, we run each trial of the simulation for 20T simulated seconds, where
T is a parameterized load balancing period, and its default value was set to 60 seconds in our test. The
item and node join/departure rates were modeled by Poisson processes. The default rate of item join/
departure rate was 0.4; that is, there were one item join and one item departure every 2.5 seconds. We
ranged node interarrival time from 10 to 90 seconds, with 10 second increment in each step. A node life
time is computed by arrival rate times number of nodes in the system. The default system utilization
SU was set to 0.8.
Performance comparison with CRA in Churn.Figure 6 plots the performance due to LAR1 and CRA
versus node interarrival time during T period. By comparing results of LAR1 and CRA, we can have a
number of observations. First, the 99.9th NUs of LAR1 and CRA are kept no more than 1 and 1.25 reFigure 7. Impact of system utilization under continual node joins and departures
181
Figure 8. Impact of load balancing frequency
spectively. This implies that on average, LAR1 is comparable with CRA in achieving the load balancing
goal in churn. Second, LAR1 moves up to 20% and CRA moves up to 45% of the system load to achieve
load balance for SU as high as 80%. Third, the load moved due to load balancing is very small compared
with the load moved due to node joins and departures and it is up to 40% for LAR1 and 53% for CRA.
When the node interarrival time is 10, the L/DHT-L is the highest. It is because faster node joins and
departures generate much higher load imbalance, such that more load transferred is needed to achieve
load balance. The fact that the results of LAR1 are comparable to CRA implies that LAR algorithm is
as efficient as CRA to handle churn by moving a small amount load.
The results in Figure 6 are due to a default node join/leave rate of 0.4. Figure 7 plots the 99.9th NU,
load movement factor and the L/DHT-L as a function of SU with different node interarrival time respectively. We can observe that the results of the three metrics increase as SU increases. Thats because
nodes are prone to being overloaded in a heavily loaded system, resulting in more load transferred to
achieve load balance. We also can observe that the results of the metrics increase as interarrival time
decreases, though they are not obvious. It is due to the fact that with faster node joins and departures,
nodes are more easily to become overloaded, leading to the increase of the 99.9th NU and load moved
in load balancing. Low NUs in different SU and node interarrival time means that the LAR is effective
in maintaining each node light in a dynamic DHT with different node join/departure rate and different
SUs, and confirms the churn-resilient feature of the LAR algorithm.
Impact of load balancing frequency in Churn. It is known that high frequent load balancing ensures
the system load balance at a high cost, and low frequent load balancing can hardly guarantee load balance
at all time. In this simulation, we varied load balancing interval T from 60 to 600 seconds, at a step size
of 60, and we conducted the test in a system with SU varies from 0.5 to 0.9 at a step size of 0.1.
Figure 8(a) and (b) show the 99.9th NU and load movement factor in different system utilization and
time interval. We can see that the 99.9th NU and load movement factor increase as SU increases. This is
because that nodes are most likely to be overloaded in highly loaded system, leading to high maximum
NU and a large amount of load needed to transfer for load balance.
Figure 8(a) shows that all the 99.9th NUs are less than 1, and when the actual load of a system consists
182
Figure 9. Impact of item arrival/departure rate
more than 60% of its target load, the 99.9 NU quickly converges to 1. It implies that the LAR algorithm
is effective in keeping every node light, and it can quickly transfer excess load of heavy nodes to light
nodes even in a highly loaded system. Observing Figure 8(a) and (b), we find that in a certain SU, the
more load moved, the lower 99.9th NU. It is consistent with our expectation that more load moved leads
to move balanced load distribution.
Intuitively, a higher load balancing frequency should lead to less the 99.9th NU and more load moved.
Our observation from Figure 8 is counter-intuitive. That is, the 99.9th NU increases and load movement factor decreases as load balancing is performed more frequently. Recall that the primary objective
of load balancing is to keep each node not overloaded, instead of keeping the application load evenly
distributed between the nodes. Whenever a nodes utilization is below 1, it does not need to transfer its
load to others. With a high load balancing frequency, few nodes are likely to be overloaded. They may
have high utilizations less than 1, and end up with less load movement and high node utilization. Figure
8(b) reveals a linear relationship between the load movement factor and system utilization and that the
slope of low frequency is larger than high frequency because of the impact of load balancing frequency
on highly loaded systems.
Impact of item arrival/departure rate in Churn. Continuous and fast item arrivals increase the probability of overloaded nodes generation. Item departures generate nodes with available capacity for
excess items. An efficient load balancing algorithm will find nodes with sufficient free capacity for
excess items quickly in order to keep load balance state in churn. In this section, we evaluate the efficiency of LAR algorithm in the face of rapid item arrivals and departures. In this test, we varied item
arrival/departure rate from 0.05 to 0.45 at a step size of 0.1, varied SU from 0.5 to 0.9 at a step size of
0.05, and measured the 99.9th NU and load movement factor in each condition. Figure 9(a) and (b),
respectively, plot the 99.9th NU and load movement factor as functions of item arrival/departure rate.
As expected, the 99.9th NU and load movement factor increase with system utilization. It is consistent
with the results in the load balancing frequency test. Figure 9(a) shows that all the 99.9th NUs are less
than 1, which means that the LAR is effective to assign excess items to light nodes in load balancing in
rapid item arrivals and departures. From the figures, we can also see that when item arrival/departure
183
Figure 10. Impact of non-uniform item arrival patterns
rate increases, unlike in lightly loaded system, the 99.9th NU decreases in heavily loaded system. It is
due to efficient LAR load balancing, in which more load rearrangements initiated timely by overloaded
nodes with high item arrival rate. On the other hand, in the lightly loaded system, though the loads of
nodes accumulate quickly with high item arrival rate, most nodes are still light with no need to move
out load, leading to the increase of 99.9th NU. This is confirmed by the observation in Figure 9(b) that
the load moved is higher in heavily loaded system than that in lightly loaded system, and movement
factor drops faster in highly loaded system, which means that faster item departures lead to less load
moved for load balance. Figure 9(b) demonstrates that the load movement factor drops as item arrival/
departure rate increases. It is because that the total system load (denominator of load movement factor)
grows quickly with a high item arrival/departure rate. In summary, item arrival/departure rate has direct
effect on NU and load movement factor in load balancing, and LAR is effective to achieve load balance
with rapid item arrivals and departures.
Impact of Non-uniform Item Arrivals in Churn. Furthermore, we tested LAR algorithm to see if it is
churn-resilient enough to handle skewed load distribution. We define an impulse of items as a group
of items that suddenly join in the system and their IDs are distributed over a contiguous interval of an ID
space interval. We set their total load as 10% of the total system load, and varied the spread of interval
from10% to 90% of the ID space.
Figure 10(a) shows that in different impulses and SUs, LAR algorithm kept the 99.9th NU less than
1.055, which implies that LAR algorithm can almost solve the impulses successfully. The 99.9th NU is
high in high SU and low impulse spread. Except when SU equals to 0.8, the impulse with spread larger
than 0.3 can be successfully resolved by LAR algorithm. When the impulse is assigned to a small ID
space interval less than 0.3, the load of the nodes in that ID space interval accumulates quickly, leading to
higher NUs. The situation becomes worse with higher SU, because theres already less available capacity
left in the system for the impulse. The curve of SU=0.8 is largely above others is mainly due to the item
load and node capacity distributions, and the impulse load relative to the SU. In that case, it is hard to
find nodes with large enough capacity to support excess items because of the fragmentation of the 20%
capacity left in the system. The results are consistent with the results in paper (Godfrey, 2006).
Figure 10(b) shows that the load movement factor decreases with the increase of impulse spread, and
the decrease of SU. In low impulse spread, a large amount of load assigning to a small region generates
184
Figure 11. Impact of the number of nodes in the system
a large number of overloaded nodes, so the LAR load balancing algorithm cannot handle them quickly.
This situation becomes worse when SU increases to 0.8, due to little available capacity left. Therefore,
the 99.9th NU and the load movement factor are high in highly loaded system and low impulse interval.
In summary, the LAR algorithm can solve non-uniform item arrival generally. It can deal with sudden
increase of 10% load in 10% ID space in a highly loaded system with SU equals to 0.8, achieving the
99.9th NU close to 1.
Impact of Node Number and Capacity Heterogeneity in Churn. Consistent hashing function adopted
in DHT leads to a bound of O(log n) imbalance of keys between the nodes, where n is the number of
nodes in the system. Node heterogeneity in capacity makes the load balancing problem even more severe.
In this section, we study the effects of the number of nodes and heterogeneous capacity distribution in
the system on load balancing. We varied the number of nodes from 1000 to 8000 at a step size of 1000,
and tested NU and load movement factor when node capacities were heterogeneous and homogeneous.
Homogeneous node capacities are equal capacities set as 50000, and heterogeneous node capacities are
determined by the default Pareto node capacity distribution.
Figure 11(a) shows that in the heterogeneous case, the 99.9th NUs are all around 1. It means that the
LAR can maintain nodes to be light in different network scales when node capacities are heterogeneous.
In the homogeneous case, the 99.9th NU maintains around 1 when node number is no more than 5000, but
it grows linearly as node number increases when nodes are more than 5000. It is somewhat surprisingly
that LAR can achieve better load balance in large scale network when node capacities are heterogeneous
than when they are homogeneous. Intuitively, this is because that in the heterogeneous case, very high
load items can be accommodated by large capacity nodes, but theres no node with capacity large enough
to handle them in the homogeneous case. The results are consistent with those in (Godfrey, 2006).
Figure 11(b) shows that in both cases, the load movement factors increase as the number of nodes
grows. Larger system scale generates higher key imbalance, such that more load needs to be transferred
for load balance. The figure also shows that the factor of the homogeneous case is pronounced less than
that in the heterogeneous case. This is due to the heterogeneous capacity distribution, in which some
nodes have very small capacities but are assigned much higher load, which is needed to move out for
load balance. The results show that node heterogeneity helps, not hurts, the scalability of LAR algorithm.
LAR algorithm can achieve good load balance even in large scale network by arranging load transfer
185
timely.
3.2.6. Summary
This section presents LAR load balancing algorithms to deal with both of the proximity and dynamism
of DHTs simultaneously. The algorithms distribute application load among the nodes by moving items
according to their capacities and proximity information in topology-aware DHTs. The LAR algorithms
introduce a factor of randomness in the probing process in a range of proximity to deal with DHT churn.
The efficiency of the randomized load balancing is further improved by d-way probing.
Simulation results show the superiority of a locality-aware 2-way randomized load balancing in DHTs
with and without churn. The algorithm saves bandwidth in comparison with randomized load balancing
because of its locality-aware feature. Due to the randomness factor in node probing, it can achieve load
balance for SU as high as 90% in dynamic situations by moving load up to 20% of the system load, and
up to 40% of the underlying DHT load moved caused by node joins and departures. The LAR algorithm
is further evaluated with respect to a number of performance factors including load balancing frequency,
arrival/departure rate of items and nodes, skewed item ID distribution, and node number and capacity
heterogeneity. Simulation results show that LAR algorithm can effectively achieve load balance by
moving a small amount of load even in skewed distribution of items.
4. FUTURE TRENDS
Though a lot of research has been conducted in the field of load balancing in parallel and distributed
systems, load balancing methods are still in their incubation phase when it comes to P2P overlay networks. In this section, we discuss the future and emerging trends, and present a number of open issues
in the domain of load balancing in P2P overlay networks.
P2P overlay networks are characterized by heterogeneity, dynamism and proximity. With heterogeneity
consideration, a load balancing method should allocate load among nodes based on the actual file load
rather than the number of files. A dynamism-resilient load balancing method should not generate high
overhead in load balancing when nodes join, leave or fail continuously and rapidly. A proximity-aware
load balancing method moves load between physically close nodes so as to reduce the overhead of load
balancing. However, few of the current load balancing methods take into account these three factors to
improve the efficiency and effectiveness of load balancing. Virtual server methods and ID assignment and
reassignment methods only aim to distribute the number of files among nodes in balance, therefore they
are unable to consider file heterogeneity. In addition, these methods lead to high overhead due to neighbor
maintenance and varied ID intervals owned by nodes in churn. These two categories of methods can be
complementary to the load transfer methods that have potential to deal with the three features of P2P
overlay networks. Thus, combing the three types of load balancing strategies to overcome each others
drawbacks and take advantage of the benefits of each method will be a promising future direction.
The LAR algorithms were built on Cycloid structured DHT. It is important that the LAR algorithms
are applicable to other DHT networks as well. It must be complemented by node clustering to cluster
DHT nodes together according to their physical locations to facilitate LARs probing in a range of proximity. The work in (Shen, 2006) presents a way of clustering physically close nodes in a general DHT
network, which can be applied to LARs generalization to other DHT networks.
186
Currently, most heterogeneity-unaware load balancing methods measure load by the number of files
stored in a node, and heterogeneity-aware load balancing methods only consider file size when determining a nodes load. In addition to the storage required, the load incurred by a file also includes bandwidth
consumption caused by file queries. Frequently-queried files generate high load, while infrequentlyqueried files lead to low load. Since files stored in the system often have different popularities, and the
access patterns to the same file may vary with time, a files load is changing dynamically. However, most
load balancing methods are not able to cope with load variance caused by non-uniform and time-varying
file popularity. Thus, an accurate method to measure a files load that considers all factors affecting
load is required. On the other hand, node capacity heterogeneity should also be identified. As far as the
author knows, all current load balancing methods assume that there is one bottleneck resource, though
there are various resources including CPU, memory, storage and bandwidth. For highly effective load
balancing, the various loads such as bandwidth and storage should be differentiated, and various node
resources should be differentiated as well. Rather than mapping generalized node capacity and load,
corresponding load and node resource should be mapped in load balancing. These improvements will
significantly enhance the accuracy and effectiveness of a load balancing method.
Most load balancing algorithms only balances key distribution among nodes. In file sharing P2P
systems, a main function of nodes is to handle key location query. Query load balancing is a critical
part of P2P load balancing. That is, the number of queries that nodes receive, handle and forward corresponds to their different capacities. A highly effective load balancing method will distribute both key
load and query load in balance.
5. CONCLUSION
A load balancing method is indispensable to a high performance P2P overlay network. It helps to avoid
overloading nodes and take full advantage of node resources in the system. This chapter has provided a
detailed introduction of load balancing in P2P overlay networks, and has examined all aspects of load
balancing methods including their goals, properties, strategies and classification. A comprehensive
review of research works focusing on load balancing in DHT networks has been presented, along with
an in depth discussion of their pros and cons. Furthermore, a load balancing algorithm that overcomes
the drawbacks of previous methods has been presented in detail. Finally, the future and emerging trends
and open issues in load balancing in P2P overlay networks have been discussed.
REFERENCES
Adler, M., Halperin, E., Karp, R. M., & Vazirani, V. (2003, June). A stochastic process on the hypercube
with applications to peer-to-peer networks. In Proc. of STOC.
Azar, Y. Broder, A., et al. (1994). Balanced allocations. In Proc. of STOC (pp. 593602).
Bienkowski, M., Korzeniowski, M., & auf der Heide, F. M. (2005). Dynamic load balancing in distributed hash tables. In Proc. of IPTPS.
187
Brighten Godfrey, P., & Stoica, I. (2005). Heterogeneity and load balance in distributed hash tables. In
Proc. of IEEE INFOCOM.
Byers, J. Considine, J., & Mitzenmacher, M. (2003, Feb.). Simple load balancing for distributed hash
tables. In Proc. of IPTPS.
Castro, M., Druschel, P., Hu, Y. C., & Rowstron, A. (2002). Topology-aware routing in structured peerto-peer overlay networks. In Future Directions in Distributed Computing.
Fasttrack product description. (2001). http://www.fasttrack.nu/index.html.
Fu, S., Xu, C. Z., & Shen, H. (April 2008). Random choices for Churn resilient load balancing in peerto-peer networks. Proc. of IEEE International Parallel and Distributed Processing Symposium.
Godfrey, B., Lakshminarayanan, K., Surana, S., Karp, R., & Stoica, I. (2006). Load balancing in dynamic
structured P2P systems. Performance Evaluation, 63(3).
Gummadi, K., Gummadi, R., Gribble, S., Ratnasamy, S., Shenker, S., & Stoica, I. (2003). The impact
of DHT routing geometry on resilience and proximity. In Proc. of ACM SIGCOMM.
Jamin, S., Jin, C., Jin, Y., Raz, D., Shavitt, Y., & Zhang, L. (2000). On the placement of Internet instrumentation. In Proc. of INFOCOM.
Kaashoek, F., & Karger, D. R. Koorde. (2003). A simple degree-optimal hash table. In Proceedings
IPTPS.
Karger, D., Lehman, E., Leighton, T., Levine, M., et al. (1997). Consistent hashing and random trees:
Distributed caching protocols for relieving hot spots on the World Wide Web. In Proc. of STOC (pp
654663).
Karger, D. R., & Ruhl, M. (2004). Simple efficient load balancing algorithms for Peer-to-Peer systems.
In Proc. of IPTPS.
Manku, G. (2004). Balanced binary trees for ID management and load balance in distributed hash tables.
In Proc. of PODC.
Maymounkov, P., & Mazires, D. Kademlia. (2002). A peer-to-peer information system based on the xor
metric. The 1st Interational Workshop on Peer-to-Peer Systems (IPTPS).
Mitzenmacher, M. (1997). On the analysis of randomized load balancing schemes. In Proc. of SPAA.
Mondal, A., Goda, K., & Kitsuregawa, M. (2003). Effective load-balancing of peer-to-peer systems. In
Proc. of IEICE DEWS DBSJ Annual Conference.
Motwani, R., & Raghavan, P. (1995). Randomized Algorithms. New York: Cambridge University
Press.
Naor, M., & Wieder, U. (June 2003). Novel Architectures for P2P applications: The continuous-discrete
approach. In Proc. SPAA.
Rao, A., Lakshminarayanan, K., et al. (2003). Load balancing in structured P2P systems. In Proc. of
IPTPS.
188
Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Shenker, S. (2001). A scalable content-addressable
network. In Proceedings of ACM SIGCOMM. (pp 329350).
Ratnasamy, S., Handley, M., Karp, R., & Shenker, S. (2002). Topologically aware overlay construction
and server selection. In Proc. of INFOCOM.
Rowstron, A., & Druschel, P. Pastry. (2001). Scalable, decentralized object location and routing for
large-scale peer-to-peer systems. In Proc. of the 18th IFIP/ACM Intl Conf. on Distributed Systems
Platforms (Middleware).
Saroiu, S., et al. (2002). A Measurement Study of Peer-to- Peer File Sharing Systems. In Proc. of
MMCN.
Shen, H., & Xu, C. (2006,April). Hash-based proximity clustering for load balancing in heterogeneous
DHT networks. In Proc. of IPDPS.
Shen, H., Xu, C., & Chen, G. (2006). Cycloid: A scalable constant-degree P2P overlay network. Performance Evaluation, 63(3), 195216. doi:10.1016/j.peva.2005.01.004
Shen, H., & Xu, C.-Z. (2007). Locality-aware and Churn-resilient load balancing algorithms in structured peer-to-peer networks. [TPDS]. IEEE Transactions on Parallel and Distributed Systems, 18(6),
849862. doi:10.1109/TPDS.2007.1040
Stoica, I., Morris, R., et al. (2003). Chord: A scalable peer-to-peer lookup protocol for Internet applications. IEEE/ACM Transactions on Networking.
Waldvogel, M., & Rinaldi, R. (2002). Efficient topology-aware overlay network. In Proc. of HotNets-I.
Xu, C. (2005). Scalable and Secure Internet Services and Architecture. Boca Raton, FL: Chapman &
Hall/CRC Press.
Xu, Z., Mahalingam, M., & Karlsson, M. (2003). Turning heterogeneity into an advantage in overlay
routing. In Proc. of INFOCOM.
Xu, Z., Tang, C., & Zhang, Z. (2003). Building topology-aware overlays using global soft-state. In Proc.
of ICDCS.
Yang, B., & Garcia-Molina, H. (2003). Designing a super-peer network. In Proc. of ICDE.
Zegura, E. Calvert, K. et al. (1996). How to model an Internetwork. In Proc. of INFOCOM.
Zhao, B. Y., Kubiatowicz, J., & Oseph, A. D. (2001). Tapestry: An infrastructure for fault-tolerant
wide-area location and routing (Tech. Rep. UCB/CSD-01-1141). University of California at Berkeley,
Berkeley, CA.
Zhu, Y., & Hu, Y. (2005). Efficient, proximity-aware load balancing for DHT-based P2P systems. Proc.
of IEEE TPDS, 16(4).
189

Dynamism/Churn: A great number of nodes join, leave and fail continually and rapidly, leading to
unpredicted network size.
Heterogeneity: The instinct properties of participating peers, including computing ability, differ a lot
and deserve serious consideration for the construction of a real efficient wide-deployed application.
Load Balancing Method: A method that controls the load in each node no more than the nodes
capacity.
Peer: A peer (or node) is an abstract notion of participating entities. It can be a computer process, a
computer, an electronic device, or a group of them.
Peer-to-Peer Network: A peer-to-peer network is a logical network on top of physical networks in
which peers are organized without any centralized coordination.
Proximity: Mismatch between logical proximity abstraction derived from DHTs and physical proximity information in reality, which is a big obstacle for the deployment and performance optimization
issues for P2P applications.
Structured Peer-to-Peer Network/Distributed Hash Table: A peer-to-peer network that maps keys
to the nodes based on a consistent hashing function.
190
191
Chapter 9
Decentralized Overlay for

Federation of Enterprise Clouds
Rajiv Ranjan
The University of Melbourne, Australia
Rajkumar Buyya
ABSTRACT
This chapter describes Aneka-Federation, a decentralized and distributed system that combines enterprise Clouds, overlay networking, and structured peer-to-peer techniques to create scalable wide-area
networking of compute nodes for high-throughput computing. The Aneka-Federation integrates numerous
small scale Aneka Enterprise Cloud services and nodes that are distributed over multiple control and
enterprise domains as parts of a single coordinated resource leasing abstraction. The system is designed
with the aim of making distributed enterprise Cloud resource integration and application programming
flexible, efficient, and scalable. The system is engineered such that it: enables seamless integration of
existing Aneka Enterprise Clouds as part of single wide-area resource leasing federation; self-organizes
the system components based on a structured peer-to-peer routing methodology; and presents end-users
with a distributed application composition environment that can support variety of programming and
execution models. This chapter describes the design and implementation of a novel, extensible and decentralized peer-to-peer technique that helps to discover, connect and provision the services of Aneka
Enterprise Clouds among the users who can use different programming models to compose their applications. Evaluations of the system with applications that are programmed using the Task and Thread
execution models on top of an overlay of Aneka Enterprise Clouds have been described here.
INTRODUCTION
Wide-area overlays of enterprise Grids (Luther, Buyya, Ranjan, & Venugopal, 2005; Andrade, Cirne,
Brasileiro, & Roisenberg, 2003; Butt, Zhang, & Hu, 2003; Mason & Kelly, 2005) and Clouds (Amazon
DOI: 10.4018/978-1-60566-661-7.ch009
Decentralized Overlay for Federation of Enterprise Clouds
Elastic Compute Cloud, 2008; Google App Engine, 2008; Microsoft Live Mesh, 2008; Buyya, Yeo,
Venugopal, 2008) are an appealing platform for the creation of high-throughput computing resource
pools and cross-domain virtual organizations. An enterprise Cloud1 is a type of computing infrastructure
that consists of a collection of inter-connected computing nodes, virtualized computers, and software
services that are dynamically provisioned among the competing end-users applications based on their
availability, performance, capability, and Quality of Service (QoS) requirements. Various enterprise
Clouds can be pooled together to form a federated infrastructure of resource pools (nodes, services,
virtual computers). In a federated organisation: (i) every participant gets access to much larger pools
of resources; (ii) the peak-load handling capacity of every enterprise Cloud increases without having
the need to maintain or administer any additional computing nodes, services, and storage devices; and
(iii) the reliability of a enterprise Cloud is enhanced as a result of multiple redundant clouds that can
efficiently tackle disaster condition and ensure business continuity.
Emerging enterprise Cloud applications and the underlying federated hardware infrastructure (Data
Centers) are inherently large, with heterogeneous resource types that may exhibit temporal resource
conditions. The unique challenges in efficiently managing a federated Cloud computing environment
include:

Large scale: composed of distributed components (services, nodes, applications, users, virtualized computers) that combine together to form a massive environment. These days enterprise
Clouds consisting of hundreds of thousands of computing nodes are common (Amazon Elastic
Compute Cloud, 2008; Google App Engine, 2008; Microsoft Live Mesh, 2008) and hence federating them together leads to a massive scale environment;
Resource contention: driven by the resource demand pattern and a lack of cooperation among
end-users applications, particular set of resources can get swamped with excessive workload,
which significantly undermines the overall utility delivered by the system; and
Dynamic: the components can leave and join the system at will.
The aforementioned characteristics of the infrastructure accounts to significant development, system

integration, configuration, and resource management challenges. Further, the end-users follow a variety
of programming models to compose their applications. In other words, in order to efficiently harness
the computing power of enterprise Cloud infrastructures (Chu, Nandiminti, Jin, Venugopal, & Buyya,
2007; Amazon Elastic Compute Cloud, 2008; Google App Engine, 2008; Microsoft Live Mesh, 2008),
software services that can support high level of scalability, robustness, self-organization, and application
composition flexibility are required.
This chapter has two objectives. The first is to investigate the challenges as regards to design and
development of decentralized, scalable, self-organizing, and federated Cloud computing system. The
second is to introduce the Aneka-Federation software system that includes various software services,
peer-to-peer resource discovery protocols and resource provisioning methods (Ranjan, 2007; Ranjan,
Harwood, & Buyya, 2008) to deal with the challenges in designing decentralized resource management system in a complex, dynamic and heterogeneous enterprise Cloud computing environment. The
components of the Aneka-Federation including computing nodes, services, providers and end-users
self-organize themselves based on a structured peer-to-peer routing methodology to create a scalable
wide-area overlay of enterprise Clouds. In rest of this chapter, the terms Aneka Cloud(s) and Aneka
Enterprise Cloud(s) are used interchangeably.
192
The unique features of Aneka-Federation are: (i) wide-area scalable overlay of distributed Aneka
Enterprise Clouds (Chu et al., 2007); (ii) realization of a peer-to-peer based decentralized resource discovery technique as a software service, which has the capability to handle complex resource queries;
and (iii) the ability to enforce coordinated interaction among end-users through the implementation
of a novel decentralized resource provisioning method. This provisioning method is engineered over
a peer-to-peer routing and indexing system that has the ability to route, search and manage complex
coordination objects in the system.
The rest of this chapter is organized as follows: First, the challenges and requirements related to design of decentralized enterprise Cloud overlays are presented. Next, follows a brief introduction of the
Aneka Enterprise Cloud system including the basic architecture, key services and programming models.
Then, a finer detail related to the Aneka-Federation software system that builds upon the decentralized
Content-based services is presented. Comprehensive details on the design and implementation of decentralized Content-based services for message routing, search, and coordinated interaction follows. Next,
the experimental case study and analysis based on the test run of two enterprise Cloud applications on
the Aneka-Federation system is presented. Finally, this work is put in context with the related works.
The chapter ends with a brief conclusion.
DESIGNING DECENTRALIzED ENTERPRISE CLOUD OVERLAY

In decentralized organization of Cloud computing systems both control and decisions making are decentralized by nature and where different system components interact together to adaptively maintain
and achieve a desired system wide behavior. A distributed Cloud system configuration is considered to
be decentralized if none of the components in the system are more important than the others, in case
that one of the component fails, then it is neither more nor less harmful to the system than caused by the
failure of any other component in the system.
A fundamental challenge in managing the decentralized Cloud computing system is to maintain a
consistent connectivity between the components (self-organization) (Parashar & Hariri, 2007). This challenge cannot be overtaken by introducing a central network model to connect the components, since the
information needed for managing the connectivity and making the decisions is completely decentralized
and distributed. Further, centralized network model (Zhang, Freschl, & Schopf, 2003) does not scale
well, lacks fault-tolerance, and requires expensive server hardware infrastructure. System components
can leave, join, and fail in a dynamic fashion; hence it is an impossible task to manage such a network
centrally. Therefore, an efficient decentralized solution is mandatory that can gracefully adapt, and scale
to the changing conditions.
A possible way to efficiently interconnect the distributed system components can be based on structured peer-to-peer overlays. In literature, structured peer-to-peer overlays are more commonly referred
to as the Distributed Hash Tables (DHTs). DHTs provide hash table like functionality at the Internet
scale. DHTs such as Chord (Stoica, Morris, Karger, Kaashoek, & Balakrishnan, 2001), CAN (Ratnasamy,
Francis, Handley, Karp, & Schenker, 2001), Pastry (Rowstron & Druschel, 2001), and Tapestry (Zhao,
Kubiatowicz, & Joseph, 2001) are inherently self-organizing, fault-tolerant, and scalable. DHTs provide
services that are light-weight and hence, do not require an expensive hardware platform for hosting,
which is an important requirement as regards to building and managing enterprise Cloud system that
consists of commodity machines. A DHT is a distributed data structure that associates a key with a data.
193
Entries in a DHT are stored as a (key, data) pair. A data can be looked up within a logarithmic overlay
routing hops if the corresponding key is known.
The effectiveness of the decentralized Cloud computing system depends on the level of coordination and cooperation among the components (users, providers, services) as regards to scheduling and
resource allocation. Realizing cooperation among distributed Cloud components requires design and
development of the self-organizing, robust, and scalable coordination protocols. The Aneka-Federation
system implements one such coordination protocol using the DHT-based routing, lookup and discovery
services. The finer details about the coordination protocol are discussed later in the text.
ANEKA ENTERPRISE CLOUD: AN OVERVIEW

Aneka (Chu et al., 2007) is a .NET-based service-oriented platform for constructing enterprise Clouds. It
is designed to support multiple application models, persistence and security solutions, and communication protocols such that the preferred selection can be changed at anytime without affecting an existing
Aneka ecosystem. To create an enterprise Cloud, the resource provider only needs to start an instance of
the configurable Aneka container hosting required services on each selected Cloud node. The purpose
of the Aneka container is to initialize services and acts as a single point for interaction with the rest of
the enterprise Cloud.
Figure 1 shows the design of the Aneka container on a single Cloud node. To support scalability,
the Aneka container is designed to be lightweight by providing the bare minimum functionality needed
for an enterprise Cloud node. It provides the base infrastructure that consists of services for persistence, security (authorization, authentication and auditing), and communication (message handling
and dispatching). Every communication within the Aneka services is treated as a message, handled and
dispatched through the message handler/dispatcher that acts as a front controller. The Aneka container
hosts a compulsory MembershipCatalogue service, which maintains the resource discovery indices
(such as a .Net remoting address) of those services currently active in the system. The Aneka container
can host any number of optional services that can be added to augment the capabilities of an enterprise
Cloud node. Examples of optional services are indexing, scheduling, execution, and storage services.
This provides a single, flexible and extensible framework for orchestrating different kinds of enterprise
Cloud application models.
To support reliability and flexibility, services are designed to be independent of each other in a container. A service can only interact with other services on the local node or other Cloud node through
known interfaces. This means that a malfunctioning service will not affect other working services and/
or the container. Therefore, the resource provider can seamlessly configure and manage existing services
or introduce new ones into a container. Aneka thus provides the flexibility for the resource provider to
implement any network architecture for an enterprise Cloud. The implemented network architecture
depends on the interaction of services among enterprise Cloud nodes since each Aneka container on a
node can directly interact with other Aneka containers reachable on the network.
194
Figure 1. Design of Aneka container
ANEKA-FEDERATION
The Aneka-Federation system self-organizes the components (nodes, services, clouds) based on a DHT
overlay. Each enterprise Cloud site in the Aneka-Federation (see Figure 2) instantiates a new software
service, called Aneka Coordinator. Based on the scalability requirements and system size, an enterprise
Cloud can instantiate multiple Aneka Coordinator services. The Aneka Coordinator basically implements the resource management functionalities and resource discovery protocol specifications. The
software design of the Aneka-Federation system decouples the fundamental decentralized interaction
of participants from the resource allocation policies and the details of managing a specific Aneka Cloud
Service. Aneka-Federation software system utilizes the decentralized Cloud services as regards to efficient distributed resource discovery and coordinated scheduling.
DESIGN AND IMPLEMENTATION

Aneka Coordinator software service is composed of the following components:
195
Figure 2. Aneka-Federation network with the coordinator services and Aneka enterprise Clouds
Aneka services: These include the core services for peer-to-peer scheduling (Thread Scheduler,
Task Scheduler, Dataflow Scheduler) and peer-to-peer execution (Thread Executor, Task Executor)
provided by the Aneka framework. These services work independently in the container and have
the ability to interact with other services such as the P2PMembershipCatalogue through the
MessageDispatcher service deployed within each container.
Aneka peer: This component of the Aneka Coordinator service loosely glues together the core
Aneka services with the decentralized Cloud services. Aneka peer seamlessly encapsulates together the following: Apache Tomcat container (hosting environment and web service front end
to the Content-based services), Internet Information Server (IIS) (hosting environment for ASP.
Net service), P2PMembershipCatalogue, and Content-based services (After Figure 3 see Figure
4). The basic functionalities of the Aneka peer (refer to Figure 3) include providing services for:
(i) Content-based routing of lookup and update messages; and (ii) facilitating decentralized coordination for efficient resource sharing and load-balancing among the Internet-wide distributed
Aneka Enterprise Clouds. The Aneka peer service operates at the Core services layer in the layered architecture shown after Figures 5, 6, 7, and 8, in Figure 9.
Figure 4 shows a block diagram of interaction between various components of Aneka Coordinator
software stack. The Aneka Coordinator software stack encapsulates the P2PMembershipCatalogue
196
Figure 3. Aneka-Federation over decentralized Cloud services
and Content-based decentralized lookup services. The design components for peer-to-peer scheduling,
execution, and membership are derived from the basic Aneka framework components through object
oriented software inheritance (see Figure 5, Figure 6, and Figure 7).
A UML (Unified Modeling Language) class diagram that displays the core entities within the Aneka
Coordinators Scheduling service is shown in Figure 5. The main class (refer to Figure 5) that undertakes
activities related to application scheduling within the Aneka Coordinator is the P2PScheduling service,
which is programmatically inherited from the Anekas IndependentScheduling service class. The P2PScheduling service implements the methods for: (i) accepting application submission from client nodes
(see [REMOVED SHAPE FIELD] Figure 8); (ii) sending search query to the P2PMembershipCatalogue
service; (iii) dispatching application to Aneka nodes (P2PExecution service); and (iv) collecting the
application output data. The core programming models in Aneka including Task, Thread, and Dataflow
instantiate P2PScheduling service as their main scheduler class. This runtime binding of P2PScheduling
service class to different programming models is taken care of by Microsoft .NET platform and Inverse
of Control (IoC) (Fowler, 2008) implementation in the Spring .NET framework (Spring.Net, 2008).
Similar to P2PScheduling service, the binding of P2PExecution service to specific programming
models (such as P2PTaskExecution, P2PThreadExecution) is done by Microsoft .NET platform and
IoC implementation in the Spring .NET framework. The interaction between the services (such as
P2PTaskExecution and P2PTaskScheduling service) is facilitated by the MessageDispatcher service.
The P2PExecution services update their node usage status with the P2PMembershipCatalogue through
the P2PExecutorStatusUpdate component (see Figure 6). The core Aneka Framework defines distinct
message types to enable seamless interaction between services. The functionality of handling, compiling,
and delivering the messages within the Aneka framework is implemented in the MessageDispactcher
service. Recall that the MessageDispatcher service is automatically deployed in the Aneka container.
197
Figure 4. A block diagram showing interaction between various components in the Aneka Coordinator
software stack
P2PMembershipCatalogue service is the core component that interacts with the Content-based decentralized Cloud services and aids in the organization and management of Aneka-Federation overlay.
The UML class design for this service within the Aneka Coordinator is shown in Figure 7. This service
accepts resource claim and ticket objects from P2PScheduling and P2PExecution services respectively
(refer to Figure 8), which are then posted with the Content-based services hosted in the Apache Tomcat
container.
The P2PMembershipCatalogue interacts with the components hosted within the Apache Tomcat container (Java implementation) using the SOAP-based web services Application Programming Interfaces
(APIs) exposed by the DFPastryManager component (see Figure 7). The Content-based service communicates with the P2PMembershipCatalogue service through an ASP.NET web service hosted within
in the IIS container (see Figure 4 or 8).
The mandatory services within a Aneka Coordinator that are required to instantiate a fully functional
Aneka Enterprise Cloud site includes P2PMembershipCatalogue, P2PExecution, P2PScheduling, .Net
web service, and Content-based services (see Figure 8). These services exports a enterprise Cloud site to
the federation, and give it capability to accept remote jobs based on its load condition (using their P2PExecution services), and submit local jobs to the federation (through their P2PScheduling services).
Figure 8 demonstrates a sample application execution flow in the Aneka-Federation system. Clients
198
Figure 5. Class design diagram of P2PScheduling service
directly connect and submit their application to a programming model specific scheduling service. For
instance, a client having an application programmed using Anekas Thread model would submit his
application to Thread P2PScheduling service (refer to step 1 in Figure 8). Clients discover the point
of contact for local scheduling services by querying their domain specific Aneka Coordinator service.
On receipt of an application submission message, a P2PScheduling service encapsulates the resource
requirement for that application in a resource claim object and sends a query message to the P2PMembershipCatalogue (see step 2 in Figure 8).
Execution services (such as the P2PThreadExecution, P2PTaskExecution), which are distributed over
different enterprise Clouds and administered by enterprise specific Aneka Coordinator services, update
their status by sending a resource ticket object to the P2PMembership Catalogue (see step 3 in Figure
8). A resource ticket object in the Aneka-Federation system abstracts the type of service being offered,
the underlying hardware platform, and level of QoS that can be supported. The finer details about the
composition and the mapping of resource ticket and claim objects are discussed later in this chapter.
The P2PMembershipCatalogue then posts the resource ticket and claim objects with the decentralized
Content-based services (see step 4 and 5 in Figure 8). When a resource ticket, issued by a P2PTExecution
service, matches with a resource claim object, posted by a P2PScheduling service, the Content-based
service sends a match notification to the P2PScheduling service through the P2PMembershipCatalogue
(see step 6, 7, 8 in Figure 8). After receiving the notification, the P2PScheduling service deploys its
application on the P2PExecution service (see step 9 in Figure 8). On completion of a submitted applica-
199
Figure 6. Class design diagram of P2PExecution service
tion, the P2PExecution service directly returns the output to the P2PScheduling service (see step 10 in
Figure 8). (Figure 9 and Figure 10)
The Aneka Coordinator service supports the following two inter-connection models as regards to
an Aneka Enterprise Cloud site creation (See Figure 9 and Figure 10). First, a resource sharing domain or enterprise Cloud can instantiate a single Aneka-Coordinator service, and let other nodes in the
Cloud connect to the Coordinator service. In such a scenario, other nodes need to instantiate only the
P2PExecution and P2PScheduling services. These services are dependent on the domain specific Aneka
Coordinator service as regards to load update, resource lookup, and membership to the federation (see
Figure 11). In second configuration, each node in a resource domain can be installed with all the services
within the Aneka Coordinator (see Figure 4). This kind of inter-connection will lead to a true peer-topeer Aneka-Federation Cloud network, where each node is an autonomous computing node and has the
ability to implement its own resource management and scheduling decisions. Hence, in this case the
Aneka Coordinator service can support completely decentralized Cloud computing environment both
within and between enterprise Clouds.
200
Figure 7. Class design diagram of P2PMembershipCatalogue service
CONTENT-BASED DECENTRALIzED CLOUD SERVICES

It is aforementioned that the DHT based overlay presents a compelling solution for creating a decentralized
network of Internet-wide distributed Aneka Enterprise Clouds. However, DHTs are efficient at handling
single-dimensional search queries such as find all services that match a given attribute value. Since
Cloud computing resources such as enterprise computers, supercomputers, clusters, storage devices,
and databases are identified by more than one attribute, therefore a resource search query for these
resources is always multi-dimensional. These resource dimensions or attributes include service type,
processor speed, architecture, installed operating system, available memory, and network bandwidth.
Recent advances in the domain of decentralized resource discovery have been based on extending the
existing DHTs with the capability of multi-dimensional data organization and query routing (Ranjan,
Harwood, & Buyya, 2008).
Our decentralized Cloud management middleware supports peer-to-peer Content-based resource
discovery and coordination services for efficient management of distributed enterprise Clouds. The
middleware is designed based on a 3-tier layered architecture: the Application layer, Core Services layer,
and Connectivity layer (see Figure 9). Cloud services such as the Aneka Coordinator, resource brokers,
and schedulers work at the Application layer and insert objects via the Core services layer. The core
functionality including the support for decentralized coordinated interaction, and scalable resource dis-
201
Figure 8. Application execution sequence in Aneka-Federation
covery is delivered by the Core Services Layer. The Core services layer, which is managed by the Aneka
peer software service, is composed of two sub-layers (see Figure 9): (i) Coordination Service (Ranjan
et al., 2007); and (ii) Resource discovery service. The Coordination service component of Aneka peer
accepts the coordination objects such as a resource claim and resource ticket. A resource claim object is a
multi-dimensional range look-up query (Samet, 2008) (spatial range object), which is initiated by Aneka
Coordinators in the system in order to locate the available Aneka Enterprise Cloud nodes or services that
can host their client s applications. A resource claim object has the following semantics:
Aneka Service = P2PThreadExecution && CPU Type = Intel && OSType

= WinXP && Processor Cores > 1 && Processors Speed > 1.5 GHz
On the other hand, a resource ticket is a multi-dimensional point update query (spatial point object),
which is sent by an Aneka Enterprise Cloud to report the local Cloud nodes and the deployed services
availability status. A resource ticket object has the following semantics:
202
Figure 9. Layered view of the content-based decentralized Cloud services
Aneka Service = P2PThreadExecution && CPU Type = Intel && OSType

= WinXP && Processor Cores = 2 && Processors Speed = 3 GHz
Further, both of these queries can specify different kinds of constraints on the attribute values. If a
query specifies a fixed value for each attribute then it is referred to as a multi-dimensional point query.
However, in case the query specifies a range of values for attributes, then it is referred to as a multidimensional range query. The claim and ticket objects encapsulate coordination logic, which in this case
is the resource provisioning logic. The calls between the Coordination service and the Resource Discovery service are made through the standard publish/subscribe technique. Resource Discovery service
is responsible for efficiently mapping these complex objects to the DHT overlay.
The Resource Discovery service organizes the resource attributes by embedding a logical publish/
subscribe index over a network of distributed Aneka peers. Specifically, the Aneka peers in the system
create a DHT overlay that collectively maintains the logical index to facilitate a decentralized resource
discovery process. The spatial publish/subscribe index builds a multi-dimensional attribute space based
on the Aneka Enterprise Cloud nodes resource attributes, where each attribute represents a single
dimension. The multi-dimensional spatial index assigns regions of space to the Aneka peer. The calls
between Core services layer and Connectivity layer are made through standard DHT primitives such
203
Figure 10. Resource claim and ticket object mapping and coordinated scheduling across Aneka Enterprise Cloud sites. Spatial resource claims {T1, T2, T3, T4}, index cell control points {A, B, C, D}, spatial
point tickets {l, s} and some of the spatial hashings to the Pastry ring, i.e. the d-dimensional (spatial)
coordinate values of a cells control point is used as the Pastry key. For this Figure fmin =2, dim = 2.
as put (key, value), get (key) that are defined by the peer-to-peer Common Application Programming
Interface (API) specification 0.
There are different kinds of spatial indices 0 such as the Space Filling Curves (SFCs) (including the
Hilbert curves, Z-curves), k-d tree, MX-CIF Quad tree and R*-tree that can be utilized for managing,
routing, and indexing of objects by resource discovery service at Core services layer. Spatial indices are
well suited for handling the complexity of Cloud resource queries. Although some spatial indices can
have issues as regards to routing load-balance in case of a skewed attribute set, all the spatial indices
are generally scalable in terms of the number of hops traversed and messages generated while searching
and routing multi-dimensional/spatial claim and ticket objects.
Resource claim and ticket object mapping: At the Core services layer, a spatial index that assigns regions of multi-dimensional attribute space to Aneka peers has been implemented. The MX-CIF
Quadtree spatial hashing technique (Egemen, Harwood, & Samet, 2007) is used to map the logical
204
Figure 11. Aneka-Federation test bed distributed over 3 departmental laboratories
multi-dimensional control point (point C in Figure 10 represents a 2-dimensional control point) onto
a Pastry DHT overlay. If an Aneka peer is assigned a region in the multi-dimensional attribute space,
then it is responsible for handling all the activities related to the lookups and updates that intersect with
the region of space. Figure 10 depicts a 2-dimensional Aneka resource attribute space for mapping resource claim and ticket objects. The attribute space resembles a mesh-like structure due to its recursive
division process. The index cells, resulted from this process, remain constant throughout the life of a
d-dimensional attribute space and serve as the entry points for subsequent mapping of claim and ticket
objects. The number of index cells produced at the minimum division level, fmin is always equal to (fmin)
dim
, where dim is the dimensionality of the attribute space. These index cells are called base index cells
and they are initialized when the Aneka Peers bootstrap to the federation network. Finer details on the
recursive subdivision technique can be found in (Egemen et al., 2007). Every Aneka Peer in the federation has the basic information about the attribute space coordinate values, dimensions and minimum
division levels.
Every cell at the fmin level is uniquely identified by its centroid, termed as the control point. Figure
10 shows four control points A, B, C, and D. A DHT hashing (cryptographic functions such as SHA1/2) method is utilized to map the responsibility of managing control points to the Aneka Peers. In a
2-dimensional setting, an index cell i = (x1, y1, x2, y2), and its control point are computed as ((x2-x1)/2,
(y2-y1)/2). The spatial hashing technique takes two input parameters, SpatialHash (control point coor-
205
dinates, objects coordinates), in terms of DHT common API primitive that can be written as put (Key,
Value), where the cryptographic hash of the control point acts as the Key for DHT overlay, while Value
is the coordinate values of the resource claim or ticket object to be mapped. In Figure 10, the Aneka peer
at Cloud s is assigned index cell i through the spatial hashing technique, which makes it responsible for
managing all objects that map to the cell i (Claim T2, T3, T4 and Ticket s).
For mapping claim objects, the process of mapping index cells to the Aneka Peers depends on whether
it is spatial point object or spatial range object. The mapping of point object is simple since every point
is mapped to only one cell in the attribute space. For spatial range object (such as Claims T2, T3 or
T4), the mapping is not always singular because a range object can cross more than once index cell (see
Claim T5 in Figure 10). To avoid mapping a spatial range object to all the cells that it intersects, which
can create many duplicates, a mapping strategy based on diagonal hyperplane 0 in the attribute space
is implemented. This mapping involves feeding spatial range object coordinate values and candidate
index as inputs to a mapping function, Fmap (spatial object, candidate index cells). An Aneka Peer service
uses the index cell(s) currently assigned to it and a set of known base index cells as candidate cells,
which are obtained at the time of bootstrapping into the federation. The Fmap returns the index cells and
their control points to which the given spatial range object should be stored with. Next, these control
points and the spatial object is given as inputs to function SpatialHash(control point, object), which in
connection with the Connectivity layer generates DHT Ids (Keys) and performs routing of claim/ticket
objects to the Aneka Peers.
Similarly, the mapping process of a ticket object also involves the identification of the intersection
index cells in the attribute space. A ticket is always associated with a region (Gupta, Sahin, Agarwal, &
Abbadi, 2004) and all cells that fall fully or partially within the region are selected to receive the corresponding ticket. The calculation of the region is based upon the diagonal hyperplane of the attribute
space.
Coordinated load balancing: Both resource claim and ticket objects are spatially hashed to an index
cell i in the multi-dimensional Aneka services attribute space. In Figure 10, resource claim object for
task T1 is mapped to index cell A, while for T2, T3, and T4, the responsible cell is i with control point
value C. Note that, these resource claim objects are posted by P2PScheduling services (Task or Thread)
of Aneka Cloud nodes. In Figure 10, scheduling service at Cloud p posts a resource claim object which
is mapped to index cell i. The index cell i is spatially hashed to an Aneka peer at Cloud s. In this case,
Cloud s is responsible for coordinating the resource sharing among all the resource claims that are currently mapped to the cell i. Subsequently, Cloud u issues a resource ticket (see Figure 10) that falls under
a region of the attribute space currently required by the tasks T3 and T4. Next, the coordination service
of Aneka peer at Cloud s has to decide which of the tasks (either T3 or T4 or both) is allowed to claim
the ticket issues by Cloud u. The load-balancing decision is based on the principle that it should not
lead to over-provisioning of resources at Cloud u. This mechanism leads to coordinated load-balancing
across Aneka Enterprise Clouds and aids in achieving system-wide objective function, while at the same
time preserving the autonomy of the participating Aneka Enterprise Clouds.
The examples in Table 1 are list of resource claim objects that are stored with an Aneka peers coordination service at time T = 700 secs. Essentially, the claims in the list arrived at a time <= 700 and wait for
a suitable ticket object that can meet their applications requirements (software, hardware, service type).
Table 2 depicts a ticket object that has arrived at T = 700. Following the ticket arrival, the coordination
service undertakes a procedure that allocates the ticket object among the list of matching claims. Based
on the Cloud nodes attribute specification, both Claim 1 and Claim 2 match the ticket issuing Cloud
206
Table 1. Claims stored with an Aneka Peer service at time T

Time
300
Claim ID
Claim 1
Service Type
Speed (GHz)
P2PThreadExecution
400
Claim 2
P2PTaskExecution
500
Claim 3
P2PThreadExecution
>2
Processors
1
Type
Intel
>2
Intel
> 2.4
Intel
nodes configuration. As specified in the ticket object, there is currently one processor available within
the Cloud 2, which means that at this time only Claim 1 can be served. Following this, the coordination
service notifies the Aneka-Coordinator, which has posted the Claim 1. Note that Claims 2 and 3 have to
wait for the arrival of tickets that can match their requirements.
The Connectivity layer is responsible for undertaking a key-Based routing in the DHT overlay, where
it can implement the routing methods based on DHTs, such as Chord, CAN, and Pastry. The actual
implementation protocol at this layer does not directly affect the operations of the Core services layer.
In principle, any DHT implementation at this layer could perform the desired task. DHTs are inherently
self-organizing, fault-tolerant, and scalable.
At the Connectivity layer, our middleware utilizes the open source implementation of Pastry DHT
known as the FreePastry (2008). FreePastry offers a generic, scalable and efficient peer-to-peer routing framework for the development of decentralized Cloud services. FreePastry is an open source
implementation of well-known Pastry routing substrate. It exposes a Key-based Routing (KBR) API
and given the Key K, Pastry routing algorithm can find the peer responsible for this key in logb n messages, where b is the base and n is the number of Aneka Peers in the network. Nodes in a Pastry overlay
form a decentralized, self-organising and fault-tolerant circular network within the Internet. Both data
and peers in the Pastry overlay are assigned Ids from 160-bit unique identifier space. These identifiers
are generated by hashing the objects names, a peers IP address or public key using the cryptographic
hash functions such as SHA-1/2. FreePastry is currently available under BSD-like license. FreePastry
framework supports the P2P Common API specification proposed in the paper (Dabek, Zhao, Druschel,
Kubiatowicz, & Stoica, 2003).
ExPERIMENTAL EVALUATION AND DISCUSSION

In this section, we evaluate the performance of the Aneka-Federation software system by creating a
resource sharing network that consists of 5 Aneka Enterprise Clouds (refer to Figure 11). These Aneka
Enterprise Clouds are installed and configured in three different Laboratories (Labs) within the Computer
Science and Software Engineering Department, The University of Melbourne. The nodes in these Labs
Table 2: Ticket Published with an Aneka Peer service at time T

Time
700
Cloud ID
Cloud 2
Service Type
P2PThreadExecution
Speed (GHz)
2.7
Processors
1 (available)
Type
Intel
207
are connected through a Local Area Network (LAN). The LAN connection has a data transfer bandwidth
of 100 Mb/Sec (megabits per seconds). Next, the various parameters and application characteristics
related to this study are briefly described.
Aneka enterprise cloud configuration: Each Aneka Cloud in the experiments is configured to
have 4 nodes out of which, one of the nodes instantiates the Aneka-Coordinator service. In addition
to the Aneka Coordinator service, this node also hosts the other optional services including the P2PScheduling (for Thread and Task models) and P2PExecution services (for Thread and Task models).
The remaining 3 nodes are configured to run the P2PExecution services for Task and Thread programming models. These nodes connect and communicate with the Aneka-Coordinator service through .Net
remoting messaging APIs. The P2PExecution services periodically update their usage status with the
Aneka-Coordinator service. The update delay is configurable parameter with values in milliseconds or
seconds. The nodes across different Aneka Enterprise Clouds update their status dynamically with the
decentralized Content-based services. The node status update delays across the Aneka Enterprise Clouds
are uniformly distributed over interval [5, 40] seconds.
FreePastry network configuration: Both Aneka Peers nodeIds and claim/ticket objectIds are
randomly assigned from and uniformly distributed in the 160-bit Pastry identifier space. Every Contentbased service is configured to buffer maximum of 1000 messages at a given instance of time. The buffer size is chosen to be sufficiently large such that the FreePastry does not drop any messages. Other
network parameters are configured to the default values as given in the file freepastry.params. This file
is provided with the FreePastry distribution.
Spatial index configuration: The minimum division fmin of logical d-dimensional spatial index that
forms the basis for mapping, routing, and searching the claim and ticket objects is set to 3, while the
maximum height of the spatial index tree, fmax is constrained to 3. In other words, the division of the
d-dimensional attribute is not allowed beyond fmin. This is done for simplicity, understanding the load
balancing issues of spatial indices (Egemen et al., 2007) with increasing fmax is a different research problem
and is beyond scope of this chapter. The index space has provision for defining claim and ticket objects
that specify the Aneka nodes/services characteristics in 4 dimensions including number of Aneka service type, processors, processor architecture, and processing speed. The aforementioned spatial index
configuration results into 81(34) index cells at fmin level. On an average, 16 index cells are hashed to an
Aneka Peer in a network of 5 Aneka Coordinators.
Claim and ticket objects spatial extent: Ticket objects in the Aneka-Federation express equality
constraints on an Aneka nodes hardware/software attribute value (e.g. =). In other words, ticket objects
are always d-dimensional (spatial) point query for this study. On the other hand, the claim objects posted
by P2PScheduling services have their spatial extent in d dimensions with both, range and fixed constraint
(e.g. >=, <=) for the attributes. The spatial extent of a claim object in different attribute dimension is
controlled by the characteristic of the node, which is hosting the P2PScheduling service. Attributes
including Aneka service type, processor architecture, and number of processors are fixed, i.e. they are
expressed as equality constraints. The value for processing speed is expressed using >= constraints, i.e.
search for the Aneka services, which can process application atleast as fast as what is available on the
submission node. However, the P2PScheduling services can create claim objects with different kind of
constraints, which can result in different routing, searching, and matching complexity. Studying this
behavior of the system is beyond the scope of this chapter.
Application models: Aneka supports composition and execution of application programmers using
different models (Vecchiola & Chu, 2008) to be executed on the same enterprise Cloud infrastructure. The
208
experimental evaluation in this chapter considers simultaneous execution of applications programmed

using Task and Thread models. The Task model defines an application as a collection of one or more
tasks, where each task represents an independent unit of execution. Similarly, the Thread model defines
an application as a collection of one or more independent threads. Both models can be successfully
utilized to compose and program embarrassingly parallel programs (parameter sweep applications).
The Task model is more suitable for cloud enabling the legacy applications, while the Thread model
fits better for implementing and architecting new applications, algorithms on clouds since it gives finer
degree of control and flexibility as regards to runtime control.
To demonstrate the effectiveness of the Aneka-Federation platform with regards to: (i) ease of heterogeneous application composition flexibility; (ii) different programming model supportability; and
(iii) concurrent scheduling feasibility of heterogeneous applications on shared Cloud computing infrastructure, the experiments are run based on the following applications:

Persistence of Vision Raytracer (2008): This application is cloud enabled using the Aneka Task
programming model. POV-Ray is an image rendering application that can create very complex
and realistic three dimensional models. Aneka POV-Ray application interface allows the selection
of a model, the dimension of the rendered image, and the number of independent tasks into which
rendering activities have to be partitioned. The task partition is based on the values that a user
specifies for parameter rows and columns on the interface. In the experiments, the values for the
rows and the columns are varied over the interval [5 x 5, 13 x 13] in steps of 2.
Mandelbrot Set (2008): Mathematically, the Mandelbrot set is an ordered collection of points in
the complex plane, the boundary of which forms a fractal. Aneka implements and cloud enables
the Mandelbrot fractal calculation using the Thread programming model. The application submission interface allows the user to configure number of horizontal and vertical partitions into
which the fractal computation can be divided. The number of independent thread units created is
equal to the horizontal x vertical partitions. For evaluations, we vary the values for horizontal and
vertical parameters over the interval [5 x 5, 13 x 13 ] in steps of 2. This configuration results in 5
observation points.
Results and Discussion

To measure the performance of Aneka-Federation system as regards to scheduling, we quantify the response time metric for the POV-Ray and Mandelbrot applications. The response time for an application
is computed by subtracting the output arrival time of the last task/thread in the execution list from the
time at which the application is submitted. The observations are made for different application granularity (sizes) as discussed in the last Section.
Figure 12 depicts the results for response time in seconds with increasing granularity for the POVRay application. The users at Aneka Cloud 1, 3, 4 submit the applications to their respective Aneka
Coordinator services (refer to the Figure 11). The experiment results show that the POV-Ray application
submitted at Aneka Cloud 1 experienced comparatively lesser response times for its POV-Ray tasks as
compared to the ones submitted at Aneka Cloud 3 and 4. The fundamental reasons behind this behavior
of system is that the spatial extent and attribute constraints of the resource claim objects posted by the
P2PTaskScheduling service at Aneka Cloud 1. As shown in Figure 11, every Aneka Cloud offers processors of type Intel with varying speed. Based on the in the previous Section, the processing speed
209
Figure 12. POV-Ray application: Response time (secs) vs. problem size
is expressed using >= constraints, which means that the application submitted in the Aneka Enterprise
Clouds, 1 and 2 (processing speed = 2.4 GHz), can be executed on any of the nodes in the enterprise
Clouds 1, 2, 3, 4, and 5.
However, the application submitted at Aneka Clouds 3 and 4 can be executed only on Clouds 3, 4,
and 5. Accordingly, the application submitted in Aneka Cloud 3 can only be processed locally as the
spatial dimension and processing speed for the resource claim objects specifies constraints as >= 3.5
GHz. Due to these spatial constraints on the processing speed attribute value, the application in different Clouds gets access to varying Aneka node pools that result in different levels of response times.
Figure 13. Mandelbrot application: Response time (Secs) vs. problem size
210
Figure 14. P2PTaskExecution service: Time (secs) vs. number of jobs completed
For the aforementioned arguments, it can be seen in Figure 12 and Figure 13 (Mandelbrot applications)
that applications at Aneka Clouds 1 and 2 have relatively better response times as compared to the ones
submitted at Aneka Cloud 3, 4, and 5.
Figure 14 and Figure 15 present the results for the total number of jobs processed in different
Aneka Clouds by their P2PTaskExecution and P2PThreadExectuion services. The results show that
the P2PTaskExecution and P2PThreadExecution services hosted within the Aneka Clouds 3, 4, and 5
processes relatively more jobs as compared to those hosted within Aneka Clouds 1 and 2. This happens
due to the spatial constraint on the processing speed attribute value in the resource claim object posted
by different P2PScheduling (Task/Thread) services across the Aneka Clouds. As Aneka Cloud 5 offers
the fastest processing speed (within the spatial extent of all resource claim objects in the system), it
Figure 15. P2PThreadExecution service: Time (secs) vs. number of jobs completed
211
Figure 16. Enterprise Cloud Id vs. job %
processes more jobs as compared to other Aneka Clouds in the federation (see Figure 14 and Figure
15). Thus, in the proposed Aneka-Federation system, the spatial extent for resource attribute values
specified by the P2PScheduling services directly controls the job distribution and application response
times in the system.
Figure 16 shows the aggregate percentage of task and thread jobs processed by the nodes of the different Aneka Clouds in the federation. As mentioned in our previous discussions, Aneka Clouds 3, 4, and 5
ends up processing larger percentage for both Task and Thread application composition models. Together
they process approximately 140% of total 200% jobs (100% task + 100% thread) in the federation.
RELATED WORK
Volunteer computing systems including SETI@home (Anderson, Cobb, Korpela, Lebofsky, & Werthimer,
2002) and BOINC (Anderson, 2004) are the first generation implementation of public resource computing systems. These systems are engineered on the traditional master/worker model, wherein a centralized scheduler/coordinator is responsible for scheduling, dispatching tasks and collecting data from the
participant nodes in the Internet. These systems do not provide any support for multi-application and
programming models, a capability which is inherited from the Aneka to the Aneka-Federation platform.
Unlike SETI@home and BOINC, Aneka-Federation creates a decentralized overlay of Aneka Enterprise
Clouds. Further, Aneka-Federation allows submission, scheduling, and dispatching of application from
any Aneka-Coordinator service in the system, thus giving every enterprise Cloud site autonomy and
flexibility as regards to decision making.
OurGrid (Andrade et al., 2003) is a peer-to-peer middleware infrastructure for creating an Internetwide enterprise Grid computing platform. The message routing and communication between the OurGrid
sites is done via broadcast messaging primitive based on the JXTA (Gong, 2001) substrate. ShareGrid
212
Project (2008) extends the OurGrid infrastructure with fault-tolerance scheduling capability by replication
tasks across a set of available nodes. In contrast to the OurGrid and the ShareGrid, Aneka-Federation
implements a coordinated scheduling protocol by embedding a d-dimensional index over a DHT overlay,
which makes the system highly scalable and guarantees deterministic search behavior (unlike JXTA).
Further, the OurGrid system supports only the parameter sweep application programming model, while
the Aneka-Federation supports more general programming abstractions including Thread, Task, and
Dataflow.
Peer-to-Peer Condor flock system (Butt et al., 2003) aggregates Internet-wide distributed condor
work pools based on the Pastry overlay (Rowstron et al., 2001). The site managers in the Pastry overlay
accomplish the load-management by announcing their available resources to all sites, whos Identifiers
(IDs) appear in the routing table. An optimized version of this protocol proposes recursively propagating
the load-information to the sites whos IDs are indexed by the contacted sites routing table. The scheduling coordination in an overlay is based on probing each site in routing table for resource availability. The
probe message propagates recursively in the network until a suitable node is located. In the worst case,
the number of messages generated due to recursive propagation can result into broadcast communication. In contrast, Aneka-Federation implements more scalable, deterministic and flexible coordination
protocol by embedding a logical d-dimensional index over DHT overlay. The d-dimensional index gives
the Aneka-Federation the ability to perform deterministic search for Aneka services, which are defined
based on the complex node attributes (CPU type, speed, service type, utilization).
XtermWeb-CH (Abdennadher & Boesch, 2005) extends the XtermWeb project (Fedak, Germain,
Neri, & Cappello, 2002) with the functionalities such as peer-to-peer communication among the worker
nodes. However, the core scheduling and management component in XtermWeb-CH, which is called
the coordinator, is a centralized service that has a limited scalability. G2-P2P (Mason & Kelly, 2005)
uses the Pastry framework to create a scalable cycle-stealing framework. The mappings of objects to
nodes are done via Pastry routing method. However, the G2-P2P system does not implement any specific
scheduling or load-balancing algorithm that can take into account the current application load on the
nodes and based on that perform runt-time load-balancing. In contrast, the Aneka-Federation realizes
a truly decentralized, cooperative and coordinated application scheduling service that can dynamically
allocate applications to the Aneka services/nodes without over-provisioning them.
CONCLUSION AND FUTURE DIRECTIONS

The functionality exposed by the Aneka-Federation system is very powerful, and our experimental
results on real test-bed prove that it is a viable technology for federating high throughput Aneka Enterprise Cloud systems. One of our immediate goals is to support substantially larger Aneka-Federation
setups than the ones used in the performance evaluations. We intend to provide support for composing
more complex application models such as e-Research workflows that have both compute and data node
requirement. The resulting Aneka-Federation infrastructure will enable new generation of application
composition environment where the application components, Enterprise Clouds, services, and data
would interact as peers.
There are several important aspects of this system that require further implementation and future
research efforts. One such aspect being developing fault-tolerant (self-healing) application scheduling
algorithms that can ensure robust executions in the event of concurrent failures and rapid join/leave
213
operations of enterprise Clouds/Cloud nodes in decentralized Aneka-Federation overlay. Other important

design aspect that we would like to improve is ensuring a truly secure (self-protected) Aneka-Federation
infrastructure based on peer-to-peer reputation and accountability models.
ACKNOWLEDGMENT
The authors would like to thank Australian Research Council (ARC) and the Department of Innovation,
Industry, Science, and Research (DIISR) for supporting this research through the Discovery Project
and International Science Linkage grants respectively. We would also like to thank Dr. Tejal Shah, Dr.
Sungjin Choi, Dr. Christian Vecchiola, and Dr. Alexandre di Costanzo for proofreading the initial draft
of this chapter. The chapter is partially derived from our previous publications (Ranjan, 2007).
REFERENCES
Abdennadher, N., & Boesch, R. (2005). Towards a peer-to-peer platform for high performance computing.
In HPCASIA05 Proceedings of the Eighth International Conference in High-Performance Computing in
Asia-Pacific Region, (pp. 354-361). Los Alamitos, CA: IEEE Computer Society. Retrieved from http://
doi.ieeecomputersociety.org/10.1109/HPCASIA.2005.98
Amazon Elastic Compute Cloud. (2008, November). Retrieved from http://www.amazon.com/ec2
Anderson, D. P. (2004). BOINC: A system for public-resource computing and storage. In Grid04 Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing, (pp. 4-10). Los Alamitos,
CA: IEEE Computer Society. Retrieved from http://dx.doi.org/10.1109/GRID.2004.14
Anderson, D. P., Cobb, J., Korpela, E., Lebofsky, M., & Werthimer, D. (2002). SETI@home: An experiment in public-resource computing. Communications of the ACM, 45(11), 56-61. New York: ACM
Press. Retrieved from http://doi.acm.org/10.1145/581571.581573
Andrade, N., Cirne, W., Brasileiro, F., & Roisenberg, R. (2003, October). OurGrid: An approach to
easily assemble grids with equitable resource sharing. In JSSPP03 Proceedings of the 9th Workshop
on Job Scheduling Strategies for Parallel Processing (LNCS). Berlin/Heidelberg, Germany: Springer.
doi: 10.1007/10968987
Butt, A. R., Zhang, R., & Hu, Y. C. (2003). A self-organizing flock of condors. In SC 03 Proceedings
of the ACM/IEEE Conference on Supercomputing, (p. 42). Los Alamitos, CA: IEEE Computer Society.
Retrieved from http://doi.ieeecomputersociety.org/10.1109/SC.2003.10031
Buyya, R., Yeo, C. S., & Venugopal, S. (2008, September). Market-oriented cloud computing: vision,
hype, and reality for delivering it services as computing utilities. In HPCC08 Proceedings of the 10th
IEEE International Conference on High Performance Computing and Communications. Los Alamitos,
CA: IEEE CS Press.
214
Chu, X., Nadiminti, K., Jin, C., Venugopal, S., & Buyya, R. (2007, December). Aneka: Next-generation
enterprise grid platform for e-science and e-business applications, e-Science07: In Proceedings of the
3rd IEEE International Conference on e-Science and Grid Computing, Bangalore, India (pp. 151-159).
Los Alamitos, CA: IEEE Computer Society Press. For more information, see http://doi.ieeecomputersociety.org/10.1109/E-SCIENCE.2007.12
Dabek, F., Zhao, B., Druschel, P., Kubiatowicz, J., & Stoica, I. (2003). Towards a common API for
structured peer-to-peer overlays. In IPTPS03 Proceedings of the 2nd International Workshop on Peerto-Peer Systems, (pp. 33-44). Heidelberg, Germany: SpringerLink. doi: 10.1007/b11823
Fedak, G., Germain, C., Neri, V., & Cappello, F. (2002, May). XtremWeb: A generic global computing
system. In CCGRID01: Proceeding of the First IEEE Conference on Cluster and Grid Computing,
workshop on Global Computing on Personal Devices, Brisbane, (pp. 582-587). Los Alamitos, CA: IEEE
Computer Society. Retrieved from http://doi.ieeecomputersociety.org/10.1109/CCGRID.2001.923246
Fowler, M. (2008, November). Inversion of control containers and the dependency injection pattern.
Retrieved from http://www.martinfowler.com/articles/injection.html
FreePastry. (2008, November). Retrieved from http://freepastry.rice.edu/FreePastry
Gong, L. (2001, June). JXTA: A network programming environment. IEEE Internet Computing, 5(3),
88-95. Los Alamitos, CA: IEEE Computer Society. Retrieved from http://doi.ieeecomputersociety.
org/10.1109/4236.93518
Google App Engine. (2008, November). Retrieved from http://appengine.google.com
Gupta, A., Sahin, O. D., Agarwal, D., & El Abbadi, A. (2004). Meghdoot: Content-based publish/
subscribe over peer-to-peer networks. In Middleware04 Proceedings of the 5th ACM/IFIP/USENIX
International Conference on Middleware, (pp. 254-273). Heidelberg, Germany: SpringerLink. doi:
10.1007/b101561.
Luther, A., Buyya, R., Ranjan, R., & Venugopal, S. (2005, June). Alchemi: A. NET-based enterprise
grid computing system, In ICOMP05 Proceedings of the 6th International Conference on Internet
Computing, Las Vegas, USA.
Mandelbrot Set. (2008, November). Retrieved from http://mathworld.wolfram.com/MandelbrotSet.
html.
Mason, R., & Kelly, W. (2005). G2-p2p: A fully decentralized fault-tolerant cycle-stealing framework.
In R. Buyya, P. Coddington, and A. Wendelborn, (Eds.), In AusGrid05 Australasian Workshop on Grid
Computing and e-Research, Newcastle, Australia, (Vol. 44 of CRPIT, pp. 33-39).
Microsoft Live Mesh. (2008, November). Retrieved from http://www.mesh.com.
Parashar, M., & Hariri, S. (Eds.). (2007). Autonomic computing: Concepts, infrastructures, and applications. Boca Raton, FL: CRC Press, Taylor and Francis Group.
Persistence of Vision Raytracer. (2008, November). Retrieved from http://www.povray.org
215
Ranjan, R. (2007, July). Coordinated resource provisioning in federated grids. Doctoral thesis, The
University of Melbourne, Australia.
Ranjan, R., Harwood, A., & Buyya, R. (2008, July). Peer-to-peer resource discovery in global grids: A
tutorial. IEEE Communication Surveys and Tutorials (COMST), 10(2), 6-33. New York: IEEE Communications Society Press. doi:doi:10.1109/COMST.2008.4564477
Ranjan, R., Harwood, A., & Buyya, R. (2008). Coordinated load management in peer-to-peer coupled
federated grid systems. (Technical Report GRIDS-TR-2008-2). Grid Computing and Distributed Systems
Laboratory, The University of Melbourne, Australia. doi: http://www.gridbus.org/reports/CoordinatedGrid2007.pdf
Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Schenker, S. (2001). A scalable content-addressable
network. In SIGCOMM01 Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, (pp. 161-172). New York: ACM Press. Retrieved
from http://doi.acm.org/10.1145/ 383059.383072
Rowstron, A., & Druschel, P. (2001). Pastry: Scalable, decentralized object location, and routing for
large-scale peer-to-peer systems. In Middleware01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms, (pp. 329-350). Heidelberg, Germany: SpringerLink. doi:
10.1007/3-540-45518-3
Samet, H. (2008, November). The design and analysis of spatial data structures. New York: AddisonWesley Publishing Company.
ShareGrid Project. (2008, November). Retrieved from http://dcs.di.unipmn.it/sharegrid.
Spring.NET. (2008, November). Retrieved from http://www.springframework.net.
Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A scalable peerto-peer lookup service for internet applications. In SIGCOMM01 Proceedings of the 2001 Conference
on Applications, Technologies, Architectures, and Protocols for Computer Communications, (pp. 149
160). New York: ACM Press. Retrieved from http://doi.acm.org/10.1145/383059.383071
Tanin, E., Harwood, A., & Samet, H. (2007). Using a distributed quadtree index in peer-to-peer networks. [Heidelberg, Germany: SpringerLink.]. The VLDB Journal, 16(2), 165178. doi:. doi:10.1007/
s00778-005-0001-y
Vecchiola, C., & Chu, X. (2008). Aneka tutorial series on developing task model applications. (Technical Report). Grid Computing and Distributed Systems Laboratory, The University of Melbourne,
Australia.
Zhang, X., Freschl, J. L., & Schopf, J. M. (2003, June). A performance study of monitoring and information services for distributed systems. In HPDC03: Proceedings of the Twelfth International Symposium on High Performance Distributed Computing, (pp. 270-281). Los Alamitos, CA: IEEE Computer
Society Press.
Zhao, B. Y., Kubiatowicz, J. D., & Joseph, A. D. (2001, April). Tapestry: An infrastructure for Faulttolerant wide-area location and routing. Technical Report UCB/CSD-01-1141, UC Berkeley, USA.
216

Enterprise Cloud: An enterprise Cloud is a type of computing infrastructure that consists of a collection of inter-connected computing nodes, virtualized computers, and software services that are dynamically provisioned among the competing end-users applications based on their availability, performance,
capability, and Quality of Service (QoS) requirements.
Aneka-Federation: The Aneka-Federation integrates numerous small scale Aneka Enterprise Cloud
services and nodes that are distributed over multiple control and enterprise domains as parts of a single
coordinated resource leasing abstraction.
Overlay Networking: A logical inter-connection of services, nodes, devices, sensors, instruments,
and data hosts at application layer (under TCP/IP model) over an infrastructure of physical network routing systems such as the Internet or Local Area Network (LAN). In overlays, the routing and forwarding
of messages between services is done on the basis of their relationship in the logical space, while the
messages are actually transported through the physical links.
Decentralized Systems: A distributed Cloud system configuration is considered to be decentralized if none of the components in the system are more important than the others, in case that one of the
component fails, then it is neither more nor less harmful to the system than caused by the failure of any
other component in the system.
Distributed Hash Table (DHT): A DHT is a data structure that associates unique index with a data.
Entries in the DHTs are stored as (index, data) pair. A data can be looked up within a logarithmic overlay
routing hops and messages bound if the corresponding index is known. DHTs are self-managing in their
behavior as they can dynamically adapt to leave, join and failure of nodes or services in the system.
Recently, DHTs have been applied to build Internet scale systems that involve hundreds of thousands
of components (node, service, data, and file).
Resource Discovery: Resource discovery activity involve searching for the appropriate service, node,
or data type that match the requirements of applications such as file sharing, Grid applications, Cloud
applications etc. The resource discovery methods can be engineered based on various network models
including centralized, decentralized, and hierarchical with varying degree of scalability, fault-tolerance,
and network performance.
Multi-Dimensional Queries: Complex web services, Grid resource characteristics, and Cloud services are commonly represented by a number of attributes such as service type, hardware (processor
type, speed), installed software (libraries, operating system), service name, security (authentication and
authorization control); efficiently discovering the aforementioned services with deterministic guarantees
in decentralized and scalable manner requires lookup queries to encapsulate search values for each attribute (search dimension) . The search is resolved by satisfying the constraints for the values expressed
in each dimension, hence resulting in multi-dimensional queries that search for values in a virtual space
that has multiple dimensions (x, y, z).
ENDNOTE
1
3rd generation enterprise Grids are exhibiting properties that are commonly envisaged in Cloud
computing systems.
217
Section 3
219
Chapter 10
Reliability and Performance

Models for Grid Computing
Yuan-Shun Dai
University of Electronics Science Technology of China, China & University of Tennessee, Knoxville,
USA
Jack Dongarra
University of Tennessee, Knoxville, USA; Oak Ridge National Laboratory, USA; & University of
Manchester, UK
ABSTRACT
Grid computing is a newly developed technology for complex systems with large-scale resource sharing,
wide-area communication, and multi-institutional collaboration. It is hard to analyze and model the Grid
reliability because of its largeness, complexity and stiffness. Therefore, this chapter introduces the Grid
computing technology, presents different types of failures in grid system, models the grid reliability with
star structure and tree structure, and finally studies optimization problems for grid task partitioning and
allocation. The chapter then presents models for star-topology considering data dependence and treestructure considering failure correlation. Evaluation tools and algorithms are developed, evolved from
Universal generating function and Graph Theory. Then, the failure correlation and data dependence are
considered in the model. Numerical examples are illustrated to show the modeling and analysis.
INTRODUCTION
Grid computing (Foster & Kesselman, 2003) is a newly developed technology for complex systems with
large-scale resource sharing, wide-area communication, and multi-institutional collaboration etc, see
e.g. Kumar (2000), Das et al. (2001), Foster et al. (2001, 2002) and Berman et al. (2003). Many experts
believe that the grid technologies will offer a second chance to fulfill the promises of the Internet.
The real and specific problem that underlies the Grid concept is coordinated resource sharing and
problem solving in dynamic, multi-institutional virtual organizations (Foster et al., 2001). The sharing
that we are concerned with is not primarily file exchange but rather direct access to computers, software,
data, and other resources. This is required by a range of collaborative problem-solving and resourceDOI: 10.4018/978-1-60566-661-7.ch010
Reliability and Performance Models for Grid Computing
brokering strategies emerging in industry, science, and engineering. This sharing is highly controlled
by the resource management system (Livny & Raman, 1998), with resource providers and consumers
defining what is shared, who is allowed to share, and the conditions under which the sharing occurs.
Recently, the Open Grid Service Architecture (Foster et al., 2002) enables the integration of services and resources across distributed, heterogeneous, dynamic, virtual organizations. A grid service is
desired to complete a set of programs under the circumstances of grid computing. The programs may
require using remote resources that are distributed. However, the programs initially do not know the
site information of those remote resources in such a large-scale computing environment, so the resource
management system (the brain of the grid) plays an important role in managing the pool of shared resources, in matching the programs to their requested resources, and in controlling them to reach and use
the resources through wide-area network.
The structure and functions of the resource management system (RMS) in the grid have been introduced in details by Livny & Raman (1998), Cao et al. (2002), Krauter et al. (2002) and Nabrzyski et
al. (2003). Briefly stated, the programs in a grid service send their requests for resources to the RMS.
The RMS adds these requests into the request queue (Livny & Raman, 1998). Then, the requests are
waiting in the queue for the matching service of the RMS for a period of time (called waiting time),
see e.g. Abramson et al. (2002). In the matching service, the RMS matches the requests to the shared
resources in the grid (Ding et al., 2002) and then builds the connection between the programs and their
required resources. Thereafter, the programs can obtain access to the remote resources and exchange
information with them through the channels. The grid security mechanism then operates to control the
resource access through the Certification, Authorization and Authentication, which constitute various
logical connections that causes dynamicity in the network topology.
Although the developmental tools and infrastructures for the grid have been widely studied (Foster
& Kesselman, 2003), grid reliability analysis and evaluation are not easy because of its complexity,
largeness and stiffness. The gird computing contains different types of failures that can make a service
unreliable, such as blocking failures, time-out failures, matching failures, network failures, program
failures and resource failures. This chapter thoroughly analyzes these failures.
Usually the grid performance measure is defined as the task execution time (service time). This index
can be significantly improved by using the RMS that divides a task into a set of subtasks which can be
executed in parallel by multiple online resources. Many complicated and time-consuming tasks that
could not be implemented before are working well under the grid environment now.
It is observed in many grid projects that the service time experienced by the users is a random
variable. Finding the distribution of this variable is important for evaluating the grid performance and
improving the RMS functioning. The service time is affected by many factors. First, various available
resources usually have different task processing speeds online. Thus, the task execution time can vary
depending on which resource is assigned to execute the task/subtasks. Second, some resources can fail
when running the subtasks, so the execution time is also affected by the resource reliability. Similarly,
the communication links in grid service can be disconnected during the data transmission. Thus, the
communication reliability influences the service time as well as data transmission speed through the
communication channels. Moreover, the service requested by a user may be delayed due to the queue of
earlier requests submitted from others. Finally, the data dependence imposes constraints on the sequence
of the subtasks execution, which has significant influence on the service time.
220
Figure 1. Grid computing system
This chapter first introduces the grid computing system and service, and analyzes various failures in
grid system. Both reliability and performance are analyzed in accordance with the performability concept. Then the chapter presents models for star- and tree-topology grids respectively. The reliability and
performance evaluation tools and algorithms are developed based on the universal generating function,
graph theory, and Bayesian approach. Both failure correlation and data dependence are considered in
the models.
GRID SERVICE RELIABILITY AND PERFORMANCE

Description of the Grid Computing
Today, the Grid computing systems are large and complex, such as the IP-Grid (Indiana-Purdue Grid)
that is a statewide grid (http://www.ip-grid.org/). IP-Grid is also a part of the TeraGrid that is a nationwide grid in the USA (http://www.teragrid.org/). The largeness and complexity of the grid challenge
the existing models and tools to analyze, evaluate, predict and optimize the reliability and performance
of grid systems. The global grid system is generally depicted by the Figure 1. Various organizations
(Foster et al., 2001), integrate/share their resources on the global grid. Any program running on the grid
can use those resources if it can be successfully connected to them and is authorized to access them.
The sites that contain the resources or run the programs are linked by the global network as shown in
the left part of Figure 1.
221
The distribution of the service tasks/subtasks among the remote resources are controlled by the
Resource Management System (RMS) that is the brain of the grid computing, see e.g. Livny & Raman (1998). The RMS has five layers in general, as shown in Figure 1: program layer, request layer,
management layer, network layer and resource layer.
1.
2.
3.
4.
5.
Program layer: The program layer represents the programs of the customers applications. The
programs describe their required resources and constraint requirements (such as deadline, budget,
function etc). These resource descriptions are translated to the resource requests and sent to the
next request layer.
Request layer: The request layer provides the abstraction of program requirements as a queue
of resource requests. The primary goals of this layer are to maintain this queue in a persistent and
fault-tolerant manner and to interact with the next management layer by injecting resource requests
for matching, claiming matched resources of the requests.
Management layer: The management layer may be thought of as the global resource allocation
layer. It has the function of automatically detecting new resources, monitoring the resource pool,
removing failed/unavailable resources, and most importantly matching the resource requests of a
service to the registered/detected resources. If resource requests are matched with the registered
resources in the grid, this layer sends the matched tags to the next network layer.
Network layer: The network layer dynamically builds connection between the programs and
resources when receiving the matched tags and controls them to exchange information through
communication channels in a secure way.
Resource layer: The resource layer represents the shared resources from different resource providers including the usage policies (such as service charge, reliability, serving time etc.)
Failure Analysis of Grid Service

Even though all online nodes or resources are linked through the Internet with one another, not all resources or communication channels are actually used for a specific service. Therefore, according to this
observation, we can make tractable models and analyses of grid computing via a virtual structure for a
certain service. The grid service is defined as follows:
Grid service is a service offered under the grid computing environment, which can be requested by different users through the RMS, which includes a set of subtasks that are allocated to specific resources
via the RMS for execution, and which returns the result to the user after the RMS integrates the outputs
from different subtasks.
The above five layers coordinate together to achieve a grid service. At the Program layer, the subtasks
(programs) composing the entire grid service task initially send their requests for remote resources to
the RMS. The Request layer adds these requests in the request queue. Then, the Management layer
tries to find the sites of the resources that match the requests. After all the requests of those programs
in the grid service are matched, the Network layer builds the connections among those programs and
the matched resources.
It is possible to identify various types of failures on respective layers:
222
Program layer: Software failures can occur during the subtask (program) execution; see e.g. Xie
(1991) and Pham (2000).
Request layer: When the programs requests reach the request layer, two types of failures may
occur: blocking failure and time-out failure. Usually, the request queue has a limitation on
the maximal number of waiting requests (Livny & Raman, 1998). If the queue is full when a new
request arrives, the request blocking failure occurs. The grid service usually has its due time set
by customers or service monitors. If the waiting time for the requests in the queue exceeds the due
time, the time-out failure occurs, see e.g. Abramson et al. (2002).
Management layer: At this layer, matching failure may occur if the requests fail to match with
the correct resources, see e.g. Xie et al. (2004, pp. 185-186). Errors, such as incorrectly translating
the requests, registering a wrong resource, ignoring resource disconnection, misunderstanding the
users requirements, can cause these matching failures.
Network layer: When the subtasks (programs) are executed on remote resources, the communication channels may be disconnected either physically or logically, which causes the network
failure, especially for those long time transmissions of large dataset, see e.g. Dai et al. (2002).
Resource layer: The resources shared on the grid can be of software, hardware or firmware type.
The corresponding software, hardware or combined faults can cause resource unavailability.
Grid Service Reliability and Performance

Most previous research on distributed computing studied performance and reliability separately. However,
performance and reliability are closely related and affect each other, in particular under the grid computing
environment. For example, while a task is fully parallelized into m subtasks executed by m resources, the
performance is high but the reliability might be low because the failure of any resource prevents the entire
task from completion. This causes the RMS to restart the task, which reversely increases its execution
time (i.e. reduces performance). Therefore, it is worth to assign some subtasks to several resources to
provide execution redundancy. However, excessive redundancy, even though improving the reliability,
can decrease the performance by not fully parallelizing the task. Thus, the performance and reliability
affect each other and should be considered together in the grid service modeling and analysis.
In order to study performance and reliability interactions, one also has to take into account the effect of service performance (execution time) upon the reliability of the grid elements. The conventional
models, e.g. Kumar et al. (1986), Chen & Huang (1992), Chen et al. (1997), and Lin et al., (2001), are
based on the assumption that the operational probabilities of nodes or links are constant, which ignores
the links bandwidth, communication time and resource processing time. Such models are not suitable
for precisely modeling the grid service performance and reliability.
Another important issue that has much influence the performance and reliability is data dependence,
that exists when some subtasks use the results from some other subtasks. The service performance and
reliability is affected by data dependence because the subtasks cannot be executed totally in parallel. For
instance, the resources that are idle in waiting for the input to run the assigned subtasks are usually hotstandby because cold-start is time consuming. As a result, these resources can fail in waiting mode.
The considerations presented above lead the following assumptions that lay in the base of grid service
reliability and performance model.
223
Assumptions:
1.
2.
3.
4.
5.
6.
7.
8.
9.
The service request reaches the RMS and is being served immediately. The RMS divides the entire
service task into a set of subtasks. The data dependence may exist among the subtasks. The order
is determined by precedence constraints and is controlled by the RMS.
Different grid resources are registered or automatically detected by the RMS. In a grid service,
the structure of virtual network (consisting of the RMS and resources involved in performing the
service) can form star topology with the RMS in the center or, tree topology with the RMS in the
root node.
The resources are specialized. Each resource can process one or multiple subtask(s) when it is
available.
Each resource has a given constant processing speed when it is available and has a given constant
failure rate. Each communication channel has constant failure rate and a constant bandwidth (data
transmission speed).
The failure rates of the communication channels or resources are the same when they are idle
or loaded (hot standby model). The failures of different resources and communication links are
independent.
If the failure of a resource or a communication channel occurs before the end of output data transmission from the resource to the RMS, the subtask fails.
Different resources start performing their tasks immediately after they get the input data from the
RMS through communication channels. If same subtask is processed by several resources (providing execution redundancy), it is completed when the first result is returned to the RMS. The entire
task is completed when all of the subtasks are completed and their results are returned to the RMS
from the resources.
The data transmission speed in any multi-channel link does not depend on the number of different
packages (corresponding to different subtasks) sent in parallel. The data transmission time of each
package depends on the amount of data in the package. If the data package is transmitted through
several communication links, the link with the lowest bandwidth limits the data transmission
speed.
The RMS is fully reliable, which can be justified to consider a relatively short interval of running
a specific service. The imperfect RMS can also be easily included as a module connected in series
to the whole grid service system.
Grid Service Time Distribution and Reliability/Performance Measures

The data dependence on task execution can be represented by mm matrix H such that hki = 1 if subtask
i needs for its execution output data from subtask k and hki = 0 otherwise (the subtasks can always be
numbered such that k<i for any hki = 1). Therefore, if hki = 1 execution of subtask i cannot begin before
completion of subtask k. For any subtask i one can define a set i of its immediate predecessors: k ji
if hki = 1.
The data dependence can always be presented in such a manner that the last subtask m corresponds
to final task processed by the RMS when it receives output data of all the subtasks completed by the
grid resources.
224
The task execution time is defined as time from the beginning of input data transmission from the
RMS to a resource to the end of output data transmission from the resource to the RMS.
The amount of data that should be transmitted between the RMS and resource j that executes subtask
i is denoted by ai. If data transmission between the RMS and the resource j is accomplished through
links belonging to a set j, the data transmission speed is
s j = min(bx )
(1)
Lx g j
Where bx is the bandwidth of the link Lx. Therefore, the random time tij of subtask i execution by resource
j can take two possible values
tij = tij = t j +
ai
(2)
sj
if the resource j and the communication path j do not fail until the subtask completion and tij = otherwise. Here, j is the processing time of the j-th resource.
Subtask i can be successfully completed by resource j if this resource and communication path j do
not fail before the end of subtask execution. Given constant failure rates of resource j and links, one can
obtain the conditional probability of subtask success as
-(l +p )t
p j (tij ) = e j j ij
(3)
Where j is the failure rate of the communication path between the RMS and the resource j, which can
be calculated as p j = lx , x is the failure rate of the link Lx. The exponential distribution (3) is comx g j
mon in software or hardware components reliability that had been justified in both theory and practice,
see e.g. Xie et al. (2004).
These give the conditional distribution of the random subtask execution time tij: Pr(tij = tij ) = p j (tij )
and Pr(tij = ) = 1 - p j (tij ) .
Assume that each subtask i is assigned by the RMS to resources composing set i. The RMS can
initiate execution of any subtask j (send the data to all the resources from i) only after the completion of every subtask k ji . Therefore the random time of the start of subtask i execution Ti can be
determined as
Ti = max(Tk )
k ji
(4)
Where Tk is random completion time for subtask k. If ji = , i.e. subtask i does not need data produced
by any other subtask, the subtask execution starts without delay: Ti = 0. If ji , Ti can have different
realizations Til (1 l Ni).Having the time Ti when the execution of subtask i starts and the time tij of
225
subtask i executed by resource j, one obtains the completion time for subtask i on resource j as
tij = Ti + tij
(5)
In order to obtain the distribution of random time tij one has to take into account that probability of
any realization of tij = Til + tij is equal to the product of probabilities of three events:
Execution of subtask i starts at time Til : qil=Pr(Ti =Til );
Resource j does not fail before start of execution of subtask i: pj(Til );
Resource j does not fail during the execution of subtask i: pj( tij ).
Therefore, the conditional distribution of the random time tij given execution of subtask i starts at
time Til (Ti=Til ) takes the form
Pr( tij = Til + tij ) =pj(Til )pj( tij ) = pj(Til + tij ) = tij = Til + tij Til tij Til tij e -(lj +pj )(Til +tij ) ) =pj(T )pj( tij ) =
il

(6)
pj(T + tij ) = e -(lj +pj )(Til +tij ) ,
il
-(l +p )(T +t )
Pr( tij = )=1- pj(Til + tij )=1- tij = Til tij e -(lj +pj )(Til +tij ) )=1- pj(Til + tij )=1-e j j il ij .
The random time of subtask i completion Ti is equal to the shortest time when one of the resources
from i completes the subtask execution:
Ti = min(tij )
j wi
(7)
According to the definition of the last subtask m, the time of its beginning corresponds to the service completion time, because the time of the task proceeds with RMS is neglected. Thus, the random
service time is equal to Tm. Having the distribution (pmf) of the random value Tm in the form
qml = Pr(Tm = Tml ) for 1 l Nm, one can evaluate the reliability and performance indices of the grid
service.
In order to estimate both the service reliability and its performance, different measures can be used
depending on the application. In applications where the execution time of each task (service time) is of
critical importance, the system reliability R(*) is defined (according to performability concept in Meyer
(1980), Grassi et al. (1988) and Tai et al. (1993)) as a probability that the correct output is produced in
time less than *. This index can be obtained as
Nm
R(Q*) = qml 1(Tml < Q*)

l =1
(8)
When no limitations are imposed on the service time, the service reliability is defined as the probability that it produces correct outputs without respect to the service time, which can be referred to as
226
Figure 2. Grid system with star architecture
R(). The conditional expected service time W is considered to be a measure of its performance, which
determines the expected service time given that the service does not fail, i.e.
Nm
W = Tmlqml / R().
(9)
l =1
STAR TOPOLOGY GRID ARCHITECTURE

A grid service is desired to execute a certain task under the control of the RMS. When the RMS receives
a service request from a user, the task can be divided into a set of subtasks that are executed in parallel.
The RMS assigns those subtasks to available resources for execution. After the resources complete the
assigned subtasks, they return the results back to the RMS and then the RMS integrates the received
results into entire task output which is requested by the user.
The above grid service process can be approximated by a structure with star topology, as depicted
by Figure 2, where the RMS is directly connected with any resource through respective communication
channels. The star topology is feasible when the resources are totally separated so that their communication channels are independent. Under this assumption the grid service reliability and performance can
be derived by using the universal generating function technique.
227
Universal Generating Function

The universal generating function (u-function) technique was introduced in (Ushakov, 1987) and proved
to be very effective for the reliability evaluation of different types of multi-state systems.
The u-function representing the pmf of a discrete random variable Y is defined as a polynomial
K
u (z ) = ak z k ,
y
(10)
k =1
where the variable Y has K possible values and k is the probability that Y is equal to yk.
To obtain the u-function representing the pmf of a function of two independent random variables
(Yi, Yj), composition operators are introduced. These operators determine the u-function for (Yi, Yj)
using simple algebraic operations on the individual u-functions of the variables. All of the composition
operators take the form
Ki
u j (z ) = aik z
U(z) = ui (z )
f
k =1
yik
Kj
ajh z
f
h =1
y jh
Ki
Kj
= aik ajh z
f(yik ,y jh )
(11)
k =1 h =1
The u-function U(z) represents all of the possible mutually exclusive combinations of realizations
of the variables by relating the probabilities of each combination to the value of function (Yi, Yj) for
this combination.
In the case of grid system, the u-function uij(z) can define pmf of execution time for subtask i assigned
to resource j. This u-function takes the form
t
uij (z ) = p j (tij )z ij + (1 - p j (tij ))z
(12)
Where tij and p j (tij ) are determined according to Eqs. (2) and (3) respectively.
The pmf of the random start time Ti for subtask i can be represented by u-function Ui(z) taking the
form
Li
U i (z ) = qil z il ,
T
(13)
l =1
where qil = Pr(Ti = Til ) .

For any realization Til of Ti the conditional distribution of completion time tij for subtask i executed
by resource j given Ti = Til according to (6) can be represented by the u-function
T +t
uij (z ,Til ) = p j (Til + tij )z il ij + (1 - p j (Til + tij ))z
228
(14)
Table 1. Parameters of grid system for analytical example

No of subtask i
No of resource j
j+j
(sec-1)
0.0025
100
0.779
0.00018
180
0.968
tij
(sec)
p j (tij )
0.0003
250
0.0008
300
0.0005
300
0.861
0.0002
430
0.918
The total completion time of subtask i assigned to a pair of resources j and d is equal to the minimum
of completion times for these resources according to Eq. (7). To obtain the u-function representing the
pmf of this time, given Ti = Til , composition operator with (Yj, Yd) = min(Yj, Yd) should be used:
ui (z ,Til ) = uij (z ,Til ) uid (z ,Til ) = [ p j (Til + tij )z
Til +tij
min
[ pd (Til + tid )z
Til +tid
min
+ (1 - p j (Til + tij ))z ]
+ (1 - pd (Til + tid ))z ]
T +min(tij ,tid )
T +t
= p j (Til + tij )pd (Til + tid )z il
+ pd (Til + tid )(1 - p j (Til + tij ))z il id
T +t
+p (T + t )(1 - p (T + t ))z il ij + (1 - p (T + t ))(1 - p (T + t ))z .
j
il
ij
il
id
il
ij
il
id
(15)
The u-function ui (z ,Til ) representing the conditional pmf of completion time Ti for subtask i assigned
to all of the resources from set i ={j1, , ji} can be obtained as
ui (z ,Til ) = uij (z ,Til ) uij (z ,Til ) ... uij (z ,Til ) .
1
min
min
min
(16)
ui (z ,Til ) can be obtained recursively:

ui (z ,Til ) = uij (z ,Til ),
1
ui (z ,Til ) = ui (z ,Til ) uie (z ,Til ) for e = j2, , ji.

min
(17)
Having the probabilities of the mutually exclusive realizations of start time Ti, qil = Pr(Ti = Til ) and
u-functions ui (z ,Til ) representing corresponding conditional distributions of task i completion time, we
can now obtain the u-function representing the unconditional pmf of completion time Ti as
Ni
Ui (z ) = qil ui (z ,Til ) .
(18)
l =1
229
Figure 3. Subtask execution precedence constraints for analytical example
Having u-functions Uk (z ) representing pmf of the completion time Tk for any subtask k ji = {k1,..., ki }
, one can obtain the u-functions Ui(z) representing pmf of subtask i start time Ti according to (4) as
Ni
T
U i (z ) = Uk (z ) Uk (z ) ... Uk (z ) = qil z il .
1
max
max
max
l =1
(19)
Ui(z) can be obtained recursively:

Ui(z) = z0
U i (z ) = U i (z ) Ue (z ) for e = k1, , ki.
max
(20)
It can be seen that if ji = then Ui(z) = z0.

The final u-function Um(z) represents the pmf of random task completion time Tm in the form
Nm
U m (z ) = qml z
Tml
(21)
l =1
Using the operators defined above one can obtain the service reliability and performance indices by
implementing the following algorithm:
1.
2.
230
Determine tij for each subtask i and resource j wi using Eq. (2);
Define for each subtask i (1 i m) Ui (z ) = U i (z ) = z0.
For all i:
If ji = 0 or if for any k ji Uk (z ) z0 (u-functions representing the completion times of all of
the predecessors of subtask i are obtained)
Ni
2.1. Obtain U i (z ) = qil z
Til
using recursive procedure (20);
l =1
3.
4.
2.2. For l = 1, , Ni:

2.2.1.
For each j wi obtain uij (z ,Til ) using Eq. (14);
2.2.2.
Obtain ui (z ) using recursive procedure (17);
2.3. Obtain Ui (z ) using Eq. (18).
If Um(z) = z0 return to step 2.
Obtain reliability and performance indices R(*) and W using equations (8) and (9).
Illustrative Example
This example presents analytical derivation of the indices R(*) and W for simple grid service that
uses six resources. Assume that the RMS divides the service task into three subtasks. The first subtask
is assigned to resources 1 and 2, the second subtask is assigned to resources 3 and 4, the third subtask
is assigned to resources 5 and 6:
1 = {1,2}, 2 = {3,4}, 3 = {5,6}.
The failure rates of the resources and communication channels and subtask execution times are
presented in Table 1.
Subtasks 1 and 3 get the input data directly from the RMS, subtask 2 needs the output of subtask 1,
the service task is completed when the RMS gets the outputs of both subtasks 2 and 3: j1 = j3 = ,
j2 = {1} , j4 = {2, 3} . These subtask precedence constraints can be represented by the directed graph
in Figure 3.
Since j1 = j3 = , the only realization of start times T1 and T3 is 0 and therefore, U1(z)=U2(z)=z0 .
According to step 2 of the algorithm we can obtain the u-functions representing pmf of completion times
t11 , t12 , t35 and t36 . In order to determine the subtask execution time distributions for the individual
resources, define the u-functions uij(z) according to Table 1 and Eq. (9):
u11 (z , 0) = exp(-0.0025 100)z 100 + [1 - exp(-0.0025 100)]z = 0.779z100 + 0.221z.
In the similar way we obtain
u12 (z , 0) = 0.968z180 + 0.032z;
u35 (z , 0) = 0.861z300 + 0.139z; u35 (z , 0) u36 (z , 0) = 0.861z300 + 0.139z; u (z , 0) = 0.918z430 +
36
0.082z.
231
Figure 4. A virtual tree structure of a grid service
The u-function representing the pmf of the completion time for subtask 1 executed by both resources
1 and 2 is
u12 (z , 0) u11 (z , 0)
u12 (z , 0) = (0.779z100 + 0.221z)U (z ) = u (z , 0)
U1 (z ) = u11 (z , 0)
min
min
min
1
11
u12 (z , 0) = (0.779z100 + 0.221z) (0.968z180 + 0.032z)
min
min
=0.779z100 +0.214z180 + 0.007z.

The u-function representing the pmf of the completion time for subtask 3 executed by both resources
5 and 6 is
u36 (z ) = (0.861z300 + 0.139z)U (z ) = u (z , 0) = u (z ) u (z )
U3 (z ) = u3 (z , 0) = u35 (z ) =
min
36
3
3
35
min
min
300
430
u (z , 0) = u (z ) = u (z ) = (0.861z + 0.139z ) (0.918z + 0.082z )

3
35
min
36
min
=0.861z300 +0.128z430 + 0.011z.

Execution of subtask 2 begins immediately after completion of subtask 1. Therefore,
U2(z) = U1 (z ) =0.779z100 +0.214z180 + 0.007z
(T2 has three realizations 100, 180 and ).
232
The u-functions representing the conditional pmf of the completion times for the subtask 2 executed
by individual resources are obtained as follows.
u23 (z , 100) = e -0.0003(100+250)z 100+250 + [1 - e -0.0003(100+250) ]z =0.9z350+0.1z;
u23 (z , 180) = e -0.0003(180+250)z 180+250 + [1 - e -0.0003(180+250) ]z =0.879z430+0.121z;
u23 (z , ) = z ;
u24 (z , 100) = e -0.0008(100+300)z 100+300 + [1 - e -0.0008(100+300) ]z =0.726z400+0.274z;
u24 (z , 180) = e -0.0008(180+300)z 180+300 + [1 - e -0.0008(180+300) ]z =0.681z480+0.319z;
u24 (z , ) = z .
The u-functions representing the conditional pmf of subtask 2 completion time are:
u2 (z , 100) = u23 (z , 100) u24 (z, 100) = (0.9z350+0.1z) u2 (z , 100) = u23 (z , 100) u24 (z, 100) =
min
min
min
(0.9z350+0.1z) (0.726z400+0.274z)
min
=0.9z350+0.073z400+0.027z;
u2 (z , 180) = u23 (z , 180) u24 (z, 180) = (0.879z430+0.121z) u2 (z , 180) = u23 (z , 180) u24 (z, 180) =
min
min
min
(0.879z430+0.121z) (0.681z480+0.319z)
min
=0.879z430+0.082z480+0.039z;
u2 (z , ) = u23 (z , ) u24 (z , ) = z .
min
According to Eq. (18) the unconditional pmf of subtask 2 completion time is represented by the following u-function
U2 (z ) = 0.779u2 (z , 100) + 0.214u2 (z , 180) + 0.007z
=0.779(0.9z350+0.073z400+0.027z)+0.214(0.879z430+0.082z480+0.039z)+0.007z
=0.701z350+0.056z400+0.188z430+0.018z480+0.037z
The service task is completed when subtasks 2 and 3 return their outputs to the RMS (which corresponds to the beginning of subtask 4). Therefore, the u-function representing the pmf of the entire
service time is obtained as
U 4 (z ) = U2 (z ) U3 (z )
max
233
=(0.701z350+0.056z400+0.188z430+0.018z480+0.037z) (0.861z300 +0.128z430 + 0.011z)=0.603z350

max
+0.049z400 +0.283z430 +0.017z480 +0.048z.
The pmf of the service time is:
Pr(T4 = 350) = 0.603; Pr(T4 = 400) = 0.049;
Pr(T4 = 430) = 0.283; Pr(T4 = 480) = 0.017; Pr(T4 = ) = 0.048.
From the obtained pmf we can calculate the service reliability using Eq. (8):
R(*) = 0.603 for 350< * 400; R(*) = 0.652 for 400< * 430;
R(*) = 0.935 for 430< * 480; R() = 0.952
and the conditional expected service time according to Eq. (9):
W = (0.603350 + 0.049400 + 0.283430 + 0.017480) / 0.952 = 378.69 sec.
TREE TOPOLOGY GRID ARCHITECTURE

In the star grid, the RMS is connected with each resource by one direct communication channel
(link). However, such approximation is not accurate enough even though it simplifies the analysis and computation. For example, several resources located in a same local area network (LAN)
can use the same gateway to communicate outside the network. Therefore, all these resources are
not connected with the RMS through independent links. The resources are connected to the gateway, which communicates with the RMS through one common communication channel. Another
example is a server that contains several resources (has several processors that can run different
applications simultaneously, or contains different databases). Such a server communicates with
the RMS through the same links. These situations cannot be modeled using only the star topology
grid architecture.
In this section, we present a more reasonable virtual structure which has a tree topology. The
root of the tree virtual structure is the RMS, and the leaves are resources, while the branches of the tree
represent the communication channels linking the leaves and the root. Some channels are commonly
used by multiple resources. An example of the tree topology is given in Figure 3 in which four resources
(R1, R2, R3, R4) are available for a service.
The tree structure models the common cause failures in shared communication channels. For example,
in Figure 4, the failure in channel L6 makes resources R1, R2, and R3 unavailable. This type of common
cause failure was ignored by the conventional parallel computing models, and the above star-topology
models. For small-area communication, such as a LAN or a cluster, such assumption that ignores the
common cause failures on communications is acceptable because the communication time is negligible
compared to the processing time. However, for wide-area communication, such as the grid system, it is
more likely to have failure on communication channels. Therefore, the communication time cannot be
neglected. In many cases, the communication time may dominate the processing time due to the large
amount of data transmitted. Therefore, the virtual tree structure is an adequate model representing the
functioning of grid services.
234
Table 2. Parameters of the MTSTs paths

Elements, subtasks
R1, J1
R2, J2
R3, J2
R4, J1
Data transmission speed (Kbps)
10
Data transmission time (s)
30
15
22.5
15
Processing time (s)
48
25
35.5
38
Time to subtask completion (s)
78
40
58
53
Algorithms for Determining the pmf of the Task Execution Time

With the tree-structure, the simple u-function technique is not applicable because it does not consider
the failure correlations. Thus, new algorithms are required. This section presents a novel algorithm to
evaluate the performance and reliability for the tree-structured grid service based on the graph theory
and the Bayesian approach.
Minimal Task Spanning Tree (MTST)

The set of all nodes and links involved in performing a given task form a task spanning tree. This task
spanning tree can be considered to be a combination of minimal task spanning trees (MTST), where
each MTST represents a minimal possible combination of available elements (resources and links) that
guarantees the successful completion of the entire task. The failure of any element in a MTST leads to
the entire task failure.
For solving the graph traversal problem, several classical algorithms have been suggested, such as
Depth-First search, Breadth-First search, etc. These algorithms can find all MTST in an arbitrary graph
(Dai et al., 2002). However, MTST in graphs with a tree topology can be found in a much simpler way
because each resource has a single path to the RMS, and the tree structure is acyclic.
After the subtasks have been assigned to corresponding resources, it is easy to find all combinations
of resources such that each combination contains exactly m resources executing m different subtasks
that compose the entire task. Each combination determines exactly one MTST consisting of links that
belong to paths from the m resources to the RMS. The total number of MTST is equal to the total number
of such combinations N, where
m
N = | wj |
(22)
j =1
(see Example 4.2.1).

Along with the procedures of searching all the MTST, one has to determine the corresponding running time and communication time for all the resources and links.
For any subtask j, and any resource k assigned to execute this subtask, one has the amount of input
and output data, the bandwidths of links, belonging to the corresponding paths k, and the resource processing time. With these data, one can obtain the time of subtask completion (see Example 4.2.2).
235
Some elements of the same MTST can belong to several paths if they are involved in data transmission
to several resources. To track the element involvement in performing different subtasks and to record
the corresponding times in which the element failure causes the failure of a subtask, we create the lists
of two-field records for each subtask in each MTST. For any MTST Si (1 i N), and any subtask j (1
j m), this list contains the names of the elements involved in performing the subtask j, and the corresponding time of subtask completion yij (see Example 4.2.3). Note that yij is the conditional time of
subtask j completion given only MTST i is available.
Note that a MTST completes the entire task if all of its elements do not fail by the maximal time
needed to complete subtasks in performing which they are involved. Therefore, when calculating the
element reliability in a given MTST, one has to use the corresponding record with maximal time.
pmf of The Task Execution Time

Having the MTST, and the times of their elements involvement in performing different subtasks, one
can determine the pmf of the entire service time.
First, we can obtain the conditional time of the entire task completion given only MTST Si is available as
Y{i } = max(yij ) for any 1 i N:
1 j m
(23)
For a set of available MTST, the task completion time is equal to the minimal task completion
times among the MTST.
Yy = min(Y{i } ) = min max(yij ) .

i y
i y 1 j m
(24)
Now, we can sort the MTST in an increasing order of their conditional task completion times Y{i},
and divide them into different groups containing MTST with identical conditional completion time.
Suppose there are K such groups denoted by G1, G2,,GK where 1 K N, and any group Gi contains
MTST with identical conditional task completion times i (0 1 < 2<< K). Then, it can be seen
that the probability Qi = Pr( = i) can be obtained as
Qi = Pr (Ei ,Ei -1 ,Ei -2 ,...,E1 )
(25)
Where Ei is the event when at least one of MTST from the group Gi is available, and Ei is the event
when none of MTST from the group Gi is available.
Suppose the MTST in a group Gi are arbitrarily ordered, and Fij (j=1,2,, Ni) represents an event
when the j-th MTST in the group is available. Then, the event Ei can be expressed by
Ni
Ei = Fij ,
j =1
236
(26)
and (25) takes the form

Ni
Pr(Ei , Ei -1, Ei -2 ,..., E1 ) = Pr(Ei , Ei -1, Ei -2 ,..., E1 ) Pr( Fij , Ei -1, Ei -2 ,..., E1 ) =
Ni
j =1
Pr( Fij , Ei -1, Ei -2 ,..., E1 ) .
(27)
j =1
Using the Bayesian theorem on conditional probability, we obtain from (27) that
Ni
Qi = Pr (Fij ) Pr Fi ( j -1), Fi ( j -2),..., Fi 1, E1, E 2 , , Ei -1 Fij .

j =1
(28)
The probability Pr(Fij) can be calculated as a product of the reliabilities of all the elements belonging
to the j-th MTST from group Gi.
The probability Pr Fi ( j -1), Fi ( j -2),..., Fi 1, E1, E 2 , , Ei -1 Fij can be computed by the following twostep algorithm (see Example 4.2.4).
Step 1: Identify failures of all the critical elements in a period of time (defined by the start and end
time), during which they lead to the failures of any MTST from groups Gm for m=1,2,i-1 (events
E m ), and any MTST Sk from group Gi for k = 1,2,, j1 (events F ik ), but do not affect the
MTST Sj from group Gi.
Step 2: Generate all the possible combinations of the identified critical elements that lead to
the event Fi ( j -1), Fi ( j -2),..., Fi 1, E1, E 2 , , Ei -1 Fij using a binary search, and compute the
probabilities of those combinations. The sum of the probabilities obtained is equal to
Pr Fi ( j -1), Fi ( j -2),..., Fi 1, E1, E 2 , , Ei -1 Fij . When calculating the failure probabilities of MTSTs
elements, the maximal time from the corresponding records in a list for the given MTST should
___ ___
be used. The algorithm for obtaining the probabilities Pr{ E1 , E 2 , Ei -1 Ei } can be found in
Dai et al. (2002).
Having the conditional task completion times Y{i}for different MTST, and the corresponding probabilities Qi, one obtains the task completion time distribution (i, Qi), 1 i K, and can easily calculate
the indices (8) & (9) (see Example 4.2.5).
Illustrative Example
Consider the virtual grid presented in Figure 3, and assume that the service task is divided into two
subtasks J1 assigned to resources R1 & R4, and J2 assigned to resources R2 & R3. J1, and J2 require
50Kbits, and 30Kbits of input data, respectively, to be sent from the RMS to the corresponding resource;
and 100Kbits, and 60Kbits of output data respectively to be sent from the resource back to the RMS.
The subtask processing times for resources, bandwidth of links, and failure rates are presented in
Fig. 3 next to the corresponding elements.
237
Table 3. pmf of service time

i
Qi
i Qi
53
0.3738
19.8114
58
0.1480
8.584
78
0.0945
7.371
0.3837
The Service MTST

The entire graph constitutes the task spanning tree. There exist four possible combinations of two resources
executing both subtasks: {R1, R2}, {R1, R3}, {R4, R2}, {R4, R3}. The four MTST corresponding to
these combinations are: S1: {R1, R2, L1, L2, L5, L6}; S2: {R1, R3, L1, L3, L5, L6}; S3: {R2, R4, L2,
L5, L4, L6}; S4: {R3, R4, L3, L4, L6}.
Parameters of MTSTs Paths

Having the MTST, one can obtain the data transmission speed for each path between the resource, and
the RMS (as minimal bandwidth of links belonging to the path); and calculate the data transmission
times, and the times of subtasks completion. These parameters are presented in Table 2. For example,
resource R1 (belonging to two MTST S1 & S2) processes subtask J1 in 48 seconds. To complete the subtask, it should receive 50Kbits, and return to the RMS 100Kbits of data. The speed of data transmission
between the RMS and R1 is limited by the bandwidth of link L1, and is equal to 5 Kbps. Therefore, the
data transmission time is 150/5=30 seconds, and the total time of task completion by R1 is 30+48=78
seconds.
List of MTST Elements

Now one can obtain the lists of two-field records for components of the MTST.
S1: path for J1:(R1,78); (L1,78); (L5,78); (L6,78); path for J2: (R2,40); (L2,40); (L5,40); (L6,40).
S2: path for J1: (R1,78), (L1,78), (L5,78), (L6,78); path for J2: (R3,58), (L3,58), (L6,58).
S3: path for J1: (R4,53), (L4,53); path for J2: (R2,40), (L2,40), (L5,40), (L6,40).
S4: path for J1: (R4,53), (L4,53); path for J2: (R3,58), (L3,58), (L6,58).
pmf of Task Completion Time

The conditional times of the entire task completion by different MTST are
Y1=78; Y2=78; Y3=53; Y4=58.
238
Therefore, the MTST compose three groups:

G1 = {S3} with 1 = 53; G2 = {S4} with 2= 58; and G3 = {S1, S2} with 3 = 78.
According to (25), we have for group G1: Q1=Pr(E1)=Pr(S3). The probability that the MTST S3 completes the entire task is equal to the product of the probabilities that R4, and L4 do not fail by 53 seconds;
and R2, L2, L5, and L6 do not fail by 40 seconds.
Pr(=53)=Q1=exp(0.00453)exp(0.00453)exp(0.00840)
exp(0.00340)exp(0.00140)exp(0.00240) = 0.3738.
Now we can calculate Q2 as
Q2 = Pr(E 2 , E1 ) = Pr (F21 ) Pr E1 F21 = Pr (F21 ) Pr F11 F21 = Pr(E 2 , E1 ) Pr (F21 ) Pr E1 F21
(
(
Pr (F21 ) Pr F11 F21 Pr (S 4 ) Pr S 3 S 4 = Pr (F21 ) Pr E1 F21 = Pr (F21 ) Pr F11 F21 =

Pr (S 4 ) Pr S 3 S 4
because G2, and G1 have only one MTST each. The probability that the MTST S4 completes the entire
task Pr(S4) is equal to the product of probabilities that R3, L3, and L6 do not fail by 58 seconds; and R4,
and L4 do not fail by 53 seconds.
Pr(S4) = exp(-0.004 53) exp(-0.003 58) exp(-0.004 53) exp(-0.004 58) exp(-0.002 58) =
0.3883
To obtain Pr S 3 S 4 , one first should identify the critical elements according to the algorithm
presented in the Dai et al. (2002). These elements are R2, L2, and L5. Any failure occurring in one of
these elements by 40 seconds causes failure of S3, but does not affect S4. The probability that at least
one failure occurs in the set of critical elements is
Pr S 3 S 4 = Pr S 3 S 4 1 - exp(-0.008 40) exp(-0.003 40) exp(-0.001 40) =

1 - exp(-0.008 40) exp(-0.003 40) exp(-0.001 40) = 0.3812.
Then,
Pr( =58) = Pr(E 2 , E1 ) =Pr(S4) Pr S 3 S 4 = Pr(E 2 , E1 ) Pr S 3 S 4 0.3883 0.3812 =Pr(S4)

Pr S 3 S 4 = 0.3883 0.3812 =0.1480.
239
Now one can calculate Q3 for the last group G3 = {S1, S2} corresponding to 3 = 78 as
Q3 = Pr(E 3 , E 2 , E1 ) = Pr(E 3 , E 2 , E1 ) Pr (F31 ) Pr E1, E 2 F31 + Pr (F32 ) Pr F31, E1, E 2 F32 =

Pr (F31 ) Pr E1, E 2 F31 + Pr (F32 ) Pr F31, E1, E 2 F32
= Pr (S1 ) Pr S 3 , S 4 S1 + Pr (S 2 ) Pr S1, S 3 , S 4 S 2
The probability that the MTST S1 completes the entire task is equal to the product of the probabilities
that R1, L1, L5, and L6 do not fail by 78 seconds; and R2, and L2 do not fail by 40 seconds.
Pr (S1 ) = exp(-0.007 78) exp(-0.008 40) exp(-0.005 78) exp(-0.003 40)
exp(-0.001 78) exp(-0.002 78) = 0.1999.
The probability that the MTST S2 completes the entire task is equal to the product of the probabilities
that R1, L1, L5, and L6 do not fail by 78 seconds; and R3, and L3 do not fail by 58 seconds.
Pr (S 2 ) = exp(-0.007 78) exp(-0.003 58) exp(-0.005 78) exp(-0.004 58)
exp(-0.001 78) exp(-0.002 78) = 0.2068.
To obtain Pr S , S S , one first should identify the critical elements. Any failure of either R4 or
3
4
1
L4 in the time interval from 0 to 53 seconds causes failures of both S3, and S4; but does not affect S1.
Therefore,
Pr S 3 , S 4 S1 = Pr S 3 , S 4 S1 1 - exp(-0.004 53) exp(-0.004 53) =

1 - exp(-0.004 53) exp(-0.004 53) =0.3456.
The critical elements for calculating Pr S1, S 3 , S 4 S 2 are R2, and L2 in the interval from 0 to 40
seconds; and R4, and L4 in the interval from 0 to 53 seconds. The failure of both elements in any one
of the following four combinations causes failures of S3, S4, and S1, but does not affect S2:
1.
2.
3.
4.
R2 during the first 40 seconds, and R4 during the first 53 seconds;

R2 during the first 40 seconds, and L4 during the first 53 seconds;
L2 during the first 40 seconds, and R4 during the first 53 seconds; and
L2 during the first 40 seconds, and L4 during the first 53 seconds.
Therefore,
4
2
Pr S1, S 3 , S 4 S 2 = Pr S1, S 3 , S 4 S 2 1 - 1 - [1 - exp(lij tij )] =

i =1
j =1
4
2
1 - 1 - [1 - exp(lij tij )] =0.1230,

i =1
j =1
240
where ij is the failure rate of the j-th critical element in the i-th combination (j=1,2), (i=1,2,3,4); and tij
is the duration of the time interval for the corresponding critical element.
Pr (S1 ), Pr (S 2 ), Pr S 3 , S 4 S1
Pr S1, S 3 , S 4 S 2
, and
, one can calculate
Having the values of
Pr( =78)= Q3 = 0.1999 0.3456 + 0.2068 0.1230 = 0.0945.

After obtaining Q1, Q2, and Q3, one can evaluate the total task failure probability as
Pr( = ) = 1 Q1 Q2 Q3 = 1 0.3738 0.1480 0.0945 = 0.3837,
and obtain the pmf of service time presented in Table 3.
4.2.5. Calculating the Reliability Indices

From Table 3, weone obtains the probability that the service does not fail as
R() = Q1 + Q2 + Q3 = 0.6164,
the probability that the service time is not greater than a pre-specified value of * = 60 seconds as
3
R(q*) = Qi 1( i < q*) = 0.3738 + 0.1480 = 0.5218 ,

i =1
and the expected service execution time given that the system does not fail as
3
W = iQi / R() = 35.7664 / 0.6164 = 58.025 seconds.

i =1
Parameterization and Monitoring

In order to obtain the reliability and performance indices of the grid service one has to know such model
parameters as the failure rates of the virtual links and the virtual nodes, and bandwidth of the links. It is
easy to estimate those parameters by implementing the monitoring technology.
A monitoring system (called Alertmon Network Monitor,http://www.abilene.iu.edu/noc.html) is being
applied in the IP-grid (Indiana Purdue Grid) project (www.ip-grid.org), to detect the component failures,
to record service behavior, to monitor the network traffics and to control the system configurations.
With this monitoring system, one can easily obtain the parameters required by the grid service reliability model by adding the following functions in the monitoring system:
241
1.
2.
Monitoring the failures of the components (virtual links and nodes) in the grid service, and recording
the total execution time of those components. The failure rates of the components can be simply
estimated by the number of failures over the total execution time.
Monitoring the real time network traffic of the involved channels (virtual links) in order to obtain
the bandwidth of the links.
To realize the above monitoring functions, network sensors are required. We presented a type of
sensors attaching to the components, acting as neurons attaching to the skins. It means the components
themselves or adjacent components play the roles of sensors at the same time when they are working.
Only a little computational resource in the components is used for accumulating failures/time and for
dividing operations, and only a little memory is required for saving the data (accumulated number of
failures, accumulated time and current bandwidth). The virtual nodes that have memory and computational
function can play the sensing role themselves; if some links have no CPU or memory then the adjacent
processors or routers can perform this data collecting operations. Using such self-sensing technique
avoids overloading of the monitoring center even in the grid system containing numerous components.
Again, it does not affects the service performance considerably since only small part of computation
and storage resources is used for the monitoring. In addition, such self-sensing technique can also be
applied in monitoring other measures.
When evaluating the grid service reliability, the RMS automatically loads the required parameters
from corresponding sensors and calculates the service reliability and performance according to the
approaches presented in the previous sections. This strategy can also be used for implementing the
Autonomic Computing concept.
CONCLUSION
Grid computing is a newly developed technology for complex systems with large-scale resource sharing, wide-area communication, and multi-institutional collaboration. Although the developmental tools
and techniques for the grid have been widely studied, grid reliability analysis and modeling are not easy
because of their complexity of combining various failures.
This chapter introduced the grid computing technology and analyzed the grid service reliability and
performance under the context of performability. The chapter then presented models for star-topology
grid with data dependence and tree-structure grid with failure correlation. Evaluation tools and algorithms were presented based on the universal generating function, graph theory, and Bayesian approach.
Numerical examples are presented to illustrate the grid modeling and reliability/performance evaluation
procedures and approaches.
Future research can extend the models for grid computing to other large-scale distributed computing systems. After analyzing the details and specificity of corresponding systems, the approaches and
models can be adapted to real conditions. The models are also applicable to wireless network that is
more failure prone.
Hierarchical models can also be analyzed in which output of lower level models can be considered
as the input of the higher level models. Each level can make use of the proposed models and evaluation
tools.
242
ACKNOWLEDGMENT
This work was supported in part by National Science Foundation (NSF) under grant number 0831609.
REFERENCES
Abramson, D., Buyya, R., & Giddy, J. (2002). A computational economy for grid computing and its implementation in the Nimrod-G resource broker. Future Generation Computer Systems, 18(8), 10611074.
doi:10.1016/S0167-739X(02)00085-7
Berman, F., Wolski, R., Casanova, H., Cirne, W., Dail, H., & Faerman, M. (2003). Adaptive computing
on the Grid using AppLeS. IEEE Transactions on Parallel and Distributed Systems, 14(4), 369382.
doi:10.1109/TPDS.2003.1195409
Cao, J., Jarvis, S. A., Saini, S., Kerbyson, D. J., & Nudd, G. R. (2002). ARMS: An agent-based resource
management system for grid computing. Science Progress, 10(2), 135148.
Chen, D. J., Chen, R. S., & Huang, T. H. (1997). A heuristic approach to generating file spanning trees
for reliability analysis of distributed computing systems. Computers and Mathematics with Applications,
34(10), 115131. doi:10.1016/S0898-1221(97)00210-1
Chen, D. J., & Huang, T. H. (1992). Reliability analysis of distributed systems based on a fast
reliability algorithm. IEEE Transactions on Parallel and Distributed Systems, 3(2), 139154.
doi:10.1109/71.127256
Dai, Y. S., & Levitin, G. (2006). Reliability and performance of tree-structured grid services . IEEE
Transactions on Reliability, 55(2), 337349. doi:10.1109/TR.2006.874940
Dai, Y. S., Pan, Y., & Zou, X. K. (2006). A hierarchical modelling and analysis for grid service reliability.
IEEE Transactions on Computers.
Dai, Y. S., Xie, M., & Poh, K. L. (2002), Reliability analysis of grid computing systems. IEEE Pacific
Rim International Symposium on Dependable Computing (PRDC2002), (pp. 97-104). New York: IEEE
Computer Press.
Dai, Y. S., Xie, M., & Poh, K. L. (2005). Markov renewal models for correlated software failures of
multiple types. IEEE Transactions on Reliability, 54(1), 100106. doi:10.1109/TR.2004.841709
Dai, Y. S., Xie, M., & Poh, K. L. (2006).Availability modeling and cost optimization for the grid resource
management system. IEEE Transactions on Systems, Man, and Cybernetics. Part A . Systems and Humans: a Publication of the IEEE Systems, Man, and Cybernetics Society., 38(1), 170.
Dai, Y. S., Xie, M., Poh, K. L., & Liu, G. Q. (2003). A study of service reliability and availability for
distributed systems. Reliability Engineering & System Safety, 79(1), 103112. doi:10.1016/S09518320(02)00200-4
243
Dai, Y. S., Xie, M., Poh, K. L., & Ng, S. H. (2004). A model for correlated failures in N-version programming. IIE Transactions, 36(12), 11831192. doi:10.1080/07408170490507729
Das, S. K., Harvey, D. J., & Biswas, R. (2001). Parallel processing of adaptive meshes with load balancing.
IEEE Transactions on Parallel and Distributed Systems, 12(12), 12691280. doi:10.1109/71.970562
Ding, Q., Chen, G. L., & Gu, J. (2002). A unified resource mapping strategy in computational grid environments. Journal of Software, 13(7), 13031308.
Foster, I., & Kesselman, C. (2003). The Grid 2: Blueprint for a new computing infrastructure. San
Francisco: Morgan-Kaufmann.
Foster, I., Kesselman, C., Nick, J. M., & Tuecke, S. (2002). Grid services for distributed system integration. Computer, 35(6), 3746. doi:10.1109/MC.2002.1009167
Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the grid: Enabling scalable virtual
organizations. International Journal of High Performance Computing Applications, 15, 200222.
doi:10.1177/109434200101500302
Grassi, V., Donatiello, L., & Iazeolla, G. (1988). Performability evaluation of multicomponent fault
tolerant systems. IEEE Transactions on Reliability, 37(2), 216222. doi:10.1109/24.3744
Krauter, K., Buyya, R., & Maheswaran, M. (2002). A taxonomy and survey of grid resource management systems for distributed computing. Software, Practice & Experience, 32(2), 135164. doi:10.1002/
spe.432
Kumar, A. (2000). An efficient SuperGrid protocol for high availability and load balancing. IEEE Transactions on Computers, 49(10), 11261133. doi:10.1109/12.888048
Kumar, V. K. P., Hariri, S., & Raghavendra, C. S. (1986). Distributed program reliability analysis. IEEE
Transactions on Software Engineering, SE-12, 4250.
Levitin, G., Dai, Y. S., & Ben-Haim, H. (2006). Reliability and performance of star topology grid service
with precedence constraints on subtask execution. IEEE Transactions on Reliability, 55(3), 507515.
doi:10.1109/TR.2006.879651
Levitin, G., Dai, Y. S., Xie, M., & Poh, K. L. (2003). Optimizing survivability of multi-state systems with
multi-level protection by multi-processor genetic algorithm. Reliability Engineering & System Safety,
82, 93104. doi:10.1016/S0951-8320(03)00136-4
Lin, M. S., Chang, M. S., Chen, D. J., & Ku, K. L. (2001). The distributed program reliability analysis on ring-type topologies. Computers & Operations Research, 28, 625635. doi:10.1016/S03050548(99)00151-3
Liu, G. Q., Xie, M., Dai, Y. S., & Poh, K. L. (2004). On program and file assignment for distributed
systems. Computer Systems Science and Engineering, 19(1), 3948.
Livny, M., & Raman, R. (1998). High-throughput resource management. In The Grid: Blueprint for a
new computing infrastructure (pp. 311-338). San Francisco: Morgan-Kaufmann
244
Meyer, J. (1980). On evaluating the performability of degradable computing systems. IEEE Transactions
on Computers, 29, 720731. doi:10.1109/TC.1980.1675654
Nabrzyski, J., Schopf, J. M., & Weglarz, J. (2003). Grid Resource Management. Amsterdam: Kluwer
Publishing.
Pham, H. (2000). Software reliability. Singapore: Springer-Verlag.
Tai, A., Meyer, J., & Avizienis, A. (1993). Performability enhancement of fault-tolerant software. IEEE
Transactions on Reliability, 42(2), 227237. doi:10.1109/24.229492
Xie, M. (1991). Software reliability modeling. Hackensack, NJ: World Scientific Publishing Company.
Xie, M., Dai, Y. S., & Poh, K. L. (2004). Computing systems reliability: Models and analysis. New York:
Kluwer Academic Publishers.
Yang, B., & Xie, M. (2000). A study of operational and testing reliability in software reliability analysis.
Reliability Engineering & System Safety, 70, 323329. doi:10.1016/S0951-8320(00)00069-7

Bayesian Analysis: Use Bayes method to get the posterior distribution from a prior distribution.
Graph Theory: Use graph algorithms to analyze given a network graph.
Grid Computing: Grid computing is a newly developed technology for complex systems with largescale resource sharing, wide-area communication, and multi-institutional collaboration.
Modeling: A representation, generally in mathematical presentations, to show the construction or
appearance of a computing system.
Performance: The inverse of the execution time.
Reliability: The probability for the service to be successfully completed given a execution time.
Universal Generating Function: Also called as u-function that is a technique to express and evaluate models in a polynomial format.
245
246
Chapter 11
Mixed Programming Models

Using Parallel Tasks
Jrg Dmmler
Chemnitz University of Technology, Germany
Thomas Rauber
University Bayreuth, Germany
Gudula Rnger
Chemnitz University of Technology, Germany
ABSTRACT
Parallel programming models using parallel tasks have shown to be successful for increasing scalability on medium-size homogeneous parallel systems. Several investigations have shown that these
programming models can be extended to hierarchical and heterogeneous systems which will dominate
in the future. In this chapter, the authors discuss parallel programming models with parallel tasks and
describe these programming models in the context of other approaches for mixed task and data parallelism. They discuss compiler-based as well as library-based approaches for task programming and
present extensions to the model which allow a flexible combination of parallel tasks and an optimization
of the resulting communication structure.
INTRODUCTION
Large modular parallel applications can be decomposed into a set of cooperating parallel tasks. This set
of parallel tasks and their cooperation or coordination structure are a flexible representation of a parallel
program for the specific application. The flexibility in scheduling and mapping the parallel tasks can be
exploited to achieve efficiency and scalability on a specific distributed memory platform by choosing a
suitable mapping and scheduling of the tasks. Each parallel task is responsible for the computation of a
specific part or module of the application, and can be executed on an arbitrary number of processors. The
terms multiprocessor tasks, malleable tasks and moldable tasks have been used to denote such parallel
tasks. In the following, we use the term multiprocessor task (M-task). An M-task can be implemented
DOI: 10.4018/978-1-60566-661-7.ch011
Mixed Programming Models Using Parallel Tasks
using an SPMD programming model (basic M-task) or can be hierarchically composed of other M-tasks
and thereby support nested parallelism (composed M-task). The advantage of the M-task programming
model is to exploit coarse-grained parallelism between M-tasks and fine-grained parallelism within basic
M-tasks in the same program and thus the potential parallelism and scalability can be increased.
Each M-task provides an interface consisting of a set of input and output parameters. These parameters
are parallel data structures that are distributed among the processors executing the M-task according to
a predefined distribution scheme, e.g. a block-wise distribution of an array. A data dependence between
M-tasks M1 and M2 arises if M1 produces output data required as an input for M2. Such data dependencies
might lead to data re-distribution operations if M1 and M2 are executed on different sets of processors
or if M1 produces its output in a different data distribution than expected by M2. Control dependencies
are introduced by coordination operators, e.g. loop constructs for the repeated execution of an M-task
or constructs for the conditional execution of an M-task. The data and control dependencies between
M-tasks can be captured by a graph representation. Examples are macro dataflow graphs(Ramaswamy,
Sapatnekar, & Banerjee, 1997) or series-parallel (SP) graphs(Rauber & Rnger, 2000).
The actual execution of an M-task program is based on a schedule of the M-tasks that has to take the
data and control dependencies into account. M-tasks that are connected by a data or control dependence
have to be executed subsequently. For independent M-tasks both, a concurrent execution on disjoint
processor groups or an execution one after another are possible. The optimal schedule depends on the
structure of the application and on the communication and computing performance of the parallel target
platform. For the same application a pure data parallel schedule that executes all M-tasks consecutively
on all available processors might lead to the best results on one platform but a mixed task and data parallel
schedule may result in lower execution times on another platform. Thus, the parallel programming with
M-tasks offers a very flexible programming style exploiting different levels of granularity and making
parallel programs easily adoptable to a specific parallel platform.
Examples for M-task applications come from multiple areas. Large multi-disciplinary simulation programs consist of a collection of algorithms from different fields, e.g. aircraft design (Chapman, Haines,
Mehrota, Zima, & van Rosendale, 1997; Bal & Haines, 1998) that uses models from aerodynamics,
propulsion, and structural analysis or environmental simulations (Chapman et al., 1997) that combine
atmospheric, surface water, and ground water models. Examples from numerical analysis include solution methods for ordinary differential equations (ODEs) like extrapolation methods (Rauber & Rnger,
2000), iterated Runge-Kutta methods (Rauber & Rnger, 1999a), implicitly iterated Runge-Kutta methods
(Rauber & Rnger, 2000), or Parallel Adams methods (Rauber & Rnger, 2007). These time-stepping
methods compute a fixed number of independent stage vectors within each time step and combine these
vectors into the new approximation vector for the next time step. Partial differential equations (PDEs) can
be defined over geometrically complex domains that are decomposed into sets of partially overlapping
discretization meshes. Solution methods for PDEs can exploit coarse-grained parallelism between these
meshes and fine-grained parallelism within the meshes (Merlin, Baden, Fink, & Chapman, 1999; Diaz,
Rubio, Soler, & Troya, 2003). Hierarchical algorithms and divide-and-conquer algorithms compute partial
solutions for independent subsets of the input and derive the final solution from these partial results. Examples are multi-level matrix multiplication algorithms (Hunold, Rauber, & Rnger, 2008). Stream-based
applications process input streams by several pipeline stages and can exploit task and data parallelism
by replicating non-scaling stages and executing the replicas concurrently. Examples come from image
processing (Subhlok & Yang, 1997) and sensor-based programs that periodically process data produced
by sensors (Subhlok & Yang, 1997; Bal & Haines, 1998; Orlando, Palmerini, & Perego, 2000).
247
There is a large variety of specific parallel programming models which support the programming with
parallel tasks, multiprocessor tasks or related concepts. In the next section we start with an overview of
programming approaches for mixed parallelism with different ways of programming support. A more
detailed description is given for the TwoL(two level) approach with its compiler support and the TLib
approach with a library interface. Moreover, we present extensions to the parallel programming with
M-tasks that have been proposed recently. The scheduling and mapping of M-tasks with dependencies
is an important method to get efficient versions of an M-task program. Thus, we present mapping techniques for M-task programs and finish the chapter with measurements of numerical codes on up-to-date
multi-core clusters.
TASK-BASED PROGRAMMING APPROACHES

Several approaches have been proposed for the use of M-tasks for programming large parallel systems
including language extensions as well as skeleton-based, library-based and coordination-based approaches.
We give an overview of these approaches in the following subsections.
Language Extensions
Language extensions enrich existing programming languages with additional annotations or language
constructs to support a mixed task and data parallel execution. The host languages are often data parallel
languages that are extended to support task parallelism or task parallel languages with support for data
parallelism. A special compiler is required to translate the language extensions. Most approaches use a
source-to-source compiler which creates a program in the host language that utilizes a runtime library
to realize the extensions.
Fortran M (Foster & Chandy, 1995) is a task parallel language based on Fortran 77. Language
constructs are provided for creating processes and communication channels that enable one-to-one communication between processes based on a message passing paradigm. The process model is dynamic,
i.e., new processes and communication channels can be created at runtime. Fortran D(Fox, Hiranandani,
Kennedy, Koelbel, Kremer, & Tseng et al., 1990) and High Performance Fortran (HPF)(High Performance Fortran Forum, 1993) are data parallel languages based on Fortran 90 that include primitives to
distribute arrays among processors and data parallel operations such as array expressions and parallel
loops. The integration of Fortran M with either Fortran D or HPF is described in (Chandy, Foster, Kennedy, Koelbel, & Tseng, 1994). Fortran M is responsible for resource management, e.g. starting the data
parallel tasks that are executed on processor groups specified by the user, and Fortran D or HPF takes
care of the distribution of computations and data structures on these groups. Two concurrently executed
data parallel tasks can communicate with each other using send and receive operations on a channel that
has to be provided by the parent task.
Opus (Chapman, Mehrotra, van Rosendale, & Zima, 1994; Chapman et al., 1997) defines a set of
extensions to the data parallel HPF language to support the coordination of multiple independent data
parallel modules. Target applications of Opus are coarse-grained multi-disciplinary simulations consisting of independent program parts that periodically exchange information, e.g. for the simultaneous
optimization of the aerodynamic and structural design of an aircraft configuration. Task parallelism in
Opus is realized by special subroutines that may be invoked onto a specific set of processor resources
248
that has to be provided by the user. The heart of the Opus extensions are ShareD Abstractions (SDAs)
that are objects encapsulating data and methods. An SDA may be shared by multiple tasks thus supporting communication and coordination between these tasks. The framework OpusJava(Laure, Mehrotra, &
Zima, 1999; Laure, 2001) has been proposed to integrate Opus components into larger distributed Java
based environments and thus providing support for loosely coupled heterogeneous platforms.
Braid (West & Grimshaw, 1995) adds data parallelism to the object-oriented Mentat task parallel
language. Mentat is based on C++ and provides high-level abstractions to define task parallel objects.
The Mentat system handles the dynamic creation, communication, synchronization, and scheduling of
these objects. Braid additionally supports data parallel objects that include overlay methods to initialize data, aggregate methods to apply operations on all or a subset of the data elements, and reduction
methods to distill information from the values of the data set. The user can provide annotations to inform
the compiler and runtime system about the communication behavior of the objects. This includes local
communication within data parallel methods, e.g. nearest neighbor communication pattern and the interaction between objects and which operations are dominant. The runtime system realizes the distribution
of the data based on the users annotations and platform specific characteristics.
Fx (Subhlok & Yang, 1997) is a Fortran-based language that integrates directives to partition and
layout data similar to HPF and directives to control task parallelism. Task parallelism can be exploited
within specific areas of the program called task regions. Within a task region subroutine calls can be
executed in a data parallel way by a subset of the available processors. The size and layout of the processor subsets can be computed at runtime. Each subroutine may contain additional task regions, thus
providing support for nested parallelism. The Fx framework includes a mapping tool to compute an
optimized task placement based on a dynamic programming approach (Subhlok & Vondran, 1995). The
required cost information are obtained by executing the program with different mappings.
High Performance Fortran 2.0 (HPF 2.0) (High Performance Fortran Forum, 1997) is a language
extension based on Fortran 95 including approved extensions for a mixed task and data parallel execution. The utilized task model is similar in spirit to the Fx approach. The task region construct provides
support for creating independent coarse-grained tasks, each of which can itself execute data parallel or
nested task parallel computations. The on directive allows the programmer to control the distribution
of computations among the processors of a parallel machine. The distribution of the data on processor groups and subgroups is supported by the distribute and align directives. The shape of the utilized
processor groups can be computed at runtime.
Orca (Ben Hassen, Bal, & Jacobs, 1998) defines a specification language that is translated into
C code utilizing a special runtime library. Data parallelism is available in form of partitioned objects
that may be distributed over multiple processors. Data parallel computations are performed using the
owner-computes rule and communication operations to access remote data are inserted by the compiler.
Task parallelism is expressed by using processes that can be started dynamically. The data distribution
for partitioned objects and the processors for the execution of a task have to be explicitly coded by the
programmer. The communication between processes is supported by shared objects that are implemented
as instances of abstract data types. Each process can read and modify data within the shared objects by
using atomic operations, thus enabling data exchanges between concurrently executing processes.
Spar/Java (van Reeuwijk, Kuijlman, & Sips, 2003; Sips & van Reeuwijk, 2004) defines language
extensions for Java that are translated to C++ code by the Timber compiler using the Vnus language
as an intermediate step. The compiler includes special optimizations for multi-dimensional arrays. The
language extensions provide annotations to explicitly distribute data and computations. The syntax of
249
these annotations is similar to functional languages. The foreach construct defines data parallel computations, e.g. operations on array elements. The each construct defines data independence for a set of
statements and therefore enables a task parallel execution. The executing processors can be specified
for each statement using the on annotation.
Fortress (Allen, Chase, Hallett, Luchangco, Maessen, & Ryo et al., 2008), Chapel (Chamberlain,
Callahan, & Zima, 2007) and X10 (Charles, Grothoff, Saraswat, Donawa, Kielstra, & Ebcioglu et al.,
2005) are new parallel programming languages that are currently under development. The underlying
programming models provide a higher level of abstraction than the previously mentioned approaches
and are targeted to increase the productivity of the programmers. The memory is assumed to be globally
shared by all program parts; necessary communication operations have to be automatically determined
by the compiler.
Fortress is an object-oriented language that expresses parallel computations with implicit and explicit
threads. Explicit threads are created by the programmer; implicit threads are created by parallel language
constructs, e.g. also-do blocks to define independent computations for task parallelism or for loops
which are parallel by default and can be executed in a data parallel way. The parallel target platform is
modeled by regions that can be hierarchically nested; threads can be assigned to specific regions by the
user to increase the performance. Currently, Fortress is only available for shared memory platforms but
an extension for distributed memory systems is planned.
Parallel platforms in the Chapel language can be described by a set of locales, e.g. a locale per cluster
node. Data and computations can be mapped on locales using the on clause. Data parallelism in Chapel
is expressed by domains that define the size and shape of arrays. Domains can be distributed among
locales. Data parallel operations on array elements can be expressed using the forall loop, or the reduce
and scan functions. Task parallelism is supported by the cobegin directive that expresses independent
computations.
X10 is based on a partitioned global address space (PGAS) memory model that is represented by a
set of places. Multiple activities may be executed concurrently by different places. The async statement
supports the creation of asynchronous activities on specific places. These activities can be synchronized
using by the finish statement. X10 supports multi-dimensional arrays that may be distributed among a
set of places using pre-defined or user-defined distribution types. Data parallel operations on arrays can
be performed using the ateach construct.
Skeleton-Based Approaches
Skeleton-based approaches include a predefined set of coordination patterns to combine sequential code
or small parallel program fragments into complex parallel applications. Parallel skeletons can provide
support for data parallel computations, e.g. mapping the same code onto different parts of the input data,
or for task parallel computations, e.g. arrange different tasks in form of a pipeline. Multi-level parallelism is supported by nesting different skeletons within each other.
P3L (Pelagatti, 2003) is a skeleton coordination language using C as a host language that is used to
express the sequential portions of the application. The supported skeletons include data parallel, task
parallel and control skeletons that can be nested within each other. Data parallel skeletons are map to
distribute data and to apply a specific skeleton to each data element, reduce to combine distributed data
into a single value, scan to compute the parallel prefix of an array, and comp for functional composition.
The task parallel skeletons operate on streams of input data; pipe applies a sequence of skeletons to the
250
input data one after another forming a pipeline and farm applies the same skeleton to different items of
the input data stream. Control skeletons are seq for wrapping sequential code and loop for the repeated
execution of another skeleton. P3L includes a compiler for the generation of C+MPI programs that utilize
a library which provides optimized implementation templates for each skeleton. A cost expression for
each skeleton is available and, thus, the costs for the entire application can be determined by combining these cost expressions according to the hierarchical structure of the application. The costs for the
sequential fraction of the code are obtained by profiling.
taskHPF (Ciarpaglini, Folchi, Orlando, Pelagatti, & Perego, 2000) uses a two-tier model for combining
task and data parallelism. The task parallel coordination structure is described by a high-level language
that includes the definition of data parallel tasks with input and output parameters and the interaction
between tasks based on predefined skeletons. Available skeletons are the pipeline pattern and the replicate
directive to create multiple incarnations of non-scalable stages. The specification language includes the
on processors directive to define the number of executing processors for each data parallel task. HPF is
used to implement the data parallel tasks and to describe data distributions within these tasks. Necessary
re-distribution operations between data parallel tasks are identified by the compiler and are realized by
the COLTHPF (Orlando & Perego, 1999; Orlando et al., 2000) coordination layer.
LLC (Dorta, Gonzlez, Rodriguez, & de Sande, 2003; Dorta, Lpez, & de Sande, 2006) is a highlevel parallel language with support for algorithmic skeletons. The host language is C augmented with
OpenMP-like directives to define skeletons and to provide additional information to the compiler. The
compiler llCoMP translates this code into a parallel C+MPI program. Basic data parallel skeletons in
LLC are forall to define parallel loops and taskq to define task farms. Task parallelism is provided by
the sections skeleton to define independent computations and the pipeline skeleton to describe pipelined computations. The implementation of the basic skeletons partitions the available processors into
a number of subgroups equal to the number of tasks, e.g. number of pipeline stages or loop iterations.
The mapping of tasks onto processor groups can be controlled by assigning weights.
ASSIST (Vanneschi, 2002) is a framework for the skeleton-based composition of sequential and
parallel modules into complex applications. Sequential modules are provided in a host language of
ASSIST (C, C++, and Fortran) and operate on streams of input data. Parallel modules are expressed by
the parmod construct that defines the input and output streams, a set of virtual processors and a virtual
processor topology. Additionally, modules can access external objects that may be declared as shared,
thus supporting data exchanges between concurrently executing modules. The interaction between the
modules in form of a directed graph is described in the ASSIST-CL coordination language. The nodes
of the graph correspond to components and the edges are data streams that are communicated between
components. For program execution, the virtual processors of the parallel modules have to be mapped onto
physical processors. This mapping can also be reconfigured at runtime (Vanneschi & Veraldi, 2007).
Lithium (Aldinucci, Danelutto, & Teti, 2003) is a Java-based programming environment for the
development of structured parallel applications based on a set of predefined skeletons provided in form
of a library. A variety of skeletons is supported, e.g. the data parallel map and divide-&-conquer, the
task parallel farm and pipe and control skeletons to model loops and conditionals. The execution of a
Lithium application is based on a master-slave approach: the master contains a task pool with all executable tasks that are distributed to the slave nodes.
DIP (Diaz et al., 2003) is a pattern-based coordination language with focus on domain decomposition
and multi-block applications, e.g. solution methods for PDEs. The implementation of DIP is based on
the border-based coordination language BCL (Diaz, Rubio, Soler, & Troya, 2002), i.e., the DIP compiler
251
translates a DIP specification program into a BCL program. BCL supports the solution of numerical
problems with multiple domains by automatically creating necessary border-exchange operations
between domains. Basic data parallel tasks are implemented in HPF. DIP provides the multiblock
pattern to describe a fixed number of k-dimensional domains with fixed boundary coordinates that
require a periodic exchange of border values. Additionally, the pipe pattern to describe a chain of
pipeline stages and the replicate directive to create multiple independent incarnations of a pipeline
stage each operating on different data from the input stream are available. DIP supports multiple
implementation templates for each pattern. The programmer is responsible for selecting an appropriate template and for specifying the number of processors to execute each task.
SBASCO (Diaz, Rubio, Soler, & Troya, 2004) is an enhancement of the DIP approach that similarly
supports the multiblock, pipe and farm skeletons. Additionally, SBASCO includes a cost model for
the estimation of the execution time of each skeleton depending on hardware parameters. SBASCO
distinguishes two different views on the specification, the application view and the configuration
view. The application view describes the structure of the application using the available skeletons and
the basic data parallel components with their input and output parameters. The configuration view
extends the application view with information on data distributions, processor layout and the internal
structure of the components. The application view is provided by the programmer; the configuration view is used by a configuration tool to obtain an efficient allocation of the different application
components on parallel platforms based on the cost model enhanced by a run-time analysis.
Library-Based Approaches
Library-based approaches provide library routines to support task and data parallel executions. This
includes the support of coordination and synchronization of multiple data parallel tasks, the provision of data re-distribution routines, the creation of processor groups and the execution of tasks on
the correct processors.
HPF/MPI (Foster, Kohr, Krishnaiyer, & Choudhary, 1996) is a library that provides an HPF
binding to the MPI message passing library and thus enables HPF programs to issue MPI communication operations. Therefore, the coordination and synchronization of different concurrently
executing programs is supported. Arbitrary variables defined in the HPF program can be used as
parameters for the communication operations provided. These variables may be distributed among
the processors executing the data parallel module and therefore the implementation of the library
has to deal with arbitrary distributed data structures. For example, a point-to-point communication
operation between two modules has to handle the case of different source and target distribution
types. This is realized using a descriptor exchange to exchange distribution information between
communicating modules.
HPF_TASK_LIBRARY (Brandes, 1999) enables the interaction of data parallel HPF tasks during their execution time by providing point-to-point and collective communication operations. The
library is designed for the HPF 2.0 task model that supports the creation of data parallel tasks on
disjoint processor groups but does not allow communication between concurrently executed tasks.
The library supports the exchange of data structures that are distributed among multiple processors.
Therefore, the distribution information has to be exchanged prior to the data transmission to determine the resulting communication pattern. Nested parallel executions are supported, but only tasks
on the same nesting level may communicate with each other.
252
Figure 1. Decomposition of the set of processes V={1,2,..,9} into a two-dimensional grid and executing
a group-SPMD phase using vertical processor groups (left) and horizontal processor groups (right).
KeLP-HPF (Merlin et al., 1999) uses the C++ class library KeLP (Fink, 1998) to coordinate multiple
data parallel HPF tasks. KeLP provides high-level abstractions to simplify the development of blockstructured algorithms on SMP clusters. KeLP builds on MPI and includes mechanisms to manage data
layout, data motion and the parallel control flow. For the data layout, general block decompositions are
supported and the communication schedules are determined at runtime. In the KeLP-HPF programming
model, KeLP can dynamically create processor groups and start new HPF tasks. Thus, the programming
model is especially suited for applications that execute regular data parallel operations on an irregular
or dynamic domain, e.g. multi-block codes or adaptive refinement methods. The arguments for the data
parallel tasks are provided by KeLP in a distributed format along with a mapping descriptor that informs
the HPF code of the distribution type.
The library ORT (Rauber, Reilein-Ru, & Rnger, 2004a) supports the programming for applications
with a two or higher dimensional task grid and task dependencies mainly aligned in the dimensions of
the task grid. Examples are algorithms from linear algebra based on two or higher dimensional arrays,
like the LU decomposition. The programming model of the ORT library is based on a group-SPMD
model in which the set of processors is subdivided into a set of disjoint groups of processors and each
processor groups executes a parallel task in parallel to the other groups. In the programming model of
the ORT library there exist several partitions of the entire set of processors in disjoint groups with the
specific property that the groups are orthogonal to each other in a two or higher dimensional grid; Figure
1 shows the two-dimensional case.
A typical ORT program consists of computation phases and communication phases. Each phase is
executed on exactly one of the processor decompositions and performs either group-SPMD computation
on the decomposition or a communication within the groups. During the execution of the ORT program
the active processor decomposition changes from phase to phase such that different tasks cooperate in
the group-SMPD way. The ORT library calls support the building of processor decompositions based on
MPI and the mapping of task to the processors groups. The orthogonal way of communication can speed
up the communication phases of an application and it is useful to integrate it in a hierarchical model as
it will be described as an extended programming model in a subsequent section of this article.
253
Coordination-Based Approaches
Coordination-based approaches are based on a static task structure that might be provided in form of an
explicit specification of the available parallelism. A compiler or a transformation-based toolset can be
used to translate the specification into executable code. In contrast to language extensions, the complete
structure of the application is visible to the compiler and optimizations like scheduling can be applied.
Paradigm (Joisha & Banerjee, 1999) is a parallel research compiler framework for HPF programs
that additionally supports task parallel extensions proposed in (Ramaswamy et al., 1997). The extensions include annotations in the program source that enable the automatic extraction of the task parallel
structure of the application in form of a macro dataflow graph (MDG). The MDG has a hierarchical
structure with simple nodes representing computation, loop nodes (for or while), conditional nodes(if)
and user-defined nodes, edges symbolize data dependencies. The MDG is annotated with cost parameters
for simple nodes and possible data re-distribution operations resulting from data dependencies. The node
costs are determined by profiling and fitting the obtained results to a curve according to Amdahls law.
The costs for data re-distribution operations depend on the size of the transmitted data, the overhead for
sending and receiving data, and the transmission time of the network. The Paradigm framework includes
scheduling support to map MDGs on a specific target platform. Two scheduling algorithms, TSAS and
SAS, are available (Ramaswamy, 1996). The final stage is the generation of an optimized MPMD code
that utilizes a data re-distribution library for multi-dimensional arrays. The communication pattern and
the communication schedule of the re-distribution operations are calculated at runtime using the FALLS
algorithm (Ramaswamy, Simons, & Banerjee, 1996).
Network of Tasks (Pelagatti & Skillicorn, 2001) is a programming model that defines a coordination
language for coarse-grained tasks with an emphasis on runtime prediction. An application is modeled
as a directed acyclic graph with nodes being arbitrary parallel programs that may be heterogeneous.
The nodes are adaptive, i.e., different implementations may be available and the number of executing
processors can be modified. The directed edges of the graph indicate one-way communication with
parallel data structures or streams. The scheduling of the task graph on a target platform is performed
by a work-based allocation technique (Skillicorn, 1999). Pipelining and farming are used to increase
application performance in case there are enough processors available. Pipelining allows the simultaneous execution of all nodes of a subgraph, e.g. iterations of a loop, and farming increases the effective
parallelism by replicating non-scaling nodes. The costs for the entire application are composed of the
costs for the nodes that are provided by the user and the costs for the communication that are derived
using the BSP model.
S-Net (Grelck, Scholz, & Shafarenko, 2007) is a stream-based coordination language to combine
data parallel modules implemented in SAC. SAC is a side-effect free functional language that supports
data parallel operations on n-dimensional stateless arrays. S-Net treats data parallel SAC programs as
stateless boxes operating on input streams and producing an output stream. On arrival of an item on
the input stream the box is expected to apply its operation and produce one or more output items on a
single output stream. In S-NET, the functionality of these boxes is defined using a box signature that
maps the input type to output types.
Four constructors are available to hierarchically compose boxes into complex networks. The static
serial composition connects two networks A and B by connecting the output of A with the input of B.
The static parallel composition of two networks A and B sends input items depending on their types
either to A or B and merges the output streams of A and B. The serial and parallel replicators support the
254
repeated execution of a network A where the iterations are connected via serial composition or parallel
composition, respectively. A compiler for S-NET programs for shared memory platforms is currently
under development.
The performance-aware composition framework (Kessler & Lwe, 2007) supports the combination of parallel and sequential components into parallel applications with an emphasis on performance
prediction. Each component is required to provide a functional interface specifying the parameters and
a performance interface that contains information on the execution time depending on the number of
executing processors. For each component there may exist multiple parallel or sequential implementation variants that share the same functional interface but define separate performance interfaces. The
implementation of the parallel variants may be based on an SPMD programming model. The structure
of the application is defined using a host language extended by annotations that are evaluated by a
composition tool. Parallel components may include compose_parallel operators that mark independent
invocations of components, i.e., any sequential or parallel execution order is valid. Calls to components
outside this operator are assumed to be executed sequentially in the specified order.
The execution of the target application is based on a static variant dispatch table that is created by
the composition tool. This table contains the optimal implementation variant for each combination of
component and processor number. Additionally, this table contains a schedule for each compose_parallel operator for each number of executing processors. The schedule is determined using scheduling
techniques for independent M-tasks and specifies the execution order and sizes of processor groups. At
runtime, the optimal schedule or implementation variant is selected depending on the actual problem
size and number of available processors. A prototype compiler using the C-based parallel language Fork
for the implementation of the components has been realized.
HIERARCHICAL M-TASK PROGRAMMING

In this section, we present hierarchical programming approaches for mixed programming with parallel
tasks. In particular, we describe the library TLib (task library), as well as the coordination model TwoL
(two level) which is based on a specification language for the hierarchical composition of tasks.
TwoL Model
Support for the programming with parallel tasks (called modules) can also be provided in form of a
coordination approach. The TwoL (two level) model (Rauber & Rnger, 1996, Rauber & Rnger, 2000)
is a top-down method for the development of applications that distinguishes two well separated layers
of parallelism. The lower (data parallel) level defines the interfaces of modules that are provided by the
application developer. These basic modules are treated as black-box SPMD codes that can be executed
on an arbitrary number of processors. For each basic module there may be multiple implementation
variants differing in the data distribution of the parameters or the employed algorithm. The upper (task
parallel) level defines composed modules that are hierarchically composed of other modules.
The structure of the composed modules is defined in the platform independent TwoL-specification
language. The specification of a composed module is based on a module expression consisting of invocations of modules (as basic elements) and a set of coordination operators that specify data dependence
or data independence between subexpressions. The ||-operator combines independent computations for
255
which both, a concurrent and a consecutive execution are possible. The -operator demands the subsequent
execution of computations due to data dependencies that may lead to data re-distribution operations.
Sequential loops can be defined using the for and while operators and parallel loops are specified using
the parfor operator. A data dependence between the iterations of sequential loops is assumed whereas
the iterations of parallel loops are independent from each other and can be computed concurrently. The
conditional execution of subexpressions is supported by the if operator.
The initial TwoL-specification of an algorithm defines the maximum degree of available task parallelism. For an execution on a specific target platform, the actual degree of task parallelism that should
be exploited and the data distributions of the modules need to be fixed. These decisions are made by
several incremental transformation steps resulting in a non-executable parallel frame program. The
parallel frame program does not include any platform-dependent information, but different platforms
may require different frame programs to achieve a good performance. The final transformation step of
the TwoL framework translates the parallel frame program into an executable message passing program.
The generated program is responsible for the creation of the correct communication context for the
execution of the basic modules, e.g. by using communicators provided by MPI, and a correct dataflow
between modules by inserting data re-distribution operations at the appropriate positions.
The design steps in the derivation of an efficient parallel frame program are based on a cost model
that has to provide accurate predictions for the execution times of modules depending on the number
of executing processors and for the data re-distribution operations between modules depending on the
source and target processor groups and distribution types. For the basic modules, a cost model based
on parameterized runtime formulas is employed. These closed-form symbolic formulas consist of a
term describing the execution times of the arithmetic operations and functions that describe the runtime of the internal communication operations such as single-transfer and broadcast operations. The
platform-independent structure of the runtime formulas can be derived by inspecting the program text.
The compiler tool SCAPP (Khnemann, Rauber, & Rnger, 2004) has been developed to automate this
task. The platform-specific parameters of the formulas can be determined through profiling techniques.
Date re-distribution costs are modeled using a platform-dependent startup time and byte-transfer time
of the interconnection network. For composed modules, the runtime functions are composed according
to the hierarchical structure of the module. (See Figure 2)
The scheduling step of the TwoL framework determines an execution order for independent modules,
assigns processors to modules and load balances processor groups. For this step, the specification program is transformed into a global module dependence graph (MDG) that captures the data dependencies
between modules. The MDG is a directed acyclic graph that exhibits a series-parallel (SP) structure,
Figure 3 (left) shows an example. For the scheduling and load balancing, the TwoL-Level (Rauber &
Rnger, 1998) and the TwoL-Tree (Rauber & Rnger, 1999b) algorithms have been developed and
implemented in a scheduling toolkit (Dmmler, Kunis, & Rnger, 2007a). The runtime of a TwoL program may also be influenced by the choice of the data distribution for the input and output parameters of
the modules. Therefore, the TwoL framework includes support for the automatic derivation of suitable
data distributions. The derivation process uses a dynamic programming approach that determines the
optimal distribution types by exploiting the hierarchical structure of the application (Rauber, Rnger,
& Wilhelm, 1995).
The core concepts of the TwoL model have been implemented in form of a compiler tool (Rauber,
Reilein-Ru, & Rnger, 2004b; Reilein-Ru, 2005).
256
Figure 2. Illustration of the hierarchical structure of a TLib program. M-task M1 consecutively executes
M2, M3 and M13; M3 concurrently executes M4 and M9; M4 executes M5 and M6 one after another, where
M6 further subdivides the available processors to excute M7 and M8 in parallel; M9 consists of the sequential execution of M10, M11, and M12.
Figure 3. Illustration of a task graph representing an M-task application in the TwoL model (left) and
a possible CM-task graph (right)
257
Programming Interface Tlib

The runtime library TLib has been developed to support the programming with hierarchically structured
M-tasks. TLib library functions are designed to be called in an SPMD manner which results in multilevel group-SPMD programs. The entire management of groups and M-tasks at execution time is done
by the library. Thus, the TLib API provides support for:
a.
b.
c.
d.
The creation and administration of a dynamic hierarchy of processor groups;

The coordination and mapping of nested M-tasks to processor groups;
The handling and termination of recursive calls and group splittings;
The organization of communication between M-tasks.
Internally, the library uses distributed information stored in distributed descriptors which cannot be
accessed directly by the user, thus hiding the complexity of the group management and the multi-level
group-SPMD organization. This relieves the application programmer from realizing the technical details
of hierarchical M-tasks and the corresponding group management and allows him to concentrate on how
to exploit the potential M-task structure of the given application. The current version of the library is
based on C and is built on top of MPI. A TLib program consists of:
a.
b.
A set of basic functions expressing M-tasks that are executed in an SPMD style and that comprise
the computations to be performed;
A set of coordination functions to control the execution of the basic functions.
The processors executing a basic M-task can exchange information with arbitrary MPI operations.
The coordination functions allow a concurrent execution of basic M-tasks by the activation of suitable
library functions. The coordination functions can be nested arbitrarily. Thus, a coordination function
can assign other coordination functions to subgroups of processors for execution, which can then again
split the corresponding subgroup and assign other coordination functions. A basic M-task function F is
expressed as a function of the form,
void *F (void * arg, MPI_Comm comm, T_Descr * pdescr)
where the parameter arg comprises the arguments used by F; the parameter comm specifies an MPI communicator which can be used for internal communication within the M-task F; pdescr is a reference to a
TLib group descriptor containing information about the processor group onto which F is mapped. This
descriptor can be used to dynamically split this processor group further in the body of F, if F exhibits
an internal task parallel structure. F may also generate a recursive call of itself on a smaller subgroup,
thus enabling the implementation of divide-and-conquer algorithms.
The TLib library provides functions for initialization, splitting of groups into two or more subgroups,
assignment of tasks to processor groups, and getting information on the subgroup structure. An example
of a library function for splitting a processor group into two processor groups is,
258
int T_SplitGrp(T_Descr *pdescr, T_Descr *pdescr1, float per1, float

per2)
where pdescr is a reference to the group descriptor of the original group and pdescr1 is a reference to
a new group descriptor that is generated by the library function. The parameters per1 and per2 specify
fractional values with per1 + per2 1. The effect of the operation is a splitting into two disjoint processor groups with a percentage of per1 or per2 of the processors of the original group, respectively, as
specified by the parameter pdescr. More splitting operations are provided, allowing, e.g., the splitting
into an arbitrary number of processor groups.
After a splitting operation generating a number of subgroups, M-tasks can be assigned to the newly
generated subgroups by corresponding mapping functions. An example for a mapping operation onto
two processor groups is,
int T_Par(void *(*F1)(void *, MPI_Comm, T_Descr *),
void *parg1,
void *pres1,
void *(*F2)(void *, MPI_Comm, T_Descr *),
void *parg2,
void *pres2,
T_Descr *pdescr)
where F1 and F2 describe the M-tasks to be mapped to the subgroups and to be executed concurrently
by the subgroups; parg1 and parg2 are the parameters for F1 and F2, respectively; pres1 and pres2 are
the results produced by F1 and F2, respectively; the subgroups are described by the parameter pdescr.
More mapping operations are provided to assign M-tasks to an arbitrary number of subgroups. Figure
2 shows an example for the emerging hierarchical structure of TLib programs.
A detailed description of the TLib library is given in (Rauber & Rnger, 2005) along with example
applications demonstrating the use of the library. The use of TLib for specifying efficient parallel implementations for matrix multiplication based on the Strassen algorithm has been described in (Hunold,
Rauber, & Rnger, 2004). A data re-distribution library DRDLib to support re-distributions between
cooperating M-tasks using TLib has been described in (Rauber & Rnger, 2006).
ExTENDED PROGRAMMING MODEL

In the TwoL programming model two modules M1 and M2 can only communicate in-between their execution, i.e. the module supplying the data has to finish its execution and the module consuming the data
must not have started. For applications that require periodic data exchanges between program parts, it
might also be beneficial to allow further data exchanges during the execution of concurrently executed
modules. Examples for such applications are time stepping methods that perform a data exchange at the
end of each time step. An implementation using the TwoL programming model restricts the modules
to the execution of a single time step and, thus, limits the possible granularity of the modules. A more
natural way to structure such applications is to combine multiple time steps within a module and to sup-
259
port communication between running modules to perform the required data exchanges. In the following,
we present an extended programming model that follows this idea and discuss programming support for
the development of efficient parallel implementations in this model.
CM-Task Programming Model

The programming model of communicating multiprocessor-tasks(CM-tasks) (Dmmler, Rauber, &
Rnger, 2007) extends the TwoL model by providing support for communication between concurrently
executed modules. CM-tasks are parallel modules which have a set of input and output parameters and
support the execution on an arbitrary number of processors. The interactions between CM-tasks are
expressed by P-relations and C-relations:

Precedence relations (P-relations) capture the input/output dependencies between CM-tasks. A

P-relation between CM-tasks A and B denotes that A produces output data required as an input for
B and might lead to a data re-distribution operation between A and B when A and B are executed
on different subsets of the processors or if A provides its output data in a different distribution than
expected by B. These are the dependencies captured in the original TwoL model.
A communication relation (C-relation) between CM-tasks A and B denotes that A and B have to
communicate with each other during their execution. This is an extension of the TwoL model since
modules can now communicate during their execution, if they are connected with a C-relation.
The structure of a CM-task program can be represented by a CM-task graph G = (V, E) where the
set of nodes V corresponds to the set of CM-tasks. The set of edges E = E p Ec consists of the set of
directed edges Ep representing P-relations and the set of bidirectional edges Ec symbolizing C-relations.
Figure 3 (right) shows an illustration of a CM-task graph. The possible execution orders of the CM-tasks
are limited by the P-relations and C-relations. A P-relation connecting CM-tasks A and B requires that
the execution of A must have been finished and all required data re-distributions between A and B must
have been carried out before B can be started. CM-tasks connected by a C-relation must be executed
concurrently by disjoint subsets of the processors to perform the specified data exchanges. Therefore, there
cannot be both a P-relation and a C-relation between CM-tasks A and B and hence E p Ec = .
Examples for CM-task programs are iterated Runge-Kutta (IRK) methods (van der Houwen & Sommeijer, 1991; Rauber & Rnger, 1999a) and Parallel Adams methods (van der Houwen & Messina, 1999)
that are time-stepping methods for the solution of initial value problems of non-stiff ODEs. Due to data
dependencies between successive time steps, each time step of these applications has to be computed
by a separate set of modules in the TwoL model. Using the CM-task model, successive time steps can
be combined within a single set of CM-tasks and data dependencies between time steps are modeled by
C-relations. This enables the CM-task version to exploit optimized communication patterns. Examples
are the orthogonal arrangement of the processes (cf. Figure 1) and the use of concurrent multi-broadcast
operations to realize the data exchanges between the CM-tasks. Benchmark results comparing a pure
M-task based implementation with a CM-task version are presented at the end of this chapter.
260
Figure 4. Overview of the incremental transformation steps used to create an executable CM-task coordination program
Development of CM-Task Programs

Support for the development of CM-task programs has been proposed in form of a compiler framework
(Dmmler, Rauber, & Rnger, 2008a). The framework consists of several consecutive transformation
steps and supports the incremental creation of an executable coordination program from an initial specification of the structure of the CM-task application using the non-executable, platform-independent
specification language. Each transformation step of the framework adds additional information resulting
in an augmented specification. The specification language supports the definition of basic CM-tasks
whose implementation has to be provided by the application developer and composed CM-tasks whose
structure is visible to the framework. Composed CM-tasks consist of CM-task activations and control
constructs guiding the control flow, i.e. conditional execution (if-statement), the repeated execution
with data dependencies (while-loop, for-loop) and without data dependencies between loop iterations
(parfor-loop).
The transformation process is depicted in Figure 4 and consists of four consecutive steps. The first
step, the Dataflow Analyzer, takes the initial specification of a parallel algorithm as an input. The data
dependencies in this specification program are defined implicitly using input/output parameter lists and
variable names. The Dataflow Analyzer is responsible for uncovering these dependencies and inserting
the appropriate P-Relations and C-Relations.
The successive transformation step, the Scheduler, requires additional information about the target
platform that is provided in form of a Machine Description. This input file specifies the number of available processors and contains approximations of the computational power, i.e. the average time required
to execute an arithmetic operation, and the communication performance, i.e. the startup and byte-transfer
261
time of the interconnection network. The output of the Scheduler is a platform-dependent specification
program with annotations that define the execution order and the executing processor groups for each
CM-task invocation.
The Data Manager inspects all P-relations and uses the computed schedule for the CM-task program
to decide which data re-distribution operations are required for a correct execution. The Code Generator creates the final message passing program in the target language. The created coordination program
consists of:

The execution of CM-tasks on the processor groups computed by the scheduler; basic CM-tasks
are provided by the user in form of a library and the coordination code for composed CM-tasks is
created by the framework,
The execution of data re-distribution operations between CM-tasks; the communication pattern is
statically computed by the framework and included in the coordination program;
Coordination constructs (loops, conditions) according to the input specification;
Processor group management code that creates the correct MPI communicators for communication between concurrently executed CM-tasks as specified by the C-relations and for the execution of CM-tasks and data re-distribution operations.
A prototype realization of the transformation framework as a compiler tool for the C target language
is available.
SCHEDULING AND MAPPING

A suitable schedule is crucial to obtain the maximum performance of an M-task application. For homogenous platforms, the schedule defines the execution order of the M-tasks and the number of executing
processors. Unfortunately, the M-task scheduling problem of determining the optimal schedule that
leads to the lowest execution time is NP-complete. Therefore, several research groups have proposed
scheduling heuristics and approximation algorithms to automatically determine a good schedule. Examples are TwoL-Level (Rauber & Rnger, 1998), TwoL-Tree (Rauber & Rnger, 1999b), CPR (Radulescu, Nicolescu, van Gemund, & Jonker, 2001), CPA (Radulescu & van Gemund, 2001), iCASLB
(Vydyanathan, Krishnamoorthy, Sabin, atalyrek, Kur, & Sadayappan et al., 2006a) and Loc-MPS
(Vydyanathan, Krishnamoorthy, Sabin, atalyrek, Kur, & Sadayappan et al., 2006b), see (Dmmler,
Kunis, & Rnger, 2007b) for a comparison. For heterogeneous platforms, the schedule additionally has
to define the mapping of M-tasks onto specific processors. Scheduling heuristics for large heterogeneous
cluster-of-cluster platforms restrict the execution of an M-task to a single homogeneous subcluster, but
each subcluster is allowed to execute multiple M-tasks concurrently. Examples are M-HEFT (Suter,
Desprez, & Casanova, 2004) and HCPA (Ntakp & Suter, 2006), see (Ntakp, Suter, & Casanova,
2007) for a comparison. Multi-core and SMP clusters are heterogeneous platforms that are built up of
a set of homogeneous processing cores interconnected by a heterogeneous network. For these systems,
the scheduling can be performed by two consecutive steps (Dmmler, Rauber, & Rnger, 2008b):
1.
262
Scheduling the M-task graph describing the application on a set of homogeneous symbolic cores
whose computing performance is equal to the physical cores of the target platform but a homogeneous
Figure 5. Illustration of a tree-structure representing the architecture of a multi-core SMP cluster (left)
and the use of the Dewey notion to label the computing elements (right).
2.
interconnection network is assumed and

Mapping the symbolic cores used for the scheduling decisions onto physical cores.
The scheduling step is similar to homogenous scheduling algorithms. In the following, we concentrate
on the mapping step that has to define a mapping for each point in time, i.e. there has to be an assignment of the symbolic cores of the currently executing M-tasks to the physical cores of the architecture.
For each M-task, the mapping is fixed, i.e. during its lifespan an M-task is executed by the same set of
physical cores.
The multi-core target architecture is represented in a tree structure with cores C as leaves, processors
P as intermediate nodes that combine cores, computing nodes N as intermediate nodes that combine
processors, and the entire architecture A as a root node. The levels of the tree correspond to different
interconnection networks. For a unique identification of the physical cores within the tree structure we
use the Dewey notation (Knuth, 1975). Each node n gets a label l(n) that describes the path from the
root node to the specific node. The root node r gets label l(r) = 0. The label l(n) of a node n consists of
the label of the parent node m concatenated by the digit i, if n is children i of m, i.e. l(n) = l(m).i. Figure
5 illustrates the tree structure of the architecture and the use of the Dewey notation to describe multicore clusters.
For the definitions of the mappings, we consider the situation that g independent M-tasks should be
executed concurrently and the scheduling step has assigned the group of symbolic cores Gi to M-task
Mi, i = 1,,g with Gi G j = for i j. The number of symbolic cores in group Gi is denoted as gi and
has been determined in the scheduling step. The mapping is a function F : {G1,...G g } 2C where
C denotes the set of physical cores. F maps the groups of symbolic cores to disjoint physical cores, i.e.
F (Gi ) F (G j ) = for i j and each symbolic group is mapped on a physical group with the same
size, i.e. |F(Gi)|=|Gi|. For each proposed mapping we define a sequence of physical cores
s1, s2,,sm with m = p*c*n
assuming an architecture with c cores per processor, p processors per node and n total nodes. The mapping function F assigns the symbolic cores of a group Gi, i = 1,,g to consecutive physical cores in
263
Figure 6. Illustration of a consecutive mapping (left) and a scattered mapping (right) for M-tasks {M1,
M2, M3, M4} each requiring 4 symbolic cores on a platform with 4 nodes consisting of 2 dual-core processors.
this sequence, i.e.
i -1
F (Gi ) = s j , s j +1,..., s j +gi -1 j = 1 + gk
k =1
.
The Node-Oriented Consecutive Mapping tries to map the symbolic cores of an M-task onto the
same cluster node. If an M-task does not fit on a single node of the architecture, multiple nodes are
used. Figure 6 (left) shows an illustration of the consecutive mapping. The advantage of this mapping
strategy is to enable shared memory optimizations for the implementation of the M-tasks, e.g. to speed
up the internal communication by using optimized MPI libraries or to use a shared memory or hybrid
programming model. In this mapping, the physical cores are ordered such that cores of the same node
are adjacent, i.e. the sequence of physical cores is,
1.1.1,1.1.2,,1.1.c,1.2.1,,1.p.c.,2.1.1,,2.1.c,,n.p.c.
The Scattered Core-Level Mapping assigns corresponding symbolic cores of different M-tasks onto
the same cluster node. If the number of cores of the architecture exceeds the number of independent
M-tasks, multiple symbolic cores of each M-task are mapped on the same node. The scattered mapping
is illustrated in Figure 6 (right). This mapping strategy ensures an equal participation of all nodes in the
internal communication of the M-tasks and can speed up data exchanges between M-tasks, especially in
the case that only corresponding symbolic cores communicate with each other. In the sequence of physical cores the corresponding cores of neighboring nodes are adjacent, i.e. the sequence is given by
1.1.1, 2.1.1,..., n.1.1, 1.1.2, 2.1.2,..., n.1.c, 1.2.1,..., n.p.c .
The Mixed Core-Level Mapping is a generalization of the consecutive and the scattered mappings.
A parameter d, 1 d p*c describes the number of consecutive symbolic cores of an M-task that are
mapped to the same cluster node. For d = 1 the scattered mapping results and setting d = p*c results in
264
the consecutive mapping. This mapping can be used to adapt to the ratio of communication within Mtasks and data exchanges between M-tasks. The sequence of physical cores is given by
d -1
d -1
).(1 + ((d - 1) mod c)), 2.1.1,..., n.(1 +
).(1 + ((d - 1) mod c)),...,
c
c
2d - 1
).(1 + ((2d - 1) mod c)),..., n.p.c.
1.(1 +
c
1.1.1, 1.(1 +
A suitable compiler tool can be used to integrate the mapping strategies in the code generation
process. A realization using the MPI library can adapt the order of the processes within the appropriate
communicators.
ExAMPLE AND ExPERIMENTS

As example applications for an M-task based execution model we consider numerical codes. The first
examples are solution methods for non-stiff ordinary differential equations(ODEs). We consider iterated Runge-Kutta (IRK) methods (van der Houwen & Sommeijer, 1991; Rauber & Rnger, 1999a) and
Parallel Adams methods (van der Houwen & Messina, 1999) that have been developed for a parallel
execution. The IRK method computes s stage vectors using m fixed point iteration steps with an implicit Runge-Kutta corrector. In the M-task programming model, each fixed point iteration step can be
represented by s independent M-tasks but the M-tasks of successive steps cannot be combined due to
data dependencies. The extended CM-task programming model provides another possibility to structure
IRK methods. The computation of each stage vector is accomplished by a single CM-task that computes all fixed point iteration steps and performs the required data exchanges during its execution. The
Parallel Adams methods include the explicit Parallel Adams-Bashforth (PAB) and the implicit Parallel
Adams-Moulton (PAM) methods. The combination of the PAB and PAM methods in a predictor-corrector
scheme results in an implicit ODE solver (PABM). Each time step of the PABM method involves the
computation of K independent stage vectors each requiring group-based communication. At the end of
a time step, a global data exchange is required to compute the next approximation vector. The M-task
version of PABM uses K independent M-tasks each computing one of the stage vectors, but each time
step requires a separate set of M-tasks. The extended CM-task model enables the adoption of a single
set of CM-tasks that keeps running over all time steps.
The second example comes from the area of solution methods for partial differential equations (PDEs)
that operate on a set of meshes (also called zones). Within each time step, the computation of the solution is performed independently for each zone. At the end of a time step, a border exchange between
overlapping zones is required. The NAS parallel benchmark multi-zone version (NPB-MZ) provides
solvers for discretized versions of the unsteady, compressible Navier-Stokes equations that operate on
multiple zones (van der Wijngaart & Jin, 2003). The fine grain parallelism within the zones is exploited
using shared memory OpenMP programming; the coarse grain parallelism between the zones is realized using message passing with MPI. For the purpose of this article we consider a modified version of
the Lower-Upper Symmetric Gauss-Seidel multi-zone (LU-MZ) benchmark which uses MPI for both
levels of parallelism. This has the advantage that multiple nodes of a distributed memory platform can
265
Figure 7. Comparison of the execution times of a single time step of the IRK method using different
execution schemes on the Xeon cluster (left) and on the CHiC cluster (right)
operate on the same zone.

For the benchmark test, a variety of multi-core platforms is used. The benchmarks on the SGI Altix
are executed within a partition consisting of 128 nodes each equipped with two dual-core Intel Itanium
Montecito processors running at 1.6 GHz. The interconnection network is a high-speed NUMAlink 4
that offers a bidirectional bandwidth of 6.4 GByte/s per link. The Intel quad-core Xeon cluster consists
of two nodes each equipped with two Intel Xeon E5345 Clovertown quad-core processors clocked at
2.33 GHz. An InfiniBand network with 10 GBit/s connects the nodes. The CHiC cluster includes 530
nodes consisting of two AMD Opteron 2218 dual-core processors with a clock rate of 2.6 GHz connected
Figure 8. Speedups of a different execution schemes for the IRK method on the SGI Altix (left) and for
the PABM method on the CHiC cluster (right)
266
Figure 9. Comparison of the execution times of a single time step of the PABM method using different
mapping strategies on SGI Altix (left) and on the quad-core Xeon cluster (right)
by a 10 GBit/s InifiniBand network.

First, we compare a standard data parallel version with task parallel versions based on the M-task
and CM-task models. The CM-task version of the IRK and PABM methods utilizes an optimized communication scheme based on an orthogonal arrangement of the processes (cf. Figure 1). The execution
times of a single time step of the IRK method using the RadauIIA7 method with s = 4stage vectors are
compared for the sparse Brusselator system on 16 processor cores of the Xeon cluster in Figure 7 (left).
The M-task version is not competitive due to its large communication overhead. This overhead can be
Figure 10. Performance of the LU-MZ benchmark for problem classes C and D on CHiC (left) and
SGI Altix (right)
267
reduced significantly by the CM-task version leading to much lower runtimes. Figure 7 (right) shows
the execution times of the IRK method using 960 processor cores of the CHiC cluster and the dense
Schrdinger equation. Communication is less important for dense systems and therefore the differences
between the program versions are smaller. Again, the lowest execution times are achieved by the CMtask program version. Figure 8 compares the achieved speedups for the IRK method on the SGI Altix
(left) and for the PABM method with K = 8 stage vectors on the CHiC cluster (right). In both cases the
CM-task version achieves a superior performance compared to M-task based task parallelism and pure
data parallelism.
Figure 9 shows the execution times of a single time step of the PABM method on the SGI Altix using
256 processor cores (left) and on the quad-core Xeon cluster using 16 processor cores (right). On both
systems, task parallelism leads to better execution times because the communication within M-tasks can
be restricted to subgroups of cores. The best mapping strategy is the scattered mapping because the data
exchanges between M-tasks at the end of each time step can be executed within a cluster node.
The performance of the LU-MZ benchmark is depicted in Figure 10 for the CHiC cluster (left) and
for the SGI Altix (right). Problem classes C with a global mesh size of 480 320 28 and D with
a global mesh size of 1632 1216 34 are used. The data parallel version of class C can only be
executed for up to 448 cores because a minimum amount of data is required for each process. For a low
number of cores, pure data parallelism leads to better results because a data exchange between zones
is not required. But on a high number of cores the communication within the zones becomes more
important because the amount of data and, thus, the amount of computation assigned to each process
becomes smaller. The node consecutive mapping leads to the best performance on both platforms. For
class D the computation to communication ratio is much higher leading to smaller differences between
the program versions.
The IRK, PABM and LU-MZ benchmarks prove that a mixed task and data parallel execution scheme
can outperform pure data parallelism on a variety of platforms. Additional optimizations of the communication pattern as it is possible with the extended CM-task model lead to a further increase of the
performance. Additionally, it has been shown that different mapping strategies can lead to significant
differences in the performance on multi-core clusters. The best mapping strategy mainly depends on
the communication requirements of the application, but also the communication performance of the
interconnection networks of the architecture needs to be taken into account.
CONCLUSION
Mixed task and data parallel execution schemes are a flexible method to exploit the computing power
of up-to-date distributed memory platforms. Program development in these mixed programming models
is more complex and error-prone compared to pure task or data parallel models because the organization of the processor groups and the execution of data re-distribution operations additionally have to be
taken into account. Moreover, the optimal assignment of the tasks of an application to processors may
depend on the target platform and therefore a complex restructuring of the application might be required
when porting to another platform. Therefore, a variety of programming support to assist the application
developer is available. In this article, we have discussed several of these approaches.
In particular, we have considered the runtime library TLib and the coordination model TwoL. TLib
supports the structuring of application programs using hierarchically organized multiprocessor tasks.
268
The library provides an easy to use interface and relieves the programmer from the processor group
management. The TwoL model includes a specification language for the definition of hierarchical
multiprocessor task programs. Several transformation steps are available to transform a specification
of a parallel algorithm into an executable message passing program. The transformation is guided by
an underlying cost model.
Additionally, we have discussed the model of communicating multiprocessor tasks that is a natural
extension to existing models and which supports communication between running tasks. The advantage
of this model is to enable special communication patterns like orthogonal communication. The benefits
of this programming model were demonstrated using solution methods for ordinary differential equations. Programming support for communicating multiprocessor tasks has been presented in form of a
transformation-based compiler framework.
Finally, we have presented several mapping strategies to adapt multiprocessor task applications to
the hierarchical structure of recent multi-core SMP clusters. The proposed mapping strategies have been
applied to example codes from numerical analysis. It was shown that the optimal mapping strategy
depends on the ratio of communication within multiprocessor tasks and between tasks.
REFERENCES
Aldinucci, M., Danelutto, M., & Teti, P. (2003). An advanced environment supporting structured parallel programming in Java. Future Generation Computer Systems, 19(5), 611626. doi:10.1016/S0167739X(02)00172-3
Allen, E., Chase, D., Hallett, J., Luchangco, V., Maessen, J.-W., Ryo, S., et al. (2008). The Fortress
language specification, Version 1.0. Santa Clara, CA: Sun Microsystems, Inc.
Bal, H. E., & Haines, M. (1998). Approaches for integrating task and data parallelism. IEEE Concurrency, 6(3), 7484. doi:10.1109/4434.708258
Ben Hassen, S., Bal, H. E., & Jacobs, C. J. H. (1998). A task- and data-parallel programming language
based on shared objects. [TOPLAS]. ACM Transactions on Programming Languages and Systems, 20(6),
11311170. doi:10.1145/295656.295658
Brandes, T. (1999). Exploiting advanced task parallelism in high performance Fortran via a task library.
In Euro-Par 99: Proceedings of the 5th International Euro-Par Conference on Parallel Processing (pp.
833844). London: Springer-Verlag.
Chamberlain, B. L., Callahan, D., & Zima, H. P. (2007). Parallel programmability and the chapel
language. International Journal of High Performance Computing Applications, 21(3), 291312.
doi:10.1177/1094342007078442
Chandy, M., Foster, I., Kennedy, K., Koelbel, C., & Tseng, C.-W. (1994). Integrated support for task and
data parallelism. The International Journal of Supercomputer Applications, 8(2), 8098.
Chapman, B., Haines, M., Mehrota, P., Zima, H., & van Rosendale, J. (1997). Opus: A coordination
language for multidisciplinary applications. Science Progress, 6(4), 345362.
269
Chapman, B. M., Mehrotra, P., van Rosendale, J., & Zima, H. P. (1994). A software architecture for
multidisciplinary applications: integrating task and data parallelism. In CONPAR 94 - VAPP VI: Proceedings of the Third Joint International Conference on Vector and Parallel Processing (pp. 664676).
London: Springer-Verlag.
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., et al. (2005). X10: An
object-oriented approach to non-uniform cluster computing. In OOPSLA 05 Proceedings of the 20th
annual ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications (pp. 519538). New York: ACM.
Ciarpaglini, S., Folchi, L., Orlando, S., Pelagatti, S., & Perego, R. (2000). Integrating task and data parallelism with taskHPF. In H. R. Arabnia (Ed.). Proceedings of the International Conference on Parallel and
Distributed Processing Techniques and Applications, PDPTA 2000. Las Vegas, NV: CSREA Press.
Diaz, M., Rubio, B., Soler, E., & Troya, J. M. (2002). A border-based coordination language for integrating task and data parallelism. Journal of Parallel and Distributed Computing, 62(4), 715740.
doi:10.1006/jpdc.2001.1814
Diaz, M., Rubio, B., Soler, E., & Troya, J. M. (2003). Domain interaction patterns to coordinate HPF
tasks. Parallel Computing, 29(7), 925951. doi:10.1016/S0167-8191(03)00064-4
Diaz, M., Rubio, B., Soler, E., & Troya, J. M. (2004). SBASCO: Skeleton-based scientific components.
In Proceedings of the 12th Euromicro Workshop on Parallel, Distributed and Network-Based Processing
(PDP 2004) (pp. 318325). Washington, DC: IEEE Computer Society.
Dorta, A. J., Gonzlez, J. A., Rodriguez, C., & de Sande, F. (2003). LLC: A parallel skeletal language.
Parallel Processing Letters, 13(3), 437448. doi:10.1142/S0129626403001409
Dorta, A. J., Lpez, P., & de Sande, F. (2006). Basic skeletons in LLC. Parallel Computing, 32(7-8),
491506. doi:10.1016/j.parco.2006.07.001
Dmmler, J., Kunis, R., & Rnger, G. (2007a). A scheduling toolkit for multiprocessor-task programming with dependencies. In Proceedings of the 13th International Euro-Par Conference (pp. 2332).
Berlin: Springer.
Dmmler, J., Kunis, R., & Rnger, G. (2007b). A comparison of scheduling algorithms for multiprocessortasks with precedence constraints. In Proceedings of the 2007 High Performance Computing &
Simulation (HPCS07) Conference (pp. 663669). ECMS.
Dmmler, J., Rauber, T., & Rnger, G. (2007). Communicating multiprocessor-tasks. In Proceedings of
the 20th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2007).
Berlin: Springer.
Dmmler, J., Rauber, T., & Rnger, G. (2008a). A transformation framework for communicating multiprocessor-tasks. In Proceedings of the 16th Euromicro International Conference on Parallel, Distributed
and Network-Based Processing (PDP 2008) (pp. 6471). New York: IEEE Computer Society.
270
Dmmler, J., Rauber, T., & Rnger, G. (2008b). Mapping algorithms for multiprocessor tasks on multicore clusters. In Proceedings of the 37th International Conference on Parallel Processing (ICPP08).
Fink, S. J. (1998). A programming model for block-structured scientific calculations on smp clusters.
Doctoral thesis, University of California, San Diego, CA.
Foster, I., Kohr, D. R., Krishnaiyer, R., & Choudhary, A. (1996). Double standards: Bringing task parallelism to HPF via the message passing interface. In Proceedings of the 1996 ACM/IEEE Conference on
Supercomputing (pp. 36-36). New York: IEEE Computer Society.
Foster, I. T., & Chandy, K. M. (1995). Fortran M: A language for modular parallel programming. Journal
of Parallel and Distributed Computing, 26(1), 2435. doi:10.1006/jpdc.1995.1044
Fox, G., Hiranandani, S., Kennedy, K., Koelbel, C., Kremer, U., Tseng, C.-W., et al. (1990). Fortran D
Language Specification (No. CRPC-TR90079), Houston, TX.
Grelck, C., Scholz, S.-B., & Shafarenko, A. V. (2007). Coordinating data parallel SAC programs with
S-Net. In Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS
2007) (pp. 18). New York: IEEE.
High Performance Fortran Forum. (1993). High performance Fortran language specification, version 1.0
(No. CRPC-TR92225). Center for Research on Parallel Computation, Rice University, Houston, TX.
High Performance Fortran Forum. (1997). High performance Fortran language specification 2.0. Center
for Research on Parallel Computation, Rice University, Houston, TX.
Hunold, S., Rauber, T., & Rnger, G. (2004). Multilevel hierarchical matrix-matrix multiplication on
clusters. In Proceedings of the 18th International Conference of Supercomputing (ICS04) (pp. 136145).
New York: ACM.
Hunold, S., Rauber, T., & Rnger, G. (2008). Combining building blocks for parallel multi-level matrix
multiplication. Parallel Computing, 34(6-8), 411426. doi:10.1016/j.parco.2008.03.003
Joisha, P. G., & Banerjee, P. (1999). PARADIGM (version 2.0): A new HPF compilation system. In IPPS
99/SPDP 99: Proceedings of the 13th International Symposium on Parallel Processing and the 10th
Symposium on Parallel and Distributed Processing (pp. 609615). Washington, DC: IEEE Computer
Society.
Kessler, C. W., & Lwe, W. (2007). A framework for performance-aware composition of explicitly parallel components. In [Jlich/Aachen, Germany: IOS Press.]. Proceedings of the International Conference
ParCo, 2007, 227234.
Knuth, D. E. (1975). The art of computer programming. Volume 1: Fundamental Algorithms. Reading,
MA: Addison Wesley.
Khnemann, M., Rauber, T., & Rnger, G. (2004). A source code analyzer for performance prediction. In
Proceedings of IPDPS04 Workshop on Massively Parallel Processing (WMPP04. New York: IEEE.
271
Laure, E. (2001). OpusJava: A Java framework for distributed high performance computing. Future
Generation Computer Systems, 18(2), 235251. doi:10.1016/S0167-739X(00)00094-7
Laure, E., Mehrotra, P., & Zima, H. P. (1999). Opus: Heterogeneous computing with data parallel tasks.
Parallel Processing Letters, 9(2). doi:10.1142/S0129626499000256
Merlin, J. H., Baden, S. B., Fink, S., & Chapman, B. M. (1999). Multiple data parallelism with HPF and
KeLP. Future Generation Computer Systems, 15(3), 393405. doi:10.1016/S0167-739X(98)00083-1
Ntakp. T., & Suter, F. (2006). Critical path and area based scheduling of parallel task graphs on heterogeneous platforms. In Proceedings of the Twelfth International Conference on Parallel and Distributed
Systems (ICPADS) (pp. 310), Minneapolis, MN.
Ntakp. T., Suter, F., & Casanova, H. (2007). A comparison of scheduling approaches for mixed-parallel
applications on heterogeneous platforms. In 6th International Symposium on Parallel and Distributed
Computing (pp. 3542). Hagenberg, Austria: IEEE Computer Press.
Orlando, S., Palmerini, P., & Perego, R. (2000). Coordinating HPF programs to mix task and data parallelism. In Proceedings of the 2000 ACM Symposium on Applied Computing (SAC00) (pp. 240247).
Orlando, S., & Perego, R. (1999). COLTHPF, A run-time support for the high-level co-ordination
of HPF tasks. Concurrency (Chichester, England), 11(8), 407434. doi:10.1002/(SICI)10969128(199907)11:8<407::AID-CPE435>3.0.CO;2-0
Pelagatti, S. (2003). Task and Data Parallelism in P3L. In F. A. Rabhi & S. Gorlatch (Eds.), Patterns and
skeletons for parallel and distributed computing (pp.155186). London: Springer-Verlag.
Pelagatti, S., & Skillicorn, D. B. (2001). Coordinating programs in the network of tasks model. Journal
of Systems Integration, 10(2), 107126. doi:10.1023/A:1011228808844
Radulescu, A., Nicolescu, C., van Gemund, A. J. C., & Jonker, P. (2001). CPR: Mixed task and data parallel scheduling for distributed systems. In Proceedings of the 15th International Parallel and Distributed
Processing Symposium (IPDPS01) (pp. 39-46). New York: IEEE Computer Society.
Radulescu, A., & van Gemund, A. J. C. (2001). A low-cost approach towards mixed task and data parallel scheduling. In Proceedings of the International Conference on Parallel Processing (ICPP01)(pp.
6976). New York: IEEE Computer Society.
Ramaswamy, S. (1996). Simultaneous exploitation of task and data parallelism in regular scientific
computations. Doctoral thesis, University of Illinois at Urbana-Champaign.
Ramaswamy, S., Sapatnekar, S., & Banerjee, P. (1997). A framework for exploiting task and data parallelism on distributed memory multicomputers. IEEE Transactions on Parallel and Distributed Systems,
8(11), 10981116. doi:10.1109/71.642945
Ramaswamy, S., Simons, B., & Banerjee, P. (1996). Optimizations for efficient array redistribution on
distributed memory multicomputers. Journal of Parallel and Distributed Computing, 38(2), 217228.
doi:10.1006/jpdc.1996.0142
272
Rauber, T., Reilein-Ru, R., & Rnger, G. (2004a). Group-SPMD programming with orthogonal processor groups. Concurrency and Computation: Practice and Experience . Special Issue on Compilers for
Parallel Computers, 16(2-3), 173195.
Rauber, T., Reilein-Ru, R., & Rnger, G. (2004b). On compiler support for mixed task and data parallelism. In G. R. Joubert, W. E. Nagel, F. J. Peter, & W. V. Walter (Eds.), Parallel Computing: Software
Technology, Algorithms, Architectures & Applications. Proceedings of 12th International Conference
on Parallel Computing (ParCo03) (pp. 2330). New York: Elsevier.
Rauber, T., & Rnger, G. (1996). The compiler TwoL for the design of parallel implementations. In
Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques(PACT96)
(pp. 292-301). Washington, DC: IEEE Computer Society.
Rauber, T., & Rnger, G. (1999). Compiler support for task scheduling in hierarchical execution models.
Journal of Systems Architecture, 45(6-7), 483503. doi:10.1016/S1383-7621(98)00019-8
Rauber, T., & Rnger, G. (1999a). Parallel execution of embedded and iterated Runge-Kutta methods. Concurrency (Chichester, England), 11(7), 367385. doi:10.1002/(SICI)1096-9128(199906)11:7<367::AIDCPE430>3.0.CO;2-G
Rauber, T., & Rnger, G. (1999b). Scheduling of data parallel modules for scientific computing. In
Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing (PPSC),
SIAM(CD-ROM), San Antonio, TX.
Rauber, T., & Rnger, G. (2000). A transformation approach to derive efficient parallel implementations.
IEEE Transactions on Software Engineering, 26(4), 315339. doi:10.1109/32.844492
Rauber, T., & Rnger, G. (2005). TLib - A library to support programming with hierarchical multiprocessor tasks. Journal of Parallel and Distributed Computing, 65(3), 347360.
Rauber, T., & Rnger, G. (2006). A data re-distribution library for multi-processor task programming. International Journal of Foundations of Computer Science, 17(2), 251270. doi:10.1142/
S0129054106003814
Rauber, T., & Rnger, G. (2007). Mixed task and data parallel executions in general linear methods.
Science Progress, 15(3), 137155.
Rauber, T., Rnger, G., & Wilhelm, R. (1995). Deriving optimal data distributions for group parallel
numerical algorithms. In Proceedings of the Conference on Programming Models for Massively Parallel
Computers (PMMP94) (pp. 3341). Washington, DC: IEEE Computer Society.
Reilein-Ru, R. (2005). Eine komponentenbasierte Realisierung der TwoL Spracharchitektur. PhD
Thesis, TU Chemnitz, Fakultt fr Informatik, Chemnitz, Germany.
Sips, H. J., & van Reeuwijk, C. (2004). An integrated annotation and compilation framework for task and
data parallel programming in Java. In Parallel Computing (PARCO): Software Technology, Algorithms,
Architectures and Applications (pp. 111118). New York: Elsevier.
Skillicorn, D. B. (1999). The network of tasks model, (TR1999-427). Queens University, Kingston,
Canada.
273
Subhlok, J., & Vondran, G. (1995). Optimal mapping of sequences of data parallel tasks. ACM SIGPLAN
Notices, 30(8), 134143. doi:10.1145/209937.209951
Subhlok, J., & Yang, B. (1997). A new model for integrated nested task and data parallel programming.
In Proceedings of the 6th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (pp. 112). New York: ACM Press.
Suter, F., Desprez, F., & Casanova, H. (2004). From heterogeneous task scheduling to heterogeneous
mixed parallel scheduling. In Proceedings of the 10th International Euro-Par Conference (Euro-Par04),
(LNCS: Vol. 3149, pp. 230237). Pisa, Italy: Springer.
van der Houwen, P. J., & Messina, E. (1999). Parallel Adams methods. Journal of Computational and
Applied Mathematics, 101(1-2), 153165. doi:10.1016/S0377-0427(98)00214-3
van der Houwen, P. J., & Sommeijer, B. P. (1991). Iterated Runge-Kutta methods on parallel computers.
SIAM Journal on Scientific and Statistical Computing, 12(5), 10001028. doi:10.1137/0912054
van der Wijngaart, R. F., & Jin, H. (2003). The NAS parallel benchmarks, multi-zone versions (No.
NAS-03-010). NASA Ames Research Center, Sunnydale, CA.
van Reeuwijk, C., Kuijlman, F., & Sips, H. J. (2003). Spar: A set of extensions to Java for scientific
computation. Concurrency and Computation, 15, 277299. doi:10.1002/cpe.659
Vanneschi, M. (2002). The programming model of ASSIST, an environment for parallel and distributed
portable applications. Parallel Computing, 28(12), 17091732. doi:10.1016/S0167-8191(02)00188-6
Vanneschi, M., & Veraldi, L. (2007). Dynamicity in distributed applications: Issues, problems and the
ASSIST approach. Parallel Computing, 33(12), 822845. doi:10.1016/j.parco.2007.08.001
Vydyanathan, N., Krishnamoorthy, S., Sabin, G., atalyrek, . V., Kur, T. M., Sadayappan, P., et al.
(2006a). An integrated approach for processor allocation and scheduling of mixed-parallel applications.
In Proceedings of the 2006 International Conference on Parallel Processing (ICPP06) (pp. 443450).
New York: IEEE.
Vydyanathan, N., Krishnamoorthy, S., Sabin, G., atalyrek, . V., Kur, T. M., Sadayappan, P., et
al. (2006b). Locality conscious processor allocation and scheduling for mixed parallel applications. In
Proceedings of the 2006 IEEE International Conference on Cluster Computing, September 25-28, 2006,
Barcelona, Spain. New York: IEEE.
West, E. A., & Grimshaw, A. S. (1995). Braid: Integrating task and data parallelism. In Proceedings of
the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers95) (p. 211). New
York: IEEE Computer Society.

CM-Task: A CM-task is an extension of an M-task that additionally supports data exchanges with
other CM-tasks during its execution.
274
Data Parallelism: Data parallel computations apply the same operation in parallel on different elements of the same set of data.
M-Task: An M-task is a parallel program fragment that operates on a set of input parameters and
produces a set of output parameters. The implementation of an M-task supports an execution on an
arbitrary number of processors.
Mapping: Mapping assigns specific physical processing units, e.g., specific cores of a multi-core
SMP cluster, to tasks of an application.
Mixed Parallelism: Mixed parallelism is a combination of task and data parallelism that supports
the concurrent execution of independent data parallel tasks each operating on a different set of data.
Scheduling: Scheduling defines an execution order of the tasks of an application and an assignment
of tasks to processing units. Scheduling for mixed parallel applications additionally has to fix the number
of executing processors for each data parallel task.
Task Parallelism: Task parallel computations consist of a set of different tasks that operate independently on different sets of data.
275
276
Chapter 12
Programmability and Scalability

on Multi-Core Architectures
Jaeyoung Yi
Yonsei University, Seoul, Korea
Yong J. Jang
Doohwan Oh
Won W. Ro
ABSTRACT
In this chapter, we will describe todays technological trends on building a multi-core based microprocessor and its programmability and scalability issues. Ever since multi-core processors have been commercialized, we have seen many different multi-core processors. However, the issues related to how to
utilize the physical parallelism of cores for software execution have not been suitably addressed so far.
Compared to implementing multiple identical cores on a single chip, separating an original sequential
program into multiple running threads has been an even more challenging task. In this chapter, we introduce several different software programs which can be successfully ported on the future multi-core
based processors and describe how they could benefit from the multi-core systems. Towards the end, the
future trends in the multi-core systems are overviewed.
INTRODUCTION
Intel has shipped the first dual-core processor as early as 2005 and many major processor vendors
have developed dual-core or quad-core processors since then. We are now entering the new era of the
multi-core processors, and practically every field in computer science or computer engineering will be
affected by this strong movement. Though computing power has improved dramatically with higher
clock frequencies and techniques such as superscalar, superpipelining, and VLIW (Very Large InstrucDOI: 10.4018/978-1-60566-661-7.ch012
Programmability and Scalability on Multi-Core Architectures
tion Word), it seems that this progress will see a significant slowdown and we will have to come up with
other solutions to maintain the speed of improvement we are enjoying now.
There are three main reasons for the slowdown in single core performance improvement. First of all,
we cannot increase the clock frequency following the Moores Law due to the power dissipation concerns
and thermal problems. As millions of transistors are integrated onto one chip and as the clock speed goes
up, the heat becomes too much to handle with current affordable cooling solutions. Secondly, the latency
of processor-memory requests becomes a limiting factor, caused by the gap of speed advancements
between the processor and the memory. Indeed, this becomes a major bottleneck for overall computing
performance. Lastly, it is known that ILP (instruction-level parallelism) from a single thread almost
reaches its limit in the current microprocessor architectures and compiler techniques. Therefore, we can
see that the next path to take is the multi-core approach; instead of trying to improve the performance
of single thread execution, we should partition applications into multiple threads that can run in parallel
on prevailing multi-core systems.
In this chapter, we are going to look at recent researches on multi-core architectures and how to fully
utilize the multi-core systems. In the next section, we look at hardware designs and characteristics of the
multi-core architectures. In Section 3, we explain software programming skills to exploit parallelism in
two specific applications which are network coding and Intrusion Detection Systems (IDS). In Section 4,
we touch on the programmability and scalability issues to use multi-core systems for video applications,
and in Section 5, we conclude the chapter with forecasting on future multi-core systems.
BACKGROUND STUDY: HARDWARE DESIGNS

OF MULTI-CORE ARCHITECTURE
In this section, the general hardware architecture of current multi-core processors is surveyed as background study. We first describe the basic method to build multi-core processors and then describe the
memory hierarchy design issues for multi-core system.
Multi-Core Processor Architecture: Homogeneous or Heterogeneous

In a simplest way to design multi-core processors, we can just imagine to arrange multiple processing units
(so called cores) on a single chip. In theory, this might be a good way to boost performance by providing
parallelism through more processors. However, in reality, multi-cores do not always promise the performance enhancement in software execution (Hill & Marty, 2008) . In fact, it could cause opposite results
due to some communication restrictions and memory sharing problems. Therefore, hardware designers
of multi-core based processors have to research various methods of multi-core structure and simultaneously find effective algorithm of data sharing to boost performance and efficiency of processing.
According to an architect for a specific design, the internal architecture of each core can vary. Actually, the core can be either heterogeneous or homogeneous. In Figure 1, we show the two ways to build a
multi-core processor. The left-side two diagrams show multi-core processor from Intel and AMD which
can be classified as homogeneous multi-core processors. As the name implies, these models integrate
identical cores on to a single chip. Yorkfield from Intel is a quad-core CPU that integrates two dual-core
CPUs onto a single die. Phenome from AMD consists of four cores that have identical architecture and
shared L3 cache on a single chip.
277
Figure 1. Two-ways to build a multi-core processor
On the other hand, the diagram on the right side, Cell Broadband Engine is a heterogeneous processor.
The processor is originally developed as a co-project of Sony, IBM, and Toshiba. There are one functional core and eight synergistic processor elements performing data-intensive processing (Gschwind,
Hofstee, Flachs, Hopkins, Watambe & Yamazaki, 2006) .
Today, most of manufacturers have followed homogeneous type for a certain reason; they have just
preferred to utilize the previously developed design for a single core processor. In fact, it might provide
the balance of computing ability between high throughput and good single-thread performance. In
addition, simple re-use of micro-architecture provides great portability and scalability for previously
provided applications and legacy codes. It is an easy way to build multi-core architectures as we only
need to add multiple cores to a single-chip processor.
In spite of these advantages of homogeneous structure, most processor architects expect that heterogeneous multi-core processors such as the Cell processor would perform more powerfully and effectively
than homogeneous ones (Pericas, Cristal, Cazorla, Gonzalez, Jimenez & Valero, 2007)(Kumar, Tullsen
& Jouppi, 2006). The reason is that the heterogeneous type can be designed to match each application to
the specific multi-core processor; it is a better approach to meet high performance. This approach allows
us to design more efficient power consumption architecture in smaller chip die area (Kowaliski, n.d.) .
Another reason is that the number of data to process is being enlarged more sharply than the number of
instructions with development of the digital technology. In order to process numerous data, it is needless
to say that heterogeneous one is much more suitable than homogeneous one.
One of examples for heterogeneous multi-core architecture is implemented with Graphics Processing
Units. The expectation to large amount of data processing would affect that GPU (Graphics Processing
Unit) is developed as a general purpose processing unit named GP-GPU (general purpose-GPU). Most
companies that make processing units such as GPU, DSP, and CPU also expect GP-GPU should make
strong transition into CPUs market (Kowaliski, n.d.) .
278
Figure 2. Single-core and multi-core processors various multi-core processors
The example of this trend is shown at major processor companies; AMD has acquired ATI that is a
famous graphics processing unit maker. Also, it seems that Intel has interested in this trend as developing
the GPU architecture named Larabee. In addition, NVIDIA, one of the most famous GPU makers, tries
to develop general purpose processing unit based on the traditional graphics processing unit.
Memory Hierarchy Design

Main memory latency becomes a major delay in modern computer systems either in single-core systems
or multi-core systems. Indeed, the architectural design for memory system is very important for the
performance of multi-core processors. Especially, cache has one more important duty in the multi-core
systems; the coherency protocol. The shared cache in multi-core processor has to observe the change
of data and notify each cores about it. In Figure 2, we show various multi-core processor models with
different cache designs.
There are no cache sharing in the single core processor and the simple dual-core processor. On the
other hand, the shared L2 cache dual-core processors integrate a shared L2 cache; the cache coherence
protocol must be adequately designed in order to use the shared L2 cache in a proper way.
One of the most important aspects in designing multi-core processors is memory hierarchy and data
sharing techniques among different cores. Much like the traditional shared memory architectures, the
data between running threads on different cores are passed through the shared memory. To this point,
the cache design plays an important role and becomes a major issue in designing multi-core processors.
279
As a consequence, the performance of todays multi-core systems remarkably depends on the size of
cache, the cache hierarchy, and shared cache architecture (Schirrmeister, 2007).
In fact, the performance would be improved on the single core architecture as the size of cache
becomes larger. However, the advantages of a large-sized cache become less effective in multi-core
processors due to the data sharing operation and coherence protocol (Kowaliski, n.d.) . Indeed, the
structures of cache hierarchy and coherence protocols are considered as more important than the size
in multi-core processors. For that reason, research about cache structure in a multi-core processor is
being studied by various sides; many different structures have been proposed and developed in order to
achieve improvement of performance.
The design of multi-core is more complicated than the design of single core, due to data sharing
between cores and grouping of cores. The simple combination of single cores without any consideration
for grouping does not perform well; the main reason is that an original single core has not been designed
considering any parallelism. However, to have a better design as a multi-core processor, we must design
it fully considering the parallel execution of software and communications between cores. Therefore,
the special design and technique should be added on top of simple arrangement of cores. This special
design and technique could be related to the basic core architectures or a combination structure of cores
architectures. There are many ways to extract the potential parallelism by providing a better hardware
platform. To find the best way, we must design the hardware platform based on execution of software
parallelism.
ExPLOITING SOFTWARE PARALLELISM ON A MULTI-CORE SYSTEM

In this section, we will look into three approaches to exploit software parallelism using a multi-core
system. The first application described is network application. After that, we will present an algorithm
development of intrusion detection system as a multi-core application.
Network Coding on Multi-Core Systems

Network coding is a method that increases the network transmission rate while increasing reliability and
security. It does this by performing coding operations on packets not only in the source nodes but also
at intermediate nodes throughout the network topology between the source and receivers. The idea was
first proposed by Ahlswede et al (Akenine-Moller, 2002), who showed the usefulness of network coding in multicast networks. It was further researched by others, who showed that simple linear structures
could be used for the implementation of network coding (Gschwind, et al., 2006), and going further,
that a random combination of the linear codes could be used in decoding (Fernando, Harris, Wloka &
Zeller, 2004) .
In Figure 3, we show a communication network, a directed graph where the edges represent pathways
for information (Li, Yeung & Cai, 2003) . At the source S, information is generated and then multicasted
to other nodes in the network. Here, every node can pass on whatever information it has received. Now,
suppose you generate data bits a and b at source S. We want to send the data to both node D and node
E. By the Max-flow min-cut theorem (Wikkipedia, Gauss-Jordan elimination), we can calculate the
maximum flow, that is, the maximum amount of information we can transmit through this network. We
cannot achieve this maximum rate by just routing, and that is where network coding comes in.
280
Figure 3. A communication networks for network coding
We first send data a through path SA, AC, AD, and data b through SB, BC, BE. With the routing
scheme, we can only send a copy of either a or b but not both, from C down to the path CZ. Suppose
we send data a through CZ. Then node D would receive data a twice, once from A and once from Z, and
would not get data b. Sending data b instead would also raise the same problem for node E. Therefore,
we could say that routing is insufficient as it cannot send both data a and data b to both destinations
node D and E simultaneously. Using network coding, on the other hand, we could encode the data a and
b received in node C and send down the encoded version to CZ. Say we use bitwise xor for encoding.
Then, data a and b are encoded to a xor b. The encoded data is sent along on the path CZ, ZD, and ZE.
Node D receives data a and a xor b, so it can decode and get data b from it. It is the same for node E,
where it receives data b and a xor b, extracting data a by decoding.
By looking at this example, it is clear that network coding has a huge advantage over simple routing.
Network coding enables us to multicast two bits per unit time from the source down to the destinations,
which you cannot achieve through routing. Now, with this high transmission capacity, another factor in
performance is the encoding/decoding speed. It has to be fast enough not to be a performance bottleneck,
and in todays multi-core environment, fast encoding/decoding can be achieved by exploiting parallelism. In this case, we shall pick the method of linear encoding and random linear decoding mentioned
earlier, and see how we can parallelize it. First, we will take a look at the big picture of encoding and
decoding, then the specific algorithm and parallelization.
Let us assume that an application generates a stream of equal sized frames. Organize these frames
into blocks, which contain a number of consecutive frames. Suppose the frames are numbered,
then b(blockID, blockSize) denotes a block which holds frame(blockID) to frame(blockID+blockSize-1).
A coded packet c(blockID,blockSIze) is a linear combinations of the frames within b(blockID,blockSIze). That is
c(blockID ,blockSize ) = k =1 ek p(k + blockSize - 1) , where pk is an application frame and the coefficient
ek is a certain element in a chosen finite field F. Every arithmetic operation will be over field F. (Figure
4)
A blockSize number of application frames is needed to make a coded packet, so a source node waits
blockSize
281
Figure 4. Blocks and coded packets
for enough frames to accumulate before starting encoding. The encoded packet will be broadcasted
to other destination nodes along with the coefficient vector stored in the header. Nodes in the path to
the destinations nodes will re-encode the coded packets and send them along. When the coded packet
reaches a destination node it will get stored it in the memory. For the destination node to decode the
package into the original data block with blockSize frames, it needs to get blockSize coded packets with
independent coefficient vectors.
T
T
T
T
T
T
T
] , and P = [pblockID ...pblockID +blockSize-1 ] , where
If we denote E = [e1 ...eblockSize ] , C T = [c1T ...cblockSize
superscript T stand for the transpose operation. As the coded packet was calculated as C = EP, we can
decode C into the original block P in the destination nodes with the formula P=E-1C. Note that the matrix
E needs to be convertible, so all coefficient vectors eks must be independent with each other.
The pseudocode for the encoding algorithm is given in Figure 5, and more easily represented in
Figure 4. It is basically a matrix operation. We can parallelize this operation to run in multiple threads,
and depending on how we divide it the speedup could be zero to twofold. For instance, suppose you
divide the work in each row * column multiplication operation in Figure 5. The operation would calculate a1 * b1 + a2 * b2 + + a 8 * b8 , so you can make a few threads that does each multiplication an*bn
and add up all the results of the threads in the end. It seems possible when you only think about it in the
algorithm level, but the problem is that you cannot store results simultaneously in the memory. Race
conditions could occur, so you have to add locks to the critical section to avoid collision. However,
these locks mean that the threads cannot run in parallel; each thread has to wait its turn to use the lock.
Therefore, even if you can do the multiplication in separate cores simultaneously, the memory store
operation will become a bottleneck and sequentialize the whole process, hindering the speedup that can
be achieved in multi-core environment. (Figure 6)
282
Figure 5. Encoding algorithm
Figure 6. Encoding process
On the other hand, suppose you divide the work by assigning each whole row * column multiplication
on different threads instead of breaking down each operation on threads, as in Figure 7. This way the
process is divided into chunks so that memory storage is not a problem. Inside each chunk the operations
store the temporary results in the same memory cell, but since it is sequential inside each thread it is OK.
If you look at each different chunk, they store all results in different memory cells, avoiding collision
and thus suitable to run in parallel. The specific description of the algorithm is as follows.
As in Figure 7, we can parallelize the encoding algorithm by using threads to divide the workload.
In Figure 6, a single thread does vector multiplication e1 b(1,8), e2 b(1,8), , e8 b(1,8) to get the coded
packets c1, c2, ,c8 sequentially, one at a time. In the parallelized version, we split the work into 4
Figure 7. Parallelized encoding algorithm
283
Figure 8. Parallelized encoding process
independent parts, each running on a different thread. Thus, if the processor has more than 4 cores, we
can get the coded packets c1 and c5 computed in the first core, c2 and c6 in the second, and so on. This
is shown in Figure 8. Note that coded packets in the same color are calculated in the same thread. This
parallelization means we get a fourfold throughput in the encoding process, excluding the time it takes
to manage the threads. Figure 9.
After the blocks are encoded into coded packets, they are sent to the destination node. At every node
on the way to the destination, the packets go through a re-encoding process. The re-encoding process
is basically the same as the encoding process, where the coded packet gets encoded once more with a
randomly selected re-encoding vector. The newly coded packet c(blockID ,blockSize ) = k =1
blockSize
Figure 9. Re-encoding process
284
ek ck is sent
Figure 10. Decoding algorithm in a single-core system
along with the combined coefficient vector e= k =1 ek ek . This re-encoding process can therefore
be parallelized similarly, with multiple threads doing the calculations of matrix multiplication.
After going through the re-encoding process in every node it passes, a coded packet along with the
corresponding coefficient vector finally reaches the destination(s). The destination node waits until it
receives enough of these coded packets to decode, which would be the blockSize supposing that all coefficient vectors are independent. The following process after receiving all needed packets is decoding;
reconstructing the original data from the coded packets, and this is another place we could upgrade the
performance in a multi-core system. If you arrange the coded blocks as a matrix, we can calculate the
inverse matrix and then the original block using the Gauss-Jordan algorithm (Wikkipedia, Gauss-Jordan
elimination). In a single-core machine without using threads, this process is purely sequential, as in
Figure 10. However, in a multi-core environment, we can speed this process up by dividing the work
into several threads which would run on separate cores in parallel.
The aim of the decoding process is to transform the coefficient matrix to its reduced row echelon
form with basic row operations. For each ith vector in the coefficient matrix, the first step is to divide
the whole vector by the value of the basis coordinate, say val, to make the basis coordinate 1. Then the
second step is for rows above, subtract (ith vector) * val to make the value of column i 0. For rows
below, divide the vector by val and then subtract the current_vector. Each loop of this will set column i
as 0, except for the ith row. After going through every vector like this, we will get the identity matrix.
Apply the same operations in the reduction process to an identity matrix, and it will reveal the inverse
matrix with which we can just multiply with the coded packets to get the original blocks.
Now the parallelization part here is simple. In the second step of the above algorithm, each row
operation is independent with one another with no race conditions. Thus we can part the row operations
into small groups to run concurrently on separate threads. This part of the algorithm corresponds to line
5-14 in the pseudocode of Figure 10, and simply making multiple threads to execute this part will do
the job. See Figure 11.
We have looked at the encoding/decoding process used in network coding, and have parallelized
blockSize
285
Figure 11. Parallelized decoding algorithm Ith basis operation
it into multiple threads. As you can see, there are several problems to think of before letting multiple
threads divide up the job, such as race conditions, critical regions, locks, and etc. The hard part is that
present compilers do not catch these errors; you are on your own on that. Race conditions are especially
hard to manage, as once you let an error like that get in it could be hard to detect. It is possible that a
problem would just pop out after years of safe use of the program. It could lead to a critical mistake if
hidden in medical equipment or space ships or so on. Thus, as it is with all software codes, check and
double check the algorithm, and execute sufficient testing before declaring it safe to use.
Implementing Intrusion Detection System on Multi-Core Architecture

Ever since the Internet service has been introduced, it has been a major method for people to communicate
to each other and collect useful information from the world. Although this provides a lot of convenience
to everyday life, it also contains some serious threats in a point that it may expose personal privacy.
Hence the necessity of Internet security has been emphasized in order to protect personal information
from the world. Moreover, ubiquitous computing will be actively developed and widely adopted in the
near future; this trend will require more advanced Internet service as a backbone platform to implement
successful ubiquitous computing environment. In this section, we will discuss parallel intrusion detection algorithms designed for multi-core systems.
Overview of IDS (Intrusion Detection System)

Among the network security products, Intrusion Detection System (IDS) is a leading network security
solution in the market. The IDS systems can be divided into two groups: the host-based IDS and the
network-based IDS. Since we are interested in parallel implementation of IDS applications, the networkbased IDS will be the main target of our research.
One of the main advantages of the network-based IDS is that it can support a large scale network.
In addition, it is also able to detect the host server before being attacked. However there are two major
problems of the network-based IDS: the packet filtering problem and the classification problem based
on string matching (Akenine-Moller, 2002). The former has been improved by several previous studies
286
Figure 12. Multi-thread test
however the latter still needs to be studied more.

Boyer-Moores algorithm is the most well known algorithm among String matching. The algorithm
is a general purpose string matching algorithm; it scans and compares the string to be matched the input
string starting from the rightmost character of the string (Li, Yeung & Cai, 2003) . The weakness of
string matching is that all data must be scanned. In the scanning process there are large power distribution and slowing down of performance speed (Chen, & Lee, 1999) . Therefore, this weakness should
be improved.
Pattern matching is the most important part in the network based IDS. However, it always requires
many calculations and even worse; the number of attacking patterns is increasing day by day. Pattern
matching should be able to manage patterns efficiently with various lengths, capitals and small letters, and
several ordered letters at the same time. Therefore, it would be efficient to have multi-pattern matching
in network based IDS to manage packets coming in at high speed (Ni, Lin, Chen & Ungsunan, 2007) .
Figure 12 shows data comparing Q6600 (single-core) and E2140 (multi-core) of Intel. It is possible
to improve the efficiency up to 40% by using multi-threads in virus checking. This means that the large
parts of the network security can be improved by being parallelized. However, the present network security research is focused on mathematical approach, data communication, and data processing. Thus, the
network security method should be more efficiently implemented by using multi-core based systems.
Parallelization of Intrusion Detection System

The future computer system will be designed based on multi-core. This trend will also be applied in
IDS. However, most of the research related to IDS is being studied in single-core and software. As
mentioned before, several parts in IDS can be parallelized. In fact, pattern matching algorithm is an
orderly process so it takes a long time (Ni, et al., 2007) . It can be solved by exploiting parallelism in
the pattern matching algorithm.
Figure 13 and Table 1 show the structure of the pattern matching and order of execution, respectively
(Ni, et al., 2007) . As it is shown in Figure 13, the IDS scan the patterns through the pattern matching.
The scanned patterns are decided in the CPU whether or not it is an attacking pattern. This process
takes a long time because all patterns need to be scanned. The process until now has performed orderly.
However the patterns can be divided in to several blocks because they are independent to one another.
287
Figure 13. Structure of parallel pattern matching
Because of these characteristics, blocks can be allocated to several cores in the multi-core environment
(Kowaliski, n.d.) . Each block is divided into several threads and those threads enter each allocated core.
Through this, the process of pattern matching in the multi-core gets high processing speed.
GPU DEVELOPMENT ON MULTI-CORE SYSTEM

Parallel computing and data-parallel programming environments provide us a high performance in a
computer system. Especially, it becomes a crucial feature in the graphics processors because nearly 10,
000 data elements need to be handled at a given time (Boyd, 2008) . As a result, the graphics processor
has become one of the target applications to utilize multi-core system. The large-scale 3D graphic applications have created many-cores GPUs and a large number of CPUs.
Indeed many-core CPUs need a new software paradigm which can easily exploit software parallelism. To reflect this trend, NVIDIA have announced CUDA in 2007 which is a parallel programming
language. The CUDA is available in multi-core system that has a shared-memory parallel processing
Table 1. Order of pattern matching

1
Fetch the patterns in the network interface
Scan the pattern and send to L2 cache
Pattern saved in the shared L2 cache
Several blocked-patterns input to each CPU
Patterns saved in L1 cache in CPU
Check the patterns whether it is attacking pattern or not
288
Figure 14. 3D graphics system architecture
architectures (Nickolls, Buck & Garland, 2007) .

In the graphic market, the many-core GPUs are rapidly developed because graphic processing includes
many calculations. Generally, the development of the graphic market requires 3D graphic technology.
3D graphic technology should be changed because it needs high processing power. As a result, power
distribution in 3D graphic technology needs to use multi-cores in order to reduce the power consumption.
Moores law says there are more transistors on one die so more cores can be integrated. Todays
GPUs have been improved to the GPGPUs (General-Purpose computation on GPUs) (Nickolls, et al.,
2007) . This leads to the strong result of the improvement in the speed.
Background of 3D Graphics Processing and GPU

Generally the process of 3D Graphics is divided into three stages: application stage, geometry stage,
and rasterization stage (Chen, & Lee, 1999) . The stage is shown in Figure 14.
The above figure explains the three stages of the process of 3D Graphics. The first stage is the application stage. This stage is processed in the CPU and manages the user input, calculates the physical
3D object data, and also provides vertex data - such as plots, lines, triangle or polygon. The second
stage is the geometry stage. In this stage, the vertex data is received from application stage and the
geometry convergence is calculated (it involves the multiplication of vector) and color bit such as
clipping, lighting and coordinate transformation. The last stage is the rasterization stage; it maps the
previously saved texture in the geometry stage and some effect. In addition, it also saves the final result
in the frame buffer.
Most of the 3D graphics chips (GPU) focus on accelerating the rasterization stage and leave the geometry processing to host CPU because the geometry processing demands lots of floating-point operations
which cannot be easily handled (Chen, & Lee, 1999) . To achieve real-time 3D graphics performance,
we focus more on the rasterization stage.
The GPU is the most heavily used processor in the graphics field. It has shown rapid improvement in
289
Figure 15. A modern GPU
the recent 10 years, and this has supported the increase of the capacity of the computer. In detail, GPU
gives solution to rendering and parallel data. Figure 15 below describes the simplification of modern
GPU (Boyd, 2008) .
Parallelizing the GPU with CPU has solved the problem of data processing that causes performance
bottleneck. However, this could not solve the numerous data of graphic processes. Therefore, the private
used hardware (GPU) has been designed to solve the previous problem. The main core of the GPU is
the shader. The shader controls the calculation of vertex or pixel data which is mainly focused on the
calculation of the graphics data (Boyd, 2008) . Furthermore, a unified parallel shader has been designed
to solve the I/O problems and to improve the capacity of the shader.
Parallelization of 3D Graphic Processing

The whole process of the 3D graphics processing was operated in a CPU. However, there is a problem
that loads of CPU gets bigger due to the high processing of the calculation (Fernando, et al., 2004) . As
a result, there is a GPU (Graphics + CPU) that can separate geometry state and rasterization state to one
core. The shader used in GPU is specialized processor that can accelerate API (Application Program
Interface). Meanwhile, the user can receive the technical program from the 3D graphic program or other
hardware block in GPU. The technical program involves the vertex and pixel data in the 3D graphic
calculation process that is independent or similar calculation. So, the GPU has parallelized vertex and
pixel shader. Also, inside of shader, there is a parallelized structure to improve the capability.
The current trend is to integrate shaders in GPU to improve the use of shader. Integrated shaders do
not have the distinction between vertex shader and pixel shader. The small shaders with the same structure together become one big component. One big shader operates efficiently because the inner small
shaders are operating in parallelization. This can operate to vertex shader and pixel shader depends on
the cases. Because the 3D graphics API is being programmed in the pipeline, it is possible to calculate
290
Figure 16. GPU using parallelized shader
independently even the vertex shader and pixel shader are integrated. Parallelized integrated-shader can
solve the problem of calculation being convergent and save hardware resources. Moreover, it uses the
same instruction set so that the shader programming becomes easier. Figure 16 shows the inner structure
of GPU where the parallelized integrate-shader is used.
A parallelized shader is a homogeneous multiprocessor that has several small shaders with the same
structures. Each parallelized shader is designed to accelerate the calculation of vertex or pixel in 3D
programs.
Even though the GPU is a processor core which is specified from the data parallel, there is a problem in using it as general purpose. However it is inefficient to be used only to process graphics. As a
result, the GPGPU has been developed for general purpose of numeric value calculation. Furthermore,
the two leading companies of the graphic market, NVIDIA and AMD have developed general purpose
parallel programming tools. Those are CUDA (Compute Unified Device Architecture) from NVIDIA
and CTM (Close to the Metal) from AMD. For example, set NVIDIAs GPGPU to computer and install
the CUDA software. This can increase the speed in the programs with many floating point calculations
such as graphics. CUDA uses parallel data cache in between ALU and memory to perform the threadunit parallel process by using several ALUs. Roughly speaking, several clusters of desktop PC can be
efficient as the super computer.
CONCLUSION
Today, most of processor manufacturers are interested in multi-core architectures and release various
multi-core products. We expect this trend would last for the next couple of years due to the following
three aspects. First of all, the clock speed is no longer a major issue of processor design. This is due
to the fact that as increased the clock speed, the leakage current on a chip also becomes high, so the
291
dramatic increase of power consumption is caused and processor temperature goes high. Therefore, the
higher clock frequency is no longer useful factor and the new implementation technique such as multicore architecture is required to elevate performance of processors. The second fact is that manufacturers
can integrate more and more transistors onto a single chip. Since the number of transistors per chip has
increased continually, the trend of multi-core processor that integrates two or more processing cores
onto a single chip becomes realistic in computer technology (Hayes, 2007) . Thirdly, the market need
is also the one factor that brings up the trend. Most computer users want to perform multiple tasks on
their desktop machine, concurrently such as listening music, playing games, watching television, internet
surfing, and so on. Therefore, the needs of computer users encourage processor makers to obtain the
performance improvement through parallelism or processing cores.
In this chapter, we have provided several software applications which can be used efficiently in
multi-core processors.
REFERENCES
Akenine-Moller, T., & Haines, E., (2002, July). Real-time rendering (2nd Ed.). Wellesley, MA: A. K.
Peters Publishing Company.
Aldwairi, M., Conte, T., & Franzon, P., (2005) Configurable string matching hardware for speeding up
intrusion detection. ACM SIGARCH Computer Architecture News, 33(1).
Alshwede, R., Cai, N, Li, S.-Y. R., & Yeung, R. W. (2000). Network information flow: Single Source.
IEEE Transactions on information theory, (submitted for publication).
Boyd, C. (2008, March/April). Data-parallel computing. ACM Queue; Tomorrows Computing Today,
6(2). doi:10.1145/1365490.1365499
Boyer, R., & Moore, J. (1977). A fast string searching algorithm. Communications of the ACM, 20(10),
762777. doi:10.1145/359842.359859
Chen, C.-H., & Lee, C.-Y. (1999). A cost effective lighting processor for 3D graphics application. Proceedings of International Conference on Image Processing, 2, 792796.
Dharmapurikar, S., & Lockwood, J. (2006, October). Fast and Scalable Pattern Matching for Network
Intrusion Detection Systems. Communication of the IEEE Journal, 24(10).
Femando, R., Harris, M., Wloka, M., & Zeller, C. (2004). Programming graphics hardware. In Tutorial
on EUROGRAPHICS. NVIDIA Corporation.
Gschwind, M., Hofstee, H.P., Flachs, B., Hopkins M., Watambe, Y., & Yamazaki, T., (2006). Synergistic
Processing in Cells Multicore Architecture. IEEE Computer Society, 0272-1732/06.
Hammond, L., Nayfeh, B.A., & Olukotun, K. (1997, September). A Single-Chip Multiprocessor. IEEE
Computer, September, 30(9), 79-85
Hayes, B. (2007). Computing in a parallel universe. American Scientist, , 95.
Hennessy, J.L., & Patterson, D.A., (n.d.). Computer Architecture A Quantitative Approach (4th Ed).
292
Hill, M. D., & Marty, M. R. (2008, July). Amdahls Law in the Multicore Era. HPCA 2008, IEEE 14th
International Symposium (pp.187).
Ho, T., Medard, M., Koetter, R., Karger, D. R., Effros, M., Shi, J., & Leong, B. (2006, October). A random linear network coding approach to multicast. IEEE Transactions on Information Theory, 52(10).
doi:10.1109/TIT.2006.881746
Koetter, R., & Medard, M. (2003, October). An algebraic approach to network coding. IEEE/ACM
Transactions on Networking (TON), 11(5), 782 795.
Kowaliski, C. (2008). NVIDIA CEO talks down CPU-GPU hybrids, Larrabee. The Tech Report, April
11th. Retrieved from http://techreport.com/discussions.x/14538
Kumar, R., Tullsen, D. M., & Jouppi, N. P. (2006). Core Architecture Optimization for Heterogeneous
Chip Multiprocessors. In Proceedings of the 15th International Conference on Parallel Architecture and
Compilation Techniques (pact 2006) (pp. 23-32).
Kumary, R. Tullsen D.M., Ranganathan, P., Jouppi, N.P., & Farkas, K.I., (2004). Single-ISA Heterogeneous Multi-Core Architecture for Multithreaded Workload Performance. In Proceedings of the 31st
International Symposium on Computer Architecture (ISCA04), June, 2004.
Kwok, T. T.-O., & Kwok, Y.-K. (2007). Design and Evaluation of Parallel String Matching Algorithms
for Network Intrusion Detection Systems (NPC 2007), (LNCS 4672, pp. 344-353). Berlin: Springer.
Li, S.-Y. R., Yeung, R. W., & Cai, N. (2003, Feb.). Linear network coding. IEEE Transactions on Information Theory, 49(2), 371381. doi:10.1109/TIT.2002.807285
Ni, J., & Lin, C. Chen, Z., & Ungsunan, P. (2007, September). A Fast Multi-pattern Matching Algorithm
for Deep Packet Inspection on a Network Processor. In Proceedings of International Conference on
Parallel Processing (ICPP 2007)(p.16).
Nickolls, J., Buck. I, & Garland, M., (2008). Scalable Parallel Programming with CUPA. ACM QUEUE,
March/April, 6(2), 40-53
Olukotun, K., & Hammond, L., (September 2005). The Future of Microprocessors. ACM Queue, September, 3(7), 26-29
Patterson, D.A., & Hennessy, J.L. Computer Organization and Design (3rd Ed.).
Paxson, V., & Sommer, R. (2007). An Architecture Exploiting Multi-Core Processors to Parallelize
Network Intrusion Prevention. In . Proceedings of IEEE Sarnoff Symposium, 3(7), 2629.
Pericas, M., Cristal, A., Cazorla, F. J., Gonzalez, R., Jimenez, D. A., & Valero, M. (2007). A Flexible
Heterogeneous Multi-Core Architecture. In Proceedings of the 16th International Conference on Parallel
Architecture and Compilation Techniques, (pp. 13 -24).
Schirrmeister, F. (2007). Multi-core Processors: Fundamentals, Trends, and Challenges, Embedded
Systems Conference, (pp. 6-15).
Shen, J. P., & Lipasti, M. (2004). Modern Processor Design: Fundamentals of Superscalar Processors
(1st Ed.).
293
Wikipdedia, Gauss-Jordan elimination. Retrieved from http://en.wikipedia.org/wiki/Gauss-Jordan_elimination

Wikipedia, Max-flow min-cut theorem. Retrieved from http://en.wikipedia.org/wiki/Max-flow_mincut_theorem

Cache Coherency: Cache coherence is a method of managing conflicts and maintain consistency
between cache and memory.
Compiler: A compiler is a set of programs that translates text written in a computer language (the
source language) into an another computer language (the target language).
Instruction-Level Parallelism (ILP): ILP is a measure of how many of the instructions in a computer
program can be computed simultaneously.
Multi-Core: Multi-core are multi-core architectures with a high number of cores.
Multiprocessor: Multiprocessor is a single computer system that has two or more processors.
Parallelism: Parallelism is a method of computation in which many calculations are carried out
simultaneously.
Thread-Level Parallelism (TLP): TLP is a form of executing threads across different parallel
computing nodes
294
295
Chapter 13
Assembling of Parallel
Programs for Large Scale
Numerical Modeling
V. E. Malyshkin
Russian Academy of Sciences, Russia
ABSTRACT
The main ideas of the Assembly Technology (AT) in its application to parallel implementation of large
scale realistic numerical models on a rectangular mesh are considered and demonstrated by the parallelization (fragmentation) of the Particle-In-Cell method (PIC) application to solution of the problem
of energy exchange in plasma cloud. The implementation of the numerical models with the assembly
technology is based on the construction of a fragmented parallel program. Assembling of a numerical
simulation program under AT provides automatically different useful dynamic properties of the target
program including dynamic load balance on the basis of the fragments migration from overloaded into
underloaded processor elements of a multicomputer. Parallel program assembling approach also can
be considered as combination and adaptation for parallel programming of the well known modular programming and domain decomposition techniques and supported by the system software for fragmented
programs assembling.
INTRODUCTION
Parallel implementation of realistic numerical models, using direct numerical modeling of a physical
phenomenon on the basis of description of the phenomenon behaviour in the local area, usually requires
high performance computations. However the algorithms of these models based on the regular data
structures (like rectangular mesh) are also remarkable for irregularity and even dynamically changing
irregularity of the data structure (adoptive mesh, variable time step, particles, etc.). For example, in the
PIC method the test particles are in the bottom of such an irregularity. Hence, these models are very
DOI: 10.4018/978-1-60566-661-7.ch013
Assembling of Parallel Programs for Large Scale Numerical Modeling
difficult for effective parallelization and high performance implementation with conventional programming languages and systems.
The Assembly Technology (AT) (Kraeva & Malyshkin, 1997), (Kraeva & Malyshkin, 1999), (Valkovskii, Malyskin, 1988) has been especially created in order to support the development of fragmented
parallel programs for multicomputers. Fragmentation and dynamic load balancing are the key features of
programming and program execution under AT. The application of the AT to implementation of the large
scale numerical models is demonstrated on the example of parallel implementation of the PIC method
(Berezin & Vshivkov, 1980), (Hockney & Eastwood, 1981), (Kraeva & Malyshkin, 2001) application
to solution of the problem of energy exchange into plasma cloud.
Actually AT integrates such well known programming techniques as modular programming and domain decomposition in order to provide suitable technology for the development of parallel programs
implementing large scale numerical models. AT supports exactly the process of the whole program assembling out of atomic fragments of computation.
The process of new knowledge extraction consisted of two major components. First, new fact is found
in real physical (chemical, ) experiments. After that a theory is constructed that should explain new
fact and predict unknown facts. The theory serves to the science until some new not explainable fact is
found. This is long time and resources consumable process. Real experiments are often very expensive,
original equipment for such experiments are prepared long time, and go on. Now the third component is
added to scientific process. Numerical simulation of the natural phenomena on supercomputers is now
used in order to test the developed theory in numerical experiments, not in real physical experiments.
Also such numerical experiments often help to form the new real experiment if necessary. Sometimes
the parameters of a physical system can not be measured, for example, the processes in plasma or inside
the sun. In these cases only numerical simulation can find some arguments in order to support or to
reject the theory.
In comparison to real experiments, numerical experiments consume far less resources and can be
organized very quickly. Therefore, the investigations of the phenomenon can be done more quickly and
the phenomenon can be studied more carefully in numerous experiments. It is no wonder that the modern
supercomputers are mostly loaded by the large scale numerical simulation (Kedrinskii, Vshivkov, Dudnikov, Shokin & Lazareva, 2004), (Kuksheva, Malyshkin, Nikitin, Snytnikov, Snytnikov, & Vshivkov,
2005).
Unfortunately, the development of parallel programs is very difficult problem. Earlier, sequential
programming languages and systems provided for numerical mathematicians the possibility to program
their numerical models more or less well without any assistance from the professional programmers. It
was their private technology (Malyshkin, 2006) of programming. Now another situation is. Development of parallel programs is far more difficult and labor consumed work. Additionally, parallel programs
are very sensible to any errors, to any not optimal design decisions and/or inefficiency in programming.
As result numerical mathematicians are now unable to develop parallel programs implementing their
numerical models without assistance from the professional programmers.
The technology of parallel programs of numerical modeling assembling out of ready made atomic
fragments is suggested as private technology of programming for numerical mathematicians that are
often working with restricted number of numerical method and algorithms. The AT is demonstrated on
PIC implementation (algorithms parallelization/fragmentation and program construction).
Methods of the whole program assembling out of atomic fragments of computation are in use already
long time in different forms (scalable computing, granularity, etc). Actually AT integrates such well known
296
programming techniques as modular programming and domain decomposition in order to provide suitable
technology for the development of parallel programs implementing large scale numerical models. AT
supports exactly the process of the whole program assembling out of atomic fragments of computation.
Also the peculiarities of numerical algorithm parallel implementation are taken into account.
The most close to AT approach to the development of application parallel programs demonstrate the
IBM programming system ALF for the microprocessor Cell (ALF for Cell be programmers guide and
API reference), (ALF for hybrid-x86 programmers guide and API reference)
THE PIC METHOD AND THE PROBLEMS OF ITS PARALLEL IMPLEMENTATION

The particle simulation is a powerful tool for modeling of the behaviour of complex non-linear phenomena in plasmas and fluids. In the PIC method, trajectories of a huge number of test particles are
calculated as these particles are moved under the influence of the electromagnetic fields computed
self-consistently on a discrete mesh. These trajectories represent a desirable solution of the system
of differential equations describing a physical phenomenon under study (Berezin & Vshivkov, 1980;
Hockney & Eastwood, 1981).
A real physical space is represented by a model of simulation domain called the space of modeling
(SM). The electric E and magnetic B fields are defined as vectors and discretised upon rectangular mesh
(or several shifted meshes, as shown in Figs.1 and 2). Thus, as distinct from other numerical methods on
the rectangular mesh, in the PIC method there are two different data structures particles and meshes.
None of the particles affects another particle. At any moment of modeling a particle belongs to a certain
cell of each mesh.
Each charged particle is characterized by its mass, co-ordinates and velocities. Instead of solution
of equations in the 6D space of co-ordinates and velocities, the dynamics of the system is determined
by integrating the equations of motion of every particle in the series of discrete time steps. At each time
step tk+ 1 = tk + t the following is done:
1.
2.
3.
4.
For each particle, the Lorentz force is calculated from the values of electromagnetic fields at the
nearest mesh points (gathering phase);
For each particle the new co-ordinates and velocity of a particle are calculated; a particle can move
from one cell to another (moving phase);
For each particle the charge carried by a particle to the new cell vertices is calculated to obtain
the current charge and density, which are also discretised upon the rectangular mesh (scattering
phase);
Maxwells equations are solved to update the electromagnetic field (mesh phase).
The sizes of a time step and of a cell are chosen in such a manner that a particle cannot fly farther
than into the adjacent cell at one time step of modeling. The number of time steps depends on a physical experiment. A more detailed description of the PIC method can be found in (Berezin & Vshivkov,
1980; Hockney & Eastwood, 1981).
The PIC algorithm has the great possibility for the parallelisation, because all the particles are moved
independently. The volume of computations at the first three phases of each time step is proportional to
the number of particles. About 90% of multicomputer resources are spent for the particle processing.
297
Figure 1. A cell of the SM with the electric E and magnetic B fields, discretised upon shifted meshes
Thus, in order to implement the PIC code on multicomputer with high performance, the equal number
of particles should be assigned for processing to each processor element (PE). However, on the MIMD
distributed memory multicomputers, the performance characteristics of the PIC code crucially depends
on how the mesh and particles are distributed among the PEs. In order to decrease the communication
overheads at the first and the third phases of a time step, it is required that a PE contains as the cells (values of the electromagnetic fields at the mesh points) as the particles located inside them. Unfortunately,
in the course of modeling, some particles might fly from one cell to another. To satisfy the previous
requirement, two basic decompositions can be used (Kraeva & Malyshkin, 1997).
In the so-called Lagrangian decomposition, the equal number of particles are assigned to each PE
with no regard for their position in the SM. In this case, the values of the electromagnetic fields, the
current charge and density at all the mesh points should be copied in every PE. Otherwise, the communication overheads at the first and the third phases will decrease the effectiveness of parallelisation.
Disadvantages of the Lagrangian decomposition are the following:

Strict memory requirements;

Communication overheads at the second phase (to update the current charge and the current density in each PE).
Figure 2. The whole space of modeling (SM) assembled out of cells
298
In the Eulerian decomposition, each PE contains a fixed rectangular sub-domain, including electromagnetic fields at the corresponding mesh points and particles in the corresponding cells. If a particle
leaves its sub-domain and flies to another sub-domain in the course of modeling, then this particle should
be transferred to the PE containing this latter sub-domain. Thus, even with an equal initial workload
of the PEs, in several steps of simulation, some PEs might contain more particles than the others. This
results in the load imbalance. The character of the particles motion does not fully depend on equations,
but also on the initial particles distribution and the initial value of the electromagnetic field.
Many researchers studied parallel implementation of the PIC method on different multicomputers.
Several methods of the PIC parallelization were developed. The big list of references to articles devoted
to PIC parallelization can be found in (Kraeva & Malyshkin, 1997). In order to reach high performance
these methods take into account the particles distribution.
Let us consider some examples of particles distribution, which correspond to the different real physical experiments.

Uniform distribution of the particles in the entire space of modeling.

The case of a plate. The space of modeling has n1 n2 n3 size. The particles are uniformly
distributed in k n2 n3size space (k << n1).
Flow. The set of particles is divided into two subsets: the particles with zero initial velocity and
the active particles with the initially nonzero velocity. Active particles are organized as a flow
crossing the space along a certain direction.
Explosion. There are two subsets of particles. The background particles with zero initial velocities
are uniformly distributed in the entire space of modeling. All the active particles form a symmetric cloud (r << h, where r is the radius of the cloud, h is the mesh step). The velocities of active
particles are directed along the radius of the cloud.
The main problem of programming is that the data distribution among the PEs depends not only on
the volume of data, but also on the data properties (particles velocities, configuration of electromagnetic
field, etc.). With the same volume of data but different particles distributions inside the space the data
processing is organized in different ways.
As the particles distribution is not stable in the course of modeling, the program control and data
distribution among the PEs should be dynamically changing. It is clear that the parallel implementation of the PIC method on a distributed memory multicomputer strongly demands the dynamic load
balancing.
BASIC CONCEPTS OF THE TECHNOLOGY OF FRAGMENTED PROGRAMMING

Numerical algorithms, generally, and the PIC method, in particular, are very suitable for the application
of AT. Considering different approaches to parallel implementation of numerical models it is necessary
always to bear in mind, that constructed parallel programs should possess the dynamic properties such
as:
1.
Non determinism of execution. The order of processes execution is not fully fixed. The order is
chosen in the course of execution for better use of the multicomputer resources.
299
2.
3.
4.
5.
6.
Dynamic tunability of the program to all the available resources.

Dynamic resources assignment
Dynamic load balancing
Program portability
Dynamic behavior of the program.
The program should follow to the behavior of the simulated phenomenon.

Provision of dynamic properties of a program can be found on the way of the fragmented representation
of an algorithm and the program. This affects different stages of an application program development.
This is good time to remark that only technological solutions, i.e., the solutions, that can be used
in universal technology of program construction, are selected for inclusion into AT. AT provides high
quality of implementation of any suitable numerical model. But if a certain numerical model should
be implemented with maximally high quality, then with the use of specific algorithms and specific
programming techniques the implementing program of more high performance, then under AT, can be
developed manually.
Algorithm and Program Fragmentation

A.
B.
C.
An application problem description should be divided into a system of reasonably small atomic
fragments, representing the realization entities of a model. Fragments might be represented in
programming languages by variables, procedures, subroutines, macros, nets, notions, functions,
etc. An atomic fragment (P_fragment), contains both data and code. In other words, a program,
realizing an application problem, is assembled out of such small P_fragments of computations
which are connected through variables for data transfer. Under AT the size of atomic fragments
can be changed from one to another program execution.
The fragmented structure of an application parallel program is kept in the executable code and
provides the possibility for organization of flexible and high performance execution of the fragmented program. The general idea is the following. A fragmented program is composed as a set of
executable P_fragments. Into every PE a part of P_fragments is loaded that constitute the program
for the PE. This program is executed inside every PE, looping over all the P_fragments, loaded
to the PE. If these fragments are small enough, then initially for each PE of a multicomputer the
equal workload is assembled out of these P_fragments.
The workload of PEs can be changed in the course of computation, and if at least one PE becomes
overloaded, then a part of P_fragments (with the processing data), which were assigned for execution into the overloaded PE, should migrate to the neighbours underloaded PEs equalizing the
workload of multicomputer PEs. Providing dynamic load balancing of a multicomputer, scalability
and many other dynamic properties of an application program is based on such a fragmentation.
This is of course a general idea only.
300
Assembling vs. Partitioning

Our basic key word is assembly. Contrary to partitioning, the AT supports explicit assembling of a whole
program out of ready-made fragments of computations, rather then dividing a problem, defined as a
whole, into the suitable fragments to be executed on the different PEs of a multicomputer. These fragments are the elementary blocks, the bricks, to construct a whole program. An algorithm of a problem
assembling is kept and used later for dynamic program/problem parallelization. Assembling defines the
seams of a whole computation, the way of the fragments connection. Therefore, these seams are the
most suitable way to cut the entire computation for parallel execution. For this reason program parallelization can be always done if the appropriate size of atomic fragments is chosen.
Separation of the Fine Grain and the Coarse Grain Computations

The fine grain computations are encapsulated inside a module that realizes the computations bound
up with atomic fragment (P_fragment). Such a module can be implemented effectively on a processor
element.
The whole parallel program is assembled out of these ready-made P_fragments. The set of P_fragments of a program defines a set of interacting processes (coarse grain computations). Encapsulation
of the fine grain computations, their complexity, inside an atomic fragment provides the possibility of
formalization of a parallel program construction and the use of explicitly two-level representation of
an algorithm: the programming level inside an atomic fragment and the scheme level of an application
program assembling.
Explicitly Two-Level Programming

First, suitable atomic fragments of computation are designed, programmed and debugged separately.
Then the whole computation (problem solution) is assembled out of these fragments. As a code for
atomic fragment, a sequential library subroutine can be used, for example.
Automatic Providing of Dynamic Properties of a Program

Fragmenting an algorithm we should try to satisfy two conditions that not always can be generally satisfied, but for numerical algorithm it might be done:
All the processes should consume approximately equal volume of resources.
On the set of all the processes there should exist such a partial ordering relation <, that the processes
interact only with their neighbours, i.e., process pi can interact with the process pj iff pi < pj &
p (pi < p < pj)
Explicit numerical algorithms on rectangular mesh practically always permit such a fragmentation.
Execution of the set of P_fragments can be organized in order their behaviour would imitate the
behaviour of a liquid in the system of communicating vessels. This is the technological basis in order
uniformly to solve the problem of automatic providing dynamic properties of a target program.
301
Figure 3. Decomposition of SM for implementation of the PIC on the line of PEs
Computation and Communicating in Parallel

Into each PE many enough P_fragments are loaded. As result, if one of P_fragments started its communication, the others can continue the computation. Therefore, in the course of the program execution
the most of communications can be done in parallel with computations. This is such multiprogramming
on the set of P_fragments.
Many other properties of a program like library of subroutines accumulation for any platform, program
scalability, memory use optimization and go on can be provided by fragmentation too.
Separation of Semantics and Scheme of Computation

The fine grain computations define a sufficient part of semantics (functions) of computations. They are
realized within an atomic fragment. Therefore, on the coarse grain level, only a scheme (non-interpreted
or semi-interpreted) of computations is assembled. It means that formal methods of automatic synthesis
of parallel programs (Valkovskii & Malyskin, 1988) can be successfully used and the standard schemes
of parallel computations can be accumulated in libraries.
PARALLELIzATION OF NUMERICAL METHODS WITH AT

Let us consider the assembly approach to the numerical algorithms parallelisation on the example of the
PIC method algorithms parallel implementation.
The line and the 2D grid structure of interprocessor communications of a multicomputer are sufficient for the effective parallel implementation of numerical methods on the rectangular meshes. In
(Valkovskii & Malyskin, 1988) the algorithm of the 2D grid mapping into the hypercube keeping the
neibourhood of PEs is given.
A cell is natural atomic fragment of computation for a numerical method implementation. It contains
both data (particles inside the cells of a fragment and values of electromagnetic fields, current density
at their mesh points) and the procedures, which operate with these data. For the PIC method, when a
302
Figure 4. Decomposition of SM for implementation of PIC on the 2D grid of PEs
particle moves from one cell to another, it should be removed from the former cell and added to the latter
cell. Thus, we can say that with the AT the Eulerian decomposition is implemented.
PIC Parallelization on the line of PEs

Let us consider first in what way the PIC is parallelized for the multicomputers with the line structure
of interprocessor communications. The three-dimensional simulation domain is initially partitioned into
N blocks (where N is the number of PEs). Each block_i consists of several adjacent layers of cells and
contains approximately the same number of particles (Figure 3). When the load imbalance is crucial,
some layers of the block located into an overloaded PE are transferred to another less loaded PE. In the
course of modeling, the adjacent blocks are located in the linked PEs. Therefore, the adjacent layers are
located in the same or in the linked PEs. It is important for the second phase, at which some particles
can fly from one cell into another and for the fourth phase, when for recalculation of values of electromagnetic fields in a certain cell, values in the adjacent cells are also used.
Figure 5. Virtual layers for implementation of PIC on the 2D grid of PEs
303
Figure 6. Direction of data transfer for implementation of the PIC on the 2D grid
PIC parallelization on the 2D grid of PEs

Let us consider now the PIC method parallelization for the 2D grid of PEs. Let the number of PEs be
equal to l m. Then SM is divided into l blocks orthogonal to a certain axis. Each block consists of
several adjacent layers and contains about NP/l particles (in the same way as it was done for the line
of PEs). The block_i is assigned for processing to the i-th row of the 2-D grid (Figure 4). Blocks are
formed in order to provide an equal total workload of every PEs row of the processor grid. Then every
block_i is divided into m sub-blocks block_i_j, which are distributed for processing among m PEs of the
row. These sub-blocks are composed in such a way in order to provide an equal workload of every PE
of the row. If overload of at least one PE occurs in the course of modeling, this PE is able to recognize
it at the moment when the number of particles substantially exceeds NP/(l m). Then this PE initiates
the re-balancing procedure.
If the number of layers k N (or k l in the case of the grid of PEs), it is difficult or even impossible to divide the SM into blocks with the equal number of particles. Also, if particles are concentrated
inside a single cell, it is definitely impossible to divide SM into the equal sub-domains. In order to attain
the better load balance, the following modified domain decomposition is used. A layer containing more
than the average number of particles is copied at least into 2 or more neighbouring PEs (Figure 5) these
are virtual layers. A set of particles located inside such a layer are distributed among all these PEs. In
the course of the load balancing, particles inside the virtual layers are the first to be redistributed among
PEs, and only if necessary, the layers are also redistributed. For the computations inside a PE, there is
no difference between virtual and non-virtual layers. Any layer could become virtual in the course of
modeling, a virtual layer can stop to be virtual.
We can see, that in both cases there is no a necessity to provide flying of a cell. A cell is very small
fragment, therefore a large resources should be spent to provide its flying. Thus, there is a necessity
to use bigger indivisible fragments on the step of execution (not on the step of a problem/program assembling!). In the case of the line of PEs a layer of the SM should be chosen as indivisible fragment of
concrete implementation of PIC. Such a fragment is called a minimal fragment. For PIC implementation
on the grid of PEs a column is taken as minimal indivisible fragment. Procedure realizing minimal fragment is composed statically out of P_fragments, before the whole program assembling. This essentially
improves the performance of an executable code.
In such a way, a cell is used as atomic fragment at the step of numerical algorithm description. At
304
Figure 7. Unification of the Hx mesh variables for two minimal fragments (layers)
the step of execution of a numerical algorithm, different minimal fragments assembled out of atomic
fragments are chosen depending on architecture of a multicomputer.
General PIC Method Fragmentation

General PIC method fragmentation is based on the dividing of SM into the parallelepipeds. The size
of a parallelepiped is chosen in such a way, that several fragments can be loaded into every PE. All the
other notions (equal workload, virtual fragments and so on) are defined in the same way. This type of
fragmentation is suitable for any current multicomputers.
IMPLEMENTATION OF THE PIC METHOD ON MULTICOMPUTERS

Using AT PIC method has been implemented on different multicomputer systems. In order to provide a
good portability the language C was chosen for the parallel PIC code implementation. For the dynamic
load balancing of PIC several algorithms were developed (Kraeva & Malyshkin, 1999), (Kraeva &
Malyshkin, 2001). In the cases of the grid communication structure (Figure 6) and the virtual fragments
(particles might fly not only to the neighbouring PEs) special tracing functions are used.
According to the AT the array of particles is divided into N parts, where N is the number of minimal
fragments (layers of SM in the case of the line of PEs, and columns and parallelepiped in the case of the
2D grid). When elements of the mesh variables (electromagnetic fields, current charge and density) of
different minimal fragments hit upon the same point of SM, they are unified (Figure 7). The elements of
the mesh variables of one minimal fragment are not added to the data structure for this fragment, but are
stored in the 3D arrays with elements of the mesh variables of other minimal fragments in the PE. This
appears possible due to the rectangular shape of blocks block_i (block_i_j in the case of the 2D grid).
Such a decision allows us to decrease the memory requirements and to speed up computations during
the fourth phase of the PIC algorithm.
In the case of dynamic load balancing, when some minimal fragments are transferred from one PE
to another, the size of the 3D arrays of elements of the mesh variables is dynamically changed. This
demands special implementation of such dynamic arrays. In the case of mesh fragmentation into parallelepiped, dynamic load balancing is reached by the fragments migration only.
305
DYNAMIC LOAD BALANCING

To attain a high performance of parallel PIC implementation, a number of centralized and decentralized
load balancing algorithms were especially developed.
Initial Load Balancing

A layer of cells is chosen as minimal fragment for the PIC implementation on the line of PEs. Each
minimal fragment has its own weight. The weight of a minimal fragment is equal to the number of
particles in this fragment. The sum of weights of all the minimal fragments in some PE determines the
workload of PE. The 3D simulation domain is initially partitioned with a certain algorithm into N blocks
(where N is the number of PEs). Each block consists of several adjacent minimal fragments and contains
approximately the average number of particles. For the initial load balancing two heuristic centralized
algorithms were designed (Kraeva & Malyshkin, 1999), (Kraeva & Malyshkin, 2001). These algorithms
employ the information about the weights of all the minimal fragments. Each PE has this information
and constructs the workload card by the same algorithm. The workload card contains the list of minimal
fragments that should be loaded to the PEs. If the number of minimal fragments is much greater than
the number of PEs, it is usually possible to distribute the minimal fragments among the PEs in such a
way that every block would contain approximately the same number of particles.
If considerable portion of particles is concentrated inside a single cell, it is impossible to divide the
SM into the blocks with quite an equal workload. To solve this problem, a notion of virtual layer is introduced. The centralized algorithm was modified for the case of virtual fragments.
If overloading of at least one PE occurs in the course of modeling, this PE is able to recognize it at
the moment when the number of particles substantially exceeds NP/N. Then this PE initiates the rebalancing procedure.
Dynamic Load Balancing

If a load imbalance occurs, the procedure BALANCE is called. In this procedure, the decision about
the direction of data transfer and the volume of data to be transferred is taken. For any load balancing
algorithm there exists its own realizing procedure BALANCE. The procedure TRANSFER is used for the
data transfer. There are two implementations of this procedure: for the line of PEs and for the grid of PEs.
The procedure TRANSFER is the same for any load-balancing algorithm on the line of PEs. Parameters
of the procedure are the number of particles to be exchanged and the direction of data transfer.
Let us consider algorithms of the dynamic load balancing of the PIC. All the PEs are numerated. In
the case of the line of PE, each PE has number i, where 0 i number_of_PEs. In the case of the (l*m)
grid of PEs, the number of PE is the pair (i, j), where 0 i l, 0 j < m. In the same way, layers and
columns of the SM are numerated.
Centralized Dynamic Load Balancing Algorithm

For the dynamic load balancing the initial load balancing algorithm can be used. One of PEs collects
the information about weights of the minimal fragments and broadcasts this information to all the other
PEs. All the PEs build the new load card. After that neighbouring PEs exchange minimal fragments
306
Figure 8. Review window for Hx mesh variable of an atomic fragment of computation
according to the information in the new workload card.

Imbalance threshold. If centralized algorithms are used for the dynamic load balancing, the PEs
exchange the information about load balancing, therefore every PE has information about the number
of particles in all the PEs. In any PE, the difference mnp-NP/N (where mnp is the maximum number of
particles in PE, NP is the total number of particles, N is the number of PEs) is calculated. If the difference is more than the threshold Th, the procedure BALANCE is called. The threshold can be a constant,
chosen in advance, or an adaptive number. In the latter case, initially Th=0. In the course of modeling,
the time t_part, which is required to implement steps (1-3) of the PIC algorithm for one particle, is calculated. After every BALANCE call the time of balancing t_bal is calculated. Th is assigned to be equal
to t_bal/t_part (how many particles could be processed for the same time as one balancing requires).
After each subsequent step of the PIC algorithm, Th is decreased by mnp-NP/N. When the value of Th
is negative, BALANCE is called.
If the threshold is always equal to zero, the procedure BALANCE is called after each time step of
modeling.
Decentralized Dynamic Load Balancing Algorithm

for the Constant Number of Particles
The use of centralized algorithms is good enough for multicomputers containing a few PEs. But if the
number of PEs is big enough, the communication overheads could neutralize the advantages of the dynamic load balancing. In this case it is more preferable to use decentralized algorithms. Such algorithms
use information about the load balance in the local domain of PEs only.
If the number of test particles is not changed in the course of modeling, a simple decentralized algorithm could be suggested. Each PE has the information on how many particles were moved to/from its
neighbouring PEs. To equalize the load it is sufficient just to receive/send the same number of particles
from/to the neighbouring PEs. It should be noted that this algorithm works in the case of virtual fragments only.
307
Specialized Decentralized Algorithm

To equalize the load balance for the PIC implementation, the following specialized algorithm was designed. In the course of simulation in every PE the main direction of particles motion is calculated from
the values of particles velocities. In the load balancing, particles are delivered to the direction opposite
to the main direction. As in the previous algorithm it is assumed that the number of particles is not
changed in the course of simulation. The number of particles to be transferred from overloaded PEs to
their neighbours in the direction opposite to the main direction is calculated from the average number
of particles and the number of particles in the PE. Some particles are transferred in advance, in order
to reduce the number of calls of the dynamic load balancing procedure. This is the case of dynamic
behavior of the program when the program feels the behavior of the model.
Diffusive Load Balancing Algorithms

The basic diffusive load-balancing algorithm was implemented and tested for the PIC parallel implementation (Kraeva & Malyshkin, 1999), (Kraeva & Malyshkin, 2001), (Kuksheva, Malyshkin, Nikitin,
Snytnikov, Snytnikov, & Vshivkov, 2001). The size of the local domain is equal to two. Any diffusive
algorithm is characterized by the number of steps. Actually the number of steps defines how far the
data from one PE could be transferred in the course of load balancing. For every step the procedure
TRANSFER is called. The more the number of steps of diffusive algorithm the better load balance could
be attained, but also the more time is required for load balancing. The tests have shown that the total
program time does not decrease with the growth of the number of steps.
AUTOMATIC GENERATION OF PARALLEL CODE

The PIC method is applied to simulation of many natural phenomena. In order to facilitate the parallel
PIC implementation, a special system of parallel program automatic generation was designed. This system consists of the VISual system of parallel program Assembling (VisA) and a parallel code generator
in C language.
The process of the generation of a parallel program for the PIC (and the same way for the other
numerical algorithms on the rectangular meshes) consists of three steps.
At the first step, a user defines the atomic fragment of computation a cell of the SM. This cell
contains elements of the mesh variables at several mesh points, an array of particles (for PIC method)
and procedures in C language, which describe all the computations inside the cell {procedure1,,
procedurek} (Figure 8).
At the second step the description of assembling of the minimal fragments out of atomic fragments
is done in visual system VisA, after that the whole computation is assembled in the manner like a wall
is assembled out of the bricks.
After that the generator constructs a program for implementation of the defined minimal fragment (a
layer, a column or a parallelepiped). The particle arrays of atomic fragments merge to a single particle
array for the minimal fragment. The elements of the mesh variables, which hit upon the same point of
SM, are unified. In such a way, for every mesh variable, only one 3D array of its elements is formed.
At the third step, the decision on the PEs workload is made. The generator creates a parallel program
308
implementing the whole computation for a target multicomputer. This program includes data initialization and a time loop. At each iteration of the time loop, k loops (where k is the number of procedures in
the description of an atomic fragment) over all the minimal fragments of PE run. After each k-th loop,
those elements of the mesh variables, which are copied in several PEs, are updated (if necessary).
All the particles are stored in m arrays (where m is the number of minimal fragments in a certain
PE). However, similarly the case of the minimal fragment assembling, the elements of a mesh variable
in all the minimal fragments of a PE form one 3D array.
The user develops procedures (computations inside a cell) in C language, using also several additional
statements for defining the computations over the mesh variables.
CONCLUSION
The AT provids high performance of an assembled program execution, its high flexibility in reconstruction of the code and dynamic tunability to available resources of a multicomputer. High performance of
the program execution provides modeling of large scale problems such as the study of a cloud plasma
explosion in the magnetized background, modeling of interaction of a laser impulse with plasma, astrophysical problems solution, etc.
We applied the AT to implementation of different numerical methods and hope to create a general
tool to support implementation of mathematical approximating models.
Finally the question can be given: How many numerical algorithms can be fragmented? The answer
to this question can be found in (Malyskin, Sorokin & Chauk, 2008). The answer is: any numerical mass
algorithm can be fragmented, but with the different results. In order to reach good result many efforts
should made. Very often a deep modification of initial algorithms should be done similar to the algorithms
modification for their parallelization. But this is another topic for consideration.
REFERENCES
ALF for Cell BE Programmers Guide and API Reference. Retrieved from http://www01.ibm.com/chips/
techlib/techlib.nsf/techdocs/41838EDB5A15CCCD002573530063D465
ALF for Hybrid-x86 Programmers Guide and API Reference. Retrieved from http://www01.ibm.com/
chips/techlib/techlib.nsf/techdocs/389BBE99638335B80025735300624044
Berezin, Y. A., & Vshivkov, V. A. (1980). The method of particles in rarefied plasma dynamic. Novosibirsk, Russia: Nauka (Science).
Corradi, A., Leonardi, L., & Zambonelli, F. (1997). Performance comparison of load balancing policies
based on a diffusion scheme. In Proc. of the Euro-Par97 (LNCS Vol. 1300). Springer: Germany.
Hockney, R., & Eastwood, J. (1981). Computer simulation using particles. London: McGraw-Hill,
Inc.
309
Kedrinskii, V. K., Vshivkov, V. A., Dudnikova, G. I., Shokin, Yu. I., & Lazareva, G. G. (2004). Focusing
of an oscillating shock wave emitted by a toroidal bubble cloud. Journal of Experimental and Theoretical Physics, 98(6), 11381145. doi:10.1134/1.1777626
Kraeva, M. A., & Malyshkin, V. E. (1997). Implementation of PIC method on MIMD multicomputers
with assembly technology. In Proc. of the High Performance Computing and Networking Europe 1997
Int. Conference. (LNCS, Vol.1255), (pp. 541-549). Berlin: Springer Verlag.
Kraeva, M. A., & Malyshkin, V. E. (1999). Algorithms of parallel realization of PIC method with assembly technology. In Proceedings of 7th High Performance Computing and Networking Europe, (LNCS
Vol. 1593), (pp. 329-338). Berlin: Springer Verlag.
Kraeva, M. A., & Malyshkin, V. E. (2001). Assembly technology for parallel realization of numerical
models on MIMD-multicomputers. International Journal on Future Generation Computer Systems,
Elsevier Science, 17(6), 755765. doi:10.1016/S0167-739X(00)00058-3
Kuksheva, E. A., Malyshkin, V. E., Nikitin, S. A., Snytnikov, A. V., Snytnikov, V. N., & Vshivkov, V. A.
(2005). Supercomputer simulation of self-gravitating media. International Journal on Future Generation Computer Systems, 21(5), 749758. doi:10.1016/j.future.2004.05.019
Malyshkin, V. (2006). How to create the magic wand? Currently implementable formulation of the
problem. In New Trends in Software Methodologies, Tools and Techniques, Proceedings of the Fifth
SoMeT_06, 147, 127-132.
Malyshkin, V. E. (1995). Functionality in ASSY system and language of functional programming. In
Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis.
(pp. 92-97). Aizu-Wakamatsu, Japan: IEEE Comp. Soc. Press.
Malyshkin V.E., Sorokin S.B., & K.G.Chauk (2008, May). Fragmented numerical algorithms for the
library parallel standard subroutines. Accepted to publication in Siberian Journal of Numerical Mathematics, Novosibirsk, Russia.
Snytnikov, V. N., Vshivkov, V. A., Kuksheva, E. A., Neupokoev, E. V., Nikitin, S. A., & Snytnikov, A.
V. (2004). Three-dimensional numerical simulation of a nonstationary gravitating n-body system with
gas. Astronomy Letters, 30(2), 124138. doi:10.1134/1.1646697
Valkovskii, V. A., & Malyshkin, V. E. (1988). Synthesis of parallel programs and systems on the basis
of computational models. Novosibirsk, Russia: Nauka.
Vshivkov, V. A., Nikitin, S. A., & Snytnikov, V. N. (2003). Studying instability of collisionless systems
on stochastic trajectories. JETP Letters, 78(6), 358362. doi:10.1134/1.1630127
Walker, D. W. (1990). Characterising the parallel performance of a large-scale, particle-in-cell plasma
simulation code. International Journal on Concurrency: Practice and Experience., 2(4), 257288.
doi:10.1002/cpe.4330020402
310

Assembly Technology: Technology of parallel programs development for large scale numerical
simulation based on assembling of the whole computation out of atomic fragments of computation. The
technology integrates well known techniques of modular programming and domain decomposition and
is supported by the system software.
Cluster: A multicomputer with the tree structure of communication net.
Dynamic Load Balancing: Equalizing of workload of multicomputer processor elements in the
course of a program execution in order to reach better multicomputer performance.
Dynamic Tunability of a Program to All the Available Resources: A program should be able to
use all the available resources of a multicomputer.
Multicomputer: A set of computers connected by the communication net and able with the use of
special system software to solve jointly the same application problem. Well known examples of multicomputers communication net are rectangular mesh, tree, torus, hypercube.
Parallel Programming: The development of programs able to be executed on multicomputers.
Particle-In-Cell Method: Widely used numerical method for direct simulation of natural phenomena
where the material is represented by the huge number of test particles. Instead of solution of the system
of partial differential equations in the 6D space of co-ordinates and velocities, the dynamics of a simulated phenomenon is determined by integrating the equations of motion of every particle in the series of
discrete time steps. The method began to be applicable with the use of supercomputers only.
311
312
Chapter 14
Cell Processing for two

Scientific Computing Kernels
Meilian Xu
University of Manitoba, Canada
Parimala Thulasiraman
Ruppa K. Thulasiram
ABSTRACT
This chapter uses two scientific computing kernels to illustrate challenges of designing parallel algorithms for one heterogeneous multi-core processor, the Cell Broadband Engine processor (Cell/B.E.). It
describes the limitation of the current parallel systems using single-core processors as building blocks.
The limitation deteriorates the performance of applications which have data-intensive and computationintensive kernels such as Finite Difference Time Domain (FDTD) and Fast Fourier Transform (FFT).
FDTD is a regular problem with nearest neighbour comminuncation pattern under synchronization
constraint. FFT based on indirect swap network (ISN) modifies the data mapping in traditional CooleyTukey butterfly network to improve data locality, hence reducing the communication and synchronization
overhead. The authors hope to unleash the Cell/B.E. and design parallel FDTD and parallel FFT based
on ISN by taking into account unique features of Cell/B.E. such as its eight SIMD processing units on
the single chip and its high-speed on-chip bus.
INTRODUCTION
High performance computing (HPC) clusters provide increased performance by splitting the computational tasks among the nodes in the cluster and have been commonly used to study scientific computing
applications. These clusters are cost effective, scalable and run standard software libraries such as MPI
which are specifically designed to develop scientific application programs on HPC. They are also comparable in performance speed and availability to supercomputers.A typical example is the Beowulf cluster
DOI: 10.4018/978-1-60566-661-7.ch014
Cell Processing for two Scientific Computing Kernels
which uses commercial off-the-shelf computers to produce a cost-effective alternative to a traditional

supercomputer. In the list of top 500 fastest computers in top500.org, many of them are pure clusters. One
of the crucial issue in clusters is the communication bandwidth. High speed interconnection networks
such as Infiniband have paved the way for increased performance gain in clusters.
However, the development trend in clusters has been greatly influenced by hardware constraints
leading to three walls called brick wall (Asanovic et al., 2006). According to Moores law, the number
of transistors on the chip will double every 18 to 24 months. However, the speed of processor clocks
has not kept up with the increased transistor design. This is due to the physical constraints imposed on
clock speed increase. For example, too much heat dissipation leads to complicated cooling techniques
to prevent the hardware from deteriorating. And too much power consumption daunts the customers
from adopting new hardware, increasing the cost of commodity applications. Power consumption is
doubling with the doubling of operating frequency leading to the first of the three walls, power wall.
On the other hand, even with the increased processor frequency achieved so far, the system performance
has not improved significantly in comparison to the increased clock speeds. In many applications, the
data size operated on by each processor changes dynamically, which in turn, affects the computational
requirements of the problem leading to communication/synchronization latencies and load imbalance.
Multithreading is one way of tolerating latencies. However, previous research (Thulasiram & Thulasiraman, 2003; Thulasiraman, Khokhar, Heber, & Gao, 2004) has indicated that though multithreading
solves the latency problem to some extent by keeping all processors busy exploiting parallelism in an
application, it has not been enough. Accessing data in such applications greatly affects memory access
efficiency due to the non-uniform memory access patterns that are unknown until runtime. In addition,
the gap between the processor speed and memory speed is widening as processor speed increases more
rapidly than memory speed leading to the second wall, memory wall. To solve this problem, many
memory levels are incorporated which requires exotic management strategies. However, the time and
effort required to extract the full benefits of these features detracts from the effort exerted on real coding and optimization. Furthermore, it has become a very difficult task for algorithm designers to fully
exploit instruction level parallelism (ILP) to utilize the processor resources effectively to keep the processors busy. Solutions to this problem have been in using deep pipelines with out-of-order execution.
However, this approach impacts the performance of the algorithm due to the high penalty paid on wrong
branch predictions. This leads to the third wall, ILP wall. These three walls force architecture designers
to develop solutions that can sustain the requirements imposed by applications and provide solutions to
some of the problems imposed by hardware in traditional multiprocessors.
A multi-core architecture is one of the solutions to tackle the three walls. These architectures are driven
by the need for decreased power consumption, increased operations/watt and Moores Gap. A multi-core
architecture consists of a multi-core processor, which is also called a chip-level multiprocessor (CMP).
A multi-core processor combines two or more independent cores into a single die. It is a new architecture and cannot be regarded as a new SMP (Symmetric MultiProcessor) architecture since all cores in
this architecture share on-chip resources while separate processors in the conventional SMP do not. For
example, each core of AMD Opteron dual-core processor has its own L2 cache, but the two cores still
share other interconnect to the rest of system such as the memory controller. These dual-core processors
belong to homogeneous multi-core processors because the resources and execution units (or cores) are
mere replications of each other. The number of cores on a single die is still growing. Quad-Core Intel
Xeon processor and Quad-Core AMD Opteron processor are already available. Cyclops64 has as many
as 64 homogeneous cores on a single chip, which is usually known as a many-core architecture. On the
313
other hand, the IBM Cell Broadband Engine (Cell/B.E.) processor is a heterogeneous multi-core processor
(Chen, Raghavan, Dale, & Iwata, 2007), which has one conventional microprocessor, Power Processor
Element (PPE), and eight SIMD co-processing elements called Synergistic Processor Elements(SPEs).
PPE and SPEs use different Instruction Set Architecture (ISAs). These devices communicate with one
another by an ultra speed broadband connection called the Element Interconnect Bus (EIB). The PPE, a
superscalar RISC processor, acts as the central controller for the SPEs and provides multithreaded support to better utilize the resources of modern processor architectures. Just as the neuron cells in the brain
work together, the Cell incorporates many electronic devices to work together as a complete system. The
Cell is, therefore, a System-on-Chip or heterogeneous multi-core architecture. The concept of multi-core
architectures and its implementations have paved the way to building tera- and peta-scale supercomputer
systems. Los Alamos Roadrunner, an Opteron-Cell hybrid supercomputer, aims at providing a sustained
petaflop supercomputer based on AMD Opteron multi-core processors and Cell/B.E. processors. The
concept of multiprocessors is not new. It has existed in other hardware designs such as GPU (Graphics
Processing Unit), FPGA (Field Programmable Gate Array), and network processors.
In this chapter, we design and develop parallel algorithms for two scientific computing kernels, FDTD
(Finite-Difference Time-Domain) and FFT (Fast Fourier Transform) on multicore architecture,in particular
the Cell/B.E. FDTD is a regular scientific computing problem which has its applications in electromagnetic theory and medical imaging (Xu, Sabouni, Thulasiraman, Noghanian, & Pistorius, 2007). It follows
nearest neighbour communication pattern and is synchronous in nature. The FDTD is computationally
data intensive and is usually a kernel in the applications. Therefore, improving the FDTD algorithm
is very important to the overall performance of the application. The FFT is a semi-irregular problem
and a kernel in many applications such as computed tomography and option pricing in finance (Barua,
Thulasiram, & Thulasiraman, 2005). The partners of the butterfly computation change at each iteration
thereby changing the communication pattern at each iteration. In the FFT algorithm the processors can
be only one iteration ahead of their neighbouring processors. In this chapter we explain the Indirect Swap
Networks (ISN) technique, an idea proposed in VLSI circuits that can be efficiently used to compute the
butterfly computations in FFT. Data mapping in the swap network topology reduces the communication
overhead by half at each iteration compared to the traditional Cooley-Tukey algorithm.
The rest of the chapter is organized as follows. Section 0 introduces in detail the Cell/B.E. which
brings new challenges to parallel algorithm design. FDTD is described in Section 0. The parallel FDTD
algorithms for distributed memory machines and homogeneous muilticore processors are provided in
Section 0 and Section 0 respectively. Section 0 explains the parallel algorithm design on the Cell/B.E.
The experimental results of these three parallel algorithms are presented in Section 0. FFT is described
in Section 3. An introduction to FFT is provided in Section 3. The indirect swap network in explained
in Section 3. The algorithm based on ISN is parallelized on the Cell/B.E. and is explained in Section
4 followed by experimental results in Section 5. The experience of exploiting multicore processors for
these two scientific computing kernels is summarized in Section 8 which concludes this chapter.
CELL BROADBAND ENGINE PROCESSOR

Applications that require stream lining of data and instructions are more suitable for vector processors
such as Cray X-MP supercomputers existed in the 1980s to 1990s. In the recent years, there have been
several other vector computer architectures such as NEX XS series, Cray X1, Fujitsu vector systems,
314
Hitachi SR8000 emulating vector architectures. Furthermore, the SSE instructions in regular Intel processors introduce vector instruction (even if for very short vector length) to regular proccessor chips.
The Cell/B.E. is also an architecture to support vector operations(Chen et al., 2007). One of Cell/B.E.s
unique features, the Single Instruction Multiple Data (SIMD) computing, allows data level parallelism
and moves towards vector processing. The Cell/B.E. processor is the first implementation of the Cell
Broadband Engine Architecture (CBEA) (Chen et al., 2007). CBEA was implemented to address some
of the issues related to the three walls existing in conventional uni-processor systems. The Cell/B.E.
processor is a heterogeneous multi-core processor. It consists of one conventional 64-bit Power Processor Element (PPE), eight Synergistic Processor Elements (SPEs), a memory controller, an I/O controller, and an on-chip coherent bus EIB (Element Interconnect Bus) which connects all elements on the
single chip. The eight SPEs are purposely designed for intensive computing via large number of wide
uniform registers (128-entry 128-bit registers) and 256KB local store for each SPE. The Memory Flow
Controller (MFC) on each SPE and the high bandwidth EIB (with a peak bandwidth of 204.8 GBytes/s)
enable SPEs to interact with PPE, with other SPEs, and with main memory efficiently. The novice features make the Cell/B.E. processor an attractive and well suited for scientific computing applications
(Williams et al., 2006).
The Cell/B.E. processor exhibits several levels of parallelism. Coarse-grained parallelism exists between the PPE and SPEs, and between different SPEs. The PPE and SPEs can work on different tasks
concurrently. Each SPE can also perform different tasks simultaneously. Fine-grained parallelism can
be implemented both on the PPE and on the SPE. Both the PPE and the SPEs have their own SIMD
instruction sets, each capable of executing two instructions per clock cycle. The PPE has a two-way
multi-threaded hardware support and is a dual-issue in-order processor. The SPE does not support multithreaded on the hardware level, however, it is also a dual-issue in-order processor because of its two
pipelines. Also, the MFC of each SPE can move data around without interrupting the ongoing tasks on
the PPE and SPEs. The nature of parallelism on the Cell/B.E. processor is expected to produce significant
performance improvement if fully explored and utilized (Chen et al., 2007).
All these features make the Cell/B.E. processor an attractive and new architecture for compute intensive applications. Liu et al. (Liu et al., 2007) develop a digital media indexing application on Cell/B.E..
Williams et al. (Williams et al., 2006) investigate the performance of several key scientific computing
kernels on Cell/B.E. processor. They conclude that the Cell/B.E. processors three level software-controlled
memory architecture (the 128 registers, the LS, and the main memory) outperforms the conventional
cache-based architectures, especially for applications with predictable memory access patterns by effective overlap of computation and communication.
FINITE DIFFERENCE TIME DOMAIN ALGORITHM

This section explains the Finite-Difference Time-Domain (FDTD) method. FDTD is a popular method
in many applications such as electromagnetic theory (Yu, Mittra, Su, Liu, & Yang, 2006) and medical
imaging (Xu et al., 2007). FDTD is inherently data-intensive and compute-intensive exhibiting nearest
neighbour communication patterns. Since it is usually a kernel in many applications, the performance of
the FDTD algorithm to the overall performance of the entire application is crucially important. In this
section, we develop parallel FDTD algorithm for the Cell/B.E. and compare the results to two different
architectures, distributed memory clusters and homogeneous multicore AMD Opteron. We discuss the
315
experimental results and the challenges posed in developing the algorithms taking into consideration
the architectural features of these architectures.
Finite Difference Time Domain Algorithm

FDTD is a numerical technique proposed by Yee in 1966 to solve Maxwells equations in electromagnetics field (Yee, 1966). Yees algorithm discretizes the 3D region of interest into a mesh of cubic cells or
2D region into a grid of rectangular cells. These cells are called Yee cells. Each Yee cell has electrical
fields( E ) and magnetic fields( H ) when the region is pinged with microwaves. The electrical fields
and magnetic fields interleave with each other spatially: the edges of the cells in electrical mesh lie at
the center of the cells in magnetic cells. Electrical fields and magnetic fields are updated at alternate
half time steps in a leapfrog scheme in time. An application of FDTD for breast cancer detection uses
the following equations to model the electrical and magnetic fields update. We refer readers (Xu et al.,
2007) for details of the application.
E zx |ni,+j 1 = aE zx |ni, j +
H |n +1/2 -H y |ni -+1,1/2

j
Dx y i, j
(1)
E zy |ni,+j 1 = aE zy |ni, j -
H |n +1/2 -H x |ni,+j -1/2

1
Dy x i, j
(2)
H x |ni,+j 1/2 = H x |ni,-j 1/2 -g E zx |ni, j +1 +E zy |ni, j +1 -E zx |ni, j -E zy |ni, j
(3)
H y |ni,+j 1/2 = H y |in,-j 1/2 +g E zx |ni +1, j +E zy |ni +1, j -E zx |ni, j -E zy |ni, j
(4)
1-
sDt
2e0 er
1+
sDt
2e0 er
a=
b=
Dt
e0 er
(1 +
g=
316
(5)
Dt
mDy
sDt
)
2e0 er
(6)
(7)
In the equations, E zx |ni,+j 1 is the electrical field at position (i, j) at the time interval (n + 1). H x |ni,+j 1/2
is the magnetic field at position (i, j) at the time interval (n + 1/2). is the conductivity of the material.
0 and r represent the permittivity of the free space and the material respectively. denotes the permeability of the material. t is the time step. x y is the size of the Yee cell.
A sequential FDTD on a conventional computer is shown in Algorithm 1. N is the number of Yee cells
in each direction, assuming that each direction is equally divided. MAX_TIMESTEPS is the max number
of time steps (iterations) for field updates. FDTD is the kernel of many applications in electromagnetic
field (Taflove & Hagness, 2000; Xu et al., 2007). As an iterative algorithm, its performance is critical to
its widespread applications. However, it is computationally intensive and therefore parallel processing
is required (Xu et al., 2007). The complexity of a 2D FDTD algorithm is O(N3) The sequential FDTD
algorithm takes about 200 seconds for a 600 600 computational domain in 4000 time steps on an
AMD Athlon 64 X2 Dual Core processor at 2GHz. In medical imaging, finer granularity is a necessity
to produce more accurate results. However, increased granularity indicates increased computation time
along with more memory requirement. These reasons have led us to design parallel FDTD algorithms
for different architectures.
A number of parallel FDTD research has been reported using different parallel schemes on different
platforms for different applications. Guiffaut et al. (Guiffaut & Mahdjoubi, 2001) implement a parallel
FDTD on a computational domain of 150 150 50 cells on PC and the Cray T3E. They use Message
Passing Interface (MPI) and adopt vector communication scheme and matrix communication scheme,
obtaining higher efficiency by the latter scheme. Su et al. (Su, EI-kady, Bader, & Lin, 2004) combine
OpenMP and MPI to parallelize FDTD: OpenMP used for the one time initialization and each time-step
updating of the E-fields and H-fields; MPI is used for the communication between neighboring processors. Yu et al. (Yu et al., 2006) introduce three communication schemes in parallel FDTD. The three
schemes differ in which components of E-fields and H-fields should be exchanged and which process
should update the E-fields on the interface.
Algorithm 1 Sequential FDTD on a conventional computer
Initialize electric fields and magnetic fields;

Calculate coefficients for all Yee cell;
for n = 1 to MAX_TIMESTEPS do
for i = 1 to N do
for j = 1 to N do
Update Ezx[i][j] using equation 1;
Update Ezy[i][j] using equation 2;
Update Hx[i][j] using equation 3;
Update Hy[i][j] using equation 4;
end for
end for
end for
317
FDTD on Distributed-Memory Machines

FDTD is data-parallel in nature and exhibits apparent nearest-neighbor communication pattern (Yu et
al., 2006). Therefore, FDTD is a suitable algorithm for parallelization on distributed memory machines
using Message Passing Interface (MPI). The factors impacting the performance of parallel FDTD on
distributed memory machines are communication and synchronization overhead. As shown in the previous
section, field updates of each Yee cell requires information from its neighbors. There is no communication overhead if the neighbors of the Yee cells reside on the same processor. However, communication
becomes an issue at the border of decomposition where some or all of cells neighbors are on the neighboring processors. The computational domain has to be large to provide accurate results which implies
the communication overhead is high while transferring large amounts of data. Therefore, overlapping
computation with communication is critical to gaining performance. Another nature of FDTD is that the
field updates cannot proceed to the next time step until all Yee cells have been updated for the current
time step. This incurs synchronization overhead for each time step. Therefore, in designing the parallel
FDTD algorithm, proper data distribution and mapping on the available processors is critical to avoiding communication bottlenecks.
Yu et al. (Yu et al., 2006) introduce three communication schemes for parallel FDTD. The three
schemes differ in which components of E and H should be exchanged and which processor should
update the E on the interface. The division of the computational domain is on the E along the Cartesian
axis. In this chapter, the computational domain is divided along the x axis of E . Suppose the computation domain is divided into n n cells and p processors are used for FDTD computation. Then, each
processor receives a matrix of m n cells where m = n/p. Each processor i (i is not equal to 1 and p,
the last processor) shares the first and mth row of its computational domain with processor i 1 and i +
1 respectively. Therefore, E on the interface of adjacent processors are calculated on both processors.
The purpose of the scheme is to eliminate the communication of E and only communicate H , trying
to improve the computation/communication efficiency. The parallel FDTD algorithm, referred to as
MPI-version parallel FDTD algorithm, is given in Algorithm 2.
Algorithm 2 Parallel FDTD on distributed-memory machines (MPI-version parallel FDTD)

if processor is master processor then
Calculate coefficients for all Yee cells;
Decide the Yee cells for each processor and send coefficients of
those Yee cells to the corresponding processors;
else
Receive the coefficients of the Yee cells residing on the local processor;
end if
forn = 1 to MAX TIMESTEPSdofori = 1 to N=Pdoforj = 1 to Ndo
318

end for
end for
Exchange magnetic fields with the neighboring processors;
Synchronize among all processors
end for
if processor is not master processor then
Send the final results to master processor;
else
Receive results from all other processors;
Output the results at the observation points;
end if
FDTD on Homogeneous Multicore Achitecture

The multicore machine available for this work is a Sun Fire X4600 serverIt is configured with eight
sockets. Each socket is configured with an AMD Opteron dual-core processor. The processors are connected with AMD HyperTransport Technology links. As a whole system, it is a ccNUMA(cache-coherent
Non-Uniform Memory Access) SMP system. Each processor has its dedicated memory attached to two
cores. It can access the memory of other processors via AMDs unique Direct Connect Architecture
(DCA). AMD Opteron dual-core processors have interesting design technologies to tackle some aspects of the three walls. Each core has separate L1 and L2 cache. Separate L2 caches prevent potential
synchronization bottleneck for multiple threads on multiple cores competing over the same data cache.
Hence, separate cores can process separate data sets, avoiding cache contention and coherency problems.
Furthermore, AMD has a unique implementation of ccNUMA based on DCA. Inside each of the eight
dual-core processors, there is a cross-bar switch. One side of the switch attaches the two cores. The other
side of the switch attaches the DCA with a shared memory controller and HyperTransport Technology
links. The shared memory controller connects the two cores to dedicated memory. The HyperTransport
Technology links allow dual cores on one processor access to another processors dedicated memory.
Therefore, for each core, some memory is directly attached, yielding a lower latency, while some is
not directly attached and has a higher latency. The combination of ccNUMA and DCA technology can
improve performance by locating data close to the thread that needs it, which is called memory affinity. Besides, the hypervisor which virtualizes the underlying multi-core processor and multi-processor
system, provides facilities to specify thread affinity to assign dedicated cores for threads. This facility
contributes to further performance improvement.
Although FDTD is computationally intensive, it shows apparent data parallelism and high data locality
property. Each Yee cell update, both for electrical fields and for magnetic fields, only needs information of its near neighbors as shown in equation 1 through equation 4. Locality is one of the key factors
that impact performance on cache-based computers. The inherent locality property of FDTD may bring
significant performance on the homogeneous multi-core system via shared-memory parallel programming paradigm, especially with the hardware support of separate L2 cache for each core. Therefore, we
designed a shared-memory version of FDTD as shown in Algorithm 3, which we refer to as OpenMP
version parallel FDTD.
319
Algorithm 3 Parallel FDTD on shared memory machines (OpenMP-version parallel FDTD)

Calculate coefficients for all Yee cell;
forn = 1 to MAX_TIMESTEPSdo
#pragma omp parallel
{
#pragma for private(i, j)
fori = 1 to Ndoforj = 1 to Ndo
end for
end for
}
end for
FDTD on Cell/B.E. Processor

This section will list the key issues to fully utilize the parallelism of Cell/B.E. for FDTD.
One issue is the limited size of the local store (LS) on each SPE. The 256KB LS is for both instructions and data. Based on the equations (1) to equation (4), for a computational domain of 600 600
Yee cells, 16M memory is needed to hold the variables for the coefficients and the fields at run time,
without considering the code and other variables. Therefore, one of the main issues is to decide on how
to make the data fit in the limited memory size at run time. A solution to this issue is we let each SPE
consider a part of the computational domain once. At each time step, the SPE fetches the coefficients
and the field values of the Yee cells within the part of the computational domain, and updates the fields
using the equations. The updated field values are stored back to the corresponding memory locations to
make space for the next part of the computational domain. The SPE then starts with the next part of the
domain. The process continues until all the Yee cells of the computational domain are updated for the
current time step. Another round of the whole process starts for the next time step for MAX_TIMESTEPS
rounds. By fetching and storing data between the main memory and the LS, each SPE can manage the
LS to have instructions and data under 256KB limit at run time.
Another issue is how to decide on the size and the frequency of exchanging data between the memory
and the LS such that the Cell/B.E. processor is fully utilized . The SPEs can only operate on instructions
and data residing on the LS. Unlike the PPE, SPEs cannot access the main memory directly. It has to
fetch instructions and data from the memory to the LS using asynchronous coherent DMA commands.
Therefore, the communication cost must be considered during algorithm design. A suitable size and
frequency for the transfers has to be determined to ensure there is no data starvation and there is minimal
overhead. Several points are critical in reducing the communication cost and achieving efficient SPE
data access: data alignment, access pattern, DMA initiator, and location. The MFC of the SPE supports
320
transfers of 1,2,4,8 and n 16 (up to 16K) bytes. Transfers less than 16 bytes must be naturally aligned
and have the same quad-word offset for the source and the destination addresses. Also, all transfers cannot be completed without the EIB. Hence, the cost on the EIB must be minimized. A minimal overhead
of the EIB can be achieved if transfers are at least 128 bytes, and transfers greater than or equal to 128
bytes are cache-line aligned, i.e., aligned to 128 bytes. Furthermore, whenever possible we let SPE initiate the DMAs and pull the data from the main memory instead of PPEs L2 cache. MFC transfers from
the system memory have high bandwidth and moderate latency, whereas transfers from L2 cache have
moderate bandwidth and low latency.
The third issue is the synchronization problem when more than one SPE is used to explore the parallelism between SPEs. Algorithm 0 shows that the field update for all Yee cells must be completed for the
current time step before any Yee cells can be dealt with further for the next time step. Therefore, when
more than one SPE is used to update different part of the computational domain, the synchronization
among all participant SPEs is mandatory for correct results.
The Cell/B.E. processor supports different synchronization mechanisms (Chen et al., 2007): MFC
atomic update commands, mailboxes, SPE signal notification registers, events and interrupts, or just
polling of the shared memory. We consider the mailboxes and SPE signal notification registers in the
paper. For the first method, SPEs use mailboxes while PPE acts as the arbitrator. When each SPE finishes its tasks for the current time step, it uses mailbox to notify the PPE that it is ready for the next
time step. When the PPE receives messages from all participant SPEs, it sends a message via mailboxes
to those SPEs and lets the SPEs start the task for the next time step. The PPE is not involved for SPE
signal notification registers method. One SPE acts as the master SPE, and other SPEs are slave SPEs.
The slave SPEs send signals to the master SPE when their tasks for the current time step is completed,
and wait for the signal of starting the task for the next time step from the master SPE. The master SPE
sends such signal only when it receives signals from all slave SPEs.
The last issue is on the implementation level: the exploitation of SIMD on the SPE. SPEs are SIMDonly co-processors. Scalar codes, especially codes for arithmetic operations, may deteriorate the performance since the SPE has to re-organize the data and instructions to be executed on the SPE. The code
written in a high-level language must rely on the compiler technology to be auto-vectorized to exploit
SIMD capability of the SPE. However, the flexibility of high-level languages makes it difficult to achieve
optimal results for different applications. Therefore, explicit control of the instructions by the programmers is a detrimental for optimal performance. For this purpose, the SPE provides intrinsics which are
essentially inline assembly code with C function call syntax. These intrinsics provide such functions as
register coloring, instruction scheduling, data loads and stores, looping and branching, and literal vector construction. The paper considers literal vector constructions since most of the tasks for FDTD are
arithmetic operations. It aims to manually apply SIMD to the two FOR loops shown in Algorithm 1.
Based on all the issues and solutions, we designed a parallel FDTD algorithm as shown in Algorithm
5 (referred as CellBE-version parallel FDTD) for the SPE side respectively. The PPE is used to manage
all SPE threads and calculate the intializatin values. The purpose is to fully exploit the natural parallelism provided by the processor in order to achieve significant performance improvement.
Algorithm 5 FDTD on the SPE (CellBE-version parallel FDTD)
Send ready singal to the PPE;

Receive information for synchronization;
321
DMA in control block to get information about its task assigned and
running setting;
for n = 1 to MAX TIMESTEPS do
while Ezx of Yee cells not updated do
Fetch chunks of data, including the coefficients and the field values of last time step;
Update Ezx using SIMD version of equation 1;
Store the updated Ezx back to the corresponding memory location;
end while
while Ezy of Yee cells not updated do
Update Ezy using SIMD version of equation 2;
Store the updated Ezy back to the corresponding memory location;
end while
while Hx of Yee cells not updated do
Update Hx using SIMD version of equation 3;
Store the updated Hx back to the corresponding memory location;
end while
while Hy of Yee cells not updated do
Update Hy using SIMD version of equation 4;
Store the updated Hy back to the corresponding memory location;
end while
Synchronize with other SPEs;
end for
Send finish signal to the PPE;
Experiment Results and Comparisons

The three parallel algorithms were designed for three architectures: distributed memory machines,
homogeneous multicore machines and the Cell/B.E. processor. They were run on four configurations.
We use processing unit number as x axis to avoid confusion between processor number, thread number,
core number, and SPE number. Although PPE is used for part of the computation, its contribution to the
final performance is negligible compared to the computation on the SPEs. Therefore, the processing unit
number indicates the number of SPEs for the Cell/B.E. processor. The four configurations on which the
parallel algorithms are designed and implemented are summarized below.

322
AMD Athlon cluster: a cluster of 24 nodes. Each node is an AMD Athlon dual-core processor at
2GHz, 512KB cache, with 100Mb/s Ethernet switch as the interconnection; GNU C compiler;
AMD Opteron single-core cluster: a cluster of 16 nodes. Each node is a dual AMD Opteron
single-core processors at 2.4GHz, 2GB per node of physical memory, with Voltaire Infiniband
Switched-fabric interconnection; C compiler from Portland Group.
AMD Opteron dual-core shared memory machine: 8 AMD dual-core Opteron processor at 1GHz,
1M cache per core, 4GB memory per processor and 32GB distributed shared memory in the system; Sun C compiler and Omni compiler;
IBM Cell/B.E. processor: Georgia Tech Cell/B.E. cluster containing 14 IBM Blade QS20 dualcell blades, each running at 3.2GHz, GNU C compiler.
Figure 1 depicts the computation time for these four configurations. There are different ways to
compare the performance between pairs of configuratons.
It illustrates the performance of the MPI-version parallel FDTD algorithm (Algorithm 2) on two
clusters. The AMD Opteron single-core cluster outperforms the AMD Athlon cluster when the same
number of processing units is used. One of the main reasons for this difference is that the two clusters
use different interconnection networks between processors. The Voltaire Infiniband Switched-fabric
interconnection network of the AMD Opteron single-core cluster provides faster communication speed
and lowers the communication latencies and synchronization latencies.
Figure 1 also shows the performance of MPI-version on the AMD Opteron signle-core cluster and
the OpenMP-version on AMD Opteron dual-core shared memory machines. We notice that the AMD
Opteron dual-core processor outperforms Opteron single-core processor both at the core level (1 processing unit) and at processor level (2 processing units for dual-core versus 1 processing unit for single-core).
However, for 4 and 8 processing units, the homogeneous multicore architecture with Opteron dual-core
processors has longer computation time than the AMD Opteron single-core cluster. The reason is the
overhead of multi-threads and the longer memory latency for dual-core system. Although the singlecore Opteron processors resides on different computers, these computers are connected with Voltaire
Infiniband Switched-fabric interconnection which minimize the communication latencies.
The performance comparison between AMD Opteron single-core cluster and the Cell/B.E. processor
is depicted in Figure 1. The Cell/B.E. uses constantly reduced time when more SPEs are involved. This
implies that the performance on the Cell/B.E. processor keeps almost constant ratio of 1.45 over the AMD
Figure 1. Computation time for different processors
323
Opteron single-core processors, no matter how many processing units are involved. At the processor
level, a Cell/B.E. processor using 8 SPEs is 7.05 faster than an Opteron single-core processor.
The final comparison is between the Cell/B.E. processor and the Opteron dual-core processors in the
homogeneous multicore architecture. We can see from the figure that when more processing units are
involved, the speedup of Opteron dual-core processor in the shared memory machine is lower than the
speedup of the Cell/B.E. processor. This is due to the thread overheads when using more cores. At the
processor level, a Cell/B.E. processor using 8 SPEs is 3.37 faster than an Opetron dual-core processor.
As discussed in the previous section, DMA size may be a factor for the performance improvement
since a large number of transfers incur more communication overhead. For this purpose, we designed
a simulation scenario where different number of rows (each row has 600 floats, which is 2,400 bytes)
in the computational domain are transferred in each DMA command. The result is depicted in Figure
2(a). The almost flat curves (downward a little for 6 rows in each DMA command) indicates that the
DMA size is not a factor for FDTD since the minimal transfer size (for 1 row) is already 2400 bytes.
For bigger sizes, the next DMA command has to wait for the previous transfer (large number of data,
e.g. 14, 400 bytes for 6 rows) to be completed. The figure shows another issue for the communication
overhead, the time for synchronization. It can be seen from the different spaces between different curves.
The space between the top two curves (for 1 SPE and 2 SPEs) is the widest, while the space between
the bottom two curves (for 4 SPEs and 8 SPEs) is the narrowest. The observation verifies the fact that
more overhead occurs when more SPEs are involved. In fact, the speed up for 2 SPEs is 1.95, 3.69 for
4 SPEs, and 4.92 for 8 SPEs.
Another scenario was designed to verify the performance difference when using signal and mailbox
synchronization mechanisms. The results shown in Figure 2(b) indicate that the two mechanisms give
comparatively equal performance.
Based on these comparisons, we can conclude that:

The Cell/B.E. processor provides significant performance improvement over conventional processors and parallel architectures.
All parts of the whole parallel system are important to the final performance. These include the
processor, the interconnection network, and the compiler.
FFT
This section explains the Fast Fourier Transform (FFT). Communication and synchronization are two
main latency issues in computing FFT on parallel architectures (Loan, 1992). Both latencies have to be
either hidden or tolerated to achieve high performance. One approach to achieve this is by multithreading. Another approach to tolerate latency is to map data efficiently onto the processors local memory
and exploiting data locality. Indirect swap networks (ISN), an idea proposed in VLSI circuits can be
efficiently used to compute the butterfly computations in FFT (Yeh, Parhami, Varvarigos, & Lee, 2002).
Data mapping in the swap network topology reduces the communication overhead by half at each iteration. This section explains the traditional FFT Cooley-Tukey algorithm followed by the FFT algorithm
based on the ISN method. The parallel FFT algorithm based on ISN is designed on the Cell/B.E. and
compared to a cluster.
324
Figure 2. Performance of FDTD on Cell/B.E.
Fast Fourier Transform

The discrete Fourier transform (DFT) is used in many applications such as in digital signal processing
to analyze the signals frequency spectrum, to solve partial differential equations or to perform convolutions. The 1D DFT computation can be expressed as a matrix-vector multiplication. A straightforward
solution for N input elements is of complexity O(N2). The Fast Fourier Transform (FFT) proposed by
Cooley-Tukey is a fast algorithm for computing the DFT that reduces the complexity to O(N log N).
The FFT has been studied extensively as a frequency analysis tool in diverse applications areas such
as audio, signal, image processing, computed tomography and computational finance (Barua et al.,
325
2005). There are many variants of the FFT algorithm. Mathematically, all variations differ in the use of
permutations and transformations of the data points (Loan, 1992). For a sequence x(r) with N data points,
decimation-in-time (DIT) FFT, divides the sequence into two halves x1(r) and x2(r) at every iteration.
On the other hand, decimation-in-frequency (DIF) FFT divides the sequence into odd and even data
points at every iteration. The difference of division method leads to different structure of the butterfly
computation. Depending on the number of groups to divide the input elements, there are radix-2, radix-4,
mixed-radix, split-radix FFTs in the literature. In this chapter, we consider the basic radix-2 DIT FFT
on N input complex elements where N is a power of 2.
Parallelizing the FFT on multiprocessor computers concerns the mapping of data onto processors. On
shared-memory processors, the whole data is placed in one global memory, allowing all processors to
have access to the data. The computation is subdivided among the processors in such a way that the load
is balanced and memory conflict is low. The recursive FFT algorithm can be easily programmed on such
machines. On distributed architectures, each processor has its own local memory and data exchanges are
via message passing. In this architecture, the recursive FFT algorithm is not the appropriate algorithm
because combining even and odd parts of elements at each iteration while the data is distributed on different processors requires relatively high level of programming sophistication. Therefore, an iterative
scheme of FFT is more suitable for distributed machines.
There are mainly two latency issues in computing FFTs on parallel architectures: communication and
synchronization. During the butterfly computation, the partners change at each iteration and an efficient
data mapping is difficult. Data need to be communicated between processors at every iteration. This
implies synchronization between processors. In order to achieve high performance, both these latencies
have to be either hidden or tolerated. One such approach is multithreading (Thulasiraman, Theobald,
Khokhar, & Gao, 2000). Another approach to tolerate latency is by mapping data efficiently onto the
processors local memory, that is exploiting data locality. Yeh et al. (Yeh et al., 2002) proposed an efficient parallel architecture for FFT in VLSI circuits using indirect swap networks (ISN). Data mapping
in the swap network topology reduces the communication overhead by half at each iteration. The idea
of swap network has been applied to option pricing in computational finance applications (Barua et al.,
2005) and has shown to produce better performance than the traditional parallel DIT FFT. However,
synchronization latency is still an issue for large data size.
Cooley-Tukey Butterfly Network and ISN

At each iteration of the FFT computations, two data points perform a butterfly computation. The butterfly computation can be conceptually described as follows: a and b are points or complex numbers.The
upper part of the butterfly operation computes the summation of a and b with a twiddle factor while
N
N
summations and
differences
the lower part computes the difference. In each iteration, there are
2
2
(Grama, Gupta, Kumar, & Karypis, 2003).
N
elements on P procesP
sors, involves communication for log P iterations and terminates after log N iterations. If we assume
shuffled input data at the beginning (Grama et al., 2003), the first (log N log P) iterations require no
communication. Therefore, during the first (log N log P) iterations (local stage), a sequential FFT
algorithm can be used inside each processor. At the end of the (log N log P)th iteration, the latest
In general, a parallel algorithm for FFT, with blocked data distribution of
326
N
data points exist in each processor. The last log P iterations require remote
P
N
communications (called remote stage). Note that the other half of the pairs for the
elements on one
P
processor reside on the same remote processor. The identity of the processors for remote communication can be identified very easily. That is, at the kth stage of the remote stages ( k = 0, , log P - 1 ),
if processor Pi needs to communicate with processor Pj then j = iXOR2k where XOR is exclusive OR
binary operation (Chu & George, 2000; Grama et al., 2003).
N
Note that in the Cooley-Tukey FFT algorithm,
data elements are exchanged between two paired
P
processors without inter-processor permutation at the remote stage, leaving each paired processors with
2N
the same copy of
elements for butterfly computations. Since the same butterfly computations are
P
performed on the processors, there are redundant computations. If only one processor performs the butcomputed values for
terfly computations, then some of the processors may be idle. Furthermore, this communication incurs
N
elements at each stage for remote stages and the distance each message trava message overhead of
P
els increases as iterations move forward depending on the interconnection network. The consequence
is that more communication and synchronization overhead leads to traffic congestion in the butterfly
network.
One solution to reduce data communication at each remote stage is through inter-processor permutation by using Indirect Swap Network (ISN) (Yeh et al., 2002). For local stages, each processor permutes
N
N
elements locally and performs
butterfly calculations; for remote stages, each processor permutes
2P
2P
N
and exchanges
data with its paired processor. Note that the permutation exploits data locality and
2P
N
thereby reducing message overhead between two paired processors by
. This is a significant decrease
2P
in communication for very large networks. An indirect swap network is depicted in Figure 3 for 16 elements on 4 processors. In this example, at remote stage 0, processors 0 and 1 exchange data points 2,
3 and 4,5 respectively. The other data points are kept intact in their respective processors (data points
0 and 1 in processor 0, data points 6 and 7 in processor 1). In general for a given N and P processors,
N
data points are swapped between two processors. The result is reducing the communication of the
2P
traditional butterfly network by half.
Parallel FFT Based on ISN on Cell/B.E.

As a new architecture for high performance computing, Cell/B.E. has been investigated for different FFT
algorithms. Chow et al. (Chow, Gossum, & Brokenshire, 2005) investigate the performance of Cell/B.E.
for a modified stride-by-1 algorithm proposed by Bailey based on Stockham Self-sorting FFT. They fix
the input sampling size to 16 million (224) single precision complex elements and achieve 46.8 Gflop/s
on a 3.2GHz Cell/B.E.. Williams et al. (Williams et al., 2006) investigate 1D/2D FFT on Cell/B.E. on
one SPE. Bader et al. (Bader & Agarwal, 2007) investigate an iterative out-of-place DIF FFT with 1K to
16K complex input samples and obtain a single precision performance of 18.6Gflop/s. Their approach
incurs frequent synchronization overhead both at the end of the butterfly update and at the end of the
327
Figure 3. Indirect swap network with bit-reversed input and scrambled output
followed permutations. FFTW adds various benchmarks of FFT on IBM Cell Blade and PlayStation 3
for different combination among single precision, double precision, real number inputs, complex number
inputs, 1D, 2D, and 3D transforms.
In the implementation of the FFT algorithm on the Cell/B.E., assume N is the size of the data and P
is the number of SPEs. The PPE bit-reverses input data which is naturally ordered. The PPE prepares
and conveys information such as the memory address of the bit-reversed data, the memory address of
the swap area, the number of SPEs, and the problem size when creating SPE threads. After SPEs receive
the information, each of them gets the corresponding ( N ) amount of data from the main memory acP
cording to their id. At the same time, SPE can overlap the communication between the main memory
and LS with the task of computing twiddle factors. Each SPE then starts (log N log P) iterations of
the sequential computation. After (log N log P) iterations, each of the SPEs starts the iterations of the
remote stage. At every iteration of the remote stage, each SPE stores intermediate results back to the
swap area and synchronizes to ensure every SPE stores their portion of intermediate results to the swap
area. At the end of synchronization, each SPE gets their paired partners from the swap area to perform
N
data is stored back for each SPE at each
the butterfly computation. Note that in the swap network,
2P
iteration. In the Cell/B.E. implementation, the SPEs do not exchange data directly with one another, which
is different from the distributed algorithm implementation. In a cluster, the data is initially distributed
to the processors by the master processor, and the processors communicate with one another to obtain
328
N
communications, which is an overhead in the
2P
distributed implementation. On the Cell/B.E., data exchange is between the SPE and main memory via
asynchronous DMA transfer issued by SPE. This is a significant advantage on the Cell/B.E. over distributed memory machines. The EIB is fast and allows fast communication between the main memory
and SPEs. On the distributed memory machines, the interconnection network plays a crucial role in the
exchange of data between processors. On the Cell/B.E., since it is system-on-chip architecture, DMA
access is fast, and every element works together to accomplish the task.
Another issue is synchronization. On the distributed memory machines, the processors synchronize
at each iteration. In the FFT implementation on Cell/B.E., the SPEs also need to synchronize, but some
unique features of the Cell/B.E. bring great benefits to FFT computations. As shown in section 0, two
synchronization mechanisms, mailbox and SPE signal notification registers, are used for FFT.
At the end of all iterations, the SPEs write the final results back to the main memory. This new FFT
algorithm based on ISN for Cell/B.E. is presented as a pseudo-code in Algorithm 6. It only shows the
workload on the SPE. The PPE is responsible to bit-reverse the naturally-ordered input at the beginning
and shuffle the final computation results of SPEs such that the overall output is naturally-ordered as in
butterfly network.
Algorithm 6 Parallel FFT based on ISN for Cell on SPE
their paired partners at each iteration. This requires
Input:N/P bit-reversed single precision complex number in array

A[N/P ], P SPEs, N = 2i;P = 2j;N >> P, array B[N/P ] to store transferred data temporarily
Output: scrambled N/P complex numbers transformed in array A
DMA in N/P complex numbers to array A; Compute twiddle factors and
stored in array W[N/2];
fori = 0 to (logN - logP - 1) do
NG = 2i; {number of groups} shuffle twiddle factors W[N/2];
forj = 0 to N/P- 1 step 2 doif ((j&NG) = 0) then
pID = j xor NG; {butterfly partner id}
Copy A[j] and A[pID] to B[j] and B[j+1];
else
pID = (j + 1) xor NG;
Copy A[pID] and A[j+1] to B[j] and B[j+1];
end if
end for
whileUTE > 8 do{UTE: un-transformed elements number}
SIMDize butterfly computation between neighboring 8 elements in array B; UTE- = 8;
end while
Compute any un-transferred elements if N/P is not multiple of 8;
Swap results in array B to array A;
end for
fori = 0 to (logP - 1) doif ((SPEid & NG) = 0) then
DMA out all N/2P elements with odd number indices to main memory;
329
else
DMA out all N=2P elements with even number indices to main memory;
end if
Synchronize with all other SPEs;
if ((SPEid & NG) = 0) then
DMA in N/2P elements from main memory and put into the odd number
indexed positions;
else
DMA in N/2P elements from main memory and put into the even number
indexed positions;
end if
NG = 2i;
shuffle twiddle factors W[N=2];
whileUTE > 8 do
SIMDize butterfly computation between neighboring 8 elements in array B; UTE- = 8;
end while
Compute any un-transferred elements if N/P is not multiple of 8;
Swap results in array B to array A;
end for
DMA out final N/P transformed results in array A to PPE;
Experimental Results
The new FFT algorithm based on swap network was implemented using sdk2.1 on an IBM Blade QS20
dual-Cell blade running at 3.2GHz available at Georgia Institute of Technology. The compiler is xlc
compiler. Figure 4 shows the performance of the algorithm for different problem sizes on different
numbers of SPEs. The figure shows that the execution time decreases when increasing number of SPEs
for different input sizes. Furthermore, the time for 4K input decreases faster than the time for 1K when
increasing number of SPEs. This is because DMA supports up to 16K asynchronous transfers between
main memory and local store. Therefore, for larger problem size, the communication overhead is very
close to the overhead of smaller problem size. The difference between run time for larger problem size
and smaller problem size is mainly the computation time on each SPE.
In order to investigate features of Cell/B.E., we compare the execution time of the algorithm on
Cell/B.E. with its execution time on a cluster (Barua et al., 2005). The cluster is a 20 node SunFire 6800
running MPI. The SunFire system consists of Ultra Space III CPUs, with 1050 MHz clock rate and 40
gigabytes of cumulative shared memory running Solaris 8 operating system. The comparison is depicted
in Figure 5(a) for 16K single precision complex numbers. As shown in the figure, Cell/B.E. performs
much better than the cluster. For 8 SPEs on Cell/B.E. and for 8 processors of the cluster, Cell/B.E. is
6.4 times faster than the cluster for 16K input data size. The reason is due to the large communication
N
log P communications per processor for log P iterations. On the
overhead in the cluster, that is,
2P
330
Figure 4. Computation time for different problem size on different number of SPEs
contrary, the high-speed EIB on Cell/B.E., which supports a peak bandwidth of 204.8GBytes/s for intrachip transfers, provides good performance for Cell/B.E., especially when the problem size increases. This
can be further validated by Figure 5(b). The FFT algorithm for Cell/B.E. outperforms the FFT algorithm
for the traditional cluster significantly for larger problem sizes.
Note that the communication between the main memory and the SPEs do not degrade the performance
of the algorithm. This is part due to the system on chip architecture of Cell/B.E.. The interconnection
network which is a hindrance on distributed memory machines is not of concern on Cell/B.E. We have
used the high speed EIB, asynchronous DMA transfer overlapped with computation, large number of
large uniform registers for SIMD operations available on the Cell/B.E. to our advantage in the FFT
implementation.
This is our initial work on FFT. We hope to compare our algorithm and results on Cell/B.E. with
other existing FFT algorithms and their results on Cell/B.E., such as FFTW and FFTC.
CONCLUSION
Although many scientific computing applications have achieved great performance improvement via
different parallel paradigms, they are limited for further upgrade because of the three walls posed on the
conventional processors. In this chapter, we have focused on multicore processors, especially Cell/B.E.,
which aims to tackle the three walls to provide significant performance improvement via novice features
such as ultra high speed on-chip bus EIB, eight SIMD coprocessor SPEs, software managed memory
hierarchy and hardware support of asynchronous communication between hierarchical memories.
However, the novelty brings challenges for parallel algorithm design. Therefore, we have investigated two scientific computing kernels, FDTD and FFT, as case studies to illustrate the challenges and
331
Figure 5. Comparison between Cell/B.E. and cluster
solutions when designing parallel algorithms on Cell/B.E.. For 2D FDTD, we have achieved an overall
speedup of 14.14 over AMD Athlon running at 2GHz and 7.05 over AMD Opteron running at 2.4GHz
at the processor level for a computational domain of 600 600 Yee cells. As to 1D FFT, for 8 SPEs
of IBM Blade QS20 dual-Cell blade running at 3.2GHz and for 8 processors of the cluster of SunFire
6800 running at 1050MHz clock rate, Cell/B.E. is 3.7 times faster than the cluster for 4K input data size
and 6.4 times faster than the cluster for 16K input data size. The results obtained from these problems
are promising to further consider Cell/B.E. as high performance comptung architecture for many more
332
applications.
I is not difficult to see that Cell/B.E., especially its eight independent SPEs, brings great performance
improvement via manual SIMD operations, explicit data movement management by asynchronous DMA
transfers and explicit scheduling and synchronization (such as multiple buffering) among all nine cores.
However, all the performance improvement techniques put more burdens on developers compared to
coventional processors, which in turn impacts the productivity and code portability. Therefore, developers have to balance between productivity and performance (a metric called relative productivity), which
can be measured by the ratio of speedup over SLOC (source lines of code) (Alam S. R., Meredith J.
S., Vetter J. S., 2007). With the popularity of multi-core architectures, the industry and the academia
have been improving the productivity from multi-core architectures by enhancing the software stacks
such as the optimized compiler, optimized library, and different programming models and development
platforms. These enhancements will help developers on multi-core architectures, including Cell/B.E.,
fully unleash the power of multi-core without introducing too much programming complexities, thus
achieving high relative productivity.
ACKNOWLEDGMENT
The authors are thankful to the University of Manitoba Research Grants Program for their support in this
research. The authors would also like to acknowledge the partial financial support from Natural Sciences
and Engineering Research Council (NSERC) of Canada. The authors acknowledge Georgia Institute of
Technology, its Sony-Toshiba-IBM Center of Competence, and the National Science Foundation, for
the use of Cell Broadband Engine resources that have contributed to this research.
REFERENCES
Alam, S. R., Meredith, J. S., & Vetter, J. S. (2007, Sept.) Balancing productivity and performance on the
cell broandband engine. IEEE Annual International Conference on Cluster Computing.
Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, R., Keutzer, K., et al. (2006, Dec).
The Landscape of Parallel Computing Research: A View from Berkeley (Tech. Rep. No. UCB/EECS2006-183). EECS Department, University of California, Berkeley.
Bader, D. A., & Agarwal, V. (2007, Dec). FFTC: Fastest fourier transform on the ibm cell broadband
engine. In 14th IEEE international conference on high performance computing (hipc 2007) Goa, India,
(pp. 1821).
Barua, S., Thulasiram, R. K., & Thulasiraman, P. (2005, Aug.). High performance computing for a
financial application using fast Fourier transform. In Euro-par parallel processing (p. 1246-1253).
Lisbon, Portugal.
Chen, T., Raghavan, R., Dale, J. N., & Iwata, E. (2007, Sept.). Cell Broadband Engine Architecture and
its first implementation-A performance view. IBM. Journal of Research and Development (Srinagar),
51(5), 559572.
333
Chow, A. C., Gossum, G. C., & Brokenshire, D. A. (2005). A programming example: Large fft on the
cell broadband engine. In Gspx. tech. conf. proc. of the global signal processing expo.
Chu, E., & George, A. (2000). Inside the fft black box: Serial and parallel fast Fourier transform algorithms. Boca Raton, FL: CRC Press LLC.
Grama, A., Gupta, A., Kumar, V., & Karypis, G. (2003). Introduction to parallel computing. Upper
Saddle River, NJ: Pearson Education Limited.
Guiffaut, C., & Mahdjoubi, K. (2001, April). A Parallel FDTD Algorithm Using the MPI Library. IEEE
Antennas andPropagation Magazine, 43(No. 2), 94103.
Liu, L.-L., Liu, Q., Natsev, A., Ross, K. A., Smith, J. R., & Varbanescu, A. L. (2007, July). Digital media
indexing on the cell processor. In 16th international conference on parallel architecture and compilation
techniques, Beijing, China (pp. 425425).
Loan, C. V. (1992). Computational frameworks for the fast Fourier transform. Philadelphia, PA: Society
for Industrial and Applied Mathematics.
Su, M. EI-kady, I., Bader, D. A., & Lin, S. (2004, August). A Novel FDTD Application Featuring
OpenMP-MPI Hybrid Parallelization. In 33rd international conference on parallel processing(icpp)
Montreal, Canada, (pp. pp. 373379).
Taflove, A., & Hagness, S. (2000). Computational Electrodynimics: The Finite-Difference Time-Domain
Method, second edition. Boston: Artech House.
Thulasiram, R. K., & Thulasiraman, P. (2003, August). Performance evaluation of a multithreaded fast
fourier transform algorithm for derivative pricing. [TJS]. The Journal of Supercomputing, 26(1), 4358.
doi:10.1023/A:1024464001273doi:10.1023/A:1024464001273
Thulasiraman, P., Khokhar, A., Heber, G., & Gao, G. (2004, Jan.). A fine-grain load adaptive algorithm
of the 2d discrete wavelet transform for multithreaded architectures. [JPDC]. Journal of Parallel and
Distributed Computing, 64(1), 6878. doi:10.1016/j.jpdc.2003.06.003doi:10.1016/j.jpdc.2003.06.003
Thulasiraman, P., Theobald, K. B., Khokhar, A. A., & Gao, G. R. (2000, July). Multithreaded algorithms
for the fast Fourier transform. In Acm symposium on parallel algorithms and architectures Winnipeg,
Canada, (p. 176-185).
Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., & Yelick, K. (2006, May). The Potential of
the Cell Processor for Scientific Computing. In Computing frontiers (cf06) Ischia, Italy (pp. 920).
Xu, M., Sabouni, A., Thulasiraman, P., Noghanian, S., & Pistorius, S. (2007, Sept.). Image Reconstruction using Microwave Tomography for Breast Cancer Detection on Distributed Memory Machine. In
International conference on parallel processing (icpp) XiAn, China (p. 1-8).
Yee, K. (1966, May). Numerial solution of initial boundary value problems involving maxwells equations in isotropic media. IEEE Transactions on Antennas and Propagation, AP-14(8), 302307.
334
Yeh, C.-H., Parhami, B., Varvarigos, E. A., & Lee, H. (2002, July). VLSI layout and packaging of butterfly networks. In Acm symposium on parallel algorithms and architectures Winnipeg, Canada (pp.
196205).
Yu, W., Mittra, R., Su, T., Liu, Y., & Yang, X. (2006). Parallel Finite-Difference Time-Domain Method.
Boston: Artech House publishers.

Cell Broadband Engine Architecture: Cell Broadband Engine Architecture, also abbreviated CBEA
or Cell/B.E. or called Cell as a shorthand, is a microprocessor architecture jointly developed by Sony,
Toshiba, and IBM. It is a heterogeneous multi-core architecture by combining one general-purpose
Power architecture core PPE (Power Processor Element) with eight streamline coprocessing elements
SPEs (Synergistic Processor Elements). PPE supports the operating system and is mainly used for control tasks. SPEs support SIMD (Single Instruction Multiple Data) processing. Each SPE has 128 entries
128-bit registers and 256KB local memory (called local storage) both for instructions and data. The
PPE, SPEs and memory subsystem are connected by on-chip bus Element Interconnect Bus (EIB). Cell
is designed as a general-purpose high-performance processor to bridge the gap between conventional
desktop processors and more specialized high-performance processors. It has been installed in Sony
PlayStation 3.
Cooley-Tukey FFT Algorithm: The Cooley-Tukey algorithm, named after J.W. Cooley and John
Tukey, is the most common FFT algorithm. It re-express the DFT of an arbitrary composite size N =
N1N2 in terms of smaller DFTs of size N1 and N2, recursively, in order to reduce the computation time
to O(N logN).
Discrete Fourier Transform: Discrete Fourier transform (DFT) is one of the specific forms of Fourier analysis which transforms one function in the time domain into another function in the frequency
domain. A DFT decomposes a sequence of values into components of different frequencies. A direct way
to compute a DFT of N points takes O(N2) arithmetical operations. The inverse DFT (IDFT) transforms
one function in the frequency domain into another function in the time domain.
Fast Fourier Transform: Fast Fourier Transform (FFT) is an efficient algorithm to compute the
DFT and its inverse. Instead of O(N2) arithmetical operations in a direct way to compute a DFT of N
points, FFT can compute the same result in only O(N logN) operations.
Finite Difference Time Domain: Finite Difference Time Domain (FDTD) is a numerical technique
proposed by Yee in 1966 to solve Maxwells equations in electromagentics fields. It discretizes a 3D field
into a mesh of cubic cells or a 2D field into a grid of rectangular cells using central-difference approximations. It is a time-stepping algorithm. Each cell has electrical field vector components and magnetic
field vector components which are updated at alternate half time steps in a leapfrog scheme in time.
Indirect Swap Network: Indirect Swap Network (ISN) is an improvement to the Cooley-Tukey
butterfly network. It aims to reduce data communication by half than traditional Cooley-Tukey network
by introducing inter-processor permutation when implemented on parallel systems using message passing model.
Multi-Core Processor: A multi-core processor combines two or more independent cores (normally
a CPU) into a single package composed of a single die, or more dies packaged together. It is also called
335
chip-level multiprocessor (CMP) and implements multiprocessing in a signle physical package. If the
cores are identical, the processor is called homogeneous multi-core processor, such as AMD Opteron
dual-core processor. If the cores are not the same, the processor is called heterogeneous multi-core
processor, such as Cell/B.E. processor.
Single Instruction Multiple Data: Single Instruction Multiple Data (SIMD) is a technique used to
achieve data level parallelism, the same instruction applied to multiple data. It is one category of processor architecture proposed in Flynns taxonomy. SIMD was popular in large-scale supercomputers with
vector processors. Now, smaller-scale SIMD operations have become widespread in general-purpose
computers, such as SSE instructions in regular Intel processors and SIMD instruction set in Cell/B.E.
processors.
336
Section 4
Scheduling and Communication

Techniques
338
Chapter 15
On Application Behavior
Extraction and Prediction to
Support and Improve Process
Scheduling Decisions
Evgueni Dodonov
University of So Paulo ICMC, Brazil
Rodrigo Fernandes de Mello
University of So Paulo ICMC, Brazil
ABSTRACT
The knowledge of application behavior allows predicting their expected workload and future operations.
Such knowledge can be used to support, improve and optimize scheduling decisions by distributing data
accesses and minimizing communication overheads. Different techniques can be used to obtain such
knowledge, varying from simple source code analysis, sequential access pattern extraction, history-based
approaches and on-line behavior extraction methods. The extracted behavior can be later classified into
different groups, representing process execution states, and then used to predict future process events.
This chapter describes different approaches, strategies and methods for application behavior extraction and classification, and also how this information can be used to predict new events, focusing on
distributed process scheduling.
INTRODUCTION
The knowledge of the application behavior allows predicting the application workload during its execution and forecasting distributed data accesses. In order to obtain such data, different strategies can be
employed, varying from simple source code analysis, sequential access pattern extraction (Kotz & Ellis,
1993), history-based approaches (Gibbons, 1997; Smith, Foster & Taylor, 1998) and on-line behavior
extraction methods (Senger, Mello, Santana, & Santana, 2005; Dodonov, Mello, & Yang, 2006).
It is possible to define two different strategies for application behavior extraction. The first approach
DOI: 10.4018/978-1-60566-661-7.ch015
On Application Behavior Extraction and Prediction
consists of a static source code evaluation, where the behavior is evaluated in an empiric way, without
real application execution. The second, also known as dynamic evaluation, consists of the application
behavior evaluation during execution.
The static evaluation approach is elderly, being conceived by Church and Turing, and is well described
by Fischer (1965). Originally, the technique was applied to finite state automata and Turing machines,
aiming at detecting possible problems which could lead to deadlock states or improper application
terminations.
With the evolution of computing systems, new static evaluation techniques were introduced. Among
those techniques are the model verification method, which aims at reducing the application behavior to a
formal representation (Schuster, 2003), and the abstract interpretation method (Loiseaux, Graf, Sifakis,
Bouajjani, & Bensalem, 1995), which represents the application behavior using a series of finite state
machines, characterizing different application behavior using independent automata states.
Dynamic behavior evaluation technique, on its turn, investigates the behavior during process execution, usually with aid of debugging or monitoring utilities. It can be further divided into continuous
monitoring and event-based approaches (Jain, 1991).
Continuous monitoring techniques consist of periodic extraction of application characteristics, determining the current behavior by evaluating differences among each execution state. This technique
can be easily employed, as no application modification is required. However, as the monitoring occurs
on pre-determined intervals, it introduces a constant overhead. Besides, a disadvantage of this approach
lies in its impreciseness as the monitoring occurs on fixed intervals, it is not possible to determine the
precise behavior on specific execution states.
The event-based technique evaluates the behavior by determining critical execution states and extracting application characteristics when such states are reached, resulting in a more precise behavior
determination. The adoption of this technique is more difficult, as it requires previous knowledge of
application functionalities in order to correctly determine critical execution states that are further used
to extract its behavior. Besides, this approach usually requires source code instrumentation or the interception and interpretation of function calls.
Among the advantages of event-based approach is the lower execution overhead, as the behavior is
only extracted on specific execution points (for example, on data transfers, or thread synchronization
operations).
The area of behavior prediction has received a lot of attention over the last years, resulting in a series
of comparisons among different approaches, motivating competitions such as the K. U. Leuven (Suykens
& Vandewalle, 2000), Eunite (Chen, Chang, & Lin, 2004) and Santa Fe (Weigend & Gershenfeld, 1994).
However, most of the researches are focused on generic chaotic time-series prediction, or posterior behavior reconstruction. The study on the applicability of behavior prediction techniques to support and
improve the performance of distributed systems, combined with our previous researches in this field, has
motivated us to write this chapter, aiming at outlining different approaches and strategies employed on
the process behavior extraction, classification and prediction, and their usage in distributed scheduling,
load balancing and data access anticipation.
This chapter is organized as follows: Section 2 overviews the evolution of application behavior
extraction strategies. Section 3 describes different approaches and techniques for behavior classification and prediction, and Section 4 outlines practical applications of the prediction results in distributed
environments. Finally, Section 5 summarizes this chapter.
339
BACKGROUND: BEHAVIOR ExTRACTION APPROACHES

One of the most common tasks for static application behavior evaluation is the source code analysis. The
most common approaches rely on widely available tools - such as grep, find and sed (Wilding & Behman, 2005). It is also possible to use specific source code indexation applications, such as LXR (http://
lxr.sourceforge.net) and similar tools, or code documentation-oriented approaches, such as ones of Java
Docs or Doxygen (http://www.doxygen.org).
Application behavior can also be evaluated by specific tools, such as Lint, Blast and Valgrind
(Nethercote & Fitzhardinge, 2004), that allow to extract and profile the application behavior aiming at
identifying possible faults during execution (such as incorrect parameter casting, bad memory usage
and unreachable code).
Those approaches, however, rely on the availability of application source code. In cases where the
code is not available, other techniques can be employed. One of such techniques is known as tracing,
which consists in the extraction and further analysis of system calls performed by the application.
Generally, the system kernel provides support to trace such operations, usually via the ptrace system
call, which notifies the kernel that the application is being monitored. In this case, it is possible to trace
application operations in a completely transparent way. As examples of applications that rely on system
call tracing, we may mention strace (http://strace.sourceforge.net) and GridBox (Dodonov, Souza, &
Guardia, 2004) projects.
The ptrace approach allows modifying and retrieving additional information for each system call,
being widely used for system and application monitoring and debugging. However, as the call tracing
occurs on the kernel level, it is difficult to intercept high-level functions. When such functionality is
required, dynamic system linker-related techniques can be used, such as specific environment variables
used by Linux linker (ld.so), or a tracing application like ltrace (Wilding & Behman, 2005). This approach allows intercepting any function call and tracing or modifying it when required, as demonstrated
by the GridBox (Dodonov et al., 2004) and MidHPC (Andrade Filho et al., 2008) projects.
Continuous monitoring-based approaches, on their turn, are more widely used as they do not require
knowledge of application functionality. Such methods usually rely on periodical system statistics collecting and analysis, either on the user level, as in the StatMonitor (Keskar & Leibowitz, 2005; Dodonov
et al., 2006), or on the kernel level (Senger et al., 2005).
Furthermore, application-specific techniques can be employed when using a pre-defined execution
environment, such as MPI, as demonstrated next.
MPI Application Behavior Extraction

One of the most widely parallel programming technique is the MPI (Message Passing Interface) standard
(Gropp, Lusk, Doss, & Skjellum, 1996; Squyres & Lumsdaine, 2003). This approach provides a mechanism for transparent message passing among distributed nodes, allowing synchronous and asynchronous
data transfers in a distributed environment.
The most widely used MPI implementations are MPICH (Gropp et al, 1996) and LAM-MPI (Squyres
& Lumsdaine, 2003). Among other implementations are the Intel MPI Library (http://intel.com/go/mpi),
originally known as Vampir (Nagel, Arnold, Weber, Hoppe, & Solchenbach, 1996), and the recently
introduced Open-MPI (Gabriel et al., 2004).
Initially, MPI was intended to be used in cluster environments, composed of homogeneous nodes
340
and without fault tolerance mechanisms. However, with the evolution of distributed environments,
such approaches became a clear disadvantage, being inadequate for heterogeneous and large scale
systems. This lead to the introduction of new MPI versions, focused on large heterogeneous distributed environments such as computational grids, with approaches such as the GRID-MPI project
(Ishikawa, Matsuda, Kudoh, Tezuka, & Sekiguchi, 2003).
Since the MPI standard is implemented in the form of a library, the application debugging and
monitoring requires specific libraries or tools to obtain the behavior information, such as the Xmpi
utility, included in most of the MPI distributions.
Specific MPI profiling and monitoring utilities are also widely used, such as the Intel Trace Tools
(ITT), included in the Intel MPI library, which allow both continuous and event-based evaluation
of distributed MPI applications. The ITT applications provide support for controlled execution, application tracing, flow control and communication pattern visualization among nodes, being divided
into two main applications: Intel Trace Collector, used for on-line application behavior collection,
and Intel Trace Analyzer, which evaluates the obtained data. The monitoring and behavior extraction
is carried out using a shared library, dynamically linked to a MPI application. The data is stored in
a consistent way, and is further evaluated by Intel Trace Analyzer. It is also possible to instrument
source code of an application using specific functions to define, in a precise manner, which data
should be captured.
A similar approach is used in MPE (Multi Processing Environment) system, composed of a series of libraries, applications and graphical utilities for performance evaluation of MPI applications
(Chan, Gropp, & Lusk, 2003). The system employs an event-based approach, providing libraries
for MPI event tracing, similar to Intel Trace Tools, and a graphical analysis tool. MPE, unlike Intel
Trace Tools, is freely available.
A similar strategy is used by the Instrumentation Library for MPI (MPICL), which provides an API
to profile and instrument the MPI applications (Huband & McDonald, 2001). However, it requires
manual source code modifications.
Besides such tools, mpiP utility may also be mentioned, as it provides a lightweight and scalable performance analysis library for MPI applications (mpiP: Lightweight, Scalable MPI Profiling,
http://mpip.sourceforge.net).
Finally, a dynamic monitoring approach is introduced by Dodonov et al. (2006), which aims at
extracting and predicting MPI operations by transparently intercepting function calls during the application execution.
Distributed Application Behavior Extraction

The extracted process behavior can be classified and analyzed using stochastic techniques (Devarakonda & Iyer, 1989; Feitelson, Rudolph, Schweigelshohn, Sevcik & Wong, 1997), processing of
historical traces (Gibbons, 1997; Smith et al., 1998), on-line evaluation algorithms (Arpaci-Dusseau,
Cutler & Mainwaring, 1998; Silva & Scherson, 2000), and mixed approaches (Senger et al., 2005,
Dodonov et al., 2006).
Devarakonda and Iyer (1989) proposed a statistic approach to predict CPU resource usage, input/
output operations and application memory utilization, using a clustering algorithm based on k-means
combined with the Markov chains approach. This strategy allows to evaluate the behavior and identify
resource-intensive applications.
341
A similar approach is proposed by Feitelson et al. (1997), where the same application is repeatedly
executed and its behavior variations evaluated. The authors observed small differences between different executions, therefore allowing to employ previously observed application functional parameters to
determine the future behavior without explicit user cooperation or application modification.
Approaches based on application execution are studied by several authors (Gibbons, 1997; Smith et
al., 1998), who evaluate the average CPU load, memory usage and input/output operations. The derived
results are further used to predict application behavior.
Distributed memory accesses can also be evaluated using series of execution traces, as demonstrated
by Fang et al. (2004). Similar approaches are also employed in distributed memory systems, for nonsequential data access pattern prediction, as discussed by Bianchini, Pinto and Amorim (1998).
Complex data access patterns can be determined using semantic structures and analytical approaches
(Lei & Duchamp, 1997), data access classification (Mehrotra & Harrison, 1999), and stochastic approaches such as hidden Markov chains model (Madhyastha & Reed, 1997).
On-line process behavior prediction approaches are studied by Silva and Scherson (1997), using
Bayes model and fuzzy logic to model application behavior. Similar approaches are discussed by ArpaciDusseau et al. (1998) and Corbalan, Martorell and Labarta (2001), using the collected information to
control a dynamic load balancing mechanism.
Finally, approaches based on artificial intelligence techniques for on-line behavior classification and
prediction are studied by Senger, Mello and Dodonov (Senger, 2004; Senger et al., 2005; Mello, Senger
& Yang, 2005; Dodonov et al., 2006; Dodonov & Mello, 2007). Those works employ neural networks to
classify and predict the process behavior. Prediction results are further used for dynamic load balancing
and scheduling algorithms (Mello & Senger, 2004), and for data prefetching (Dodonov et al., 2006).
The approaches introduced in this section allow extracting and representing the behavior of distributed
applications. However, the resulting data must be processed in order to be useful for future behavior
prediction, as it will be shown in the next section.
APPROACHES FOR APPLICATION BEHAVIOR

CLASSIFICATION AND PREDICTION
The application behavior classification technique aims at reducing repetitive or similar events to a series
of representative patterns, which can further be used to predict or forecast future events.
It is possible to use data extracted from application behavior. However, this approach results in a huge
amount of repetitive and similar data, which could compromise the prediction process. In this context,
classification techniques can be used to reduce the dimensionality of data, grouping similar behaviors
and performing predictions over the most relevant points. The classification is also particularly useful
in cases where similarities among observed behavior cannot be trivially detected.
The classification can be performed with aid of stochastic techniques, such as the iterative clustering,
linear classification or auto-regressive approaches; or with the aid of artificial intelligence techniques
as neural networks and evolutionary computing.
The iterative clustering technique is represented by the k-means, fuzzy c-means and quality threshold
(qt) clustering models. The k-means technique intends to separate the input space into different clusters,
by determining a centroid value which minimizes the classification quadratic error (Bradley, Fayyad,
& Reina, 1998). The fuzzy c-means approach modifies the k-means model by calculating the similarity
342
degree among all clusters (Liao, Celmins, & Robert J. Hammell, 2003), and the qt clustering improves
both k- and c-means strategies by automatically calculating the ideal number of clusters (Jiang, Tang,
& Zhang, 2004).
Another classification strategy is used by linear classification techniques, also known as the maximum margin classifiers, such as the Support Vector Machines (SVM) (Schlkopf & Smola, 2001). Such
technique maps the input patterns, represented by a multi-dimensional weight vector, into a higher degree
space by constructing a series of hyperplanes to maximize the separation of patterns from each other.
The larger is the distance among the planes, the higher is the separation degree and, consequently, the
better is the generalization. This approach can also be used to predict future patterns, as demonstrated
by Hirose et al. (Hirose, Shimizu, Kanai, Kuroda, & Noguchi, 2007), where it is employed on the prediction of long disordered sequences.
Model-based behavior prediction is also used in different stochastic auto-regressive approaches, such
as: SVCA, which employs non-linear auto-regression; NARX and ARMAX models, intended to predict
time series with independent variables; and ARMA and ARIMA models, used for general time series
prediction (Jain, 1991).
Self-Organized Maps, or SOM networks, originally introduced by Kohonen (Kaski & Oja, 1999),
are frequently used for input data generalization according to similarities among them. SOM uses an
unsupervised learning model, where the best matching neuron represents similar patterns, which are
determined by distributing all the input values over a map and evaluating distances among them. In this
process, the neuron with the closest weight to input value is declared the winner, and has its weight is
adjusted towards the input value. The weights of the other neurons in its neighborhood are also adjusted,
according to their distance to the winning neuron.
Although the self-organized maps provide unsupervised pattern classification and feature extraction,
their efficiency is limited by the need of a previous definition of network topology. Aiming at a more
flexible pattern classification, self-expansible neural networks were introduced (Kunze & Steffens,
1995; Fritzke, 1995; Thacker, Abraham & Courtney, 1997), such as Cascade-Correlation Learning
Architecture (CCLA), Growing Cell Structure (GCS), Probabilistic GCS, Growing Neural Gas (GNG),
Growing Self-Organizing Maps, Restricted Coloumb Energy (RCE) and Contextual Layered Associative
Memory (CLAM).
The basic training and classification process of self-expansible networks is similar to self-organized
maps. However, the self-expansible networks create new elements on demand, according to the variations
among the input patterns, aiming at reducing the global residual training error by supporting the neuron
with highest accumulated error rate. The creation of new elements is periodical and requires several
previous training phases, therefore limiting the networks capabilities for on-line classification.
The introduction of novelty-detection networks, such as GWR (Marsland et al., 2002) and SONDE
(Albertini & Mello, 2007), further extended the family of adaptive data classification approaches. Such
networks provide additional advantages, creating new neurons at any time and forgetting irrelevant
information, what may result in a more precise and efficient pattern classification for temporal applications.
A different approach is employed by the Adaptive Resonance Theory (ART) family of neural networks,
proposed by Grossberg (Carpenter & Grossberg, 1987). The basic ART system is an unsupervised learning model and typically consists of neuron layers for comparison and recognition, a vigilance parameter,
and a reset module. The comparison and recognition layers are responsible for similarity determination
among input values, and their classification. The vigilance parameter has considerable influence in the
343
classification: the higher is the vigilance parameter, the more accurate is the classification, although
such accuracy implies in lost of generalization. Finally, the reset module is used to control the algorithm, constantly verifying the classification precision according to the vigilance parameter. In order to
determine the adequate number of clusters, the resulting classification can be evaluated to determine
the inter-cluster and intra-cluster distance, as proposed by Mello et al. (2005). The average distance
among clusters is known as inter-cluster distance, and the distance among elements of the same cluster
as intra-cluster distance. Those parameters determine the level of independence among the classified
data, allowing tuning the desired generalization.
The original version of ART network is known as ART-1, and was limited to the classification of
binary data. In order to allow the classification of continuous input values, an extension of the network
was proposed, called ART-2. Further performance optimizations of the ART-2 network resulted in ART2A network (He, Tan & Tan, 2004).Among other modifications of ART family are Fuzzy ART, which
employs fuzzy logic to reduce the number of clusters created during classification; ART-3 network that
allows partial neuron inhibition using a neuro-transmitting mechanism; ARTMAP or Predictive ART,
which combines ART-1 and ART-2 network into a supervised learning structure, among others (Marinai,
Gori, & Soda, 2005).
A different approach is used in the Independent Component Analysis (ICA) family of networks which
allows extracting a series of independent signals from a composite one by detecting the correlation among
them. Applied to process behavior classification, such networks can be used to determine and separate
different behaviors of a process. Among different ICA network implementations are Infomax, FastICA
and MF-ICA (Rosca, Erdogmus, Prncipe & Haykin, 2006).
A different classification and prediction model is used by the Radial Basis Function (RBF) networks
(Powell, 1987), which calculate a regressive function to represent the input patterns. This is performed
by combining a series of Gaussian functions, whose combination results in regressive equation which
describes the observed behavior. Among notable extensions to the RBF network are the Recurrent RBF,
described by Zemouri et al. (2003), where it is used to predict chaotic time series, and Time-Delay RBF
(Berthold, 1994), employed on temporal behavior recognition.
The concept of interconnections among neurons with constant feedback and back-propagation techniques is used in the Multi-Layer Perceptron (MLP) neural network (Hornik, Stinchcombe & White,
1989). This allows the network to learn and adapt itself to the input patterns.
Another approach is employed in the Time-Delay Neural Networks (TDNN), introduced by Waibel
et al. (1989), and used for position-independent recognition of features within a larger pattern. While in
a traditional neural network the basic unit computes the weighted sum of its inputs and then forwards it
through a nonlinear function to other units, the basic unit of the TDNN network is modified by introducing n delays to the input, so a input layer composed of y inputs would generate z = y * (n+1) inputs to
the network, corresponding to the past inputs. By employing delays, the neural network can relate and
compare the current input to previously observed data. In this way it effectively implements a shortterm memory mechanism. The output values are compared to the expected ones and the error value is
obtained and back-propagated through the network, updating the weights and decreasing the prediction
error. This procedure is repeated until the results converge to the expected outputs.
Among the extensions of the time-delay family of networks is the ATNN network, which extends the
TDNN by including the ability to adjust the delays dynamically, therefore allowing a better adaptation
to different access patterns, as demonstrated by Day and Davenport (1993).
A statistical approach for prediction is introduced by Bayesian networks (Morawski, 1989), that
344
consider the correlation and conditional dependences among different patterns in order to predict future
accesses.
A different prediction strategy relies on Markov Chains (Bolch, Greiner, Meer & Trivedi, 1998),
representing a sequence of application state changes, with given probability for each transition. Thus,
the complete application execution can be represented as a sequence of state changes. By knowing the
probability of each such change, it is possible to predict future application behavior (Dodonov et al,
2006). This approach is further extended by Hidden Markov Chains Model (HMM), which considers
series of state changes to realize more complex predictions (Madhyastha & Reed, 1997), and Kalman
filter, which aims at forecasting the behavioral trends (Kohler, 1997).
It is also possible to employ similarity and clustering techniques, such as the ones used by the SOM
neural network, to forecast future application behavior. Among such techniques are Temporal Kohonen
Maps, or TKM (Chappel & Taylor, 1993), which aims at classifying the application behavior according
to its temporary states. This technique introduces the concept of short-term memory, which is used to
store the historical neighborhood changes for each application state.
A different strategy is used by auto-regressive SOM (AR-SOM) model (Lampinen & Oja, 1989),
which represents each application state as an auto-regressive vector, composed of a time series of past
application behaviors. Therefore, such approach allows to use the traditional SOM model to predict possible application behavior changes. Another approach, also based on SOM network, is represented by
the Vector-Quantized Temporal Associative Memory (VQTAM) network, proposed by Barreto and Araujo
(2004). This method combines the application behavior classification with an associative memory to
predict future application events.
The short-term memory concept for data classification and prediction is employed by recurrent neural networks, characterized by usage of connections to store recent events, such as the Elman network
(Kremer, 1995), which employs a special context layer to maintain historical events, or Hopfield network
(Hopfield, 1988) that acts as an associative memory between input and output patterns. Such networks
provide a fast access to recent events, and are highly efficient for short-time predictions. However, while
such approaches can be effective for short-term prediction, they do not provide adequate results for longterm sequences, requiring a different approach. Therefore, Long-Short Term Memory (LSTM) network
was introduced in researches by Hochreiter, Schmidhuber and Gers (Hochreiter & Schmidhuber, 1997;
Gers & Schmidhuber, 2000), combining conventional short-term memory with a long-term prediction,
noise detection and reduction techniques.
The network aims at maintaining a constant error rate flow, preventing false positives and resulting in a superior prediction precision when compared to other approaches, as shown in several works
(Prez-Ortiz, Gers & Schmidhuber, 2003). The long-term prediction is differentiated from short-term by
introducing memory gates which control whether the memory content should be modified. This approach
results in an effective prediction for both short and long-range sequences, even over noisy observations.
It also requires a lower computational cost. However, the network topology must be carefully planned
for an efficient prediction.
A novel approach for behavior prediction was simultaneously introduced by several authors, represented
by techniques such as Echo State Networks, Liquid State Machines and Backpropagation-Decorrelation,
all known under the name of Reservoir Computing (Jaeger, 2007). Their functionalities are similar to
recurrent networks, storing temporal information in internal network nodes. The proposed approaches
aim at providing efficient and low-cost prediction of series with a priori unknown behavior, being based
on a large network with randomly generated units, known as reservoir. The input signal is submitted to
345
the network, being mapped into a higher dimension by the reservoir dynamics. A supervising mechanism, controlled by the output weight vector, is trained to read the state of the reservoir and map it
according to the desired output. Such mapping is known as the echo property of the reservoir. As only
the output weights of the network are modified, it results in a constant execution time. The results
obtained from such techniques are promising, as they are able to outperform previously employed
techniques by several orders of magnitude according to Jaeger (2007).
Among the limitations of reservoir computing are the learning instability, inflicted by unconstrained
training, and the size of reservoir network, which is often composed of thousands of neurons. Aiming
at overcoming such limitation, a new approach was introduced recently by Gao et al. (2007), known
as Spiral Recurrent Neural Network (SRNN). The proposed network introduces fading memory feature by combining a trainable hidden recurrent layer with the echo property of reservoir computing,
resulting in highly effective prediction for both short-term and long-term periods according to the
researchers.
APPLICATIONS
The usage of process behavior classification and prediction aiming performance improvements in
distributed systems has received considerable attention over the last decades, and currently is used in
different areas of high-performance computing.
Devarakonda and Iyer (1989) use the k-means technique to evaluate the application history, determining the critical execution points and overall resource utilization at different execution states.
Linear classification, as performed by SVM, can be used for pattern prediction over long disordered
sequences, as demonstrated by Hirose et al. (2007), and for time series prediction, as shown by Muller
et al. (1997) and Chen et al. (2004).
Markov chains and time-delay neural network are employed for access pattern detection and prediction in distributed systems by Sakr et al. (1996) and Dodonov and Mello (2006, 2007). The works
demonstrated the efficiency of such techniques for correct access pattern representation in distributed
systems, aiming at support the data prefetching and load balancing.
Artificial intelligence techniques are used for distributed scheduling in the MidHPC project (Mello
et al., 2007), intended for transparent execution of concurrent application over large distributed environment. Application behavior knowledge is employed by the Route scheduling algorithm (Mello,
Senger, & Yang, 2006), which deploys jobs based on the observed behavior. The Route algorithm is
further extended by by the RouteGA algorithm, (Mello, Andrade, Senger, & Yang, 2007), which takes
scheduling decisions considering the network analysis performed by a genetic algorithm, and migrates
processes according to the network environment configuration (Dodonov, Mello, & Yang, 2005).
The nature of data accesses in distributed systems is studied by Kroeger and Long (1999), demonstrating that the analysis of execution history can improve the prediction results in up to 80%.
Similar results were obtained in research by Byna et al. (2004), evaluating MPI accesses in distributed
applications. The prediction of process behavior is also used by Martin et al. (2003) to improve the
trade-off between latency and bandwidth in shared memory multiprocessors.
Neural networks-based approaches are also used for job scheduling in grid environment by Ishii,
Mello and Yang (2007), as well as for user behavior extraction and classification (Santos, Mello, &
Yang, 2007).
346
Finally, a series of innovative process scheduling approaches based on application behavior analysis
are studied in researches by Mello and Yang (2008), evaluating chaos theory approaches; and Nery et
al. (2006), distributing processes over the network considering Ant colony optimization.
CONCLUSION
In this chapter, we presented different techniques for application behavior extraction, classification and
prediction, ranging from source code evaluation approaches and process execution tracing to platformspecific techniques, such as MPI-based approaches and distributed application execution monitoring.
The correct identification and classification of different process execution states are responsible for
the effectiveness of behavior prediction. Therefore, several statistical and artificial intelligence-based
approaches were studied for access pattern extraction, classification and prediction.
Finally, applications based on different behavior prediction strategies were discussed in this chapter
to illustrate the effectiveness of such techniques.
REFERENCES
Albertini, M. K., & Mello, R. F. (2007). A self-organizing neural network for detecting novelties. In
Sac 07: Proceedings of the 2007 ACM Symposium on Applied Computing (pp. 462466). New York:
ACM Press.
Andrade Filho, J. A., Mello, R. F., Dodonov, E., Senger, L. J., Yang, L. T., & Li, K. C. (2008). Toward
an Efficient Middleware for Multithreaded Applications in Computational Grid. In IEEE International
Conference on Computational Science and Engineering, (p. 147-154).
Arpaci-Dusseau, A. C., Culler, D. E., & Mainwaring, M. (1998). Scheduling with Implicit Information
in Distributed Systems. In Proceedings of ACM SIGMETRICS98 (pp. 233248).
Barreto, G. A., & Araujo, A. F. R. (2004). Identification and control of dynamical systems using the
self-organizing map. In Special Issue of IEEE Transactions on Temporal Coding (pp. 1244-1259).
Berthold, M. (1994). A time delay radial basis function network for phoneme recognition. IEEE World
Congress on Computational Intelligence., 1994 IEEE International Conference on Neural Networks,
7.
Bianchini, R., Pinto, R., & Amorim, C. L. (1998). Data prefetching for software DSMs. In International
Conference on Supercomputing (pp. 385-392).
Bolch, G., Greiner, S., de Meer, H., & Trivedi, K. S. (1998). Queueing networks and markov chains: modeling and performance evaluation with computer science applications. New York: Wiley-Interscience.
Bradley, P. S., Fayyad, U. M., & Reina, C. (1998). Scaling clustering algorithms to large databases. In
Knowledge Discovery and Data Mining (pp. 9-15).
347
Byna, S., Sun, X.-H., Gropp, W., & Thakur, R. (2004). Predicting memory-access cost based on dataaccess patterns. In Cluster 04: Proceedings of the 2004 IEEE International Conference on Cluster
Computing (pp. 327336). Washington, DC: IEEE Computer Society.
Carpenter, G. A., & Grossberg, S. (1987). Art 2: Self-organization of stable category recognition codes
for analog input patterns. Applied Optics, 26, 49194930. doi:10.1364/AO.26.004919
Chan, A., Gropp, W., & Lusk, E. (2003). Users guide for mpe extensions for mpi programs. Technical
Report ANL-98/xx, Argonne National Laboratory, 1998. Retrieved from ftp://ftp.mcs.anl.gov/pub/mpi/
mpeman.ps.
Chappell, G. J., & Taylor, J. G. (1993). The temporal kohonen map. Neural Networks Journal, 6(3),
441445. doi:10.1016/0893-6080(93)90011-K
Chen, B. J., Chang, M. W., & Lin, C. J. (2004). Load forecasting using support vector machines: A study
on eunite competition 2001. IEEE Transactions on Power Systems, 19(4), 18211830. doi:10.1109/
TPWRS.2004.835679
Corbalan, J., Martorell, X., & Labarta, J. (2001). Improving Gang Scheduling through Job Performance
Analysis and Malleability. In International Conference on Supercomputing (pp. 303311). Sorrento,
Italy.
Day, S. P., & Davenport, M. R. (1993, March). Continuous-time temporal back-propagation with adaptable time delays. IEEE Transactions on Neural Networks, 4(2), 348354. doi:10.1109/72.207622
de Mello, R. F., Filho, J. A. A., Senger, L. J., & Yang, L. T. (2007). RouteGA: A grid load balancing
algorithm with genetic support. In AINA (pp. 885-892). New York: IEEE Computer Society.
de Mello, R. F., Senger, L. J., & Yang, L. T. (2006). Performance evaluation of route: A load balancing
algorithm for grid computing. Research Initiative, Treatment Action, 13(1), 87108.
Devarakonda, M. V., & Iyer, R. K. (1989). Predictability of process resource usage: A measurement-based study on UNIX. IEEE Transactions on Software Engineering, 15(12), 15791586.
doi:10.1109/32.58769
Dodonov, E., Mello, R., & Yang, L. T. (2006). Adaptive technique for automatic communication access pattern discovery applied to data prefetching in distributed applications using neural networks and
stochastic models. In Proceedings of the ISPA06.
Dodonov, E., & Mello, R. F. (2007). A model for automatic on-line process behavior extraction, classification and prediction in heterogeneous distributed systems. In CCGRID07: Proceedings of the Seventh
IEEE International Symposium on Cluster Computing and the Grid (pp. 899904). Washington, DC:
Dodonov, E., Mello, R. F., & Yang, L. T. (2005). A network evaluation for lan, man and wan grid environments. In L. T. Yang, M. Amamiya, Z. Liu, M. Guo, & F. J. Rammig (Eds.), EUC, 3824 (pp.11331146). Berlin: Springer.
348
Dodonov, E., Sousa, J. Q., & Guardia, H. C. (2004). Gridbox: securing hosts from malicious and greedy
applications. In Proceedings of the 2nd Workshop on Middleware for Grid Computing (pp. 1722). New
York: ACM Press.
dos Santos, M. L., de Mello, R. F., & Yang, L. T. (2007). Extraction and classification of user behavior.
In T.-W. Kuo, E. H.-M. Sha, M. Guo, L. T. Yang, & Z. Shao (Eds.), EUC (LNCS Vol. 4808, p. 493-506).
Berlin: Springer.
Fang, W., Wang, C. L., Zhu, W., & Lau, F. C. M. (2004). Pat: a postmortem object access pattern analysis
and visualization tool. In CCGRID 2004, 4th IEEE/ACM International Symposium on Cluster Computing and the Grid, (pp. 379-386). Chicago: IEEE Computer Society.
Feitelson, D. G., Rudolph, L., Schwiegelshohn, U., Sevcik, K. C., & Wong, P. (1997). Theory and practice in parallel job scheduling. In Job Scheduling Strategies for Parallel Processing (LNCS Vol. 1291,
pp. 134). Berlin: Springer Verlag.
Fischer, P. C. (1965). On formalisms for turing machines. Journal of the ACM, 12(4), 570580.
doi:10.1145/321296.321308
Fritzke, B. (1995). A growing neural gas network learns topologies. In G. Tesauro, D. S. Touretzky, &
T. K. Leen (Eds.), Advances in Neural Information Processing Systems 7 (pp. 625632). Cambridge
MA: MIT Press.
Gabriel, E., Fagg, E. G., Bosilca, G., Angskun, T., Dongarra, J. J., Squyres, J. M., et al. (2004). Open
MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings of 11th European PVM/MPI Users Group Meeting, (pp. 97-104).
Gao, H., Sollacher, R., & Kriegel, H. P. (2007). Spiral recurrent neural network for online learning. In
Proceedings of 15th European Symposium on Artificial Neural Networks (pp. 483488).
Gers, A. F., & Schmidhuber, J. (2000, 25). Long short-term memory learns context free and context
sensitive languages. In Proceedings of IDSIA, 3.
Gibbons, R. (1997). A historical application profiler for use by parallel schedulers. In Job Scheduling
Strategies for Parallel Processing (pp. 5877). Berlin: Springer Verlag.
Gropp, W., Lusk, E., Doss, N., & Skjellum, A. (1996, September). A high-performance, portable
implementation of the MPI message passing interface standard. Parallel Computing, 22(6), 789828.
doi:10.1016/0167-8191(96)00024-5
He, J., Tan, A.-H., & Tan, C.-L. (2004). Modified art 2a growing network capable of generating
a fixed number of nodes. IEEE Transactions on Neural Networks, 15(3), 728737. doi:10.1109/
TNN.2004.826220
Hirose, S., Shimizu, K., Kanai, S., Kuroda, Y., & Noguchi, T. (2007). Poodle-l: A two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics (Oxford, England), 23(16),
20462053. doi:10.1093/bioinformatics/btm302
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 17351780.
doi:10.1162/neco.1997.9.8.1735
349
Hopfield, J. J. (1988). Neural networks and physical systems with emergent collective computational
abilities. Neurocomputing: Foundations of Research, 457464.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359366. doi:10.1016/0893-6080(89)90020-8
Huband, S., & McDonald, C. (2001). A preliminary topological debugger for MPI programs. In Proceedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid, (pp.
422-429).
Ishii, R. P., de Mello, R. F., & Yang, L. T. (2007). A complex network-based approach for job scheduling
in grid environments. In R. H. Perrott, B. M. Chapman, J. Subhlok, R. F. de Mello, & L. T. Yang (Eds.),
HPCC (LNCS Vol. 4782, p. 204-215). Berlin: Springer.
Ishikawa, Y., Matsuda, M., Kudoh, T., Tezuka, H., & Sekiguchi, S. (2003). The design of a latency-aware
mpi communication library. In Swopp03.
Jaeger, H. (2007). Echo state network. Scholarpedia, 2(9), 2330. Available at http://www.scholarpedia.
org/article/Echo_state_network
Jain, R. (1991). The art of computer systems performance analysis: Techniques for experimental design,
measurement, simulation, and modeling. New York: John Wiley and Sons.
Jiang, D., Tang, C., & Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE
Transactions on Knowledge and Data Engineering, 16(11), 13701386. doi:10.1109/TKDE.2004.68
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of the
ASMEJournal of Basic Engineering, 82(Series D), 3545.
Kaski, S., & Oja, E. (1999). Kohonen maps. New YorkUSA: Elsevier Science Inc.
Keskar, D., & Leibowitz, M. (2005). Speeding up openoffice: profiling, tools, approaches. In First
Openoffice.org Conference.
Kotz, D., & Ellis, C. S. (1993). Practical prefetching techniques for multiprocessor file systems. Journal
of Distributed and Parallel Databases, 1(1), 3351. doi:10.1007/BF01277519
Kremer, S. C. (1995). On the computational power of Elman-style recurrent networks. IEEE Transactions on Neural Networks, 6(4), 10001004. doi:10.1109/72.392262
Kroeger, T. M., & Long, D. D. E. (1999). The case for efficient file access pattern modeling. In Workshop
on Hot Topics in Operating Systems (p. 14-19).
Kunze, M., & Steffens, J. (1995). Growing cell structure and neural gas; incremental neural networks.
In Proceedings of the 4th AIHEP Workshop, Pisa, Italy.
Lampinen, J., & Oja, E. (1989). Self-organizing maps for spatial and temporal AR models. In M. Pietikainen & J. Roning (Eds.), Proceedings of 6th SCIA, Scandinavian Conference on Image Analysis
(pp.120127). Helsinki, Finland: Suomen Hahmontunnistustutkimuksen seura r. y.
350
Lei, H., & Duchamp, D. (1997). An analytical approach to file prefetching. In 1997 USENIX Annual
Technical Conference. Anaheim, USA.
Liao, T. W., Celmins, A. K., Robert, J., & Hammell, I. (2003). A fuzzy c-means variant for the generation
of fuzzy term sets. Fuzzy Sets and Systems, 135(2), 241257. doi:10.1016/S0165-0114(02)00136-7
Loiseaux, C., Graf, S., Sifakis, J., Bouajjani, A., & Bensalem, S. (1995). Property preserving abstractions
for the verification of concurrent systems. Formal Methods in System Design, 6(1), 1144. doi:10.1007/
BF01384313
Madhyastha, T. M., & Reed, D. A. (1997). Input/output access pattern classification using hidden Markov
models. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems (pp.
5767). San Jose, CA: ACM Press.
Marinai, S., Gori, M., & Soda, G. (2005). Artificial neural networks for document analysis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(1), 2335. doi:10.1109/
TPAMI.2005.4
Marsland, S., Shapiro, J., & Nehmzow, U. (2002). A self-organizing network that grows when required.
Neural Networks, 15(8-9), 10411058. doi:10.1016/S0893-6080(02)00078-3
Martin, M., Harper, P., Sorin, D., Hill, M., & Wood, D. (2003, June). Using destination-set prediction to
improve the latency /bandwidth tradeoff in shared memory multiprocessors. In Proceedings of the 30th
Annual International Symposium on Computer Architecture.
Mehrotra, S., & Harrison, L. (1996). Examination of a memory access classification scheme for pointerintensive and numeric programs. In ICS96 (pp. 133-140).
Mello, R. F., Andrade, J. A., Dodonov, E., Ishii, R. P., & Yang, L. T. (2007). Optimizing distributed
data access in grid environments by using artificial intelligence techniques. In I. Stojmenovic, R. K.
Thulasiram, L. T. Yang, W. Jia, M. Guo, & R. F. de Mello (Eds.), ISPA07 (LNCS Vol. 4742, pp. 125136). Berlin: Springer.
Mello, R. F., & Senger, L. J. (2004). A new migration model based on the evaluation of processes load
and lifetime on heterogeneous computing environments. In 16th Symposium on Computer Architecture
and High Performance Computing (SBAC2004) Foz do Iguacu, PR, Brazil, (pp. 222227).
Mello, R. F., Senger, L. J., & Yang, L. T. (2005). Automatic text classification using an artificial neural
network. High Performance Computational Science and Engineering, 1, 121.
Mello, R. F., & Yang, L. T. (2008). Prediction of Dynamical, Non-Linear and Unstable Process Behavior.
Journal of Supercomputing. Dordrecht, the Netherlands: Springer Netherlands.
Morawski, P. (1989). Understanding bayesian belief networks. AI Expert, 4(5), 4448.
Muller, K., Smola, A., Ratsch, G., Scholkopf, B., Kohlmorgen, J., & Vapnik, V. (1997). Predicting time
series with support vector machines. Proceedings of the International Conference on Artificial Neural
Networks, (pp. 9991004).
351
Nagel, W. E., Arnold, A., Weber, M., Hoppe, H. C., & Solchenbach, K. (1996). VAMPIR: Visualization
and analysis of MPI resources. Supercomputer, 12(1), 6980.
Nery, B. R., de Mello, R. F., & Carvalho, A. C. P. Leon Ferreira, & Yang, L. T. (2006). Process scheduling using ant colony optimization techniques. In M. Guo, L. T. Yang, B. D. Martino, H. P. Zima, J.
Dongarra, & F. Tang (Eds.), ISPA (LNCS Vol. 4330, p. 304-316). Berlin: Springer.
Nethercote, N., & Fitzhardinge, J. (2004, January). Bounds-checking entire programs without recompiling. In Informal Proceedings of the Second Workshop on Semantics, Program Analysis, and Computing
Environments for Memory Management (SPACE 2004), Venice, Italy.
Prez-Ortiz, J. A., & Gers, F. A. E., D., & Schmidhuber, J. (2003). Kalman filters improve LSTM e
network performance in problems unsolvable by traditional recurrent nets. In Neural Networks, 16(2).
Powell, M. (1987). Radial basis functions for multivariable interpolation: a review. Clarendon Press
Institute Of Mathematics And Its Applications Conference Series, (pp. 143167).
Rosca, J. P., Erdogmus, D., Principe, J. C., & Haykin, S. (Eds.). (2006). Independent component analysis
and blind signal separation, in Proceedings of 6th International Conference, ICA 2006, Charleston, SC.
New York: Springer.
Sakr, M., Giles, C., Levitan, S., Horne, B., Maggini, M., & Chiarulli, D. (1996). On-line prediction of
multiprocessor memory access patterns. In Proceedings of the IEEE International Conference on Neural
Networks (pp. 1564-1569).
Schlkopf, B., & Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization,
optimization, and beyond (adaptive computation and machine learning). Cambridge, MA: The MIT
Press.
Schuster, A. (2003). Scalable distributed model checking: Experiences, lessons, and expectations. Electronic Notes on Theory of Computer Science, 89(1).
Senger, L. J., Mello, R. F., Santana, M. J., & Santana, R. C. (2005). An on-line approach for classifying and extracting application behavior on Linux. In L. T. Yang & M. Guo (Eds.), High Performance
Computing: Paradigm and Infrastructure (chap. 20). New York: John Wiley and Sons.
Silva, F. A. B. D., & Scherson, I. D. (2000). Improving parallel job scheduling using runtime measurements. In D. G. Feitelson & L. Rudolph (Eds.), Job Scheduling Strategies for Parallel Processing (LNCS,
Vol. 1911, pp. 1838). Berlin: Springer Verlag.
Smith, W., Foster, I. T., & Taylor, V. E. (1998). Predicting application run times using historical information. In JSSPP (p. 122-142).
Squyres, J. M., & Lumsdaine, A. (2003). A Component Architecture for LAM/MPI. In Proceedings of
10th European PVM/MPI Users Group Meeting (pp. 379387). Venice, Italy: Springer-Verlag.
Suykens, J. A., & Vandewalle, J. (2000). The K. U. Leuven competition data: a challenge for advanced
neural network techniques. In Esann (pp. 299-304)
352
Thacker, N. A., Abraham, I., & Courtney, P. (1997). Supervised learning extensions to the clam network.
Neural Networks, 10(2), 315326. doi:10.1016/S0893-6080(96)00074-3
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recognition using time
delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37, 328339.
doi:10.1109/29.21701
Weigend, A. S., & Gershenfeld, N. A. (1994). Time series prediction: Forecasting the future and understanding the past. In A. S. Weigend & N. A. Gershenfeld (Eds.), Santa Fe Institute Studies on the
Sciences of Complexity, Proceedings of the NATO Advanced Research Workshop on Comparative Time
Series Analysis, Santa Fe, New Mexico, May 14-17, 1992. New York: Addison-Wesley
Wilding, M., & Behman, D. (2005). Self-service Linux: mastering the art of problem determination (1st
edition). Upper Saddle River, NJ: Prentice Hall.
Zemouri, R., Racoceanu, D., & Zerhouni, N. (2003). Recurrent radial basis function network for timeseries prediction. Engineering Applications of Artificial Intelligence, 16(5-6), 453463. doi:10.1016/
S0952-1976(03)00063-0

Application Behavior Extraction: The process of extraction and transcription of events observed
during the application execution.
Application Behavior Classification: Evaluation of extracted application behavior, aiming at determining the most representative execution patterns and reducing the data dimensionality.
Application Behavior Prediction: Forecasting of future application actions with base on previously
observed behavior.
Application Knowledge: The transcription of resource usage by applications during the course of
execution.
Data Prefetching: Anticipated reading of data elements according to the forecasted execution patterns, aiming at reducing the access latency.
Process Execution States: Set of information which defines the process behavior on a given time
instant.
Process Scheduling: Allocation of applications across the environment, aiming at reducing the system
idleness and minimizing the total execution time.
353
354
Chapter 16
A Structured Tabu Search

Approach for Scheduling in
Parallel Computing Systems
Tore Ferm
Sydney University, Australia
Albert Y. Zomaya
Sydney University, Australia
ABSRACT
Task allocation and scheduling are essential for achieving the high performance expected of parallel
computing systems. However, there are serious issues pertaining to the efficient utilization of computational resources in such systems that need to be resolved, such as, achieving a balance between system
throughput and execution time. Moreover, many scheduling techniques involve massive task graphs
with complex precedence relations, processing costs, and inter-task communication costs. In general,
there are two main issues that should be highlighted: problem representation and finding an efficient
solution in a timely fashion. In the work proposed here, the authors have attempted to overcome the
first problem by using a structured model which offers a systematic method for the representation of the
scheduling problem. The model used can encode almost all of the parameters involved in a scheduling
problem in a very systematic manner. To address the second problem, a Tabu Search algorithm is used to
allocate tasks to processors in a reasonable amount of time. The use of Tabu Search has the advantage
of obtaining solutions to more general instances of the scheduling problem in reasonable time spans.
The efficiency of the proposed framework is demonstrated by using several case studies. A number of
evaluation criteria will be used to optimize the schedules. Communication- and computation-intensive
task graphs are analyzed, as are a number of different task graph shapes and sizes.
INTRODUCTION
The impressive proliferation of the use of parallel processor systems these days in a great variety of
applications is the result of many breakthroughs over the last two decades. These breakthroughs span
DOI: 10.4018/978-1-60566-661-7.ch016
A Structured Tabu Search Approach
a wide range of specialities, such as device technology, computer architectures, theory, and software
tools. However, there remain many problems that need to be addressed which will keep the research
community busy for years to come (Zomaya, 1996).
The scheduling problem involves the allocation of a set of tasks or jobs to resources, such that the
optimum performance is obtained. If these tasks are not inter-dependent the problem is known as task
allocation. In a parallel computing system one would expect a linear speedup in performance when more
processors (or computers) are employed. However, in practice, this is generally not the case, due to such
factors as communication overhead, control overhead, and precedence constraints between tasks (Lee
et al., 2008). Thus, the development of efficient scheduling techniques would improve the operation of
parallel processor systems.
The efficiency of a parallel processor system is commonly measured by completion time, speedup,
or throughput, which in turn reflect the quality of the scheduler. Many heuristic algorithms have already
been developed which provide effective solutions. Most of these methods, however, can solve only
limited instances of the scheduling problem (El-Rewini, 1996, Macey and Zomaya, 1997, Nabhan and
Zomaya, 1997).
The scheduling problem is known to be NP-complete for the general case and even for many restricted
instances (Salleh and Zomaya, 1999). For this reason, scheduling is usually handled by heuristic methods which provide reasonable solutions for restricted instances of the problem (El-Rewini, 1996). Most
research on scheduling has dealt with the problem when the tasks, inter-processor communication costs,
and precedence constraints are fully known. When the task information is known a priori, the problem is
known as static scheduling. On the other hand, when there is no a priori knowledge about the tasks the
problem is known as dynamic scheduling. For dynamic scheduling problems with precedence constraints
optimal scheduling algorithms are not known to exist (Lee and Zomaya, 2008).
In non-preemptive scheduling, once a task has begun on a processor, it must run to completion before
another task can start execution on the same processor. In preemptive scheduling, it is possible for a task
to be interrupted during its execution, and resumed from that position on the same or any other processor, at a later time. Although preemptive scheduling requires additional overhead, due to the increased
complexity, it may perform more effectively than non-preemptive methods (El-Rewini, 1996).
Furthermore, a non-adaptive scheduler does not change its behaviour in response to feedback from
the system. This means that it is unable to adapt to changes in system activity. In contrast, an adaptive
scheduler changes its scheduling according to the recent history and/or current behaviour of the system
(Zomaya and Teh, 2001, Seredynski and Zomaya, 2002). In this way, adaptive schedulers may be able
to adapt to changes in system use and activity. Adaptive schedulers are usually known as dynamic, since
they make decisions based on information collected from the system.
TASK SCHEDULING AND PROBLEM FORMULATION

An application can be represented by a Directed Acyclic Graph (DAG), G = (V,E), where V is the set
of v nodes, and E is the set of e edges. A node ni, in a DAG represents a task, and the corresponding
weight wi, representing the computational cost required to complete that task. An edge (i,j), connecting
nodes ni and nj (in the direction i > j), represents a task precedence constraint, in which ni must be
completed before nj can begin, with the corresponding weight, cij, representing the communication cost
of sending the required data to task j from task i. This communication cost is only required if nodes ni
355
and nj are scheduled onto different processors. A node with no incoming edges is known as an entry
task, and one without any outgoing edges is known as an exit task. In the case of multiple entry or exit
tasks, a pseudo-entry/exit node is created that has zero-cost edges connecting it to all entry/exit nodes.
This simplifies the DAG and does not affect the schedule. A task is considered a ready task if all its
precedence constraints have been met, i.e. all parent nodes have completed.
The goal of scheduling a DAG is to reduce the fitness criteria specified, by mapping the tasks onto
processors, properly ordering the tasks on these processors, and by ensuring that all precedence constraints are met. The most common fitness criterion is a simple measure of the length of the schedule
(makespan) produced. However, other methods exist, such as: minimizing the amount of communication
across the interconnection network; load balancing the computation as equally as possible among all
processors; minimizing idle time on the processors; or any combination of these. Some heuristics also
aim at the parallel system architecture and attempt to minimize the setup costs of the parallel processors
(or computers) (Bruno et al., 1974, Dogan and zgner, 2002).
There are a number of different broad techniques that have developed for solving the task scheduling
problem. List-based techniques are the most common, and are popular because they produce competitive
solutions with relatively low time complexity when compared to the other techniques. The two steps
that comprise most list-based techniques are task prioritisation and processor selection, where tasks are
prioritised based upon a prioritising function and subsequently mapped onto a processor. (This second
step is trivial for homogeneous systems, where the processor speed does not matter.) The algorithms
maintain a list of all the tasks ordered by their priority.
In clustering techniques, initial clusters contain a single task. An iteration of the heuristic improves
the clustering by combining some of the clusters, should the resulting combined cluster reduce the finish
time. This technique requires an additional step, when compared to list-based techniques, which involves
mapping an arbitrary number of clusters onto a bounded number of processors (or further merging the
clusters so that the number of clusters matches the number of processors available).
Duplication-based techniques are another design which revolves around reducing the amount of
communication overhead in a parallel execution. Therefore by (redundantly) duplicating certain tasks,
and running them on more than one processor, the precedence constraints are maintained, but the communication from that duplicated task to a child task is eliminated.
Finally the most computationally expensive of the scheduling techniques is the guided random search
technique. This technique involves some very popular algorithms, such as Simulated Annealing, Genetic
Algorithms, Tabu Search and Neural Networks. These algorithms generally require a set of parameters
especially tailored to the problem they are attempting to solve. A variety of these different approaches
have been compared, and their efficiency compared and Tabu Search performed roughly in the middle
of all the approaches on all the tests performed (Siegel and Ali, 2000).
In general, the efficient management of both the processors and communication links of a parallel
and distributed system is essential in order to obtain high performance (Kwok and Ahmad, 1999). It is
unfortunate that the communication links are often the bottleneck in a distributed system, and processors
often end up wasting cycles idling while waiting for data from another processor in order to proceed.
One different approach is a heuristic that attempts to increase the idle time on a given processor for
extended periods of time so that power consumption is reduced for that processing resource (Zomaya
and Chan, 2005).
356
TABU SEARCH
Tabu Search (TS) is best thought of as an intelligent, iterative Hill Descent algorithm, which avoids
becoming stuck in local minima by using short- and long-term memory. It has gained a number of key
influences from a variety of sources; these include early surrogate constraint methods and cutting plane
approaches (Glover and Laguna, 2002). TS incorporates adaptive memory and responsive exploration as
prime aspects of the search in this way it can be considered intelligent problem solving. TS has been
proven to be effective when compared to a number of heuristics and algorithms previously proposed
(Porto and Ribeiro, 1995), and has been used to solve a wide variety of combinatorial optimisation
problems. A very useful introduction to TS can also be found in (Hertz et al., 1995), which demonstrates
the main attributes and applications of this powerful search technique.
Short Term Memory

Tabu Searchs short term memory focuses on restricting the search space to avoid revisiting solutions
and cycling. A list of previously visited solutions is recorded during the search that, along with a value
that represents the lifetime that the move is to remain Tabu (or restricted), represents the history of the
search. The size of this tabu list can also affect the search considerably. If the size is too small then the
primary goal of the short term memory, preventing cycling, might not be achieved. Conversely if it is
too large, then too many restrictions are created, which limit the search space covered. Unfortunately,
there is no exact method for determining the proper value to prevent cycling in an optimisation problem
such as task scheduling. The existence of the tabu list does have the side effect of denying the search
from exploring certain areas of the search space in some cases.
Unlike Genetic Algorithms or Simulated Annealing, TS attempts to avoid randomization whenever
possible, employing it only when other implementation approaches are cumbersome.
Longer Term Memory

Long-term memory is generally frequency-based, and keeps track of the number of times a certain attribute has occurred and the value of the solution while it was present. Long-term memory is used in
TS in order to apply graduated tabu states, which can be used to define penalty and incentive values to
modify the evaluation of moves. This allows certain aspects of a solution to increase (or decrease) the
overall fitness of a move. In this way a move that might provide a lower makespan might be chosen
because an attribute it contains makes the resulting solution beneficial, or helps lead the search into a
promising search space.
Intensification
Intensification strategies are based on modifying choice rules in order to guide the search towards solutions that have particular promising attributes, which have been discovered to be historically good.
Similarly, they may also change the search space entirely to a promising region, in order to more thoroughly analyze the area.
357
Diversification
Diversification is performed in TS in order to visit areas of the search space that may remain ignored due to the nature of the search. The easiest way to reach these areas is to perform a number
of random restarts from differing points on the search plane. Diversification strategies are employed
within TS for a variety of reasons. The chief among these reasons is to avoid cycling or visiting
the same set of solutions repeatedly. Other reasons for diversification include adding robustness
to the search and escaping from local optima.
Genetic Algorithms use randomisation and population based techniques in order to diversify
their search domain, while Simulated Annealing also uses randomisation in the form of the temperature function. Diversification is particularly useful when better solutions can only be reached
by crossing barriers in the solution space.
Implementation
An analysis of previous scheduling heuristics and algorithms has proven that they do not account
for the amount of communication present in the schedules produced. Many of the previous designs
have either ignored communication altogether, assumed communication is constant, or have used
communication, but considered it to have no bearing on the outcome of the final schedule.
The TS implementation has been developed with a variety of schedule evaluation criteria, in
order to determine whether the schedules produced are effectively using the computational resources
available, while limiting the use of potentially limiting factors such as the interconnection network.
Each of these evaluation criteria also aims to minimize the makespan (or increase speedup) of the
schedules produced, in order for TS to remain competitive (Porto and Ribeiro, 1995).
Design Overview
As mentioned previously, TS has been proven to be an effective scheduling technique, thus it
provides a good basis to develop a scheduler for heterogeneous parallel systems. This TS implementation consists of a number of classes, each providing a necessary role in the TS; it has been
very loosely based off a TS skeleton that has been previously developed (Blesa et al., 2001):

Solution: Represents a solution to the task scheduling problem. Contains the current schedule and fitness rating, also providing methods to alter the solution legally (i.e. performing
moves).
Movement: Represents a move from one solution to a neighbouring one. This can either be
moving a task from one processor to another, or swapping the processor assignments of two
tasks.
Solver: Runs the actual TS and provides the means to keep track of the current solution and
the best solutions.
TabuStorage: Is a list of all the moves that have taken place, moves currently considered
tabu have a TabuLife greater than 0 iterations.
There are also two auxiliary classes:
358
Problem: Represents the Task Graph itself. Contains all the precedence constraints, task computation values (per processor), and inter-task communication values.
Setup: A simple class containing a number of variables that can be provided to the TS in order to
customise it further, including the Aspiration Plus parameters, TabuLife, etc.
The general diversification steps that exist within most TS implementations, in order to increase the
variety of the solutions explored, have been altered to include a system that automatically increases the
Figure 1. A description of the Tabu Search algorithm
359
Figure 2. The algorithm for generating an initial solution
number of processors available on-the-fly. As the solution becomes stagnant at a particular number
of processors, another processor is added and the algorithm begins again from this new point. In this
way, the search need only be run once, with a maximum number of processors specified the search
will proceed until it reaches a maximum set number of iterations, or reaches the processor limit. It can
be shown that the optimum number of processors may not always be the largest provided depending
on the structure of the task graph itself. The generalized graphical structure of the TS implementation
algorithm is shown in Figure 1 with a brief algorithm description in pseudo code.
Initialization
The random task graphs for this thesis have been generated using Task Graphs For Free (TGFF) version 3.0 (Dick et al., 1998). There are three important aspects contained within the .tgff file; the
properties of the task graph itself, the processors (computation), and the network (communication).
The task graph properties are a listing of the nodes and edges of the task graph, also containing their
type. The edges also represent the precedence constraints within a DAG, there is a single list containing the edge type to communication costs relationship, as it is assumed that the network is constant
throughout the parallel system. The processor properties contain a list of types, with corresponding
computational cost (these map to the task types provided in the task graph properties section of the
file) there is a separate list for each processor.
In the initialisation phase, before the TS begins, the properties of the task graph are read into a
number of data structures, which are used to represent the tasks themselves, the precedence constraints, and the computational costs for each processor. The initial solutions for the current_solution
and best_solution are generated here using a greedy algorithm, so as to provide a starting point for
the algorithm.
Initial Solution
The initial solution generated by the TS implementation is a greedy heuristic algorithm, ensuring each
tasks earliest finish time (EFT). The finish time is used instead of the start time on a heterogeneous
processing system, because the computational cost is not fixed.
At each iteration, a task is selected and assigned to the processor that will provide it with the EFT.
This algorithm maintains precedence constraints and benefits from the heterogeneity by allowing it
360
to make a decision based upon both the communication time (if the processor is different to a parent
node) and the computation cost of each particular processor (Figure 2).
Neighborhood
A neighborhood of solutions is obtained by removing a single task from the task list of one processor
and moving it to the task list of another. The entire neighbourhood is obtained by going through every
task, and moving it to every other processor in the system.
To expand upon the neighbourhood and to help expand the search area, we have added another move
category the swap move. In this move, two tasks are selected and their processor assignments are
swapped. In this way the search can proceed further within fewer moves (if the move is worth making),
at the expense of some computational efficiency since more moves are present in each neighbourhood,
more should be examined at each iteration of the search. It also allows the search to escape a poor area
of the solution space, by allowing the search to look-ahead two moves, as opposed to a single step.
Thus a neighbouring solution is a solution that differs by a single task assignment, or the swapping
of two task assignments. Each move consists of relabelling the processor a task belongs to. A move
consists of a source and destination processor, and either one or two task ids.
A solutions neighbourhood is generated every iteration of the search in the form of a move list, which
is the list of moves possible from the current solution to transform it into all possible neighbouring solutions. These moves do not include moves considered Tabu, which are located in the tabu list.
Candidate List
We have used the Aspiration Plus candidate selection criteria in order to reduce the computational cost
of the TS implementation. In Aspiration Plus, moves are analyzed until a move under the given threshold is found, called first. The search continues for plus moves, whereupon the best move found is made
(Rangaswamy, 1998). This strategy is further strengthened by the use of a min and max value, the search
will always analyze min moves, but never more than max.
The user specifies three variables (within the Setup class) for the Aspiration Plus candidate selection:

max the maximum number of moves to analyze,

min the minimum number of moves to analyze, and
plus the amount of moves to search after first is found.
Unfortunately, while the Aspiration Plus candidate selection criteria dramatically reduces the computational cost of the TS by several orders of magnitude, there remains the possibility that good moves
are not evaluated in a given iteration, because only a small subset of moves is considered.
Tabu List
The main mechanism for using memory within TS is the tabu list, a list of all the moves that have been
made so far, along with a time limit specifying the number of iterations that they are to remain tabu.
Each move made within the search is stored inversely within the tabu list, with its associated tabulife.
361
Table 1. Evaluation criteria

Function
Description
Minimize Length (makespan)
The goal of most scheduling research, to reduce the overall length of the solution schedule.
Minimize Communication
Minimizes the total amount of communication present in the final solution as well as minimize the length of the schedule. A good compromise between finding a good schedule, and
an efficient use of network resources.
Load Balance
Spreads the load of computation as evenly as possible among the available processors, while
still attempting to keep the overall length of the schedule to a minimum.
Combination
This function attempts to minimize communication, balance the load as much as possible
among the component processors, and minimize the length of the final solution.
This short-term memory function is used to prevent the search from cycling or returning to previously
visited solutions easily.
The tabu list must be checked before a move is to be applied, in order to ensure that a tabu move is
not made. Merely being located within the tabu list is not enough; the move must also have a tabulife
value greater than zero to be considered tabu. If a move located within the tabu list has a tabulife value
of zero, it means that the inverse move had been made previously, has since become non-tabu, and thus
is available to be performed again.
Evaluation Functions
Specified by the user, a number of different evaluation functions of the solutions can be invoked. These
evaluation functions allow the user to determine the governing factors they want to be used in evaluating
the fitness of a solution found by the algorithm. While the overall goal of many scheduling algorithms
has been to reduce the overall makespan (or length) of the schedules produced, it is noted that this may
not always be the only factor necessary for an efficient solution.
The amount of communication present in a solution can determine the load that will be placed upon
the interconnection network during the execution of the parallelised application. On a limited bandwidth
interconnection network, the cost of placing high loads upon the network is at a premium; therefore it is
wise to reduce the overall inter-task communication as much as possible. Modern media and information
has increased the amount of bandwidth traversing networks, and this means that the network is becoming
more and more important to many different applications. In order to produce competitive solutions in a
congested network environment, the amount of communication must be limited. Minimizing the communication used is especially important in communication-intensive applications, where each inter-task
communication is likely to be highly expensive.
Some parallel systems seek to distribute as much of the computation as evenly as possible over their
component processors, in other words: load balancing. This is another measure by which the user can
analyze the schedules produced by the TS implementation. This evaluation scheme is useful because
the solutions generated have processing time proportional to the speed at which they execute tasks.
A processor twice as fast as another, will have two times the amount of tasks scheduled to it, but will
maintain the same processing time. Load balancing is also important in time-shared systems, where a
user may only have a specific amount of time on each processor in order to execute their parallelized
application.
362
Figure 3. (a) Task graph generated by TGFF v3.0, 50 tasks, even in/out degree; and (b) Scheduling of
task graph in (a) with the TS implementation using the minimize length evaluation criteria
Figure 4. Execution trace of the task graph in Figure 3(a)
363
Similarly, a combination of all these methods might be appropriate, the user may want to analyze
the schedules for an evenly distributed system that reduces the total amount of communication, with
the fastest finish time possible. In all the above cases, the requirements for different systems are catered
for by the specific use of differing evaluation functions. Each evaluation function allows the user more
control over the properties that the schedules produced contain.
Table 1 lists the four evaluation criteria that have been used to determine the effect of communication
on the schedules produced by the TS implementation.
Execution Trace
The major components of the TS implementation have been described in detail in the previous six sections. Figure 3(a) illustrates what a task graph looks like before it is inputted into the TS implementation. For this brief trace, the minimize length evaluation criteria will be used. The search begins with
a sequential execution (on a single processor). At each iteration, the search attempts to improve upon
the makespan of the schedule by moving a task from one processor to another, or swapping two tasks
processor assignments. If the schedule cannot be improved upon in 100 moves, then another processor
is added and the algorithm continues.
In this case the best solution found for minimizing the makespan of the schedule, was with 15
processors. The resulting schedule can be seen in Figure 3(b), and the general slow-start of the parallelization can easily be seen with the top-end of the schedule being very under-utilized when compared
to the lower-end.
Figure 4 contains a trace of the TS implementation, and the value of the schedule length at each iteration. The trace begins with a sequential schedule on a single processor, and has a value of 460, when the
second processor is added the schedule immediately improves to around 200 time units. Similarly when
the third processor is added the schedule length drops dramatically again. Each additional processor
after this, however only decreases the schedule length minimally. The light line represents the theoretical minimum, which is defined as:
Theoretical Minimum = Sequential Execution Time / Number of Processors

The best performance obtained close to the theoretical minimum is at four processors. After this
value, the improvement in schedule length fails to keep up with the theoretical minimum this results
in a poorer utilization of the parallel heterogeneous system. The reason the schedule length is able to
perform better than the theoretical minimum, and critical path is because of the heterogeneity of the
parallel system; where some processors will perform faster than the average (which is used to calculate
these values).
RESULTS: COMPUTATION-INTENSIVE CASES

A comparative analysis of the different evaluation functions of the schedules proposed are presented in
this chapter. More specifically, the results presented in this section will focus on computation-intensive
364
Table 2. Task graph degree shapes

High In/Low Out Degree
very tall, thin task graphs, less scope for parallelising
Even In/Out Degree
generally in-between the two extremes
Low In/High Out Degree
very wide task graphs, excellent for parallelising
task graphs. Examples of computation-intensive applications include large-scale simulations, such as,
the SETI@home class of applications (Anderson et al., 2000), without the need for much communication.
Three measures have been used to analyze the quality of the schedules produced and evaluated by the
TS implementation. The first is the speedup, which is an important factor when gauging any scheduling
algorithm and has been used to analyze the effectiveness of each of the evaluation functions, and this
is defined to be:
Figure 5. (a) A high in/low out degree task graph, (b) An even in/out degree task graph, and (c) A low
in/high out degree task graph
365
Table 3. The tests performed for the computation intensive task graphs
Degree
Even In/Out Degree (5:5)
# of Processors
20
Evaluation Criteria
Minimize Communication
Minimize Length
Load Balance
Combination
50
100
150
200
Low In/High Out Degree (3:7)
High In/Low Out Degree (6:2)
Speedup = Sequential Execution Time / Parallel Execution Time

The sequential execution time is the schedule length of a task graph on a single processor. The parallel execution time is the schedule length of a task graph on multiple processors.
The second measure that is used to evaluate the efficiency of a given schedule is the speedup per
processor. The speedup per processor can be used to determine the amount of idle time present in the
schedule, and can also be referred to as the average utilization of each processor in the parallel system.
As the value reaches 1, the processors are reaching full utilization. The upper bound for the speedup per
processor is 1.0, and it is defined as:
Speedup per Processor = Speedup / Number of Processors

The final factor used to judge the effectiveness of the solutions presented is the communication
usage, and is defined to be the amount of communication time units used in a solution divided by the
366
Figure 6. Speedup for computation-intensive: (a) high in/low out; (b) Even in/out; and (c) Low in/high
out degree task graph sets
sequential execution time.
CCR` = Communication / Sequential Execution Time

To test the TS implementation for computationally intensive task graphs, three sets of randomly
generated task graphs have been used. The first set contains task graphs with an even in-out degree, the
second set contains task graphs with high in degree and low out degree, and the third a low in degree
and high out degree. Table 2 describes the various attributes that apply to each shape of task graph, and
Figure 5 illustrates these features graphically.
Test Parameters
The three test sets each contained five task graphs generated by TGFF v3.0 (Dick et al., 1998). The
number of tasks varied in each task graph of a set, ranging from 20 up to 200 tasks, totalling five task
graphs per set. The number of tasks was not increased beyond 200 because of the high time complexity
of the TS implementation. Such tests would take an unacceptable amount of time to complete. Each of
these task graphs was then run four times with the program for 3000 iterations, using all four evaluation
367
Figure 7. Speedup per processor for computation-intensive: (a) High in/low out; (b) Even in/out; and
(c) Low in/high out degree task graph sets
criteria, which results in a total of over 60 tests being performed.

Since these task graphs were computation-intensive, the computation to communication ratio (CCR)
was set to 5:1 this allows for a wider disparity between the computation- and communication-intensive
task graphs. Table 3 summarizes the tests performed.
Performance Results
The test results are presented in four different sections. The first section is shown in Figure 6, where
comparisons between the speedup obtained by the differing evaluation criteria are conducted on a range
of differing graph sizes and shapes. The second section consists of an analysis of the number of processors used in the best solution found in each test and the overall usage of these processors, as shown in
Figure 7. Communication usage is presented and discussed in the third section, and is shown in Figure
8. The final section presents only a single example, in order to illustrate that the differences between
physical features of the task graphs can affect the final solution considerably. This is represented in
Figures 9 and 10.
Comparisons with Differing Evaluation Criteria

It is shown in Figure 6 that regardless of the evaluation criteria being used, the speedup obtained from
the TS implementation is competitive. Each schedule produced generates a speedup value that increases
368
Figure 8. Communication usage for computation-intensive: (a) High in/low out; (b) Even in/out; and
(c) Low in/high out degree task graph sets
with task graph size and width. This is important because the trade-off between speedup and specialized
properties on the schedules produced is very small, the average penalty to the speedup for minimizing
communication is less than 30%.
Therefore for heterogeneous parallel systems where the network may be congested (or the bottleneck),
schedules can be produced that perform on par with the most efficient schedules produced, while also
maintaining the minimum load on the interconnection network. Similarly with time-sharing systems,
Figure 9. Schedule length & communication usage for a computation-intensive task graph with 100
tasks, with respect to different task graph shapes/types
369
Figure 10. Speedup for a computation-intensive task graph with 100 tasks, with respect to different task
graph shapes/types
where each processor may only be available for a certain amount of time, the processing can be spread
evenly across all processors while still obtaining a competitive speedup.
There are a few anomalies present within the results. The limit of 3000 iterations (or moves) for the
search (imposed to restrict the overall running time of the algorithm) forced the larger task graphs (150200 tasks) to terminate before they had reached the allotted maximum of 15 processors. This potentially
limited speedup possibly attainable, therefore the speedup obtained for the minimize length evaluation criteria is not reflective of general results, where it should obtain the fastest speedup. This is
a result of the variable time until a new processor is added to the algorithm; as such, the minimize
length criteria did not add as many processors to the algorithm as some of the other criteria.
Comparisons with Number of Processors

Figure 7 shows the results obtained when using the average speedup in Figure 6, and dividing it by
the number of processors used in each solution. This shows the overall effectiveness of the speedup,
where the closer a solution gets to having a speedup per processor value of 1.0, the more efficient
the schedule is in terms of processor usage. Also noticeable is the task graphs that are more parallelisable are able to obtain a higher speedup per processor as the number of tasks increases. This
is due to the fact that the schedules provided for these task graphs are able to utilize additional
processors more efficiently than the high in/low out degree task graphs, because the task graphs
widen faster.
The poor speedup per processor of the minimize length criteria occurs because many of the
solutions produced for these evaluation criteria are very sparse. The slow-start effect of many of
the task graphs also reduces the total speedup possible as the number of processors increases, the
amount of idle time at the beginning of the schedule increases, until the task graph begins to widen
sufficiently to take advantage of the additional processors.
Minimizing Communication
The test results in Figure 8 demonstrate the efficiency of the evaluation criteria on reducing the
burden placed upon the network by the parallelized application. There is a significant drop, in the
370
Figure 11. Speedup for communication-intensive: (a) High in/low out; (b) Even in/out; and (c) Low in/
high out degree task graph sets
order of over 50%, in the amount of communication placed on the network when either the Combination or Minimize Communication evaluation criteria are used. As mentioned previously, the
penalty in speedup for using these alternate evaluation criteria is minimal when compared to the
reduction in communication. The results are clearly consistent throughout the numerous task graphs,
and highlight the efficient use of network resources with these evaluation criteria.
The computation-intensive task graphs tend to have a very minimal impact on network load due to
the very small values of the communication edges. Therefore computation-intensive task graphs are ideal
for parallelizing without evaluating the amount of communication being used in a solution.
The reason that the Load Balance and Minimize Length evaluation criteria perform worse than
the Minimize Communication and Combination evaluation criteria, is because they give no regard to
the amount of communication being used. They may produce schedules with a higher speedup value but
unnecessarily burden the network with additional communication in order to obtain somewhat minimal
gains in speedup.
Comparisons with Various Graph Features

It is clear from Figure 9 that the task graphs that widen quickly (Low In/High Out and Even In/Out,)
both achieve better schedule lengths. This is because the parallel processors can begin to take immediate
advantage of the parallelizability of these task graphs. The communication usage is different among all
the different task graph types, and is based upon the number and nature of the precedence constraints
371
Figure 12. Speedup per processor for communication-intensive: (a) High in/low out; (b) Even in/out;
and (c) Low in/high out degree task graph sets
present in each graph.

Not surprisingly the trends continue in favor of the highly parallelizable task graphs for speedup and
utilization and are shown in Figure 10. The more parallelizable the task graph, the higher the speedup (as
also shown in the lower schedule lengths in Figure 8) and higher the utilization in general. The higher
utilization is reached because, on average, the wider task graphs begin to use the additional processors
a lot sooner than the taller task graphs.
The test results presented in Figures 9 and 10 can easily be replicated among the other equivalent
tests, but only a single result is presented here to demonstrate that the shape of a task graph significantly
limits the speedup achievable in the final schedules.
Results: Communication-Intensive Cases

This section will focus on presenting the results for the second set of tests conducted, aimed at communication-intensive task graphs. These differ quite markedly from computation intensive task graphs,
where tasks are able to be shifted to other processors with little-to-no penalty. In communicationintensive task graphs, the cost for changing processors tends to be expensive and often unfeasible. Real
world examples of communication-intensive applications include any application that is running on
time-shared hardware, where one or more components may have to wait for its share to proceed (thus
increasing communication times), such as, Acoustic Beam Forming (Lee and Sullivan, 1993). The same
performance measurements used in the previous section (speedup, utilization and CCR`) will again be
372
Figure 13. Communication usage for communication-intensive: (a) High in/low out; (b) Even in/out;
and (c) Low in/high out degree task graphs
used to compare the various evaluation criteria and schedule outputs.
Test Parameters
Table 4 shows the tests that were performed on the computation intensive task graphs, and an identical
set of tests were conducted on communication intensive task graphs and TGFF v3.0 was used to generate the task graphs. The task graphs themselves differed slightly from the computation intensive ones
because of the random nature of the task graph generation.
The computation to communication ratio was set to 1:5, so that the effects of the communication on
parallelizing applications can be readily seen when compared to results in the previous section.
Performance Results
The performance results for the communication intensive task graphs have been split up into three sections. The first section contains a comparison between the speedup produced for a variety of task graph
shapes and number of tasks. The second section analyzes the computational efficiency of the solutions
produced by the TS implementation. The final section is most important for communication intensive
task graphs and displays the amount of communication present in each schedule produced; it is here
that we can truly see the differences in schedules produced by both the computation intensive and com-
373
munication intensive task graphs.
Comparisons with Differing Evaluation Criteria

Figure 11 clearly demonstrates that it is not feasible to parallelise communication intensive task graphs
which are not very wide. A speedup value of 1 indicates that there was no improvement on the sequential
solution. Similar to the results in the previous section, the wider the task graph, the higher the speedup
obtainable. These wide task graphs halved the makespan of their schedules while minimizing the amount
of communication present, on all but the smallest task graphs.
The other evaluation criteria (which disregard communication) obtained increasing speedups as the
number of tasks increased, irrespective of task graph shape.
Comparisons with Number of Processors

The speedup per processor is a measure of how utilized, on average, every processor is in the parallel
system. To reach a value of 1 the processor must contain no idle time, and this is only achievable in
rare cases with more than one processor. The results of a speedup per processor of value 1 contained in
Figure 12 are because those task graphs were found to be unsuitable for parallelization, and the produced
schedule contained a single processor. A sequential result on a single processor will always return a
utilization value of 1, which means 100% processor usage.
When compared to the results in the previous section, the utilization of processors changed rapidly
depending on the number of processors in the final solution. The computation intensive task graphs,
however, increased the utilization of the processors as the number of tasks increased. This is due to the
smaller communication costs which allow child nodes to be located on different processors without delaying the overall schedule. In a communication intensive task graph, there is a significant delay for a child
node located on a different processor before the child can begin processing unfortunately this increases
the amount of idle time on the processors, reducing the overall utilization of the parallel system.
The theoretical upper bound of 1.0 is overstepped a few times in the results. This occurs because of
the heterogeneous nature of the parallel system. The sequential time is calculated from the average of
all processors available, and if the tasks are run on processors that use a faster time than the average,
then the overall utilisation can appear to be more than 100%. Basically, a higher percent of processing
is performed in a lower than average amount of time.
Minimizing Communication
Communication-intensive task graphs are truly where the communication needs to be taken into account
because any communication across processors will be significantly large. Figure 13 clearly shows that if
the evaluation criteria contain the need to minimize communication, then in most cases for a communication intensive application it is not feasible to parallelize (this is shown with a value of 0 communication usage for many of the minimize communication and combination evaluation results). The Low
In/High Out Degree and the larger task Even In/Out Degree task graphs, which are the widest, can be
parallelized with reasonable efficiency and reduction in communication usage.
It should also be noted that when increasing the speedup several times over (see Figure 11), the
communication cost for these schedules is alarmingly high. Again, these schedules are only efficient if
374
the interconnection network has a lot of bandwidth. Should the network be shared by other systems or
be subject to bandwidth restrictions, burdening the network this much is inefficient. For comparative
purposes, the communication usage for the minimize length and load balance evaluation criteria on
150 tasks, is roughly 3-4 times the processing time units required to sequentially compute. This is clearly
inefficient, and demonstrates the need to account for communication usage in a communication-intensive
task graph schedule if it is to be viable in realistic networks.
CONCLUSION
A major goal when designing a scheduling algorithm is to reduce the makespan in order to reduce the
running time of the application. Unfortunately this doesnt always lead to the best use of resources,
whether they are computational resources, networking resources, or time itself. Designing proper evaluation criteria in order to efficiently utilize these resources is essential if larger, distributed heterogeneous
systems are to become effective (such as Grids). Despite the excellent quality of the schedules the previous work has produced, and accounting for communication costs, all previously proposed algorithms
fail to mention how much they utilize the network as part of the criteria for determining the efficiency
of a schedule.
In this chapter, a variety of evaluation criteria were presented, which demonstrate, in conjunction
with a robust Tabu Search implementation for parallel computing systems, that good quality schedules can be produced that are tailored to the specific requirements of the computing system. The most
balanced schedules were produced when all three criteria were used collectively in the combination
evaluation criteria. This resulted in schedules that limited the load to the network, utilized the available
processors evenly and generally had a competitive speedup. The minimize length criteria generally
produced excellent quality solutions in regards to speedup, but this resulted in an extremely poor use of
the interconnection network especially for the communication-intensive task graphs. Conversely, the
minimize communication criteria produced schedules of reasonable makespan, but limited the load
significantly on the interconnection network.
Further extensions of this work consist of the reduction of the number of assumptions placed upon
the system and taking into account variability in network conditions. Expanding the Tabu Search implementation to take advantage of some of the more advanced features would increase the robustness of
the search.
REFERENCES
Anderson, D., et al. (2000). Internet computing for SETI. In G. Lemarchand and K. Meech, (Eds.), The
Proceedings of Bioastronomy 99: A New Era in Bioastronomy, ASP Conference Series No. 213, (p. 511).
San Francisco: Astronomical Society of the Pacific
Blesa, M. J., Hernandez, L., & Xhafa, F. (2001). Parallel skeletons for Tabu search method. In The Proceedings of International Conference on Parallel and Distributed Systems (ICPADS).
Bruno, J., Coffman, E. G., & Sethi, R. (1974). Scheduling independent tasks to reduce mean finishing
time. Communications of the ACM, 17(7), 382387. doi:10.1145/361011.361064
375
Dick, R. P., Rhodes, D. L., & Wolf, W. (1998). TGFF: Task graphs for free. In The Proceedings of the
6th International Workshop on Hardware/Software Codesign, (pp. 97101).
Dogan, A., & zgner, F. (2002). Matching and scheduling algorithms for minimizing execution time
and failure probability of applications in heterogeneous computing. IEEE Transactions on Parallel and
Distributed Systems, 13(3), 308323. doi:10.1109/71.993209
El-Rewini, H. (1996). Partitioning and scheduling. In A.Y. Zomaya, (ed.), Parallel and Distributed
Computing Handbook, (pp. 239273). New York: McGraw-Hill.
Glover, F., & Laguna, M. (2002). Tabu search. New York: Kluwer Academic Publishers, USA.
Hertz, A., Taillard, E., & de Werra, D. (1995). A tutorial on Tabu search. Proceedings of Giornate di
Lavoro AIRO, (pp.1324).
Kwok, Y.-K., & Ahmad, I. (1999). Static scheduling algorithms for allocating directed task graphs to
multiprocessors. ACM Computing Surveys, 31(4), 406471. doi:10.1145/344588.344618
Lee, C. E., & Sullivan, D. (1993). Design of a heterogeneous parallel processing system for beam forming. In The Proceedings of the Workshop on Heterogeneous Processing, (pp. 113118).
Lee, Y.-C., Subrata, R., & Zomaya, A. Y. (2008). Efficient exploitation of grids for largescale parallel
applications. In A.S. Becker (Ed.) Concurrent and Parallel Computing: Theory, Implementation and
Applications, (pp. 8.1658.184). Hauppauge, NY: Nova Science Publishers.
Lee, Y.-C., & Zomaya, A. Y. (2008). Scheduling in grid environments. In S. Rajasekaran & J. Reif (Eds.)
Handbook of Parallel Computing: Models, Algorithms and Applications, pp. 21.121.19. Boca Raton,
FL: Chapman& Hall/CRC Press.
Macey, B. S., & Zomaya, A. Y. (1997). A comparison of list scheduling heuristics for communication intensive task graphs. International Journal of Cybernetics and Systems, 28, 535546.
doi:10.1080/019697297125921
Nabhan, T. M., & Zomaya, A. Y. (1997). A parallel computing engine for a class of time critical processes [Part
B]. IEEE Transactions on Systems, Man, and Cybernetics, 27(5), 774786. doi:10.1109/3477.623231
Porto, S. C. S., & Ribeiro, C. C. (1995). A Tabu search approach to task scheduling on heterogeneous
processors under precedence constraints. International Journal of High Speed Computing, 7(1).
doi:10.1142/S012905339500004X
Rangaswamy, B. (1998). Tabu search candidate list strategies in scheduling. In the Proceedings of the 6th
INFORMS Advances in Computational and Stochastic Optimization, Logic Programming and Heuristic
Search: Interfaces in Computer Science and Operations Research Conference.
Salleh, S., & Zomaya, A. Y. (1999). Scheduling in parallel computing systems: fuzzy and annealing
techniques. New York: Kluwer Academic Publishers, USA.
Seredynski, F., & Zomaya, A. Y. (2002). Sequential and parallel cellular automata-based scheduling
algorithms. IEEE Transactions on Parallel and Distributed Systems, 13(10), 10091023. doi:10.1109/
TPDS.2002.1041877
376
Siegel, H. J., & Ali, S. (2000). Techniques for mapping tasks to machines in heterogeneous computing
systems. Journal of Systems Architecture, 46(8), 627639. doi:10.1016/S1383-7621(99)00033-8
Zomaya, A. Y. (Ed.). (1996). Parallel and distributed computing handbook. New York: McGraw-Hill.
Zomaya, A. Y., & Chan, F. (2005). Efficient clustering for parallel task execution in distributed systems.
Journal of Foundations of Computer Science, 16(2), 281299. doi:10.1142/S0129054105002991
Zomaya, A. Y., & Teh, Y.-W. (2001). Observations on using genetic algorithms for dynamic load-balancing.
IEEE Transactions on Parallel and Distributed Systems, 12(9), 899911. doi:10.1109/71.954620
KEY TERMS AND DEFINTIONS

Adaptive Scheduler: This type of schedulers changes its scheduling scheme according to the recent
history and/or current behaviour of the system. In this way, adaptive schedulers may be able to adapt to
changes in system use and activity. Adaptive schedulers are usually known as dynamic, since they make
decisions based on information collected from the system.
Non-Adaptive Scheduler: This type of schedulers does not change its behaviour in response to
feedback from the system. This means that it is unable to adapt to changes in system activity.
Non-Preemptive Scheduling: In this class of scheduling once a task has begun on a processor, it
must run to completion before another task can start execution on the same processor.
Preemptive Scheduling: In this class of scheduling it is possible for a task to be interrupted during its
execution, and resumed from that position on the same or any other processor, at a later time. Although
preemptive scheduling requires additional overhead, due to the increased complexity, it may perform
more effectively than non-preemptive methods.
Scheduling: The allocation of a set of tasks or jobs to resources, such that the optimum performance
is obtained. If these tasks are not inter-dependent the problem is known as task allocation. When the task
information is known a priori, the problem is known as static scheduling. On the other hand, when there
is no a priori knowledge about the tasks the problem is known as dynamic scheduling.
Tabu Search: Is an intelligent, iterative Hill Descent algorithm, which avoids local minima by using short- and long-term memory. It has gained a number of key influences from a variety of sources;
these include early surrogate constraint methods and cutting plane approaches. Tabu search incorporates
adaptive memory and responsive exploration as prime aspects of the search.
377
378
Chapter 17
Communication Issues in
Scalable Parallel Computing1
C. E. R. Alves
Universidade Sao Judas Tadeu, Brazil
E. N. Cceres
Universidade Federal de Mato Grosso do Sul, Brazil
F. Dehne
Carleton University, Canada
S. W. Song
Universidade de Sao Paulo, Brazil
ABSTRACT
In this book chapter, the authors discuss some important communication issues to obtain a highly scalable
computing system. They consider the CGM (Coarse-Grained Multicomputer) model, a realistic computing model to obtain scalable parallel algorithms. The communication cost is modeled by the number
of communication rounds and the objective is to design algorithms that require the minimum number
of communication rounds. They discuss some important issues and make considerations of practical
importance, based on our previous experience in the design and implementation of parallel algorithms.
The first issue is the amount of data transmitted in a communication round. For a practical implementation to be successful they should attempt to minimize this amount, even when it is already within the
limit allowed by the CGM model. The second issue concerns the trade-off between the number of communication rounds which the CGM attempts to minimize and the overall communication time taken in
the communication rounds. Sometimes a larger number of communication rounds may actually reduce
the total amount of data transmitted in the communications rounds. These two issues have guided us to
present efficient parallel algorithms for the string similarity problem, used as an illustration.
DOI: 10.4018/978-1-60566-661-7.ch017
Communication Issues in Scalable Parallel Computing
INTRODUCTION
In this book chapter, we discuss some important communication issues to obtain a highly scalable computing system. Scalability is a desirable property of a system, a network, or a process, which indicates
its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged.
We consider the CGM (Coarse-Grained Multicomputer) model, a realistic computing model to obtain
scalable parallel algorithms. A CGM algorithm that solves a problem of size n with p processors each
with O(n/p) memory consists of an alternating sequence of computation rounds and communication
rounds. In one communication round, we allow the exchange of O(n/p) data among the processors. The
communication cost is modeled by the number of communication rounds and the objective is to design
algorithms that require the minimum number of communication rounds. We discuss some important
issues and make considerations of practical importance, based on our previous experience in the design
and implementation of several parallel algorithms.
The first issue is the amount of data transmitted in a communication round. For a practical implementation to be successful we should attempt to minimize this amount, even when it is already within
the maximum allowed by the CGM model which is O(n/p).
The second issue concerns the trade-off between the number of communication rounds which the
CGM attempts to minimize and the overall communication time taken in the communication rounds.
Under the CGM model we want to minimize the number of communication rounds so that we do not
have to care about the particular interconnection network. In a practical implementation, we do have
more information concerning the hardware utilized and the communication times in a particular interconnection network. Sometimes a larger number of communication rounds may actually reduce the
total amount of data transmitted in the communications rounds. Although the goal of the CGM model
is to minimize the number of communication rounds, ultimately the main objective is to minimize the
overall running time that includes the computation and the communication times.
These two issues have guided us to present efficient parallel algorithms for the string similarity problem, used as an illustration. By using the wavefront-based algorithms we present in this book chapter to
illustrate these two issues, we also address a third issue, the desirability of avoiding costly global communication such as broadcast and all-to-all primitives. This is obtained by using wavefront or systolic
parallel algorithms where each processor communicates with only a few other processors.
The string similarity problem is presented here as an illustration. This problem is interesting in its
own right. Together with many other important string processing problems (Alves et al., 2006), string
similarity is a fundamental problem in Computational Biology that appears in more complex problems
(Setubal & Meidanis, 1997), such as the search of similarities between bio-sequences (Needleman &
Wunsch, 1970; Sellers, 1980; Smith & Waterman, 1981). We show two wavefront parallel algorithms to
solve the string similarity problem. We implement both the basic algorithm (Alves et al., 2002) and the
improved algorithm (Alves et al., 2003) by taking into consideration the communication issues discussed
in this book chapter and obtain very efficient and scalable solutions.
PARALLEL COMPUTATION MODEL

Valiant (1990) introduced a simple coarse grained parallel computing model, called Bulk Synchronous
Parallel Model BSP. It gives reasonable predictions on the performance of the algorithms when
379
implemented on existing, mainly distributed memory, parallel machines. It is also one of the earliest
models to consider communication costs and to abstract the characteristics of parallel machines with
a few parameters. The main objective of BSP is to serve a bridging model between the hardware and
software necessities. This is one of the fundamental characteristics for the success of the von Neumann
model. In the BSP model, parallel computation is modeled by a series of super-steps. In this model,
p processors with local memory communicate through some interconnection network managed by a
router with global synchronization. A BSP algorithm consists of a sequence of super-steps separated by
synchronization barriers. In a super-step, each processor executes a set of independent operations using
local data available in each processor at the start of the super-step, as well as communication consisting
of send and receive of messages. An h-relation in a super-step corresponds to sending or receiving at
most h messages in each processor. The response to a message sent in one super-step can only be used
in the next super-step.
In this paper we use a similar model called the Coarse Grained Multicomputers (denoted by BSP/
CGM), proposed by Dehne et al. (1993). A BSP/CGM consists of a set of p processors P1, P2,,Pp
with O(n/p) local memory per processor and each processor is connected through any interconnection
network. The term coarse granularity comes from the fact that the problem size in each processor n/p
is considerably larger than the number of processors, that is, n/p>>p. A BSP/CGM algorithm consists
of alternating local computation and global communication rounds separated by a barrier synchronization. The BSP/CGM model uses only two parameters: the input size n and the number of processors
p. In a computing round, each processor runs a sequential algorithm to process its data locally. A communication round consists of sending and receiving messages, in such a way that each processor sends
at most O(n/p) data and receives at most O(n/p) data. We require that all information sent from a given
processor to another processor in one communication round is packed into one long message, thereby
minimizing the message overhead.
In the BSP/CGM model, the communication cost is modeled by the number of communication rounds
which we wish to minimize. In a good BSP/CGM algorithm the number of communication rounds
does not depend on the input size n. The ideal algorithm requires a constant number of communication
rounds. If this is not possible, we attempt to get an algorithm for which this number is independent on
n but depends on p. This is the case of the present chapter.
The BSP/CGM model has the advantage of producing results are close to the actual performance
of commercially available parallel machines. Some algorithms for computational geometry and graph
problems require a constant number or O(log p) communication rounds (e.g. see Dehne et al. (1993)).
The BSP/CGM model is particularly suitable for current parallel machines in which the global computing speed is considerably greater than the global communication speed.
One way to explore the use of parallel computation can be through the use of clusters of workstations
or Fast/Gigabit Ethernet connected Linux-based Beowulf machines, with Parallel Virtual Machine PVM or Message Passing Interface - MPI libraries. The latency in such clusters or Beowulf machines
of 1Gb/s is currently less than 10 s and programming using these resources is today a major trend in
parallel and distributed computing.
Though much effort has been expended to deal with the problems of interconnection of clusters or
Beowulfs and the programming environment, there is still few works on methodologies to design and
analyze algorithms for scalable parallel computing systems.
380
Figure 1. String alignment examples
THE STRING SIMILARITY PROBLEM

In Molecular Biology, the search for tools that identify, store, compare and analyze very long bio-sequences
is becoming a major research area in Computational Biology. In particular, sequence comparison is a
fundamental problem that appears in more complex problems (Setubal & Meidanis, 1997), such as
the search of similarities between bio-sequences (Needleman & Wunsch, 1970; Sellers, 1980; Smith
& Waterman, 1981), as well as in the solution of several other problems such as approximate string
matching, file comparison, and text searching with errors (Hall & Dowling, 1980; Hunt & Szymansky,
1977; Wu & Manber, 1992).
One main motivation for biological sequence comparison, in particular proteins, comes from the
fact that proteins that have similar tri-dimensional forms usually have the same functionality. The tridimensional form is given by the sequence of symbols that constitute the protein. In this way, we can
guess a functionality of a new protein by searching a known protein that is similar to it.
In this section we present the string similarity problem. One way to identify similarities between
sequences is to align them, with the insertion of spaces in the two sequences, in such way that the two
sequences become equal in length. We expect that the alignment of two sequences that are similar will
show the parts where they match, and different parts where spaces are inserted. We are interested in the
best alignment between two strings, and the score of such an alignment gives a measure of how much
the strings are similar.
The similarity problem is defined as follows. Let A = a1a2am and C = c1c2cn be two strings over
some alphabet.
To align the two strings, we insert spaces in the two sequences in such way that they become equal
in length. See Figure 1 where each column consists of a symbol of A (or a space) and a symbol of C (or
a space). An alignment between A and C is a matching of the symbols a A and c C in such way
that if we draw lines between the corresponding matched symbols, these lines cannot cross each other.
The alignment shows the similarities between the two strings. Figure 1 shows two simple alignment
examples where we assign a score of 1 when the aligned symbols in a column match and 0 otherwise.
The alignment on the right has a higher score (5) than that on the left (3).
A more general score assignment for a given alignment between strings is done as follows. Each
column of the alignment receives a certain value depending on its contents and the total score for the
alignment is the sum of the values assigned to its columns. Consider a column consisting of symbols
r and s. If r = s (i.e. a match), it will receive a value p(r, s) > 0. If r s (a mismatch), the column will
receive a value p(r , s ) < 0 . Finally, a column with a space in it receives a value k, where k N . We
look for the alignment (optimal alignment) that gives the maximum score. This maximum score is called
the similarity measure between the two strings to be denoted by sim(A,C) for strings A and C. There may
381
Figure 2. Grid DAG G for A= baabcbca and B = baabcabcab
be more than one alignment with maximum score (Setubal & Meidanis, 1997).
Dynamic programming is a technique used in the solution of many optimization and decision problems.
It decomposes the problem into a sequence of optimization or decision steps that are interconnected and
are solved one after another. The optimal solution of the problem is obtained by the decomposition of
the problem in sub-problems, and computing the optimal solution for each sub-problem. By combining
these solutions we obtain the optimal solution of the global problem.
Differently from the other optimization methods, such as linear programming and branch and bound,
dynamic programming is not a general technique. Optimization problems should be translated into a
more specific form before dynamic programming can be used. This translation can be very difficult. This
constitutes a further difficulty in addition to the need of formulating the problem to be solved efficiently
by the dynamic programming approach.
Consider two strings A and C, where |A| = m and |C| = n. We can solve the string similarity problem
by computing all the similarities between arbitrary prefixes of the two strings starting with the shorter
prefixes and use previously computed results to solve the problem for larger prefixes. There are m + 1
possible prefixes of A and n + 1 prefixes of C. Thus, we can arrange our calculations in an (m + 1)
(n +1) matrix S where each S(r,s) represents the similarity between A[1..r] and C[1..s], that denote the
prefixes a1a2ar and c1c2cs, respectively.
Observe that we can compute the values of S(r,s) by using the three previous values S(r 1,s), S(r
1,s 1) and S(r, s 1), because there are only three ways to compute an alignment between A[1..r] and
C[1..s]. We can align A[1..r] with C[1..s 1] and match a space with C[s], or align A[1..r 1] with C[1..s
1] and match A[r] with B[s], or align A[1..r 1] with C[1..s] and match a space with A[r]. (Figure 2)
Figure 3. The recursive definition of the similarity score
382
The similarity score S of the alignment between strings A and C can be computed as in Figure 3.
An l1 l2 grid DAG (Figure 2) is a directed acyclic graph whose vertices are the l1l2 points of an l1
l2 grid, with edges from grid point G(i, j) to the grid points G(i, j + 1), G(i + 1, j) and G(i + 1, j + 1).
Let A and C be two strings with |A| = m and |C| = n symbols, respectively. We associate an (m + 1)
(n + 1) grid DAG G with the similarity problem in the natural way: the (m + 1)(n + 1) vertices of G
are in one-to-one correspondence with the (m + 1)(n + 1) entries of the S-matrix, and the cost of an edge
from vertex (t, l) to vertex (i, j) is equal to k if t = i and l = j 1 or if t = i 1 and l = j; and to p(i, j) if
t = i 1 and l = j 1.
It is easy to see that the string similarity problem can be viewed as computing the minimum sourcesink path in a grid DAG. In Figure 2 the problem is to find the minimum path from (0,0) to (8,10).
A sequential algorithm to compute the similarity between two strings of lengths m and n uses a technique called dynamic programming. The complexity of this algorithm is O(mn). The construction of the
optimal alignment can be done in sequential time O(m + n) (Setubal & Meidanis, 1997).
PRAM (Parallel Random Access Machine) algorithms for the dynamic programming problem have
been obtained by Galil and Park (1991). PRAM algorithms for the string editing problem have been
proposed by Apostolico et al. (1990). A more general study of parallel algorithms for dynamic programming can be seen in (Gengler, 1996).
We present two algorithms that use the realistic BSP/CGM model. A characteristic and advantage
of the wavefront or systolic algorithm is the modest communication requirement, with each processor
communicating with few other processors. This makes it very suitable as a potential application for
grid computing where we wish to avoid costly global communication operations such as broadcast and
all-to-all operations.
THE BASIC SIMILARITY ALGORITHM

The basic similarity algorithm is due to Alves et al. (2002). It is a BSP/CGM algorithm and attempts to
minimize the number of communication rounds.
Consider two given strings A = a1a2am and C = c1c2cn. The basic similarity algorithm computes
the similarity between A and C on a CGM/BSP with p processors and mn/p local memory in each processor.
We divide C into p pieces, of size n/p, and each processor Pi, 1 i p, receives the string A and the
i-th piece of C ( c (i - 1)n / p + 1,..., cin / p ).
Each processor Pi computes the elements Si(r, s) of the submatrix Si, where 1 r m and
(i - 1)n / p + 1 s in / p using the three previous elements S (r 1,s), S (r 1, s 1) and S (r, s 1),
i
because, as mentioned before, there are only three ways of computing an alignment between A[1..r] and
C[1..s]. We can align A[1..r] with C[1..(s 1)] and match a space with C[s], or align A[1..(r 1)] with
C[1..(s 1)] and match A[r] with B[s], or align A[1..(r 1)] with C[1..s] and match a space with A[r].
To compute the submatrix Si, each processor Pi uses the best sequential algorithm locally. It is easy
to see that processor Pi, i > 1, can only start computing the elements Si(r, s) after the processor Pi 1 has
computed part of the submatrix Si 1 (r, s).
Denote by Rik, 1 i, k p, all the elements of the right boundary (rightmost column) of the k-th part
of the submatrix Si. More precisely, Rik = {Si(r, in/p,(k 1)m/p + 1 r km/p}.
383
Figure 4. An O(p) communication rounds scheduling used in the basic algorithm
The idea of the algorithm is the following: After computing the k-th part of the submatrix Si, processor Pi sends to processor Pi+ 1 the elements of Rik . Using Rik, processor Pi+ 1 can compute the k-th part
of the submatrix Si+ 1. After p 1 rounds, processor Pp receives Rp-11 and computes the first part of the
submatrix Sp. At round 2p 2, processor Pp receives Rp-1p and computes the p-th part of the submatrix
Sp and finishes the computation.
Using this schedule (Figure 4), we can see that in the first round, only processor P1 works. In the
second round, processors P1 and P2 work. It is easy to see that in round k, all processors Pi work, where
1 i k.
We now present the basic string similarity algorithm.Basic Similarity Algorithm (see Figure 5).
Theorem 1.
The basic similarity algorithm uses 2p 2 communication rounds with O(mn/p) sequential computing
time in each processor.
Proof.
Processor P1 sends R1k to processor P2 after computing the k-th block of m/p rows of the mn/p subFigure 5. The basic similarity algorithm
384
Figure 6. Table of running times of the basic algorithm for various string lengths
matrix S1. After p 1 communication rounds, processor P1 finishes its work. Similarly, processor P2
finishes its work after p communication rounds. Then, after p 2 + i communication rounds, processor
Pi finishes its work. Since we have p processors, after 2p 2 communication rounds, all the p processors have finished their work.
Each processor uses a sequential algorithm to compute the similarity submatrix Si. Thus this algorithm
takes O(mn/p) computing time.
Theorem 2.
At the end of the basic similarity algorithm, S(m, n) will store the score of the similarity between
the strings A and C.
Proof.
By Theorem 1, after 2p 2 communication rounds, processor Pp finishes its work. Since we are
essentially computing the similarity sequentially in each processor and sending the boundaries to the
right processor, the correctness of the algorithm comes naturally from the correctness of the sequential
algorithm. Then, after 2p 2 communication rounds, S(m, n) will store the similarity between the strings
A and C.
Experimental Results of the Basic Algorithm

In this section we present the experimental results of the basic similarity algorithm. The following figures
give running time curves.
We have implemented the O(p) rounds basic similarity algorithm on a Beowulf with 64 nodes. Each
node has 256 MB of RAM memory and more 256 MB for swap. The nodes are connected through a
100 MB interconnection network.
385
Figure 7. Curves of the observed times for various string lengths
Figure 8. Curves of the observed times for various string lengths
386
Figure 9. An O(p) communication rounds scheduling with = 1
The obtained times (Figures 6, 7 and 8) show that with small sequences, the communication time is
significant when compared to the computation time with more than 8 and 16 processors, respectively
(512 512 and 512 1024). When we apply the algorithm to sequences greater than 8192, using one
or two processors, the main memory is not enough to solve the problem. The utilization of swap gives
us meaningless resulting times. This would not occur if the nodes have more main memory. Thus we
have suppressed these times.
In general, the implementation of the CGM/BSP algorithm shows that the theoretical results are
confirmed in the implementation.
The basic similarity algorithm requires O(p) communication rounds to compute the score of the
similarity between two strings. We have worked with a fixed block size of m/p n/p. Another good
alternative is to work with adaptative choice of the optimal block size to further decrease the running
time of the algorithm.
The alignment between the two strings can be obtained with O(p) communication rounds backtracking
from the lower right corner of the grid graph in O(m + n) time (Setubal & Meidanis, 1997). For this, S(r,
s) for all points of the grid graph must be stored during the computation (requiring O(mn) space).
THE IMPROVED SIMILARITY ALGORITHM

Alves et al. (2003) extend and improve the basic similarity algorithm (Alves et al., 2002) for computing
an alignment between two strings A and C, with A =|m| and C =|n|. On a distributed memory parallel
computer of p processors each with O((m + n) / p) memory, the improved algorithm also requires O(p)
communication rounds, more precisely (1 + 1 / )p 2 communication rounds where is a parameter
to be presented shortly, and O(mn / p) local computing time. As in the basic algorithm, the processors
communicate in a wavefront or systolic manner, such that each processor communicates with few other
processors. Actually each processor sends data to only two other processors.
The novelty of the improved similarity algorithm is based on a compromise between the workload
of each processor and the number of communication rounds required, expressed by a parameter called
. The proposed algorithm is expressed in terms of this parameter that can be tuned to obtain the best
overall parallel time in a given implementation. In addition to showing theoretic complexity we confirm
the efficiency of the proposed algorithm through implementation. As will be seen shortly, very promising
experimental results are obtained on a 64-node Beowulf machine.
We present a parameterized O(p) communication rounds parallel algorithm for computing the simi-
387
Figure 10. An O(p) communication rounds scheduling with = 1/2
larity between two strings A and C, over some alphabet, with |A|= m and |C|= n. We use the CGM/BSP
model with p processors, where each processor has O(mn / p) local memory. As will be seen later, this
can be reduced to O((m + n) / p).
Let us first give the main idea to compute the similarity matrix S by p processors. The string A is
broadcasted to all processors, and the string C is divided into p pieces, of size n / p, and each processor
Pi, 1 l p, receives the i-th piece of C ( c (i - 1)n / p + 1...cin / p ).
The scheduling scheme is illustrated in Figure 9. The notation Pik denotes the work of Processor Pi
at round k. Thus initially P1 starts computing at round 0. Then P1 and P2 can work at round 1, P1, P2 and
P3 at round 2, and so on. In other words, after computing the k-th part of the sub-matrix Si (denoted Sik),
processor Pi sends to processor Pi+ 1 the elements of the right boundary (rightmost column) of Sik. These
elements are denoted by Rik. Using Rik, processor Pi + 1 can compute the k -th part of the sub-matrix
Si+ 1. After p 1 rounds, processor Pp receives Rp-11 and computes the first part of the sub-matrix Sp. In
round 2p 2, processor Pp receives Rp-1p and computes the p-th part of the sub-matrix Sp and finishes
the computation.
Figure 11. The improved similarity algorithm
388
It is easy to see that with this scheduling, processor Pp only initiates its work when processor P1 is
finishing its computation, at round p 1. Therefore, we have a very poor load balancing.
In the following we attempt to assign work to the processors as soon as possible. This can be done
by decreasing the size of the messages that processor Pi sends to processors Pi+ 1. Instead of message
size m / p we consider sizes m / p and explore several sizes of . In our work, we make the assumption
that the sizes of the messages m / p divides m. Therefore, Sik (the similarity sub-matrix computed by
processor Pi at round k) represents k m / p + 1 to (k + 1) m / p rows of Si that are computed at the
k-th round.
We now present the improved similarity algorithm.
The improved algorithm works as follow: After computing Sik, processor Pi sends Rik to processor
Pi+ 1. Processor Pi+ 1 receives Rik from Pi and computes Si+1k+1. After p 2 rounds, processor Pp receives
Rp-1p-2 and computes Spp-1. If we use < 1 all the processors will work simultaneously after the p 2-th
round. We explore several values for trying to find a balance between the workload of the processors
and the number of rounds of the algorithms. Figure 10 shows how the algorithm works when = 1/2.
In this case, processor Pp receives Rp-13p-3, computes Sp3p-2 and finishes the computation.
Improved Similarity Algorithm (see Figure 11).
Using the schedule of Figure 10, we can see that in the first round, only processor P1 works. In the
second round, processors P1 and P2 work. It is easy to see that at the k-th round, all processors Pi work,
where 1 i k. Since the total number of rounds is increased with smaller values of the processors
start working earlier.
Theorem 3
The improved algorithm uses (1 + 1 / )p 2 communication rounds with mn / p sequential computing time in each processor.
Proof:
Processor P1 sends R1k to processor P2 after computing the k-th block of m / p rows of the mn / p
sub-matrix S1. After p / 1 communication rounds, processor P1 finishes its work. Similarly, processor
P finishes its work after p / a communication rounds. Then, after p / 2 + i communication rounds,
2
processor Pi finishes its work. Since we have p processors, after (1 + 1 / )p 2 communication rounds,
all the p processors have finished their work.
Each processor uses a sequential algorithm to compute the similarity sub-matrix Si. Thus this algorithm takes O(mn / p) computing time.
Theorem 4
At the end of the improved algorithm, S(m, n) will store the score of the similarity between the strings
A and C.
389
Figure 12. Table showing running times for various values of with m=8K and n=16K
Proof:
Theorem 3 proves that after (1 + 1 / )p 2 communication rounds, processor Pp finishes its work.
Since we are essentially computing the similarity sequentially in each processor and sending the boundaries to the right processor, the correctness of the algorithm comes naturally from the correctness of the
sequential algorithm. Then, after (1 + 1 / )p 2 communication rounds, S(m, n) will store the similarity
between the strings A and C.
Figure 13. Time curves vs. number of processors with m=8K and n=16K
390
Figure 14. Time curves vs. values of with m=8K and n=16K
Figure 15. Table showing running times for various values of with m=4K and n=8K
Figure 16. Time curves versus number of processors with m=4K and n=8K
391
Figure 17. Curves of the observed times - quadratic space
Experimental Results of the Improved Similarity Algorithm

In this section we present the experimental results of the improved similarity algorithm. We have implemented the improved similarity algorithm on a Beowulf with 64 nodes. Each node has 256 MB of RAM
memory in addition to 256 MB for swap. The nodes are connected through a 100 MB interconnection
network.
Figures 12, 13 and 14 show the running times of the improved similarity algorithm for different values
of for string lengths of m=8K and n=16K. For a given experiment and hardware platform a parameter
tuning phase is required to obtain the best value for .
Figures 12, 13 and 14 show running times for string sizes m =8K and n =16K where K=1024. It can
be seen that, for very small , the communication time is significant when compared to the computation time. We have analyzed the behavior of to estimate the optimal block size. The observed times
show that when m / p decreases from 16 to 8 (the number of rows of the sub-matrix Si(k)), we have an
increase on the total time. The best times are obtained for between 1/4 and 1/8.
Figures 15 and 16 show the running times of the improved similarity algorithm for different values
Figure 18. Curves of the observed times - linear space
392
of for string lengths of m=4K and n=8K. Again, for a given experiment and hardware platform a parameter tuning phase is required to obtain the best value for .
quadratic vs. Linear Space Implementation

We can further improve our results by exploring a linear space implementation, by storing a vector instead of the entire matrix. In the usual quadratic space implementation, each processor uses O(mn / p)
space, while in the linear space implementation each processor requires only O((m + n) / p) space. The
results are impressive, as shown in Figures 17 and 18. With less demand on the swap of disk space, we
get an almost 50% improvement. We have used =1.
CONCLUSION
We have presented a basic and an improved parameterized BSP/CGM parallel algorithm to compute the
score of the similarity between two strings. On a distributed memory parallel computer of p processors
each with O((m + n) / p) memory, the proposed algorithm requires O(p) communication rounds and O(mn
/ p) local computing time. The novelty of the improved similarity algorithm is based on a compromise
between the workload of each processor and the number of communication rounds required, expressed
by a new parameter called . We have worked with a variable block size of m / p n / p and studied the
behavior of the block size. We show how this parameter can be tuned to obtain the best overall parallel
time in a given implementation. Very promising experimental results are shown.
Though we dedicated considerable space to present the two string similarity algorithms, these algorithms serve the purpose of illustrating two main issues. The first issue is the amount of data transmitted
in a communication round. For a practical implementation to be successful we should attempt to minimize
this amount, even when it is already within the limit allowed by the CGM model. The second issue concerns the trade-off between the number of communication rounds which the CGM attempts to minimize
and the overall communication time taken in the communication rounds. Sometimes a larger number of
communication rounds may actually reduce the total amount of data transmitted in the communications
rounds. To this end the parameter is introduced in the improved similarity algorithm. By adjusting
the proper value of , we can actually require more communication rounds while diminishing the total
amount of data transmitted in the communication rounds, thus resulting in a more efficient solution.
As a final observation notice that a characteristic of the wavefront communication requirement is
that each processor communicates with few other processors. This makes it very suitable as a potential
application for grid computing.
REFERENCES
Alves, C. E. R., Caceres, E. N., Dehne, F., & Song, S. W. (2002). A CGM/BSP Parallel Similarity Algorithm. In Proceedings I Brazilian Workshop on Bioinformatics (pp. 1-8). Porto Alegre: SBC Computer
Society.
393
Alves, C. E. R., Caceres, E. N., Dehne, F., & Song, S. W. (2003). A Parallel Wavefront Algorithm for
Efficient Biological Sequence Comparison. In Kumar, M. L. Gavrilva, C. J. K. Tan, & P. LEcuyer (Eds.).
The 2003 International Conference on Computational Science and its Applications. (LNCS Vol. 2668,
pp. 249-258). Berlin: Springer Verlag.
Alves, C. E. R., Caceres, E. N., & Song, S. W. (2006). A coarse-grained parallel algorithm for the allsubstrings longest common subsequence problem. Algorithmica, 45(3), 301335. doi:10.1007/s00453006-1216-z
Apostolico, A., Atallah, M. J., Larmore, L. L., & Macfaddin, S. (1990). Efficient parallel algorithms for string
editing and related problems. SIAM Journal on Computing, 19(5), 968988. doi:10.1137/0219066
Dehne, F. (1999). Coarse grained parallel algorithms. Algorithmica, 24(3/4), 173176.
Dehne, F., Fabri, A., & Rau-Chaplin, A. (1993). Scalable parallel geometric algorithms for coarse grained
multicomputers. In Proceedings ACM 9th Annual Computational Geometry (pp. 298-307).
Galil, Z., & Park, K. (1991). Parallel dynamic programming (Tech. Rep. CUCS-040-91). New York:
Columbia University, Computer Science Department.
Gengler, M. (1996). An introduction to parallel dynamic programming. In Solving Combinatorial Optimization Problems in Parallel. (LNCS Vol. 1054 pp. 87-114). Berlin: Springer Verlag.
Hall, P. A., & Dowling, G. R. (1980). Approximate string matching. Comput. Surveys, 12(4), 381402.
doi:10.1145/356827.356830
Hunt, J. W., & Szymansky, T. (1977). An algorithm for differential file comparison. Communications
of the ACM, 20(5), 350353. doi:10.1145/359581.359603
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in
the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443453. doi:10.1016/00222836(70)90057-4
Sellers, P. H. (1980). The theory and computation of evolutionary distances: Pattern recognition. Journal
of Algorithms, (4): 359373. doi:10.1016/0196-6774(80)90016-4
Setubal, J., & Meidanis, J. (1997). Introduction to computational molecular biology. Boston: PWS
Publishing Company.
Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol.
Bio. (147), 195-197.
Valiant, L. (1990). A bridging model for parallel computation. Communications of the ACM, 33(8),
103111. doi:10.1145/79173.79181
Wu, S., & Manber, U. (1992). Fast text searching allowing errors. Communications of the ACM, 35(10),
8391. doi:10.1145/135239.135244
394

Coarse-Grained Multicomputer: A simple and realistic parallel computing model, characterized by
two parameters (input size n and number of processors p), in which local computation rounds alternate with
global communication rounds, with the goal of minimizing the number of communication rounds.
Granularity: A measure of the size of the components, or descriptions of components, that make up
a system. In parallel computing, granularity refers to the amount of computation that can be performed
by the processors before requiring a communication stepto exchange data.
Scalability: A desirable property of a system, a network, or a process, which indicates its ability to
either handle growing amounts of work in a graceful manner, or to be readily enlarged.
String Similarity Metrics: Textual based metrics resulting in a similarity or dissimilarity (distance)
score between two pairs of text strings for approximate matching or comparison.
Systolic Algorithm: An algorithm that has the characteristics of a systolic array.
Systolic Array: A pipelined network of processing elements called cells, used in parallel computing, where cells compute data and store it independently of each other and passes the computed data to
neighbor cells.
Wavefront Algorithm: An algorithm that has the characteristics of a systolic array, also known as
systolic algorithm.
ENDNOTE
1
Partially supported by FAPESP Proc. No. 2004/08928-3, CNPq Proc. No. 55.0094/05-9, 55.0895/07-8,
30.5362/06-2, 30.2942/04-1, 62.0123/04-4, 48.5460/06-8, FUNDECT 41/100.115/2006, and the
Natural Sciences and Engineering Resarch of Canada.
395
396
Chapter 18
Scientific Workflow
Scheduling with TimeRelated QoS Evaluation
Wanchun Dou
Nanjing University, P. R. China
Jinjun Chen
Swinburne University of Technologies, Australia
ABSTRACT
This chapter introduces a scheduling approach for cross-domain scientific workflow execution with timerelated QoS evaluation. Generally, scientific workflow execution often spans self-managing administrative
domains to achieving global collaboration advantage. In practice, it is infeasible for a domain-specific
application to disclose its process details for privacy or security reasons. Consequently, it is a challenging endeavor to coordinate scientific workflows and its distributed domain-specific applications from
service invocation perspective. Therefore, in this chapter, the authors aim at proposing a collaborative
scheduling approach, with time-related QoS evaluation, for navigating cross-domain collaboration.
Under this collaborative scheduling approach, a private workflow fragment could maintain temporal
consistency with a global scientific workflow in resource sharing and task enactments. Furthermore, an
evaluation is presented to demonstrate the scheduling approach.
INTRODUCTION
In the past few years, some computing infrastructures, e.g., grid infrastructure, have been emerged for
accommodating powerful computing and for enhancing resource sharing capabilities required by crossorganizational workflow application (Wieczorek, 2005; Fox, 2006). It is a new special type of workflow
that often underlies many large-scale complex e-science applications such as climate modeling, structural
biology and chemistry, medical surgery or disaster recovery simulation (Ludscher, 2005; Bowers, 2008;
Zhao, 2006). This new type of scientific workflow applications is gaining more and more momentums due
to their key role in e-Science and cyber-infrastructure applications. As scientific workflows are typically
DOI: 10.4018/978-1-60566-661-7.ch018
Scientific Workflow Scheduling with Time-Related QoS Evaluation
data-centric and dataflow-oriented analysis pipelines (Ludscher, 2005; McPhillips, 2005), scientists
often need to glue together various cross-domain services such as cross-organizational data management, analysis, simulation, and visualization services (Yan, 2007; Rygg, 2008). Compared with business
workflows, scientific workflows have special features such as computation, data or transaction intensity,
less human interactions, and a larger number of activities (Wieczorek, 2005). Accordingly, scientific
workflow applications frequently require collaborative patterns marked by multiple domain-specific
applications from different organizations. An engaged domain-specific application often contributes a
definite local computing goal to global scientific workflow execution. Typically, in this loose coupled
application environment, goal-specific scientists are rather individualistic and more likely to create their
own knowledge discovery workflow by taking advantage of available services (Ludscher, 2005). It
promotes scientific collaboration in form of service invocation for achieving certain computing goals.
To facilitate scientific workflows development and execution, cross-domain workflow modeling and
scheduling are key topics that currently cause more and more attentions (Wieczorek, 2005; Yan, 2007;Yu,
J., & Buyya, R., 2005;Yu, J., Buyya, R., & Tham, C. K., 2005). For example, Yu and Buyya (Yu, J., &
Buyya, R., 2005) provided a general taxonomy of scientific workflow, in which workflow design, workflow
scheduling, fault tolerance, and data movement are four key features associated with the development and
execution of a scientific workflow management system in Grid environment. Furthermore, they believed
that a scientific workflow paradigm could greatly enhance scientific collaboration through spanning
multiple administrative domains to obtain specific processing capabilities. Here, scientific collaborations are often navigated by data-dependency and temporal-dependency relations among goal-specific
domain applications, in which a domain-specific application is often implemented as a local workflow
fragment deployed inside a self-managing organization for providing the demanded services in time. In
a grid computing infrastructure, a service for scientific collaboration is often called a grid service that
address resource discovery, security, resource allocation, and other concerns (Foster, 2001). For crossorganizational collaboration, existing (global) analysis techniques often mandate every domain-specific
service to unveil all individual behaviors for scientific collaboration (Chiu, 2004). Unfortunately, such
an analysis is infeasible when a domain-specific service refuses to disclose its process details for privacy
or security reasons (Dumitrescu, 2005; Liu, 2006). Therefore, it is always a challenging endeavor to
coordinate a scientific workflow and its distributed domain-specific applications (local workflow fragments for producing domain-specific service), especially when a local workflow fragment is engaged
in different scientific workflow executions in a concurrent environment. Generally, a local workflow
fragment for producing domain-specific service is often deployed inside a self-governing organization,
which could be treated as a private workflow fragment of the self-governing organization.
In this situation, to effectively coordinate internal service performances and external service invocation, as well as their quality, collaborative scheduling between a scientific workflow and engaged
self-managing organizations may be greatly helpful for promoting the interactions of independent local
applications with the higher-level global application. It aims at coordinating executions for computationand data-rich scientific collaboration in a readily available way. For example, resource management in
Grid environment is typically subject to individual access, accounting, priority, and security policies
of the resource owner. Resource sharing is, necessarily, highly controlled, with resource providers and
consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs in form of service. The usage policy imposing on these resources
is often enforced by a self-managing organization (Foster, 2001; Batista, 2008). At runtime, if a selfmanaging organization refuses to disclose its process details for privacy or security reasons, the resource
397
service process is often promoted by a resource-broker (Abramsona, 2002; Elmroth, 2008). Besides,
if a resource could not be shared by different resource users at the same time, executions of different
scientific workflow around these resources should coordinate their resource sharing in a compromising
way. Otherwise, some conflicts would be occurred during the execution. Therefore, cross-organizational
scientific workflow execution, resource allocation, and compromising usage policy should be scheduled
in an incorporated way in a concurrent environment (Yan, 2007; Li, 2006). For instance, a computing
center is a typical self-managing organization that often bears up heavy computing loads from numerous goal-specific applications. The scheduling of a computing center for satisfying its multi external
service requirements is a typical coordinative process between a scientific workflow and a self-managing
organization. Resource compromising usage policy is often recruited for coordinating its computational
resources using processes engaged in different scientific collaborations in a concurrent environment.
Additionally, for a performance-driven scientific workflow execution, collaborative scheduling process
is a more complex situation as a collaborative scheduling process not only covers cross-organizational
resource sharing, but also covers task enactments deployed inside self-managing organizations (Yan,
2007; Batista, 2008), which is often initiated by domain-specific service specifications and its application context specifications.
In view of these observations, a collaborative scheduling approach is investigated, in this paper,
for achieving coordinated executions of a scientific workflow with time-related QoS evaluation. It is
specifically deployed in a Grid environment. Taking advantage of the collaborative scheduling strategy,
a private workflow fragment could maintain its temporal consistency with a scientific workflow in resource sharing and task enactments. Please note that our method subscribes to relative time rather than
absolute time in collaborative scheduling applications. The rest of this chapter is organized as follows.
In Section 2, some preliminary knowledge of QoS is presented for piloting our further discussion. In
Section 3, a temporal model of service-driven scientific workflow execution is investigated. In section 4,
application context analyses of scientific workflow execution is discussed. In Section 5, taking advantage
of the temporal model presented in Section 3 and the context analysis presented in Section4, a temporal
reasoning rule is put forward for collaboration scheduling application of a scientific workflow. In Section 6, an evaluation is proposed for demonstrating our approach presented in this paper. In Section 7,
related works and comparison analysis are presented to evaluate the feasibility of our proposal. Finally,
the conclusions and our future work are presented in Section 8.
PRELIMINARY KNOWLEDGE OF qOS

With recent advances in pervasive devices and communication technologies, there are increasing demands
in scientific and engineering application for ubiquitous access to networked services. These services
extend supports from Web browsers on personal computers to handheld devices and sensor networks.
Generally, a service is a function that is well-defined, self-contained, and does not depend on the context
or state of other services. Service-Oriented Architecture (SOA) is essentially a collection of services.
These services communicate with each other and the communication can involve either simple data
passing or it could involve two or more services for coordinating some activity (http://www.servicearchitecture.com/).
Figure 1 illustrates a general style of the service-oriented scenario. In Fig.1, a service consumer sends
a service request to a service provider, and the service provider returns a response to the service con-
398
Figure 1. Service-oriented scenario between service consumer and service provider
sumer. The request and subsequent response connections are defined in some way that is understandable
to both the service consumer and service provider. Here, the service could be reified into a unit of work
done by a service provider to achieve desired goal for a service consumer. Here, both service provider
and service consumer could be roles played by software agents on behalf of their owners.
Service-oriented applications are most launched by this Web service invocation style as illustrated by
Fig.1. Generally, Web services are self-contained business applications, which can be published, located
and invoked by other applications over the Internet. Different vendors, research firms and standards
organizations could define their Web services differently, however the common theme in all these definitions is that Web services are loosely coupled, dynamically bound, accessed over the web and standards
based. Web services often use XML schema to create a robust connection. They are based on strict
standard specifications to work together and with other similar kinds of applications. More specifically,
Web services are based on three key standards in their current manifestation, i.e., SOAP (XML-based
message format), WSDL (XML-based Web Services Description Language), and UDDI (XML-based
Universal Description, Discovery, and Integration of Web Services). Any use of these basic standards
constitutes a Web service. Universal, platform independent connectivity (via XML-based SOAP messages) and self-describing interfaces (through WSDL) characterize the Web services, and UDDI is the
foundation for a dynamic repository which provides the means to locate appropriate Web Services. The
typical Web Service invocation is demonstrated by Fig.2, by taking advantage of those standards.
Web services allows for the development of loosely coupled solutions. The independent resources
expose an interface, which can be accessed over the network. For example, a firm may expose a particu-
Figure 2. A typical Web service invocation paradigm in technology
399
lar application as a service, which would allow the firms partners to access the particular service. This
is made possible by standards which define how Web services are described, discovered, and invoked.
This adherence to strict standards enables applications in one business to inter-operate easily with other
businesses. In addition, it allows application interactions across disparate platforms and those running
on legacy systems and thereby offers a company the capability of conducting business electronically
with potential business partners in a multitude of ways at reasonable cost.
It has to be acknowledged that Web Services technology is only one of several technologies that enable
component-based distributed computing and support information system integration efforts, largely due
to its universal nature and the broad support by major IT technologies. Other standards, such as WSFL
(Web Services Flow Language) or BPEL4WS (Business Process Execution Language for Web Services),
also play an important role, but are not necessarily required to consume or provide Web Services, and if
the location of the Web Service is known even UDDI is not required. Those basic concept and scenarios
mentioned above could also be referred to http://www.w3.org/2002/ws/.
The emergence of Web services has created unprecedented opportunities for organizations to establish more agile and versatile collaborations with other organizations. Widely available and standardized
Web services make it possible to realize cross-organizational collaboration. A typical SOA paradigm
based on Web Service rationale could be illustrated by Fig.3, in which there are three fundamental roles:
Service Provider, Service Requestor and Service Registry and 3 fundamental operations: Publish, Find
and Bind. The service provider is responsible for creating a service description, publishing its service
description to one or more service registries, and receiving Web service invocation messages from one
or more service requestors. A service requestor is responsible for finding a service description published
to one or more service registries and is responsible for using service descriptions to bind to or invoke
Web services hosted by service providers. The service registry is responsible for advertising Web service descriptions published to it by service providers and for allowing service requestors to search the
collection of service descriptions contained within the service registry. Once the service registry makes
the match, the rest of the interaction is directly between the service requestor and the service provider
for the Web service invocation (Graham, 2001) Please note that even grid service is often recruited in
grid application and scientific workflow research domain, as it is essentially a special web service that
provides a set of well-defined interfaces and that follows specific conventions, we do not distinguish a
grid service and web service in this chapter.
In Fig.3, since they are intended to be discovered and used by other applications across the Web,
Web services need to be described and understood both in terms of functional capabilities and QoSs
properties. Therefore, a service is always specified by its function-attributes (i.e. a services function
specification including inputs, outputs, preconditions and effects) and its non-function attributes (e.g.
time, price, availability, et al, for evaluating a services execution). Generally, the service profile primarily describes a services function-attributes. In cross-domain grid service invocations, quality of a grid
service (mainly specified by its non-function) is often evaluated in terms of common security semantics,
distributed workflow and resource management, coordinated fail-over, problem determination services,
and other metrics across a collection of resources with heterogeneous and often dynamic characteristics.
In (Zeng, 2004), five generic quality criteria for elementary services are presented as follows: (1) Execution price. Given an operation op of a service s, the execution price qpr(s; op) is the fee that a service
requester has to pay for invoking the operation op. (2) Execution duration. Given an operation op of a
service s, the execution duration qdu(s; op) measures the expected delay in seconds between the moment
when a request is sent and the moment when the results are received. (3) Reputation. The reputation
400
Figure 3. A typical SOA paradigm based on Web service
qrep(s) of a service s is a measure of its trustworthiness. It mainly depends on end users experiences of
using the service s. Generally, different end users may have different opinions on the same service. (4)
Successful execution rate. The successful execution rate qrat(s) of a service s is the probability that a
request is correctly responded (i.e., the operation is completed and a message indicating that the execution has been successfully completed is received by service requestor) within the maximum expected
time frame indicated in the Web service description. The successful execution rate (or success rate for
short) is a measure related to hardware and/or software configuration of Web services and the network
connections between the service requesters and providers, and (5) Availability. The availability qav(s) of
a service s is the probability that the service is accessible.
A Temporal Model of Service-Driven Scientific Workflow Execution

In this chapter, we mainly focus on discussing scientific workflow scheduling with time-related QoS
evaluation in grid environment. More specifically, a temporal model for service-driven scientific workflow
execution is presented in this Section, and its further applications are investigated in later sections. In
(Cardoso, 2004), four distinct advantages are highlighted for organizations to characterize their workflow
developments and executions based on QoS:
1.
2.
3.
QoS-based design: it allows organizations to translate their vision into their business processes more
efficiently, since workflow can be designed according to QoS metrics. For e-commerce processes
it is important to know the QoS an application will exhibit before making the service available to
its customers.
QoS-based selection and execution: it allows for the selection and execution of workflows based
on their QoS, to better fulfill customer expectations. As workflow systems carry out more complex
and mission-critical applications, QoS analysis serves to ensure that each application meets user
requirements.
QoS monitoring: it makes possible the monitoring of workflows based on QoS. Workflows must
be rigorously and constantly monitored throughout their life cycles to assure compliance both with
initial QoS requirements and targeted objectives. QoS monitoring allows adaptation strategies to
be triggered when undesired metrics are identified or when threshold values are reached.
401
4.
QoS-based adaptation: it allows for the evaluation of alternative strategies when workflow adaptation becomes necessary. In order to complete a workflow according to initial QoS requirements,
it is necessary to expect to adapt, re-plan, and reschedule a workflow in response to unexpected
progress, delays, or technical conditions. When adaptation is necessary, a set of potential alternatives is generated, with the objective of changing a workflow as its QoS continues to meet initial
requirements. For each alternative, prior to actually carrying out the adaptation in a running workflow, it is necessary to estimate its impact on the workflow QoS.
In a service-driven workflow system, time is one of the key parameter engaged in its QoS specification (Cardoso, 2004). Timing constraint is often associated with organizational rules, laws, commitment, technical demands, and so on. In (Zeng, 2004), two timing constraints are put forward related to
activities, which are internal timing constraint and external timing constraint. They specified the internal
timing constraint as the execution duration or executable time span; and specified the external timing
constraints as the temporal dependency relations between different activities. On the assumption that
given a workflow model, designers could assign execution duration and executable time span (during
which an activity could be executed) to every individual activity based on their experience and expectation from the past execution, Li, et al (Zeng, 2004) defined the duration time accurately in their timing
constraint model. In practice, we believe that it may be more reasonable to specify the duration time
as a time span. For example, it may be more acceptable to specify the execution duration is 3 days to
5 days (ab. (3, 5)) than to specify the execution duration is 4 days, at the stage of system modeling. By
extending the timing constraint definitions presented in (Zeng, 2004) with this idea, we put forward a
general timing constraint model for service invocation engaged in cross-domain workflow execution.
Facilitating temporal-dependency analysis, we believe that service invocation cost often consists of
service producing cost and service delivering cost. Service producing process aims at producing concrete
service content, it underlies later service delivering. As service producing process is often deployed
for reifying required service item, its time-related QoS evaluation is often calculated based on internal
temporal cost inside an organization. Compared to service producing cost, time-related QoS evaluation
of service delivering cost is often calculated based on external temporal cost associated with service
distributing process among service providers and service consumers, as well as administrative cost consumed in cross-organizational collaboration. Accordingly, cost evaluation of a service invocation would
be calculated based on this two cost calculating. For example, a car part vendor or car part enterprise
receives an order for some parts, the service process spans the time from receiving the order to delivering its products. It often contains two stages: the first stage focus on manufacturing the required parts
associated with enterprises internal time elapsing; the second stage focuses on timely delivering the
required parts associated with enterprises external time elapsing. The QoS is often related to both the
time of service producing process and service delivering process. Here, the internal time is determined
by service producing process and the external time is determined by service delivering process. The
cost of service could be evaluated based on the time cost associated with these two stages, which is two
side of time analysis upon a same service invocation. Please note that in some situations, there could
be only service delivering process without service producing process in some situation. For example,
the service provided by Urban Emergency Monitoring and Support Centre (EMSC) is often related to
service delivering without service producing process in a concrete service invocation process. Here, the
QoS is only related to the time of service delivering.
Associated with service producing process and service delivering process, some typical service invo-
402
Figure 4. A temporal logic-based time model for steering service innovation
cating modes are discussed, here, for specifying their coordination. Fig.4 illustrates a typical coordination
relation between a service provider and a service consumer. The temporal parameters as illustrated in
Fig.4 are specified by Table1.
According to the temporal-dependency relation among these parameters as listed in Table1, some
typical service innovation styles are specified as follows.
1.
2.
3.
4.
If SP-End =SD-Start, we believe that the service delivering process is a strong service delivering
style.
If SP-End <SD-Start, we believe that the service delivering process is a weak service delivering
style.
If SC-Start =SD-End, we believe that the service consuming process is a strong service invocation
style.
If SC-Start < SD-End, we believe that the service consuming process is a weak service invocation
style.
In practice, SC-Start is often determined firstly, which initiated the service invocation in a bottom-to-up
way. More specifically, a service provider often schedules its service producing process and its services
delivering process according to the deadline required by a service consumer. What a service requestor
Table 1. Specifications of temporal parameters indicated in Figure 4

Temporal Parameters
Specifications
SD-Start
The start time of service delivering process
SD-End
The end time of service delivering process

The start time of service producing process
SP-Start
SP-End
The end time of service producing process
SC-Start
The start time of service consuming process
SC-End
The end time of service consuming process
[T-SR]
The time point for firing the service producing process
403
cares about is the expected time to achieve its required item; while the service provider take care of the
time of service producing and service delivering, respectively, based on certain service providing deadline. Here, if the time pair of [SP-Start, SP-End] is degraded into (0, 0), it is a special service paradigm
without the process of service producing. For example, if the car part vendor or car part enterprise has
the required parts in stock, the time of part producing is omitted and the service process is depended on
the process of part delivering.
If the service provider and service consumer as illustrated in Figure4 respectively stand for two
workflow fragments, it indicates a typical service-driven cross-domain workflow execution with timerelated QoS evaluation. It is more suitable for specifying a service-driven scientific workflow execution
in a grid environment. Furthermore, in Fig.4, the service definitions mainly consist of such foundational
prescriptions as definitions of service function, QoS prescription, cost evaluation, serving relation
definition, and other service item definitions. It prescribes the policies of how to organize the required
web service into a service-based workflow system. It definitely satisfies the requirements of a scientific
workflow execution on security issues in grid environment. For example, in grid environment, resource
access and resource sharing is often invocated through granting license to certain resource user and then
opens the resource access to valid consumer (Yan, 2007; Foster, 2001). According to the rationale of
temporal logic and the scenario of service invocating mode as demonstrated by Fig.4, certain security
policy could be easily embed into grid service invocation for a grid computing paradigm. In practice,
certain security policies are often integrated into a grid services QoS evaluation, which provide different
profiles of control flow and data flow specification. In the following section, we will take advantage of
this temporal model to explore the scheduling application of cross-domain scientific workflow execution in grid environment.
APPLICATION CONTExT ANALYSES OF SCIENTIFIC WORKFLOW ExECUTION

As mentioned in (Yu, J., & Buyya, R., 2005), cross-organizational collaboration engaged in a scientific
workflow often aims at obtaining specific processing capabilities through spanning multiple administrative domains. Here, a domain-specific application engaged in scientific collaboration is often uniquely
associated with a local workflow fragment deployed in a self-managing organization. In this paper, when
a workflow fragment refuses to disclose some of its process details for privacy or security reasons, it
would be treated as a private workflow fragment of a self-managing organization. For a private workflow fragment, the actions and resources hidden from a scientific workflow specification and execution
would be treated as silent actions and silent resources.
Compared with the silent actions and silent resource, a self-managing organization only exposes
its publicly accessible port for its scientific collaboration. Therefore, a private goal-specific workflow
fragment consists of a set of silent actions, silent resources and some publicly accessible ports. It is essentially a gray box embedded in a scientific workflow. In scientific workflow execution, it is wholly
a functional unit for scientific collaboration, and is triggered by its publicly accessible port for certain
computing goals. In this paper, a publicly accessible port would be treated as an interaction interface
between a scientific workflow and a self-managing organization.
Fig.5 demonstrates a scientific workflow and its application context associated with three selfmanaging organizations. In Fig.5, a scientific workflow consists of three tasks (i.e., Ti, Tj, and Tk) and
three resources (i.e., Ri, Rj, and Rk). Ti, Tj, and Tk are respectively associated with three private work-
404
Figure 5. Global application context of a cross-organizational scientific workflow execution
flow fragments (i.e., Pri-WF1, Pri-WF2, and Pri-WF3) for achieving certain local computing goals.
Pri-WF1, Pri-WF2, and Pri-WF3 are respectively deployed inside three self-managing organizations
(i.e., SM-Org-1, SM-Org-2, and SM-Org-3). Obviously, this scientific workflow is typically deployed
in a cross-organizational way.
For a scientific workflow specification, it can not covers the silent actions and silent resources contained in private fragments, as they exclusively belong to self-managing organizations for certain privacy
or security reasons. Fig.6 illustrates Pri-WF1, Pri-WF2, and Pri-WF3, and a global scientific workflow
view by masking the silent actions and silent resources engaged in Pri-WF1, Pri-WF2, and Pri-WF3.
From Fig.6, we could find that Pri-WF1, Pri-WF2, and Pri-WF3 are respectively enacted by different self-managing organizations in isolated environments. The scientific workflow only covers their
publicly accessible ports. As a private workflow fragment often masks part of its own internal workflow
specification and its scheduling specification from the scientific workflow, it is a challenging endeavor
to coordinate the executions of a scientific workflow and a private workflow fragment at runtime for
their scientific collaboration. It greatly depends on the collaborative execution specification.
In view of this challenge, we will then discuss the temporal context of a scientific workflow for its
runtime scheduling. As mentioned in (Rajpathak, 2006), scheduling deals with the assignment of jobs
and activities to resources and time ranges in accordance with relevant constraints and requirements.
For a scientific workflow, its scheduling application is always promoted in a top-down way. For example,
the scheduling tools such as Petri Net, WF-net, or DAG (Zeng, 2004; Li, 2003;van der Aalst, 1998;
Guan, 2006) are typical associated with a direction from source behaviors to sink behaviors. They are
typical a downstream scheduling style for scheduling application. In this scheduling style, the start time
of the source behaviors are determined in advance, then succeeding activities are scheduled according
to certain workflow patterns (e.g., And-Split, Or-Split, And-Join, and Or-Join workflow pattern (van
der Aalst, 2003), to name a few) and certain temporal-dependencies (e.g., Before, Meet, Overlap (Allen, 1983), etc). Its scheduling application is unfolded in the same direction with its practical execution
direction. Here, we take advantage of a time axis t1 to indicate the scheduling application of a scientific
workflow.
405
Figure 6. Three private workflow fragments and a scientific workflow view by masking its silent context
For a private workflow fragment associated with a scientific workflow, its scheduling application
is different from the scientific workflow. As a private workflow fragment is always triggered by its
publicly accessible port, although there are certain source behaviors and sink behaviors in its model
and its later concrete execution, the concrete temporal parameters of its behaviors could not be scheduled independently. Its scheduling application is unfolded by two steps. At the first stage, according
to its expected computing goal specified by a scientific collaboration, a private workflow fragment
schedules its workflow model and execution in an isolated application environment. At this scheduling
stage, we take no consideration of the temporal constraints of the publicly accessible ports specified by
the scientific workflow scheduling specification. Here, we take advantage of a time axis t2 to indicate
the scheduling application environment of a private workflow fragment. At the second stage, taking
advantage of the temporal constraints of the publicly accessible port, the temporal distributions of the
private workflow fragment indicated by time axis t2 are wholly mapped onto time axis t1 to keep the
temporal consistency with its publicly accessible port. Through time mapping, we can guarantee the
temporal consistency between executions of a private workflow fragment and a scientific workflow for
their scientific collaboration.
The first scheduling stage aims at specifying a private workflow fragments internal temporal dependencies among its silent behaviors and publicly accessible port without external temporal constraints.
Its scheduling application is initiated from a certain source point. It is unfolded in the same direction
with the workflow fragments execution in practice. The second scheduling stage aims at keeping the
external temporal consistency with a scientific workflow execution for certain scientific collaboration
through temporal transferring from time axis t1 to time axis t2. Its temporal calculating process is initiated
by a publicly accessible port. It may be unfolded in a reversed direction compared with the workflow
fragments execution in practice. It is a typical hierarchical scheduling process. For example, in Fig.5,
406
Figure 7. Typical temporal parameters and their distributions for scheduling a scientific workflow
SWF
publicly accessible ports of Ri and Ti inside SM-Org-1 stand for a sink resource and a sink task of a
private workflow fragment Pri-WF1. In this situation, the scheduling application of Pri-WF1 depends
on the expected start time of the scientific workflow. As the scheduling application of Pri-WF1 is initiated by the scheduling result of the scientific workflow, their executions should be scheduled in an
incorporated way.
Please note that the temporal disciplines (e.g., weak service delivering style, strong service delivering
style, weak service invocation style, and strong service invocation style) presented in section 3 could
be incorporated into the application context of a cross-organizational scientific workflow execution. A
concrete example of service delivering style would be demonstrated in our evaluation analysis presented
in section 6.2. In the next section, we will focus on exploring a temporal reasoning rule which can coordinate cross-organizational executions of a scientific workflow.
A Temporal Reasoning Rule for Collaborative

Scheduling Application of Scientific Workflow
According to the temporal model as presented in Section3, a temporal reasoning rule would be investigated in this section for cross-domain scientific workflow scheduling application. Firstly, suppose that
there is just one publicly accessible port contained in a private workflow fragment in a self-managing
organization. More complex situations would be investigated at the end of this section.
Definition1. For a scientific workflow SWF, its expected executable duration could be specified by
a time period of [SWF-Estart, SWF-Eend], in which SWF-Estart and SWF-Eend respectively stand for
SWFs expected start time and expected end time.
Definition2. Suppose that there is a private workflow fragment Pri-WFi associated with a scientific
workflow SWF. It has a publicly accessible port Pi engaged in SWFs execution. For Pi, its expected executable duration could be indicated by a time period of [Pri-WFi-Ep-start, Pri-WFi-Ep-end],
in which Pri-WFi-Ep-start and Pri-WFi-Ep-end respectively stand for Pi expected start time and its
expected end time.
Here, SWF-Estart, SWF-Eend, Pri-WFi-Ep-start, and Pri-WFi-Ep-end are specified by SWFs scheduling
specification. Fig.7 indicates this unique scheduling process of SWF with a time axis t1. Please note that
the temporal parameters indicated by time axis t1 are relative time rather than absolute time.
In practice, Pri-WFi is uniquely associated with SWFs execution through Pi, in which SWF plays as
a service consumer and Pri-WFi plays as a service provider in their cross-domain scientific collaboration. In their service-driven scientific collaboration, as a service consumer, SWF should firstly specify
407
Figure 8. Typical temporal parameters and their distributions for scheduling a private workflow fragment Pri-WFi
its service requirement, in terms of what and when; then, as a service consumer, Pri-WFi is scheduled
for providing the demanded service, in time, in terms of how and when. Concretely, Pri-WFis execution aims at providing the demanded service based on SWFs specification in time. Once a service item
is determined in terms of what, required silent resources and silent task enactments could be deployed
by a self-managing organization for achieving the expected computing goal. It is associated with the
first scheduling stage as mentioned in Section2. To provide the demanded service in time, Pri-WFis
implementation should be scheduled in terms of when. It is associated with the second scheduling stage
as mentioned in Section2.
Generally, for a goal-driven workflow execution, if there is no external temporal dependency with
other workflow executions, it could be scheduled based on its capacity and past experiences in a selfmanaging way with a special execution goal (Li, 2003).
Definition3. For a private workflow fragment Pri-WFi that takes no external temporal dependency with
other workflow executions, its expected executable duration could be specified by a time period
of [Pri-WFi-Estart, Pri-WFi-Eend], in which Pri-WFi-Estart and Pri-WFi-Eend respectively stands for
Pri-WFis expected start time and its expected end time.
Fig.8 demonstrates this unique scheduling process of Pri-WFi specified by time axis t2. Similarly, the
temporal parameters indicated by time axis t2 are also relative time rather than absolute time.
As Pri-WFi is uniquely associated with SWF through Pi, there are certain temporal dependencies
between Pri-WFi and SWF. To provide the required computing service in time, Pri-WFi should be active
in a required duration based on their temporal dependencies. Pri-WFis start time should be deduced
based on the temporal constraints of its publicly accessible port specified by SWFs specification rather
than determined independently. Here, suppose that associated with time axis t2, Pis expected start time
and expected end time are respectively indicated by Pri-WFi-Ep-start and Pri-WFi-Ep-end. Here, Pri-WFiEp-start and Pri-WFi-Ep-end should be respectively equal to Pri-WFi-Ep-start and Pri-WFi-Ep-end in terms of
absolute time or in execution. To keep the temporal consistency between Pri-WFi and SWF, the time
parameters Pri-WFi-Estart, Pri-WFi-Eend, Pri-WFi-Ep-start and Pri-WFi-Ep-end indicated by t2 should be
mapped to SWFs time axis t1.
In view of this observation, a temporal transferring rule will be investigated in this section, for keeping temporal consistency in cross-organizational scientific collaboration.
408
Figure 9. Temporal parameters and their distributions of a scientific workflow example SWF associated
with time axis t1
1.
2.
For a scientific workflow SWF, as the time period of [SWF-Estart, SWF-Eend] covers the time period
of [Pri-WFi-Ep-start, Pri-WFi-Ep-end], i.e., [Pri-WFi-Ep-start, Pri-WFi-Ep-end][SWF-Estart, SWF-Eend], SWFEstart should be determined firstly, then Pri-WFi-Ep-start and Pri-WFi-Ep-end are determined according
to the values of SWF-Estart and SWFs internal temporal distributions. This temporal scheduling
is formalized by a scheduling logic of SWF-Estart [Pri-WFi-Ep-start, Pri-WFi-Ep-end]. It indicates a
top-down or a global-to-local temporal reasoning path for scheduling a scientific workflow.
For a private workflow fragment Pri-WFi, although the time period of [Pri-WFi-Estart, Pri-WFi-Eend]
covers the time period of [Pri-WFi-Ep-start, Pri-WFi-Ep-end], i.e., [Pri-WFi-Ep-start, Pri-WFi-Ep-end][PriWFi-Estart, Pri-WFi-Eend], a different temporal scheduling logic is held. More specifically, only after
the values of Pri-WFi-Ep-start and Pri-WFi-Ep-end are achieved based on the scheduling logic of SWFEstart [Pri-WFi-Ep-start, Pri-WFi-Ep-end], Pri-WFi-Estart and Pri-WFi-Eend could be deduced based on
the concrete values of Pri-WFi-Ep-start, Pri-WFi-Ep-end and Pri-WFi internal temporal distributions.
Here, Fig.8 illustrates Pri-WFis internal temporal distributions in a qualitative way. This temporal
scheduling process could be formalized by a scheduling logic of [Pri-WFi-Ep-start, Pri-WFi-Ep-end]
[Pri-WF -E , Pri-WF -E ]. It indicates a bottom-up or a local-to-global temporal reasoning
i
start
i
end
path for scheduling a private workflow fragment in an incorporated scheduling environment. It is
different from the global-to-local temporal reasoning path.
Pri-WFi-Ep-start and Pri-WFi-Ep-end should be respectively equal to Pri-WFi-Ep-start and Pri-WFi-Ep-end

in terms of absolute time. Accordingly, the temporal association relation between SWF and WFi is specified by Definition4.
Definition 4. The temporal dependency around a publicly accessible port between SWF and Pri-WFi
could be formalized by [SWF-Estart, SWF-Eend] [Pri-WFi-Ep-start, Pri-WFi-Ep-end] [Pri-WFi-Estart,
Pri-WFi-Eend] for keeping temporal consistency in cross-organizational scientific collaboration.
Here, an example is presented to demonstrate the application of the temporal transferring process
according to Definition 4.
Fig.9 and Fig.10 respectively illustrate the scheduled temporal distributions a SWF and a Pri-WFi.
These two scheduled temporal distributions are respectively associated with time axes t1 and t2. More
specifically, in Fig.5, SWF-Estart=0, SWF-Eend=10, Pri-WFi-Ep-start=3, and Pri-WFi-Ep-end=5; in Fig.6, PriWFi-Estart=0, Pri-WFi-Eend=5, Pri-WFi-Ep-start=2, and Pri-WFi-Ep-end = 4.
Here, a temporal association relation is taken into consideration between this two time axes. The
incorporated temporal scheduling environment should be specified by a united time axis. For brevity and
without the loss of generality, the time axis t1 would be selected as a united time axis of t. Here, some
duration parameters associated with Pri-WFi are specified, as below, for later temporal reasoning.
409
Figure 10. Temporal parameters and their distributions of a private workflow fragment example PriWFi associated with time axis t2
1.
2.
3.
The duration between Pri-WFi-Estart and Pri-WFi-Ep-start is 2 time units, i.e., Pri-WFi-d1= 2 time
units;
The duration between Pri-WFi-Ep-start and Pri-WFi-Ep-end is 2 time units, i.e., Pri-WFi-d2= 2 time
units, and,
The duration between Pri-WFi-Ep-end and Pri-WFi-Eend is 1 time units, i.e., Pri-WFi-d3= 1 time
units.
According to these parameters, the time parameters of Pri-WFi could be re-specified, as below, in
the united time axis t. Fig.11 illustrates the re-specified temporal parameters and their distributions in
the united time axis t.
1.
2.
3.
4.
Pri-WFi-Ep-start = Pri-WFi-Ep-start = 3.
Pri-WFi-Ep-end = Pri-WFi-Ep-end = 5.
Pri-WFi-Estart = Pri-WFi-Ep-start Pri-WFi-d1 = 3 2 = 1.
Pri-WFi-Eend = Pri-WFi-Ep-end +Pri-WFi-d3 = 5 + 1 = 6.
In Fig.11, Pri-WFis start time (i.e., Pri-WFi-Estart = 1) as specified by time axis t is an ideal start time
for producing the required service item for SWFs execution. Otherwise, Pri-WFi will occupy some additional time costs or can not provide the demanded service item in time. For example, if Pri-WFi starts
at the zero time point in time axis t, i.e., Pri-WFi-Estart =0, as Pi expected start time is fixed in SWFs
specification, i.e., Pri-WFi-Ep-start = 3, and Pri-WFi-Ep-end = 5 could not be changed, the duration between
Pri-WFi-Estart and Pri-WFi-Ep-start is 3 time units, i.e., Pri-WFi-d1= 3 time units. Obviously, it wastes 1 time
unit compared to Tj-d1s original value (i.e., Tj-d1=2) that we calculated previously. On the other hand, if
Pri-WFi starts at the 2nd time point in time axis t, i.e., Pri-WFi-Estart =2, as the duration from Pri-WFiEstart to Pri-WFi-Ep-start is a fixed value, according to their relative time distributions, Pri-WFi-Ep-start should
be at the 4th time point, i.e., Pri-WFi-Ep-start =4, to meet Pri-WFis workflow specification. Obviously, it
Figure 11. Re-specified temporal parameters and their distributions of example SWF and Pri-WFi associated with a united time axis t based on their temporal dependent relation
410
delays the service invocation for satisfying Pri-WFs execution.

This example demonstrates a real-time application for scientific collaboration. In practice, the execution of a scientific workflow system may be a mixture of hard real-time applications and soft real-time
applications. Generally, a system is said to be real-time if the total correctness of an operation depends
not only upon its logical correctness, but also upon the time in which it is performed (Liu, 2002).
Moreover, in a hard or immediate real-time system, the completion of an operation after its deadline is
considered useless. On the other hand, a soft real-time system will tolerate such lateness and take the
overhead of context switching into consideration. Soft real-time systems are typically useful if there
are some issues of concurrent access that need to keep a number of connected systems up to date with
changing situations.
To incorporate the soft real-time property into workflow scheduling application, a publicly accessible port should have some typical attributes of Has-Earliest-Start-Time, Has-Latest-Start-Time,
Has-Earliest-End-Time, and Has-Latest-End-Time as specified in (Rajpathak, 2006). Moreover, a
temporal-dependable service initiated by a publicly accessible port subscribes to Allens (Allen, 1983)
representation of standard time and relations between a private workflow fragment and a scientific
workflow. These temporal attributes are key temporal constraints for task enactment and resource allocation for scientific collaboration.
Here, some more complex situations would be investigated. Suppose that there is more than one
publicly accessible port contained in a self-managing organization. With this scenario in our mind, some
complex situations are distinguished as below.
1.
2.
3.
4.
If the ports belong to a same private workflow fragment Pri-WFi, and Pri-WFi just engaged in a
scientific collaboration with a scientific workflow, its local temporal scheduling among its silent
actions, silent resources and publicly accessible ports just aims at providing the demanded service
item in time. In this situation, there is no conflict in resource sharing and task enactment, and it is
easy to schedule a Pri-WFi in a self-managing way.
If the ports belong to a same private workflow fragment Pri-WFi, and Pri-WFi is engaged in more
than one scientific workflow in a concurrent environment, its local temporal scheduling among its
silent actions, silent resources and publicly accessible ports should be coordinated with each other
for satisfying different service items for different scientific workflows. It is required that Pri-WFi
should compromise the service producing processes if there is a conflict in resource sharing and
task enactment.
If the ports belong to different private workflow fragments, and there is no shared silent action or
silent resource for producing the service items among the private workflow fragments, the private
workflow fragments could be scheduled independently with each other, according to the temporal
transferring rule proposed in this paper.
If the ports belong to different private workflow fragments but there are some shared silent actions
or silent resources among the workflow fragments, the scheduling application of the workflow
fragments should be promoted in an incorporated way. In the following section, an evaluation
would be presented to demonstrate this complex situation based on the method presented in this
section.
411
Figure 12. Scientific workflow execution process based on context switching and role binding
EVALUATION: A SCIENTIFIC WORKFLOW ENGINEERING

DEVELOPMENT WITH TIME-RELATED qOS EVALUATION
Fig.12 demonstrates a scientific workflow execution process based on context-awareness technique in
a simulating way. It is a typically scheduler-based paradigm, and provides the foundation for development a context- and role-driven workflow engine through timely role binding. It is navigated through
effectively context switching at runtime. More specifically, a Member judges its application situation,
apperceives application context information for its behavior goals, selects a suitable role according to
certain application logics, and then enters into a concrete application context with the role specification.
The taxonomy of context as illustrated in Fig.12 navigates effective context-awareness and suitable role
binding for promoting a scientific workflow execution.
Please note that the application logic demonstrated in Fig.12 is spread from two directions with certain
state transitions. One is initiated by a vertical navigating logic, and the other is horizontal navigating
logic. The vertical navigating logic is initiated by three stages: task performing, resource accesses, and
collaborating with others. The horizontal navigating logic is composed of context-awareness, runtime
context switching, and concrete task execution. With this navigating discipline as demonstrated in Fig.12,
cross-domain scientific workflow system could be effectively promoted in a collaborative way.
Accordingly, a service-driven scientific workflow system can be characterized as sequences of service
invocations based on certain temporal logic among distributed and heterogeneous stand-alone systems
that can provide autonomous services. The autonomous services could be treated as task-oriented processing. According to the temporal disciplines presented in this chapter, a certificate-driven scientific
workflow application paradigm will be explored with time-related QoS evaluation.
In practice, a scientific workflow is often deployed in a cross-organizational way, which is often
enabled in form of virtual organization. In (Foster, 2001), virtual organizations were characterized by
flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions,
and organizations. In (Yuan, 2000), three generic types of accounts on virtual organizations were sum-
412
marized. The first one is on organizations that extend some of their organizational activities externally,
thus forming virtual alliances to achieve organizational objectives. Integrating several companies core
competences and resources may form virtual organizations in a collaborative way. The second description of the virtual organization is related to a perceptual organization that is abstract, unseeing and
existing within the minds of those who form a particular organization, which is an antithesis of the
physical organization with which we are familiar. The third type of description is of organizations that
are established with information technology such as corporations with an intensive use of telecommuting.
Service execution times, level of control and flexible change control could be abstracted as key aspects
for service-driven workflow specifications and enactment (Grefen, 1999), especially for processes crossing organizational boundaries (Grefen, 2001).
For scientific workflow execution, servers supporting workflow application are decentralized (duplicated) throughout the virtual organization and the distributed servers are controlled by a centralized
authority (headquarters). It is characterized by some basic features as follows.
1.
2.
3.
4.
5.
Lifetime of cooperative is limited;

Organization-across collaboration;
Access to a wide range of specialized resources during collaboration;
Task- or goal-driven autonomous processes;
Role-based communication, et ac.
Here, time constraints are often imposed on resource access or task enactment. At the stage of
modeling an organization-across workflow, the concept of control flow is exploited to prescribe the
service relation among organization with a temporal dependency. During workflow execution, control
flow is instantiated into a logical switching according to a scheduled temporal logic among activities to
satisfy certain collaboration. Temporal logic among cross-organizational collaboration often specifies
the organization-across workflow execution for certain collaboration. If these temporal specifications
are initiated by a global workflow engine, private workflow fragment in organization level could be
automatically navigated with the specified temporal logic. It guarantees that these workflow fragments
could be fired timely for satisfy certain collaboration in form of service invocation. Each workflow fragments in organization level could centralize on its inside-execution in self-governing way, taking little
care of their temporal context switching. In Grid application environment, this temporal logic is often
realized in form of certificate mechanism. Service is often opened by granting certificates to creditable
candidates for their certain resource access. This certificate mechanism provides certain authorization
in terms of what, when, who, and how, in which temporal discipline is a key factor for certificate granting and using.
In view of these observations, a certificate-driven workflow execution scenario will be explored for
cross-organizational workflow collaboration and resource access. The validity of certificate has a limited
lifetime for their service invocation in workflow execution. This certificate-driven workflow execution
scenario could be specified as follows:
1.
2.
Server-level or proxy-level private workflow fragments delegate their certificate granting to workflow engine;
Invocation of services and functions among private workflow fragments are awakened through
certificate granted by workflow engine;
413
Figure 13. A grid-oriented and certificate-driven workflow system in self-governing fashion
3.
4.
The period of certificates validity reflects the lifetime of cooperative, and guarantees the QoS in
time;
Private workflow fragments are task- or goal-driven autonomous process, in which workflow
engine play as a nerve center in a workflow system.
Essentially, this certificate-driven scientific workflow execution is initiated by a server-based workflow

engine that control global workflow execution according to cross-organizational collaboration among
private workflow fragments and their resource accesses. Fig.13 demonstrates the enactment of a prototype
conformable to the service computing paradigm as demonstrated by Fig.1, Fig.2, and Fig.3.
The collaboration disciplines engaged in Figure 13 are depicted as follows.
Step1: Proxy-based private workflow fragment hand over their routines of certificate release to a global
workflow engine that acts as the certificate authority in later scientific workflow executed.
Step2: According to the pre-defined global scientific workflow application logic, the global workflow
engine initiated cross-organizational collaboration through granting a certificate to a candidate for
certain service innovation (resource access). Service invocation is enabled via a certificate identification and authentication, and validity of a certificate specifies the collaborative duration.
Step3: After granting a certificate, a duplicated content of the certificate is sent to the resource or service host for identifying and authenticating future logging or visiting.
Step4: According to the certificate and its security level, the certificate holder could get the access to
the needed resource or invoke some service across the borders of different security domains in
order to achieve its local computing goals.
Step5: If a task is not finished in the period of validity, the resource access is forbidden. In this situation,
414
the actor must apply for an added time and then repeat step 2. Otherwise, the task is finished in the
scheduled time. Please note that this step is indispensable, if there has an unexpected requirement
during workflow execution across the borders of different security domains.
The invocation processes among these steps are certificate-driven, which is essentially initiated by
QoS evaluation. The granting is initiated by the temporal discipline discussed in the previous sections.
The execution logic and the process logic are illustrated, respectively, by Fig.13.a and Fig.13.b. The
period of validity is based on the time constraint model exploited in section 3. For achieving the object,
the global scientific workflow engine demonstrated in Fig.13 should contain some basic items related
to service definitions as below:
1.
2.
3.
4.
5.
Resource pool indexing the available resource supporting workflow execution

Directory-based resource location mechanism and workflow-peer location mechanism
A certificate authority (CA) for certificate granting
Trigger mechanisms initiated by service invoking or ECA rules.
Delegation capability supporting dynamic process data transportation, agent application, and other
proxy-based issues in access control.
The scientific workflow engine is characterized by a key mechanism of service definition mentioned
in section 3. Their temporal-dependent relation engaged in their cross-domain collaboration is navigated
by the temporal-dependent rationale presented in section 4. Please note that the global scientific workflow
engine plays as the part of a centrally managed security mechanism by taking over the security issue
in certificate granting, while the issues or routines about authentication are carried out among private
workflow fragment directly. Private workflow fragments have their security control inside themselves
according to their service ability, storage space, security level, networking speed, and service demand
in hierarchical way.
RELATED WORKS AND COMPARISON ANALYSIS

The scheduling issue is very important for enhancing the scalability, autonomy, quality and performance
of scientific workflows (Ludscher, 2005; Yan, 2007; Li, 2006; Rajpathak, 2006). In (Yu, J., & Buyya, R.,
2005), three major categories of scientific workflow scheduling architecture are presented, i.e., centralized,
hierarchical and decentralized scheduling schemes. In the centralized workflow enactment environment,
one central scheduler makes scheduling decision for all tasks engaged in future workflow execution. For
hierarchical scheduling, there is a central manager and multiple lower-level sub-workflow schedulers.
This central manager is responsible for controlling workflow execution and assigning sub-workflows to
the lower-level schedulers. In contrast with the centralized and hierarchical schemes, there are multiple
schedulers without any central controller in decentralized scheduling. A scheduler can communicates
with others and schedules a sub-workflow to other schedulers with lower load. The authors (Yu, J., &
Buyya, R., 2005) believed that the centralized scheme can produce efficient schedules because it has all
necessary information about all tasks engaged in workflow execution. However, it is not scalable with
respect to the number of task and Grid resource that are generally autonomous. The major advantage
of using the hierarchical architecture is that different scheduling policies can be deployed in the central
415
manager and lower-level schedulers. However, the failure of the central manager will result in entire
system failure. Decentralized scheduling is more scalable but faces more challenges to generate optimal solutions for overall workflow performance. The method presented in this paper falls into the third
scheme, i.e., decentralized scheduling scheme.
Compared with the related works, the main contributions of this paper are twofold.
First, as a typical application environment, Grid is an efficient infrastructure for scientific workflow
development and execution. In a general Grid environment, scheduling of resource allocation is an
important issue for cross-organizational Grid service invocation based on certain privacy and security
usage policies (Dumitrescu, 2005; Batista, 2008; Abramsona, 2002; Elmroth, 2008; Li, 2006). Generally,
it takes less consideration of task scheduling application (i.e., private workflow fragment scheduling)
enacted inside a self-managing organization for achieving a Grid service. In this paper, we incorporated
the resource and task into a private workflow fragment scheduling for satisfying a demanded service
item with certain QoS evaluation. It enhances the QoS of a cross-organizational service invocation for
a scientific collaboration, through keeping the temporal consistency between a scientific workflow and
the private workflow fragments.
Second, for a cross-organizational scientific collaboration, privacy and security issues are key factors that should be incorporated into concrete scheduling application. In technique, brokering strategy
(Abramsona, 2002; Elmroth, 2008) or view techniques (Chiu, 2004) have been proved as efficient
approaches for dealing with this problem. In this paper, the collaboration scheduling is essentially promoted based on workflow view technique, in which publicly accessible ports play as interaction view
opening for scientific workflow execution. Concretely, a scientific workflow imposes certain temporal
constraints on the publicly accessible ports. The silent resources and the silent tasks engaged in a private
workflow fragment are scheduled based on these temporal constraints of the publicly accessible ports.
It guarantees that the scheduling application of a private workflow fragment is closely navigated by
a scientific workflow scheduling application. To our best knowledge, the workflow view technique is
mainly employed in cross-organizational business workflow for execution supervision. In this paper,
we use this technique for collaboration scheduling of a scientific workflow, which is a novel application
of workflow view technique.
CONCLUSION
In this paper, a collaborative scheduling approach with time-related QoS evaluation is presented based
on a temporal model. The proposed approach aims at keeping the temporal consistency of a scientific
collaboration among distributed private workflow fragment in resource sharing and task enactments.
Through an evaluation, we also demonstrated the capability of our approach for promoting multiple
scientific workflow executions in a concurrent environment. This collaborative scheduling approach
could also be helpful with QoS-aware middleware development for cross-organizational scientific collaborations, which will be studied as a future research topic.
416
ACKNOWLEDGMENT
This paper is partly supported by the National Science Foundation of China under Grant No.60673017,
and part of this chapter is cited from our previous research work.
REFERENCES
Abramson, D., Buyya, R., & Giddy, J. (2002). A computational economy for grid computing and its
implementation in the nimrod-g resource broker. Future Generation Computer Systems, 18(8), 10611074.
doi:10.1016/S0167-739X(02)00085-7
Allen, J. F. (1983). Maintaining knowledge about temporal internals. Communications of the ACM,
26(11), 832834. doi:10.1145/182.358434
Batista, D. M., da Fonseca, N. L. S., Miyazawa, F. K., & Granelli, F. (2008). Self-adjustment of resource
allocation for grid applications. Computer Networks: The International Journal of Computer and Telecommunications Networking, 52(8), 17621781.
Bowers, S., McPhillips, T. M., & Ludscher, B. (2008). Provenance in collection-oriented scientific
workflows. Concurrency and Computation, 20(5), 519529. doi:10.1002/cpe.1226
Cardoso, J., Miller, J., Sheth, A., & Arnold, J. (2004). Quality of service for workflow and web service processes. Web Semantics: Science . Services and Agents on the World Wide Web, 1(3), 281308.
doi:10.1016/j.websem.2004.03.001
Chiu, D. K. W., Cheung, S. C., Till, S., Karlapalem, K., Li, Q., & Kafeza, E. (2004). Workflow viewdriven cross-organizational interoperability in a web service environment. Information Technology and
Management, 5(3/4), 221250. doi:10.1023/B:ITEM.0000031580.57966.d4
Dumitrescu, C. L., Wilde, M., & Foster, I. (2005). A model for usage policy-based resource allocation
in grids. In R. Dienstbier (Ed.), Proc. 6th IEEE Intl Workshop Policies for Distributed Systems and
Networks (pp. 191-200). Stockholm, Sweden: IEEE Computer Society Press.
Elmroth, E., & Tordsson, J. (2008). Grid resource brokering algorithms enabling advance reservations
and resource selection based on performance predictions. Future Generation Computer Systems, 24(6),
585593. doi:10.1016/j.future.2007.06.001
Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the grid: Enabling scalable virtual organizations. The International Journal of Supercomputer Applications, 15(3), 200222.
Fox, G. C., & Gannon, D. (2006). Special issue: Workflow in grid systems. Concurrency and Computation, 18(10), 10091019. doi:10.1002/cpe.1019
Graham, S., Davis, D., Simeonov, S., Daniels, G., Brittenham, P., Nakamura, Y., et al. (Eds.). (2001).
Building web services with Java: Making sense of XML, SOAP, WSDL, and UDDI. New York: Sams
Publishing.
417
Grefen, P., Aberer, K., Ludwig, H., & Hoffner, Y. (2001). CrossFlow: Cross-organizational workflow
management for service outsourcing in dynamic virtual enterprises. A Quarterly Bulletin of the Computer
Society of the IEEE Technical Committee on Data Engineering, 24(1), 5257.
Grefen, P., & Hoffner, Y. (1999). CrossFlow: Cross-organizational workflow support for virtual organizations. Research Issues on Data Engineering: Information Technology for Virtual Enterprises(RIDE-VE
99.)(pp. 90-91). Sydney, Australia: IEEE Computer Society Press.
Guan, Z., Hernandez, F., Bangalore, P., Gray, J., Skjellum, A., Velusamy, V., & Liu, Y. (2006). GridFlow: A grid-enabled scientific workflow system with a Petri-Net-based interface. Concurrency and
Computation, 18(10), 11151140. doi:10.1002/cpe.988
Li, C., & Li, L. (2006). A sistributed multiple simensional QoS constrained resource scheduling optimization policy in computational grid. Journal of Computer and System Sciences, 72(4), 706726.
doi:10.1016/j.jcss.2006.01.003
Li, J. Q., Fan, Y. S., & Zhou, M. C. (2003). Timing constraint workflow nets for workflow analysis. [Part A]. IEEE Transactions on Systems, Man, and Cybernetics, 33(2), 179193. doi:10.1109/
TSMCA.2003.811771
Liu, D. T., Franklin, M. J., Abdulla, G. M., Garlick, J., & Miller, M. (2006). Data-preservation in scientific
workflow middleware. In R. Dienstbier (Ed.), Proc. 18th Intl Conf. Scientific and Statistical Database
Management (SSDBM06)(pp. 49-58). Vienna, Austria: IEEE Computer Society Press.
Liu, J. W. S. (Ed.). (2002). Real-times Systems. New York: Pearson Education Press.
Ludscher, B., & Goble, C. (2005). Guest Editors Introduction to the Special Section on Scientific
Workflows. SIGMOD Record, 34(3), 34. doi:10.1145/1084805.1084807
McPhillips, T. M., & Bowers, S. (2005). An Approach for Pipelining Nested Collections in Scientific
Workflows. SIGMOD Record, 34(3), 1217. doi:10.1145/1084805.1084809
Rajpathak, D., Motta, E., Zdrahal, Z., & Roy, R. (2006). A Generic Library of Problem Solving Methods
for Scheduling Applications. IEEE Transactions on Knowledge and Data Engineering, 18(6), 815828.
doi:10.1109/TKDE.2006.85
Rygg, A., Roe, P., Wong, O., & Sumitomo, J. (2008). GPFlow: An Intuitive Environment for Web-Based
Scientific Workflow. Concurrency and Computation, 20(4), 393408. doi:10.1002/cpe.1216
van der Aalst, W. M. P. (1998). The Application of Petri Nets to Workflow Management. J. Circuits .
Syst. Comput., 8(1), 2166.
van der Aalst, W. M. P., Hofstede, A.H.M.ter, & Barros, A. P. (2003). Workflow Patterns. Distributed
and Parallel Databases, 14(1), 551. doi:10.1023/A:1022883727209
Wieczorek, M., Prodan, R., & Fahringer, T. (2005). Scheduling of Scientific Workflows in the ASKALON
Grid Environment. SIGMOD Record, 34(3), 5662. doi:10.1145/1084805.1084816
418
Yan, Y., & Chapman, B. (2007). Scientific Workflow Scheduling in Computational Grids Planning,
Reservation, and Data/Network-Awareness. In R. Dienstbier (Ed.), Proc. 8th IEEE/ACM Intl Conf.
Grid Computing (pp. 18-25). Austin, Texas: IEEE Computer Society Press.
Yu, J., & Buyya, R. (2005). A Taxonomy of Scientific Workflow Systems for Grid Computing. SIGMOD
Record, 34(3), 4449. doi:10.1145/1084805.1084814
Yu, J., Buyya, R., & Tham, C. K. (2005). Cost-based Scheduling of Scientific Workflow Applications on
Utility Grids. In R. Dienstbier (Ed.), Proc. 1st Intl Conf. e-Science and Grid Computing (e-Science05)
(pp. 1-8). Melbourne, Australia: IEEE Computer Society Press.
Yuan, P. S., Matthew, K. O., & Shao, Y. L. (2000). Virtual Organizations: The Key Dimensions. In
R. Dienstbier (Ed.), Proceeding of Academia/Industry Working Conference on Research Challenges
(AIWORC00) (pp. 3-8). Buffalo, New York: IEEE Computer Society Press.
Zeng, L., Benatallah, B., Ngu, A. H. H., Dumas, M., Kalagnanam, J., & Chang, H. (2004). QoS-Aware
Middleware for Web Services Composition. IEEE Transactions on Software Engineering, 30(5), 311327.
doi:10.1109/TSE.2004.11
Zhao, Z., Booms, S., Belloum, A., de Laat, C., & Hertzberger, B. (2006). VLE-WFBus: A Scientific
Workflow Bus for Multi e-Science Domains. In R. Dienstbier (Ed.), Proc. 2th IEEE Int. Conf. e-Science
and Grid Computing(e-Science06)(p. 11). Amsterdam, Netherlands: IEEE Computer Society Press.

Certificate Mechanism: A security policy recruited by Grid application, in which cross-domain
resource sharing is enabled by certain certificate verification process among collaborators.
Grid: Grid specifies the next generation infrastructure of Internet and its web-based applications.
QoS: A set of evaluation parameters for evaluating the quality of a service.
Scheduling: Scheduling deals with the assignment of jobs and activities to resources and time ranges
in accordance with relevant constraints and requirements.
Scientific Workflow: A novel workflow application style for e-Scientific activities.
Temporal Model: A model for specify the temporal-dependent relation among collaborative activities.
Workflow Fragment: A local workflow execution situation.
419
Section 5
Service Computing
421
Chapter 19
Grid Transaction
Management and Highly
Reliable Grid Platform
Feilong Tang
Shanghai Jiao Tong University, China
Minyi Guo
Shanghai Jiao Tong University, China
ABSTRACT
As Grid technology is expanding from scientific computing to business applications, open grid platform
increasingly needs the support of transaction services. This chapter proposes a grid transaction service
(GridTS) and GridTS based transaction processing model, defines two kinds of grid transactions: atomic
grid transaction for short-lived reliable applications and long-lived transaction for business processes.
The chapter also presents solutions to managing these two kinds of transactions to reach different
consistent requirements. Moreover, this chapter investigates a mechanism for automatic generation of
compensating transactions in the execution of long-lived transactions through the GridTS. Finally, it
discusses the future trends along the reliable grid platform research.
INTRODUCTION
Grid computing is a natural evolution of distributed computing and Internet applications for largescale science and engineering problems, aiming at effective resource sharing and task collaboration in
distributed and self-governing environments. The main goal of grid computing is sharing large-scale
resources and accomplishing collaborative tasks through enabling people to utilize computing and
storage resources transparently. By providing service oriented computing and data infrastructures, grid
technology is becoming the preferred basis for large-scale distributed computing, and expanding from
scientific computing to business applications (Berman,2003; Foster,2002; Wang, 2004).
Many key grid applications especially business applications require reliability guarantee from highly
reliable grid computing platform (Jiang 2006). As an effective and widely-used means, transaction
DOI: 10.4018/978-1-60566-661-7.ch019
Grid Transaction Management and Highly Reliable Grid Platform
technology can help people to make this vision a reality, providing application developers with multiple transparencies on location, replica, concurrency and failure (Wang, 2008). As a result, transaction
management is one of the most important core services of reliable grid platform for the mission-critical
commercial grid applications (Yang, 2008).
In grids, a transaction is a set of operations that execute on geographically distributed grid services.
Transaction management service is responsible for ensuring the reliable execution of these distributed
grid applications to keep the system consistent, free the applications from various failures. Ideally, it
also shields users from the complex recovery process.
The traditional distributed transactions, where application systems are tightly coupled, have the
ACID properties, i.e., Atomicity, Consistency, Isolation and Durability. However, traditional distributed
transaction models and Web service transaction specifications do not work in open grid environments
because:

Grid systems are loosely coupled and autonomous. For the security and efficiency, autonomous
grid services typically do not allow to be locked by outside applications while traditional atomic
transaction models generally adopts locking mechanism to guarantee the atomicity.
It is difficult even impracticable for application programmers to develop compensating transactions. Existing transaction models require application programmers to provide all compensating
transactions. However, grid services that execute a business application are dynamically discovered; and autonomous service providers may set up special compensating rules based on their own
business models. For example, some service providers allow users to cancel a ticket order without
other actions while others may require users to pay some compensating fee.
Grid systems are dynamic, i.e., grid services may exit the systems dynamically during an execution of a business process. Grid transaction service has to hide the dynamism from users.
As a result, it is a very important and significant work to propose and implement a transaction service
for grid computing. Generally, a grid transaction service has to address following issues:

Coordination of the short-lived activities to form an atomic transaction, such as transferring fund
from one bank account to another.
Coordination of the long-lived transactional activities to fulfill a common agreement, for example,
a journey arrangement that involves booking tickets, booking hotel rooms and hiring cars.
This chapter presents a grid transaction service (GridTS) and coordination algorithms to manage
atomic and long-lived grid transactions, providing commercial applications with reliability support.
Moreover, we propose a solution for automatic generation of compensating transaction, which is an
significant advantage over existing long-lived transaction models. The objective is to set up a reliable
grid platform based on transaction service for grid applications with reliable requirements, enabling application programmers to use the GridTS to implement transactional applications easily. The proposal
has the following advantages over existing related researches. Firstly, the GridTS can automatically
generate compensating transactions. Secondly, the GridTS can hide the dynamicity of grids from users.
Next, the GridTS reserves resources for the atomic transaction commit to adapt to the autonomous grid
environment. Finally, it is extensible because it is built on top of a series of open standards, technologies and infrastructures.
422
BACKGROUND
The demands for reliable grid platform and transaction management service result from practical grid
applications. Since the beginning of this century, both academia and computer industry have been regarding the development of grids as another chance to improve the current paradigm of Internet computing.
ShanghaiGrid is a good example for such grid projects.
As an internationalization city with 18 million people, Shanghai presents emergent needs for an information infrastructure to enable sharing of heterogeneous resources to improve government efficiency and
quick response to emergent events. For sharing heterogeneous resources of computing, storage and data,
Shanghai municipality launched the ShanghaiGrid project in 2003. ShanghaiGrid is a grid infrastructure
that aggregates several heterogeneous supercomputers, data centers and other applications scattered in
different organizations in Shanghai for city government as well as enterprises and communities. The
primary goal is to establish a metropolitan-area service grid for widespread upper-layer applications from
both research communities and government departments, tailored for the characteristics of Shanghai.
The ShanghaiGrid project is built upon four major computational aggregations and networks in
Shanghai, i.e., CHINANET (public Internet backbone built by China Telecom), SHERNET (Shanghai
Education and Research Network), STNC (Shanghai Science and Technology Network Communication), and campus networks. From the perspective of hardware infrastructure, ShanghaiGrid aggregates
various distributed and heterogeneous resources, including computers, networks, storage devices and
so on. From the perspective of software infrastructure, one of the research focuses is to develop the
ShanghaiGrid system software (SHGSS).
The hardware facilities interconnected in ShanghaiGrid include supercomputers, data storages, devices (e.g., sensors and traffic surveillants) and other resources. These facilities are distributed in the
intra-grids of the participating organizations of this project. As Figure 1 illustrates, the ShanghaiGrid
hardware infrastructure comprises more than five connected intra-grids, including Shanghai Supercomputing Center (SSC) intra-grid, Shanghai Jiao Tong University (SJTU) intra-grid, Shanghai University
(SHU) intra-grid, Tongji University (TJU) intra-grid and Shanghai Urban Traffic Information Center
(SUTIC) intra-grid.
The ShanghaiGrid system software SHGSS was designed with three levels: (1) the low-level access and management software for encapsulating heterogeneous resources as grid services, (2) the grid
middleware that provide top-level traffic applications and others with transparent access to grid services,
no matter where grid services are located, whether failures occur during execution of tasks, and how
these services are composed, etc(Fox,2001), and the top-level grid portals that enable users to access
ShanghaiGrid in a way similar to access Web. At the first stage, SHGSS supported resource encapsulation
and management, service scheduling and accounting, data aggregation and adaptive transmission as well
as an intelligent traffic management. After this stage, transaction management service was required for
more key grid applications. From then on, grid transaction service was investigated and implemented.
The following is researches and reports related to our grid transaction service.
Transaction management for distributed environments have widely been researched (Ammann,1997;
Ancilotti, 1990; Thomasian,1997).
Traditional transaction models. Distributed Transaction Processing (DTP) and Object Transaction
Service (OTS) are widely used in the traditional distributed environment. DTP defines three kinds of
roles, Application Program, Transaction Manager and Resource Manager, and two types of interfaces,
TX and XA interfaces. These two models do not release the locked resources until the end of a global
423
Figure 1. ShanghaiGrid hardware infrastructure
transaction, thus not able to coordinate long-lived grid transactions. Thus, they are generally not applicable for applications comprising loosely coupled, Web-based business services (Dalal,2003).
Web Services transaction specifications. WS-Coordination (WS-C) and WS-Transaction (WS-T)
provide a set of transaction specifications for Web Services (Cabrera,2002). WS-C describes a transaction framework comprising Activation Service, Registration Service and Protocol Service. It can accommodate multiple coordination protocols. WS-T classifies transactions in the Web Services environment
into atomic transactions and business activities, and defines the corresponding coordination protocols.
Business Transaction Protocol (BTP) is another important service-oriented transaction specification that
defines a conceptional model and a set of complex messages to be exchanged between a coordinator
and participants, specifies how to interact between Web services (Dalal, 2003).
Compensating transactions. Existing long-lived transaction models were generally built on the conception of transaction compensation that was first proposed by Gray (Gray 1981). The typical implementation of compensating transactions is the Sagas model that is widely adopted in many extended
transaction models (Chrysanthis,1992; Garcia-Molina,1987; Liang,1996).
Sagas (Garcia-Molina,1987) is a classical transaction model for handling long-lived transactions,
based on transaction compensation. In Sagas, a transaction is called a Saga, which consists of a set
of sub-transactions with atomicity, consistency, isolation and durability properties such that T={T1,
T2,,Tn}, and a set of associated compensating transactions such that C={C1,C2,. . ., Cn}, where each
sub-transaction Ti associates with a compensating transaction Ci that can semantically undo the effect
caused by the commit of Ti. Sub-transactions in Sagas independently commit and immediately release
424
resources accessed in the execution of the sub-transactions in order to reduce the duration of resource
lock and improve the system efficiency. In Sagas, all the committed sub-transactions must be undone if
a subsequent sub-transaction fails, which causes waste of a lot of valuable work already finished. ACTA
(Chrysanthis, 1992) is a comprehensive transaction framework that permits a transaction modeler to
specify the effects of extended transactions on each other and on objects in the database. ACTA allows
to specify interactions between transactions in terms of relationships and transactions effects on objects state and concurrency status. ACTA provides a reasoning ability more powerful and flexible than
Sagas through a series of variations to the original Sagas. ConTracts (Wachter,1992) is a mechanism for
grouping transactions into a multi-transaction activity. It consists of a set of predefined actions called
steps, and an explicitly specified execution plan called a script. In case of a failure, the ConTract state
must be restored and its execution may continue.
The above transaction models do not work well in grid environments because
1.
2.
3.
Traditional atomic transaction models in general lock resources to achieve consistent commit. Grid
services, however, are not ready to be locked by outside grid applications.
Existing long-lived transaction models require application programmers to provide compensating
transactions for all the sub-transactions. Business Transaction Protocol (BTP) and Web Services
Transaction (WS-Transaction) also mentioned to use compensation for the coordination of longrunning activities, but they did not solve how to provide compensating transactions. Owing to
the autonomy of grids, service providers may setup special compensating rules according their
own business model, for example, different providers require users to pay different charges for
the cancellation of ticket orders, while grid services that actually executes sub-transactions are
dynamically discovered just before the beginning of a grid transaction. Application programmers
do not know special compensating policies of services discovered dynamically, therefore are not
able to provide corresponding compensating transactions in advance.
Grid services may dynamically join and leave the grid system before a global grid transaction
completes.
The following will present how to extend related work for grid environments.
MAIN FOCUS OF THE CHAPTER

Grid Transaction Service and Highly Reliable Grid Platform
Grid transaction is a set of operations that execute on different grid services (Tang, 2004). The transaction service GridTS is a special service responsible for the coordination of these services to keep the
system consistent, and shields users from the complex recovery process.
Layered Architecture
The architecture of highly reliable grid platform consists of three layers (see Figure 2). The middle layer,
GridTS, consists of following main components.
425
Figure 2. Architecture of reliable grid platform
Coordinator and Participant. They cooperatively coordinate a transaction for an application and
grid services respectively. The Coordinator and Participant themselves do not execute actual application operations.
Scheduler. This component takes charge of (1) creating a Coordinator and a coordination context
(CC) on the application side and Participants on the service side, and (2) scheduling the Service Discovery module.
Compensating Transaction Generator (CTG). If a predefined event occurs, the component first queries the corresponding compensating rule(s), and then dynamically generates a compensating operation.
Also, it encapsulates the generated compensating operations into a compensating transaction when the
sub-transaction commits.
Log Service. This component records the coordination operations and the state information for recovery of transactions from failures.
Service Discovery. This component dynamically discovers the qualified grid services according to
users requirements, such as cost, quality and availability, to complete specified sub-transactions.
Interfaces. GridTS provides grid applications with two types of APIs: the extended TX interfaces
for transaction management, and the service-specific interfaces for management of the GridTS service
instances and discovery of grid services to execute application operations in sub-transactions.
HOW TO USE GRIDTS

GridTS is a special grid service so that it possesses all properties of a grid service. Interfaces of GridTS
are encapsulated in TX portType of grid services by defining each interface, corresponding input and
output parameters as operation, input and output messages. The interface definition is exemplified in
Figure 3.
426
Figure 3. The definition of interface begin of GridTS
GridTS ensures the reliability for grid applications through the following ways.
1.
2.
Public transaction service. GridTS is published in the public registration center. Transactional
applications discover and invoke the GridTS. The advantage in this way is flexible and convenient,
which means that users may share reliability support without installing the GridTS.
Private transaction service. The GridTS locates on the application-side and service-side nodes.
The strength of this method is efficiency and the weakness is less flexibility.
Grid Transaction Processing Framework and Flow

The framework of our grid transaction processing, shown in Figure 4, is built on GridTS, with the following three actors:
1.
2.
3.
Initiator. A transactional application is the initiator of a global grid transaction. It initiates the
transaction through GridTS, then requests remote grid services involved in the transaction to execute application operations.
GridTS. It is the grid transaction manager. Coordinator and Participant in the GridTS perform
coordination algorithm according to the transaction kind. They interchange coordination messages to guarantee the consistency of global transaction. Coordinator is the controller of the global
transaction.
(Remote) Grid services. These gird services actually perform application operations in the
transaction.
Based on above framework, a typical transaction processing flow is just like this. First, GridTS initiates a global transaction on behalf of an application, discovers and selects the required grid services
to serve as participants. Then, its Scheduler broadcasts the CC messages to all selected participants.
Finally, the created Coordinator and Participants interact to control the transaction execution, including
correct completion and failure recovery. The following sections will discuss the details of transaction
coordination.
427
Figure 4. GridTS-based transaction processing framework
TRANSACTION COORDINATION
GridTS coordinated short-lived and long-lived transactions by executing the corresponding coordination
algorithms based on the incoming transaction kinds.
Transaction Coordination for Atomic Grid Transaction

Atomic transaction lasts a short time, typically within a few of seconds. The following is the formal
definition.
Figure 5. An example of the atomic grid transaction
428
Figure 6. An example of transaction process: Dashed line refers to alternative candidate, which occurs
only on the following long-lived transactions
Definition 1. An atomic transaction (AT) is a 4-tuple={T, D, S, R}, where

T={T1, T2,. . ., Tn} is the set of atomic transactions, where Ti (1 i n) can be a set of lower-level atomic
transaction Tij (1 j m)
D is the set of data carried by the transaction,
S is the set of states, and
R is the set of dependency relationships between (sub)transactions.
In an atomic transaction, all participants have to commit or abort synchronously, and the intermediate
results of an atomic transaction are invisible (read or write) to other concurrent transactions. As shown
in Figure 5, an atomic transaction that transfers 1000$ from an account in bank service A to another in
bank service B such T={T1, T2}. The subtransaction T1 and T2 have to executed atomically.
1.
Coordination mechanism: Coordination of atomic grid transactions includes following phases:

1. Initiation of an atomic transaction. For remote grid service to join in an atomic transaction, the
client-side GridTS sends CoordinationContext (CC) message to them and creates a Coordinator,
which lives until end of the transaction. The CC message includes necessary information to
create a transaction, including transaction type AT, transaction identifier, coordinator address
and expiration.
Participants return Response messages to Coordinator, indicating it agrees to join the
transaction.
2. Preparation for the transaction commit. Coordinator sends Prepare messages to participants
Pi (i=1, 2, . . ., N), where N is the number of subtransactions. Each Pi reserves necessary
resources and returns Prepared or NotPrepared message, depending on whether the reservation is successful or not.
3. Transaction commit. Within T1, if Coordinator receives N Prepared messages, it sends Commit
to all Pi and records the commit in log. Otherwise, it sends Abort to them, making them cancel
the previous reservation. On receiving Commit message, each Participant: (a) requests for
allocating the reserved resources, (b) records the transaction in log in order to recover later
from possible failures, and (c) monitors the execution of corresponding task and report result
to Coordinator.
429
Figure 7. Coordination algorithms of atomic transactions (CAAT)
2.
3.
430
Within T2, if Coordinator receives N Committed messages, it judges the transaction is correctly
completed. Otherwise, it reports failure information to the user and sends Rollback messages to all
Pi, making them recover to the previous states. In execution of a transaction, if any Pi itself contains
sub-transactions, it will apply above mechanism recursively. In this case, nested transactions form
a tree structure, and the Pi not only is a participant but also serves as the sub-coordinator for its
children Pij (see Figure 6).
Coordination Algorithm: The coordination algorithm of atomic transactions includes two parts:
ActionOfParent and ActionOfChild, executed by the Coordinator and Participant respectively, as
illustrated in Figure 7, where t is the waiting time of the Coordinator and the Participants, CC is the
transaction context and Tc means the timeout that a Participant waits for the Commit message.
Nested Atomic Grid Transactions: In the above coordination algorithm CAAT, if a participant
nests lower-level sub-transactions it uses above mechanism recursively so as to form a transaction
tree, where an internal node acts as not only a participant of its parent but also a sub-coordinator
of its children, that is, these nodes interact with their parents in the participant algorithm and with
their children in the coordinator algorithm. The root node represents a global transaction and it
always executes the coordinator algorithm while leaf nodes actually perform application operations
and always execute the participant algorithm.
Formation of nested sub-transactions. Let P(i,j) is a Participant associated with the ith level and jth
sub-transaction T(i,j) (j=1,2,. . ., n(i)), where n(i) is the number of sub-transactions in the ith level.
We demonstrate the process for T(i,j) how to create its child transactions:
Figure 8. A long-lived transaction
1.
2.
3.
4.
T(i,j) calls interface Begin to initiate its child transactions and create a SubCoordinator(i,j).
T(i,j) creates sub-transaction context CC(i+1, j) whose PortReference is the address of SubCoordinator(i,j), taking the current context CC(i,j) as the input parameter.
T(i,j) propagates CC(i+1,j) to its all child transactions T(i+1,j)(j=1,2,. . ., n(i+1)).
Each child transaction T(i+1,j) creates a Sub-Participant(i+1,j) and registers with the SubCoordinator(i,j) in a Response message.
From then on, if T(i+1,j) still nests child transactions, set i=i+1 and j=j, CALGT repeats
from above step 1 to 4 until it does not nest child transaction any longer.
Long-Lived Grid Transaction

A long-lived transaction is the one that lasts for a long period such as a few hours even a few days because of the user interaction and processing delay. Long-lived transactions are generally associated with
business processes so that they should have the following features.

It is convenient for a user to select committed sub-transactions in order to adapt to the business
model. In view of cost or other factors, for example, a user often requests multiple services for the
same subtask in a business process, then selects a best result, i.e., a traveler hopes to both buy
the cheapest airline ticket and ensure his traveling plan uninterrupted so that he requests multiple
booking services for only one airline ticket.
Resources accessed by sub-transactions are released as early as possible to improve the system
efficiency.
Transaction service must provide the ability to shield the dynamicity of grids and make the global
transaction proceed even if services involved in a transaction dynamically leave the grid system
before the global transaction completes.
431
Definition 2. A long-lived grid transaction (LGT) is a 5-tuple={T,D,S,R,OP}, where

T={Ti| Ti T, 1in, n is the number of sub-transactions involved in T}
D is a set of data operated by T,
S is a set of states,
R is a set of dependency relationships between (sub)transactions, and
OP={AP-OP, TM-OP} is a set of operations. TM-OP {Begin, Enroll, Confirm, Cancel} is a set of
coordination messages and AP-OP is a set of application operations.
For long-lived transactions, transaction compensation is an appropriate method to release grid resources being held by sub-transactions as early as possible. For example, in a traveling arrangement
shown in Figure 8, a traveler orders airplane tickets from ticket service A and B, and reserves a room
from service C. If the hotel reservation fails, the two ticket orders have to be cancelled. On the other
hand, after the tickets are successfully booked, the traveler may select the best (e.g. the cheapest) one
and cancels another.
1.
432
Coordination Mechanism: As the name suggests, a long-lived grid transaction LGT takes a
relatively long time to finish, even without the interference from other concurrent transactions,
so the LGT has to relax the atomicity and isolation properties. More specifically, a LGT allows
some candidates to abort while others to commit. Application operations on grid services exhibit
a loose unit of work where results are shared prior to completion of the global transaction, i.e,
subtransactions in a LGT transaction independently commit, and then immediately release the
held resources before the global transaction finishes. Compared with AT, LGT has following new
characteristics:
Grid services that participate in a transaction independently commit sub-transactions after

receiving pre-commit message, then immediately release held resources.
If some participants fail to participate in or commit a sub-transaction, a global transaction

can proceed through initiating new requests to locate substitutes.
Users can confirm or cancel committed sub-transactions according to their interests using
compensating transactions. Note that a coordinator confirms or cancels each participant only
once.
Grid services that participate in a LGT may leave the grid before the global transaction completes. In that case, these services notify the coordinator before leaving the grid.
Long-lived grid transaction processing consists of the following three phases:
An application initiates a LGT transaction. It is similar to that in AT except the transaction

type is LGT.
Candidates commit independently. Coordinator sends Enroll messages to all candidates.

The latter reserve and allocate necessary resources, record operations in log, then directly
commit the transaction. If successfully, each candidate generates corresponding compensation transaction and returns Committed message, which contain execution results, to the
Coordinator. Otherwise, it automatically rollbacks operations taken previously and returns
Aborted messages. From then on, it is removed from the transaction.
The user confirms successful candidates. According to the returned results, the user may
take one of the following actions: (1) for candidates committing successfully, he confirms
some and cancels the others by sending Confirm and Cancel messages to them respectively,
Figure 9. Coordination algorithms of long-lived grid transactions (CTLGT)
2.
3.
within Tvalid; and (2) for failed candidates, he neednt reply them and may renew to send
CoordinationContext messages to locate new candidates. As a result, if a candidate receives
a Confirm message within Tvalid, it responds a Confirmed message. Otherwise, it executes a
compensating transaction to recover the system to the previous state.
Coordination algorithm: The coordination algorithm of LGT transactions (CALGT) also consists of two parts: the coordinator algorithm ActionOfCoordinator and the participant algorithm
ActionOfParticipant, as shown in Figure 9, where t is the system time, CC is a coordination context,
and Tvalid is the valid time within which a coordinator must send a confirmation or cancellation
decision and participants must report their commit states. Otherwise, if a coordinator does not confirm or cancel a sub-transaction within Tvalid, the corresponding participant automatically undoes
the committed sub-transaction by the compensating transaction. On the other hand, a coordinator
presumes that a participant has failed if the participant does not return the commit result before
Tvalid. The algorithm allows users to confirm or cancel committed sub-transactions according to
their own requirements.
In above long-lived coordination algorithm, if a sub-transaction Ti comprises lower-lever subtransaction Tij it calls coordinator algorithm in the nested way described in above Section.
Shielding the Dynamicity of Grid Services: Services that participate in a LGT may exit from the
grid before the global transaction completes. GridTS shields the dynamicity of grid in the following
ways.
433
Figure 10. A LGT transaction and its compensating transactions
1.
2.
Keep on handling the global transaction. If a grid service leaves the grid before the global
transaction completes, it notifies the coordinator of this decision by means of notification
mechanism of grid. To support for subscription of notification message, the Coordinator
implements NotificationSink interface to receive notification messages and the Participant
implements NotificationSubscription and NotificationSource interfaces to manage subscription
and send notification messages. Moreover, the CC message provides enough information to
direct remote services how to notify the coordinator of their decision before they leave the
grid system. The subscription request within the CC consists of:
The kind and content of notification messages,
The address to call NotificatonSink, i.e., the network address of the coordinator, and
The initial lifetime of subscription, which is equal to Tvalid. When a service leaves the
grid, the associated GridTS notifies the coordinator using the NotificatonSink. The
coordinator resends CC message to new grid service to perform that sub-transaction,
where the valid expiration of the new service is still Tvalid.
Undo effects on grid services. Let services do not leave grid in the commit process. Services
may leave grid in the following two situations and GridTS will take corresponding actions.
1. A grid service leaves grid before the commit of a sub-transaction. In this case, the service
simply leaves grid without performing compensating actions.
2. A grid service leaves the grid after the commit of a sub-transaction, when its coordinator
has received the Enrolled message. In this case, the coordinator sends Cancel messages
to notify participants of execution of compensating transactions.
AUTOMATIC GENERATION OF COMPENSATING TRANSACTIONS

Existing long-lived transactions models were generally built on compensating transactions, both in the
traditional distributed systems and in the Web Services environments. The compensating transaction was
first implemented in Sagas that requires application programmers to provide compensating transactions
before a transaction execution.
This section focuses on how to automate the generation of compensating transactions for grid environments, free application programmers from complex compensation details. We define the common
compensating rules for data modification operations and transaction coordination messages, while
434
Figure 11. Generation of compensating transactions
allowing service providers to add and modify their own rules. In the execution of a LGT, the GridTS
dynamically generates and stores a compensating transaction for each sub-transaction based on the
compensating rules, shown in Figure 10. On receiving a Confirm message, the GridTS deletes the generated compensating transaction from the database because the result(s) of the sub-transaction will not
change from then on.
Key Technologies for Automatic Generation of Compensating Transaction

Compensating actions are closely related to system states. The states describe current properties and
possible further action(s) of a transaction system. For example, we can describe the state of an airline
booking system as S={reservation, available}, where reservation is the number of available tickets, and
available indicates whether the system can accept new reservations or not. If reservation is greater than
0, available becomes true. Otherwise, available becomes false.
States are changed by operations. However, not all operations affect system states. For example,
the update(d1,d2) operation changes the data value from d1 to d2 and the Enroll message transfers the
Participant state from Active to Committing, but reading a data value does not affect the system state.
Definition 3. A compensating transaction (CT) is the transaction that rollbacks the operations taken by
a committed transaction T and undoes semantically the effects from the commit of T, the original
transaction of the compensating transaction.
A compensating transaction mainly involves in two aspects. One is to undo the effects of the original
operations, and another is to recover the system consistency. The key technologies to generate automatically compensating transactions include:

Definition of compensating rules,

Generation of compensating operations in the execution of a long-lived transaction, and
Generation of a compensating transaction at the commit of a sub-transaction.
435
Set Compensating Rules

Generation of a compensating transaction is event-driven. Compensating rules indicate how to undo the
effects from events that change system states. We divide these events into three types: data modification
event, transaction coordination event and service self-definition event (see Figure 11). Compensating
rules for the first two types of events are provided by GridTS while rules for service self-definition
events are set by service providers through the following interfaces:

setCompensatingRule (): sets compensating rules for grid services.

getCompensatingRule (): gets compensating rules of grid services.
1.
Data modification event: Currently, most companies store their information in relation databases.
Data modification operations in a LGT mainly consist of insertion, deletion and replacement of
records in databases.
Definition 4. A data modification event refers to insertion, deletion or modification of databases.
Let eTi[p(d)] be a data modification event from which transaction Ti modifies data d using operation p, where pOT belongs to one of operation types. Furthermore, DETi is a set of data modification events caused by Ti and eTi[p(d)] DETi.
For a relation database, OT={update, insert, delete}. We mainly analyze how to compensate these
three data modification operations.
Let Si and Si+1 be the states before and after Ti commits respectively, CTi a compensating transaction of Ti, and Tj (ji) a dependent transaction that concurrently executes between Ti and CTi. If
data accessed by Ti is not modified by Tj, CTi simply executes a reversed action for each operation
in Ti. Otherwise, CTi undoes the committed transaction Ti, but may not change the results of the
dependent transaction Tj . For example, the cancellation of Alices airline ticket reservation can
not affect Bobs reservation. Compensating rules for update, insert and delete operations are set
as follows.
Update:
Let opi=update(d1,d2) be a operation in Ti that replaces d1 with d2. How to compensate opi
depends on the data modification operation opj in Tj.
An insert operation opj=insert(d) in Tj does not affect the result of opi. The compensating
operation for opi is copi= update(d2,d1).
A delete operation opj= delete(d2) in Tj will delete the result of opi. As a result, it is not necessary to compensate opi.
An update operation opj =update(d2,d3) in Tj will change the result of opi. The compensating
operation of opi depends on the type of the replacement operation.
Relevant replacement Si+1 =f(Si, Ti). It means that the state Si+1 is relevant to the state Si. copi
has to remove the effect of opi. For example, if opi=update(d1,d2) and d2=d1+n, the corresponding compensating operation is copi=update(d3,d4), where d4=d3-n.
Irrelevant replacement Si+1=f(Ti), where Si+1 is irrelevant to Si, e.g., opi=update(Monday,
Tuesday) and opj=update(Tuesday,Wednesday). Such a replacement need not be
436
2.
compensated.
Insert:
Let operation opi=insert(d1) in Ti insert a record with value
d1. copi is also relevant to data modification operations opj
in Tj .
opj =insert(d2) does not affect the result of opi so that the
compensating rule for opi is copi = delete(d1).
opj =delete(d1) will delete the result of opi, however, opi
need not be compensated in order to keep the result of opj.
opj =update(d1, d2) shall be compensated as follows.
If (d2d1)
If (relevant replacement) {
temp=change caused by opj=update(d1, d2);
insert (temp);
delete(d1); }
else
do nothing;
Delete: A delete operation opi=delete(d) in Ti deletes a record with the value d. Any operation in Tj can not affect the result of opi so that the compensating operation for opi is simply
a reversed operation copi =insert(d).
Transaction Coordination Event:
Definition 5. A transaction coordination event denotes that a sub-transaction receives messages
from a coordinator. The set of transaction coordination events involved in a sub-transaction Ti is
TETi {CC, Enroll, Confirm, Cancel}.
3.
Each transaction coordination event changes the state of a transaction system. The GridTS sets the
compensating rules for the transaction coordination event in the following way.
For CC message, it records the original transaction identifier and input parameters.
For Enroll message, it encapsulates compensating operations in delimiters Begin and

Commit, and stores the compensating transaction CTi in a database.
For Cancel messages, it invokes CTi stored in the database.
For Confirm message, it deletes CTi in the database because CTi will be useless after the subtransaction is confirmed.
Service self-definition event:
Definition 6. A service self-definition event refers to the actions that a service provider takes
according to the states of a business process, which depends on the special business model of a
service provider.
The compensating rules for the service self-definition event are defined by service providers. They
typically focus on:
Subsequent activities after undoing operations in an original transaction, e.g., sending an

email to notify the user of new available services.
437
Economic compensation. For example, if a user cancels a committed sub-transaction which

has finished a transportation order, the transportation company typically requires amends
from the user.
Generate Compensating Operations

In the execution of a LGT, the Compensating Transaction Generator (CTG) of GridTS monitors events,
such as a delete operation or an Enroll message. Once predefined events occur, CTG examines whether
the conditions for a rule are satisfied. If so, it extracts the type and parameters of the operation, queries
corresponding compensating rules of the operation, generates a compensating operation, and records
input parameters. For example, when a sub-transaction deletes a record from the database, the Delete
event will generate a compensating operation to insert the record.
Generate and Call Compensating Transactions

The Enroll message enables the CTG to generate delimiters Begin and Commit, and combines the
compensating operations into a transaction. If the sub-transaction fails, all compensating operations
generated previously are abandoned. A compensating transaction is stored in a database, and deleted
from the database when GridTS receives a Confirm message from the Coordinator.
In a LGT, both the Cancel message and a timeout signal, which is generated after the transaction
expiration Tvalid, can start the corresponding compensating transaction.
Handle Noncompensable Transaction

A transaction is compensable if effects from its commit can be semantically undone by another transaction, i.e., the corresponding compensating transaction. Otherwise, the transaction is noncompensable.
A compensating transaction consists of a set of compensating operations. A transaction T is compensable, if and only if each operation OPiT has a corresponding compensating operation COPi. Some
transactional grid applications comprise noncompensable operations so that these transactions are noncompensable. Generally, noncompensable operations can be divided into two types:
1.
2.
Difficult compensating operations such as the sale of stocks bought previously, which means that
the execution of these compensating operations may cause unexpected results.
Unable compensating operations, which refer to the operations that can not be compensated. For
example, it is impossible to compensate a launched missile.
Noncompensable operations often generate effects on outside activities so that in general, their effects
are not allowed visible out of these applications. Thus, GridTS does not allow such a sub-transaction
to commit in the pre-commit phase if it can not find compensating rule(s) for an operation. Instead, we
handle noncompensable transactions with the following policies.

438
GridTS imposes commit dependence between the sub-transaction and the global transaction,
which indicates that the sub-transaction actually commits only if the global transaction commits.
GridTS rollbacks operations taken previously but returns the Committed message to the coordinator.
After receiving the Confirm message, it redoes and commits the sub-transaction.
GridTS rollbacks the executed operations and reports a commit exception to a user. The latter
decides how to handle the exception.
FUTURE TRENDS
We have proposed a transaction service GridTS and coordination algorithms for short-lived and longlived transaction management in grid environments. It is an effort towards the reliable grid platform.
With the increasing reliability requirements from business applications, reliable grid platform will be
an important research direction. Transaction service is an indispensable component for emerging reliable grid infrastructure. Therefore, our GridTS and coordination algorithms will provide the powerful
support for research on reliability of grid platform.
The future research along with this direction includesthe following aspects in order to make GridTS
more practical and effective in the commercial grid environment. The first one is security guarantee
during transaction processing. Grid Security Infrastructure (GSI) may be used because it provides the
abilities for authentication, authorization and communication protection, based on the public-key mechanism, and is the de facto standard authentication method with single sign-on property. Another issue
is to investigate the mechanism for solving the possible deadlock problem of competing transactions,
as well as the approach for combining the transaction management with the resource scheduling and
management to enhance system efficiency.
CONCLUSION
We have proposed a grid transaction service GridTS and coordination algorithm for management of
short-lived and long-lived reliable activities in grids. GridTS can coordinate different grid transactions through executing corresponding coordination algorithms. The design that separates GridTS and
algorithms makes GridTS more flexible and scalable by means of adding new algorithms for coming
reliable applications.
Our proposal has three advantages. Firstly, for transactional grid applications, users only need to
submit the corresponding parameters (e.g., the transaction type and timeout). GirdTS can intelligently
invoke different coordination algorithms and handle the entire transaction process on behalf of the users. The complex process is hidden from the users. Secondly, GridTS is able to dynamically generate
compensating transactions in the execution of long-live transactions, and at the same time provides interfaces for setting up service-specific compensating rules to satisfy different application requirements.
Next, the long-lived coordination algorithm allows users to select committed results, which is applicable
to practical business applications. Finally, GridTS is extensible because it is built on top of a series of
open standards, technologies and infrastructures.
439
ACKNOWLEDGMENT
Feilong Tang would like to thank The Japan Society for the Promotion of Science (JSPS) and The University of Aizu (UoA), Japan for providing the excellent research environment during his JSPS Postdoctoral
Fellow Program in UoA, Japan. Thanks are also given to Dr. Chao-Li Wang in The University of Hong
Kong, China and Professor Zixue Cheng in UoA, Japan for their precious helps.
This work is supported by the National High Technology Research and Development Program (863
Program) of China (Grant Nos. 2006AA01Z172, 2006AA01Z199 and 2008AA01Z106), the National
Natural Science Foundation of China (NSFC) (Grant Nos. 60773089,60533040, and 60725208), and
Shanghai Pujiang Program (Grant No. 07pj14049).
REFERENCES
Ammann, P., Jajodia, S., & Ray, I. (1997). Applying formal methods to semantic-based decomposition of transactions. [TODS]. ACM Transactions on Database Systems, 22(2), 215254.
doi:10.1145/249978.249981
Ancilotti, P., Lazzerini, B., & Prete, C. A. (1990). A distributed commit protocol for a multicomputer
system. IEEE Transactions on Computers, 39(5), 718724. doi:10.1109/12.53589
Berman, F., Fox, G., & Hey, T. (Eds.). (2003). Grid computing making the global infrastructure a reality.
New York: Wiley Series in Communication Networking & Distributed Systems.
Cabrera, F., Copel, G., & Coxetal, B. (2002). Web Services Transaction (WS- Transaction). Retrieved
from http://www.ibm.com/developerworks/library/ws-transpec.
Chrysanthis, P., & Ramamriham, K. (Eds.). (1992). ACTA: The SAGA continues. Transactions Models
for Advanced Database Applications. San Francisco: Morgan Kaufmann.
Chrysanthis, P. K., & Ramamriham, K. (1994). Synthesis of extended transaction models using ACTA.
ACM Transactions on Database Systems, 19(3), 450491. doi:10.1145/185827.185843
Dalal, S., Temel, S., & Little, M. (2003). Coordinating business transactions on the Web. IEEE Internet
Computing, 7(1), 3039. doi:10.1109/MIC.2003.1167337
Foster, I., Kesselman, C., & Nick, J. (2002). Grid services for distributed system integration. IEEE
Computer, 35(6), 3746.
Fox, F., & Gannon, D. (2001). Computational grids. Computing in Science & Engineering, 3(4), 7477.
doi:10.1109/5992.931906
Garcia-Molina, H., & Salem, K. (1987). SAGAS. In Proceedings of ACM SIGMOD87, International
Conference on Management of Data, 16(3), 249-259.
Gray, J. (1981). The transaction concept: Virtues and limitations. In Proceedings of the 7th International
Conference on VLDB, (pp.144-154).
440
Jiang, J. L., Yang, G. W., & Shi, M. L. (2006). Transaction Model for Service Grid Environment and
Implementation Considerations. In Proceedings of IEEE International Conference on Web Services (pp.
949 950).
Liang, D., & Tripathi, S. (1996). Performance analysis of longlived transaction processing systems
with rollbacks and aborts. IEEE Transactions on Knowledge and Data Engineering, 8(5), 802815.
doi:10.1109/69.542031
Tang, F. L., Li, M. L., & Huang, Z. X. (2004). Real-time transaction processing for autonomic Grid
applications. Engineering Applications of Artificial Intelligence, 17(7), 799807. doi:10.1016/S09521976(04)00122-8
Thomasian, A. (1997). A performance comparison of locking methods with limited wait depth. IEEE
Transactions on Knowledge and Data Engineering, 9(3), 421434. doi:10.1109/69.599931
Wachter, H., & Reuter, A. (Eds.). (1992). Contracts: A means for Extending Control Beyond Transaction
Boundaries. Advanced Transaction Models for New Applications. San Francisco: Morgan Kaufmann.
Wang, T., Vonk, J., Kratz, B., & Grefen, P. (2008). A survey on the history of transaction management:
from flat to grid transactions. Distributed and Parallel Databases, 23(3), 235270. doi:10.1007/s10619008-7028-1
Yang, Y. G., Jin, H., & Li, M. L. (2004). Grid computing in China. Journal of Grid Computing, 2(2),
193206. doi:10.1007/s10723-004-4201-2
Yang, H. T., Wang, Z. H., & Deng, Q. H. (2008). Scheduling optimization in coupling independent services
as a Grid transaction. Journal of Parallel and Distributed Computing, 68(6), 840854. doi:10.1016/j.
jpdc.2008.01.004

Atomic Transaction: A short-lived transaction with the property all or nothing, i.e., subtransactions in an atomic transaction all commit or abort.
Compensating Transaction: A transaction for undoing submitted transactions, which means canceling submitted operations and recovering system consistency.
Grid Computing: A distributed computing paradigm for large-scale and effective resource sharing and
task collaboration through enabling people to utilize computing and storage resources transparently.
Grid Transaction: A set of operations that are execute on geographically distributed grid services
Long-Lived Transaction: A transaction with a long lifetime. Generally, a long-lived transaction
relaxes the atomicity and isolation properties.
Reliability: In transaction processing, reliability is an ability of a system or component to keep
system consistency through performing its required functions under stated conditions for a specified
period of time.
Transaction Processing: A technology responsible for ensuring the reliable execution of these distributed grid applications to keep the system consistent and free from various failures. Ideally, it also
shields users from the complex recovery process.
441
442
Chapter 20
Error Recovery for SLABased Workflows Within

the Business Grid
Dang Minh Quan
International University in Germany, Germany
Jrn Altmann
Seoul National University, South Korea
Laurence T. Yang
ABSTRACT
This chapter describes the error recovery mechanisms in the system handling the Grid-based workflow
within the Service Level Agreement (SLA) context. It classifies the errors into two main categories. The
first is the large-scale errors when one or several Grid sites are detached from the Grid system at a
time. The second is the small-scale errors which may happen inside an RMS. For each type of error, the
chapter introduces a recovery mechanism with the SLA context imposing the goal to the mechanisms.
The authors believe that it is very useful to have an error recovery framework to avoid or eliminate the
negative effects of the errors.
INTRODUCTION
In the Grid Computing environment, many users need the results of their calculations within a specific
period of time. Examples of those users are meteorologists running weather forecasting workflows,
automobile producers running dynamic fluid simulation workflow (Lovas et al., 2004). Those users are
willing to pay for getting their work completed on time. However, this requirement must be agreed on
by both, the users and the Grid provider, before the application is executed. This agreement is kept in the
Service Level Agreement (SLA) (Sahai et al., 2003). In general, SLAs are defined as an explicit statement of expectations and obligations in a business relationship between service providers and customers.
DOI: 10.4018/978-1-60566-661-7.ch020
Error Recovery for SLA-Based Workflows
SLAs specify the a-priori negotiated resource requirements, the quality of service (QoS), and costs. The
application of such an SLA represents a legally binding contract. This is a mandatory prerequisite for
the Next Generation Grids. The basic concepts of a system handling the Grid-based workflow within
an SLA context are described in the following sections.
Grid-Based Workflow Model

Workflows received enormous attention in the databases and information systems research and development community (Georgakopoulos et al., 1995). According to the definition from the Workflow Management Coalition (WfMC) (Fischer, 2004), a workflow is The automation of a business process, in
whole or parts, where documents, information or tasks are passed from one participant to another to be
processed, according to a set of procedural rules. Although business workflows have great influence
on research, another class of workflows emerged in sophisticated scientific problem-solving environments, which is called Grid-based workflows. A Grid-based workflow differs slightly from the WfMC
definition as it concentrates on intensive computation and data analyzing but not the business process.
A Grid-based workflow is characterized by the following features (Singh et al., 1997):

A Grid-based workflow usually includes many sub-jobs (i.e. applications), which perform data
analysis tasks. However, those sub-jobs are not executed freely but in a strict sequence.
A sub-job in a Grid-based workflow depends tightly on the output data from previous sub-jobs.
With incorrect input data, a sub-job will produce wrong results and damage the result of the whole
workflow.
Sub-jobs in the Grid-based workflow are usually computationally intensive. They can be sequential or parallel programs and require a long runtime.
Grid-based workflows usually require powerful computing facilities (e.g. super-computers or
clusters) to run on.
Most of existing Grid-based workflows (Ludtke et al., 1999, Berriman et al., 2003, Lovas et al., 2004)
can be presented under Directed Acyclic Graph (DAG) form so only the DAG workflow is considered in
this chapter. The user specifies the required resources needed to run each sub-job, the data transfer between
sub-jobs, the estimated runtime of each sub-job, and the expected runtime of the whole workflow.
In this chapter, we assume that time is split into slots. Each slot equals a specific period of real time,
from 3 to 5 minutes. We use the time slot concept in order to limit the number of possible start-times
and end-times of sub-jobs. More over, delaying 3 minutes also has little impact with the customer. It is
noted that the data to be transferred between sub-jobs can be very large.
Grid Service Model

The computational Grid includes many High Performance Computing Centers (HPCCs). The resources of each HPCC are managed by a software called local Resource Management System (RMS)1.
Each RMS has its own unique resource configuration. A resource configuration comprises the number of
CPUs, the amount of memory, the storage capacity, the software, the number of experts, and the service
price. To ensure that the sub-job can be executed within a dedicated time period, the RMS must support
advance resource reservation such as CCS (Hovestadt, 2003). In our model, we reserve three main types
443
of resources: CPU, storage, and expert. The addition of further resources is straightforward.
If two output-input-dependent sub-jobs are executed on the same RMS, it is assumed that the time
required for the data transfer equals zero. This can be assumed since all compute nodes in a cluster usually use a shared storage system like NFS or DFS. In all other cases, it is assumed that a specific amount
of data will be transferred within a specific period of time, requiring the reservation of bandwidth.
The link capacity between two local RMSs is determined as the averagely available capacity between
those two sites in the network. The available capacity is assumed to be different for each different RMS
couple. Whenever a data transfer task is required on a link, the possible time period on the link is determined. During that specific time period, the task can use the whole capacity, and all other tasks have
to wait. A more realistic model for bandwidth estimation (than the average capacity) can be found in
(Wolski, 2003). Note, the kind of bandwidth estimation model does not have any impact on the working
of the overall mechanism.
Business Model
In the case of Grid-based workflow, letting users work directly with resource providers has two main
disadvantages:

The user has to have sophisticated resource discovery and mapping tools in order to find the appropriate resource providers.
The user has to manage the workflow, ranging from monitoring the running process to handling
error events.
To free users from this kind of work, it is necessary to introduce a broker handling the workflow
execution for the user. We proposed a business model (Quan and Altmann, 2007a) for the system. There
are three main entities: the end-user, the SLA broker and the service provider:
The end-user wants to run a workflow within a specific period of time. The user asks the broker to
execute the workflow for him and pays the broker for the workflow execution service. The user does not
need to know in detail how much he has to pay to each service provider. He only needs to know the total
amount. This amount depends on the urgency of the workflow and the budget of the user. If there is a
SLA violation, for example the runtime deadline has not been met, the user will ask the broker for compensation. This compensation is clearly defined in the Service Level Objectives (SLOs) of the SLA.
The SLA workflow broker represents the user as specified in the SLA with the user. It controls the
workflow execution. This includes mapping of sub-jobs to resources, signing SLAs with the services
providers, monitoring, and error recovery. When the workflow execution has finished, it settles the accounts. It pays the service providers and charges the end-user. The profit of the broker is the difference.
The value-add that the broker provides is the handling of all the tasks for the end-user.
The service providers execute the sub-jobs of the workflow. In our business model, we assume that
each service provider fixes the price for its resources at the time of the SLA negotiation. As the resources
of a HPCC usually have the same configuration and quality, each service provider has a fixed policy
for compensation if its resources fail. For example, such a policy could be that n% of the cost will be
compensated if the sub-job is delayed one time slot.
Figure 1 depicts a sample scenario of running a workflow on the Grid environment.
444
Figure 1. A sample running Grid-based workflow scenario
Problem Statement
In a large and complex system like the Grid, errors can happen at any time and at any part of the system
with high frequency. The source of errors varies with network cable break, scratch software, hardware
error and so on. Specifically, though, we classify the errors into two main categories.
The Large-Scale Error

A large-scale error happens when one or several Grid sites are detached from the Grid system at any
given time. This error may be caused by a broken network link, a system power down and similar breakdowns. When one RMS is detached from the Grid system, all running/waiting sub-jobs from several
workflows in that RMS are considered as failed since the system cannot control the status and collect
the result from it.
The checkpoint images of all sub-jobs in the failed RMS cannot be used to restart them in other
healthy RMSs. Moreover, output data from the finished sub-jobs in the failed RMS is not available. Thus,
several waiting sub-jobs in the other healthy RMSs cannot be run because of the unavailability of input
data. In the case of canceling the workflow because of error, the system will be seriously fined as stated
in the SLA. Thus, the system has no way but to try and finish executing the workflow by re-running all
failed sub-jobs. However, this task faces two main problems.

Mapping and re-executing only failed sub-jobs in the other healthy RMSs are not enough. A workflow requires a strict execution in order to ensure its integrity. Considering only the error sub-jobs
and dismissing the others will lead to the potential possibility of breaking the integrity character.
Thus, determining all sub-jobs which need to continue the workflow execution is a mandatory
requirement.
When the sub-jobs in the workflow must be re-executed, the ability to finish the workflow execution on time as stated in the original SLA is very low and the ability to be fined because of not
fulfilling SLA is very high. Within the SLA context, which relates to business, the fine is usually
445
very costly and increases with the lateness of the workflows finished time. Thus, those sub-jobs
must be mapped to the healthy RMSs in a way which minimizes the workflows finished time.
The Small-Scale Error

An error inside an RMS may happen at any time during the sub-job running period. The error could have
been caused by an operating system error, hardware error, or internal network cable error. In this case,
the RMS will restart the sub-job from the checkpoint image. We also assume that the time to detect the
error and the time to re-run the sub-job from the checkpoint image will cause the end-time of the subjob to be later than the pre-determined deadline. According to our business model, because of the fact
that the provider is responsible for the error, the late sub-job will not be cancelled but will be allowed
to run a few additional time slots. However, when one sub-job is delayed, the output data transfer to the
subsequent sub-jobs is delayed as well, causing the start-time of those sub-jobs to be delayed. If those
sub-jobs do not have sufficient computational resources allocated to compensate for the shorter time
available for completing their calculations, the original error of the RMS might cause them to fail their
calculations. Thus it causes a cascading effect of failing sub-jobs. Therefore, the whole workflow will
fail because of one single error.
A concrete example of such a error scenario for the workflow in Figure 1 is that if sub-job 0 is delayed
1 time slot, the data transfer tasks 0-1, 0-2, 0-3, 0-4 cannot be executed. Thus, sub-jobs 1, 2, 3, and 4 do
not get the input data to start their calculation at the specified start-times. The consequence is that the
start of those sub-jobs will also be delayed by one time slot. Those sub-jobs, however, might not have
enough computational resources available to finish their calculation on time.
To avoid the delay of the whole workflow, the resource allocation of the sub-jobs of the workflow
must be re-scheduled so they compensate the delay. However, this re-scheduling may bring some negative side-effects.

The finished time of the workflow may exceed the pre-determined time period. The broker will be
fined by the user according to the length of the delay.
In the re-scheduling, if the remaining sub-jobs must be moved to other RMSs, the broker has to
cancel the old reservation contracts. If this is not mentioned in the SLA, the broker will be fined
by the service providers.
Thus, the system must have error recovery mechanisms in order to avoid or eliminate the negative
effect caused by both the large-scale errors and the small-scale errors. In detail, it is desirable to have a
mechanism to re-schedule the sub-jobs of the workflow in such a way that the workflow can be executed
to produce the final result, while trying to keep the fines as low as possible. This chapter presents an
error recovery framework for the SLA-based workflows that addresses these problems.
The chapter is organized as follows. The second section describes the related work. The third section
presents the error recovery mechanisms. The fourth section describes the re-mapping algorithms to re-map
sub-jobs of the affected workflow to the healthy Grid resource. The experiment about the performance
of the recovery mechanisms is discussed in section fifth. The sixth section presents the future research
direction and the last section concludes the chapter with a short summary.
446
Figure 2. The error recovery framework
RELATED WORKS
Little work exists on the issue of error recovery for workflows, although the importance of fault tolerance in Grid computing has already been acknowledged with the establishment of the Grid Checkpoint
Recovery Working Group. Its purpose is to define user-level mechanisms and Grid services for achieving
fault tolerance. (Stone, 2004) described some initial results of the groups effort.
The well-known Condor system has also implemented a mechanism to handle errors (Condor, 2006).
When the mechanism detects the error, it continues to execute the other sub-jobs of the workflow as
long as possible. This mechanism is reasonable if no SLAs have to be considered. Since it does not pay
attention to meeting the deadline of a workflow, the cost incurred through fines and the need for extra
resources can become very high.
The literature recorded a considerable amount of work in related areas especially in finding recovery
methods for single Grid job. (Garbacki et al., 2005) present a transparent fault tolerance for the Grid
application based on Java RMI. They use globally consistent checkpoint to avoid having to restart longrunning computations from scratch after a system crash.
(Hwang and Kesselman, 2003) present a framework for handling errors on the Grid. Central to the
framework is flexibility in handling errors, which is achieved by using the workflow structure as a highlevel recovery policy specification.
(Heine et al., 2005a) describe a SLA-aware job migration mechanism in Grid environments. Checkpoints of the running job can be migrated to the same or other clusters running HPC4U software (see
Heine et al., 2005b). An architecture called VRM (Virtual Resource Management) manages the status
of the process continuously.
447
ERROR RECOVERY FRAMEWORK

The error recovery framework is presented in Figure 2. Error detection is done with a monitoring module
which collects information about the RMS status, the RMS resources, the RMS reservations, the sub-jobs
state and so on from all RMSs. The information is analyzed and stored in the central database to ensure
that the broker module will have an overall picture of the system. When it detects an error occurring, it
activates the error recovery module with an appropriate recovery strategy.
Recovery the Large-Scale Error

In the SLA context, every sub-job of the workflow is planned to run on reserved resources within a
specific time period to ensure the QoS while still preserving the integrity of the workflow. During the
running process of the workflow, one or several RMSs can be detached from the system at any time. If
this happens, we first determine all affected workflows. With each of the affected workflow, we check
whether only independent sub-jobs are affected. If it is true, we try to re-map those sub-jobs to the healthy
RMSs in a way that does not affect other sub-jobs of the workflow. If there are dependently affected
sub-jobs or if the re-mapping of independently affected sub-jobs is failed, the affected workflow is added
to a list. After that, we determine the re-mapping priority for those workflows. With each workflow in
the priority order, we determine its sub-jobs which need to be re-mapped. Those sub-jobs form a new
workflow from the old one. We use w-Tabu algorithm to map the workflow to the healthy RMSs and
optimize the finished time. Following parts describe in detail each step.
Checking Workflows having only Independent Sub-Jobs Affected

The independently affected sub-job is the sub-job that the error directly affects only it but not its previous
or consequent dependent sub-jobs. To have a clear view, we can look at the running scenario in Figure 1.
If the RMS 3 is failed while the sub-job 3 is running, the error directly affects only sub-job 3. Sub-job 0
or sub-job 6 is not directly affected. Thus, sub-job 3 is the independently affected sub-job. In this case,
we try to re-map sub-job 3 to the other healthy RMS in a way that not affect the start time of sub-job 6.
This problem is similar to the problem of recovering the directly affected sub-jobs which is described
in the small scale error recovery section. This step is worth doing as it can be performed in a relatively
short period of time and if it is success, the negative effect of the error is greatly reduced.
If the RMS 2 is failed while sub-job 2 is running, sub-job 1, 2, 6, 7 are directly affected. Sub-job 6
depends on sub-job 2, sub-job 7 depends on sub-job 1, 6. Thus, sub-job 1, 2, 6, 7 are dependent affected.
Re-mapping those sub-jobs affects seriously the integrity structure of the old workflow mapping solution
and we consider it as serious affected. To recover from this error, we use procedures as follows.
Determining the Re-Mapping Priority

When the error happens, many workflows can be affected simultaneously and we have to re-plan many
new workflows formed from sets of determined affected sub-jobs. One problem is that which is the
priority of re-mapping workflows. It is important because it affects the lateness of the workflows. Here,
we use the policy Earliest Deadline First (EDF) which is used broadly in real time systems. Workflow
having earlier deadline will be given higher priority as it occupies resources shorter and the other work-
448
flows need shorter time to wait for available resource. Thus, the total lateness is reduced and the fine
amount is also reduced.
To clarify the problem, suppose we have 2 workflows need to be re-mapped and the Grid system can
execute only one workflow at a time. The workflow 1 was planned to be finished at t1, the workflow 2
was planned to be finished at t2, t2>t1. Suppose that the penalty for each hour late is P. If the workflow
2 is mapped first, workflow 1 have to wait until the workflow 2 is finished. Thus, the minimal fine will
be P*(t2-fail_slot). If the workflow 1 is mapped first, the minimal fine will be P*(t1-fail_slot). Therefore,
mapping workflow 1 first is better than workflow 2. In a real complex situation, mapping workflow
1 first gives more chance to finish workflow 1 earlier, to release resource earlier and thus, gives more
chance for workflow 2 to be mapped with smaller lateness.
Determining Sub-Jobs which need to be Re-Planned

Determining all sub-jobs to be re-mapped in a workflow is done with the following procedure.

Step 1: Clearing the re-mapped set

Step 2: Putting all sub-jobs which are running in the failed RMS to the re-mapped set
Step 3: Re-mapping those determined sub-jobs will lead to the need for re-mapping all other
consequent sub-jobs to ensure the integrity of the workflow. Thus, all those consequent sub-jobs
are put to the re-mapped set.
Step 4: With determined affected sub-jobs in the failed RMSs, they will not have input data to run
if their directly finished previous sub-jobs are also in the failed RMSs. Thus, it is necessary to put
those finished sub-jobs to the re-mapped set.
Step 5: With determined affected sub-jobs in the healthy RMSs, they will not have input data to
run if their directly finished previous sub-jobs are in the fail RMSs and the related data transfer
task is not finished. Those finished sub-jobs must also be put to the re-mapped set.
Step 6: All other sub-jobs of the workflow which did not receive the data from those determined
sub-jobs must be re-mapped to ensure the integrity of the workflow.
Based on the determined priority, each workflow will be mapped in sequence to the healthy RMSs.
To do the mapping, we refine the new workflow under Directed Acyclic Graph (DAG) format and then
use the mapping module to map this new DAG workflow to RMSs. When forming DAG for a workflow,
it is necessary to consider the dependency of affected sub-jobs with running sub-jobs in healthy RMS
to ensure the integrity of the workflow. To present that dependency, in the new workflow, with each
running sub-job in the healthy RMSs, we create a pseudo corresponding sub-job with:

Runtime= Deadline - fail slot - time overhead

number of required CPU=0
number of required storage=0
number of required expert=0
With time overhead value is the period to do the recovery process. Moreover, we also need a new
pseudo source sub-job for the workflow with runtime and resource requirement equal 0.
Because of having to rerun even the already finished sub-jobs, the probability of having many solu-
449
tions that meet the original deadline is very low. Thus, we by pass the attempt to optimize the cost while
ensuring the deadline. We use w-Tabu algorithm to minimize the finished time of the workflow. The
w-Tabu algorithm is presented in the re-mapping algorithms section.
Recovery the Small-Scale Error

When the small-scale error happens, we try to re-map the remaining sub-jobs in a way that the workflow
can complete with little delay and little extra costs. The entire strategy includes 3 phases as described in
Figure 2. Each phase represents a certain approach to find a re-mapping solution. The phases are sorted
according to the simplicity and the cost that they incur.
Phase 1: Re-Mapping the Directly Affected Sub-Jobs

In the first phase, we will try to re-map the directly affected sub-jobs in a way that does not affect the
start time of the other remaining sub-jobs in the workflow. When we re-map the directly affected subjobs, we also have to re-map their related data transfers. For the example in Figure 1, if sub-job 0 is
delayed, the affected sub-jobs are sub-job 1, 2, 3, 4 and their related data transfers. This task can be
feasible for many reasons.

The delay of the late sub-job could be very small.

The Grid may have others solutions so that the data transfers will be shorter because the links have
broader bandwidth.
The Grid may have RMSs with higher CPU power which can execute the sub-jobs in shorter
time.
In the first place, we try to adjust the execution time of the input data transfers, the affected subjobs and the output data transfers within the same RMS as pre-determined. Sub-jobs which cannot be
adjusted will be re-mapped to other RMSs. If this phase is successful, the broker only has to pay the
following costs:

The fee for canceling the reserved resources of directly affected sub-jobs.
The extra resource cost if the new mapping solution is more expensive than the old one.
As the cost for this phase is the least in three phases, it should be tried first. The algorithm to re-map the
directly affected sub-jobs called G-map is described more detail in re-mapping algorithms sections.
Phase 2: Re-Mapping the Workflow to Meet the Pre-Determined Deadline

This phase is executed if the first phase was not successful. In this phase, we will try to re-map the
remaining workflow in a way that the deadline of the workflow is met and the cost is minimized. The
remaining workflow is formed in a way similar to the large scale error recovery section. If this phase is
successful, the broker has to pay the following costs:

450
The fee for canceling the reserved resources of all remaining sub-jobs.
The extra resource cost if the new mapping solution is more expensive than the old one.
To perform the mapping, we use the H-Map algorithm to find the solution. The detailed description
about the H-Map algorithm can be seen in the re-mapping algorithms section.
Phase 3: Re-Mapping the Workflow to have Minimal Runtime

This phase is the final attempt to recover from the error. It is initiated if the two previous phases were
not successful. In this phase, we try to re-map the remaining workflow in a way that minimizes the delay
of the entire workflow. If the solution has an acceptable lateness, the broker has to pay the following
costs:

The extra resource cost if the new mapping solution has higher cost than the old one.
The fine for finishing the entire workflow late. This cost increases proportionally with the length
of the delay.
If the algorithm only finds a solution with a delay higher than accepted by the user, the whole workflow will be cancelled and the broker has to pay the following costs:

The fine for not finishing the entire workflow.
The goal of this phase equals to minimizing the total runtime of the workflow. To do the re-mapping,
we use w-Tabu algorithm, which is described in the re-mapping algorithms section.
Recovery Procedure
When error recovery module is activated, it will perform following actions in a strict sequence:

Access database to retrieve information about error RMSs and determine affected workflow as
well as necessary sub-jobs of the workflow to be remapped.
Based on determined information about affected workflows and sub-jobs, activate negotiation
module to cancel all SLA sub-jobs with local RMSs related to specific sub-jobs. All negotiation
activities are done with the help of SLA text as the means of communication.
Activate monitoring module to update newest information about RMS, especially information
about resource reservation.
Call mapping modules to determine where and when sub-jobs in the affected workflow will be
run.
Based on mapping information, activate the negotiation module to sign new SLAs for each subjob with the specific local RSM.
Update workflow control information, sub-jobs information in the central database.
451
Figure 3. w-Tabu algorithm overview
RE-MAPPING ALGORITHMS
This section presents all algorithms used in the error recovery process. They include w-Tabu algorithm to
optimize the finished time of a workflow, H-Map algorithm to optimize the cost of running a workflow
while ensuring the deadline, and G-Map algorithm to map a group of sub-jobs satisfying the deadline
while optimizing the cost.
Formal Mapping Problem Statement

The formal specification of the described problem includes the following elements:

Let R be the set of Grid RMSs. This set includes a finite number of RMSs, which provide static
information about controlled resources and the current reservations/assignments.
Let S be the set of sub-jobs in a given workflow including all sub-jobs with the current resource
and deadline requirements.
Let E be the set of edges in the workflow, which express the dependency between the sub-jobs and
the necessity for data transfers between the sub-jobs.
Let Ki be the set of resource candidates of sub-job si. This set includes all RMSs, which can run
sub-job si, Ki R.
Based on the given input, a feasible and possibly optimal solution is sought, allowing the most efficient mapping of the workflow in a Grid environment with respect to the given global deadline. The
required solution is a set defined in Formula 1.
M = {(si, rj, start_slot) | si S, rj Ki }
(1)
If the solution does not have start_slot for each si, it becomes a configuration as defined in Formula
2.
a = {(si, rj) | si S, rj Ki }
452
(2)
A feasible solution must satisfy following conditions:

Criterion 1: The finished time of the workflow must be smaller or equal to the expected deadline
of the user.
Criterion 2: All Ki . There is at least one RMS in the candidate set of each sub-job.
Criterion 3: The dependencies of the sub-jobs are resolved and the execution order remains
unchanged.
Criterion 4: The capacity of an RMS must equal or be greater than the requirement at any time
slot. Each RMS provides a profile of currently available resources and can run many sub-jobs of
a single flow both sequentially and in parallel. Those sub-jobs which run on the same RMS form
a profile of resource requirement. With each RMS rj running sub-jobs of the Grid workflow, and
with each time slot in the profile of available resources and profile of resource requirements, the
number of available resources must be larger than the resource requirement.
Criterion 5: The data transmission task eki from sub-job sk to sub-job si must take place in dedicated time slots on the link between the RMS running sub-job sk to the RMS running sub-job si.
eki E.
In the next phase, the feasible solution with the lowest cost is sought. The cost C of running a Grid
workflow is defined in Formula 3. It is the sum of four factors: the cost of using the CPU, the cost of
using the storage, the cost of using the experts knowledge, and finally the expense for transferring data
between the resources involved.
n
C= si.rt*(si.nc*rj.pc+si.ns*rj.ps+si.ne*rj.pe) + eki.nd*rj.pd
(3)
i =1
with si.rt, si.nc, si.ns, si.ne being the runtime, the number of CPUs, the number of storage, and the number
of expert of sub-job si respectively. rj.pc, rj.ps, rj.pe, rj.pd are the price of using the CPU, the storage, the
expert, and the data transmission of RMS rj respectively. eki.nd is the number of data to be transferred
from sub-job sk to sub-job si.
If two dependent sub-jobs run on the same RMS, the cost of transferring data from the previous subjob to the later sub-job is neglected.
With the problem of optimizing the finished time of the workflow, it is not necessary to meet Criterion 1. With the problem of mapping a group of sub-jobs to resources, the Criterion 1 is expressed as
the start time of each input data transfer must be later than the sub-job it depends on and the stop time
of each output data transfer must be earlier than the next sub-job which depends on it.
Supposing the Grid system has m RMSs, which can satisfy the requirement of n sub-jobs in a workflow. As an RMS can run several sub-jobs at a time, finding out the optimal solution needs mn loops. It
can easily be shown that the optimal mapping of the workflow to the Grid RMS as described above is
an NP hard problem.
453
w-Tabu Algorithm
The main purpose of the w-Tabu algorithm is finding a solution with the minimal finished time. Although
the problem has the same destination as most of existing algorithms mapping a DAG to resources (Deelman et al., 2004), the defined context is different from all other contexts appearing in the literature. In
particular, our context is characterized with resource reservation, each sub-job is a parallel application
and each RMS can run several sub-jobs simultaneously. Thus, a dedicated algorithm is necessary. We
proposed a mapping strategy as depicted in Figure 3.
Firstly, a set of referent configurations is created. Then we use a specific module to improve the quality of each configuration as far as possible. The best configuration will be selected. This strategy looks
similar to an abstract of a long term local search such as Tabu search, Grasp, Simulated Annealing and
so on. However, detailed description makes our algorithm distinguishable from them.
Generating Referent Solution Set

Each configuration from the referent configurations set can be thought of as the starting point for a
local search so it should be spread as widely as possible in the searching space. To satisfy the space
spreading requirement, the number of the same map sub-job:RMS between two configurations must
be as small as possible. The number of the member in the referent set depends on the number of available RMSs and the number of sub-jobs. During the process of generating a referent solution set, each
candidate RMS of a sub-job has a co-relative assign_number to count the times that RMS is assigned
to the sub-job. During the process of building a referent configuration, we use a similar set to store all
defined configurations having at least a map sub-job:RMS similar to one in the creating configuration.
The algorithm is defined in Algorithm 1.
Algorithm 1. Generating reference set algorithm
assign_number of each candidate RMS =0

While m_size < max_size {
Clear similar set
For each sub-job in the workflow {
For each RMS in the candidate list {
For each solution in similar set {
If solution contains sub-job:RMS
num_sim++
Store tuple (sub-job, RMS, num_sim) in a list }}
Sort the list
Pick the best result
assign_number++
If assign_number > 1
Find defined solution having the same sub-job:RMS and put to
similar set
}}
454
While building a configuration with each sub-job in the workflow, we select the RMS in the set of
candidate RMSs, which create a minimal number of similar sub-job:RMS with other configurations in
the similar set. After that, we increase the assign_number of the selected RMS. If this value is larger
than 1, which means that the RMS were assigned to the sub-job more than one time, there must exist
configurations that contain the same sub-job:RMS and thus satisfy the similar condition. We search
these configurations in the reference set which have not been in the similar set, and then add them to the
similar set. When finished, the configuration is put to the referent set. After all reference configurations
are defined, we use a specific procedure to refine each of the configurations as far as possible.
Solution Improvement Algorithm

To improve the quality of a configuration, we use a specific procedure based on short term Tabu Search
for this problem. We use Tabu Search because it can also play the role of a local search but with a wider
search area. Besides the standard components of Tabu Search, there are some components specific to
the workflow problems.
The Neighborhood Set Structure
One of the most important concepts of Tabu Search as well as local search is the neighborhood set structure. A configuration can also be presented as a vector. The index of the vector represents the sub-job, and
the value of the element represents the RMS. With a configuration a, a=a1a2. . .an | with all ai Ki, we
generate n*(m-1) configurations a. We change the value of xi to each and every value in the candidate
list which is different from the present value. Each change results in a new configuration. After that we
have set A, |A|=n*(m-1). A is the set of neighborhoods of a configuration.
The Assigning Sequence of the Workflow
When the RMS to execute each sub-job, the bandwidth among sub-jobs was determined, the next task
is determining a time slot to run sub-job in the specified RMS. At this point, the assigning sequence of
the workflow becomes important. The sequence of determining runtime for sub-jobs of the workflow
in RMS can also affect the final finished time of the workflow especially in the case of having many
sub-jobs in the same RMS.
In general, to ensure the integrity of the workflow, sub-jobs in the workflow are assigned based on
the sequence of the data processing. However, that principal does not cover the case of a set of sub-jobs,
which have the same priority in data sequence and do not depend on each other. To solve the problem,
we determine the earliest and the latest start time of each sub-jobs of the workflow in an ideal condition.
The time period to do data transfer among sub-jobs is computed by dividing the amount of data to a fixed
bandwidth. The earliest and latest start and stop time for each sub-job and data transfer depends only
on the workflow topology and the runtime of sub-jobs but not the resources context. These parameters
can be determined using conventional graph algorithms. We see that mapping sub-job having smaller
latest start time first will make the lateness smaller. Thus, the latest start time value determined as above
is used to determine the assigning sequence. The sub-job having the smaller latest start time will be assigned earlier. This procedure will satisfy Criterion 3.
455
Computing the Timetable Procedure

To determine the finished time of a solution we have to determine the timetable to execute sub-jobs
and their related data transfer. In the error recovery phase, finding a solution that meets or nearly meets
Criteria 1 is very important. Therefore, we do not simply use the provided runtime of each sub-job but
modify it according to the performance of each RMS. Let pki, pkj is the performance of a CPU in RMS
ri, rj respectively and pkj > pki. Suppose that a sub-job has the provided runtime rti with the resource
requirement equals to ri. Thus, the runtime rti of the sub-job in rj is determined as in Formula 4.
rt j =
rti
pki + (pk j - pki ) * k
pki
(4)
Parameter k presents the affection of the sub-jobs communication character and the RMSs communication infrastructure. For example, if pkj equals to 2* pki and rti is 10 hours, rtj will be 5 hours if
k equals 1. However, k=1 only when there is no communication among parallel tasks of the sub-job.
Otherwise, k will be less than 1.
The practical Grid workflow usually has a fixed input data pattern. For example, the weather forecasting workflow is executed day by day and finishes within a constant period of time since all data was
collected (Lovas et al., 2004). This character is the basis for estimating the Grid workloads runtime
(Spooner et al., 2003). In our chapter, parameter ka is an average value which is determined by the user
through many experiments and is provided as the input for the algorithm.
In the real environment, k may fluctuate around the average value depending on the network infrastructure of the system. For example, suppose that ka equals 0.8. If the cluster has good network
communications, the real value of k may increase to 0.9. If the cluster has not so good network communications, the real value of k may decrease to 0.7. Nowadays, with very good network technology in
High Performance Computing Centers, the fluctuation of k is not so much. To overcome the fluctuation
problem, we use the pessimistic value kp instead of k in the Formula 4 to determine the new runtime of
the sub-job as follows.

If ka >0.8, for example with the rare communication sub-job, kp =0.5.

If 0.8> ka >0.5, for example with normal communication sub-job, kp =0.25.
If ka <0.5, for example with heavy communication sub-job, kp =0.
The pessimistic policy will ensure that the sub-job can be finished within the new determined runtime period.
With this assumption, the algorithm to compute timetable is presented in Algorithm 2. As the w-Tabu
algorithm applies both for light workflow and heavy workflow, determining the parameter for each case
cannot be the same. With light workflow, the end time of data transfer equals the time slot after the end
of the correlative source sub-job. With a heavy workflow, the end time of data transfer is determined by
searching the bandwidth reservation profile. This procedure will satisfy Criterion 4 and 5.
Algorithm 2. Determining timetable algorithm for workflow in w-Tabu
456
Figure 4. H-Map algorithm overview
With each sub-job k following the assign sequence {

Determine set of assigned sub-jobs Q, which having output
data transfer to the sub-job k
With each sub-job i in Q {
min_st_tran=end_time of sub-job i +1
If heavy weight workflow {
Search in reservation profile of link between RMS running
sub-job k and RMS running sub-job i to determine start and
end time of data transfer task with the start time >
min_st_tran } else {
end time data transfer = min_st_tran }
}
min_st_sj=max end time of all above data transfer +1
Search in reservation profile of RMS running
sub-job k to determine its start and end time with
the start time > min_st_sj
}
The Modified Tabu Search Procedure

In normal Tabu search, in each move iteration, we will try assigning each sub-job sji S with each RMS
rj in the candidate set Ki and use the procedure in Algorithm 2 to compute the runtime and then check
for overall improvement and pick the best one. This method is not efficient as it requires a lot of time for
computing the runtime of the workflow which is not a simple procedure. We will improve the method
by proposing a new neighborhood with two comments.
Comment 1: The runtime of the workflow depends mainly on the execution time of the critical path.
In one iteration, we can move only one sub-job to one RMS. If the sub-job does not belong to the
457
critical path, after the movement, the old critical path will have a very low probability of being
shortened and the finished time of the workflow has a low probability of improvement. Thus, we
concentrate only on sub-jobs in the critical path. With a defined solution and runtime table, the
critical path of a workflow is defined with the algorithm in Algorithm 3.
Algorithm 3. Determining critical path algorithm
Let C is the set of sub-jobs in the critical path

Put last sub-job into C
next_subjob=last sub-job
do{
prev_subjob is determined as the sub-job having
latest finished data output transfer to next_subjob
Put prev_subjob into C
next_sj=prev_subjob
} until prev_sj= first sub-job
We start with the last sub-job determined. The next sub-job of the critical path will have the latest
finish data transfer to the previously determined sub-job. The process continues until next sub-job is
equal to first sub-job.
Comment 2: In one move iteration, with only one change of one sub-job to one RMS, if the finish time
of the data transfer from this sub-job to the next sub-job in the critical path is not decreased, the
critical path cannot be shortened. For this reason, we only consider the change which shortens the
finish time of consequent data transfer. It is easy to see that checking if we can improve the data
transfer time is much shorter than computing the runtime table for the whole workflow.
With two comments and other remaining procedures similar to the standard Tabu search, we build
the overall improvement procedure as presented in Algorithm 4.
Algorithm 4. Configuration improvement algorithm in w-Tabu
while (num_loop<max_loop){
Determine critical path
For each sub-job in the critical path {
For each RMS in the candidate set {
If can improve the finished time of the
sequence data transfer {
Compute timetable for new solution
Store tuple (sub-job, RMS, makespan) to
candidate list
} } }
Pick the solution having smaller makespan
458
Figure 5. G-Map algorithm overview
or not affect tabu rule

Assign tabu_number for the selected RMS
If smaller makespan then store the solution
num_loop++
}
Performance of w-Tabu Algorithm

To study the performance of the w-Tabu algorithm, we employed all the ideas in the recently appearing
literature related to mapping workflow to Grid resource with the same destination to minimize the finished
time and adapted them to our problem. Those algorithms include w-DCP, Grasp, minmin, maxmin, and
suffer (Quan, 2008). After doing the extensive experiment with simulation data, the experimental data
shows that all algorithms need few seconds to find out the solutions and w-Tabu algorithm outperforms
all other algorithms. In particular, the quality of the solution found by w-Tabu algorithm is from 15%
to 20% higher than the one found by the other algorithms. More detail about the experiment and result
can be seen in (Quan, 2008).
H-Map Algorithm
The goal of H-Map algorithm is finding out a solution, which ensures Criterion 1-5, and is as inexpensive
as possible. The overall H-Map algorithm is presented in Figure 4.
Firstly, a set of initial configurations C0 is created. The configurations in C0 should be distributed
widely over the search space and must satisfy Criteria 1. If | C0 |=, we can deduce that there is little
resource free on the Grid and the w-Tabu algorithm is invoked. If w-Tabu also cannot find out a feasible
solution, the algorithm stops. If |C0| , the set will gradually be refined to have better quality solutions.
The refining process stops when the solutions in the set cannot be improved more and we have the final
459
set C*. The best solution in C* will be output as the result of the algorithm. The following sections will
describe detail each procedure in the algorithm.
Constructing the Set of Initial Configurations

The purpose of this algorithm is to create a set of initial configurations which will be distributed widely
over the search space.
Step 0: With each sub-job si, we sort the RMSs in the candidate set Ki according to the cost they need
to run si. The cost is computed according to Formula 3. The sorted configuration space includes
many layers. The configuration in outer layers has a greater cost than the inner layers. The cost of
the configuration lying between two layers is greater than the cost of the inner layer and smaller
than the cost of the outer layer.
Step 1: We pick the first configuration as the first layer in the configuration space. The determined
configuration can be presented as a vector. The index of the vector represents the sub-job, and
the value of the element represents the RMS. Although the first configuration has minimal cost
according to Formula 3, we cannot be sure that this is the optimal solution. The real cost of a
configuration must consider the neglected cost of data transmission when two sequential sub-jobs
are in the same RMS.
Step 2: We construct the other configurations by doing a process as following. The second solution is
the second layer of the configuration space. Then we create a solution having cost located between
layer 1 and layer 2 by combining the first and the second configuration. To do this, we take the
p first elements from the first vector configuration and then the p second elements from the second vector configuration and repeat until having n elements to form the third one. Thus, we get
(n/2) elements from the first vector configuration and (n/2) other elements from the second one.
Combining in this way will ensure the target configuration of having a greater difference in cost
according to Formula 3 compared to the source configurations. The process continues until reaching the final layer. Thus, we have in total 2*m-1 configurations. With this method, we can ensure
that the set of initial configurations is distributed over the search space according to cost criteria.
Step 3: We check Criterion 4 and 5 of all 2*m-1 configurations. To verify Criterion 4 and 5, we have to
determine the timetable for all sub-jobs of the workflow. The procedure to determine the timetable
of the workflow is similar to the one described in Algorithm 2. If some of them do not satisfy the
Criterion 4 and 5 requirement, we construct more to have enough 2*m-1 configurations. To do the
construction, we change the value of p parameter in the range from 1 to (n/2) in step 2 to create
the new configuration.
After this phase we have set C0 including maximum (2*m-1) valid configurations.
Improving Solution Quality Algorithm

To improve the quality of the solutions, we use the neighborhood structure as described in w-Tabu
algorithm section. Call A the set of neighborhood of a configuration, the procedure to find the highest
quality solution includes the following steps.
460
Step 1: for all a A, calculate cost(a) and timetable(a), pick a* with the smallest cost(a*) and satisfy
Criterion 1, put a* to set C1. The detailed technique of this step is described in Algorithm 5.
Algorithm 5. Algorithm to improve the solution quality
For each subjob in the workflow {
For each RMS in the candidate list {
If cheaper then put (sjid, RMS id, improve_value)
to a list }}
Sort the list according to improve_value
From the begin of the list{
Compute time table to get the finished time
If finished time < limit
break
}
Store the result
We consider only the configuration having a smaller cost than the present configuration. Therefore,
instead of computing the cost and the timetable of all configurations in the neighborhood set, we compute
only the cost of them. All the cheaper configurations are stored in a sorted list. And then we compute the
timetable of cheaper configurations along the list to find the first feasible configuration. This technique
helps to decrease a lot of the algorithms runtime.
Step 2: Repeat step 1 with all a C0 to form C1.
Step 3: Repeat step 1 to 2 until Ct= Ct-1.
Step 4:Ct C*. Pick the best configuration of C*.
Performance of H-Map Algorithm

To study the performance of the H-Map algorithm, we applied the standard metaheuristics such as Tabu
Search, Simulated Annealing, Iterated Local Search, Guided Local Search, Genetic Algorithm, Estimation of Distribution Algorithm to our problem. The experiment results show that the H-Map algorithm
finds out equal or higher quality solutions within a much shorter runtime than other algorithms in most
cases. With small-scale problems, some metaheuristics using local search such as ILS, GLS, and EDA
find out equal results with the H-Map and better than the SA or GA. But with large-scale problems, they
have an exponential runtime with unsatisfactory results. Runtime of the H-map algorithm is just few
seconds. More detail about the experiment and results can be seen in (Quan, 2008).
G-Map Algorithm
G-Map algorithm maps a group of sub-jobs onto the Grid resources where G stands for Group. In the
G-Map algorithm, we try to compress the solution space in a way so that the ability for feasible solutions
is higher. After that, a set of initial configurations is constructed. This set will be improved by a local
search until it cannot be improved any more. Finally, we pick the best solution from the final set. The
461
architecture of the algorithm is presented in Figure 5.
Refining the Solution Space

The set of candidate RMSs for each sub-job can be continuously refined by the following observation:
An RMS will be valid with a sub-job only if the sub-job assigned to that RMS satisfies the start time of
the next sequential sub-jobs. The algorithm to refine the solution space is presented in Algorithm 6.
Algorithm 6. Refining the solution space procedure
for each sub-job k in the set {

for each RMS r in the candidate list of k{
for each link to k in assigned sequence{
min_st_tran=end_time of source sub-job
search reservation profile of link the
start_tran > min_st_tran
end_tran = start_tran+num_data/bandwidth
update reservation profile
}
min_st_sj=max (end_tran)
search in reservation profile of r the
start_job > min_st_sj
end_job= start_job + runtime
for each link from k in assigned sequence{
min_st_tran=end_job
update reservation profile
if end_tran>=end_time of destination sub-job
out of the candidate list
}}}
remove r
With each separate sub-job, we determine the schedule time of the input data transfers, the sub-job
and output data transfer. From the algorithm 6, we can see that the resource reservation profile is not
updated. We call this the ideal assignment. If the stop time of the output data transfer is not earlier than
the start time of the next sequential sub-job, then we remove the RMS from the candidate set.
Constructing the Set of Initial Configurations

The goal of the algorithm is finding out a feasible solution which satisfies all required criteria and is as
inexpensive as possible. Therefore, the set of initial configurations should satisfy two criteria.

462
The configurations in the set must differ from each other as far as possible. This criterion will
ensure that the set of initial configurations will be distributed widely over the search space.
The RMS running sub-job in each configuration should differ from each other. This criterion will
ensure that each sub-job will be assigned in the ideal condition; thus the ability to become a feasible solution will be increased.
The procedure to create the set of initial configuration is as follows.
Step 1: Sorting the candidate set according to the cost factor. With each sub-job, we compute the cost
of running the sub-job by each RMS in the candidate set and then sort the RMSs according to the
cost.
Step 2: Forming the first configuration. The procedure to form the first configuration in the set is presented in Algorithm 7. We form the first solution with as small a cost as possible. With each unassigned sub-job, we compute the m_delta = cost running in the first feasible RMS minus the cost
running in the second feasible RMS in the sorted candidate list. The sub-job having the smallest
m_delta will be assigned to the first feasible RMS. The purpose of this action is to ensure that the
sub-job having the higher ability of increasing the cost will be assigned first. After that, we will
update the reservation profile and check if the assigned RMS is still available for other sub-jobs.
If not, we will mark it as unavailable. This process is repeated until all sub-jobs are assigned. The
selection of which sub-job to be assigned is effective when there are as many sub-jobs having the
same RMS as the first feasible solution.
Algorithm 7. The algorithm to form the first configuration
While the set of unassigned sub-jobs is not empty {
Foreach sub-job s in the set of unassigned sub-jobs {
m_delta=cost in first feasible RMS- cost in second
feasible RMS
put (s, RMS, m_delta) in a list
}
Sort the list to get the minimum m_delta
Assign s to the RMS
Drop s out of the set of unassigned sub-jobs
Update the reservation profile of the RMS
Check if the RMS is still feasible with other unassigned
sub-jobs
if not, mark the RMS is infeasible
}
Step 3: Forming the other configurations. The procedure to form the other initial configurations is
described in Algorithm 8. To satisfy the two criteria as described above, we use assign_number
to keep track of the number of the assignment RMS to a sub-job and l_ass to keep track of the appearance frequency of RMS within a configuration. The RMS having the smaller assign_number
and the smaller appearance frequency in l_ass will be selected.
463
Algorithm 8. Procedure to create the initial configuration set
assign_number of each candidate RMS =0

While number of configuration < max_sol {
clear list of assigned RMS l_ass
for each sub-job in the set {
find in the candidate list RMS r having the
smallest number of appearance in l_ass
and the smallest assign_number
Put r to l_ass
assign_number++
}}
Determining the Assigning Order

When the RMS executing each sub-job and the bandwidths among sub-jobs have been determined, the
next task is determining the time slot to run a sub-job in the specific RMS. At this time, the order of
determining the scheduled time for sub-jobs becomes important. The sequence of determining runtime
for sub-jobs in RMS can also affect Criterion 1, especially in the case of having many sub-jobs in the
same RMS. In this algorithm, we use the following policy. The input data transfer having the smaller
earliest start time will be scheduled earlier. The output data transfer having the smaller latest stop time
will be scheduled earlier. The sub-job having the earlier deadline should be scheduled earlier.
Checking the Feasibility of a Solution

To check for the feasibility of a solution, we have to determine the timetable with a procedure as presented in Algorithm 9.
Algorithm 9. Procedure to determine the timetable
for each sub-job k in the set {

for each link to k in assigned sequence{
min_st_tran=end_time of source sub-job
update link reservation profile
}
min_st_sj=max (end_tran)
search in reservation profile of RMS running
k the start_job > min_st_sj
end_job= start_job + runtime
464
update resource reservation profile

for each link from k in assigned sequence{
min_st_tran=end_job
update link reservation profile
}}
After determining the timetable, the stop time of the output data transfer will be compared with the
start time of the next sequential sub-jobs. If there is violation, this solution is determined as infeasible.
Improving Solution Quality Algorithm

To improve the quality of the solution, we use the procedure similar to the one used in the H-Map algorithm. If the initial configuration set C0 , the set will gradually be refined to have better quality
solutions. The refining process stops when the solutions in the set cannot be improved any more and we
have the final set C*. The best solution in C* will be output as the result of the algorithm.
Performance of the G-Map Algorithm

To study the performance of the G-Map algorithm, we applied Deadline Budget Constraint (DBC), HMap, Search All Cases (SAC) algorithms to this problem. The experiment results show that only the SAC
algorithm has exponent runtime when the size of the problem is large. Other algorithms has very small
runtime, just few seconds. The H-Map algorithm has a limited chance to find a feasible solution. The
reason is that the H-Map is designed for mapping the whole workflow and the step of refining solution
space is not performed. Therefore, there are a lot of infeasible solutions in the initial configuration set.
G-Map and DBC algorithm have the same ability to find a feasible solution. Thus, we only compare
the quality of the solution between G-Map and DBC algorithms. In average, G-Map finds out solutions
5% better than DBC algorithm. More detail about the experiment and results can be seen in (Quan and
Altmann, 2007b).
PERFORMANCE ExPERIMENT
The experiment is done with simulation to study the performance of the error recovery mechanisms.
We use simulation data because we want to cover a wide range of character of the workload which is
impossible with a real workload. The hardware and software used in the experiments is rather standard
and simple (Pentium D 2,8Ghz, 1GB RAM, Fedora Core 5, MySQL).
Large-Scale Error Recovery Experiment

The goal of this experiment is to measuring the total reaction time of the error recovery mechanism in
absolute value when the error happens. Determining total reaction time is important because it helps
465
defining the earliest start time of the re-map workflow, which is a necessary parameter for mapping
algorithm. To do the experiment, we use 20 RMSs with different resource configuration and then we
fill all the RMSs with randomly selected workflows having start time slot equal 20. We generated 20
different workflows which:

Have different topologies.

Have a different number of sub-jobs from 7 to 32.
Have different sub-job specifications. Without lost of generality, we assume that each sub-job has
the same CPU performance requirement.
Have different amounts of data transfer.
The number of failing RMS increases from 1 to 3. The failed RMS is selected randomly. With each
number of failed RMS, fail slot is increased along the reservation axis. The reason for this activity is
that the error can happen at any random time slot along the reservation axis. Thus, the broader range
of experiment time is, the more correctly reaction time value is determined. At each time, we used the
described recovery mechanism to re-map all affected workflow as well as all affected sub-jobs and
measure runtime. The runtime is computed in second.
When 1 RMS fails, the experiment data shows that the total reaction time of the mechanism increases
following the increase of total number of affected sub-jobs. When the number of the failed RMSs increases, the total number of affected sub-jobs increases but the number of healthy RMSs decreases.
For that reason, the total reaction time of the mechanism when the number of failed RMSs increasing
does not have big difference with the case of having 1 failed RMS. Further more, the probability of
having more than 2 failed RMSs simultaneously at a time is very rare. For those reasons, the simulation
data can be dependable. With the total reaction time is only less than 2 minutes compared to hourly
running workflow, the performance of the algorithm is well accepted in real situation. In the mapping
algorithm, time is computed in slot, which can have resolution from 3 to 5 minutes. The reaction time
of the mechanism will occupy 1 time slot, the time for the system to do the negotiation takes about 1
time slot. Thus the start time slot of the re-mapped workflow can be assigned to the value of the present
time slot plus 2.
From the experiment data, we also see that module recovering group of independent affected subjobs is rarely invoked. One main reason for this result is that the consequent sub-jobs of a workflow
are mapped to the same RMS to save the data transfer cost. Thus, when the RMS is failed, a series of
dependent sub-jobs of the workflow is affected.
Small-Scale Error Recovery Performance

The goal of this experiment is studying the effectiveness of the multi phases error recovery and the effect
of the late period to the recovery process. To do the experiment, we generated 8 different workflows
which:

466
Have different topologies.

Have a different maximum number of potentially directly affected sub-jobs. The number of subjobs is in the range from 1 to 10. The number of the potentially directly affected sub-jobs stops at
10 because as far as we know with the workload model as described in Part1, this number in the
real workflow is just between 1 and 7.

Have different sub-job specifications. Without lost of generality, we assume that each sub-job has
the same CPU performance requirement.
Have different amount of data transfer.
As the difference in the static factors of an RMS such as OS, CPU speed and so on can be easily
filtered by an SQL query, we use 20 RMSs with the resource configuration equal to or even better than
the requirement of sub-jobs. Those RMSs have already had some initial workload in their resource
reservation profiles and bandwidth reservation profiles. Those 8 workflows are mapped to 20 RMSs.
We select the late sub-job in each workflow in a way that the number of the directly affected sub-jobs
equals the maximum number of the potentially directly affected sub-jobs of that workflow. The late period is 1 time slot. With each group of the affected sub-jobs, we change the power configuration of the
RMS and the k value of affected sub-jobs. The RMS configuration spreads in a wide range from having
many RMSs with more powerful CPU to having many RMSs having CPU equal to the requirement. The
workload configuration changes widely from having many sub-jobs with big k to having many sub-job
with small k. We have chosen this experiment schema because we want to study the character of the
algorithm in all possible cases.
The Effectiveness of the Error Recovery Mechanism

In this section, we study the effectiveness of the mechanisms appeared in three phases of the error recovery strategy for small-scale error. The performance of an error recovery mechanism is defined as the
cost that the broker has to pay for the negative effect of the error as described in error recovery section.
If the cost is smaller, the performance of the mechanism is better and vice versa.
To do the experiment, we set the lateness period to 1. Each reserved resource cancellation costs 10%
of the resource hiring value. For each affected sub-job group, for each power resource configuration
scenario, for each workload configuration scenario, we execute both the recovery strategy including three
phases and the recovery strategy including only phase 3. We record the cost and the phase in which the
error recovery strategy including three phases is successful. For each phase, we compute the average
relative cost of successful solutions found by both strategies. The experiment showed that if phase 2 or
phase 1 is successful, the performance of the two strategies is different. If the error recovery mechanisms
for 1 sub-job late succeeds at phase 1 or 2, the broker will pay less money than using the mechanism of
phase 3. The probability of recovering successfully at phase 1 or 2 is large when the delay is small.
The Effect of the Late Period to the Recovery Process

To evaluate the effect of the late period on the recovery process, we change the lateness period from 1
time slot to 5 time slots. For each affected sub-job group, for each power resource configuration scenario,
for each workload configuration scenario, for each late period, we perform the whole recovery process
with the G-Map, the H-Map, and the w-Tabu algorithm. If the G-Map algorithm in phase 1 is not successful, the H-Map algorithm in phase 2 is revoked. If H-Map is not successful, the w-Tabu algorithm
in phase 3 is revoked. Thus, for each late period, we have a total of 8*12*12=1152 recovery instances.
For each late period, we record the number of feasible solutions for each algorithm and also for each
phase of the recovery process. From the experimental data, the error is effectively recovered when the
467
late period is between 1 and 3 time slots. If the late period is less than or equal to 3 time slot, the ability
to successfully recover with a low cost by the first phase is very high, 830 times out of 1152. When the
late period is greater than 3, the chance of the failing of phase 1 increases sharply and we have to invoke
the second phase or third phase, whichever has the higher cost.
FUTURE RESEARCH DIRECTION

The reaction time of the error recovery depends mainly on the re-mapping time and the negotiation. From
the experiment result, we can see that the reaction time of the error recovery procedure takes in about 2
time slots. We want to reduce this value further to lessen the negative effect of the error. One potential
way to realize this idea is by reducing the re-mapping time. In particular, we will focus on improving
the speed of the re-mapping algorithms while at the same time not degrading the mapping quality.
CONCLUSION
This chapter has presented the error recovery framework for the system handling the SLA-based workflow on the Grid environment. The framework deals with both small-scale errors and large-scale errors.
When the large-scale error happens, many workflows can be simultaneously affected. After attempting
to see if the directly affected sub-jobs of each affected workflow can be recovered, the system focuses
on re-mapping those workflows in a way that minimize the lateness. When the small-scale error happens, only one workflow is affected and the system tries many recovery steps. In the first step, we try to
re-map the directly affected sub-jobs in such a way that does not affect the start time of other remaining
sub-jobs in the workflow. If the first step is not successful, we try to re-map the remaining workflow
in a way that meet the deadline of the workflow and as inexpensively as possible. If the second step
is not success, we try to re-map the remaining workflow in a way that minimizes the lateness of the
workflow. The experiment studies many aspects of the error recovery mechanism and the results show
the effectiveness of applying separate error recovery mechanisms. The total reaction time of the system
is 2 time slots in the bad case when a large-scale error happens. In the case of a small-scale error, the
error is effectively recovered when the late period is between 1 and 3 time slots. Thus, the error recovery
framework could be employed as an important part of the system supporting Service Level Agreement
for the Grid-based workflow.
REFERENCES
Berriman, G. B., Good, J. C., & Laity, A. C. (2003). Montage: A grid enabled image mosaic service
for the national virtual observatory. In F. Ochsenbein (Ed.), Astronomical Data Analysis Software and
Systems XIII, (pp. 145-167). Livermore, CA: ASP press.
Condor Team. (2006). CondorVersion 6.4.7 Manual. Retrieved October 18, 2006, from www.cs.wisc.
edu/condor/manual/v6.4
468
Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., et al. (2004). Pegasus: Mapping
scientific workflows onto the grid. In M. Dikaiakos (Ed.), AxGrids 2004, (LNCS 3165, pp. 11-20).
Berlin: Springer Verlag.
Fischer, L. (Ed.). (2004). Workflow Handbook 2004. Lighthouse Point, FL: Future Strategies Inc.
Garbacki, P., Biskupski, B., & Bal, H. (2005). Transparent fault tolerance for grid application. In P. M.
Sloot (Ed.) Advances in Grid Computing - EGC 2005, (pp. 671-680). Berlin: Springer Verlag.
Georgakopoulos, D., Hornick, M., & Sheth, A. (1995). An overview of workflow management: From
process modeling to workflow automation infrastructure. Distributed and Parallel Databases, 3(2),
119153. doi:10.1007/BF01277643
Heine, F., Hovestadt, M., Kao, O., & Keller, A. (2005). Provision of fault tolerance with grid-enabled
and SLA-aware resource management systems. In G. R. Joubert (Ed.) Parallel Computing: Current and
Future Issues of High End Computing, (pp. 105-112), NIC-Directors.
Heine, F., Hovestadt, M., Kao, O., & Keller, A. (2005). SLA-aware job migration in grid environments.
In L. Grandinetti (Ed.), Grid Computing: New Frontiers of High Performance Computing (345-367).
Amsterdam, The Netherland: Elsevier Press.
Hovestadt, M. (2003). Scheduling in HPC resource management systems: Queuing vs. planning. In D.
Feitelson (Ed.), Job Scheduling Strategies for Parallel Processing, (pp.1-20). Berlin: Springer Verlag.
Hwang, S., & Kesselman, C. (2003). GridWorkflow: A flexible failure handling framework for the Grid.
In B. Lowekamp (Ed.), 12th IEEE International Symposium on High Performance Distributed Computing, (pp. 126131). New York: IEEE press.
Lovas, R., Dzsa, G., Kacsuk, P., Podhorszki, N., & Drtos, D. (2004). Workflow support for complex
Grid applications: Integrated and portal solutions. In M. Dikaiakos (Ed.): AxGrids 2004, (LNCS 3165,
pp. 129-138). Berlin: Springer Verlag.
Ludtke, S., Baldwin, P., & Chiu, W. (1999). EMAN: Semiautomated software for high-resolution singleparticle reconstruction. Journal of Structural Biology, 128, 146157. doi:10.1006/jsbi.1999.4174
Quan, D. M. (Ed.). (2008). A Framework for SLA-aware execution of Grid-based workflows. Saabbrcken, Germany: VDM Verlag.
Quan, D. M., & Altmann, J. (2007). Business model and the policy of mapping light communication
grid-based workflow within the SLA Context. In Proceedings of the International Conference of High
Performance Computing and Communication (HPCC07), (pp. 285-295). Berlin: Springer Velag.
Quan, D. M., & Altmann, J. (2007). Mapping a group of jobs in the error recovery of the Grid-based
workflow within SLA context. In L. T. Yang (Ed.), Proceedings of the 21st International Conference
on Advanced Information Networking and Applications (AINA 2007), (pp. 986-993). New York: IEEE
press.
Sahai, A., Graupner, S., Machiraju, V., & Moorsel, A. (2003). Specifying and monitoring guarantees in
commercial grids through SLA. In F. Tisworth (Ed.), Proceeding of the 3rd IEEE/ACM CCGrid2003,
(pp.292300). New York: IEEE press.
469
Singh, M. P., & Vouk, M. A. (1997). Scientific workflows: Scientific computing meets transactional
workflows. Retrieved January 13, 2006 from http://www.csc.ncsu.edu/faculty/mpsingh/papers/databases/
workflows /sciworkflows.html
Spooner, D. P., Jarvis, S. A., Cao, J., Saini, S., & Nudd, G. R. (2003). Local grid scheduling techniques
using performance prediction. In S. Govan (Ed.), IEEE Proceedings - Computers and Digital Techniques
Vol 150, (pp. 87-96). New York: IEEE Press.
Stone, N. (2004). GWD-I: An architecture for grid checkpoint recovery services and a GridCPR API. Retrieved October 15, 2006 from http://gridcpr.psc.edu/GGF/docs/draft-ggf-gridcpr-Architecture-2.0.pdf
Wolski, R. (2003). Experiences with predicting resource performance on-line in computational grid settings.
ACM SIGMETRICS Performance Evaluation Review, 30(4), 4149. doi:10.1145/773056.773064

Business Grid: The business Grid is a Grid of resource providers that sell their computing resource.
Error Recovery: Error recovery is a process to act against the error in order to reduce the negative
effect of the error.
Grid-Based Workflow: A Grid-based workflow usually includes many dependent sub-jobs. Subjobs in the Grid-based workflow are usually computationally intensive and require powerful computing
facilities to run on.
Grid Computing: Grid computing (or the use of a computational grid) is combining the computing
resources of many organizations to a problem at the same time.
Service Level Agreement: SLAs are defined as an explicit statement of expectations and obligations
in a business relationship between service providers and customers.
Workflow Mapping: Workflow mapping is a process that determines where and when (optional)
each sub-job of the workflow will run on.
Workflow Broker: The workflow broker coordinates the work of many service providers to execute
successfully a workflow.
ENDNOTE
1
470
In this chapter, RMS is used to represent the cluster/super computer as well as the Grid service
provided by the HPCC.
471
Chapter 21
A Fuzzy Real Option Model to

Price Grid Compute Resources
David Allenotor
Ruppa K. Thulasiram
Kenneth Chiu
University at Binghamtom, State University of NY, USA
Sameer Tilak
University of California, San Diego, USA
ABSTRACT
A computational grid is a geographically disperssed heterogeneous computing facility owned by dissimilar organizations with diverse usage policies. As a result, guaranteeing grid resources availability
as well as pricing them raises a number of challenging issues varying from security to management of
the grid resources. In this chapter we design and develop a grid resources pricing model using a fuzzy
real option approach and show that finance models can be effectively used to price grid resources.
INTRODUCTION
Ian Foster and Carl Kesselman (I. Foster & Kesselman, 1999) describe the grid as an infrastructure
that provides a dependable, consistent, pervasive, and inexpensive access to high-end computational
capabilities that enable the sharing, exchange, selection, and aggregation of geographically distributed
resources. A computational grid is analogous to an electrical power grid. In the electric power grid,
electrical energy is generated form various sources such as coal, solar, hydro or nuclear. The user of
electrical energy has no knowledge about the source of the energy but only concerned about availability
and ubiquity of the energy. Likewise, the computational grid is characterized by heterogeneous resources
(grid resources) which are owned by multiple organizations and individuals. The grid distributed resources include but not limited to CPU cycles, memory, network bandwidths, throughput, computing
DOI: 10.4018/978-1-60566-661-7.ch021
A Fuzzy Real Option Model to Price Grid Compute Resources
power, disks, processor, software, various measurements and instrumentation tools, computers, software,
catalogue data and databases, special devices and instruments, and people/collaborators. We describe
the grid compute resources as grid compute commodities (gccs) that need to be priced. This chapter
focuses on the design and development of a grid resource pricing model with an objective to provide
optimal gain (profitability wise) for the grid operators and a satisfaction guarantee measured as Quality
of Service1 (QoS) requirements for grid resource users and resources owners through a regulated Service
Level Agreements2 (SLAs)-based resource pricing. We design our pricing model using a discrete time
numerical approach to model grid resources spot price. We then model resources pricing problem as a
real option pricing problem. We monitor and maintain the grid service quality by addressing uncertainty
constraints using fuzzy logic.
In recent times, research efforts in computational grid has focused on developing standard for grid
middleware in order to provide solutions to grid security issues and infrastructure-based issues (I. T.
Foster, Kesselman, Tsudik & Tuecke, 1998), and grid market economy, (Schiffmann, Sulistio, & Buyya,
2007). Since grid resources have been available for free there has been only little effort made to price
them. However, a trend is developing due to large interest in grid for public computing and because
several business operatives do not want to invest in computing infrastructures due to the dynamic nature
of information technology, there is expected to be huge demand for grid computing infrastructures and
resources. In the future, therefore, a sudden explosion of grid usage is expected. In anticipation to cope
with the sudden increase in grid and grid resources usage, Amazon has introduced a Simple Storage
Service (S3) (Palankar, Onibokun, Iamnitchi, & Ripeanu, 2007) for grid consumers. S3 offers a pay-asyou-go online storage, and as such, it provides an alternative to in-house mass storage. A major drawback
of the S3 is data access performance. Although the S3 project is successful, its current architecture lack
requirements for supporting scientific collaborations due to its reliance on a set of assumptions based
on built-in trusts.
BACKGROUND
A financial option is defined (see, for example (Hull, 2006)) as the right to buy or to sell an underlying
asset that is traded in an exchange for an agreed-upon sum. The right to buy or sell an option may expire
if the right is not exercised on or before a specific period and the option buyer forfeits the premium paid
at the beginning of the contract. The exercise price (strike price) specified in an option contract is the
stated price at which the asset can be bought or sold at a future date. A call option grants the holder the
right to purchase the underlying asset at the specified strike price. On the other hand, a put option grants
the holder the right to sell the underlying asset at the specified strike price. An American option can be
exercised at any time during the life of the option contract; a European option can only be exercised at
expiry. Options are derivative securities because their value is a derived function from the price of some
underlying asset upon which the option is written. They are also risky securities because the price of their
underlying asset at any future time may not be predicted with certainty. This means the option holder
has no assurance that the option will be in-the-money (i.e., yield a non-negative reward), before expiry.
A real option provides a choice from a set of alternatives. In the context of this study, these alternatives
include the flexibilities of exercising, deferring, finding other alternatives, waiting or abandoning an
option. We capture these alternatives using fuzzy logic (Bojadziew & Bojadziew, 1997) and express the
choices as a fuzzy number. A Fuzzy number is expressed as a membership function that lies between
472
i = 1, 2,, n and dS = (r ) Sdt Sdz. I.e., a membership function maps all elements in the universal
set X to the interval dx = vdt + dz. We map all possible flexibilities using membership function.
The majority of current research efforts ((Buyya, Abramson, & Venugopal, 2005) and (references
thereof)) in grid computing focus on grid market economy. Current literature on real option approaches
to valuing projects present real option framework in eight categories (Gray, Arabshahi, Lamassoure,
Okino, & Andringa, 2004): option to defer, time-to-build option, option to alter, option to expand, option to abandon, option to switch, growth options, and multiple options. Efforts towards improving the
selection and decision methods used in the prediction of the capital than an investment may consume.
Carlsson and Fuller in ((Carlsson & Fullr, 2003) apply a hybrid approach to valuing real options. Their
method incorporates real option, fuzzy logic, and probability to account for the uncertainty involved
in the valuation of future cash flow estimates. The results of the research given in (Gray et al., 2004)
and (Carlsson & Fullr, 2003) have no formal reference to the QoS that characterize a decision system.
Carlsson and Fuller (Carlsson & Fullr, 2003) apply fuzzy methods to measure the level of decision
uncertainties and did not price grid resources.
We propose a finance concept for pricing grid resources. In our model, we design and develop a pricing function similar in concept to Mutz et. al. (Mutz,Wolski, & Brevik, 2007) where they model resource
allocation in a batched-queue of jobs ji for v = r 2 / 2 waiting to be to be granted resources. Job
ji receives service before ji+ 1. The resources granted is based on the owners parameters. Their basis for
modeling the payment function depends on the users behavior which impose some undesirable externality
constraints (resource usage policies across multiple organizations) on the jobs on queue. With specific
reference to the job value vi (currency based), and the delay in total turnaround time d expressed as a
tolerance factor. Mutz et. al. obtained a job priority model using efficient design mechanism in (Krishna
& Perry, 2007). They also proposed a compensation function based on how the propensity with which
a job scheduled for time tn 1 wishes to be done at time earlier. The compensation that is determined by
d is paid by the job owner whose job is to be done earlier and disbursed in the form of incentives (say
more gccs) to the jobs (or owners of job) before. Our pricing model will incorporate a price variant
factor (pvf) penalty function. The pvf is a fuzzy number and based on the fuzziness (or uncertainty in
availability or changes in technology), the pvf trends influences the price of a grid resource.
In this chapter we draw our inferences by comparing simulated results to the results obtained from a
research grid (SHARCNET (SHARCNET, 2008)). The choice of our selection is to achieve a real-life
situation in these different grid types. We evaluate our proposed grid resources pricing model and provide a justification by comparing real grid behavior to simulation results obtained using some base spot
prices for the gccs. In particular, we emphasize the provision of service guarantees measured as Quality
of Service (QoS) and profitability from the perspectives of the users and grid operatives respectively. We
strike at maintaining a balance between user required service from the grid, profitability for resources
utilization, and satisfaction for using grid resources.
RESEARCH METHODOLOGY
Black-Scholes (Black & Scholes, 1973) developed one of the important models for pricing financial
options which was enhanced by Merton (Merton, 1973). Cox, Ross, and Rubinstein (Cox, Ross, & Rubinstein, 1979) developed a discretized version of this model. The Black-Scholes and other models form
the fundamental concepts of real option. In an increasingly uncertain and dynamic global market (such
473
as the grid market) place, managerial flexibility has become a major concern. A real options framework
captures the set of assumptions, concepts, and methodologies for accessing decision flexibility in a
known future. Flexibilities which are characterized by uncertainties in investment decisions are critical
because not all of them have values in the future. This challenge in real options concept has propelled
several research efforts in recent times. The real option theory becomes more functional when the business in question could be expressed as a process that includes; (1) an option, (2) irreversible investment,
and (3) when there is a measure of uncertainty about the value of investment and possibility for losses.
The uncertainty referred to here is the observed price volatility of the underlying asset, . The value of
this volatility is in direct proportion to the time value of the option. That is, if the volatility is small, the
time value of the option becomes very negligible and hence the real option approach does not add value
to its valuation. Several schemes exist in the literature to price financial options; (1) application of the
Black-Scholes model (Black & Scholes, 1973) that requires solution of a partial differential equation
which captures the price movements continuously; (2) application of a discrete time and state binomial
model of the underlying asset price that captures the price movement discretely (Cox, Ross, & Rubinstein, 1979). In our simulation, we use the trinomial model (see for example, Hull, 2006) to solve the
real option pricing problem. This is a discrete time approach to calculate the discounted expectations in
a trinomial-tree structure. A good description of the binomial lattice model can be found in (Thulasiram,
Litov, Nojumi, Downing, & Gao., 2001). We start by grid utilization trace gathering and analysis to determine the extent and effect a particular grid resources usage has on the overall behavior of the grid.
Model Assumptions and Formulation

We formulate grid resources pricing model based on the following set of assumptions.
First, we assume that it is more cost effective to use the resources from a grid than other resources elsewhere. We also assume some base prices for gccs such that they are as close to
the current real sale prices but discounted almost as close to x = 3 t . For instance if
a 1GB of Random Access Memory (RAM) cost E[ x] = pu ( x) + pm (0) + pd ( x) = v t ,
we can set a price of E[ x 2 ] = pu ( x 2 ) + pm (0) + pd ( x 2 ) = 2 t + v 2 t 2 per week for
pu = 0.5 * (( 2 t + v 2 t 2 ) / x 2 + (vt ) / x) MB memory. The option holder has a sole right to exercise
the option any time before the expiration (American style option).
Secondly, since the resources exists in non-storable (non-stable) states, we can value them as real
assets value them as real assets. This assumption qualifies them to fit into the general investment valuation model in the real option valuation approach. This assumption also justifies resources availability.
Since the gccs are non-stable, availability could be affected by a high volatility (). This implies that
the grid resources utilization times are in effect shorter relative to life of option in financial valuation
methods. Hence a holder of the option to use the grid resources has an obligation-free chance of exercising the right. The obligation-free status enables us to apply existing finance option valuation theory
to model our pricing scheme. As an example, consider an asset whose price is initially S0 and an option
on the asset whose current price is f. Suppose the option last for a time T and that during the life of the
option the asset price can either move up from S0 to a new level S0u with a payoff value of fu or move
down from S0 to a new level, S0d and with a payoff value of fd where > 1 and d < 1. This leads to a
one-step binomial. We define a grid-job a service request that require utilizes one or more of the gcc-s
between start and finish.
474
Price Variant Factor

Our model objective is to keep the grid busy (i.e., without idle compute cycles). To achieve this objective, we setup a control function define as price variant factor (pf). The pf is a fuzzy number, a multiplier
and based on the fuzziness (or uncertainty in changes in technology). The (pf) is a multiplier and a real
number given as 0 pf 1. The value depends on changes in technological trends. These changes (new
and faster algorithms, faster and cheaper processors, or changes in usage rights and policies) are nondeterminable prior to exercising any of the options to hold the use of grid resource. The certainty of
these changes cannot be predicted exactly. Therefore, we treat pvf as a fuzzy number and apply fuzzy
techniques to capture uncertainties in pf.
Real Option Discretization of Trinomial Process

The trinomial-tree model was introduced in (Boyle, 1996) to price primarily American-style and
European-style options on a single underlying asset. Options pricing under the Black-Scholes model
(Balck and Scholes, 1973) requires the solution of the partial differential equation and satisfied by the
option price. Option prices are obtained by building a discrete time and state binomial model of the asset price and then apply discounted expectations. A generalization of such a binomial valuation model
(Hull, 2006) to a trinomial model and option valuations on the trinomial model are useful since solving
the partial differential equation of the option price by the explicit finite difference method is equivalent
to performing discounted expectations in a trinomial-tree (Hull, 2006). The asset price in a trinomialtree moves in three directions compared with only two for a binomial tree the time horizon (number
of steps) can be reduced in a trinomial-tree to attain the same accuracy obtained by a binomial-tree.
Consider an asset whose current price is S, and r is the riskless and continuously compounded interest
rate, the stochastic differential equation for the risk-neutral geometric Brownian motion (GBM) model
2
2
2
2
of an asset price paying a continuous dividend yield of pm = 1 (( t + v t ) / x ) per annum (Hull,
2006) is given by the expression:
pd = 0.5 * (( 2 t + v 2 t 2 ) / x 2 (vt ) / x)
(1)
For convenience in terms of x = lnS, we take the derivative of x i.e.,

Ci , j = max(e r t ( pu Ci +1, j + pm Ci +1, j +1 + pd Ci +1, j + 2 ), K Si j ).
(2)
where Ci , j = e r t ( pu Ci +1, j + pm Ci +1, j +1 + pd Ci +1, j + 2 ). Consider a trinomial model of asset price in a small
time interval t, we set the asset price changes by x. Suppose this change remain the same or changes
by x, with probabilities of an up movement pu, probability of steady move (without a change) pm, and
probability of a downward movement pd. Figure 1 shows a one-step trinomial lattice expressed in terms
of x and t.
The drift (due to known factors) and volatility (, due to unknown factors) parameters of the asset
price can be captured in the simplified discrete process using x, pu, pm, and pd. The space step can be
computed (with a choice) using ( E ). A relationship between the parameters of the continuous time process
and trinomial process (a discretization of the geometric Brownian motion (GBM)) is obtained by equating
475
Figure 1. One-step trinomial lattice
the mean and variance over the time interval t and imposing the unitary sum of probabilities, i.e.,
F (t ) = E [ S (t )] = S (0)e
( ) d
(3)
Where E[x] is the expectation. From Equation (3),
Gi = g1 , g 2 ,, g n
(4)
where the unitary sum of probabilities can be presented as

pu + pm + pd = 1
(5)
pu, pm, and pd are probabilities of the price going up, down or remaining same respectively. Solving
Equations (3), (4), and (5) yields the transitional probabilities;
CCi = cc1 , cc2 ,, ccm
(6)
Pi = p1 , p2 ,, pn
(7)
p1ccg1
1
g
p 1
1cc2

g
p1cc1
m
476
p2 cc2
p2 cc2
g2
2 ccm
g
pnccn
1
gn
pncc
2

g
pnccn
m
(8)
Figure 2. SHARCNET: CPU time vs. number of jobs
The trinomial process of Figure 1 could be repeated a number of times to form an n-step trinomial
tree. Figure 2 shows a four-step trinomial. For number of time steps (horizontal level) n = 4, the number
of leaves (height) in such a tree is given by 2n + 1. We index a node by referencing a pair (i, j) where i
points at the level (row index) and j indicates the distance from the top (column index). Time t is referenced from the level index by i:t = it. From Figure 2(b), node (i, j) is thus connected to node (i + 1, j)
(upward move), to node (i + 1, j + 1) (steady move), and to node (i + 1, j + 2) (downward move). The
option price and the asset price at node (i, j) are given by C[i, j] = Ci, j and S[i, j] = Si, j respectively. The
g
asset price could be computed from the number of up and down moves required to reach (i, j) from pnccn
m
and is given byS[i, j] = S[0,0](uid j). (9)
The options at maturity (i.e., when T = nt for European style options; T nt for American style
options) are determined by the pay off. So for a call option (the intent to buy an asset at a previously
determined strike price), the pay off Cn,j = Max(0, Sn,j K) and for a put option (the intent to sell) is given
by Cn,j = Max(0, K Sn,j). The value K represents the strike price at maturity T = nt for a European-style
option, and the strike price at any time before or on maturity for an American-style option. To compute
option prices, we apply the discounted expectations under the risk neutral assumption. For an American
put option (for example), for i < n:
g
pnccn
(10)
For a European call option (exercised on maturity only), for i < n,

g
pnccn
(11)
While option price starts at C0,0, we apply the expression for Cn, j along with Equations (9), and (10)
or (11) to obtain the option price at every time step and node of the trinomial-tree. We now model grid
resources based on the transient availability3 of the grid compute cycles, the availability of compute
cycles, and the value of volatility of prices associated with the compute cycles. Given maturity date t,
expectation of the risk-neutral value dg cci / g cci = g cc dt + g cc dzi; the future price F(t) of a contract on
grid resources could be expressed as (see for example (Hull, 2006)):
( g cc1 , g cc 2 , , g ccn )
(12)
477
Consider a trinomial model (see e.g., (Hull, 2006), (Cox et al.1979)) of asset price in a small time
interval t, the asset price increases by x, remain the same or decreases by x, with probabilities; probability of up movement pu, probability of steady move (staying at the middle) pm, and probability of a
downward movement pd. Figure 1 shows a one-step trinomial tree and Figure 2(b) shows a multi-step
trinomial tree.
GRID COMPUTE RESOURCES PRICING

Consider some grids p1 , p2 , , pn and compute commodities that exist in the grids gccsuch as
d ln S = dpi / pi = i dt + i dz. Suppose we have set base prices (some assumed base values) such as
d ln S = [ g cc (t ) p f ln S ]dt + [ stochastic term], then we can setup a Grid Resources Utilization and Pricing
(GRUP) matrix. For the grid resources utilization of several grids and several resources, we have:
d ln Si = [ g cci (t ) p f ln Si ]dt + i dzi |i =1,2,, n
(13)
where occurrence F (t ) = E [ S (t )] in Equation (13) is a trinomial tree that means the price of a grid
compute commodity. At each l = 0,1, , n 1, a solution for best exercise is required. Therefore each
occurrence of j = 1, 2, ,(2l + 1) requires large computational resources of the grid because of its large
size. In other words, the problem of finding prices of grid resources itself is large and would require a
large amount of grid computing power.
To price the multi-resources system, we suppose a real option depends on some other variables such
as the expected growth rate gcc and the volatility respectively gcc. Then if we let
T = (t , (t )) | t T , T (t ) [0,1].
(14)
for any number of derivatives of gcc such as

1
x a
b a
T (tn ) =
c x
c b
0
for x = b
for a x b
for b x c
otherwise i.e., if x [a, c]
with prices p (g cc : tut = tn |QoS SLA) respectively, we have:

d ln S = dpi / pi = midt + sidz
(15)
where the variables gcci = {the set of resources}. Applying the price variant factor pf for pricing options,
we have:
478
d ln S = [gcc (t ) - p f ln S ]dt +[stochastic term ]
(16)
Where dz is called the stochastic term. The strength of the pf is determined by the value of its membership function (high for pf > 0). For a multi-asset problem, we have:
d ln Si = [gcci (t ) - p f ln Si ]dt + sidz i |i =1,2,,n
(17)
The value of gcc(t) is determined such that F (t ) = E[S (t )] i.e., the expected value of S is equal to
the future price. A scenario similar to what we may get is a user who suspects that he might need more
compute cycles (bandwidth) in 3 , 6 , and 9 months from today and therefore decides to pay some
amount, $s upfront to hold a position for the expected increase. We illustrate this process using a 3 step
trinomial process. If the spot price for bandwidth is $sT bit per second (bps) and the projected 3, 6, and
9 months future prices are $s1, $s2, and $s3 respectively. In this scenario, the two uncertainties are the
amount of bandwidth that will be available and the price per bit. However, we can obtain an estimate
for the stochastic process for bandwidth prices by substituting some reasonably assumed values of pf
and (e.g., pf = 10%, = 20%) in Equation (16) and obtain the value of S from Equation (17). Suppose
Vl, j represents the option values at l for l = 0,1, , n - 1 level and j node for j = 1, 2, ,(2l + 1)
(for a trinomial lattice only); i.e., V1,1 represents the option value at level 1 and at pu. Similarly, in our
simulation, using the base price values that we assume, we obtain option value for the trinomial tree at
various time step of 2, 4, 8, and 16.
Fuzzy Logic Framework

We express the value of the gcc flexibility opportunities as:
Gcc: tn = tut
(18)
where tn denotes the time-dimensional space and given as 0 tn 1 and tut describes the corresponding
utilization time. If tn = 0, gcc usage is now or today, if tn = 1, gcc has a usage flexibility opportunity for the future where future is not to exceed 6 months (say). Users often request and utilize gcc
at extremely high computing power but only for a short time for tut = tn 0. Therefore, disbursing the
gcc on-demand and satisfying users Quality of Service (QoS), requires that the distributed resources
be over-committed or under-committed for tn = 1 or 0) respectively in order to satisfy the conditions
specified in the Service Level Agreements (SLAs) document. Such extreme conditions (for example,
holding gcc over a long time) requires some cost in the form of storage. Therefore, we express utilization
time tn as a membership function of a fuzzy set T. A fuzzy set is defined (see for example (Bojadziew
& Bojadziew, 1997)) as:
T = (t, m(t )) | t T , mT (t ) [0,1].
(19)
Thus, given that T is a fuzzy set in a time domain (the time-dimensional space), then T(tn) is called the
membership function of the fuzzy set T which specifies the degree of membership (between 0 and 1) to
which tn belongs to the fuzzy set T. We express the triangular fuzzy membership function as follows:
479
Figure 3. SHARCNET: Used memory vs. number of jobs
x - a
mT (tn ) = b - a
c - x
c - b
0
for x = b
for a x b
for b x c
otherwise i.e., if x [a, c ]
(20)
Where [a, c] is called the universe of discourse or the entire life of the option. Therefore, for every
gcc at utilization time tn, availability of the gcc expressed as membership function is the value compared
to stated QoS conditions given in the SLA document.
An SLA document (Pantry & Griffiths, 1997) describes the agreed upon services provided by an
application system to insure that it is reliable, secure, and available to meet the needs of the business it
supports. The SLA document consists of the technicalities and requirements specific to service provisioning e.g., the expected processor cycles, QoS parameters, some legal and financial aspects such as
violation penalties, and utilization charges for resources use. The implication of a service constraint that
guarantees QoS and meets the specified SLA conditions within a set of intermittently available gcc is a
system that compromises the basic underlying design objective of the grid as a commercial computing
service resource (Yeo & Buyya, 2007). Therefore, Equation (18) becomes:
gcc : tut = tn |QoS SLA
(21)
To satisfy QoS-SLA requirements, we evaluate existing grid utilization behavior from utilization
traces. Based on the observed values of resources demands from the utilization traces. We obtain results
from the SHARCNET traces and observe utilizations for memory, and CPU time. To price the gcc-s, we
run the trinomial lattice using the the following model parameters: For example, for a one-step trinomial
tree we use K = $0.70, S = $0.80, T = 0.5, r = 0.06, = 0.2, and Nj = 2N + 1. We extend our study by
varying the volatility in time steps of N = 4,8,16,24. For a 6 month contract, for example, N = 3 would
mean a 2 month step size and N = 12 would mean a 2 week step size. Unlike stock prices, we need not
go for very small step sizes.
We examine the relationship between used CPU time and memory in the grid to the number of jobs
requesting its use are depicted in Figures 2 and 3 respectively for SHARCNET. The trace analysis shows
480
Figure 4. Option value for RAMin-the-money
that SHARCNET has a symmetrically skewed effect of CPU time on the number of jobs served by the
grid.
Although SHARCNET delivers a larger proportion of jobs but it experience a sharp drop in the number of jobs it serve. The times/dates of low CPU availability is due to either waste, wait or priority jobs
served by the grid or any combination of them. If we compare the CPU usage characteristics displayed
by SHARCNET in Figure 2 and the memory utilization in Figure 3 we observe that a particular application (such as those involve signal processing/image rendering) which require a high CPU as well as high
memory from the grid will not necessarily run optimally. If this application is run on the SHARCNET
grid (for example), it would run using sufficient CPU but under a depleted memory condition.
In our experiments, we simulate the grid compute commodities (gcc) and monitor users request
for utilization. For a call option, we simulate the effects of time on exercising the option to use one of
the gcc-s such as memory (RAM), hard disk (HD), and CPU. We start with memory (one of the gcc-s)
using the following parameters: S = $6.849.00 107, T = 0.5, r = 0.06, N = 4,8,16,24, = 0.2, and
Nj = 2N + 1; we vary K such that we can have in-the-money and out-of-the-money conditions. These
values reflect the market value of these raw infrastructure, in general. We are not certain about the type
of RAM available in the example grids. However, one can easily map the above parametric values to
correspond to the infrastructure available in the grids. This is true for other gcc-s such as CPU and hard
disk discussed later. We obtain option values and study the variation in several step sizes to determine
the effects of fluctuations (uncertainty) that exists between total period of the option contract and time
of exercise on option value.
Figure 4 shows an in-the-money option value for RAM. They show an increasing option value
which increases with the number of time steps. Over the number of step sizes the option value reaches
a steady state. Actual value determines the fact that entering the contract is beneficial for the user while
still generating a reasonable revenue to the grid provider.
This is an indication that at any given time, a users actual cost of utilizing the grid resources is
the sum of the base cost and the additional cost which depends on the time of utilization of the gcc.
However for an equilibrium service-profit system, we impose a price modulation factor -- price variant
factor called pf (see Section 3.2). The value of the pf depends on changes/variations in the technology
481
Figure 5. Execution time for various commodities
or architecture of the grid infrastructure. These variations are unknown prior to exercising the options
to hold the use of grid resource and hence determining the exact price of gcc in real life is uncertain
and hard to predict. Therefore, to maximize gcc utilization (ut) with more computing facilities and with
same technology, we set the value of pf(ut) is set to 0.1 and with new technology, the pf = 1.0. Fuzzified
boundary value of pf is constructed as pf(ut) = [0.1,1.0] to facilitate fuzzification. Our model, therefore,
adjusts the price in the use of grid resources by (pf(ut))1 (for the grid operator) while providing quality
service. For example, applying pf reverses an unprofitable late exercise of an out-of-the-money option
value to an early exercise of in-the-money option value in a 10% adjustments. Figure 11 shows a corresponding out-of-the-money option value for CPU.
Similarly, we obtain from our simulation the option values for both in-the-money and out-of-themoney for CPU using the parameters S = $68.49 and K = $68.47 and $80.47 (all values scaled at (106))
and simulated for a varying time step of 4, 8, 16 and 24, results of which are not shown.
We repeat this for the various grids and for the various gcc -s first individually and then using a combination of the individual gcc-s. Figure 5 shows execution time for HD, CPU, and RAM at various time
steps. From figures of option values we observed that in 24 steps the option value is reaching a steady
state and hence we did not experiment beyond 24 steps. Since number of nodes to be computed increases,
the time required to achieve steady state in option value also increases as shown in Figure 5.
Our interest in the design and development of an equilibrium service-profit grid resources pricing
model is in particular centered in levels were resources utilization in the grid show depleted values and do
not sufficiently provide a service quality necessary to guarantee a user high QoS. The depleted resources
(from the traces) are memory in utilization in levels SHARCNET Figure 5. In these circumstances, a
users QoS must be guaranteed. We use our price varying factor pf discussed earlier to modulate the
482
effective gcc prices by awarding incentive in the form of dividends to the user who require composite
resources.
CONCLUSION AND FUTURE WORK

We use the behavior of the grid resources utilization patterns observed from the traces to develop a
novel pricing model as a real option problem. Our two important contributions are: (1) option value
determination for grid resources utilization and the determination of the best point of exercise of the
option to utilize any of the grid resources. This helps the user as well as the grid operator to optimize
resources for profitability; (2) our study also incorporate a price variant factor, which controls the price
of the resources and ensure that at any time the grid users gets the maximum at best prices and that the
operators also generate reasonable revenue at the current base spot price settings.
Our future work will focus on the larger problem of pricing grid resources for applications that utilize
heterogeneous resources across heterogeneous grids and cloud computing. For example, if an application
requires memory in one grid and the CPU time from another grid simultaneously, then we will have to
deal with a more complex, computationally intensive, and a multi-dimensional option pricing problem.
This would require a more complex optimization of the solution space of the grid resources utilization
matrix as well as determining the best node (time) to exercise the option.
REFERENCES
Black, F., & Scholes, M. (1973). The Pricing of Options and Corporate Liabilities. The Journal of Political Economy, 81(3). doi:10.1086/260062
Bojadziew, G., & Bojadziew, M. (1997). Fuzzy Logic for Business, Finance, and Management Modeling, (2nd Ed.). Singapore: World Scientific Press.
Boyle, P. P. (1986). Option Valuing Using a Three Jump Process. International Options Journal, 3(2).
Buyya, R., Abramson, D., & Venugopal, S. (2005). The Grid Economy. IEEE Journal.
Buyya, R., Giddy, J., & Abramson, D. (2000). An evaluation of Economy-based Resource Trading and
Scheduling on Computational Power Grids for Parameter Sweep Applications. Proceedings of the 2nd
Workshop on Active Middleware Services, Pittsburgh, PA.
Carlsson C. & Fullr, R. (2003). A Fuzzy Approach to Real Option Valuation. Journal of Fuzzy Sets
and Systems, 39.
Cox, J. C., Ross, S., & Rubinstein, M. (1979). Option Pricing: A Simplified Approach. Journal of Financial Economics, 3(7).
Foster, I., & Kesselman, C. (1999). The Grid: Blueprint for a New Computing Infrastructure. San Francisco: Morgan Kaufmann Publishers, Inc.
Foster, I., Kesselman, C., Tsudik, G., & Tuecke, S. (1998). A security Architecture for Computational
Grids. ACM Conference on Computer and Communications Security.
483
Gray, A. A., Arabshahi, P., Lamassoure, E., Okino, C., & Andringa, J.. (2004). A Real Option Framework
for Space Mission Design. Technical report, VNational Aeronautics and Space Administration NASA.
Hull, J. C. (2006). Options, Futures, and Other Derivatives (6th Edition). Upper Saddle River, NJ:
Prentice Hall.
Krishna, V. & Perry, M. (2007). Efficient mechanism Design.
Merton, R. C. (1973). Theory of Real Option Pricing. The Bell Journal of Economics and Management
Science, 4(1). doi:10.2307/3003143
Mutz, A., Wolski, R., & Brevik, J. (2007). Eliciting honest value information in a batch-queue environment. In The 8th IEEE/ACM Int Conference on Grid Computing (Grid 2007) Austin, Texas, USA.
Palankar, M., Onibokun, A., Iamnitchi, A., & Ripeanu, M. (2007). Amazon S3 for Science Grids: a
Viable Solution? Poster: 4th USENIX mposium on Networked Systems Design and Implementation
(NSDI07).
Pantry, S., & Griffiths, P. (1997). The Complete Guide to Preparing and Implementary Service Level
Agreements (1st Ed.). London: Library Association Publishing.
Schiffmann, W., Sulistio, A., & Buyya, R. (2007). Using Revenue management to Determine Pricing of
Revervations. Proc. 3rd International Conference on e-Science and Grid Computing (eScience 2007)
Bangalore, India, December 10-13.
SHARCNET. (2008). Shared Hierarchical Academic Research Computing Network (SHARCNET).
Thulasiram, R. K., Litov, L., & Nojumi, H. Downing, C. T. & Gao, G. R. (2001). Multithreaded Algorithms for Pricing a Class of Complex Options. Proceedings (CD-ROM) of the International Parallel
and Distributed Processing Symposium (IPDPS), San Francisco, CA.
Yeo, C. S., & Buyya, R. (2007). Integrated Risk Analysis for a Commercial Computing Service. Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007,
IEEE CS Press, Los Alamitos, CA, USA).

Distributed Computing: Grid resource as they relates to the geographical regions which is a factor
in terms of availability and computability.
Fuzzy Support for QoS: A decision support systems that is based on managing uncertainties associated with grid resources availability.
Grid Computing: A computing grid is a system that delivers processing power of a massively parallel computation and facilitates the deployment of resources-intensive applications
Price Adjustments: A control/ feed back structure that modulate grid resources price with a specific
objective to benefits users and grid operatives; value depends of current tecnology or maket trend.
Real Option Model: A mathematical framework similar to financial options but characterized by
uncertainty in decision flexibility in a known future for determining project viabilites.
484
Resource Management: This refers to provision of the grid resources to users at the time of requested utilization.
Resource Pricing: A fair share of the grid resources that depends highly on availability (monitored
by price variant factor) rather than market forces of demand and supply.
ENDNOTES
1
QoS describes a users perception of a service to a set of predefined service conditions contained
in a Service Level Agreements (SLAs) that is necessary to achieve a user-desired service quality.
An SLA (Pantry & Griffiths, 1997) is a legal contract in which a resource provider (say a grid
operator) agrees to deliver an acceptable minimum level of QoS to the users.
A reserved quantity at a certain time (tn 1) may be unavailable at tn.
485
486
Chapter 22
The State of the Art and Open

Problems in Data Replication
in Grid Environments
Mohammad Shorfuzzaman
Rasit Eskicioglu
Peter Graham
ABSTRACT
Data Grids provide services and infrastructure for distributed data-intensive applications that need
to access, transfer and modify massive datasets stored at distributed locations around the world. For
example, the next-generation of scientific applications such as many in high-energy physics, molecular
modeling, and earth sciences will involve large collections of data created from simulations or experiments. The size of these data collections is expected to be of multi-terabyte or even petabyte scale in
many applications. Ensuring efficient, reliable, secure and fast access to such large data is hindered
by the high latencies of the Internet. The need to manage and access multiple petabytes of data in Grid
environments, as well as to ensure data availability and access optimization are challenges that must
be addressed. To improve data access efficiency, data can be replicated at multiple locations so that a
user can access the data from a site near where it will be processed. In addition to the reduction of data
access time, replication in Data Grids also uses network and storage resources more efficiently. In this
chapter, the state of current research on data replication and arising challenges for the new generation
of data-intensive grid environments are reviewed and open problems are identified. First, fundamental
data replication strategies are reviewed which offer high data availability, low bandwidth consumption,
increased fault tolerance, and improved scalability of the overall system. Then, specific algorithms for
selecting appropriate replicas and maintaining replica consistency are discussed. The impact of data
replication on job scheduling performance in Data Grids is also analyzed. A set of appropriate metrics
including access latency, bandwidth savings, server load, and storage overhead for use in making critical
DOI: 10.4018/978-1-60566-661-7.ch022
The State of the Art and Open Problems in Data Replication in Grid Environments
comparisons of various data replication techniques is also discussed. Overall, this chapter provides a
comprehensive study of replication techniques in Data Grids that not only serves as a tool to understanding this evolving research area but also provides a reference to which future e orts may be mapped.
INTRODUCTION
The popularity of the Internet as well as the availability of powerful computers and high-speed network
technologies is changing the way we use computers today. These technology opportunities have also
led to the possibility of using distributed computers as a single, unified computing resource, leading to
what is popularly known as Grid Computing (Kesselman & Foster, 1998). Grids enable the sharing,
selection, and aggregation of a wide variety of resources including supercomputers, storage systems,
data sources, and specialized devices that are geographically distributed and owned by different organizations for solving large-scale computational and data intensive problems in science, engineering, and
commerce (Venugopal, Buyya, & Ramamohanarao, 2006).
Data Grids deal with providing services and infrastructure for distributed data-intensive applications
that need to access, transfer and modify massive datasets stored across distributed storage resources.
For example, scientists working in areas as diverse as high energy physics, bioinformatics, and earth
observations need to access large amounts of data. The size of these data is expected to be terabyte or
even petabyte scale for some applications. Maintaining a local copy of data on each site that needs the
data is extremely expensive. Also, storing such huge amounts of data in a centralized manner is almost
impossible due to extensively increased data access time. Given the high latency of wide-area networks
that underlie many Grid systems, and the need to access or manage several petabytes of data in Grid
environments, data availability and access optimization are key challenges to be addressed.
An important technique to speed up data access for Data Grid systems is to replicate the data in
multiple locations, so that a user can access the data from a site in his vicinity (Venugopal et al., 2006).
Data replication not only reduces access costs, but also increases data availability for most applications.
Experience from parallel and distributed systems design shows that replication promotes high data availability, lower bandwidth consumption, increased fault tolerance, and improved scalability. However, the
replication algorithms used in such systems cannot always be directly applied to Data Grid systems due
to the wide-area (mostly hierarchical) network structures and special data access patterns in Data Grid
systems that differ from traditional parallel systems.
In this chapter, the state of the current research on data replication and its challenges for the new
generation of data-intensive grid environments are reviewed and open problems are discussed. First,
different data replication strategies are introduced that offer efficient replica1 placement in Data Grid
systems. Then, various algorithms for selecting appropriate replicas and maintaining replica consistency are discussed. The impact of data replication on job scheduling performance in Data Grids is also
investigated.
The main objective of this chapter, therefore, is to provide a basis for categorizing present and future developments in the area of replication in Data Grid systems. This chapter also aims to provide
an understanding of the essential concepts of this evolving research area and to identify important and
outstanding issues for further investigation.
487
The remainder of this chapter is organized as follows. First, an overview of the data replication
problem is presented, describing the key issues involved in data replication. In the following section,
progress made to date in the area of replication in Data Grid systems is reviewed. Following this, a critical comparison of data placement strategies, probably the core issue affecting replication efficiency in
Data Grids, is provided. A summary is then given and some open research issues are identified.
OVERVIEW OF REPLICATION IN DATA GRID SYSTEMS

The efficient management of huge distributed and shared data resources across Wide Area Networks
(WANs) is a significant challenge for both scientific research and commercial applications. The Data
Grid as a specialization and extension of the Grid (Baker, Buyya, & Laforenza, 2006) provides a solution to this problem. Essentially, Data Grids (Chervenak, Foster, Kesselman, Salisbury, & Tuecke, 2000)
deal with providing services and infrastructure for distributed data-intensive applications that need to
access, transfer and modify massive datasets stored in distributed storage resources. At the minimum, a
Data Grid provides two basic functions: a high-performance, reliable data transfer mechanism, and a
scalable replica discovery and management mechanism. Depending on application requirements, other
services may also be needed (e.g. security, accounting, etc.).
Grid systems typically involve loosely coupled jobs that require access to a large number of datasets.
Such a large volume of datasets has posed a challenging problem in how to make the data more easily and
efficiently available to the users of the systems. In most situations, the datasets requested by a users job
cannot be found at the local nodes in the Data Grid. In this case, data must be fetched from other nodes
in the grid which causes high access latency due to the size of the datasets and the wide-area nature of
the network that underlies most grid systems. As a result, job execution time can become very high due
to the delay of fetching data (often over the Internet).
Replication (Ranganathan & Foster, 2001b) of data is the most common solution used to address
access latency in Data Grid systems. Replication results in the creation of copies of data files at many
different sites in the Data Grid. Replication of data has been demonstrated to be a practical and efficient
method to achieve high network performance in distributed environments, and it has been applied widely
in the areas of distributed databases and some Internet applications (Ranganathan & Foster, 2001b; Chervenak et al., 2000). Creating replicas can effectively reroute client requests to different replica sites and
offer remarkably higher access speed than a single server. At the same time, the workload of the original
server is distributed across the replica servers and, therefore, decreases significantly. Additionally, the
network load is also distributed across multiple network paths thereby decreasing the probability of
congestion related performance degradation. In these ways, replication plays a key role in improving
the performance of data-intensive computing in Data Grids.
The Replication Process and Components

The use of replication in Data Grid systems speeds up data access by replicating the data at multiple
locations so that a user can access data from a site in his vicinity (Venugopal et al., 2006). Replication
of data, therefore, aims to reduce both access latency and bandwidth consumption. Replication can also
help in server load balancing and can enhance reliability by creating multiple copies of the same data.
Replication is, of course, limited by the amount of storage available at each site in the Data Grid and
488
Figure 1. A replica management architecture
by the bandwidth available between those sites. A replica management system, therefore, must ensure
access to the required data while managing the underlying storage and network resources.
A replica management system, shown in Figure 1, consists of storage nodes that are linked to each
other via high-performance data transport protocols. The replica manager directs the creation and management of replicas according to the demands of the users and the availability of storage, and a catalog
(or directory) keeps track of the replicas and their locations. The catalog can be queried by applications
to discover the number and locations of available replicas of a given file.
Issues and Challenges in Data Replication

Although the necessity of replication in Data Grid systems is evident, its implementation entails several
issues and challenges such as selecting suitable replicas, maintaining replica consistency, and so on. The
following fundamental issues are identified:
a.
b.
c.
d.
e.
Strategic placement of replicas is needed to obtain maximum gains from replication according to
the objectives of applications.
The degree of replication must be selected to require the minimum number of replicas without
reducing the performance of applications.
Replica selection identifies the replica that best matches the users quality of service (QoS) requirements and, perhaps, achieves one or more system-wide management objectives.
Replica consistency management ensures that the multiple copies (i.e., replicas) of a given file are
kept consistent in the presence of multiple concurrent updates.
The impact of replication on the performance of job scheduling must also be considered.
489
Figure 2. Taxonomy of the issues in data replication
Figure 2 presents a visual taxonomy of these issues which will be used in the next subsections.
Replica Placement
Although, data replication is one of the major optimization techniques for promoting high data availability, low bandwidth consumption, increased fault tolerance, and improved scalability, the problem of
replica placement has not been well studied for large-scale Grid environments. To obtain the maximum
possible gains from file replication, strategic placement of the file replicas in the system is critical. The
replica placement service is the component of a Data Grid architecture that decides where in the system
a file replica should be placed. The overall file replication problem consists of making the following
decisions (Ranganathan & Foster, 2001b): (1) which files should be replicated; (2) when and how many
should replicas be created; (3) where should these replicas be placed in the system.
Replication methods can be classified as static or dynamic (M. Tang, Lee, Yeo, & Tang, 2005). For
static replication, after a replica is created, it will exist in the same place until it is deleted manually
by users or its replica duration expires. The drawback of static replication is evident when client
access patterns change greatly in the Data Grid, the benefits brought by replicas will decrease sharply.
On the contrary, dynamic replication takes into consideration changes in the Data Grid environment
and automatically creates new replicas for popular data files or moves the replicas to other sites when
necessary to improve performance.
Replica Selection
A system that includes replicas also requires a mechanism for selecting and locating them at file access
time. Choosing and accessing appropriate replicas are very important to optimize the use of grid resources. A replica selection service discovers the available replicas and selects the best replica given
the users location and quality of service (QoS) requirements. Typical QoS requirements when doing
replica selection might include access time as well as location, security, cost and other constraints, The
replica selection problem can be divided into two sub-problems (Rahman, Barker, & Alhajj, 2005): 1)
discovering the physical location(s) of a file given a logical file name, and 2) selecting the best replica
from a set based on some selection criteria.
Network performance can play a major role when selecting a replica. Slow network access limits the
efficiency of data transfer regardless of client and server implementation. One optimization technique
to select the best replica from different physical locations is by examining the available (or predicted)
490
bandwidth between the requesting computing element and various storage elements that hold replicas.
The best site, in this case, is the one that has the minimum transfer time required to transport the replica
to the requested site. Although, network bandwidth plays a major role in selecting the best replica, other
factors including additional characteristics of data transfer (most notably, latency), replica host load, and
disk I/O performance are important as well.
Replica Consistency
Consistency and synchronization problems associated with replication in Data Grid systems are not well
addressed in the existing research with files often being regarded as being read-only. However, as grid
solutions are increasing used by a number of applications, requirements will arise for mechanisms that
maintain the consistency of replicated data that can change over time. The replica consistency problem
deals with concurrent updates made to multiple replicas of a file. When one file is updated, all other
replicas then have to have the same contents and thus provide a consistent view. Consistency therefore
requires some form of concurrency control.
Replica consistency is a traditional issue in distributed systems, but it introduces new problems in Data
Grid systems. The traditional consistency implementations such as invalidation protocols, distributed
locking mechanisms, atomic operations and two-phase commit protocols are not necessarily suitable
for Data Grid environments because of the long delays introduced by the use of a wide-area network
and the high degree of autonomy of Data Grid resources (Domenici, Donno, Pucciani, Stockinger, &
Stockinger, 2004). For example, in a Data Grid, the replicas for a file may be distributed over different
countries. So, if one node which holds a replica is not available when the update operation is working,
the whole updating process could fail.
The Impact of Data Replication on Job Scheduling

Dealing with the large number of data files that are geographically distributed causes many challenges in
a Data Grid. One that is not commonly considered is scheduling jobs to take data location into account
when determining job placement. The locations of data required by a job clearly impact grid scheduling
decisions and performance (M. Tang, Lee, Yeo, & Tang, 2006).
Traditional job schedulers for grid systems are responsible for assigning incoming jobs to compute
nodes in such a way that some evaluative conditions are met, such as the minimization of the overall
execution time of the jobs or the maximisation of throughput or utilisation. Such systems generally take
into consideration the availability of compute cycles, job queue lengths, and expected job execution
times, but they typically do not consider the location of data required by the jobs. Indeed, the impact of
data and replication management on job scheduling behaviour has largely remained unstudied.
Data intensive applications such as High Energy Physics and Bioinformatics require both Computational Grid and Data Grid features. Performance improvements for these applications can be achieved
by using a Computational Grid that provides a large number of processors and a Data Grid that provides
efficient data transport and data replication mechanisms. In such environments, effective resource scheduling is a challenge. One must consider not only the abundance of computational resources but also data
locations. A site that has enough available processors may not be the optimal choice for computation if it
doesnt have the required data nearby. (Allocated processors might wait a long time to access the remote
data.) Similarly, a site with local copies of required data is not a good place to compute if it doesnt have
491
Figure 3. Taxonomy of the replica placement algorithms
adequate computational resources.

An effective scheduling mechanism is required that will allow shortest access to the required data,
thereby reducing the data access time. Since creating data replicas can significantly reduce the data access cost, a tighter integration of job scheduling and automated data replication can bring substantial
improvement in job execution performance.
DATA REPLICATION: STATE OF THE ART

As mentioned earlier, data replication becomes more challenging because of some unique characteristics
of Data Grid systems. This section surveys existing replication strategies in Data Grids and the issues
involved in replication that will form a basis for future discussion of open issues in the next section.
Replica Placement Strategies

With the high latency of wide-area networks that underlies most Grid systems, and the need to access
and manage multiple petabytes of data, data availability and access optimization become key challenges
to be addressed. Hence, most of the existing replica placement algorithms focus on at least two types
of objective functions for placing replicas in Data Grid systems. The first type of replica placement
strategy looks towards decreasing the data access latency and the network bandwidth consumption. The
other type of replica placement strategy focuses on how to improve system reliability and availability.
Figure 3 shows a taxonomy of the replica placement algorithms based on the realized objective functions
together with references to papers in each category.
492
Algorithms Focusing on Access Latency and Bandwidth Consumption

Ranganathan and Foster (Ranganathan & Foster, 2001b, 2001a) present and evaluate different replication strategies for a hierarchical Data Grid architecture. These strategies are defined depending on when,
where, and how replicas are created and destroyed in a hierarchically structured grid environment. They
test six different replication strategies: 1) No Replication: only root node holds replicas; 2) Best Client:
replica is created for the client who accesses the file the most; 3) Cascading: a replica is created on the
path from the root node to the best client; 4) Plain Caching: a local copy is stored upon initial request;
5) Caching plus Cascading: combines plain caching and cascading; 6) Fast Spread: file copies are stored
at each node on the path from the root to the best client. They show that the cascading strategy reduces
response time by 30% over plain caching when data access patterns contain both temporal and geographical locality. When access patterns contain some locality, Fast Spread saves significant bandwidth over
other strategies. These replication algorithms assume that popular files at one site are also popular at
others. The client site counts hops for each site that holds replicas, and the model selects the site that is
the least number of hops from the requesting client; but it does not consider current network bandwidth
and also limits the model to a hierarchical grid. The proposed replication algorithms can be refined so
that time interval and threshold of replication change automatically based on user behaviour.
Lamehamedi et. al (Lamehamedi, Szymanski, shentu, & Deelman, 2002; Lamehamedi, Szymanski,
Shentu, & Deelman, 2003) study replication strategies where the replica sites can be arranged in different
topologies such as a ring, tree or hybrid. Each site or node maintains an index of the replicas it hosts and
the other locations that it knows about that host replicas of the same files. Replication decisions are made
based on a cost model that evaluates both the data access costs and performance gains of creating each
replica. The estimation of costs and gains is based on factors such as run-time accumulated read/write
statistics, response time, bandwidth, and replica size. The replication strategy places a replica at a site
that minimises the total access costs including both read and write costs for the datasets. The write cost
considers the cost of updating all the replicas after a write at one of the replicas. They show via simulation that the best results are achieved when the replication process is carried out closest to the users.
Bell et al. (W. H. Bell et al., 2003) present a file replication strategy based on an economic model
that optimises the selection of sites for creating replicas. Replication is triggered based on the number of
requests received for a dataset. Access mediators receive these requests and start auctions to determine
the cheapest replicas. A Storage Broker (SB) participates in these auctions by offering a price at which
it will sell access to a replica if it is available. If the replica is not available at the local storage site, then
the broker starts an auction to replicate the requested file onto its storage if it determines that having the
dataset is economically feasible. Other SBs then bid with the lowest prices that they can offer for the
file. The lowest bidder wins the auction but is paid the amount bid by the second-lowest bidder.
In subsequent research, Bell et al. (W. Bell et al., 2003) describe the design and implementation of a
Grid simulator, OptorSim. In particular, OptorSim allows the analysis of various replication algorithms.
The goal is to evaluate the impact of the choice of an algorithm on the throughput of typical grid jobs.
The authors implemented a simple remote access heuristic and two traditional cache replacement algorithms (oldest file deletion and least accessed file deletion).
Their simulation was constructed assuming that the grid consists of several sites, each of which may
provide computational and data-storage resources for submitted jobs. Each site consists of zero or more
Computing Elements and zero or more Storage Elements. Computing Elements run jobs, which use the
data in files stored on Storage Elements. A Resource Broker controls the scheduling of jobs to Comput-
493
Figure 4. An example of the history and node relations
ing Elements. Sites without Storage or Computing Elements act as network routing nodes.
Various algorithms were compared to a novel algorithm (W. H. Bell et al., 2003) based on an economic model. The comparison was based on several grid scenarios with various workloads. The results
obtained from OptorSim suggest that the economic model performs at least as well as traditional methods.
However, the economic model shows marked performance improvements over other algorithms when
data access patterns are sequential.
Sang-Min Park et al. (Park, Kim, Ko, & Yoon, 2003) propose a dynamic replication strategy, called
BHR (Bandwidth Hierarchy based Replication), to reduce data access time by avoiding network congestion
in a Data-Grid network. The BHR algorithm benefits from network-level locality, which indicates that
the required file is located at the site that has the broadest bandwidth to the site of the jobs execution.
In Data Grids, some sites may be located within a region where sites are linked closely. For instance, a
country or province/state might constitute a network region. Network bandwidth between sites within
a region will be broader than bandwidth between sites across regions. That is, a hierarchy of network
bandwidth may appear in the Internet. If the required file is located in the same region, less time will
be consumed to fetch the file. Thus, the benefit of network-level locality can be exploited. The BHR
strategy reduces data access time by maximizing this network-level locality.
Rahman et al. (Rahman, Barker, & Alhajj, 2005b) present a replica placement algorithm that considers
both the current state of the network and file requests. Replication is started by placing the master files
at one site. Then the expected utility or risk index is calculated for each site that does not currently
hold a replica and then one replica is placed on the site that optimizes the expected utility or risk. The
proposed algorithm based on utility selects a candidate site to host a replica by assuming that future
requests and current load will follow current loads and user requests. Conversely, the algorithm using a
risk index exposes sites far from all other sites and assumes a worst case scenario whereby future requests
will primarily originate from that distant site thereby attempting to provide good access throughout the
network. One major drawback of these strategies is that the algorithms select only one site per iteration
and place a replica there. Grid environments can be highly dynamic and thus there might be a sudden
burst of requests such that a replica needs to be placed at multiple sites simultaneously to quickly satisfy
the large spike of requests.
Two dynamic replication mechanisms (M. Tang, Lee, Yeo, & Tang, 2005) are proposed for a multitier architecture for Data Grids: Simple Bottom-Up (SBU) and Aggregate Bottom-Up (ABU). The SBU
494
algorithm replicates any data file that exceeds a pre-defined threshold of access rate as close as possible to
the clients. The main shortcoming of SBU is the lack of consideration of the relationship with historical
access records. To address this problem, ABU was designed which takes into account access histories
of files used by sibling nodes and aggregates the access record of similar files so that these frequently
accessed files are replicated first. This process is repeated until the root is reached. An example of a
data file access history and the network topology of the related nodes is shown in Figure 4. The history
indicates that node N1 has accessed file A five times, while N2 and N3 have accessed B four and three
times, respectively. Nodes N1, N2 and N3 are siblings and their parent node is P1.
If we assume that the SBU algorithm is adopted and the given threshold is five, the last two records
in the history will be skipped and only the first record will be processed. The result is that the file A will
be created in node P1 if it has enough space, and file B will not be replicated. Considering this example
it is clear that the decision of SBU is not optimal, because from the perspective of the whole system, file
B, which is accessed seven times by node N2 and N3, is more popular than A, which is only accessed
five times by node N1. Hence, the better solution is to replicate file B to P1 first, then replicate file A to
P1 if it still has enough space available. The Aggregate Bottom-Up (ABU) algorithm works in a similar
fashion. With a hierarchical topology, the client searches for files from a client to the root. In addition,
the root replicates the needed data at every node. Therefore, access latency can be improved significantly.
On the other hand, significant storage space may be used. Storage space utilization and access latency
must be traded off against each other.
Rahman et al. (Rahman, Barker, & Alhajj, 2005a) propose a multi-objective approach to address the
replica placement problem in Data Grid systems. A grid environment is highly dynamic, so predicting user
requests and network load, a-priori, is difficult. Therefore, only considering a single objective, variations
in user requests and network load will have larger impacts on system performance. Rahman et al. use
two models: the p-median and p-center models (Hakami, 1999), for selecting the candidate sites at which
to host replicas. The p-median model places replicas at sites that optimize the request-weighted average
response time (which is the time required to transfer a file from the nearest replication site). The response
time is zero if a local copy exists. The request-weighted response time is calculated by multiplying the
number of requests at a particular site by the response time for that site. The average is calculated by
averaging the request weighted response times for all sites. The p-center model selects candidate sites
to host replicas by minimizing the maximum response time. Rahman et al consider a multi-objective
approach that combines the p-center and p-median objectives to decide where to place replica.
Algorithms Focusing on System Reliability and Availability

Once bandwidth and computing capacity become relatively cheap, data access time can decrease dramatically. How to improve the system reliability and availability then becomes the focal point for replication
algorithms. Lei and Vrbsky (Lei & Vrbsky, 2006) propose a replica strategy to improve availability when
storage resources are limited without increasing access time.
To better express system data availability, Lei and Vrbsky introduce two new measures: the file
missing rate and the bytes missing rate. The File Missing Rate (FMR) represents the number of files
potentially unavailable out of all the files requested by all the jobs. The Bytes Missing Rate (BMR)
represents the number of bytes potentially unavailable out of the total number of bytes requested by all
jobs. Their replication strategy is aimed at minimizing the data missing rate. To minimize the FMR and
BMR, their proposed strategy makes the replica and placement decisions based on the benefits received
495
from replicating the file in the long term. If the requested file is not at a site, it is replicated at the site if
there is enough storage space. If there is not enough free space to store the replica, an existing file must
be replaced. Their replication algorithm can be enhanced by differentiating between the file missing rate
and bytes missing rate in the grid when the file size is not unique.
Ranganathan et al. (Ranganathan, Iamnitchi, & Foster, 2002) present a dynamic replication strategy
that creates copies based on trade-o s between the cost and the future benefits of creating a replica.
Their strategy is designed for peer-to-peer environments where there is a high-degree of unreliability
and hence, considers the minimum number of replicas that might be required given the probability of
a node being up and the accuracy of information possessed by a site in a peer-to-peer network. In their
approach, peers create replicas automatically in a decentralized fashion, as required to meet availability
goals. The aim of the framework is to maintain a threshold level of availability at all times.
Each peer in the system possesses a model of the peer-to-peer storage system that it can use to determine how many replicas of any file are needed to maintain the desired availability. Each peer applies
this model to the (necessarily incomplete and/or inaccurate) information it has about the system state
and replication status of its files to determine if, when, and where new replicas should be created. The
result is a completely decentralized system that can maintain performance guarantees. These advantages
come at the price of accuracy since nodes make decisions based on partial information, which sometimes
leads to unnecessary replication. Simulation results show that the redundancy in action associated with
distributed authority is more evident when nodes are highly unreliable.
An analytical model for determining the optimal number of replica servers is presented by Schintke
and Reinefeld (Schintke & Reinefeld, 2003) to guarantee a given overall reliability given unreliable
system components. Two views are identified: the requester who requires a guaranteed availability
of the data (local view), and the administrator who wants to know how many replicas are needed and
how much disk space they would occupy in the overall system (global view). Their model captures the
characteristics of peer-to-peer-like environments as well as that of grid systems. Empirical simulations
confirm the accuracy of this analytical model.
Abawajy (Abawajy, 2004) addresses the file replication problem while focusing on the issue of strategic
placement of the replicas with the objectives of increased availability of the data and improved response
time while distributing load equally. Abawajy proposes a replica placement service called Proportional
Share Replication (PSR). The main idea underlying the PSR policy is that each file replica should serve
an approximately equal number of requests in the system. The objective is to place the replicas on a set
of sites systematically in such a way that file access parallelism is increased while the access costs are
decreased. Abawajy argues that no replication approach balances the load of data requests within the
system both at the network and host levels. Simulation results show that file replication improves the
performance of data access but the gains depend on several factors including where the file replicas are
located, burstiness of the request arrivals, packet losses and file sizes.
To use distributed replicas efficiently and to improve the reliability of data transfer, Wang et al. (C.
Wang, Hsu, Chen, & Wu, 2006) propose an efficient multi-source data transfer algorithm for data replication, whereby a data replica can be assembled in parallel from multiple distributed data sources in
a fashion that adapts to various network bandwidths. The goal is to minimize the data transfer time by
scheduling sub-transfers among all replica sites. All replica sites must deliver their source data continuously to maximize their aggregated bandwidth, and all sub-transfers of source data should, ideally, be
fully overlapped throughout the replication. Experimental results show that their algorithm can obtain
more aggregated bandwidth, reduce connection overheads, and achieve superior load balance.
496
Algorithms Focusing on Overall Grid Performance

Although a substantial amount of work has been done on data replication in Grid systems, most of it has
focused on infrastructure for replication and mechanisms for creating and deleting replicas. However, to
obtain the maximum benefit from replication, a strategic placement of replicas considering many factors
is essential. Notably, different sites may have different service quality requirements. Therefore, quality
of service is an important additional factor in overall system performance.
Lin et al. (Lin, Liu, & Wu, 2006) address the problem of data replica placement in Data Grids given
traffic patterns and locality requirements. They consider several important issues. First, the replicas
should be placed in proper server locations so that the workload on each server is balanced. Another
important issue is choosing the optimal number of replicas when the maximum workload capacity for
each replica server is known. The denser the distribution of replicas is, the shorter the distance a client
site needs to travel to access a data copy. However, maintaining multiple copies of data in Grid systems
is expensive, and therefore, the number of replicas must be bounded. Clearly, optimizing the access
cost of data requests and reducing the cost of replication are two conflicting goals. Finding a balance
between them is a challenging task. Lin et al. also consider the issue of service locality. Each user may
specify the minimum distance he will accept to the nearest data server. This serves as a locality assurance that users may specify, and the system must make sure that within the specified range there is a
server to answer any file request.
Lin et al. assume a hierarchical Data Grid model. In such a hierarchical Data Grid model, all the
request traffic may reach the root, if not satisfied by a replica. This introduces additional complexity for
the design of an efficient algorithm for replica placement in Grid systems when network congestion is
one of the objective functions to be optimized.
Tang and Xu (X. Tang & Xu, 2005) suggest a QoS-aware replica placement approach to cope with
quality-of-service issues. They provide two heuristic algorithms for general graphs, and a dynamic
programming solution for a tree topology. Every edge uses the distance between the two end-points as
a cost function. The distance between two nodes is used as a metric for quality (i.e. access time) assurance. A request must be answered by a server that is within the distance specified by the request. Every
request knows the nearest server that has the replica and the request takes the shortest path to reach the
server. Their goal has been to find a replica placement that satisfies all requests without violating any
range constraint, and that minimizes the update and storage costs at the same time. They show that this
QoS-aware replica placement problem is NP-complete for general graphs.
Wang et al. (H. Wang, Liu, & Wu, 2006) study the QoS-aware replica placement problem and provide
a new heuristic algorithm to determine the positions of the replicas to improve system performance and
satisfy the quality requirements specified by the users simultaneously. Their model is based on general
graphs and their algorithm starts by finding the cover set (Revees, 1993) of every server in the network.
In the second phase, the algorithm identifies and deletes super cover sets in the network. Finally, it
inserts replicas into the network iteratively until all servers are satisfied. Experiment results indicate
that the algorithm efficiently finds near-optimal solutions so that it can be deployed in various realistic
environments. However, the study does not consider the workload capacity of the servers.
497
Figure 5. A taxonomy of replica selection algorithms and selected papers
Replica Selection Algorithms

To improve replica retrieval we must determine the best replica location using a replica selection technique.
Such techniques attempt to select the single best server to provide optimum transfer rates. This can be
challenging because bandwidth quality can vary unpredictably due to the shared nature of the Internet.
Another approach is to use co-allocation technology (Vazhkudai, 2003) to download data. Co-allocation
of data transfers enables the clients to download data from multiple locations by establishing multiple
connections in parallel. This can improve the performance compared to single-server approaches and
helps to mitigate Internet congestion problems. Figure 5 shows a taxonomy of replica selection algorithms
based on the method used for retrieving the replicas distributed in the system.
Algorithms Based on Selecting the Best Replica

Vazhkudai et al. (Vazhkudai, Tuecke, & Foster, 2001) discuss the design and implementation of a highlevel replica selection service that uses information regarding replica location and user preferences to
guide selection from among storage replica alternatives. An application that requires access to replicated
data begins by querying an application specific metadata repository, specifying the characteristics of
the desired data. The metadata repository maintains associations between representative characteristics
and logical files, thus enabling the application to identify logical files based on application requirements
rather than by a possibly unknown file name. Once the logical file has been identified, the application
uses the replica catalog to locate all replica locations containing physical file instances of this logical
file, from which it can choose a suitable instance for retrieval. Vazhkudai et al. use Globus (Foster, 2006)
information service capabilities concerning storage system properties to collect dynamic information to
improve and optimize the selection process.
Chervenak et al. (Chervenak, 2002) characterize the requirements for a Replica Location Service
(RLS) and describe a Data Grid architectural framework, Giggle (GIGa-scale Global Location Engine),
within which a wide range of RLSs can be defined. An RLS is composed of a Local Replica Catalog
(LRC) and a Replica Location Index (RLI). The LRC maps logical identifiers to physical locations
and vice versa. It periodically sends out information to other RLSs about its contents (mappings) via a
498
soft-state propagation method. Collectively, the LRCs provide a complete and locally consistent record
of global replicas. The RLI contains a set of pointers from logical identifier to LRC. The RLS uses the
RLIs to find LRCs that contain the requested replicas. RLIs may cover a subset of LRCs or cover the
entire set of LRCs.
To select the best replica Rahman et al. (Rahman, K.Barker, & Alhajj, 2005) design an optimization
technique that considers both network latency and disk state. They present a model that uses a simple data
mining approach to select the best replica from a number of sites that hold replicas. Previous history of
data transfers can help in predicting the best site to hold a replica. Rahman et als approach is one such
predictive technique. In their technique, when a new request arrives for the best replica, all previous data
are examined to find a subset of previous file requests that are similar and then they are used to predict
the best site to hold the replica. The proposed model shows significant performance improvement for
sequential and unitary random file access patterns. The client node always contacts the site found by the
classifier and requests the file, regardless of the accuracy of the classification result. Switching from a
classification method to a traditional one is not considered even when the classification result is far from
ideal. Hence, the system performance will decrease for inaccurate file accesses. Future work could be
done on designing an adaptive algorithm so that the algorithm can switch to a traditional approach for
consecutive file transfers when it encounters misclassification.
Sun et al. (M. Sun, Sun, Lu, & Yu, 2005) propose an ant optimization algorithm for file replica selection in Data Grids. The general idea of the ant-based approach is to use an ant colony optimization
algorithm to decide which data file replicas should be accessed when a job requires data resources. The
ant algorithm (Dorigo, 1992) is a meta-heuristic method which mimics the behavior of how real ants find
the shortest path from their nest to a food source. The main idea is to mimic the pheromone trail used
by real ants as a medium for communication and feedback among ants. The goal of using the ant-based
approach is to exploit the ant algorithm to decide which data file replicas should be accessed when a job
requires data resources. For the selection of a data replica the ant uses pheromone information which
reflects the efficiency of previous accesses. The algorithm has been implemented and the advantages of
the new ant algorithm have been investigated using the grid simulator OptorSim (W. Bell et al., 2003).
Their evaluation demonstrates that their ant algorithm can reduce data access latency, decrease bandwidth
consumption and distribute storage site load.
Algorithms Using Co-Allocation Mechanism

Vazhkudai (Vazhkudai, 2003) developed several co-allocation mechanisms to enable parallel downloading of files. The most interesting one is called Dynamic Co-Allocation. The dataset that the client
wants is divided into k disjoint blocks of equal size. Each available server is assigned to deliver one
block in parallel. When a server finishes delivering a block, another block is requested, and so on, until
the entire file is downloaded. Faster servers can deliver the data quickly, thus serving larger portions of
the file requested when compared to slower servers. This approach exploits the partial copy feature of
GridFTP (Allcock, 2003) provided by the Globus Toolkit (Foster, 2006) to reduce the total transfer time.
One drawback of this approach is that faster servers must wait for the slowest server to deliver the final
block. This idle-time drawback is common to existing co-allocation strategies. It is important to reduce
the differences in completion time among replica servers to achieve the best possible performance.
Chang et al. (Chang, Wang, & Chen, 2005) suggest an improvement to dynamic co-allocation to
address the problem of faster servers waiting for slower ones. Their work is based on a co-allocation
499
architecture coupled with prediction techniques. They propose two techniques: (1) abort and retransfer,
and (2) one by one co-allocation. These techniques can increase the volume of data requested from
faster servers and reduce the volume of data fetched from slower servers thereby balancing the load
and individual completion times.
The Abort and Retransfer scheme allows the aborting of the slowest server transfer so the work can
be moved to faster servers. This can dynamically change the allocation condition based on the dynamic
conditions of the transfer. When all data blocks are assigned, the procedure will check the remaining
transfer time of the slowest server. If the remaining time is longer than the time of transferring the last
data block from the fastest server, the final data block will be re-assigned to the fastest server.
One by one co-allocation focuses on preventing the problematic allocation to the slowest server
a-priori. One by one co-allocation is a pre-scheduling method used to allocate the data blocks to be
transferred to the available servers. By using a prediction technique, the transfer time of each server is
estimated. The data blocks are then assigned to the fastest server with the lowest transfer time in each
round. Further, if one server is assigned to transfer more than one data block in an earlier round, its total
transfer time is accumulated.
Yang et al. (Yang, Yang, Chen, & Wang, 2006) propose a dynamic co-allocation scheme based on a
co-allocation grid data transfer architecture called the Recursive-Adjustment Co-Allocation scheme that
reduces the idle time spent waiting for the slowest server and improves data transfer performance. Their
co-allocation scheme works by continuously adjusting each replica servers workload to correspond to
its real-time bandwidth during file transfers. Yang et al. also provide a function that enables users to
define a final block threshold, according to the characteristics of their Data Grid environment to avoid
continuous over adjustment.
Usually, a complete file is replicated to many Grid sites for local access (including when co-allocation
is used). However, a site may only need certain parts of a given replica. Therefore, to use the storage
system efficiently, it may be desirable for a grid site to store only part(s) of a replica.
Chang and Chen (Chang & Chen, 2007) propose a concept called fragmented replicas where, when
doing replication, a site can store only those partial contents that are needed locally. This can greatly
reduce the storage space wasted in storing unused data. Chang and Chen also propose a block mapping
procedure to determine the distribution of blocks in every available server for later replica retrieval.
Using this procedure, a server can provide its available partial replica contents to other members in the
grid system since clients can retrieve a fragmented replica directly by using the block mapping procedure. Given the block mapping procedure, co-allocation schemes (Vazhkudai, 2003; Chang et al., 2005)
can be used to retrieve data sets from the available servers given the added constraint that only specific
servers will hold a particular fragment. Simulation results show that download performance is improved
in their fragmented replication system.
Chang and Chen (Chang & Chen, 2007) assume that the blocks in a fragmented replica are contiguous. If they were not, then the data structure to represent the fragmented replica and the algorithm for
retrieval would be more complicated. Also, the proposed algorithms do not always find an optimal
solution as explained by the authors. It would be interesting to determine if a worst case performance
bound exists for these algorithms.
When multiple replicas exist, a client uses a replica selection mechanism to find the best source from
which to download. However, this simple approach may not yield the best performance and reliability
because data is received from only one replica server. To avoid this problem, Zhou et al. (Zhou, Kim,
Kim, & Yeom, 2006) developed ReCon, a fast and reliable replica retrieval system for Data Grids that
500
acquires data not only from the best source but from other sources as well. Through concurrent transfer,
they are able to achieve significant performance improvement when retrieving a replica. ReCon also
provides fault-tolerant replica retrieval since multiple replication sites are employed.
For fast replica retrieval, Zhou et al considered various fast retrieval algorithms, among which probebased retrieval appears to be the best approach, providing twice the transfer rate of the best replica server
chosen by the replica selection service. Probe-based retrieval predicts the future network throughput of
the replica servers by sending probing messages to each server. This allows them to select replicas which
will provide fast access. For reliable replica retrieval, they introduce a recursive scheduling mechanism,
which provides fault-tolerant retrieval by rescheduling failed sub-transfers.
Replica Consistency Algorithms

As mentioned earlier, the replica consistency problem in Data Grid systems deals with the update synchronization of multiple copies (replicas) of a file. When one file is updated, all other replicas have to
be synchronized to have the same contents and thus provide a consistent view. Different algorithms for
maintaining such consistency have been proposed in the literature.
Replication consistency algorithms have traditionally been classified into strong and weak consistency.
A strong consistency algorithm (Duvvuri, Shenoy, & Tewari, 2000) ensures that all the replicas have
exactly the same content (synchronized systems) before any transaction is carried out. In an unreliable
network like the Internet, with a large number of replicas, latency can become high so it becomes impractical to use such systems. Strong consistency algorithms are suitable for systems with few replicas,
and on a reliable, low latency network where a large amount of bandwidth is available.
In contrast, weak consistency algorithms (Golding, 1992) maintain approximate consistency of the
replicas sacrificing data freshness in a controlled way to improve availability and performance. They are
very useful in systems where it is not necessary for all the replicas to be totally consistent for carrying
out transactions (systems that can withstand a certain degree of inconsistency).
In the context of weak consistency, the fast consistency algorithm (Elias & Moldes, 2002b) prioritizes
replicas with high demand in such a way that a large number of clients receive fresh content. As described
by Elias and Moldes (Elias & Moldes, 2002a), this algorithm gives high performance in one zone of high
demand, but in multiple zones the performance may become poor. To improve this poor performance,
Elias and Moldes (Elias & Moldes, 2003) propose an election algorithm based on demand, whereby the
replicas in each zone of high demand select leader replicas that subsequently construct a logical topology, linking all the replicas together. In this way, changes are able to reach all the high demand replicas
without the low demand zones forming a barrier to prevent this from happening.
Two coherence protocols for Data Grids were introduced by Sun and Xu (Y. Sun & Xu., 2004) called
lazy-copy and aggressive-copy. In the lazy-copy based protocol, replicas are only updated as needed if
someone accesses them. This can save network bandwidth by avoiding transferring up-to-date replicas
every time some modifications are made. However, the lazy-copy protocol has to pay penalties in terms
of access delay when inter-site updating is required. For the aggressive-copy protocol, replicas are always
updated immediately when the original file is modified. In other words, full consistency for replicas
is guaranteed in aggressive-copy, whereas partial consistency is applied to lazy-copy. Compared with
lazy-copy, access delay time can be reduced by an aggressive-copy based mechanism without suffering
from long update time during each replica access. Nevertheless, full consistency with frequent replica
updates could consume a considerable amount of network bandwidth. Furthermore, some updates may
501
be unnecessary because it is probable that they will never be used.

Chang and Chang (Chang & Chang, 2006) propose an innovative and effective architecture called
the Adaptable Replica Consistency Service (ARCS) which has the capability of dealing with the replica
consistency problem to provide better performance and load balance for file replication in Data Grids.
The ARCS architecture works by modifying the two previously described distinct coherence protocols.
Chang and Chang make use of the concept of network regions in (Park et al., 2003) to develop the
scheme. Several grid sites located closely together are organized into a grid group called a grid region.
A Region Server is responsible for the consistency service within a region. Each region is connected
via the Internet.
Each grid region has at most zero or one master replica and multiple master replicas are distributed
over grid regions. A region server must be aware of the location of other master replicas to maintain full
consistency among all master replicas. Update modifications are propagated to other connected region
servers for their master replicas with the aid of a file locking mechanism whenever a master replica is
modified in a certain grid region. Thus, a master replica within a grid region has the latest information
all the time. Each secondary replica can update its contents more efficiently in accordance with the
master replica if the region has a master replica. Simulation results show that ARCS is superior to the
coherence protocols described in (Y. Sun & Xu., 2004).
Belalem and Slimani (Belalem & Slimani, 2006, 2007) proposed a hybrid model to manage the consistency of replicas in large scale systems. The model combines two existing approaches (optimistic and
pessimistic (Saito & Levy, 2000; Saito & Shapiro, 2005)) to consistency management. The pessimistic
approach prohibits any access to a replica unless it is provably up to date. The main advantage of this
approach is that all replicas converge at the same time, a fact that guarantees high consistency of data.
Hence, any problem of divergence is avoided. On the contrary, the optimistic approach allows access to
any replica at any time regardless of the state of the replica sets, which might be incoherent. This also
means that the approach can cause replica contents to diverge. Optimistic techniques require a follow-up
phase to detect and correct divergences among replicas by converging them toward a coherent state.
Pessimistic replication and optimistic replication are two contrasting replication models. The work
of Belalem and Slimani tries to benefit from the advantages of both approaches. Optimistic principals
are used to ensure replica consistency within each site in the grid individually. Global consistency,
i.e., consistency between sites, is covered by the application of algorithms inspired by the pessimistic
approach. Their model aims to substantially reduce the communication time between sites to achieve
replica consistency, increase the effectiveness of consistency management, and more importantly, be
adaptive to changes in large systems.
Domenici et al. (Domenici, Donno, Pucciani, & Stockinger, 2006) propose a Replica Consistency
Service, CONStanza, that is general enough to be suitable for most types of applications in a grid environment and which meets the general requirements for grid middleware, such as performance, scalability, reliability, and security. Their proposed replica consistency service allows for replica updates in
a single-master scenario with lazy update synchronization. Two types of replicas are maintained which
have different semantics and access permissions for end-users. The first is a master replica that can be
updated by end-users of the system. The master replica is the one that is, by definition, always up-to-date.
The other is a secondary replica (also referred to as secondary copy) that is updated/synchronized by
CONStanza with a certain delay to eventually have the same contents as the master replica. Obviously,
the longer the update propagation delay is, the more unsynchronized the master and the secondary replica are, and the higher is the probability of experiencing stale reads on secondary replicas. This service
502
Figure 6. A taxonomy of algorithms considering data scheduling and associated papers
provides users with the ability to update data using a certain consistency delay parameter (hence relaxed
consistency) to adapt to specific application requirements and tolerances.
Impact of Data Replication on Job Scheduling

Effective scheduling can reduce the amount of data transferred across the Internet by dispatching a job
to where the needed data files are available. Assume a job is scheduled to be executed at a particular
compute node. When job scheduling is coupled to replication and the data has to be fetched from remote
storage, the scheduler can create a copy of the data at the point of computation so that future requests
for the same file that come from the neighborhood of the compute node can be satisfied more quickly.
Further, in the future, any job dealing with that particular file can be preferentially scheduled at that
compute node if it is available. In a decoupled scheduler, the job is scheduled to a suitable computational
resource and a suitable replica location is then identified to request the required data from. In this case,
the storage requirement is transient, that is, disk space is required only for the duration of execution.
Figure 6 shows a taxonomy of replication algorithms considering data scheduling. A comparison of
decoupled against coupled strategies by Ranganathan and Foster (Ranganathan & Foster, 2002) has
shown that decoupled strategies promise increased performance and reduce the complexity of designing
algorithms for Data Grid environments.
He et al. (He, Sun, & Laszewski, 2003) deal with the problem of Integrated Replication and Scheduling (IRS). They couple job scheduling and data scheduling. At the end of periodic intervals when
jobs are scheduled, the popularity of required files is calculated and then used by the data scheduler to
replicate data for the next set of jobs. While these may or may not share the same data requirements as
the previous set there is often a high probability that they will.
The importance of data locality in job scheduling was also realized by Ranganathan and Foster
(Ranganathan & Foster, 2004). They proposed a Data Grid architecture based on three main components: External Scheduler (ES), Local Scheduler (LS) and Dataset Scheduler (DS). An ES receives job
submissions from the user, then it decides the remote site to which the job should be sent depending on
its scheduling strategy. The LS of each site decides how to schedule all the jobs assigned to it, using
503
its local resources. The DS keeps track of the popularity of each dataset currently available and makes
data replication decisions. Using this architecture, Ranganathan and Foster developed and evaluated
various replication and scheduling strategies. Their results confirmed the importance of data locality in
scheduling jobs.
Dheepak et al. (Dheepak, Ali, Sengupta, & Chakrabarti, 2005) have created several scheduling
techniques based on a developed replication strategy. The scheduling strategies are Matching based Job
Scheduling (MJS), Cost base Job Scheduling (CJS) and Latency based Job Scheduling (LJS). In MJS,
the jobs are scheduled to those sites that have the maximum match in terms of data.
For example, if a job requests n files, and in a site, all those files are already present, then the amount
of data in bytes corresponding to those n files represents the match corresponding to the job request.
In CJS, the cost of scheduling a job onto a site is defined as the combined cost of moving the data to the
site, the time to compute the job at the site, and the wait time in the queue at the site. The job is scheduled onto the site which has the minimum cost. Finally, LJS takes the latency experienced into account
before taking the scheduling decision. The cost of scheduling in this case includes the latency involved
in scheduling the current job based on the current data locations, and also the latency involved due to
the current queue. Simulation results show that among the strategies, LJS and CJS perform similarly
and MJS performs less well.
Venugopal et al. (Venugopal & Buyya, 2005) propose a scheduling algorithm that considers two
cost metrics: an economic budget and time. The algorithm tries to optimize one of them given a bound
on the other, e.g., spend a budget as small as possible, while not missing any deadline. The incoming
applications consist of a set of independent tasks each of which requires a computational resource and
accesses a number of data sets located on different storage sites. The algorithm assumes every data set
has only one copy in the Grid, so that the resource selection is only for computational resources, taking
into account the communication costs from data storage sites to different computation sites as well as
the actual computation costs. Instead of doing a thorough search in a space whose size is exponential
in the number of data sets requested by a task, the resource selection procedure simply performs a local
search which only guarantees that the current mapping is better than the previous one. In this way, the
cost of the search procedure is linear. The drawback of this strategy is that it is not guaranteed to find a
feasible schedule even if there is one.
As we have seen, data can be decomposed into multiple independent sub datasets and distributed for
parallel execution and access. Most of the existing studies on scheduling in Data Grids do not consider
this possibility which is typical in many data intensive applications. Kim and Weissman (Kim & Weissman, 2004), however, exploit such parallelism to achieve desired performance levels when scheduling
large Data Grid applications. When parallel applications require multiple data files from multiple data
sources, the scheduling problem is challenging in several dimensions; how should data be decomposed,
should data be moved to computation or vice-versa, and which computing resources should be used. The
problem can be optimally solved by adding some constraints (e.g., decomposing data into sub datasets of
only the same size). Another approach is to use heuristics such as those based on optimization techniques
(e.g. genetic algorithms, simulated annealing, and tabu search). Kim and Weissman propose a novel
genetic algorithm (GA) based approach to address the scheduling of decomposable Data Grid applications, where communication and computation are considered at the same time. Their proposed algorithm
is novel in two ways. First, it automatically balances load, that is, data in this case, onto communication/
computation resources while generating a near optimal schedule. Second, it does not require a job to
be pre-decomposed. This algorithm is a competitive choice for scheduling large Data Grid applications
504
in terms of both scheduling overhead and the quality of solutions when compared to other algorithms.
However, this work does not consider the case of multiple jobs competing for shared resources.
Tang et al. (M. Tang, Lee, Tang, & Yeo, 2005; M. Tang et al., 2006) propose a Data Grid architecture
supporting efficient data replication and job scheduling. The computing sites are organized into individual domains according to the network structure, and a replica server is placed in each domain. Two
centralized dynamic replication algorithms with different replica placement methods and a distributed
dynamic replication algorithm are proposed. At regular intervals, the dynamic replication algorithms
exploit the data access history for popular data files and compute the replication destinations to improve
data access performance for grid jobs.
Coupled with these replication algorithms, the grid scheduling heuristics of Shortest Turnaround Time
(STT), Least Relative Load and Data Present are proposed. For each incoming job, the STT heuristic
estimates the turnaround time on every computing site and assigns a new job to the site that provides
the shortest turnaround time. The Least Relative Load heuristic assigns a new job to the computing
site that has the least relative load. This scheduling heuristic attempts to balance the workloads for all
computing sites in the Data Grid. Finally, the Data Present heuristic considers the data location as the
major factor when assigning the job. Simulation results demonstrate that the proposed algorithms can
shorten the job turnaround time greatly.
Analyzing earlier work, Dang and Lim (Dang & Lim, 2007) identified two shortcomings in earlier
work. The first is not considering the relationships among data file and between the data files and jobs.
By replicating a set of files that has high probability to be used together on nearby resources, they expect
that the jobs using these files will be scheduled to that small area. The second is a limitation in the use
of the Dataset Scheduler (DS) (Ranganathan & Foster, 2004). Instead of just tracking data popularity,
the DS plays the role of an independent scheduler. They propose a tree of data types in which the relationship between data in the same category and relationship between nearby categories are defined. By
means of this, a correlation between data is extracted. The idea is then to gather data that is related to
a small region so that any job requiring such data will be executed inside that region. This reduces the
cost to transfer data to the job execution site, therefore, improves the job execution performance.
Desprez et al. (Desprez & Vernois, 2006) describe an algorithm that combines data management and
scheduling via a steady state approach. Using a model of the grid platform, the number of requests as well
as their distribution, and the number and size of data files, they define a linear programming problem to
satisfy the constraints at every level of the platform in steady-state. The solution of this linear program
provides a placement for the data files on the servers as well as, for each kind of job, the server on which
they should be executed. However, this heuristic approach for approximating an integer solution to the
linear program does not always give the best mapping of data and can potentially give results that are
far from the optimal value of the objective function.
Chang et al. (Chang, Chang, & Lin, 2007) developed a job scheduling policy, called Hierarchical
Cluster Scheduling (HCS), and a dynamic data replication strategy, called Hierarchical Replication
Strategy (HRS), to improve data access efficiency in a cluster structured grid. Their HCS scheduling
policy considers the locations of required data, the access cost and the job queue length of a computing
node. HCS uses hierarchical scheduling that takes cluster information into account to reduce the search
time for an appropriate computing node. HRS integrates the previous replication strategy with the job
scheduling policy to increase the chance of accessing data at a nearby node. The probability of scheduling the same type of job to the same cluster will be rather high in their proposed scheduling algorithm,
leading to possible load balancing problems. The consideration of system load balancing with other
505
scheduling factors will be an important direction for future research. In addition, balancing between data
access time, job execution time, and network capabilities also needs to be studied further.
Some recent work has addressed data movement in task scheduling. The current research has developed
along two directions: allocating the task to where the data is, and moving the data to where the task is.
He and Sun (He & Sun, 2005) incorporate data movement into task scheduling using a newly introduced
data structure called the Data Distance Table (DDT) to measure the dynamic data movement cost, and
integrate this cost into an extended Min-Min (He et al., 2003) scheduling heuristic. A data replica based
algorithm is dynamically adjusted to place data on under-utilized sites before any possible load imbalance occurs. Based on DDT, a data-conscious task scheduling heuristic is introduced to minimize data
access delay. Experimental results show that their data-conscious dynamic adjusting scheduling heuristics outperforms the general Min-Min technique significantly for data intensive applications, especially
when the critical data sets are unevenly distributed.
Khanna et al. (Khanna et al., 2006) address the problem of efficient execution of a batch of dataintensive tasks with batch-shared I/O behavior, on coupled storage and compute clusters. They approach
the problem in three stages. The first stage, called sub-batch selection, partitions a batch of tasks into
sub-batches such that the total size of the files required for a sub-batch does not exceed the available
aggregate disk space on the compute cluster. The second stage accepts a sub- batch as input and yields
an allocation of the tasks in the sub-batch onto the nodes of the compute cluster to minimize the subbatch execution time. The third stage orders the tasks allocated to each node at runtime and dynamically
determines what file transfers need to be done and how they should be scheduled to minimize end-point
contention on the storage cluster.
Two scheduling schemes are proposed to solve this three stage problem. The first approach formulates
the sub-batch selection problem using a 0-1 Integer Programming (IP) formulation. The second stage is
also modeled as a 0- 1 IP formulation to determine the mapping of tasks to nodes, source and destination
nodes for all replications, and the destination nodes for all remote transfers. The second approach, called
BiPartition, employs a bi-level hypergraph partitioning based scheduling heuristic that formulates the
sharing of files among tasks as a hypergraph. The BiPartition approach results in slightly longer batch
execution times, but is much faster than the IP based approach. Thus, the IP based approach is attractive
for small workloads, while the BiPartition approach is preferable for large scale workloads.
Lee and Zomaya (Lee & Zomaya, 2006) propose a novel scheduling algorithm called the Shared Input
data based Listing (SIL) algorithm for Data intensive Bag-of-Tasks (DBoT) applications on grids. The
algorithm uses a set of task lists that are constructed taking data sharing patterns into account and that
are reorganized dynamically based on the performance of resources during the execution of the application. The primary goal of this dynamic listing is to minimize data transfers thus leading to shortening
the overall completion time of DBoT applications.
The SIL algorithm also attempts to reduce serious schedule increases (that occur because of inefficient task/host assignments) by using task duplication. The SIL algorithm consists of two major phases.
The task grouping phase groups tasks into a set of lists based on their data sharing patterns, associates
these task lists with sites, and further breaks and/or associates them with hosts. Then the scheduling
phase assigns tasks to hosts dynamically reorganizing task lists and duplicates tasks once all tasks are
scheduled but while some tasks are still running.
Additionally, Santos-Neto et al. (Santos-Neto, Cirne, Brasileiro, & Lima, 2004) have developed a
Storage Affinity (SA) algorithm which tries to minimize data transfers by making scheduling decisions
incorporating the location of data previously transferred. In addition, they consider task replication as
506
soon as a host becomes available between the time the last unscheduled task gets assigned and the
time the last running task completes its execution. The SA algorithm determines task/host assignments
based on a storage affinity metric. The storage affinity of a task to a host is the amount of the tasks
input data already stored at the site to which the host belongs. Although the scheduling decision SA
makes is between task and host, storage affinity is calculated between task and site. This is because in
the grid model used for SA, each site in the grid uses a single data repository that is accessed by all the
hosts in the site.
For each scheduling decision, the SA algorithm calculates storage affinity values for all unscheduled
tasks and dispatches the task with the largest storage affinity value. If none of the tasks has a positive
storage affinity value one of them is scheduled at random. By the time this initial scheduling is completed,
all the hosts will be busy in running the same number of tasks. On the completion of any of the running
tasks, the SA algorithm starts task replication. Then each of the remaining running tasks is considered
for replication and the best one is selected. The selection decision is based on the storage affinity value
and the number of replicas available.
COMPARISON OF REPLICA PLACEMENT STRATEGIES

In this section, we summarize current and past research on different replica placement techniques for
Data Grid environments. Several important factors such as grid infrastructure, data access patterns, network traffic conditions, and so on are taken into account when choosing a replica placement strategy.
In the presence of diverse and varying grid characteristics it is difficult to create a common ground for
comparison of different strategies. To gain insight into the effectiveness of different replication strategies, we compare them by considering metrics including access latency, bandwidth savings, and server
work load.
Response Time: This is the time that elapses from when a node sends a request for a file until it receives the complete file. If a local copy of the file exists, the response time is assumed to be zero.
Bandwidth Consumption: This includes the bandwidth consumed for data transfers occurred when
a node requests a file and when a server creates a replica at another node.
Server Work Load: This is the amount of work done by the servers. Ideally, the replicas should be
placed so that the workload on each server is balanced.
Comparison of Replica Placement Algorithms

We start with the initial work (Ranganathan & Foster, 2001b) on replication strategies proposed for
hierarchical Data Grids. Among these strategies, Fast Spread shows relatively consistent performance
and is best both in terms of access latency and bandwidth consumption given random access patterns.
The disadvantage is that it has high storage requirements. The entire storage space at each tier is fully
utilized by Fast-Spread. If, however, there is sufficient locality in the access patterns, Cascading would
work better than the others in terms of both access latency and bandwidth consumption. The Best Client
algorithm is naive and illustrates the worst case performance among those presented in (Ranganathan
& Foster, 2001b).
An improvement to the Cascading technique is the Proportional Share Replica policy (Abawajy,
2004). The method is a heuristic one that places replicas at the optimal locations by assuming that the
507
number of sites and the total number of replicas to be distributed are already known. Firstly, an ideal load
distribution is calculated and then replicas are placed on candidate sites that can service replica requests
slightly greater than or equal to that ideal load. This technique was evaluated based on mean response
time (mean access latency). Simulation results show that it performs better than the cascading technique
with increased availability of data and considers load sharing among replica servers. Unfortunately, the
approach is unrealistic for most scenarios and is inflexible once placement decisions have been made.
With the aim of improving the performance of data access given varying workloads, dynamic replication algorithms were presented by Tang et al. (M. Tang, Lee, Yeo, & Tang, 2005). In their paper, two
dynamic replication algorithms, Simple Bottom-Up (SBU) and Aggregate Bottom-Up (ABU), were
proposed for a multi-tier Data Grid. Their simulation results show that both algorithms can reduce the
average response time of data access significantly compared to static replication methods. ABU can
achieve great performance improvements for all access patterns even if the available storage size of
the replication server is relatively small. Comparing the two algorithms to Fast Spread, the dynamic
replication strategy ABU proves to be superior. As for SBU, although the average response time of Fast
Spread (Ranganathan & Foster, 2001b) is better in most cases, Fast Spreads replication frequency may
be too high to be useful in the real world.
A multi-objective approach to dynamic replication placement exploiting operations research techniques was proposed in (Rahman, Barker, & Alhajj, 2005a). In this method, replica placement decisions
are made considering both the current network status and data request patterns. Dynamic maintainability
is achieved by considering replica relocation cost. Decisions to relocate are made when a performance
metric degrades significantly in the last specific number of time periods. Their technique was evaluated
in terms of request-weighted average response time, but the performance results were not compared to
any of the other existing replication techniques.
The BHR (Park et al., 2003) dynamic replication strategy focuses on network-level locality by trying
to place the targeted file at a site that has broad bandwidth to the site of job execution. The BHR strategy
was evaluated in term of job execution time (which includes access latency) with varying bandwidths
and storage spaces using the OptorSim (W. Bell et al., 2003) simulator.
The simulation results show that it can even outperform aggressive replication strategies like LRU
Delete and Delete Oldest (W. Bell et al., 2003) in terms of data access time especially when grid sites
have relatively small storage capacity and a clear hierarchy of bandwidths.
Lin et al. (Lin et al., 2006) is one of the relatively few replication efforts that focus on overall grid
performance. Their proposed placement algorithm, targeted for a tree-based network model, finds optimal locations for replicas so that the workload among the replicas is balanced. They also propose a new
algorithm to determine the minimum number of replicas required when the maximum workload capacity of each replica server is known. All these algorithms ensure that QoS requirements from the users
are satisfied. Subsequent work by Wang et al. (H. Wang et al., 2006) addresses the replica placement
problem when the underlying network is a general graph, instead of a tree. Their experimental results
indicate that their proposed algorithm efficiently finds near-optimal solutions.
SUMMARY AND OPEN PROBLEMS

This survey has reviewed data replication in grid systems, considering the issues and challenges involved
in replication with a primary focus on replica placement which is at the core of all replication strategies.
508
Although replication in parallel and distributed systems has been intensively studied, new challenges
in grid environments make replication an interesting ongoing topic and many research e orts are underway in this area. We have identified heterogeneity, dynamism, system reliability and availability, and
the impact of data replication on scheduling as the primary challenges addressed by current research in
grid replication. We also find that the evolution of Data Grid architectures (e.g., support for a variety of
grid structure models, fragmented replicas, co-allocation mechanisms and data sharing) provide an opportunity to implement sophisticated data replication algorithms providing specific benefits. In addition
to enhancements to classic replication algorithms, new methodologies have been applied, such as grid
economic models and nature inspired heuristics (e.g., genetic and ant algorithms).
Due to the characteristics of grid systems and the challenges involved in replication, there are still
many open issues related to data replication on the grid. Without any specific assumptions, we find the
following general issues deserving of additional/future exploration.
Fragmented Replication
Focusing on the concept of using fragmented replicas in replica placement and selection is a recent
research trend. As mentioned when discussing algorithms that use co-allocation methods, the problem
with current strategies that deal with fragmented replicas is increased complexity. Usually, the blocks in a
fragmented replica are considered to be contiguous. If they were not, then the data structure to represent
the fragmented replica and the algorithm for retrieval would be more complicated. Also, the proposed
algorithms (Chang & Chen, 2007) do not always find an optimal solution. It would be interesting to
find whether a worst case performance bound exists for the algorithms. Finding efficient ways to handle
fragmented replica updates would also be an interesting area for future research.
Algorithms that are Adaptive to Performance Variation

It will likely be important to come up with a suite of adaptive job placement and data movement algorithms that can dynamically selecting strategies depending on current and predicted grid conditions. The
limitations of current rescheduling algorithms for Data Grids are high cost and lack of consideration of
dependent tasks. For jobs whose turn-around times are large, rescheduling can improve performance
dramatically. However, rescheduling is itself costly, especially when there are extra data dependencies
among tasks compared to independent applications. In addition, many related problems also must be
considered. For example, when the rescheduling mechanisms should be invoked, what measurable parameters should be used to decide whether rescheduling will be profitable, and where tasks should be
migrated to. Research on rescheduling for Data Grids is largely an open field for future work.
Enhanced Algorithms Combining Computation and Data Scheduling

Only a handful of current research e orts consider the simultaneous optimization of computation and data
transfer scheduling, which suggests possible opportunities for future work. Consideration of data staging
in grid scheduling has an impact on the choice of a computational node for a task. The situation could
be far more complex if there were multiple copies of data, and data dependencies among the tasks were
considered. As discussed, in the work in (Kim & Weissman, 2004), scheduling decomposable Data Grid
applications does not consider the case of multiple jobs competing for shared resources which would be
509
an interesting topic for future research. Also, combined computation and data scheduling may lead to
possible load balancing problems (e.g., the probability of scheduling the same type of job to the same
cluster is high in the scheduling algorithm proposed in (Chang et al., 2007)). Thus, consideration of
system load balancing with different scheduling factors will be an important future research direction.
New Models of Grid Architecture

Grid-like complex distributed environments cannot always be organized and controlled in a hierarchical
manner. Any central directory service would inevitably become a performance bottleneck and a single
point of failure. Rather, in the future, many of these systems will likely be operated in a self-organizing
way, using replicated catalogs and a mechanism for the autonomous generation and placement of replicas
at different sites. As discussed earlier, one open question for replica placement in such environments is
how to determine replica locations when the network is general graph, instead of a tree. It is important
to consider the properties of such other graphs and derive efficient algorithms for use with them. The
design of efficient algorithms for replica placement in grid systems when network congestion is one of
the objective functions to be optimized also needs to receive further consideration.
Increased Collaboration Using VO-based Data Grids

Foster et al. (Foster, Kesselman, & Tukcke, 2001) have proposed a grid architecture for resource sharing
among different entities based around the concept of Virtual Organizations (VOs). A VO is formed
when different organizations pool resources and collaborate to achieve a common goal. A VO defines
the resources available for the participants and the rules for accessing and using the resources and the
conditions under which the resources may be used. A VO also provides protocols and mechanisms for
applications to determine the suitability and accessibility of available resources. The existence of VOs
impacts the design of Data Grid architectures in many ways. For example, a VO may be stand alone
or may be composed of a hierarchy of regional, national and international VOs. In the latter case, the
underlying Data Grid may have a corresponding hierarchy of repositories and the replica discovery and
management system might be structured accordingly. More importantly, sharing of data collections is
guided by the relationships that exist between the VOs that own each of the collections.
While Data Grids may be built around VOs, current technologies do not provide many of the capabilities required for enabling collaboration between participants. For example, the tree structure of
many replication mechanisms inhibits direct copying of data between participants that reside on different branches. Replication systems, therefore, will likely need to follow hybrid topologies that involve
peer-to-peer links between different branches for enhanced collaboration.
With the use of VOs, efforts have moved towards community-based scheduling in which schedulers follow policies that are set at the VO level and enforced at the resource level through service level
agreements and allocation quotas (Dumitrescu & Foster, 2004). Since communities are formed by pooling of resources by participants, resource allocation must ensure fair shares to everyone. This requires
community-based schedulers that assign quotas to each of the users based on priorities and resource
availability. Individual user schedulers should then submit jobs taking into account the assigned quotas
and could negotiate with the central scheduler for a quota increase or change in priorities. It could also
be possible to swap or reduce quotas to gain resource share in the future. Users are able to plan ahead
for future resource requirements by advance reservation of resources. This community-based schedul-
510
ing combined with enhanced Data Grid capabilities for collaboration will introduce new challenges to
efficient replica placement in Data Grids and also the need to reduce replication cost.
CONCLUSION
Data Grids are being adopted widely for sharing data and collaboratively managing and executing largescale scientific applications that process large data sets, some that are distributed around the world.
However, ensuring efficient and fast access to such huge and widely distributed data is hindered by the
high latencies of the Internet upon which many Data Grids are built. Replication of data is the most common solution to this problem. In this chapter, we have studied, characterized, and categorized the issues
and challenges involved in such data replication systems. In doing so, we have tried to provide insight
into the architectures, strategies and practices that are currently used in Data Grids for data replication.
Also, through our characterization, we have been attempted to highlight some of the shortcomings in
the work done and identify gaps in the current architectures and strategies. These represent some of the
directions for future research in this area. This paper provides a comprehensive study of replication in
Data Grid that should not only serve as a tool for understanding this area but also present a reference by
which future efforts can be classified.
REFERENCES
Abawajy, J. (2004). Placement of file replicas in data grid environments. In Proceedings of international
conference on computational science (Vol. 3038, pp. 66-73).
Allcock, W. (2003, Mar). GridFTP protocol specification. Global Grid Forum Recommendation
GFD.20.
Baker, M., Buyya, R., & Laforenza, D. (2002). Grids and grid technologies for wide-area distributed
computing. Software [SPE]. Practice and Experience, 32, 14371466. doi:10.1002/spe.488
Belalem, G., & Slimani, Y. (2006). A hybrid approach for consistency management in large scale systems.
In Proceedings of the international conference on networking and services (pp. 7176).
Belalem, G., & Slimani, Y. (2007). Consistency management for data grid in optorsim simulator. In Proceedings of the international conference on multimedia and ubiquitous engineering (pp. 554560).
Bell, W., Cameron, D., Capozza, L., Millar, P., Stockinger, K., & Zini, F. (2003). Optorsim - A grid
simulator for studying dynamic data replication strategies. International Journal of High Performance
Computing Applications, 17, 403416. doi:10.1177/10943420030174005
Bell, W. H., Cameron, D. G., Carvajal-Schiaffino, R., Millar, A. P., Stockinger, K., & Zini, F. (2003).
Evaluation of an economy-based file replication strategy for a data grid. In Proceedings of the 3rdIEEE/
ACM international symposium on cluster computing and the grid.
511
Chang, R., & Chang, J. (2006). Adaptable replica consistency service for data grids. In Proceedings
of the third international conference on information technology: New generations (ITNG06) (pp.
646651).
Chang, R., Chang, J., & Lin, S. (2007). Job scheduling and data replication on data grids. Future Generation Computer Systems, 23(7), 846860. doi:10.1016/j.future.2007.02.008
Chang, R., & Chen, P. (2007). Complete and fragmented replica selection and retrieval in data grids.
Future Generation Computer Systems, 23(4), 536546. doi:10.1016/j.future.2006.09.006
Chang, R., Wang, C., & Chen, P. (2005). Replica selection on co-allocation data grids. In Proceedings
of the second international symposium on parallel and distributed processing and applications (Vol.
3358, pp. 584593).
Chervenak, A. (2002). Giggle: A framework for constructing scalable replica location services. In Proceedings of the IEEE supercomputing (pp. 117).
Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., & Tuecke, S. (2000). The Data Grid: Towards an
architecture for the distributed management and analysis of large scientific datasets. Journal of Network
and Computer Applications, 23, 187200. doi:10.1006/jnca.2000.0110
Dang, N. N., & Lim, S. B. (2007). Combination of replication and scheduling in data grids. International
Journal of Computer Science and Network Security, 7(3).
Desprez, F., & Vernois, A. (2006). Simultaneous scheduling of replication and computation for dataintensive applications on the grid. Journal of Grid Computing, 4(1), 6674. doi:10.1007/s10723-0059016-2
Dheepak, R., Ali, S., Sengupta, S., & Chakrabarti, A. (2005). Study of scheduling strategies in a dynamic
data grid environment. In Distributed Computing - IWDC 2004 (Vol. 3326). Berlin: Springer.
Domenici, A., Donno, F., Pucciani, G., & Stockinger, H. (2006). Relaxed data consistency with CONStanza. In Proceedings of the sixth IEEE international symposium on cluster computing and the grid
(pp. 425429).
Domenici, A., Donno, F., Pucciani, G., Stockinger, H., & Stockinger, K. (2004, Nov). Replica consistency in a Data Grid. Nuclear Instruments and Methods in Physics Research, 534, 2428. doi:10.1016/j.
nima.2004.07.052
Dorigo, M. (1992). Optimization, learning and natural algorithms (Tech. Rep.). Ph.D. Thesis, Politecnico di Milano, Milan, Italy.
Dumitrescu, C., & Foster, I. (2004). Usage policy-based CPU sharing in virtual organizations. In Proceedings of the fifth IEEE/ACM international workshop on grid computing (pp. 5360).
Duvvuri, V., Shenoy, P., & Tewari, R. (2000). Adaptive leases: A strong consistency mechanism for the
World Wide Web. In Proceedings of IEEE INFOCOM (pp. 834843).
512
Elias, J. A., & Moldes, L. N. (2002a). Behaviour of the fast consistency algorithm in the set of replicas
with multiple zones with high demand. In Proceedings of symposium in informatics and telecommunications.
Elias, J. A., & Moldes, L. N. (2002b). A demand based algorithm for rapid updating of replicas. In Proceedings of IEEE workshop on resource sharing in massively distributed systems (pp. 686 691).
Elias, J. A., & Moldes, L. N. (2003). Generalization of the fast consistency algorithm to a grid with
multiple high demand zones. In Proceedings of international conference on computational science
(ICCS 2003) (pp. 275284).
Foster, I. (2006). Globus toolkit version 4: Software for service-oriented systems. In Proceedings of the
international conference on network and parallel computing (pp. 213).
Foster, I., Kesselman, C., & Tukcke, S. (2001). The anatomy of the grid: Enabling scalable virtual
organizations. International Journal of High Performance Computing Applications, 15(3), 200222.
doi:10.1177/109434200101500302
Golding, R. A. (1992, Dec). Weak-consistency group communication and membership (Tech. Rep.).
Computer and Information Sciences, University of California, Ph.D. Thesis.
Hakami, S. (1999). Optimum location of switching centers and the absolute centers and medians of a
graph. Operations Research, 12, 450459. doi:10.1287/opre.12.3.450
He, X., & Sun, X. (2005). Incorporating data movement into grid task scheduling. In Proceedings of
grid and cooperative computing (pp. 394405).
He, X., Sun, X., & Laszewski, G. (2003). QoS guided Min-Min heuristic for grid task scheduling. Journal
of Computer Science and Technology, Special Issue on Grid Computing, 18 (4).
Kesselman, C., & Foster, I. (1998). The Grid: Blueprint for a new computing infrastructure. San Francisco: Morgan Kaufmann Publishers.
Khanna, G., Vydyanathan, N., Catalyurek, U., Kurc, T., Krishnamoorthy, S., Sadayappan, P., et al. (2006).
Task scheduling and file replication for data-intensive jobs with batch-shared I/O. In Proceedings of
high-performance distributed computing (HPDC) (pp. 241252).
Kim, S., & Weissman, J. B. (2004). A genetic algorithm based approach for scheduling decomposable
data grid applications. In Proceedings of international conference on parallel processing (Vol. 1, pp.
405413).
Lamehamedi, H., & Szymanski, B. shentu, Z., & Deelman, E. (2002). Data replication strategies in grid
environments. In Proceedings of the fifth international conference on algorithms and architectures for
parallel processing (pp. 378383).
Lamehamedi, H., Szymanski, B., Shentu, Z., & Deelman, E. (2003). Simulation of dynamic data replication strategies in data grids. In Proceedings of the international parallel and distributed processing
symposium (pp. 1020).
513
Lee, Y. C., & Zomaya, A. Y. (2006). Data sharing pattern aware scheduling on grids. In Proceedings of
International Conference on Parallel Processing, (pp. 365372).
Lei, M., & Vrbsky, S. V. (2006). A data replication strategy to increase data availability in data grids. In
Proceedings of the international conference on grid computing and applications (pp. 221227).
Lin, Y., Liu, P., & Wu, J. (2006). Optimal placement of replicas in data grid environments with locality
assurance. In Proceedings of the 12th International Conference on Parallel and Distributed Systems
(ICPADS06), 01, 465474.
Park, S., Kim, J., Ko, Y., & Yoon, W. (2003). Dynamic data grid replication strategy based on Internet
hierarchy. In Proceedings of the second international workshop on grid and cooperative computing
(GCC2003).
Rahman, R. M., Barker, K., & Alhajj, R. (2005). Replica selection in grid environment: A data-mining
approach. In Proceedings of the ACM symposium on applied computing (pp. 695700).
Rahman, R. M., Barker, K., & Alhajj, R. (2005a). Replica placement in data grid: A multi-objective
approach. In Proceedings of the international conference on grid and cooperative computing (pp.
645656).
Rahman, R. M., Barker, K., & Alhajj, R. (2005b). Replica placement in data grid: Considering utility and
risk. In Proceedings of the international conference on information technology: Coding and computing
(ITCC05) (Vol. 1, pp. 354359).
Ranganathan, K., & Foster, I. (2001a). Design and evaluation of dynamic replication strategies for a
high performance data grid. In Proceedings of the international conference on computing in high energy
and nuclear physics (pp. 260-263).
Ranganathan, K., & Foster, I. (2002). Decoupling computation and data scheduling in distributed data
intensive applications. In Proceedings of the 11th international symposium for high performance distributed computing (HPDC) (pp. 352358).
Ranganathan, K., & Foster, I. (2003). Simulation studies of computation and data scheduling algorithms
for data grids. Journal of Grid Computing, 1(1), 5362. doi:10.1023/A:1024035627870
Ranganathan, K., & Foster, I. T. (2001b). Identifying dynamic replication strategies for a high-performance data grid. In Proceedings of the International Workshop on Grid Computing (GRID2001) (pp.
7586).
Ranganathan, K., Iamnitchi, A., & Foster, I. (2002). Improving data availability through dynamic modeldriven replication in large peer-to-peer communities. In Proceedings of the 2nd IEEE/ACM international
symposium on cluster computing and the grid (CCGRID02) (pp. 376381).
Revees, C. (1993). Moderm heuristic techniques for combinatorial problems. Oxford, UK: Oxford
Blackwell Scientific Publication.
Saito, Y., & Levy, H. M. (2000). Optimistic replication for internet data services. In Proceedings of
international symposium on distributed computing (pp. 297314).
514
Saito, Y., & Shapiro, M. (2005). Optimistic replication. ACM Computing Surveys, 37(1), 4281.
doi:10.1145/1057977.1057980
Santos-Neto, E., Cirne, W., Brasileiro, F., & Lima, A. (2004). Exploiting replication and data reuse
to efficiently schedule data-intensive applications on grids. In Proceedings of 10th workshop on job
scheduling strategies for parallel processing (Vol. 3277, pp. 210232).
Schintke, F., & Reinefeld, A. (2003). Modeling replica availability in large data grids. Journal of Grid
Computing, 1(2), 219227. doi:10.1023/B:GRID.0000024086.50333.0d
Sun, M., Sun, J., Lu, E., & Yu, C. (2005). Ant algorithm for file replica selection in data grid. In Proceedings of the first international conference on semantics, knowledge, and grid (SKG 2005) (pp. 6466).
Sun, Y., & Xu, Z. (2004). Grid replication coherence protocol. In Proceedings of the 18th international
parallel and distributed processing symposium (pp. 232239).
Tang, M., Lee, B., Tang, X., & Yeo, C. K. (2005). Combining data replication algorithms and job
scheduling heuristics in the data grid. In Proceedings of European conference on parallel computing
(pp. 381390).
Tang, M., Lee, B., Yeo, C., & Tang, X. (2005). Dynamic replication algorithms for the multi-tier data
grid. Future Generation Computer Systems, 21(5), 775790. doi:10.1016/j.future.2004.08.001
Tang, M., Lee, B., Yeo, C., & Tang, X. (2006). The impact of data replication on job scheduling
performance in the data grid. Future Generation Computer Systems, 22(3), 254268. doi:10.1016/j.
future.2005.08.004
Tang, X., & Xu, J. (2005). QoS-aware replica placement for content distribution. IEEE Transactions on
Parallel and Distributed Systems, 16(10), 921932. doi:10.1109/TPDS.2005.126
Vazhkudai, S. (2003, Nov). Enabling the co-allocation of grid data transfers. In Proceedings of the fourth
international workshop on grid computing (pp. 4151).
Vazhkudai, S., Tuecke, S., & Foster, I. (2001). Replica selection in the globus data grid. In Proceedings
of the first IEEE/ACM international conference on cluster computing and the grid (CCGRID 2001) (pp.
106113).
Venugopal, S., & Buyya, R. (2005, Oct). A deadline and budget constrained scheduling algorithm for
escience applications on data grids. In Proceedings of the 6th international conference on algorithms
and architectures for parallel processing (ICA3PP-2005) (pp. 6072).
Venugopal, S., Buyya, R., & Ramamohanarao, K. (2006). A taxonomy of data grids for distributed data
sharing, management, and processing. ACM Computing Surveys, 1, 153.
Wang, C., Hsu, C., Chen, H., & Wu, J. (2006). Efficient multi-source data transfer in data grids. In Proceedings of the sixth IEEE international symposium on cluster computing and the grid (CCGRID06)
(pp. 421424).
Wang, H., Liu, P., & Wu, J. (2006). A QoS-aware heuristic algorithm for replica placement. Journal of
Grid Computing, 96103.
515
Yang, C., Yang, I., Chen, C., & Wang, S. (2006). Implementation of a dynamic adjustment mechanism
with efficient replica selection in data grid environments. In Proceedings of the ACM symposium on
applied computing (pp. 797804).
Zhou, X., Kim, E., Kim, J. W., & Yeom, H. Y. (2006). ReCon: A fast and reliable replica retrieval service
for the data grid. In Proceedings of IEEE international symposium on cluster computing and the grid
(pp. 446453).
KEY TERMS AND THEIR DEFINITIONS

Access Latency: Access latency is the time that elapses from when a node sends a request for a file
until it receives the complete file.
Data Grids: Data Grids primarily deal with providing services and infrastructure for distributed
data-intensive applications that need to access, transfer and modify massive datasets stored in distributed
storage resources.
Job Scheduling: Job scheduling assigns incoming jobs to compute nodes in such a way that some
evaluative conditions are met, such as the minimization of the overall execution time of the jobs.
Replica Consistency: The replica consistency problem deals with the update synchronization of
multiple copies (replicas) of a file.
Replica Placement: The replica placement service is the component of a Data Grid architecture that
decides where in the system a file replica should be placed.
Replica Selection: A replica selection service discovers the available replicas and selects the best
replica that matches the users location and quality of service (QoS) requirements.
Replication: Replication is an important technique to speed up data access for Data Grid systems by
replicating the data in multiple locations, so that a user can access the data from a site in his vicinity.
ENDNOTE
1
516
A replica may be a complete or a partial copy of the original dataset.
517
Chapter 23
Architectural Elements of
Resource Sharing Networks
Marcos Dias de Assuno
Rajkumar Buyya
ABSTRACT
This chapter first presents taxonomies on approaches for resource allocation across resource sharing
networks such as Grids. It then examines existing systems and classifies them under their architectures,
operational models, support for the life-cycle of virtual organisations, and resource control techniques.
Resource sharing networks have been established and used for various scientific applications over the
last decade. The early ideas of Grid computing have foreseen a global and scalable network that would
provide users with resources on demand. In spite of the extensive literature on resource allocation and
scheduling across organisational boundaries, these resource sharing networks mostly work in isolation, thus contrasting with the original idea of Grid computing. Several efforts have been made towards
providing architectures, mechanisms, policies and standards that may enable resource allocation across
Grids. A survey and classification of these systems are relevant for the understanding of different approaches utilised for connecting resources across organisations and virtualisation techniques. In addition, a classification also sets the ground for future work on inter-operation of Grids.
INTRODUCTION
Since the formulation of the early ideas on meta-computing (Smarr & Catlett, 1992), several research
activities have focused on mechanisms to connect worldwide distributed resources. Advances in distributed computing have enabled the creation of Grid-based resource sharing networks such as TeraGrid
(Catlett, Beckman, Skow, & Foster, 2006) and Open Science Grid (2005). These networks, composed
of multiple resource providers, enable collaborative work and sharing of resources such as computers,
DOI: 10.4018/978-1-60566-661-7.ch023
Architectural Elements of Resource Sharing Networks
storage devices and network links among groups of individuals and organisations. These collaborations,
widely known as Virtual Organisations (VOs) (Foster, Kesselman, & Tuecke, 2001), require resources
from multiple computing sites. In this chapter we focus on networks established by organisations to
share computing resources.
Despite the extensive literature on resource allocation and scheduling across organisational boundaries (Butt, Zhang, & Hu, 2003: Grimme, Lepping, & Papaspyrou, 2008; Iosup, Epema, Tannenbaum,
Farrellee, & Livny, 2007; Ranjan, Rahman, & Buyya, 2008; Fu, Chase, Chun, Schwab, & Vahdat, 2003;
Irwin et al., 2006; Peterson, Muir, Roscoe, & Klingaman, 2006; Ramakrishnan et al., 2006; Huang, Casanova, & Chien, 2006), existing resource sharing networks mostly work in isolation and with different
utilisation levels (Assuno, Buyya, & Venugopal, 2008; Iosup et al., 2007), thus contrasting with the
original idea of Grid computing (Foster et al., 2001). The early ideas of Grid computing have foreseen
a global and scalable network that would provide users with resources on demand.
We have previously demonstrated that there can exist benefits for Grids to share resources with one
another such as reducing the costs incurred by over-provisioning (Assuno & Buyya, in press). Hence, it
is relevant to survey and classify existing work on mechanisms that can be used to interconnect resources
from multiple Grids. A survey and classification of these systems are important in order to understand
the different approaches utilised for connecting resources across organisations and to set the ground for
future work on inter-operation of resource sharing networks, such as Grids. Taxonomies on resource
management systems for resource sharing networks have been proposed (Iosup et al., 2007; Grit, 2005).
Buyya et al. (2000) and Iosup et al. (2007) have described the architectures used by meta-scheduler systems and how jobs are directed to the resources where they execute. Grit (2005) has classified the roles
of intermediate parties, such as brokers, in resource allocation for virtual computing environments.
This chapter extends existing taxonomies, thus making the following contributions:
It examines additional systems and classifies them under a larger property spectrum namely resource control techniques, scheduling considering virtual organisations and arrangements for resource sharing.
It provides classifications and a survey of work on resource allocation and scheduling across
organisations, such as centralised scheduling, meta-scheduling and resource brokering in Grid
computing. This survey aims to show different approaches to federate organisations in a resource
sharing network and to allocate resources to its users. We also present a mapping of the surveyed
systems against the proposed classifications.
BACKGROUND
Several of the organisational models followed by existing Grids are based on the idea of VOs. The VO
scenario is characterised by resource providers offering different shares of resources to different VOs
via some kind of agreement or contract; these shares are further aggregated and allocated to users and
groups within each VO. The life-cycle of a VO can be divided into four distinct phases namely creation,
operation, maintenance, and dissolution. During the creation phase, an organisation looks for collaborators and then selects a list of potential partners to start the VO. The operation phase is concerned with
resource management, task distribution, and usage policy enforcement (Wasson & Humphrey, 2003;
Dumitrescu & Foster, 2004). The maintenance phase deals with the adaptation of the VO, such as al-
518
location of additional resources according to its users demands. The VO dissolution involves legal and
economic issues such as determining the success or failure of the VO, intellectual property and revocation of access and usage privileges.
The problem of managing resources within VOs in Grid computing is further complicated by the fact
that resource control is generally performed at the job level. Grid-based resource sharing networks have
users with units of work to execute, also called jobs; some entities decide when and where these jobs
will execute. The task of deciding where and when to run the users work units is termed as scheduling.
The resources contributed by providers are generally clusters of computers and the scheduling in these
resources is commonly performed by Local Resource Management Systems (LRMSs) such as PBS (2005)
and SGE (Bulhes, Byun, Castrapel, & Hassaine, 2004). Scheduling of Grid users applications and
allocation of resources contributed by providers is carried out by Grid Resource Management Systems
(GRMSs). A GRMS may comprise components such as:
Meta-schedulers, which communicate with LRMSs to place jobs at the provider sites;
Schedulers that allocate resources considering how providers and users are organised in virtual
organisations (Dumitrescu & Foster, 2005); and
Resource brokers, which represent users or organisations by scheduling and managing job execution on their behalf.
These components interact with providers LRMSs either directly or via interfaces provided by the
Grid middleware. The Grid schedulers can communicate with one another in various ways, which include
via sharing agreements, hierarchical scheduling, Peer-to-Peer (P2P) networks, among others.
Recently, utility data centres have deployed resource managers that allow the partitioning of physical resources and the allocation of raw resources that can be customised with the operating system and
software of the users preference. This partitioning is made possible by virtualisation technologies such
as Xen (Barham et al., 2003; Padala et al., 2007) and VMWare1. The use of virtualisation technologies
for resource allocation enables the creation of customised virtual clusters (Foster et al., 2006; Chase,
Irwin, Grit, Moore, & Sprenkle, 2003; Keahey, Foster, Freeman, & Zhang, 2006). The use of virtualisation technology allows for another form of resource control termed containment (Ramakrishnan et al.,
2006), in which remote resources are bound to the users local computing site on demand. The resource
shares can be exchanged across sites by intermediate parties. Thereby, a VO can allocate resources on
demand from multiple resource providers and bind them to a customised environment, while maintaining it isolated from other VOs (Ramakrishnan et al., 2006).
In the following sections, we classify existing systems according to their support to the life-cycle of
VOs, their resource control techniques and the mechanisms for inter-operation with other systems. We
also survey representative work and map them according to the proposed taxonomies.
CLASSIFICATIONS FOR GRID RESOURCE MANAGEMENT SYSTEMS

Buyya et al. (2000) and Iosup et al. (2007) have classified systems according to their architectures and
operational models. We present their taxonomy in this section because it classifies the way that schedulers can be organised in a resource sharing network. We have included a new operational model to the
taxonomy (i.e. hybrid of job routing and job pulling). Moreover, systems with similar architecture may
519
Figure 1. Taxonomy on Grid resource management systems
still differ in terms of the mechanisms employed for resource sharing, the self-interest of the systems
participants, and the communication model. A Grid system can use decentralised scheduling wherein
schedulers communicate their decisions with one another in a co-operative manner, thus guaranteeing the
maximisation of the global utility of the system. On the other hand, a broker may represent a particular
user community within the Grid, can have contracts with other brokers in order to use the resources they
control and allocate resources that maximise its own utility (generally given by the achieved profit). We
classify the arrangements between brokers in this section. Furthermore, systems can also differ according to their resource control techniques and support to different stages of the VO life-cycle. This section
classifies resource control techniques and the systems support for virtual organisations. The attributes
of GRMSs and the taxonomy are summarised in Figure 1.
Architecture and Operational Models of GRMSs

This section describes several manners in which schedulers and brokers can be organised in Grid systems.
Iosup et al. (2007) considered a multiple cluster scenario and classified the architectures possibly used
520
as Grid resource management systems. They classified the architectures in the following categories:
Independent clusters - each cluster has its LRMS and there is no meta-scheduler component.
Users submit their jobs to the clusters of the organisations to which they belong or on which they
have accounts. We extend this category by including single-user Grid resource brokers. In this
case, the user sends her jobs to a broker, which on behalf of the user submits jobs to clusters the
user can access.
Centralised meta-scheduler - there is a centralised entity to which jobs are forwarded. Jobs are
then sent by the centralised entity to the clusters where they are executed. The centralised component is responsible for determining which resources are allocated to the job and, in some cases,
for migrating jobs if the load conditions change.
Hierarchical meta-scheduler - schedulers are organised in a hierarchy. Jobs arrive either at the
root of the hierarchy or at the LRMSs.
Distributed meta-scheduler - cluster schedulers can share jobs that arrive at their LRMSs with
one another. Links can be defined either in a static manner (i.e. by the system administrator at the
systems startup phase) or in a dynamic fashion (i.e. peers are selected dynamically at runtime).
Grit (2007) discusses the types of contracts that schedulers (or brokers) can establish with one
another.
Hybrid distributed/hierarchical meta-scheduler - each Grid site is managed by a hierarchical
meta-scheduler. Additionally, the root meta-schedulers can share the load with one another.
This classification is comprehensive since it captures the main forms through which schedulers and
brokers can be organised in resource sharing networks. However, some categories can be further extended. For example, the site schedulers can be organised in several decentralised ways and use varying
mechanisms for resource sharing, such as a mesh network in which contracts are established between
brokers (Irwin et al., 2006; Fu et al., 2003) or via a P2P network with a bartering-inspired economic
mechanism for resource sharing (Andrade, Brasileiro, Cirne, & Mowbray, 2007).
Iosup et al. also classified a group of systems according to their operational model; the operational
model corresponds to the mechanism that ensures jobs entering the system arrive at the resource in which
they run. They have identified three operational models:
Job routing, whereby jobs are routed by the schedulers from the arrival point to the resources
where they run through a push operation (scheduler-initiated routing);
Job pulling, through which jobs are pulled from a higher-level scheduler by resources (resourceinitiated routing); and
Matchmaking, wherein jobs and resources are connected to one another by the resource manager,
which acts as a broker matching requests from both sides.
We add a fourth category to the classification above in which the operational model can be a hybrid
of job routing and job pulling. Examples of such cases include those that use a job pool to (from) which
jobs are pushed (pulled) by busy (unoccupied) site schedulers (Grimme et al., 2008). (See Figure 2).
521
Figure 2. Architecture models of GRMSs
Arrangements Between Brokers in Resource Sharing Networks

This section describes the types of arrangements that can be established between clusters in resource
sharing networks when decentralised or semi-decentralised architectures are in place. It is important to
distinguish between the way links between sites are established and their communication pattern; from
the mechanism used for negotiating the resource shares. We classify the work according to the communication model in the following categories:
522
P2P network - the sites of the resource sharing network are peers in a P2P network. They use the
network to locate sites where the jobs can run (Butt et al., 2003; Andrade, Cirne, Brasileiro, &
Roisenberg, 2003).
Bilateral sharing agreements - sites establish bilateral agreements through which a site can
locate another suitable site to run a given job. The redirection or acceptance of jobs occurs only
between sites that have a sharing agreement (Epema, Livny, Dantzig, Evers, & Pruyne, 1996).
Shared spaces - sites co-ordinate resource sharing via shared spaces such as federation directories
and tuple spaces (Grimme et al., 2008; Ranjan et al., 2008).
Transitive agreements - this is similar to bilateral agreements. However, a site can utilise resources from another site with which it has no direct agreement (Fu et al., 2003; Irwin et al.,
2006).
Although existing work can present similar communication models or similar organisational forms
for brokers or schedulers, the resource sharing mechanisms can differ. The schedulers or brokers can
use mechanisms for resource sharing from the following categories:
System centric - the mechanism is designed with the goal of maximising the overall utility of the
participants. Such mechanisms aim to, for example, balance the load between sites (Iosup et al.,
2007) and prevent free-riding (Andrade et al., 2007).
Site centric - brokers and schedulers are driven by the interest of maximising the utility of the
participants within the site they represent without the explicit goal of maximising the overall utility across the system (Butt et al., 2003; Ranjan, Harwood, & Buyya, 2006).
Self-interested - brokers act with the goal of maximising their own utility, generally given by
profit, yet satisfying the requirements of their users. They do not take into account the utility of
the whole system (Irwin et al., 2006).
Resource Control Techniques

The emergence of virtualisation technologies has resulted in the creation of testbeds wherein multiplesite slices (i.e. multiple-site containers) are allocated to different communities (Peterson et al., 2006). In
this way, slices run concurrently and are isolated from each other. This approach, wherein resources are
bound to a virtual execution environment or workspace where a service or application can run, is termed
here as a container model. Most of the existing Grid middleware employ a job model in which jobs are
routed until they reach the sites local batch schedulers for execution. It is clear that both models can
co-exist, thus an existing Grid technology can be deployed in a workspace enabled by container-based
resource management (Ramakrishnan et al., 2006; Montero, Huedo, & Llorente, 2008). We classify
systems in the following categories:
Job model - this is the model currently utilised by most of the Grid systems. The jobs are directed
or pulled across the network until they arrive at the nodes where they are finally executed.
Container-based - resource managers in this category can manage a cluster of computers within
a site by means of virtualisation technologies (Keahey et al., 2006; Chase et al., 2003). They bind
resources to virtual clusters or workspaces according to a customers demand. They commonly
provide an interface through which one can allocate a set of nodes (generally virtual machines)
and configure them with the operating system and software of choice.
Single-site - these container-based resource managers allow the user to create a customised
virtual cluster using shares of the physical machines available at the site. These resource
managers are termed here as single-site because they usually manage the resources of one
administrative site (Fontn, Vzquez, Gonzalez, Montero, & Llorente, 2008; Chase et al.,
523
2003), although they can be extended to enable container-based resource control at multiple
sites (Montero et al., 2008).
Multiple-site - existing systems utilise the features of single-site container-based resource
managers to create networks of virtual machines on which an application or existing Grid
middleware can be deployed (Ramakrishnan et al., 2006). These networks of virtual machines are termed here as multiple-site containers because they can comprise resources
bound to workspaces at multiple administrative sites. These systems allow a user to allocate
resources from multiple computing sites thus forming a network of virtual machines or a
multiple-site container (Irwin et al., 2006; Shoykhet, Lange, & Dinda, 2004; Ruth, Jiang,
Xu, & Goasguen, 2005; Ramakrishnan et al., 2006). This network of virtual machines is also
referred to as virtual Grid (Huang et al., 2006) or slice (Peterson et al., 2006).
Some systems such as Shirako (Irwin et al., 2006) and VioCluster (Ruth, McGachey, & Xu, 2005)
provide container-based resource control. Shirako also offers resource control at the job level (Ramakrishnan et al., 2006) by providing a component that is aware of the resources leased. This component gives
recommendations on which site can execute a given job.
Taxonomy on Virtual Organisations

The idea of user communities or virtual organisations underlies several of the organisational models
adopted by Grid systems and guides many of the efforts on providing fair resource allocation for Grids.
Consequently, the systems can be classified according to the VO awareness of their scheduling and
resource allocation mechanisms. One may easily advocate that several systems, that were not explicitly
designed to support VOs, can be used for resource management within a VO. We restrict ourselves to
provide a taxonomy that classifies systems according to (i) the VO awareness of their resource allocation
and scheduling mechanisms; and (ii) the provision of tools for handling different issues related to the
VO life-cycle. For the VO awareness of scheduling mechanisms we can classify the systems in:
Multiple VOs - those scheduling mechanisms that perform scheduling and allocation taking into
consideration the various VOs existing within a Grid; and
Single VO - those mechanisms that can be used for scheduling within a VO.
Furthermore, the idea of VO has been used in slightly different ways in the Grid computing context.
For example, in the Open Science Grid (OSG), VOs are recursive and may overlap. We use several
criteria to classify VOs as presented in Figure 3.
With regard to dynamism, we classify VOs as static and dynamic (Figure 3). Although Grid computing is mentioned as the enabler for dynamic VOs, it has been used to create more static and long-term
collaborations such as APAC (2005), EGEE (2005), the UK National e-Science Centre (2005), and
TeraGrid (Catlett et al., 2006). A static VO has a pre-defined number of participants and its structure
does not change over time. A dynamic VO presents a number of participants that changes constantly
as the VO evolves (Wesner, Dimitrakos, & Jeffrey, 2004). New participants can join, whereas existing
participants may leave.
A dynamic VO can be stationary or mobile. A stationary VO is generally composed of highly specialised
resources including supercomputers, clusters of computers, personal computers and data resources. The
524
Figure 3. Taxonomy on Grid facilitated VOs
components of the VO are not mobile. In contrast, a mobile VO is composed of mobile resources such as
Personal Digital Assistants (PDAs), mobile phones. The VO is highly responsive and adapts to different
contexts (Wesner et al., 2004). Mobile VOs can be found in disaster handling and crisis management
situations. Moreover, a VO can be hybrid, having both stationary and mobile components.
Considering goal-orientation, we divide VOs into two categories: targeted and non-targeted (Figure
3). A targeted VO can be an alliance or collaboration created to explore a market opportunity or achieve
a common research goal. A VO for e-Science collaboration is an example of a targeted VO as the
participants have a common goal (Hey & Trefethen, 2002). A non-targeted VO is characterised by the
absence of a common goal; it generally comprises participants who pursue different goals, yet benefit
from the VO by pooling resources. This VO is highly dynamic because participants can leave when
they achieve their goals.
VOs can be short-, medium- or long-lived (Figure 3). A short-lived VO lasts for minutes or hours. A
medium-lived VO lasts for weeks and is formed, for example, when a scientist needs to carry out experiments that take several days to finish. Data may be required to carry out such experiments. This scenario
may be simplified if the VO model is used; the VO may not be needed as soon as the experiments have
been carried out. A long-lived VO is formed to explore a market opportunity (goal-oriented) or to pool
resources to achieve disparate objectives (non-targeted). Such endeavours normally last from months
to years; hence, we consider a long-lived VO to last for several months or years.
As discussed in the previous section, the formation and maintenance of a VO present several challenges.
These challenges have been tackled in different ways, which in turn have created different formation and
525
maintenance approaches. We thus classify the formation and membership, or maintenance, as centralised
and decentralised (Figure 3). The formation and membership of a centralised VO is controlled by a
trusted third party, such as Open Science Grid (2005) or the Enabling Grids for E-SciencE (2005). OSG
provides an open market where providers and users can advertise their needs and intentions; a provider
or user may form a VO for a given purpose. EGEE provides a hierarchical infrastructure to enable the
formation of VOs. On the other hand, in a decentralised controlled VO, no third party is responsible for
enabling or controlling the formation and maintenance. This kind of VO can be complex as it can require
multiple Service Level Agreements (SLAs) to be negotiated among multiple participants. In addition,
the monitoring of SLAs and commitment of the members are difficult to control. The VO also needs to
self-adapt when participants leave or new participants join.
Regarding the enforcement of policies, VOs can follow different approaches, such as hub or democratic.
This is also referred to as topology. Katzy et al. (2005) classify VOs in terms of topology, identifying the
following types: chain, star or hub, and peer-to-peer. Sairamesh et al. (2005) identify business models
for VOs; the business models are analogous to topologies. However, by discussing the business models
for VOs, the authors are concerned with a larger set of problems, including enforcement of policies,
management, trust and security, and financial aspects. In our taxonomy, we classify the enforcement and
monitoring of policies as star or hub, democratic or peer-to-peer, hierarchical, and chain (Figure 3).
Some projects such as Open Science Grid (2005) and EGEE (2005) aim to establish consortiums or
clusters of organisations, which in turn allow the creation of dynamic VOs. Although not very related
to the core issues of VOs, they aim to address an important problem: the establishment of trust between
organisations and the means for them to look for and find potential partners. These consortiums can be
classified as hierarchical and market-like (Figure 3). A market-like structure is any infrastructure that
offers a market place, which organisations can join and present interests in starting a new collaboration
or accepting to participate in an ongoing collaboration. These infrastructures may make use of economic
models such as auctions, bartering, and bilateral negotiation.
A SURVEY OF EXISTING WORK

This section describes relevant work of the proposed taxonomy in more detail. First, it describes work
on a range of systems that have a decentralised architecture. Some systems present a hierarchy of scheduling whereby jobs are submitted to the root of the hierarchy or to their leaves; in either case, the jobs
execute at the leaves of the hierarchical structure. Second, this section presents systems of hierarchical
structure, resource brokers and meta-scheduling frameworks. During the past few years, several Gridbased resource sharing networks and other testbeds have been created. Third, we discuss the work on
inter-operation between resource sharing networks. Finally, this section discusses relevant work focusing on VO issues.
Distributed Architecture Based Systems

Condor Flocking: The flocking mechanism used by Condor (Epema et al., 1996) provides a software
approach to interconnect pools of Condor resources. The mechanism requires manual configuration of
sharing agreements between Condor pools. Each pool owner and each workstation owner maintains full
control of when their resources can be used by external jobs.
526
The developers of Condor flocking opted for a layered design for the flocking mechanism, which
enables the Condors Central Manager (CM) (Litzkow, Livny, & Mutka, 1988) and other Condor machines to remain unmodified and operate transparently from the flock.
The basis of the flocking mechanism is formed by Gateway Machines (GW). There is at least one
GW in each Condor pool. GWs act as resource brokers between pools. Each GW has a configuration file
describing the subset of connections it maintains with other GWs. Periodically, a GW queries the status
of its pool from the CM. From the list of resources obtained, the GW makes a list of those resources that
are idle. The GW then sends this list to the other GWs to which it is connected. Periodically, the GW that
received this list chooses a machine from the list, and advertises itself to the CM with the characteristics
of this machine. The flocking protocol (which is a modified version of the normal Condor protocol)
allows the GWs to create shadow processes that so that a submission machine is under the impression
of contacting the execution machine directly.
Self-Organizing Flock of Condors: The original flocking scheme of Condor has the drawback
that knowledge about all pools with which resources can be shared need to be known a priori before
starting Condor (Epema et al., 1996). This static information poses limitations regarding the number
of resources available and resource discovery. Butt et al. (2003) introduced a self-organising resource
discovery mechanism for Condor, which allows pools to discover one another and resources available
dynamically. The P2P network used by the flocking mechanism is based on Pastry and takes into account
the network proximity. This may result in saved bandwidth in data transfer and faster communications.
Experiments with this implementation considering four pools with four machines each were provided.
Additionally, simulation results demonstrated the performance of the flocking mechanism when interconnecting 1,000 pools.
Shirako: Shirako (Irwin et al., 2006) is a system for on-demand leasing of shared networked resources
across clusters. Shirakos design goals include: autonomous providers, who may offer resources to the
system on a temporary basis and retain the ultimate control over them; adaptive guest applications that
lease resources from the providers according to changing demand; pluggable resource types, allowing
participants to include various types of resources, such as network links, storage and computing; brokers that provide guest applications with an interface to acquire resources from resource providers; and
allocation policies at guest applications, brokers and providers, which define the manner resources are
allocated in the system.
Shirako utilises a leasing abstraction in which authorities representing provider sites offer their
resources to be provisioned by brokers to guest applications. Shirako brokers are responsible for coordinating resource allocation across provider sites. The provisioning of resources determines how much
of each resource each guest application receives, when and where. The site authorities define how much
resource is given to which brokers. The authorities also define which resources are assigned to serve
requests approved by a broker. When a broker approves a request, it issues a ticket that can be redeemed
for a lease at a site authority. The ticket specifies the type of resource, the number of resource units
granted and the interval over which the ticket is valid. Sites issue tickets for their resources to brokers;
the brokers polices may decide to subdivide or aggregate tickets.
A service manager is a component that represents the guest application and uses the lease API provided by Shirako to request resources from the broker. The service manager determines when and how
to redeem existing tickets, extend existing leases, or acquire new leases to meet changing demand. The
system allows guest applications to renew or extend their leases. The broker and site authorities match
accumulated pending requests with resources under the authorities control. The broker prioritises requests
527
and selects resource types and quantities to serve them. The site authority assigns specific resource units
from its inventory to fulfill lease requests that are backed by a valid ticket. Site authorities use Cluster
on Demand (Chase et al., 2003) to configure the resources allocated at the remote sites.
The leasing abstraction provided by Shirako is a useful basis to co-ordinate resource sharing for
systems that create distributed virtual execution environments of networked virtual machines (Keahey et
al., 2006; Ruth, Rhee, Xu, Kennell, & Goasguen, 2006; Adabala et al., 2005; Shoykhet et al., 2004).
Ramakrishnan et al. (2006) used Shirako to provide a hosting model wherein Grid deployments run
in multiple-site containers isolated from one another. An Application Manager (AM), which is the entry
point of jobs from a VO or Grid, interacts with a Grid Resource Oversight Coordinator (GROC) to obtain
a recommendation of a site to which jobs can be submitted. The hosting model uses Shirakos leasing
core. A GROC performs the functions of leasing resources from computing sites and recommending
sites for task submission. At the computing site, Cluster on Demand is utilised to provide a virtual cluster
used to run Globus 4 along with Torque/MAUI.
VioCluster: VioCluster is a system that enables dynamic machine trading across clusters of computers (Ruth, McGachey, & Xu, 2005). VioCluster introduces the idea of virtual domain. A virtual domain,
originally comprising its physical domain of origin (i.e. a cluster of computers), can grow in the number
of computing resources, thus dynamically allocating resources from other physical domains according
to the demands of its user applications.
VioCluster presents two important system components: the creation of dynamic virtual domains and
the mechanism through which resource sharing is negotiated. VioCluster uses machine and network
virtualisation technology to move machines between domains. Each virtual domain has a broker that
interacts with other domains. A broker has a borrowing policy and a lending policy. The borrowing
policy determines under which circumstances the broker will attempt to obtain more machines. The
lending policy governs when it is willing to let another virtual domain make use of machines within its
physical domain.
The broker represents a virtual domain when negotiating trade agreements with other virtual domains.
It is the brokers responsibility to determine whether trades should occur. The policies for negotiating
the resources specify: the reclamation, that is, when the resources will be returned to their home domain; machine properties, which represent the machines to be borrowed; and the machines location as
some applications require communication. The borrowing policy must be aware of the communication
requirements of user applications.
Machine virtualisation simplifies the transfer of machines between domains. When a machine belonging to a physical domain B is borrowed by a virtual domain A, it is utilised to run a virtual machine. This
virtual machine matches the configuration of the machines in physical domain A. Network virtualisation
enables the establishment of virtual network links connecting the new virtual machine to the nodes of
domain A. For the presented prototype, PBS is used to manage the nodes of the virtual domain. PBS
is aware of the computers heterogeneity and never schedules jobs on a mixture of virtual and physical
machines. The size of the work queue in PBS was used as a measure of the demand within a domain.
OurGrid: OurGrid (Andrade et al., 2003) is a resource sharing system organised as a P2P network
of sites that share resources equitably in order to form a Grid to which they all have access. OurGrid
was designed with the goal of easing the assembly of Grids, thus it provides connected sites with access
to the Grid resources with a minimum of guarantees needed. OurGrid is used to execute Bag-of-Tasks
(BoT) applications. BoT are parallel applications composed of a set of independent tasks that do not
communicate with one another during their execution. In contrast to other Grid infrastructures, the system
528
does not require offline negotiations if a resource owner wants to offer her resources to the Grid.
OurGrid uses a resource exchange mechanism termed network of favours. A participant A is doing
a favour to participant B when A allows B to use her resources. According to the network of favours,
every participant does favours to other participants expecting the favours to be reciprocated. In conflicting situations, a participant prioritises those who have done favours to it in the past. The more favours
a participant does, the more it expects to be rewarded. The participants locally account their favours
and cannot profit from them in another way than expecting other participants to do them some favours.
Detailed experiments have demonstrated the scalability of the network of favours (Andrade et al., 2007),
showing that the larger the network becomes, the more fair the mechanism performs.
The three participants in the OurGrids resource sharing protocol are clients, consumers, and providers. A client requires access to the Grid resources to run her applications. The consumer receives requests
from the client to find resources. When the client sends a request to the consumer, the consumer first
finds the resources able to serve the request and then executes the tasks on the resources. The provider
manages the resources shared in the community and provides them to consumers.
Delegated Matchmaking:Iosup et al. (2007) introduced a matchmaking protocol in which a computing
site binds resources from remote sites to its local environment. A network of sites, created on top of the
local cluster schedulers, manages the resources of the interconnected Grids. Sites are organised according
to administrative and political agreements so that parent-child links can be established. Then, a hierarchy
of sites is formed with the Grid clusters at the leaves of the hierarchy. After that, supplementary to the
hierarchical links, sibling links are established between sites that are at the same hierarchical level and
operate under the same parent site. The proposed delegated matchmaking mechanism enables requests
for resources to be delegated up and down the hierarchy thus achieving a decentralised network.
The architecture is different from work wherein a scheduler forwards jobs to be executed on a remote
site. The main idea of the matchmaking mechanism is to delegate ownership of resources to the user
who requested them through this network of sites, and add the resources transparently to the users local
site. When a request cannot be satisfied locally, the matchmaking mechanism adds remote resources
to the users site. This simplifies security issues since the mechanism adds the resources to the trusted
local resource pool. Simulation results show that the mechanism leads to an increase in the number of
requests served by the interconnected sites.
Grid Federation:Ranjan et al. (2005) proposed a system that federates clusters of computers via a
shared directory. Grid Federation Agents (GFAs), representing the federated clusters, post quotes about
idle resources (i.e. a claim stating that a given resource is available) and, upon the arrival of a job, query
the directory to find a resource suitable to execute the job. The directory is a shared-space implemented
as a Distributed Hash Table (DHT) P2P network that can match quotes and user requests (Ranjan et al.,
2008).
An SLA driven co-ordination mechanism for Grid superscheduling has also been proposed (Ranjan
et al., 2006). GFAs negotiate SLAs and redirect requests through a Contract-Net protocol. GFAs use a
greedy policy to evaluate resource requests. A GFA is a cluster resource manager and has control over
the clusters resources. GFAs engage into bilateral negotiations for each request they receive, without
considering network locality.
Askalon:Siddiqui et al. (2006) introduced a capacity planning architecture with a three-layer negotiation protocol for advance reservation on Grid resources. The architecture is composed of allocators that
make reservations of individual nodes and co-allocators that reserve multiple nodes for a single Grid
application. A co-allocator receives requests from users and generates alternative offers that the user
529
can utilise to run her application. A co-allocation request can comprise a set of allocation requests, each
allocation request corresponding to an activity of the Grid application. A workflow with a list of activities is an example of Grid application requiring co-allocation of resources. Co-allocators aim to agree
on Grid resource sharing. The proposed co-ordination mechanism produces contention-free schedules
either by eliminating conflicting offers or by lowering the objective level of some of the allocators.
GRUBER/DI-GRUBER:Dumitrescu et al. (2005) highlighted that challenging usage policies can
arise in VOs that comprise participants and resources from different physical organisations. Participants
want to delegate access to their resources to a VO, while maintaining such resources under the control
of local usage policies. They seek to address the following issues:
How usage policies are enforced at the resource and VO levels.

What mechanisms are used by a VO to ensure policy enforcement.
How the distribution of policies to the enforcement points is carried out.
How policies are made available to VO job and data planners.
They have proposed a policy management model in which participants can specify the maximum
percentage of resources delegated to a VO. A VO in turn can specify the maximum percentage of resource
usage it wishes to delegate to a given VOs group. Based on this model above, they have proposed a Grid
resource broker termed GRUBER (Dumitrescu & Foster, 2005). GRUBER architecture is composed of
four components, namely:
Engine: which implements several algorithms to detect available resources.

Site monitoring: is one of the data providers for the GRUBER engine. It is responsible for collecting data on the status of Grid elements.
Site selectors: consist of tools that communicate with the engine and provide information about
which sites can execute the jobs.
Queue manager: resides on the submitting host and decides how many jobs should be executed
and when.
Users who want to execute jobs, do so by sending them to submitting hosts. The integration of existing external schedulers with GRUBER is made in the submitting hosts. The external scheduler utilises
GRUBER either as the queue manager that controls the start time of jobs and enforces VO policies, or
as a site recommender. The second case is applicable if the queue manager is not available.
DI-GRUBER, a distributed version of GRUBER, has also been presented (Dumitrescu, Raicu, &
Foster, 2005). DI-GRUBER works with multiple decision points, which gather information to steer
resource allocations defined by Usage Service Level Agreements (USLAs). These points make decisions on a per-job basis to comply with resource allocations to VO groups. Authors advocated that 4 to
5 decision points are enough to handle the job scheduling of a Grid 10 times larger than Grid3 at the
time the work was carried out (Dumitrescu, Raicu, & Foster, 2005).
Other important work:Balazinska et al. (2004) have proposed a load balancing mechanism for
Medusa. Medusa is a stream processing system that allows the migration of stream processing operators
from overloaded to under-utilised resources. The request offloading is performed based on the marginal
cost of the request. The marginal cost for one participant is given by the increase (decrease) in the cost
curve given by the acceptance (removal) of the request from the requests served by the participant.
530
NWIRE (Schwiegelshohn & Yahyapour, 1999) links various resources to a metacomputing system,
also termed meta-system. It also enables the scheduling in these environments. A meta-system comprises
interconnected MetaDomains. Each MetaDomain is managed by a MetaManager that manages a set of
ResourceManagers. A ResourceManager interfaces the scheduler at the cluster level. The MetaManager
permanently collects information about all of its resources. It handles all requests inside its MetaDomain
and works as a resource broker to other MetaDomains. In this way, requests received by a MetaManager
can be submitted either by users within its MetaDomain or by other MetaManagers. Each MetaManager
contains a scheduler that maps requests for resources to a specific resource in its MetaDomain.
Grimme et al. (2008) have presented a mechanism for collaboration between resource providers by
means of job interchange though a central job pool. According to this mechanism, a cluster scheduler
adds to the central pool jobs that cannot be started immediately. After scheduling local jobs, a local
scheduler can schedule jobs from the central pool if resources are available.
Dixon et al. (2006) have provided a tit-for-tat or bartering mechanism based on local, non-transferable
currency for resource allocation in large-scale distributed infrastructures such as PlanetLab. The currency
is maintained locally within each domain in the form of credit given to other domains for providing
resources in the past. This creates pair-wise relationships between administrative domains. The mechanism resembles OurGrids network of favours (Andrade et al., 2003). The information about exchanged
resources decays with time, so that recent behaviour is more important. Simulation results showed that,
for an infrastructure like PlanetLab, the proposed mechanism is more fair than the free-for-all approach
currently adopted by PlanetLab.
Graupner et al. (2002) have introduced a resource control architecture for federated utility data
centres. In this architecture, physical resources are grouped in virtual servers and services are mapped
to virtual servers. The meta-system is the upper layer implemented as an overlay network whose nodes
contain descriptive data about the two layers below. Allocations change according to service demand,
which requires to the control algorithms to be reactive and deliver quality solutions. The control layer
performs allocation of services to virtual server environments and its use has been demonstrated by a
capacity control example for a homogeneous Grid cluster.
Hierarchical Systems, Brokers and Meta-Scheduling

This section describes some systems that are organised in a hierarchical manner. We also describe work
on Grid resource brokering and frameworks that can be used to build meta-schedulers.
Computing Center Software (CCS): CCS (Brune, Gehring, Keller, & Reinefeld, 1999) is a system
for managing geographically distributed high-performance computers. It consists of three components,
namely: the CCS, which is a vendor-independent resource management software for local HPC systems; the Resource and Service Description (RSD), used by the CCS to specify and map hardware and
software components of computing environments; and the Service Coordination Layer (SCL), which
co-ordinates the use of resources across computing sites.
The CCS controls the mapping and scheduling of interactive and parallel jobs on massively parallel
systems. It uses the concept of island, wherein each island has components for user interface, authorisation
and accounting, scheduling of user requests, access to the physical parallel system, system control, and
management of the island. At the meta-computing level, the Center Resource Manager (CRM) exposes
scheduling and brokering features of the islands. The CRM is a management tool atop the CCS islands.
When a user submits an application, the CRM maps the user request to the static and dynamic informa-
531
tion on resources available. Once the resources are found, CRM requests the allocation of all required
resources at all the islands involved. If not all resources are available, the CRM either re-schedules the
request or rejects it. Center Information Server (CIS) is a passive component that contains information
about resources and their statuses, and is analogous to Globus Metacomputing Directory Service (MDS)
(Foster & Kesselman, 1997). It is used by the CRM to obtain information about resources available.
The Service Co-ordination Layer (SCL) is located one level above the local resource management
systems. The SCL co-ordinates the use of resources across the network of islands. It is organised as
a network of co-operating servers, wherein each server represents one computing centre. The centres
determine which resources are made available to others and retain full autonomy over them.
EGEE Workload Management System (WMS): EGEE WMS (Vzquez-Poletti, Huedo, Montero,
& Llorente, 2007) has a semi-centralised architecture. One or more schedulerscan be installed in the
Grid infrastructure, each providing scheduling functionality for a group of VOs. The EGEE WMS
components are: The User Interface (UI) from where the user dispatches the jobs; the Resource Broker
(RB), which uses Condor-G (Frey, Tannenbaum, Livny, Foster, & Tuecke, 2001); the Computing Element (CE), which is the cluster front-end; the Worker Nodes (WNs), which are the cluster nodes; the
Storage Element (SE), used for job files storage; and the Logging and Bookkeeping service (LB) that
registers job events.
Condor-G: Condor-G (Frey et al., 2001) leverages software from Globus and Condor (Frey et al.,
2001) and allows users to utilise resources spanning multiple domains as if they all belong to one personal
domain. Although Condor-G can be viewed as a resource broker itself (Venugopal, Nadiminti, Gibbins,
& Buyya, 2008), it can also provide a framework to build meta-schedulers.
The GlideIn mechanism of Condor-G is used to start a daemon process on a remote resource. The
process uses standard Condor mechanisms to advertise the resource availability to a Condor collector
process, which is then queried by the Scheduler to learn about available resources. Condor-G uses Condor
mechanisms to match locally queued jobs to the resources advertised by these daemons and to execute
them on those resources. Condor-G submits an initial GlideIn executable (a portable shell script), which
in turn uses GSI-authenticated GridFTP to retrieve the Condor executables from a central repository. By
submitting GlideIns to all remote resources capable of serving a job, Condor-G can guarantee optimal
queuing times to user applications.
Gridbus Broker: Gridbus Grid resource broker (Venugopal et al., 2008) is user-centric broker that
provides scheduling algorithms for both computing- and data-intensive applications. In Gridbus, each
user has her own broker, which represents the user by (i) selecting resources that minimise the users
quality of service constraints such as execution deadline and budget spent; (ii) submitting jobs to remote
resources; and (iii) copying input and output files. Gridbus interacts with various Grid middlewares
(Venugopal et al., 2008).
Gridway: GridWay (Huedo, Montero, & Llorente, 2004) is a Globus based resource broker that
provides a framework for execution of jobs in a submit and forget fashion. The framework performs
job submission and execution monitoring. Job execution adapts itself to dynamic resource conditions
and application demands in order to improve performance. The adaptation is performed through application migration following performance degradation, sophisticated resource discovery, requirements
change, or remote resource failure.
The framework is modular wherein the following modules can be set on a per-job basis: resource
selector, performance degradation evaluator, prolog, wrapper and epilog. The name of the first two
modules or steps are intuitive, so we describe here only the last three. During prolog, the component
532
responsible for job submission (i.e. submission manager) submits the prolog executable, which configures the remote system and transfers executable and input files. In the case of restart of an execution,
the prolog also transfers restart files. The wrapper executable is submitted after prolog and wraps the
actual job in order to obtain its exit code. The epilog is a script that transfers the output files and cleans
the remote resource.
GridWay also enables the deployment of virtual machines in a Globus Grid (Rubio-Montero, Huedo,
Montero, & Llorente, 2007). The scheduling and selection of suitable resources is performed by GridWay whereas a virtual workspace is provided for each Grid job. A pre-wrapper phase is responsible for
performing advanced job configuration routines, whereas the wrapper script starts a virtual machine
and triggers the application job on it.
KOALA:Mohamed and Epema (in press) have presented the design and implementation of KOALA,
a Grid scheduler that supports resource co-allocation. KOALA Grid scheduler interacts with cluster
batch schedulers for the execution of jobs. The work proposes an alternative to advance reservation at
local resource managers, when reservation features are not available. This alternative allows processors
to be allocated from multiple sites at the same time.
SNAP-Based Community Resource Broker: The Service Negotiation and Acquisition Protocol
(SNAP)-based community resource broker uses an interesting three-phase commit protocol. SNAP is
proposed because traditional advance reservation facilities cannot cope with the fact that information
availability may change between the moment at which resource availability is queried and the time
when the reservation of resources is actually performed (Haji, Gourlay, Djemame, & Dew, 2005). The
three phases of SNAP protocol consist of (i) a step in which resource availability is queried and probers
are deployed, which inform the broker in case the resource status changes; (ii) then, the resources are
selected and reserved; and (iii) after that, the job is deployed on the reserved resources.
Platform Community Scheduler Framework (CSF): CSF (2003) provides a set of tools that can
be utilised to create a Grid meta-scheduler or a community scheduler. The meta-scheduler enables users
to define the protocols to interact with resource managers in a system independent manner. The interface
with a resource manager is performed via a component termed Resource Manager (RM) Adapter. A RM
Adapter interfaces a cluster resource manager. CSF supports the GRAM protocol to access the services
of the resource managers that do not support the RM Adapter interface.
Platforms LSF and MultiCluster products leverage the CSF to provide a framework for implementing meta-scheduling. Grid Gateway is an interface that integrates Platform LSF and CSF. A scheduling
plug-in for Platform LSF scheduler decides which LSF jobs are forwarded to the meta-scheduler. This
decision is based on information obtained from an information service provided by the Grid Gateway.
When a job is forwarded to the meta-scheduler, the job submission and monitoring tools dispatch the job
and query its status information through the Grid Gateway. The Grid Gateway uses the job submission,
monitoring and reservation services from the CSF. Platform MultiCluster also allows multiple clusters
using LSF to forward jobs to one another transparently to the end-user.
Other important work: Kertsz et al. (2008) introduced a meta-brokering system in which the metabroker, invoked through a Web portal, submits jobs, monitors job status and copies output files using
brokers from different Grid middleware, such as NorduGrid Broker and EGEE WMS.
Kim and Buyya (2007) tackle the problem of fair-share resource allocation in hierarchical VOs. They
provide a model for hierarchical VO environments based on a resource sharing policy; and provide a
heuristic solution for fair-share resource allocation in hierarchical VOs.
533
Inter-Operation of Resource Sharing Networks

Relevant work on the attempts to enable inter-operation between resource sharing networks is discussed
in this section.
PlanetLab: PlanetLab (Peterson et al., 2006) is a large-scale testbed that enables the creation of
slices, that is, distributed environments based on virtualisation technology. A slice is a set of virtual
machines, each running on a unique node. The individual virtual machines that make up a slice contain
no information about the other virtual machines in the set and are managed by the service running in
the slice. Each service deployed on PlanetLab runs on a slice of PlanetLabs global pool of resources.
Multiple slices can run concurrently and each slice is like a network container that isolates services
from other containers.
The principals in PlanetLab are:
Owner: organisation that hosts (owns) one or more PlanetLab nodes.

User: researcher who deploys a service on a set of PlanetLab nodes.
PlanetLab Consortium (PLC): centralised trusted intermediary that manages nodes on behalf of
a group of owners and creates slices on those nodes on behalf of a group of users.
When PLC acts as a Slice Authority (SA), it maintains the state of the set of system-wide slices for
which the PLC is responsible. The SA provides an interface through which users register themselves,
create slices, bind users to slices, and request the slice to be instantiated on a set of nodes. PLC, acting
as a Management Authority (MA), maintains a server that installs and updates the software running on
the nodes it manages and monitors these nodes for correct behavior, taking appropriate action when
anomalies and failures are detected. The MA maintains a database of registered nodes. Each node is
affiliated with an organization (owner) and is located at a site belonging to the organization. MA provides an interface used by node owners to register their nodes with the PLC and allows users and slices
authorities to obtain information about the set of nodes managed by the MA.
PlanetLabs architecture has evolved to enable decentralised control or federations of PlanetLabs
(Peterson et al., 2006). The PLC has been split into two components namely the MA and SA, which allow
PLC-like entities to evolve these two components independently. Therefore, autonomous organisations
can federate and define peering relationships with each other. For example, peering relationships with
other infrastructure is one of the goals of PlanetLab Europe (2008). A resource owner may choose a MA
to which it wants to provide resources. MAs, in turn, may blacklist particular SAs. A SA may trust only
certain MAs to provide it with the virtual machines it needs for its users. This enables various types of
agreements between SAs and MAs.
It is also important to mention that Ricci et al. (2006) have discussed issues related to the design of a
general resource allocation interface that is sufficiently wide for allocators in a large variety of current
and future testbeds. An allocator is a component that receives as input the users abstract description
for the required resources and the resource status from a resource discoverer and produces allocations
performed by a deployment service. The goal of an allocator is to allow users to specify characteristics
of their slice in high-level terms and find resources to match these requirements. Authors have described
their experience in designing PlanetLab and Emulab and among several important issues, they have
advocated that:
534
In future infrastructures, several allocators may co-exist and it might be difficult for them to coexist without interfering into one another;
With the current proportional-share philosophy of PlanetLab, where multiple management services can co-exist, allocators do not have guarantees over any resources;
Thus, co-ordination between the allocators may be required.
Grid Interoperability Now - Community Group (GIN-CG): GIN-CG (2006) has been working
on providing interoperability between Grids by developing components and adapters that enable secure
and standard job submissions, data transfers, and information queries. These efforts provide the basis
for load management across Grids by facilitating standard job submission and request redirection. They
also enable secure access to resources and data across Grids. Although GIN-CGs efforts are relevant,
its members also highlight the need for common allocation and brokering of resources across Grids.2
InterGrid: Assuno et al. (2008) have proposed an architecture and policies to enable the interoperation of Grids. This set of architecture and policies is termed as the InterGrid. InterGrid is inspired
by the peering agreements between Internet Service Providers (ISPs). The Internet is composed of
competing ISPs that agree to allow traffic into one anothers networks. These agreements between ISPs
are commonly termed as peering and transit arrangements (Metz, 2001).
In the InterGrid, a Resource Provider (RP) contributes a share of computational resources, storage
resources, networks, application services or other type of resource to a Grid in return for regular payments. An RP has local users whose resource demands need to be satisfied, yet it delegates provisioning
rights over spare resources to an InterGrid Gateway (IGG) by providing information about the resources
available in the form of free time slots (Assuno & Buyya, 2008). A free time slot includes information
about the number of resources available, their configuration and time frame over which they will be
available. The control over resource shares offered by providers is performed via a container model, in
which the resources are used to run virtual machines. Internally, each Grid may have a resource management system organised in a hierarchical manner. However, for the sake of simplicity, experimental results
consider that RPs delegate provisioning rights directly to an IGG (Assuno & Buyya, in press).
A Grid has pre-defined peering arrangements with other Grids, managed by IGGs and, through which
they co-ordinate the use of resources of the InterGrid. An IGG is aware of the terms of the peering with
other Grids; provides Grid selection capabilities by selecting a suitable Grid able to provide the required
resources; and replies to requests from other IGGs. The peering arrangement between two Grids is represented as a contract. Request redirection policies determine which peering Grid is selected to process
a request and at what price the processing is performed (Assuno & Buyya, in press).
Other important work: Boghosian et al. (2006) have performed experiments using resources from
more than one Grid for three projects, namely Nektar, SPICE and Vortonics. The applications in these
three projects require massive numbers of computing resources only achievable through Grids of Grids.
Although resources from multiple Grids were used during the experiments, they emphasised that several human interactions and negotiations are required in order to use federated resources. The authors
highlighted that even if interoperability at the middleware level existed, it would not guarantee that
the federated Grids can be utilised for large-scale distributed applications because there are important
additional requirements such as compatible and consistent usage policies, automated advanced reservations and co-scheduling.
Caromel et al. (2007) have proposed the use of a P2P network to acquire resources dynamically from
a Grid infrastructure (i.e. Grid5000) and desktop machines in order to run compute intensive applica-
535
tions. The communication between the P2P network and Grid5000 is performed through SSH tunnels.
Moreover, the allocation of nodes for the P2P network uses the deployment framework of ProActive by
deploying Java Virtual Machines on the allocated nodes.
In addition to GIN-CGs efforts, other Grid middleware interoperability approaches have been
presented. Wang et al. (2007) have described a gateway approach to achieve interoperability between
gLite (2005) (the middleware used in EGEE) and CNGrid GOS (2007) (the middleware of the Chinese
National Grid (2007)). The work focuses on job management interoperability, but also describes interoperability between the different protocols used for data management as well as resource information. In the
proposed interoperability approach, gLite is viewed as a type of site job manager by GOS, whereas the
submission to GOS resources by gLite is implemented in a different manner; an extended job manager
is instantiated for each job submitted to a GOS resource. The extended job manager sends the whole
batch job to be executed in the CNGrid.
Virtual Organisations
We have also carried out a survey on how projects address different challenges in the VO life-cycle.
Two main categories of projects have been identified: the facilitators for VOs, which provide means
for building clusters of organisations hence enabling collaboration and formation of VOs; and enablers
for VOs, which provide middleware and tools to help in the formation, management, maintenance and
dissolution of VOs. The classification is not strict because a project can fall into two categories, providing software for enabling VOs and working as a consortium, which organisations can join and start
collaborations that are more dynamic. We divide our survey into three parts: middleware and software
infrastructure for enabling VOs; consortiums and charters that facilitate the formation of VOs; and other
relevant work that addresses issues related to a VOs life-cycle.
Enabling Technology
Enabling a VO means to provide the required software tools to help in the different phases of the life-cycle
of a VO. As we present in this section, due to the complex challenges in the life-cycle, many projects
do not address all the phases. We discuss relevant work in this section.
The CONOISE Project: CONOISE (Patel et al., 2005) uses a marketplace (auctions) for the formation of VOs (Norman et al., 2004). The auctions are combinatorial; combinatorial auctions allow a good
degree of flexibility so that VO initiators can specify a broad range of requirements. A combinatorial
auction allows multiple units of a single item or multiple items to be sold simultaneously. However,
combinatorial auctions lack on means for bid representation and efficient clearing algorithms to determine
prices, quantities and winners. As demonstrated by Dang (2004), clearing combinatorial auctions is an
NP-Complete problem. Thus, polynomial and sub-optimal auction clearing algorithms for combinatorial auctions have been proposed.
Stakeholders in VOs enabled by CONOISE are called agents. As example of VO formation, a user
may request a service to an agent, who in turn verifies if it is able to provide the service requested at the
time specified. If the agent cannot provide the service, it looks for the Service Providers (SPs) offering
the service required. The Requesting Agent (RA) then starts a combinatorial auction and sends call for
bids to SPs. Once RA receives the bids, it determines the best set of partners and then, starts the formation of the VO. Once the VO is formed, RA becomes the VO manager.
536
An agent that receives a call for bids has the following options: (a) she can decide not to bid for
the auction; (b) she can bid considering its resources; (c) she may bid using resources from an existing
collaboration; (d) she may identify the need to start a new VO to provide the extra resources required.
Note that call for bids are recursive. CONOISE uses a cumulative scheduling based on a Constraint
Satisfaction Program (CSP) to model the decision process of an agent.
CONOISE also focuses on the operation and maintenance phases of VOs. Once a VO is formed, it
uses principles of coalition formation for distributing tasks amongst the member agents (Patel et al.,
2005). An algorithm for coalition structure generation, which is bound from the optimal, is presented and
evaluated (Dang, 2004). Although not very focused on authorisation issues, the CONOISE project also
deals with issues regarding trust and reputation in VOs by providing reputation and policing mechanisms
to ensure minimum quality of service.
The TrustCoM Project: TrustCoM (2005) addresses issues related to the establishment of trust
throughout the life-cycle of VOs. Its members envision that the establishment of Service Oriented Architectures (SOAs) and the dynamic open electronic marketplaces will allow dynamic alliances and VOs
among enterprises to respond quickly to market opportunities. The establishment of trust, not only at a
resource level but also at a business process level, is hence of importance. In this light, TrustCoM aims to
provide a framework for trust, security and contract management to enable on-demand and self-managed
dynamic VOs (Dimitrakos, Golby, & Kearley, 2004; Svirskas, Arevas, Wilson, & Matthews, 2005).
The framework extends current VO membership services (Svirskas et al., 2005) by providing means
to: (i) identify potential VO partners through reputation management; (ii) manage users according to
the roles defined in the business process models that VO partners perform; (iii) define and manage the
SLA obligations on security and privacy; (iv) enable the enforcement of policies based on the SLAs and
contracts. From a corporate perspective, Sairamesh et al (2005) provide examples of business models
on the enforcement of security policies and the VO management.
While the goal is to enable dynamic VOs, TrustCoM focuses on the security requirements for the
establishment of VOs composed of enterprises. Studies and market analysis to identify the main issues
and requirements to build a secure environment in which VOs form and operate have been performed.
Facilitators or Breeding Environments

In order to address the problem of trust between organisations, projects have created federations and
consortiums which physical organisations or Grids can join to start VOs based on common interests. We
describe the main projects in this field and explain some of the technologies they use.
Open Science Grid (OSG): OSG (2005) can be considered as a facilitator for VOs. The reason is
that the project aims at forming a cluster or consortium of organisations and suggests them to follow a
policy that states how collaboration takes place and how a VO is formed. To join the consortium and
consequently form a VO, it is necessary to have a minimum infrastructure and preferably use the middleware suggested by OSG. In addition, OSG provides tools to check the status and monitor existing VOs.
OSG facilitates the formation of VOs by providing an open-market-like infrastructure that allows the
consortium members to advertise their resources and goals and establish VOs to explore their objectives.
The VO concept is used in a recursive manner; VOs may be composed of sub-VOs. For more information we refer to the Blueprint for the OSG (2004).
A basic infrastructure must be provided to form a VO, including a VO Membership Service (VOMS)
and operation support. The operation supports main goal is to provide technical support services at
537
the request of a member site. As OSG intends to federate across heterogeneous Grid environments,
the resources of the member sites and users are organised in VOs under the contracts that result from
negotiations among the sites, which in turn have to follow the consortiums policies. Such contracts are
defined at the middleware layer and can be negotiated in an automated fashion; however, thus far there
is no easily responsive means to form a VO and the formation requires complex multilateral agreements
among the involved sites.
OSG middleware uses VOMS to support authorisation services for VO members hence helping in
the maintenance and operation phases. Additionally, for the sake of scalability and easiness of administration, Grid User Management System (GUMS) facilitates the mapping of Grid credentials to sitespecific credentials. GUMS and VOMS provide means to facilitate the authorisation in the operation and
maintenance phases. GridCat provides maps and statistics on jobs running and storage capacity of the
member sites. This information can guide schedulers and brokers on job submission and in turn facilitate
the operation phase. Additionally, MonALISA (MONitoring Agents using a Large Integrated Services
Architecture) (Legrand et al., 2004) has been utilised to monitor computational nodes, applications and
network performance of the VOs within the consortium.
Enabling Grids for E-sciencE (EGEE): Similarly to OSG, EGEE (2005) federates resource centres
to enable a global infrastructure for researchers. EGEEs resource centres are hierarchically organised:
an Operations Manager Centre (OMC) located at CERN, Regional Operations Centres (ROC) located
in different countries, Core Infrastructures Centres (CIC) and Resource Centres (RC) responsible for
providing resources to the Grid. A ROC carries out activities as supporting deployment and operations;
negotiating SLAs within its region and organising certification authorities. CICs are in charge of providing
VO-services, such as maintaining VO-Servers and registration; VO-specific services such as databases,
resource brokers and user interfaces; and other activities such as accounting and resource usage. The
OMC interfaces with international Grid efforts. It is also responsible for activities such as approving
connection with new RCs, promoting cross-trust among CAs, and enabling cooperation and agreements
with user communities, VOs and existing national and regional infrastructures.
To join EGEE, in addition to the installation of the Grid middleware, there is a need for a formal
request and further assessment from special committees. Once the application is considered suitable to
EGEE, a VO will be formed. Accounting is based on the use of resources by members of the VO. EGEE
currently utilises LCG-2/gLite (2005).
Other Important Work

Resource allocation in a VO depends on, and is driven by, many conditions and rules: the VO can be formed
by physical organisations under different, sometimes conflicting, resource usage policies. Participating
organisations provide their resources to the VO, which can be defined in terms of SLAs, and agree to
enforce VO level policies defining who has access to the resources in the VO. Different models can be
adopted for negotiation and enforcement of SLAs. One model is by relying on a trusted VO manager.
Resource providers supply resources to the VO according to SLAs established with the VO manager.
The VO manager in turn assigns resource quotas to VO groups and users based on a commonly agreed
VO-level policy. In contrast, a VO can follow a democratic or P2P sharing approach, in which you give
what you can and get what others can offer or you get what give (Wasson & Humphrey, 2003).
Elmroth and Gardfjall (2005) presented an approach for enabling Grid-wide fair-share scheduling.
The work introduces a scheduling framework that enforces fair-share policies in a Grid-wide scale. The
538
policies are hierarchical in the sense that they can be subdivided recursively to form a tree of shares.
Although the policies are hierarchical, they are enforced in a flat and decentralised manner. In the proposed framework, resources have local policies and split the available resources to given VOs. These
local policies have references to the VO-level policies. Although the proposed framework and algorithm
do not require a centralised scheduler, it may impose certain overhead in locally caching global usage
information.
MAPPING OF SURVEYED WORK AGAINST THE TAXONOMIES

This section presents the mapping of the surveyed projects against the proposed taxonomies. For
simplicity, the mapping only considers selected work from those surveyed to be included in the tables
presented in this section.
Table 1 classifies existing work according to their architectures and operational models. Gridbus
Broker, GridWay, and SNAP-based community resource broker are resource brokers that act on behalf
of users to submit jobs to Grid resources to which they have access. They follow the operational model
based on job routing. Although GridWay provides means for the deployment of virtual machines, this
deployment takes place on a job basis (Rubio-Montero et al., 2007). DI-GRUBER, VioCluster, Condor
flocking and CSF have a distributed-scheduler architecture in which brokers or meta-schedulers have
bilateral sharing agreements between them (Table 2). OurGrid and Self-organising flock of Condors
utilise P2P networks of brokers or schedulers, whereas Grid federation uses a P2P network to build a
shared space utilised by providers and users to post resource claims and requests respectively (Table
2). VioCluster and Shirako enable the creation of virtualised environments in which job routing or job
pulling based systems can be deployed. However, in these last two systems, resources are controlled at
the level of containment or virtual machines.
Table 2 summarises the communication models and sharing mechanisms utilised by distributedscheduler based systems. Shirako uses transitive agreements in which brokers can exchange claims of
resources issued by site authorities who represent the resource providers. It allows brokers to delegate
access to resources multiple times.
The resource control techniques employed by the surveyed systems are summarised in Table 3. As
described beforehand, VioCluster and Shirako use containment based resource control, whereas the remaining systems utilise the job model. EGEE WMS and DI-GRUBER take into account the scheduling
of jobs according to the VOs to which users belong and the shares contributed by resource providers.
The other systems can be utilised to form a single VO wherein jobs can be controlled on a user basis.
The support of various works to the VO life-cycle phases is depicted in Table 4. We select a subset
of the surveyed work, particularly the work that focuses on VO related issues such as their formation
and operation. DI-GRUBER and gLite schedule jobs by considering the resource shares of multiple
VOs. EGEE and OSG also work as facilitators of VOs by providing consortiums to which organisations
can join and start VOs (Table 5). However, the process is not automated and requires the establishment
of contracts between the consortium and the physical resource providers. Shirako enables the creation
of virtualised environments spanning multiple providers, which can be used for hosting multiple VOs
(Ramakrishnan et al., 2006).
The systems characteristics and the VOs they enable are summarised in Table 5. Conoise and Akogrimo allow the formation of dynamic VOs in which the VO can be started by a user utilising a mobile
539
Table 1. GRMSs according to architectures and operational models

System
Architecture
Operational Model
SGE and PBS
Independent clusters
Job routing
Condor-G
Independent clusters*
Job routing
Gridbus Broker
Resource Broker
Job routing
GridWay
Resource Broker
Job routing**
SNAP-Based Community Resource Broker
Resource Broker
Job routing
EGEE WMS
Centralised
Job routing
KOALA
Centralised
Job routing
PlanetLab
Centralised
N/A***
Computing Center Software (CCS)
Hierarchical
Job routing
GRUBER/DI-GRUBER
Distributed/static
Job routing
VioCluster
Distributed/static
N/A***
Condor flocking
Distributed/static
Matchmaking
Community Scheduler Framework
Distributed/static
Job routing
OurGrid
Distributed/dynamic
Job routing
Self-organising flock of Condors
Distributed/dynamic
Matchmaking
Grid federation
Distributed/dynamic
Job routing
Askalon
Distributed/dynamic
Job routing
SHARP/Shirako
Distributed/dynamic
N/A***
Hybrid
Matchmaking
Delegated Matchmaking
*
**
***
Condor-G provides software that can be used to build meta-schedulers.

GridWay also manages the deployment of virtual machines.
PlanetLab, VioCluster and Shirako use resource control at the containment level, even though they enable the creation of virtual
execution environments on which systems based on job routing can be deployed.
Table 2. Classification of GRMSs according to their sharing arrangements

Communication Pattern
Sharing Mechanism
GRUBER/DI-GRUBER
System
Bilateral agreements
System centric
VioCluster
Site centric
Condor flocking
Site centric
OurGrid
P2P network
System centric
P2P network
Site centric
Shared space
Site centric
Askalon
Grid federation
Site centric
SHARP/Shirako
Transitive agreements
Self-interest
Delegated MatchMaking
Site centric
540
Table 3. Classification of GRMSs according to their support for VOs and resource control
System
Support for VOs
Resource Control
Multiple VO
Job model
EGEE WMS
KOALA
Single VO
Job model
Multiple VO
Job model
VioCluster
Single VO
Container model/multiple site*
Condor flocking
Single VO
Job model
OurGrid
Single VO
Job model
Single VO
Job model
Grid federation
Single VO
Job model
Askalon
Single VO
Job model
Multiple VO**
Container model/multiple site***
Single VO
Job model
GRUBER/DI-GRUBER
SHARP/Shirako
Delegated MatchMaking
VioCluster supports containment at both single site and multiple site levels.
**
Shirako enables the creation of multiple containers that can in turn be used by multiple VOs, even though it does not handle issues
on job scheduling amongst multiple VOs.
***
Shirako supports containment at both (i) single site level through Cluster on Demand and (ii) multiple-site level. Shirako also
explores resource control at job level by providing recommendations on the site in which jobs should be executed.
Table 4. Support to the phases of the VOs lifecycle by the projects analysed
Support to the phases of the VO life-cycle
Project
Name
Creation
Operation
Maintenance
Dissolution
Support for short term

collaborations
OSG*
Partial
Partial
Not available
Not available
Not available
EGEE/gLite*
Partial
Available
Not available
Not available
Not available
CONOISE
Available
Available
Available
Not available
Available
TrustCoM
Mainly related
to security issues
Mainly related to
security issues
Not available
Not available
Not available
DI-GRUBER
Not available
Available
Partial**
Not available
Not available
Akogrimo***
Partial
Partial
Partial
Partial
Partial
Shirako
Not available
Available
Available
Not available
Not available
OSG and EGEE work as consortiums enabling trust among organisations and facilitating the formation of VOs. They also
provide tools for monitoring status of resources and job submissions. EGEEs WMS performs the scheduling taking into
account multiple VOs.
**
DI-GRUPERs policy decision points allow for the re-adjustment of the VOs according to the current resource shares offered
by providers and the status of the Grid.
***
Akogrimo aims at enabling collaboration between doctors upon the patients request or in case of a health emergency.
541
Table 5. Mapping of the systems against the propsed VO taxonomies

Dynamism
Goal
Orientation
Duration
Control
Policy Enforcement
Facilitators
Dynamic/Hybrid
Targeted
Mediumlived
Decentralised
Democratic
N/A
TrustCoM**
Static
Targeted
Long-lived
N/A
N/A
N/A
GRUBER/DIGRUBER
Static
Targeted
Long-lived
Decentralised
Decentralised***
N/A
gLite/EGEE
Static
Targeted
Long-lived
Centralised
Centralised
Centralised+
Open Science Grid
Static
Targeted
Long-lived
Hierarchical
Centralised
Market-like
Dynamic/Hybrid
Targeted
Short or
Mediumlived
Decentralised
Democratic
N/A
Dynamic
Non-targeted
Mediumlived
Decentralised
Democratic
N/A
System
Conoise*
Akogrimo
Shirako
*
**
***
+
Conoise and Akogrimo allow a client using a mobile device to start a VO, thus the VO can comprise fixed and mobile resources.
TrustCoM deals with security issues and does not provide tools for the management and policy enforcement in VOs.
DI-GRUBER uses a network of decision points to guide submitting hosts and schedulers about which resources can execute
the jobs.
EGEE Workload Management System is aware of the VOs and schedules jobs accordingly to the VOs in the system.
device. The virtual environments enabled by Shirako can be adapted by leasing additional resources or
terminating leases according to the demands of the virtual organisation it is hosting (Ramakrishnan et
al., 2006). Resource providers in Shirako may offer their resources in return for economic compensation meaning that the resource providers may not have a common target in solving a particular resource
challenge. This makes the VOs non-targeted.
FUTURE TRENDS
Over the last decade, the distributed computing realm has been characterised by the deployment of
large-scale Grids such as EGEE and TeraGrid. Such Grids have provided the research community with
an unprecedented number of resources, which have been used for various scientific research. However,
the hardware and software heterogeneity of the resources provided by the organisations within a Grid
have increased the complexity of deploying applications in these environments. Recently, application
deployment has been facilitated by the intensifying use of virtualisation technologies.
The increasing ubiquity of virtual machine technologies has enabled the creation of customised environments atop a physical infrastructure and the emergence of new business models such as virtualised
data centres and cloud computing. The use of virtual machines brings several benefits such as: server
consolidation, the ability to create VMs to run legacy code without interfering in other applications
APIs, improved security through the creation of application sandboxes, dynamic provisioning of virtual
machines to services, and performance isolation.
542
Existing virtual-machine based resource management systems can manage a cluster of computers
within a site allowing the creation of virtual workspaces (Keahey et al., 2006) or virtual clusters (Foster
et al., 2006; Montero et al., 2008; Chase et al., 2003). They can bind resources to virtual clusters or
workspaces according to a customers demand. These resource managers allow the user to create customised virtual clusters using shares of the physical machines available at the site. In addition, current
data centres are using virtualisation technology to provide users with the look and feel of taping into a
dedicated computing and storage infrastructure for which they are charged a fee based on usage (e.g.
Amazon Elastic Computing Cloud3 and 3Tera4).
These factors are resulting in the creation of virtual execution environments or slices that span both
commercial and academic computing sites. Virtualisation technologies minimise many of the concerns
that previously prevented the peering of resource sharing networks, such as the execution of unknown
applications and the lack of guarantees over resource control. For the resource provider, substantial
work is being carried out on the provisioning of resources to services and user applications. Techniques
such as workload forecasts along with resource overbooking can reduce the need for over-provisioning
a computing infrastructure. Users can benefit from the improved reliability, the performance isolation,
and the environment isolation offered by virtualisation technologies.
We are likely to see an increase in the number of virtual organisations enabled by virtual machines,
thus allocating resources from both commercial data centres and research testbeds. We suggest that
emerging applications will require the prompt formation of VOs, which are also quickly responsive
and automated. VOs can have dynamic resource demands, which are quickly responded by data centres
relying on virtualisation technologies. There can also be an increase in business workflows relying on
globally available messaging based systems for process synchronisation5. Our current research focuses
on connecting computing sites managed by virtualisation technologies for creating distributed virtual
environments which are used by the user applications.
CONCLUSION
This chapter presents classifications and a survey of systems that can provide means for inter-operating
resource sharing networks. It also provides taxonomies on Virtual Organisations (VOs) with a focus on
Grid computing practices. Hence, we initially discussed the challenges in VOs and presented a background
on the life-cycle of VOs and on resource sharing networks. This chapter suggests that future applications
will require the prompt formation of VOs, which are also quickly responsive and automated. This may
be enabled by virtualisation technology and corroborates the current trends on multiple site containers
or virtual workspaces. Relevant work and technology in the area were presented and discussed.
ACKNOWLEDGMENT
We thank Marco Netto, Alexandre di Costanzo and Chee Shin Yeo for sharing their thoughts on the
topic and helping in improving the structure of this chapter. We are grateful to Mukaddim Pathan for
proof reading a preliminary version of this chapter. This work is supported by research grants from the
Australian Research Council (ARC) and Australian Department of Innovation, Industry, Science and
Research (DIISR). Marcos PhD research is partially supported by NICTA.
543
REFERENCES
A Blueprint for the Open Science Grids. (2004, December). Snapshot v0.9.
Adabala, S., Chadha, V., Chawla, P., Figueiredo, R., Fortes, J., & Krsul, I. (2005, June). From virtualized resources to virtual computing Grids: the In-VIGO system. Future Generation Computer Systems,
21(6), 896909. doi:10.1016/j.future.2003.12.021
Andrade, N., Brasileiro, F., Cirne, W., & Mowbray, M. (2007). Automatic Grid assembly by promoting
collaboration in peer-to-peer Grids. Journal of Parallel and Distributed Computing, 67(8), 957966.
doi:10.1016/j.jpdc.2007.04.011
Andrade, N., Cirne, W., Brasileiro, F., & Roisenberg, P. (2003). OurGrid: An approach to easily assemble
Grids with equitable resource sharing. In 9th Workshop on Job Scheduling Strategies for Parallel Processing (Vol. 2862, pp. 6186). Berlin/Heidelberg: Springer.
Australian Partnership for Advanced Computing (APAC) Grid. (2005). Retrieved from http://www.apac.
edu.au/programs/GRID/index.html.
Balazinska, M., Balakrishnan, H., & Stonebraker, M. (2004, March). Contract-based load management
in federated distributed systems. In 1st Symposium on Networked Systems Design and Implementation
(NSDI) (pp. 197-210). San Francisco: USENIX Association.
Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., et al. (2003). Xen and the art of virtualization. In 19th ACM Symposium on Operating Systems Principles (SOSP 03) (pp. 164177). New
York: ACM Press.
Boghosian, B., Coveney, P., Dong, S., Finn, L., Jha, S., Karniadakis, G. E., et al. (2006, June). Nektar,
SPICE and vortonics: Using federated Grids for large scale scientific applications. In IEEE Workshop
on Challenges of Large Applications in Distributed Environments (CLADE). Paris: IEEE Computing
Society.
Brune, M., Gehring, J., Keller, A., & Reinefeld, A. (1999). Managing clusters of geographically distributed high-performance computers. Concurrency (Chichester, England), 11(15), 887911. doi:10.1002/
(SICI)1096-9128(19991225)11:15<887::AID-CPE459>3.0.CO;2-J
Bulhes, P. T., Byun, C., Castrapel, R., & Hassaine, O. (2004, May). N1 Grid Engine 6 Features and
Capabilities [White Paper]. Phoenix, AZ: Sun Microsystems.
Butt, A. R., Zhang, R., & Hu, Y. C. (2003). A self-organizing flock of condors. In 2003 ACM/IEEE
Conference on Supercomputing (SC 2003) (p. 42). Washington, DC: IEEE Computer Society.
Buyya, R., Abramson, D., & Giddy, J. (2000, June). An economy driven resource management architecture for global computational power grids. In 7th International Conference on Parallel and Distributed
Processing Techniques and Applications (PDPTA 2000). Las Vegas, AZ: CSREA Press.
Caromel, D., di Costanzo, A., & Mathieu, C. (2007). Peer-to-peer for computational Grids: Mixing clusters
and desktop machines. Parallel Computing, 33(45), 275288. doi:10.1016/j.parco.2007.02.011
544
Catlett, C., Beckman, P., Skow, D., & Foster, I. (2006, May). Creating and operating national-scale
cyberinfrastructure services. Cyberinfrastructure Technology Watch Quarterly, 2(2), 210.
Chase, J. S., Irwin, D. E., Grit, L. E., Moore, J. D., & Sprenkle, S. E. (2003). Dynamic virtual clusters in
a Grid site manager. In 12th IEEE International Symposium on High Performance Distributed Computing (HPDC 2003) (p. 90). Washington, DC: IEEE Computer Society.
Chinese National Grid (CNGrid) Project Web Site. (2007). Retrieved from http://www.cngrid.org/
CNGrid GOS Project Web site. (2007). Retrieved from http://vega.ict.ac.cn
Dang, V. D. (2004). Coalition Formation and Operation in Virtual Organisations. PhD thesis, Faculty
of Engineering, Science and Mathematics, School of Electronics and Computer Science, University of
Southampton, Southampton, UK.
de Assuno, M. D., & Buyya, R. (2008, December). Performance analysis of multiple site resource
provisioning: Effects of the precision of availability information [Technical Report]. In International
Conference on High Performance Computing (HiPC 2008) (Vol. 5374, pp. 157168). Berlin/Heidelberg:
Springer.
de Assuno, M. D., & Buyya, R. (in press). Performance analysis of allocation policies for interGrid
resource provisioning. Information and Software Technology.
de Assuno, M. D., Buyya, R., & Venugopal, S. (2008, June). InterGrid: A case for internetworking
islands of Grids. [CCPE]. Concurrency and Computation, 20(8), 9971024. doi:10.1002/cpe.1249
Dimitrakos, T., Golby, D., & Kearley, P. (2004, October). Towards a trust and contract management
framework for dynamic virtual organisations. In eChallenges. Vienna, Austria.
Dixon, C., Bragin, T., Krishnamurthy, A., & Anderson, T. (2006, September). Tit-for-Tat Distributed
Resource Allocation [Poster]. The ACM SIGCOMM 2006 Conference.
Dumitrescu, C., & Foster, I. (2004). Usage policy-based CPU sharing in virtual organizations. In 5th
IEEE/ACM International Workshop on Grid Computing (Grid 2004) (pp. 5360). Washington, DC:
Dumitrescu, C., & Foster, I. (2005, August). GRUBER: A Grid resource usage SLA broker. In J. C. Cunha
& P. D. Medeiros (Eds.), Euro-Par 2005 (Vol. 3648, pp. 465474). Berlin/Heidelberg: Springer.
Dumitrescu, C., Raicu, I., & Foster, I. (2005). DI-GRUBER: A distributed approach to Grid resource
brokering. In 2005 ACM/IEEE Conference on Supercomputing (SC 2005) (p. 38). Washington, DC:
Dumitrescu, C., Wilde, M., & Foster, I. (2005, June). A model for usage policy-based resource allocation
in Grids. In 6th IEEE International Workshop on Policies for Distributed Systems and Networks (pp.
191200). Washington, DC: IEEE Computer Society.
Elmroth, E., & Gardfjll, P. (2005, December). Design and evaluation of a decentralized system for
Grid-wide fairshare scheduling. In 1st IEEE International Conference on e-Science and Grid Computing
(pp. 221229). Melbourne, Australia: IEEE Computer Society Press.
545
Enabling Grids for E-sciencE (EGEE) project. (2005). Retrieved from http://public.eu-egee.org.
Epema, D. H. J., Livny, M., van Dantzig, R., Evers, X., & Pruyne, J. (1996). A worldwide flock of condors: Load sharing among workstation clusters. Future Generation Computer Systems, 12(1), 5365.
doi:10.1016/0167-739X(95)00035-Q
Fontn, J., Vzquez, T., Gonzalez, L., Montero, R. S., & Llorente, I. M. (2008, May). OpenNEbula: The
open source virtual machine manager for cluster computing. In Open Source Grid and Cluster Software
Conference Book of Abstracts. San Francisco.
Foster, I., Freeman, T., Keahey, K., Scheftner, D., Sotomayor, B., & Zhang, X. (2006, May). Virtual
clusters for Grid communities. In 6th IEEE International Symposium on Cluster Computing and the
Grid (CCGRID 2006) (pp. 513520). Washington, DC: IEEE Computer Society.
Foster, I., & Kesselman, C. (1997, Summer). Globus: A metacomputing infrastructure toolkit. The International Journal of Supercomputer Applications, 11(2), 115128.
Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the Grid: Enabling scalable virtual organizations. The International Journal of Supercomputer Applications, 15(3), 200222.
Frey, J., Tannenbaum, T., Livny, M., Foster, I. T., & Tuecke, S. (2001, August). Condor-G: A computation
management agent for multi-institutional Grids. In 10th IEEE International Symposium on High Performance Distributed Computing (HPDC 2001) (pp. 5563). San Francisco: IEEE Computer Society.
Fu, Y., Chase, J., Chun, B., Schwab, S., & Vahdat, A. (2003). SHARP: An architecture for secure resource
peering. In 19th ACM Symposium on Operating Systems Principles (SOSP 2003) (pp. 133148). New
York: ACM Press.
gLite - Lightweight Middleware for Grid Computing. (2005). Retrieved from http://glite.web.cern.ch/
glite.
Graupner, S., Kotov, V., Andrzejak, A., & Trinks, H. (2002, August). Control Architecture for Service
Grids in a Federation of Utility Data Centers (Technical Report No. HPL-2002-235). Palo Alto, CA:
HP Laboratories Palo Alto.
Grid Interoperability Now Community Group (GIN-CG). (2006). Retrieved from http://forge.ogf.org/
sf/projects/gin.
Grimme, C., Lepping, J., & Papaspyrou, A. (2008, April). Prospects of collaboration between compute
providers by means of job interchange. In Job Scheduling Strategies for Parallel Processing (Vol. 4942,
p. 132-151). Berlin / Heidelberg: Springer.
Grit, L. E. (2005, October). Broker Architectures for Service-Oriented Systems [Technical Report].
Durham, NC: Department of Computer Science, Duke University.
Grit, L. E. (2007). Extensible Resource Management for Networked Virtual Computing. PhD thesis,
Department of Computer Science, Duke University, Durham, NC. (Adviser: Jeffrey S. Chase)
546
Haji, M. H., Gourlay, I., Djemame, K., & Dew, P. M. (2005). A SNAP-based community resource broker
using a three-phase commit protocol: A performance study. The Computer Journal, 48(3), 333346.
doi:10.1093/comjnl/bxh088
Hey, T., & Trefethen, A. E. (2002). The UK e-science core programme and the Grid. Future Generation
Computer Systems, 18(8), 10171031. doi:10.1016/S0167-739X(02)00082-1
Huang, R., Casanova, H., & Chien, A. A. (2006, April). Using virtual Grids to simplify application
scheduling. In 20th International Parallel and Distributed Processing Symposium (IPDPS 2006). Rhodes Island, Greece: IEEE.
Huedo, E., Montero, R. S., & Llorente, I. M. (2004). A framework for adaptive execution in Grids.
Software, Practice & Experience, 34(7), 631651. doi:10.1002/spe.584
Iosup, A., Epema, D. H. J., Tannenbaum, T., Farrellee, M., & Livny, M. (2007, November). Inter-operating
Grids through delegated matchmaking. In 2007 ACM/IEEE Conference on Supercomputing (SC 2007)
(pp. 112). New York: ACM Press.
Irwin, D., Chase, J., Grit, L., Yumerefendi, A., Becker, D., & Yocum, K. G. (2006, June). Sharing
networked resources with brokered leases. In USENIX Annual Technical Conference (pp. 199212).
Berkeley, CA: USENIX Association.
Katzy, B., Zhang, C., & Lh, H. (2005). Virtual organizations: Systems and practices. In L. M. Camarinha-Matos, H. Afsarmanesh, & M. Ollus (Eds.), (p. 45-58). New York: Springer Science+Business
Media, Inc.
Keahey, K., Foster, I., Freeman, T., & Zhang, X. (2006). Virtual workspaces: Achieving quality of service
and quality of life in the Grids. Science Progress, 13(4), 265275.
Kertsz, A., Farkas, Z., Kacsuk, P., & Kiss, T. (2008, April). Grid enabled remote instrumentation. In F.
Davoli, N. Meyer, R. Pugliese, & S. Zappatore (Eds.), 2nd International Workshop on Distributed Cooperative Laboratories: Instrumenting the Grid (INGRID 2007) (pp. 303312). New York: Springer US.
Kim, K. H., & Buyya, R. (2007, September). Fair resource sharing in hierarchical virtual organizations
for global Grids. In 8th IEEE/ACM International Conference on Grid Computing (Grid 2007) (pp.
5057). Austin, TX: IEEE.
Legrand, I., Newman, H., Voicu, R., Cirstoiu, C., Grigoras, C., Toarta, M., et al. (2004, SeptemberOctober). Monalisa: An agent based, dynamic service system to monitor, control and optimize Grid based
applications. In Computing in High Energy and Nuclear Physics (CHEP), Interlaken, Switzerland.
Litzkow, M. J., Livny, M., & Mutka, M. W. (1988, June). Condor a hunter of idle workstations. In 8th
International Conference of Distributed Computing Systems (pp. 104111). San Jose, CA: Computer
Society.
Metz, C. (2001). Interconnecting ISP networks. IEEE Internet Computing, 5(2), 7480.
doi:10.1109/4236.914650
Mohamed, H., & Epema, D. (in press). KOALA: A co-allocating Grid scheduler. Concurrency and
Computation.
547
Montero, R. S., Huedo, E., & Llorente, I. M. (2008, September/October). Dynamic deployment of custom
execution environments in Grids. In 2nd International Conference on Advanced Engineering Computing
and Applications in Sciences (ADVCOMP 08) (pp. 3338). Valencia, Spain: IEEE Computer Society.
National e-Science Centre. (2005). Retrieved from http://www.nesc.ac.uk.
Norman, T. J., Preece, A., Chalmers, S., Jennings, N. R., Luck, M., & Dang, V. D. (2004). Agentbased formation of virtual organisations. Knowledge-Based Systems, 17, 103111. doi:10.1016/j.
knosys.2004.03.005
Open Science Grid. (2005). Retrieved from http://www.opensciencegrid.org
Open Source Metascheduling for Virtual Organizations with the Community Scheduler Framework
(CSF) (Tech. Rep.) (2003, August). Ontario, Canada: Platform Computing.
OpenPBS. The portable batch system software. (2005). Veridian Systems, Inc., Mountain View, CA.
Retrieved from http://www.openpbs.org/scheduler.html
Padala, P., Shin, K. G., Zhu, X., Uysal, M., Wang, Z., Singhal, S., et al. (2007, March). Adaptive control
of virtualized resources in utility computing environments. In 2007 Conference on EuroSys (EuroSys
2007) (pp. 289-302). Lisbon, Portugal: ACM Press.
Patel, J., Teacy, L. W. T., Jennings, N. R., Luck, M., Chalmers, S., & Oren, N. (2005). Agent-based virtual
organisations for the Grids. International Journal of Multi-Agent and Grid Systems, 1(4), 237249.
Peterson, L., Muir, S., Roscoe, T., & Klingaman, A. (2006, May). PlanetLab Architecture: An Overview
(Tech. Rep. No. PDN-06-031). Princeton, NJ: PlanetLab Consortium.
PlanetLab Europe. (2008). Retrieved from http://www.planet-lab.eu/.
Ramakrishnan, L., Irwin, D., Grit, L., Yumerefendi, A., Iamnitchi, A., & Chase, J. (2006). Toward a
doctrine of containment: Grid hosting with adaptive resource control. In 2006 ACM/IEEE Conference
on Supercomputing (SC 2006) (p. 101). New York: ACM Press.
Ranjan, R., Buyya, R., & Harwood, A. (2005, September). A case for cooperative and incentive-based
coupling of distributed clusters. In 7th IEEE International Conference on Cluster Computing. Boston,
MA: IEEE CS Press.
Ranjan, R., Harwood, A., & Buyya, R. (2006, September). SLA-based coordinated superscheduling
scheme for computational Grids. In IEEE International Conference on Cluster Computing (Cluster
2006) (pp. 18). Barcelona, Spain: IEEE.
Ranjan, R., Rahman, M., & Buyya, R. (2008, May). A decentralized and cooperative workflow scheduling algorithm. In 8th IEEE International Symposium on Cluster Computing and the Grid (CCGRID
2008). Lyon, France: IEEE Computer Society.
Ricci, R., Oppenheimer, D., Lepreau, J., & Vahdat, A. (2006, January). Lessons from resource allocators for large-scale multiuser testbeds. SIGOPS Operating Systems Review, 40(1), 2532.
doi:10.1145/1113361.1113369
548
Rubio-Montero, A., Huedo, E., Montero, R., & Llorente, I. (2007, March). Management of virtual
machines on globus Grids using GridWay. In IEEE International Parallel and Distributed Processing
Symposium (IPDPS 2007) (pp. 17). Long Beach, USA: IEEE Computer Society.
Ruth, P., Jiang, X., Xu, D., & Goasguen, S. (2005, May). Virtual distributed environments in a shared
infrastructure. IEEE Computer, 38(5), 6369.
Ruth, P., McGachey, P., & Xu, D. (2005, September). VioCluster: Virtualization for dynamic computational domain. In IEEE International on Cluster Computing (Cluster 2005) (pp. 110). Burlington,
MA: IEEE.
Ruth, P., Rhee, J., Xu, D., Kennell, R., & Goasguen, S. (2006, June). Autonomic live adaptation of virtual
computational environments in a multi-domain infrastructure. In 3rd IEEE International Conference on
Autonomic Computing (ICAC 2006) (pp. 5-14). Dublin, Ireland: IEEE.
Sairamesh, J., Stanbridge, P., Ausio, J., Keser, C., & Karabulut, Y. (2005, March). Business Models for
Virtual Organization Management and Interoperability (Deliverable A - WP8&15 WP - Business &
Economic Models No. V.1.5). Deliverable document 01945 prepared for TrustCom and the European
Commission.
Schwiegelshohn, U., & Yahyapour, R. (1999). Resource allocation and scheduling in metasystems. In
7th International Conference on High-Performance Computing and Networking (HPCN Europe 99)
(pp. 851860). London, UK: Springer-Verlag.
Shoykhet, A., Lange, J., & Dinda, P. (2004, July). Virtuoso: A System For Virtual Machine Marketplaces
[Technical Report No. NWU-CS-04-39]. Evanston/Chicago: Electrical Engineering and Computer Science Department, Northwestern University.
Siddiqui, M., Villazn, A., & Fahringer, T. (2006). Grid capacity planning with negotiation-based advance reservation for optimized QoS. In 2006 ACM/IEEE Conference on Supercomputing (SC 2006)
(pp. 2121). New York: ACM.
Smarr, L., & Catlett, C. E. (1992, June). Metacomputing. Communications of the ACM, 35(6), 4452.
doi:10.1145/129888.129890
Svirskas, A., Arevas, A., Wilson, M., & Matthews, B. (2005, October). Secure and trusted virtual organization management. ERCIM News (63).
The TrustCoM Project. (2005). Retrieved from http://www.eu-trustcom.com.
Vzquez-Poletti, J. L., Huedo, E., Montero, R. S., & Llorente, I. M. (2007). A comparison between two grid
scheduling philosophies: EGEE WMS and Grid Way. Multiagent and Grid Systems, 3(4), 429439.
Venugopal, S., Nadiminti, K., Gibbins, H., & Buyya, R. (2008). Designing a resource broker for heterogeneous Grids. Software, Practice & Experience, 38(8), 793825. doi:10.1002/spe.849
Wang, Y., Scardaci, D., Yan, B., & Huang, Y. (2007). Interconnect EGEE and CNGRID e-infrastructures
through interoperability between gLite and GOS middlewares. In International Grid Interoperability
and Interoperation Workshop (IGIIW 2007) with e-Science 2007 (pp. 553560). Bangalore, India: IEEE
Computer Society.
549
Wasson, G., & Humphrey, M. (2003). Policy and enforcement in virtual organizations. In 4th International Workshop on Grid Computing (pp. 125132). Washington, DC: IEEE Computer Society.
Wesner, S., Dimitrakos, T., & Jeffrey, K. (2004, October). Akogrimo - the Grid goes mobile. ERCIM
News, (59), 32-33.
ENDNOTES
1
2
3
4
5
550
http://www.vmware.com/
The personal communication amongst GIN-CG members is online at: http://www.ogf.org/pipermail/
gin-ops/2007-July/000142.html
http://aws.amazon.com/ec2/
http://www.3tera.com/
http://aws.amazon.com/sqs/
Section 6
552
Chapter 24
Simultaneous MultiThreading
Microarchitecture
Chen Liu
Florida International University, USA
Xiaobin Li
Intel Corporation, USA
Shaoshan Liu
University of California, Irvine, USA
Jean-Luc Gaudiot
University of California, Irvine, USA
ABSTRACT
Due to the conventional sequential programming model, the Instruction-Level Parallelism (ILP) that
modern superscalar processors can explore is inherently limited. Hence, multithreading architectures
have been proposed to exploit Thread-Level Parallelism (TLP) in addition to conventional ILP. By issuing and executing instructions from multiple threads at each clock cycle, Simultaneous MultiThreading
(SMT) achieves some of the best possible system resource utilization and accordingly higher instruction
throughput. In this chapter, the authors describe the origin of SMT microarchitecture, comparing it with
other multithreading microarchitectures. They identify several key aspects for high-performance SMT
design: fetch policy, handling long-latency instructions, resource sharing control, synchronization and
communication. They also describe some potential benefits of SMT microarchitecture: SMT for faulttolerance and SMT for secure communications. Given the need to support sequential legacy code and
emerge of new parallel programming model, we believe SMT microarchitecture will play a vital role as
we enter the multi-thread multi/many-core processor design era.
INTRODUCTION
Ever since the first integrated circuits (IC) were independently invented by Jack Kilby (Nobel Prize
Laureate in Physics in 2000) from Texas Instruments and Robert Noyce (co-founder of Intel) around
50 years ago, we have witnessed an exponential growth of the whole semiconductor industry.
DOI: 10.4018/978-1-60566-661-7.ch024
Simultaneous MultiThreading Microarchitecture
Figure 1. Moores Law: Transistor count increase
Moores Law and Memory Wall

The semiconductor industry has been driven by Moores law (Moore, 1965) for about 40 years with the
continuing advancements in VLSI technology. Moores law states that the number of transistors on a
single chip doubles every TWO years, as shown in Figure 11, which is based on data from both Intel
and AMD. A corollary of Moores law states that the feature size of chip manufacturing technology keeps
decreasing at the rate of one half approximately every FIVE years (a quarter every two years), based on
our observation shown in Figure 2.
As the number of transistors on a chip grows exponentially, we have reached the point where we
could have more than one billion transistors on a single chip. For example, the Dual-Core Itanium 2
from Intel integrates more than 1.7 billion transistors (Intel, 2006). How to efficiently utilize this huge
amount of transistor estate is a challenging task which has recently preoccupied many researchers and
system architects from both academia and industry.
Processor and memory integration technologies both follow Moores law. Memory latency, however,
is drastically increasing relatively to the processor speed. This is often referred to as the Memory Wall
problem (Hennessy, 2006). Indeed, Figure 3 shows that CPU performance increases at an average rate
of 55% per year, while the memory performance increases at a much lower 7% per year average rate.
There is no sign this gap will be remedied in the near future. Even though the processor speed is continuously increasing, and processors can handle increasing numbers of instructions in one clock cycle,
however, we will continue experiencing considerable performance degradation each time we need to
access the memory. Pipeline stalls will occur when the data does not arrive soon enough after it has
been requested from the memory.
553
Figure 2. Moores Law: Feature size decrease
Figure 3. Memory wall
554
Overcoming the Limits of Instruction-Level Parallelism

Modern superscalar processors are capable of fetching multiple instructions at the same time and execute
as many instructions as there are functional units, exploiting the Instruction-Level Parallelism (ILP) that
inherently exists even in otherwise sequential programs. Furthermore, in order to extract more instructions that can be executed in parallel, these processors have employed dynamic instruction scheduling
and have been equipped with larger instruction window than ever.
Even though increasing the size of the instruction window would increase to some extent the amount
of ILP that a superscalar processor can deliver, control and data dependencies among the instructions,
branch mispredictions, and long-latency operations such as memory accesses limit the effective size of
the instruction window. For SPEC benchmark programs (http://www.spec.org/), a basic instruction block
typically consists of up to 25 instructions (Huang, 1999). However, the average block size for integer
programs has remained small (Mahadevan, 1994), around 4-5 instructions (Marcuello, 1999). Wall (1991)
also pointed out that most representative application programs do not have an intrinsic ILP higher than 7
instructions per cycle even with unbounded resources and optimistic assumptions. Hence, if many slots
in the instruction window are occupied by those depending on a preceding instruction suffering from a
cache miss for the input operands, the effective size of the instruction window is quite small: only a few
instructions can be issued due to the lack of Instruction-Level Parallelism. Therefore, the performance
achieved by such processors is far below the theoretical peak as a result of poor resource utilization. For
example, even a superscalar processor with a fetch width of eight instructions, derived from the MIPS
R10000 processor (Yeager, 1996), equipped with out-of-order execution and speculation, provided an
Instruction Per Cycle (IPC) reading of only 2.7 for a multi-programming workload (multiple independent
programs), and 3.3 for a parallel-programming workload (one parallelized program), despite a potential
of eight (Eggers, 1997).
BACKGROUND
As the design of modern microprocessors, either of superscalar or Very Long Instruction Word (VLIW)
architectures, has been pushed to its limit, the performance gain that could be achieved is diminishing
due to limited Instruction-Level Parallelism, even with deeper (in terms of pipeline stages), wider (in
terms of the fetch/execute/retire bandwidth) pipeline design (Culler, 1998; Eggers, 1997; Hennessy,
2006). Needless to say, the performance of a superscalar processor depends on how many independent
instructions are delivered to both the front-end (all the stages before execution) and the back-end stages
of the pipeline. Due to the sequential programming model, most of the software programs are written
without giving consideration to parallelizing the code. This introduces practical problems when it comes
to executing those programs because of many control and data dependencies. This has compelled hardware architects to focus on breaking the barriers introduced by limited ILP:
One approach entails performing speculative execution in order to deliver more Instruction-Level
Parallelism. Many techniques for speculative execution have been studied to alleviate the impact
of control dependencies among instructions. As the pipeline of microprocessors becomes wider
and deeper, however, the penalty of incorrect speculation increases significantly.
555
The other approach entails exploiting Thread-Level Parallelism (TLP) as well as ILP. If we can
break the boundary among threads and execute instructions from multiple threads, there is a better
chance to find instructions ready to execute.
Multithreading Microarchitectures
Multithreading microarchitectures can be classified by their method of thread switching: coarse-grain
multithreading, fine-grain multithreading, Chip Multi-Processing (CMP) and Simultaneous MultiThreading (SMT). Different implementation methods can significantly affect the behavior of the application.
In coarse-grain multithreading and fine-grain multithreading, at each cycle, we still execute instructions
from a single thread only. In Chip Multi-Processing and Simultaneous MultiThreading, at each cycle
we execute instructions from multiple threads concurrently.
Hardware-Supported Multithreading
The original idea of hardware-supported multithreading was to increase the performance through overlapping communication operations with computation operations in parallel architectures, without any
intervention from the software (Culler 1998). Based on the frequency of thread swapping operations,
hardware-supported multithreading can be divided into two categories:
Coarse-grain multithreading (or blocked multithreading), a new thread is selected for execution
only when a long-latency event occurs for the current thread, such as an L2 cache miss or a remote communication request. The advantage of coarse-grain multithreading is that it masks the
otherwise wasted slots with the execution of another thread. The disadvantage is that when there
are multiple short-latency events, the context switch overhead is high. Due to the limited ILP, the
issue slot is not fully utilized when executing one thread. The MIT Alewife is implemented using
this technique (Agarwal, 1995).
Fine-grain multithreading (or interleaved multithreading), a new thread is selected for execution
at every clock cycle, compared with coarse-grain multithreading which only switches context on
long-latency events. The advantage is that it does not require extra logic to detect the long-latency events and it handles both long-latency and short-latency events because the context switch
will happen anyway. The disadvantage is also the context switch overhead. Due to the singlethread execution at every clock cycle, the issue slot is not fully utilized either. HEP (Smith, 1981),
HORIZON (Thistle, 1988) and TERA (Alverson, 1990) all belong to this category.
Chip Multi-Processing
A CMP processor normally consists of multiple single-thread processing cores. As each core executes
a separate thread, concurrent execution of multiple threads, hence TLP, is realized (Hammond, 1997).
However, the resources on each core are not shared with others, even though a shared L2 cache (or L3
cache if there is one) is common in CMP designs. Each core is relatively simpler than a heavy-weight
superscalar processor.
The width of the pipeline of each core is smaller so that the pressure to explore Instruction-Level
Parallelism is reduced. Because of the simpler pipeline design, each core does not need to run at a high
556
frequency, directly leading to a reduction in power consumption. Actually this is one of the reasons why
the industry shifted from single-core processor to multi-core processor design: because of the Power
Wall problem. We cannot include more transistors into the processor and rely solely on raw frequency
increase to claim it as the next-generation processor, simply because the power density is just prohibitive. On the other hand, with multi-core design, we avoid the Power Wall because we can now operate
at a lower frequency while the power consumption increases linearly with the number of cores. For a
CMP processor, however, if the application program cannot be effectively parallelized, the cores will
be under-utilized because we can not find enough threads to keep all the cores busy at one time. In the
worst case, only one core is working and we cannot execute across the cores to utilize the idle functional
units of other cores.
There are actually two categories of CMP design: homogeneous multi-core and heterogeneous multicore. In homogeneous multi-core, we have identical cores on the same die. For example, the Stanford
Hydra (Hammond, 2000) integrates four MIPS-based processor on a single die. The IBM POWER4
(Tendler, 2002) is a 2-way CMP design. Intel Core2 duo processor series, Core2 Extreme QuadCore processors and AMD Opteron Quad-Core processors are the new generation CMP design.
Heterogeneous multi-core is an asymmetric design: there is normally one or more general-purpose
processing core(s) and multiple specialized, application-specific processing units. The IBM Cell/B.E.
belongs to this category.
Simultaneous MultiThreading
Superscalar or VLIW architectures are often equipped with more functional units than the width of the
pipeline, because of a more aggressive execution type. Often, not all functional units are active at the
same time because of an insufficient number of instructions to execute due to limited ILP. Simultaneous
MultiThreading has been proposed as an architectural technique whose goal is to efficiently utilize the
resources of a superscalar machine without introducing excessive additional control overhead. An SMT
processor is still one physical processor, but it is made to appear like multiple logical processors. In an
effort to reduce hardware implementation overhead, most of the pipeline resources are shared, including instruction queues and functional units. Only hardware parts necessary to retain the thread context
are duplicated, e.g., program counter (PC), register files and branch predictors, as shown in Figure 4
(Kang, 2004). By allowing one processor to execute two or more threads concurrently, a Simultaneous
MultiThreading microarchitecture can exploit both Instruction-Level Parallelism and Thread-Level
Parallelism, accordingly achieving improved instruction throughput (Burns, 2002; Lee, 2003; Nemirovsky, 1991; Shin, 2003; Tullsen, 1995; Tullsen, 1996; Yamamoto, 1995). The multiple threads can
come either from a parallelized program (parallel-programming workload) or from multiple independent
programs (multi-programming workload). With the help of multiple thread contexts that keep track of the
dynamic status of each thread, SMT processors have the ability to fetch, issue and execute instructions
from multiple threads at every clock cycle, taking advantage of the vast number of functional units that
neither superscalar nor VLIW processors can absorb. Also because of TLP, the pressure on exploring
ILP within a single thread is reduced and we do not need aggressive speculative execution any longer.
This reduces the chances of wrong-path execution. Hence, Simultaneous MultiThreading is one of the
most efficient architecture to utilize the vast computing power that such a microprocessor would have,
achieving optimal system resource utilization and higher performance.
557
Figure 4. SMT vs. CMP
The difference in scheduling among Superscalar, CMP and SMT is shown in Figure 5: CMP exploits
TLP by executing different threads in parallel on different processing cores while SMT exploits TLP by
simultaneously issuing instructions from different threads with a large issue width on a single processor.
Figure 5. Resource utilization comparison of different microarchitectures
558
From the graph we can see that SMT processors inherently decrease the horizontal and vertical waste
by executing instructions fetched from different threads (Eggers, 1997). They can provide enhanced
performance in terms of instruction throughput as a result of taking better usage of the resources.
Commercial Implementation of SMT

SMT has been an active research area for more than one decade and has also met with some commercial success. Among others, embryonic implementations can be found in the design of the CDC 6600
(Thornton, 1970), the HEP (Smith, 1981), the TERA (Alverson, 1990), the HORIZON (Thistle, 1988),
and the APRIL (Agarwal, 1990) architectures, in which there exists some concept of multithreading
or Simultaneous MultiThreading. The first major commercial development of SMT was embodied in
the DEC 21464 (EV-8) (Preston, 2002). However, it never made it into production line after DEC was
acquired by Compaq. The Intel Pentium 4 processor 3.06GHz or higher (Hinton, 2001) and Intel
Xeon processor families (Marr, 2002) are the first modern desktop/server processor implemented
SMT, with a basic 2-thread SMT engine (named Hyper-Threading (HT) technology by Intel). When
multiple threads are available, two threads can be executed simultaneously; if there is only one thread
to execute, the resources can be combined together as if it were one single processor. Intel claims its
Hyper-Threading technology implementation only requires 5% hardware overhead, while provides up
to a 65% performance improvement (Marr, 2002). This matches exactly the stated implementation goal
of Hyper-Threading: smallest hardware overhead and high enough performance gain (Marr, 2002).
Recently we see a trend to blur the boundary between CMP and SMT, which is multi-core multi-thread
processor. For example, IBM POWER5 (Sinharoy, 2005) is such an implementation, with multi-core
on a single chip and each core is a 2-thread SMT engine. MIPS Technology designed an SMT system
called MIPS MT. One implementation of this architecture has 8 cores and each core is a 4-thread SMT
engine. All these examples demonstrate the power and popularity of SMT.
SMT DESIGN ASPECTS

With the concept of SMT in mind, this section will dive into the unique design aspects of this microarchitecture. The techniques used to boost the performance of SMT processors can be roughly divided
into the following categories: fetch policy, handling long-latency instructions, active resource allocation
and cache coherence communication.
Thread Selection Policy

Just like superscalar machines, the performance of an SMT processor is affected by the quality of the
instructions injected into the pipeline. There are two critical aspects to this observation:
First, if the instructions fetched and/or executed have dependencies among each other or if they
have long latencies, the ILP and TLP which can be exploited will be limited. This will result in a
clogging of the instruction window and a stalling of the front-end stages.
Second, if the instructions fetched and/or executed belong to the wrong path, these instructions
would compete with the instructions from the correct path for system resources in both the frontend and the back-end, which would degrade the overall performance and power efficiency.
559
Therefore, how to fill the front-end stages of an SMT processor with high-quality instructions from
multiple threads is a critical decision which must be made at each cycle. Tullsen et al. (1996) suggested
the following priority-based thread-scheduling policies for SMT microarchitectures that surpass the
simple Round-Robin policy:
BRCOUNT policy, which prioritizes the threads according to the number of unresolved branches
in the front-end of the pipeline.
MISSCOUNT policy, which prioritizes the threads according to the number of outstanding
D-Cache misses.
ICOUNT policy, which prioritizes the threads according to the number of instructions in the frontend stages.
IQPOSN policy, which prioritizes the threads according to which one has the oldest instruction in
the instruction queue.
Among those, ICOUNT policy was found to provide the best performance in terms of overall instruction
throughput. The reason is that the ICOUNT variable can indicate the current performance of the thread
to some extent. However, the ICOUNT policy does not take speculative execution into account because
it does not consider that after an instruction has been injected into the pipeline, it may be discarded
whenever a conditional branch preceding the instruction has been determined to have been incorrectly
predicted. ICOUNT fails to distinguish between the instructions discarded in the intermediate stages
due to incorrect speculation and the ones normally retired from the pipeline. Furthermore, the ICOUNT
policy does not handle long-latency instructions well. If one thread has a temporarily low ICOUNT, it
does not necessarily mean that a cache miss will not happen to the current instructions from that thread.
As a result, the ICOUNT variable may incorrectly reflect the respective activities of the threads. This is
one of the reasons why the sustained instruction throughput obtained under the ICOUNT-based policy
still remains significantly lower than the possible peak.
Sometimes, a priority-based fetch policy could cause uneven execution of the threads, considering
the case that one thread has very few cache misses while the other one has frequent misses. In an effort
to avoid biased execution so that all the threads can progress equally, Raasch et al. (1999) proposed
a priority-rotating scheme in an attempt to increase the execution of instructions from less efficient
threads when threads are of equal priority. However, the performance of this scheme is not as good as
anticipated: the throughput falls short of ICOUNT policy, sometimes even Round-Robin policy. The
authors suggested enforcing the scheme by including branch confidence estimator into the process of
fetch decision making.
Handing Long-Latency Instructions

Due to the Memory Wall problem, there is a major factor that affects resource distribution in SMT
microarchitectures: long-latency instructions such as load misses. These instructions will clog the pipeline, unless the data can be pre-fetched from the memory. When one thread has injected many instructions into the pipeline and a load miss happens, the miss instruction and the instructions depending on it
would not be able to move forward at all. Thus, the residency of those instructions in the pipeline does
not necessarily translate into an increased overall instruction throughput. On the contrary, they pollute
the instruction window and waste system resources which could otherwise be utilized by instructions
560
from other threads. Taking into consideration the severe damage these instructions could cause, this
means that an SMT processor must be aware of the execution of long-latency instructions.
Since ICOUNT does not handle long-latency instructions well, Tullsen et al. (2001) proposed two fetch
policies that can better deal with those instructions. One is STALL, which immediately stops fetching
from a thread once a cache miss has been detected. The other is FLUSH, which flushes the instructions
from those threads with long-latency loads out of the pipeline, rather than occupying system resources
while waiting for the completion of the long-latency operations. In both schemes, however, the detection of long-latency operations comes too late (after an L2 miss), and flushing out all the instructions
already fetched into the pipeline is not a power-efficient solution.
There are several other techniques that attempt to advance the handling of those long-latency instructions, hence improving SMT performance. In DG (El-Moursy, 2003), when the number of outstanding
L1 data cache misses from a thread is beyond a preset threshold, fetching from that thread is prohibited.
However, L1 cache misses do not necessarily lead to L2 cache misses. Therefore, stalling a thread in
such a case may be too severe and would cause an unnecessary stall and resource under-use. It has thus
been proposed in DWarn (Cazorla, IPDPS, 2004) to use L1 cache misses as an indicator of L2 cache
misses and give those threads with cache misses a lower fetch priority instead of stalling them. This
allows DWarn to act in a controlled manner on L1 misses before L2 misses even happen so as to reduce
resource under-use and avoid harming a thread when L1 misses do not lead to L2 misses.
Resource Partitioning among Multiple Threads

If we want to exploit more TLP, we need multiple threads to co-exist in the pipeline. At the same time,
competition for system resources among these threads is also introduced. The overall performance of
an SMT processor depends on many factors. How to distribute the resources among multiple threads is
certainly one of the key issues in order to achieve better performance. Nevertheless, there are different
opinions when it comes to this specific problem. Sometimes a dynamic sharing method can be applied
to the system resources at every pipeline stage in SMT microarchitectures (Eggers, 1997; Tullsen, 1995;
Tullsen, 1996), which means threads can compete for the resources and there is no quota on the resources
that one single thread could utilize. Some other times, all the major queues can be statically partitioned
(Koufaty, 2003; Marr, 2002), so that each thread has its own portion of the resources and there is no
overlap. In most of the fetch policy studies, dynamic sharing was normally used and assumed to be
capable of maximizing the resource utilization and corresponding performance.
Fetch policy alone achieves the resource distribution function in an extremely indirect and limited
way. Upon a load miss, the pipeline of a superscalar processor will simply stall after running out of instructions before the operand from memory returns. For SMT processors on a load miss, other thread(s)
can still proceed because of the TLP, but in a handicapped way. This is due to the fact that the instructions from the thread with a cache miss will occupy system resources in the pipeline. It directly
translates into a reduction in the amount of system resources that other thread(s) can utilize. This is what
we call mutual-hindrance execution. Hence, we do need direct control over system resources in order
to achieve what we call mutual-benefit execution. This would allow us to avoid the resource being
unevenly distributed among threads, which could cause pipeline clogging.
An investigation of the impact of different system resource partitioning mechanisms on SMT processors
was performed by Raasch et al. (2003). Various system resources, like instruction queue, ReOrder Buffer
(ROB), issue bandwidth, and commit bandwidth are studied under different partitioning mechanisms.
561
Figure 6. Fetch prioritizing and throttling scheme
The authors concluded that the true power of SMT lies in its ability to issue and execute instructions
from different threads at every clock cycle. If those resources are partitioning among threads, it would
severely impair the ability of SMT to exploit TLP. Hence, the issue bandwidth has to be shared all the
time. They also observed that partitioning the storage queues, like ROB, has little impact on the overall
system performance. DCRA (Cazorla, MICRO, 2004) is proposed in an attempt to dynamically allocate
the resources among threads by dividing the execution of each thread into different phases, using instruction and cache miss count as indicators. The study shows that DCRA achieves around 18% performance
gain over ICOUNT in terms of harmonic mean. Hill-Climbing (Choi, 2006) dynamically allocates the
resources based on the current performance of each thread and feedback into the resource-allocation
engine. It uses its hill-climbing algorithm to sample some different resource distributions first to find out
the local optimum and then adopt that distribution. It achieves a slightly higher performance (2.4%) than
DCRA but is certainly the most expensive one in terms of execution overhead when it comes to finding
the local optimum. There is also concern with how to justify the local optimum is the global optimum.
Liu C. et al. (2008) extended their work by proposing several different resource sharing control
schemes and combining them with the front-end fetch policy to enforce the resource distribution. They
also studied the impact on the overall performance caused by enforcing resource sharing control on both
the front-end and the back-end of the pipeline. They introduced a two-level decision making process.
The widely accepted ICOUNT policy is still used for thread prioritizing in order to select the candidate
thread to fetch instructions from in the next clock cycle. On top of the ICOUNT policy, another variable,
the Occupancy Counter, is adopted. Each thread occupying a resource currently monitored is associated
with a designated Occupancy Counter. At every clock cycle, more instructions from a given thread are
fed into the queue. Also some instructions from the thread leave the queue and are passed onto the next
stage of the pipeline or retire. The value of the Occupancy Counter is updated after comprehensively
562
evaluating the number of instructions from the certain thread in the specific queue every cycle. If after
updating, the value of the Occupancy Counter of a running thread is greater than its assigned resource
cap, the fetching of instructions from that thread will be stalled next clock cycle, even if it is of highest
priority from ICOUNT policy. This allows the throttling of selected thread(s) after prioritizing, which
enforces the resource sharing control schemes among multiple threads, as shown in Figure 6.Four different resource sharing control mechanisms have been proposed:
D-Share: Both Instruction Fetch Queue (IFQ) and ROB are in the dynamic sharing mode, just like
other system resources. No throttling.
IFQ-Fen2: Enforcing the sharing control on IFQ. Cap is set to be half of the IFQ entries, and other
system resources are in the dynamic sharing mode. Throttling based on Occupancy Counter of
IFQ.
ROB-Fen: Enforcing the sharing control on ROB, Cap is set to be half of the ROB entries, while
other system resources are in the dynamic sharing mode. Throttling based on Occupancy Counter
of ROB.
Dual-Fen: Enforcing the sharing control on both IFQ and ROB, Cap is set to be half of the IFQ
or ROB entries, and other system resources are in the dynamic sharing mode. Throttling based on
Occupancy Counters of either IFQ or ROB.
It is found that controlling the resource sharing of either IFQ or ROB is not sufficient if implemented
alone. However, when controlling the resource sharing of both IFQ and ROB, the Dual-Fen scheme can
yield an average performance gain of 38% when compared with the dynamic sharing case. The average
L1 D-Cache miss rate has been reduced by 33%. The average time during which an instruction resides
in the pipeline has been reduced by 34%. This demonstrates the power of the resource sharing control
mechanism for SMT microarchitectures.
SMT Synchronization and Communication

When multiple processes share data, their accesses to the shared data must be serialized according to the
program semantics so as to avoid errors caused by non-deterministic data access behavior. Conventional
synchronization mechanisms in Symmetric MultiProcessing (SMP) designs are constrained by long
synchronization latency, resource contention, as well as synchronization granularity. Synchronization
latency is determined by where synchronization operations take place. For conventional SMP machines
that perform synchronization operations in memory, it can take hundreds of cycles to complete one synchronization operation. Resource contention exists in many of the existing synchronization operations,
e.g., test-and-set and compare-and-swap. These operations utilize polling mechanism which introduces
serious contention problems. When multiple processes are attempting to lock a shared variable in memory,
only one process will succeed, while all other attempts are strictly overheads. In addition, contention
may lead to deadlock situations that require extra mechanisms for deadlock prevention, which further
degrade system performance. Furthermore, due to the long-latency associated with each synchronization
operation, most synchronization operations in SMP designs are coarse-grained. Thus, a data structure
such as an array needs to be locked for synchronization although only one array element is under synchronization at any instance of parallel execution. This results in unnecessary serialization of the access
to data structures, and restricts the parallelization of programs (Liu, 2007).
563
Figure 7. Microarchitecture of the Godson-2 SMT processor
The granularity and performance of synchronization operations determine the degree of parallelism
that can be extracted from a program. Hence the conventional coarse-grained synchronization operations cannot exploit the fine-grained parallelism which is required for SMT designs. As demonstrated
by Tullsen et al. (1999), an SMT processor differs from a conventional multiprocessor in several crucial
ways which influence the design of SMT synchronization:
Threads share data in L1 cache, instead of in memory as in SMP designs, implying a much lower
synchronization latency.
Hardware thread contexts on an SMT processor share functional units, thus synchronization and
communication of data can be much more effective than through memory. Based on this characteristic, one possible way of synchronization is through direct register access between two threads.
Threads on an SMT processor compete for all fetch and execution resources each cycle, thus
synchronization mechanisms that consume any shared resources without making progress can
impede other threads. In the extreme case, when one thread demands blocking synchronization
while holding all the resources such as all instruction window entries, a deadlock would occur.
Based on these differences between SMT and conventional multiprocessor designs, the synchronization operations for SMT designs should possess the following properties:
564
Low Latency: this can be easily achieved because threads in SMT share data in the L1 cache. As
mentioned before, one possibility of synchronization is through direct register access, but this
may complicate the hardware design to avoid deadlock situations.
Fine-Grained: the degree of parallelism that can be exploited in a parallel computing system is
limited by the granularity of synchronization. To achieve high performance, the SMT design must
be capable to handle fine-grained synchronization.

Minimum Contention: conventional synchronization mechanism such as spin locks requires either spinning or retrying, thus consuming system resources. This effect is highly undesirable. To
achieve high performance, stalled threads must use zero processor resources.
Deadlock Free: blocked threads must release processor resources to allow execution progress.
One interesting SMT synchronization mechanism is implemented in the Godson-2 SMT processor.
As shown in Figure 7 (Li, 2006), Godson-2 SMT processor supports the simultaneous execution of two
threads, and each thread owns its individual program counter, logical registers, and control registers.
Other system resources, including various queues, pipeline path, functional units, and caches are shared
between the two threads.
The Godson-2 SMT processor implements full/empty synchronization to pass messages between
threads at the register level. Each register has an associated full/empty bit and each register can be read
and written by synchronized read and write instructions. Communication and synchronization through
registers meets the goal of low latency; also, the granularity of synchronization in this case is at the
single register level, which meets the goal of fine granularity. On the other hand, full/empty scheme
may result in deadlock. After being decoded, a synchronized read instruction is in the register renaming stage and the register it reads is empty (not ready or not produced). If this instruction waits for the
register it reads to be set to full in the register renaming stage, it will block the pipeline and result in a
deadlock. One solution to this problem is to block synchronized read/write instructions in the instruction buffer in the decode stage and rename the register to get the correct physical register number only
after the register is full (ready or produced). This approach avoids blocking the whole pipeline and thus
prevents deadlocks. Furthermore, this synchronization mechanism is contention-free because once a
synchronized read operation is issued, the thread is blocked and not consuming any processor resources
until the operation is retired.
Another interesting SMT synchronization approach has been proposed by Tullsen et al. (1999). This
approach uses hardware-based blocking locks such that a thread which fails to acquire a lock, blocks and
frees all resources it is using except for the hardware context itself: Further, a thread that releases a lock
causes the blocked thread to be restarted. The implementation of this scheme consists of two hardware
primitives, Acquire and Release, and one hardware data structure, a lock box. The Acquire operation
acquires a memory-based lock and does not complete until the lock has been acquired. The Release
operation releases the lock if no other thread is waiting; otherwise, the next waiting thread is unblocked.
The lock box contains one entry per context and each entry contains the address of the lock, a pointer
to the lock instruction that blocked and a valid bit. The scheme works as follows: when a thread fails to
acquire a lock, the lock address and instruction pointer are stored in the lock box entry, and the thread is
flushed from the processor after the lock instruction. When another thread releases the lock, the blocked
thread is found in the lock box and its execution is resumed. In the meantime, this threads lock box entry
is invalidated. This approach has low-latency and is fine-grained because synchronization takes place at
the level of the L1 cache and the size of the data can be adjusted. Also, when a thread is blocked, all its
instructions are flushed from the instruction queue, thus guaranteeing execution progress and freedom
from deadlock. In addition, this approach imposes minimal contention because once Acquire fails, the
thread is blocked and consumes no processor resources.
As indicated by Liu S. et al. (2008), modern applications may not contain enough ILP due to data
dependencies among instructions. Nevertheless, value prediction techniques are able to exploit the
565
inherent data redundancy in application programs. Specifically, value prediction techniques are able to
predict the value to be produced before the instruction executes, therefore the execution can move on
with the correctly predicted value. Value prediction techniques require extra hardware resources and it
also requires a recovery mechanism when the value is not correctly predicted. SMT is a perfect platform
for value prediction because the system is underutilized only when there is not enough ILP. When this
happens, a speculative thread can be triggered to perform value prediction on the underutilized resources,
which allows the execution to proceed if the value is correctly predicted. Value prediction techniques in
the context of SMT architecture have been studied by Gontmakher et al. (2006) and Tuck et al. (2005). In
(Tuck, 2005), it shows that by allowing the value-speculative execution to proceed in a separate thread,
value prediction is able to overcome data dependencies presented in traditional computing paradigms.
With value prediction techniques, 40% performance gain has been reported. In (Gontmakher, 2006),
it examines the interaction of speculative execution with the thread-related operations and develops
techniques to allow thread-related operations to be speculative executed. The results demonstrate 25%
performance improvement.
POTENTIAL BENEFITS OF SMT

We have discussed a number of design issues. We will now address some potential incidental benefits
of SMT microarchitectures beyond strict performance improvement.
SMT for Fault-Tolerance

One possible SMT application is to design microprocessors resistant to transient faults (Li X., 2006).
The multi-thread execution paradigm inherently provides the spatial and temporal redundancy necessary for fault-tolerance. We can run two copies of the same threads on an SMT processor and compare
the results in order to detect any transient fault which would have occurred in the meantime. This allows, upon detection of an error, to roll back the processor state to a known safe point, and then retry
the instructions, thereby resulting in an error-free execution. This means that temporal redundancy is
inherently implemented by SMT: for instance, assume a soft error occurred in a functional unit (FU)
when executing an instruction from thread #1. Even though the FUs are typically shared between active
threads, since the soft error is assumed to be transient, as long as the same instruction from thread #2 is
executed at a different moment, the results of the redundant execution from the two copied threads would
not match. Furthermore, if any fault in the pipeline is detected, the checkpoint information can then be
used to return the processor to a state corresponding to a fault-free point. After that, the processor can
retry the instructions from the point of recovery.
Nevertheless, this basic idea comes at a cost. Generally speaking, it requires the redundant execution
from the two copied threads to have appropriate fault detection coverage for a given processor component. Hence, the higher the desired fault detection coverage, the more redundant the required execution.
However, redundant execution inevitably comes at the cost of performance overhead, added hardware,
increased design complexity, etc. Consequently, how to trade the fault detection coverage off the added
costs is essential for the practicality of the basic idea. Specifically, consider the need to generate redundant
executing threads: given a general five-stage pipeline which is comprised of instruction fetch, decode,
issue, execute and retire stages, all stages can be exploited for that requirement. Take the fetch stage as
566
Figure 8. Functional diagram of the fault-tolerant SMT data path
an example; we can generate the redundant threads by fetching instructions twice. Since the instruction
fetch stage is the first pipeline stage, the redundant execution would then cover all the pipeline stages,
thus, the largest possible fault detection coverage could be achieved. However, allowing two redundant
threads to fetch instructions would possibly end up halving the effective fetch bandwidth. Consequently,
that halved fetch bandwidth would be an upper bound for the maximum pipeline throughput. Additionally,
the redundant thread generated in the fetch stage would then compete not only for the decode bandwidth,
the issue bandwidth, and the retire bandwidth, but also for Issue Queue (IssueQ) and ROB capacity,
which are all identified as key factors that affect the performance of the redundant execution. Conversely,
we can re-issue the retired instructions from ROB back to the functional units for redundant execution.
In doing so, the bandwidth and spatial occupancy contention at IssueQ and ROB can be relieved, thus
the performance overhead can be lowered. However, this retire-stage-based design comes at the price
of smaller fault detection coverage: only the execution stage would be covered.
Given these trade-off considerations, we can simply fetch the instructions once and then immediately copy the instructions fetched to generate the redundant thread. In doing this, there is no need for
partitioning the fetch bandwidth between the redundant threads. Moreover, we can rely on the dispatch
thread scheduling and redundant thread reduction to relieve the contention in the IssueQ and ROB. Both
techniques lower the performance overhead.
Other than the design trade-off, another issue associated with the basic idea is the need to prevent
deadlocks. In a fault-tolerant SMT design, two copies of the same thread are now cooperating with each
other. Such cooperation could cause deadlocks. We present a systematic deadlock analysis and conclude
that as long as ROB, Load Queue (LQ) and Store Queue (SQ) (the instruction issue queues for load and
store instructions, separately) have allocated some dedicated entries to the trailing thread, the deadlock
situations identified can be prevented. Based on this conclusion, we propose two ways for the prevention of any deadlock situation: one is to statically allocate entries in ROB, LQ, and SQ for the redundant
thread copy; the other is to dynamically monitor for deadlocks.
Lowering the Performance Overhead

As discussed, to lower the performance overhead, we can simply fetch the instructions once and then
immediately copy the instructions fetched in order to generate the redundant thread. However, in doing
so, faults in three major components in the fetch stage: I-Cache, Program Counter, and Branch Prediction Units (BPU) might not be covered. In particular, any transient faults which would happen inside
the I-Cache might not be detected. However, to protect the I-Cache, we can implement Error Correcting
Code (ECC)-like mechanisms that are very effective at handling transient faults in memory structures.
567
Further, the fault occurring in the BPUs will have no effect on the functional correctness of program
execution; however, the critical component PCs must also be protected by ECC-like mechanisms. As
shown in Figure 8, the instruction copy operation is simple: just buffer the instructions fetched into two
instruction queues, hence, the copy operation would not enlarge the pipeline cycle frequency, nor would
another pipeline stage be added. To be specific, each instruction fetched can be bound to a sequential
number and a unique thread ID. For instructions that are stored in IFQ, the leading thread (LT) is used
as their thread ID, whereas for those stored in another IFQ, called trace queue (traceQ), the trailing
thread (TT) is used. It should be noted that traceQ also serves in the two performance overhead lowering techniques which will be described in detail in the following subsections.
Focusing on our redundant execution mode, the key factors that affect the performance of redundant
execution can be identified as: contention for bandwidth as far as issue, execution, and retire operations
are concerned, as well as the capacity contention in IssueQ and ROB. We now address these types of
resource contention by introducing four schemes to reduce TT to be as lightweight as possible (remember
that executing TT is merely for fault detection purposes). In doing so, the competition between IssueQ,
ROB, and FUs could then be reduced.
The first scheme we propose is to prevent the mispredicted TT instructions from being dispatched
for execution. This is based on the observation that the number of dynamic mispredicted instructions
might be a significant portion of the total fetched instructions. For example, Kang et al. (2008) observed
that nearly 16.2% to 28.8% of the instructions fetched would be discarded from the pipeline even with
high branch prediction accuracy. Hence, if we could prevent mispredicted instructions in TT from being
dispatched, the effective utilization of IssueQ, ROB, and FUs would be accordingly improved. Based
on this observation, we leverage LT branch resolution results to completely prevent the mispredicted
instructions in TT from being dispatched. It should also be noted that in this design neither a branch
outcome queue nor a branch prediction queue is needed. Specifically, when encountering a branch instruction in traceQ, the dispatch operation will check its prediction status: if its prediction outcome has
been resolved by its counterpart from LT, we continue its dispatch operation; otherwise, the TT dispatch
operation will be paused. In order not to pause the TT dispatch operation, LT must be executed ahead of
TT. The LT ahead of TT execution mode is called staggered execution. To set up the TT branch instruction status (the initial status is set as unresolved), every completed branch instruction from LT will
search traceQ to match its TT counterpart. We should note here that the sequential numbers provide
the mean for matching two redundant threads instructions. As we have seen, each instruction fetched
is associated with a sequential number at first, and then the fetched instruction is replicated to generate
the redundant thread. In doing so, two copied instructions will have the same sequential numbers in different threads. It should also be noted that such a sequential number feature has been implemented, for
example, in the Alpha and PowerPC processors. If the branch has been correctly predicted, the status of
the matched counterpart TT branch instruction will be assured as resolved. Conversely, if the branch
has been mispredicted, LT will perform its usual branch misprediction recovery, while at the same time
it will flush all those instructions inside traceQ that are located behind the matched counterpart branch
instruction. In other words, LT performs the branch misprediction recovery for both LT and TT. Thus,
TT does not recover any branch misprediction by itself. After recovery, the status of the TT branch
instruction will be set up as resolved.
In the second scheme, we adopt the Load Value Queue (LVQ) design (Reinhardt, 2000) and include
it in our design as shown in Figure 8. Basically, when an LT load fetches data from the cache (or the
main memory), the data fetched and the matching tag associated are also buffered into the LVQ. Instead
568
of accessing the memory hierarchy, the TT loads simply check and match the LVQ for the data fetched.
In doing so, TT might reduce the D-Cache miss penalties and in turn improve its performance. It should
be noted here that in order to fully benefit from the LT data prefetching, we must guarantee that LT is
always ahead of TT, which requires a staggered execution mode.
The third scheme consists in applying the dispatch thread scheduling. A well-known fact is that there
are many idle slots in the execution pipeline. Hence, we must make sure that the redundant execution will
exploit those idle slots as much as possible in order to circumvent the identified performance affecting
contentions. To exploit the idle slots, we must ensure that whenever one thread is idle for any reason, the
execution resources can be promptly allocated to another thread that can utilize them more efficiently.
Based on this observation, the ICOUNT policy (Tullsen, 1996) was proposed to schedule threads in
order to fill IssueQ with issuable instructions, i.e., restrict threads from clogging IssueQ. However, we
argue that it is the dispatch stage that directly feeds IssueQ with useful instructions. Hence, scheduling
threads in the dispatch stage level would react more promptly to the thread idleness in IssueQ. Therefore, we modify the ICOUNT policy as follows (also see in Figure 8): at each clock cycle, we count the
number of instructions that are still waiting in the IssueQ from LT and TT. A higher dispatch priority
is assigned to the thread with the lower instruction count. More specifically, when the dispatch rate is
eight instructions per cycle, the selected thread is allowed to dispatch as many instructions as possible
(up to eight). If any dispatch slot is left from the selected thread, the alternate thread would consume
the remaining slots. The above policy is denoted as ICOUNT.2.8.dispatch.
While developing techniques to make TT as simple as possible, we found that a staggered execution
mode is beneficial for those techniques. To that end, the fourth scheme, the slack dispatch scheme is
proposed: in the instruction dispatch stage, if the selected thread is TT, we check the instruction distance
between LT and TT. If the distance is less than a predefined threshold, we skip the TT dispatch operation
and continue buffering TT in traceQ. This means that the size of traceQ (the entry number of traceQ)
must meet the following requirement: sizeof (traceQ) > sizeof (IFQ) + predefined distance. Moreover,
for the purposes of fault-detection, all retired LT instructions and their execution results are buffered
into the checking queue (chkQ), as shown in Figure 8. Hence, TT is responsible for triggering the result
comparison. We further assume the register file of TT is protected by ECC-like mechanisms. This means
that, if any fault is detected, the register file state of TT could be used to recover that of LT.
Deadlock Analysis and Prevention

As pointed out before, the two copies of a thread cooperate with each other for fault checking and
recovering. However, if not carefully synchronized, such cooperation could result in deadlock situations where neither copy could make any progress. To prevent this, a detailed analysis and appropriate
synchronization mechanisms are necessary.
Resource sharing is one of the underlying conditions of deadlocks. Indeed, it should be noted that
there is much resource sharing between the two thread copies. For example, IssueQ is a shared hardware
resource and both thread copies contend for it. The availability of instructions being issued is another type
of resource sharing: the issue bandwidth is dynamically partitioned between the two thread copies.
Take chkQ as an example, only if there is a free entry in chkQ could LT retire its instruction and back
up the retiring instructions and execution results there. On the other hand, the entry in chkQ can only
be freed by TT: only after an instruction has been retired and compared, can the corresponding entry
in chkQ be released. Further, due to the similarity between dispatch and issue operations, we combine
569
Figure 9. Resource allocation graph for the fault-tolerant SMT deadlock analysis
them under the term issue resource in the discussion which follows.
Based on Figure 9, we can list all possible circular wait conditions. However, some conditions obviously do not end up in a deadlock (e.g., LT traceQ TT SQ LT). After exhausting the list,
we describe all the possible deadlock scenarios as follows:
1.
LT chkQ TT issue resource LT
Scenario: When chkQ is full, LT cannot retire its instructions. Then, those instructions ready to retire
from LT are simply stalled in ROB. If that stalling ends with an ROB full of instructions from LT (the
case in which ROB is full of instructions from LT could be exacerbated by the fact that LT is favored by
the dispatch thread scheduling policy for the stagger execution mode), the instruction dispatch operation
will be blocked, thus, TT will be stalled in traceQ. Consequently, no corresponding instructions from
TT can catch up to release the chkQ entries and then a deadlock can happen. In summary, the condition
for this deadlock situation is derived from the following:
Observation 1: When chkQ is full and ROB is full of instructions from LT, a deadlock happens.
2.
LT LVQ TT issue resource LT
Observation 2: When LVQ is full and LQ is full of instructions from LT, a deadlock happens.
Similarly, the stalled load instructions could end up in a full ROB, thus, the instruction dispatch
operation will be blocked. Hence, the deadlock observation follows:
Observation 3: When LVQ is full and ROB is full and no load instructions from TT in ROB, a deadlock happens.
3.
570
LT SQ TT issue resource LT
Observation 4: When SQ is full of instructions from LT, a deadlock happens.

Based on the above systematic deadlock analysis, we propose two mechanisms to handle the possible
deadlock situations: static hardware resource partitioning and dynamic deadlock monitoring.
In static hardware resource partitioning, i.e., each thread has its own allocated resources, the deadlock
conditions identified can be broken such that we can prevent the deadlock. For example, we can partition
the ROB in order to prevent the possible deadlock situation identified in Observation 1: if some entries
of the ROB are reserved for TT, TT dispatch operations could continue since, when chkQ is full, the
partitioned ROB cannot be full of instructions from LT. Subsequently, those dispatched TT instructions
will be issued and their execution completed afterwards. After completion, they will trigger the result
comparison and then free the corresponding chkQ entries if the operation was found to be fault-free.
After some chkQ entries have been freed, LT is allowed to make progress.
Moreover, we find that only three hardware resources (ROB, LQ, and SQ) need to be partitioned in
order to prevent all the deadlock situations that we identified: Partitioning the ROB to break the deadlock
situation identified in Observation 1: ROB will never be full of instructions from LT such that TT will
be dispatched and then chkQ will be released. Similarly, partitioning LQ to break the deadlock situation
identified in Observation 2; Partitioning SQ to break the deadlock situation identified in Observation
4.
Now considering Observation 3, when LVQ is full, an LT load instruction LD_k in LQ cannot be
issued. However, since ROB is now partitioned between LT and TT, the stalled load instruction LD_k
in ROB will only block LT from being dispatched. In other words, the TT dispatch operation will not be
blocked by the stalled load instruction LD_k, thus, for example, another load instruction LD_i from TT
will be dispatched which will then release the LVQ entry occupied by the counterpart load instruction
LD_i from LT. Once free LVQ entries are made available, the stalled LT load instruction LD_k can be
issued. In summary, we have the following observation:
Observation 5: For each ROB, LQ, and SQ, allocating some dedicated entries for TT will prevent
the deadlock situations identified.
It should be noted, however, that static hardware resource partitioning has some performance impact
on the SMT, particularly when partitioning ROB, LQ, and SQ. To mitigate this performance impact, we
allocate some minimum number entries for TT to prevent deadlocks and the remainder of the queue is
shared between LT and TT. Hence, the maximum available entry number for LT is the total queue entry
number minus the reserved entry number whereas the maximum available entry number for TT is the
total queue entry number.
From the deadlock analysis, we can also conclude that if we could dynamically regulate the progress
of LT such that neither ROB nor LQ and SQ can be filled with instructions only from LT, the deadlock
situations identified can be prevented. As illustrated in Figure 10, we can dynamically count the number of instructions from LT in ROB, LQ, and SQ, respectively, and then a caution signal is generated
if at least one of the numbers of counted instructions exceeds the corresponding predefined occupancy
threshold. Furthermore, as long as the caution signal is generated, the dispatch thread scheduling policy
will hold on LT from being dispatched.
Algorithm 1. Dispatch threads scheduling policy (with dynamic deadlock monitoring)
Apply ICOUNT.2.8.dispatch policy;
If ((selected thread is LT) AND
(IFQ not empty) AND
571
Figure 10. Dynamic monitoring for the deadlock prevention
(no precaution signal))

Dispatch from IFQ;
else if ((distance between LT and TT meets predefined stagger execution mode) AND
(traceQ is not empty) AND
(not an unresolved branch instruction))
Dispatch from traceQ;
else
Nothing to be dispatched.
To be specific, the comprehensive dispatch thread scheduling policy we developed is listed in Algorithm1: first, we apply the ICOUNT.2.8. Dispatch policy. If the selected thread is LT, we must then
check whether IFQ is empty since no instruction can be dispatched in the case of an empty IFQ. Furthermore, we need to make sure no caution signal has been generated. If there is one such signal, we
must stop dispatching from LT. On the other hand, if the selected thread is TT, we check the following
conditions before dispatch TT: (1) the staggered execution mode requirement; (2) traceQ not empty; (3)
not encountering an unresolved branch instruction.
It should be noted that the dynamic deadlock monitoring approach offers higher design flexibility than
the static resource partitioning. By adjusting the predefined occupancy thresholds, we can manipulate
the resource allocation between the cooperating threads. However, this flexibility comes at the cost of
additional hardware as well as a more complicated thread scheduling policy.
572
SMT for Secure Communication

Another possibility is to exploit SMT microarchitectures for secure communication. Traditionally, computer
security focuses on the prevention and avoidance of software attacks. Nevertheless, the PC architecture
is too trusting of its code environment, which makes PCs vulnerable to hardware attacks (Huang, 2003).
For instance, to take control of a processor, a hacker can initiate a man-in-the-middle attack that injects
malicious code into a system bus that connects the processor and the memory module. One approach to
counter these attacks is memory authentication, which guarantees data integrity. Memory authentication
normally needs to be performed in three steps (Yan, 2006). First, all memory lines have to be brought to
the processor for authentication tag computation. Then, these lines are sent back to memory. At last, each
time a line is brought to the processor at run-time, the authentication tag is recomputed and compared.
This approach takes extra CPU cycles for authentication, generates extra bus traffic, and is vulnerable
at system start time. To compensate the performance overhead, many propose extra pipeline stages or
hardware units for authentication (Shi, 2004). However, the extra hardware overhead involved makes
trusted systems only affordable to high-end users.
Is it worth the hardware overhead? Financial institutions can afford spending hundreds of thousands
dollars on trusted systems but this is too much for ordinary PC users. Is it worth the performance overhead? Large trusted systems can afford spending 60% of its cycle time scrutinizing every instruction
received but this is certainly not acceptable for ordinary PC users either. To address these issues, we
propose Patrolling Thread (PT) for instruction memory authentication in an SMT microarchitecture.
We choose to only authenticate instruction memory because the most common attack is malicious code
injection and thus instruction memory is the Achilles heel in computer (hardware) security. Also,
instruction memory is one-way traffic (read-only), which makes security schemes easier to implement.
In our proposed scheme, little performance overhead is incurred because it utilizes idle resources for
authentication computation through employing SMT technique. Whats more, since PT uses only existing
pipeline stages/resources, little hardware overhead is necessary. In addition, by dedicating a hardware
thread for system security, our approach provides tunable security levels for the system to operate under
different requirements and environments.
Even though SMT exploits TLP in addition to ILP, the pipeline utilization still can not reach 100%.
Thus the patrolling thread can take advantage of unused pipeline resources to execute instruction memory
authentication algorithm, by which means to minimize the impact on regular program execution. If one
incoming instruction does not pass the authentication test, then a warning would be issued and the system
stops taking in any more instructions from memory until recovery.
To accommodate different security requirements and performance overhead, we have proposed three
different schemes to implement patrolling thread:
i.
Regular-checking scheme: served as the baseline scheme, e.g., check one in every ten incoming
instruction lines. This approach introduces some performance overhead and can secure the system
when utilization is high, but a small number of malicious instructions could still sneak in. In most
situations, this approach is secure enough because one malicious instruction is usually not enough
to cause disastrous effects to the system. For instance, if n instructions are required to hack the
system, then as long as the patrolling thread can catch one line of malicious instructions before all
n instructions enter the processor, the system is safe.
573
ii.
iii.
Self-checking scheme: the patrolling thread examines the incoming instruction lines only if there
are free pipeline slots available. This scheme incurs no performance overhead but it becomes vulnerable when the system utilization is kept high.
Secured-checking scheme: schedule the patrolling thread to authenticate every incoming instruction
line regardless of the system utilization.
For authentication algorithm, we choose the One-Key CBC MAC (OMAC), which is a block-cipherbased message authentication algorithm for messages of any bit length (Iwata, 2003). Using this algorithm,
a block cipher and a tag length need to be specified, and then both parties (Memory Management Unit
(MMU) and Processor in our case) share the same secret key for tag generation and tag verification. In
the proposed PT approach, when a memory line is requested, MMU generates a tag for the line, and the
processor can check the line authenticity by verifying its tag. We assume that MMU is able to generate
authentication tag on-the-fly since it has been demonstrated that MMU can be modified to carry out
more sophisticated security operations, such as encryption and decryption (Gilmont, 1999).
Here is a brief analysis of the probability of detecting malicious code using patrolling thread scheme.
The question can be summarized as follow: if we have m lines of instructions coming in, n of which are
malicious codes, and we perform memory authentication on k lines of instructions, what would be the
probability of catching one line of malicious code P(detection)3? The probability that the first line we
authenticate is malicious code can be written as:
P (Detection1) =
n
m
If the first memory line passes the authentication, then we choose the second memory line to
check,
P (Detection 2) =
m -n
n
m
m -1
The first term represents the first memory line is from the m-n genuine lines, and then the second
time we pick one from the remaining m-1 memory lines to authenticate.The probability that we catch
the malicious code at the third time is:
P (Detection 3) =
m -n m -n -1
n
m
m -1
m -2
Until the kth time,

P (Detectionk ) =
574
m -n m -n -1
n
m
m -1
m - (k - 1)
Figure 11. P(Detection) with m = 10
Figure 12. Pipeline with patrolling thread
575
P(Detection) is the summation of all above terms, which we could write as:
P (Detection ) =
i =1
m - i
n n
m

n (m - i - n + 1))

Here represents the binomial coefficient.

n
Based on this equation, we plotted P(Detection) for the scenario where we perform memory authentication on 10 memory lines with the malicious code ratio and detection ratio varies, as shown in Figure
11. As we can see, if detection rate (DT) is 0.1 (corresponding to the regular-checking scheme), then the
probability of detecting the malicious code is directly corresponding to the Malicious code ratio. On the
other hand, if DT is 1 (corresponding to the secured-checking scheme), then we will always be able to
detect the malicious code, hence for the P(Detection) we have a horizontal line here. At last, for the selfchecking scheme we proposed, its P(Detection) would be in between of the previous two schemes.
PT is designed in a similar fashion to the detector thread proposed by Shin et al. (2006). As shown in
Figure 12, PTs initial program image is loaded via DMA to PT PRAM by the OS during the boot. Once
loaded, PT can start running from its own reset address depending on the patrolling scheme we choose.
Whenever we have a cache miss for instructions, MMU would send in the instruction line with authentication tag. Then the instructions are sent to the instruction caches as normal, while its corresponding
security tag is sent to the PT RAM. The tag and the instructions share the same memory address while
PT RAM and Instruction Cache use the same cache indexing algorithm. This is to make sure that the tag
and its corresponding instructions are related with each other. The PT-enabled fetch unit decides which
thread to fetch from, the patrolling thread or the regular thread(s).
Whenever patrolling thread finishes the authentication process and it is a pass, nothing will happen
and pipeline continues to flow. If, however, the authentication fails, then an alert would be raised and
the whole pipeline would be flushed. Program counter(s) would be rolled back to the last known good
position and restart execution.
TRENDS AND CONLUSIONS

With major general-purpose processor manufacturers transiting from single-core to multi-core processor
production, one basic obstacle is lying before us: are we truly ready for the multi-core era?
With most programs still developed under sequential programming model, the extent of the InstructionLevel Parallelism we can exploit is very limited. Many resources on chip just remain idle and we are
considerably short of fully utilizing the vast computing power of those chips. To solve this problem, we
need to revolutionize the whole computing paradigm, from incorporating parallel programming into the
application program development, to hardware design that could better facilitate the program for parallel
execution. On the other hand, the Simultaneous MultiThreading microarchitecture model has proven to
be capable of maximizing the utilization of on-chip resources and hence achieve improved performance.
576
Here we see a perfect match through utilizing the multi-thread microarchitecture to harness the computing power of multi-core, exploiting Thread-Level Parallelism (TLP) in addition to ILP.
We expect near-future processors to be of the multi-thread, multi/many-core kind. SMT would fit
into the both homogeneous and heterogeneous multi/many-core system case, with one or many cores
running multiple threads. Because of limited ILP, the main thread normally cannot use all the system
resources. On the other hand, we have the demand from all those house-keeping functions such as
on-chip resource usage analysis, data synchronization and routing decision making to assist the execution of the main thread. With SMT microarchitecture, we could achieve better system utilization when
running those multiple threads together, hence achieve overall performance improvement.
According to the Refrigerator Theory of Professor Yale Patt from the University of Texas at Austin,
another trend for future heterogeneous multi/many-core processor design is to include some application specific cores, in additional to the general-purpose processing cores. For example, AMDs vision
on fusion (AMD, 2008) and the next generation Intel Larrabee processor (Seiler, 2008), are both
targeting a GPU/CPU combined design. These specific cores can be used as performance boosters for
specific applications to achieve an overall performance improvement. In order to utilize these specific
cores effectively, we need:
An instruction set enhancement, to add dedicated instructions to best exploit these special cores.
Improvement from the compiler, to extract more that can be run in these special cores.
Operating system assistance, to be aware of these special cores for a better job scheduling.
For some power-constrained applications, we may need to put those specific cores into sleep mode in
order to reduce the power consumption during normal execution and then power them back on when the
need arises. If that is the case, however, the sleep state enter/exit latency would be a factor that should
not be overlooked. Unless the core will be idle for considerably extended period of time, the gain you
could get from running the specific core(s) may not justify the latency you need for the core mode change
(from active to sleep/deep sleep or vice versa). Whats more, put a core into deep sleep is not of trivial
job in terms of hardware overhead. Due to these limiting factors, this approach needs to be considered
cautiously by system architect.
SMT technology has been one of the de facto features of modern microprocessors. In this chapter, we
examined this important technology from its motivation, design aspects, and applications. We strongly
believe that, if we can utilize effectively, SMT will continue playing a critical role in future multi/manycore processor design.
REFERENCES
Agarwal, A., Bianchini, R., Chaiken, D., Johnson, K. L., Kranz, D., Kubiatowicz, J., et al. (1995). The
MIT Alewife machine: architecture and performance. In Proceedings of the 22nd Annual International
Symposium on Computer Architecture (ISCA95), S. Margherita Ligure, Italy, (pp. 2-13). New York:
ACM Press.
577
Agarwal, A., Lim, B.-H., Kranz, D., & Kubiatowicz, J. (1990). April: a processor architecture for multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture
(ISCA90), (pp. 104-114), Seattle, WA: ACM Press.
Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., & Smith, B. (1990) The Tera
computer system. In Proceedings of the 4th International Conference on Supercomputing (ICS90), (pp.
1-6). Amsterdam: ACM Press.
Burns, J., & Gaudiot, J.-L. (2002). SMT layout overhead and scalability. IEEE Transactions on Parallel
and Distributed Systems, 13(2), 142155. doi:10.1109/71.983942
Cazorla, F. J., Ramirez, A., Valero, M., & Fernandez, E. (2004). Dcache Warn: an I-fetch policy to increase
SMT efficiency. In Proceedings of the 18th International Parallel & Distributed Processing Symposium
(IPDPS04), (pp. 74-83). Santa Fe, NM: IEEE Computer Society Press.
Cazorla, F. J., Ramirez, A., Valero, M., & Fernandez, E. (2004). Dynamically controlled resource allocation in SMT processors. In Proceedings of the 37th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO04), (pp. 171-182). Portland, OR: IEEE Computer Society Press.
Choi, S., & Yeung, D. (2006). Learning-based SMT processor resource distribution via hill-climbing.
In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA06), (pp.
239-251), Boston: IEEE Computer Society Press.
Culler, D. E., Singh, J. P., & Gupta, A. (1998) Parallel computer architecture: a hardware/software
approach, (1st edition). San Francisco: Morgan Kaufmann.
Eggers, S. J., Emer, J. S., Levy, H. M., Lo, J. L., Stamm, R. L., & Tullsen, D. M. (1997). Simultaneous multithreading: a platform for next-generation processors. IEEE Micro, 17(5), 1219.
doi:10.1109/40.621209
El-Moursy, A., & Albonesi, D. H. (2003). Front-end policies for improved issue efficiency in SMT processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture
(HPCA03), (pp. 31-40). Anaheim, CA: IEEE Computer Society Press.
Gilmont, T., Legat, J.-D., & Quisquater, J.-J. (1999). Enhancing the security in the memory management
unit. In Proceedings of the 25th EuroMicro Conference (EUROMICRO99). 1, 449-456. Milan, Italy:
Gontmakher, A., Mendelson, A., Schuster, A., & Shklover, G. (2006) Speculative synchronization and
thread management for fine granularity threads. In Proceedings of the 12th International Symposium
on High-Performance Computer Architecture (HPCA06), (pp. 278-287). Austin, TX: IEEE Computer
Society Press.
Hammond, L., Hubbert, B. A., Siu, M., Prabhu, M. K., Chen, M., & Olukotun, K. (2000). The Stanford
Hydra CMP. IEEE Micro, 20(2), 7184. doi:10.1109/40.848474
Hammond, L., Nayfeh, B. A., & Olukotun, K. (1997). A single-chip multiprocessor. IEEE Computer,
30(9), 7985.
578
Hennessy, J., & Patterson, D. (2006). Computer architecture: a quantitative approach (4th Ed.). San
Francisco: Morgan Kaufmann.
Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., & Roussel, P. (2001). The microarchitecture of the Pentium 4 processor. Intel Technology Journal, 5(1), 1-13.
Huang, A. (2003) Hacking the Xbox: an introduction to reverse engineering, (1st Ed.). San Francisco:
No Starch Press.
Huang, J., & Lilja, D. J. (1999). Exploiting basic block value locality with block reuse. Proceedings of
5th International Symposium on High-Performance Computer Architecture (HPCA99), (pp. 106-114).
Orlando, FL: IEEE Computer Society Press.
Intel News Release. (2006). New dual-core Intel Itanium 2 processor doubles performance, reduces
power consumption. Santa Clara, C: Author.
Iwata, T., & Kurosawa, K. (2003). OMAC: One-Key CBC MAC. In 10th International Workshop on
Fast Software Encryption (FSE03), (LNCS Vol. 2887/2003, pp. 129-153), Lund, Sweden. Berlin/
Heidelberg: Springer.
Kang, D.-S. (2004) Speculation-aware thread scheduling for simultaneous multithreading. Doctoral
Dissertation, University of Southern California, Los Angeles, CA.
Kang, D.-S., Liu, C., & Gaudiot, J.-L. (2008). The impact of speculative execution on SMT processors.
[IJPP]. International Journal of Parallel Programming, 36(4), 361385. doi:10.1007/s10766-007-00523
Koufaty, D., & Marr, D. (2003). Hyperthreading technology in the Netburst microarchitecture. IEEE
Micro, 23(2), 5665. doi:10.1109/MM.2003.1196115
Lee, S.-W., & Gaudiot, J.-L. (2003). Clustered microarchitecture simultaneous multithreading. In 9th
International Euro-Par Conference on Parallel Processing (Euro-Par03), (LNCS Vol. 2790/2004, pp.
576-585), Klagenfurt, Austria. Berlin/Heidelberg: Springer.
Li, X., & Gaudiot, J.-L. (2006). Design trade-offs and deadlock prevention in transient fault-tolerant
SMT processors. In Proceedings of 12th Pacific Rim International Symposium on Dependable Computing (PRDC06), (pp. 315-322). Riverside, CA: IEEE Computer Society Press.
Li, Z., Xu, X., Hu, W., & Tang, Z. (2006). Microarchitecture and performance analysis of Godson-2
SMT processor. In Proceedings of the 24th International Conference on Computer Design (ICCD06),
(pp. 485-490). San Jose, CA: IEEE Computer Society Press.
Liu, C., & Gaudiot, J.-L. (2008). Resource sharing control in simultaneous multithreading microarchitectures. In Proceedings of the 13th IEEE Asia-Pacific Computer Systems Conference (ACSAC08), (pp.
1-8). Hsinchu, Taiwan: IEEE Computer Society Press.
Liu, S., & Gaudiot, J.-L. (2007). Synchronization mechanisms on modern multi-core architectures. In
Proceedings of the 12th Asia-Pacific Computer Systems Architecture Conference (ACSAC07), (LNCS
Vol. 4697/2007), (pp. 290-303), Seoul, Korea. Berlin/Heidelberg: Springer.
579
Liu, S., & Gaudiot, J.-L. (2008). The potential of fine-grained value prediction in enhancing the performance of modern parallel machines. In Proceedings of the 13th IEEE Asia-Pacific Computer Systems
Conference (ACSAC08), (pp. 1-8). Hsinchu, Taiwan: IEEE Computer Society Press.
Mahadevan, U., & Ramakrishnan, S. (1994) Instruction scheduling over regions: A framework for scheduling across basic blocks. In Proceedings of the 5th International Conference on Compiler Construction
(CC94), Edinburgh, (LNCS Vol. 786/1994, pp. 419-434). Berlin/Heidelberg: Springer.
Marcuello, P., & Gonzalez, A. (1999) Exploiting speculative thread-level parallelism on a SMT processor. In Proceedings of the 7th International Conference on High-Performance Computing and Networking (HPCN Europe99), Amsterdam, the Netherlands, (LNCS Vol. 1593/1999, pp. 754-763) Berlin/
Marr, D.T., Binns, F., Hill, D.L., Hinton, G., Koufaty, D.A, Miller, J.A., & Upton, M. (2002). Hyperthreading technology architecture and microarchitecture. Intel Technology Journal, 6(1), 4-15.
Moore, G. E. (1965). Cramming more components onto integrated circuits. Electronics Magazine,
38(8).
Nemirovsky, M. D., Brewer, F., & Wood, R. C. (1991). DISC: dynamic instruction stream computer. In
Proceedings of the 24th Annual International Symposium on Microarchitecture (MICRO91), Albuquerque, NM (pp. 163-171). New York: ACM Press.
Preston, R. P., Badeau, R. W., Bailey, D. W., Bell, S. L., Biro, L. L., Bowhill, W. J., et al. (2002). Design
of an 8-wide superscalar RISC microprocessor with simultaneous multithreading. In Digest of Technical
Papers of the 2002 IEEE International Solid-State Circuits Conference (ISSCC02), San Francisco, CA
(Vol. 1, pp. 334-472). New York: IEEE Press.
Raasch, S. E., & Reinhardt, S. K. (1999). Applications of thread prioritization in SMT processors. In
Proceedings of the 3rd Workshop on Multithreaded Execution and Compilation (MTEAC99), Orlando,
FL.
Raasch, S. E., & Reinhardt, S. K. (2003). The impact of resource partitioning on SMT processors. In
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
(PACT03), (pp. 1525). New Orleans, LA: IEEE Computer Society.
Reinhardt, S., & Mukherjee, S. (2000). Transient fault detection via simultaneous multithreading. In ACM
SIGARCH Computer Architecture News: Special Issue: Proceedings of the 27th Annual International
Symposium on Computer Architecture (ISCA00), (pp. 25-36). Vancouver,Canada: ACM Press
Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., & Dubey, P. (2008). Larabee: a
many-core x86 architecture for visual computing. [TOG]. ACM Transactions on Graphics, 27(3).
doi:10.1145/1360612.1360617
Shi, W., Lee, H.-H., Ghosh, M., & Lu, C. (2004). Architectual support for high speed protection of
memory integrity and confidentiality in multiprocessor systems. In Proceedings of the 13th International
Conference on Parallel Architectures and Computation Techniques (PACT04), Antibes Juan-les-Pins,
France (pp.123-134). New York: IEEE Computer Society.
580
Shin, C.-H., & Gaudiot, J.-L. (2006). Adaptive dynamic thread scheduling for simultaneous multithreaded
architectures with a detector thread. Journal of Parallel and Distributed Computing, 66(10), 13041321.
doi:10.1016/j.jpdc.2006.06.003
Shin, C.-H., Lee, S.-W., & Gaudiot, J.-L. (2003). Dynamic scheduling issues in SMT architectures. In
Proceedings of the 17th International Symposium on Parallel and Distributed Processing (IPDPS03),
Nice, France, (p. 77b). New York: IEEE Computer Society.
Sinharoy, B., Kalla, R. N., Tendler, J. M., Eickemeyer, R. J., & Joyner, J. B. (2005). Power5 system
microarchitecture. IBM Journal of Research and Development, 49(4/5), 505521.
Smith, B. J. (1981). Architecture and applications of the HEP multiprocessor computer system. In SPIE
Proceedings of Real Time Signal Processing IV, 298, 241-248.
Tendler, J. M., Dodson, J. S. Jr, Fields, J. S., Le, H., & Sinharoy, B. (2002). Power4 system microarchitecture. IBM Journal of Research and Development, 46(1), 525.
Thistle, M. R., & Smith, B. J. (1988). A processor architecture for Horizon. In Proceedings of the 1988
ACM/IEEE conference on Supercomputing (SC88), Orlando, FL, (pp. 35-41). New York: IEEE Computer Society Press.
Thornton, J. E. (1970). Design of a computer - the Control Data 6600. Upper Saddle River, NJ: Scott
Foresman & Co.
Tuck, N., & Tullsen, D. M. (2005). Multithreaded value prediction. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA05), (pp. 5-15), San Francisco:
Tullsen, D. M., & Brown, J. A. (2001). Handling long-latency loads in a simultaneous multithreading
processor. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture
(MICRO01), (pp. 318327). Austin, TX: IEEE Computer Society.
Tullsen, D. M., Eggers, S. J., Emer, J. S., Levy, H. M., Lo, J. L., & Stamm, R. L. (1996). Exploiting
choice: instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA96), Philadelphia,
(pp. 191202). New York: ACM Press.
Tullsen, D. M., Eggers, S. J., & Levy, H. M. (1995). Simultaneous multithreading: maximizing on-chip
parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture
(ISCA95), Santa Margherita Ligure, Italy (pp. 392-403). New York: ACM Press.
Tullsen, D. M., Lo, J. L., Eggers, S. J., & Levy, H. M. (1999). Supporting fine-grained synchronization
on a simultaneous multithreading processor. In Proceedings of the 5th International Symposium on High
Performance Computer Architecture (HPCA99), Orlando, FL (pp. 54-58). New York: IEEE Computer
Society.
Wall, D. W. (1991). Limits of instruction-level parallelism. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA
(ASPLOS-IV), (pp. 176-188). New York: ACM Press.
581
White Paper, A. M. D. (2008). The industry-changing impact of accelerated computing.

Yamamoto, W., & Nemirovsky, M. (1995). Increasing superscalar performance through multistreaming.
In Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation
Techniques (PACT95), (pp. 49-58). Limassol, Cyprus: IFIP Working Group on Algol.
Yan, C., Rogers, B., Englender, D., Solihin, Y., & Prvulovic, M. (2006). Improving cost, performance,
and security of memory encryption and authentication. In Proceedings of 33rd Annual International Symposium on Computer Architecture (ISCA06), (pp. 179-190). Boston: IEEE Computer Society Press.
Yeager, K. C. (1996). The MIPS R10000 superscalar microprocessor. IEEE Micro, 16(2), 2840.
doi:10.1109/40.491460

Cache Coherence: The integrity of the data stored in local caches of a shared resource.
Fault Tolerance: The property that enables a system (often computer-based) to continue operating
properly in the event of the failure of (or one or more faults within) some of its components.
Fetch Policy: A mechanism which allows the determination of which thread(s) to fetch instructions
from, when executing multiple threads.
Instruction-Level Parallelism: A measure of how many of the operations in a computer program
can be performed simultaneously.
Simultaneous Multithreading: A technique to improve the overall efficiency by executing instructions from multiple threads simultaneously to better utilize the resources provided by modern processor
architecture.
Microarchitecture: A description of the electrical circuits of a processor that is sufficient to completely describe the operation of the hardware.
Resource Sharing Control: A mechanism which allows the distribution of various resources in the
pipeline among multiple threads.
Secure Communication: Means by which information is shared with varying degrees of certainty
so that third parties cannot know what the content is.
Synchronization: Timekeeping which requires the coordination of events to operate a system in
unison.
Thread-Level Parallelism: A measure of how many of the operation across multiple threads can
be performed simultaneously.
ENDNOTES
1
2
3
582
Some literature refers to 18 months. However, the official Moores law website of Intel, and even
an interview with Dr. Gordon Moore, confirms the two year figure.
In Chinese, Fen means Divide or Partition.
As long as we detect one line of malicious code, we will trigger an alert.
583
Chapter 25
Runtime Adaption Techniques

for HPC Applications
Edgar Gabriel
University of Houston, USA
ABSTRACT
This chapter discusses runtime adaption techniques targeting high-performance computing applications.
In order to exploit the capabilities of modern high-end computing systems, applications and system
software have to be able to adapt their behavior to hardware and application characteristics. Using
the Abstract Data and Communication Library (ADCL) as the driving example, the chapter shows the
advantage of using adaptive techniques to exploit characteristics of the network and of the application.
This allows to reduce the execution time of applications significantly and to avoid having to maintain
different architecture dependent versions of the source code.
INTRODUCTION
High Performance Computing (HPC) has reshaped science and industry in many areas. Recent groundbreaking achievements in biology, drug design and medical computing would not have been possible
without the usage of massive computational resources. However, software development for HPC systems
is currently facing significant challenges, since many of the software technologies applied in the last ten
years have reached their limits. The number of applications being capable of efficiently using several
thousands of processors or achieving a sustained performance of multiple teraflops is very limited and is
usually the result of many person-years of optimizations for a particular platform. These optimizations
are however often not portable. As an example, an application optimized for a commodity PC cluster
performs (often) poorly on an IBM Blue Gene or the NEC Earth Simulator. Among the problems application developers face are the wide variety of available hardware and software components, such as
DOI: 10.4018/978-1-60566-661-7.ch025
Runtime Adaption Techniques for HPC Applications
Processor type and frequency, number of processor per node and number of cores per processor,
Size and performance of the main memory, cache hierarchy,
Characteristics and performance of the network interconnect,
Operating system, device drivers and communication libraries,
and the influence of each of these components on the performance of their application. Hence, an enduser faces a unique execution environment on each parallel machine he uses. Even experts struggle to
fully understand correlations between hardware and software parameters of the execution environment
and their effect on the performance of a parallel application.
Motivating Example
In the following, we would like to clarify the dilemma of an application developer using a realistic and
common example. Consider a regular 3-dimensional finite difference code using an iterative algorithm
to solve the resulting system of linear equations. The parallel equation solver consists of three different operations requiring communication: scalar products, vector norms and matrix-vector products.
Although the first two operations do have an impact on the scalability of the algorithm, the dominating
operation from the communication perspective is the matrix-vector product. The occurring communication pattern for this operation is neighborhood communication, i.e. each process has to exchange data
with its six neighboring processes multiple times per iteration of the solver. Depending on the execution
environment and some parameters of the application (e.g. problem size), different implementations for
the very same communication pattern can lead to optimal performance. We analyze the execution times
for 200 iterations of the equation solver applied for a steady problem using 32 processes on the same
number of processors on a state-of-the-art PC cluster for two different problem sizes (323232 and
643232 mesh points per process) and two different network interconnects (4xInfiniBand and Gigabit
Ethernet). The neighborhood communication has been implemented in four different ways, named here
fcfs, fcfs-pack, ordered, overlap. While the nodes/processors have been allocated exclusively for these
measurements using a batch-scheduler, the network interconnect was shared with other applications
using the same PC cluster.
The results indicate, that already for this simple test-case on a single platform three different implementations of the neighborhood communication lead to the best performance of this application: although
the differences between the different implementations are not dramatic over this network interconnect,
fcfs shows the best performance for both problem sizes when using the InfiniBand interconnect. This
implementation is initiating all required communications simultaneously using asynchronous communication followed by a Waitall operation on all pending messages. However, for the Gigabit Ethernet
interconnect the fcfs approach seems to congest the network. Instead, the implementation which is
overlapping communication and computation (overlap), is showing the best performance for the small
problem size (6.2 seconds for overlap vs. 6.6 seconds for fcfs, 7.5 seconds for fcfs-pack, 8.1 seconds
for ordered) while the ordered algorithm, which limits the number of messages concurrently on the fly,
is the fastest implementation for the large problem size for this network interconnect (14.7 seconds for
ordered vs. 26.9 seconds for fcfs, 19.9 for fcfs-pack and 23.4 seconds for overlap). The implementation
considered to be the fastest one over the InfiniBand network leads thus to a performance penalty of nearly
584
80% over Gigabit Ethernet. An application developer implementing the neighborhood communication
using a particular, fixed algorithm will inevitably give up performance on certain platforms.
Problem Description
As demonstrated above, the wide variety in hardware and software leads to the inherent limitation that
any code sequence or communication operation which contributes (significantly) towards the overall
execution time of an application will inevitably give up performance, if the operation is hard-coded in
the source code, i.e. the code sequence or communication operation does not have the ability to adapt its
behavior at runtime due to changing conditions. Traditional tuning approaches have their fundamental
limitations, and are not capable to solve the problem in a satisfactory manner.
Specific Goals
The goal of this chapter therefore is to present dynamic runtime optimization techniques applied in high
performance computing. Runtime adaption in HPC serves two purposes: first it allows tweaking the
performance of a code in order to exploit the capabilities of the hardware. At the same time it simplifies
software maintenance, since an application developer does not have to maintain multiple different versions of his code for different platforms.
The chapter focuses on one specific project, the Abstract Data and Communication Library (ADCL).
ADCL enables the creation of self-optimizing applications by allowing an application to register alternative
versions of a particular function. Furthermore, ADCL offers several pre-defined operations allowing for
seamless optimization of often occurring communication patterns in MPI parallel application. Although
not fundamentally restricted to collective communication operations, most operation optimized through
ADCL are collective in nature. In the following we discuss the related work in that area, present the
concept of ADCL and give performance results obtained in three different scenarios using ADCL.
BACKGROUND
In the last couple of years many projects have been dealing with optimizing collective operations in High
Performance Computing. For the subsequent discussion, projects are categorized as approaches which
are applying either static tuning, i.e. projects which typically lead to software components that cannot
alter their behavior during execution, or dynamic tuning, in which the software/application does adapt
at runtime its behavior as a reaction to varying conditions.
Static Tuning of Applications

Most projects applying static tuning focus on one of two approaches to determine the best performing
implementation for a particular operation: they either apply a pre-execution tuning step by testing the
performance of different versions for the same operation for various message length and process counts;
alternatively, some projects rely on performance prediction using sophisticated communication models
to compare different algorithms. We will discuss representatives, advantages and disadvantages for both
approaches in the next paragraphs.
585
Among the best known projects representing the first approach are the Automatically Tuned Linear
Algebra Software (ATLAS) (Whaley, 2005) and the Automatically Tuned Collective Communications
(ATCC) (Pjesivac-Grbovic, 2007) framework. ATLAS is a library providing optimized implementations
of the Basic Linear Algebra Software (BLAS) library routines. As one of the very first projects acknowledging the wide variety of hardware and software components, ATLAS uses an extensive configuration
step to determine the best performing implementation from a given pool of available algorithms on a
specific platform and a given compiler for each operation. Furthermore, based on additional information such as cache sizes, ATLAS determines optimal, internal parameters such as the blocking factor
for blocked algorithms. As a result of the configuration step, the ATLAS library will only contain the
routines known to deliver the best performance on that platform.
Similarly to ATLAS, ATCC determines the optimal algorithms for MPIs collective operations on a
given platform by using a parallel configuration step. During this configure step, several implementations
for each collective operations are being tested and the fastest algorithm for each message length is stored
in a configuration file. The resulting set of algorithms and parameters for this platform are then used
during the execution of the application. In order to minimize the size of the configuration file, ATCC uses
quad-tree encoding to represent the overall decision tree. This is also used to conclude which algorithm
to use for message sizes/process counts which have not been tested in the parallel configure step.
Projects such as ATLAS and ATCC face a number of fundamental drawbacks. First, the tuning
procedure itself often takes more time than running an individual application. Thus, in case the system
administrators of a cluster do not reserve the according time slots to tune these libraries in advance and
typically they will only reserve limited time and not multiple days to tune e.g. the MPI collective operations exhaustingly on a multi-thousand node cluster - end-users themselves will very probably not use
their valuable compute time to perform these time consuming operations. Additionally, several factors
influencing the performance of the application can only be determined while executing the application.
These factors include process placement by the batch scheduler due to non-uniform network behavior
(Evans, 2003), resource utilization due to the fact that some resources such as the network switch or file
systems are shared by multiple applications, operating system jitter leading to slow-down of a subset of
processes utilized by a parallel job (Petrini, 2003) and application characteristics such as communication
volume and frequencies. Furthermore, some projects have also highlighted the influence of process arrivals patterns to the performance of collective communication operations: depending on the work that
each process has to perform, the order in which processes start to execute a collective operation varies
strongly depending on the application. Thus, the algorithm determined to lead to the best performance
using a synthetic benchmark might in fact be suboptimal in a real application (Faraj, 2007).
The second common approach used e.g. by the MagPIe project (Kielmann, 1999) compares the predicted execution time of various algorithms for a given operation using performance models. Although
some of the communication models used such as LogP (Culler, 1993) and LogGP (Alexandrov, 1995)
are highly sophisticated, these projects ultimately suffer from three limitations. Firstly, it is often hard
to determine some parameters of (sophisticated) communication models. As an example, no approach is
published as of today which derives a reasonable estimate of the receive-overhead in the LogGP model
(Hoefler, 2007). Second, while it is possible to develop a performance model for a simple MPI-level
communication operation, more complex functions involving alternating and irregular sequences of
computation and communication have hardly been modeled as of today. Lastly, all models have their
fundamental limitations and break-down scenarios, since they represent simplifications of the real world
behavior of the machines. Thus, while modeling collective communication operations can improve the
586
understanding of performance characteristics for various algorithms, tuning complex operations based
on these models is fundamentally limited.
Dynamic Tuning of Applications

The dynamic optimization and tuning problem can be related to multiple research areas in various
domains. Starting from the lower level of the software hierarchy, most runtime optimization problems
are represented as empirical search procedure, having the boundary condition that any evaluation of
the results has to be computationally inexpensive to minimize the overhead introduced by the runtime
optimization itself. Depending on the type of the parameters tuned during the runtime optimization,
various approaches from optimizations theory (Gill 1993) can be applied as well, e.g. the method of
steepest decent in case of a contiguous, real-valued parameter. Vuduc (Vuduc, 2004) provides an excellent overview of various algorithms.
On top of the search algorithms are often statistical methods used to remove outliers and analyze the
performance results of the various alternatives tested during the search space. These algorithms can vary
in their complexity and range from simple inter-quartile range methods to sophisticated algorithms from
cluster analysis and robust statistics. Benkert (Benkert, 2008) gives a good overview and a performance
comparison of different approaches.
Finally, since most approaches used in runtime optimization are separated into an evaluation phase,
where the runtime library uses certain events or instances in order to learn the best approach for those
operations, and a phase applying the knowledge determined earlier, theories from machine learning
(Witten, 2005) can often be applied as well. Once again, the main constraint for applying machine learning algorithms is due to the fact that the overhead introduced by the learning algorithm itself has to be
inevitable cheap for a runtime library.
A vast body of research for code optimizations is furthermore available in the field of compilers. As
an example ADAPT (Voss, 2000) introduces runtime adaption and optimization by having different variants of a code sequence. During a sampling phase, the runtime environment explores the performance of
the different versions and decides which one performs best. The runtime environment can furthermore
invoke a separate dynamic code generator, which delivers new alternative code versions which can be
loaded dynamically.
Despite of the significant progress in these areas, the number of projects applying automated (runtime) tuning techniques in HPC is still very limited. Among those projects are FFTW (Fringo, 2005),
PhiPAC (Vuduc, 2004), STAR-MPI (Faraj, 2006), and SALSA (Dongarra, 2003). In the following, we
detail three of these projects which utilize advanced adaptation techniques, and compare them to various
aspects of ADCL (Gabriel, 2007).
FFTW
The FFTW (Fastest Fourier Transform in the West) library optimizes sequential and parallel Fast Fourier
Transform (FFT) operations. To compute an FFT, the application has to invoke first a planner step
specifying a problem which has to be solved. Depending on an argument passed by the application to
the planner routine, the library measures the actual runtime of many different implementations and selects the fastest one (FFTW_MEASURE). In case many transforms of the same size are executed in an
application, this plan delivers the optimal performance for all subsequent FFTs. Since creating a plan
587
can be time consuming, FFTW also provides a mode of operation where the planner comes up quickly
with a good estimate, which might however not necessarily be the optimal one (FFTW_ESTIMATE).
The decision procedure is initiated just once by the user. Thus, FFTW makes the runtime optimization
upfront in the planner step without performing any useful work. In contrary to the approach taken by
FFTW, ADCL integrated the runtime selection logic into the regular execution of the applications. Thus,
the ADCL approach enables the library to restart the runtime selection logic in case the performance
deviates significantly from the performance measured during the tuning step is observed, e.g. due to
changing network conditions.
FFTW has also the notion of historic learning, namely using a feature called Wisdom. The user can
export experiences gathered in previous runs into a file, and reload it at subsequent executions. However,
the wisdom concept in FFTW lacks any notion of related problems, i.e. wisdom can only be reused
for exactly the same problem size that was used to generate it. Furthermore, the wisdom functionality
also does not include any mechanism which helps to recognize outdated or invalid wisdom, e.g. if the
platform used for collecting the wisdom is significantly different than the platform used while reloading the wisdom.
STAR-MPI
STAR-MPI incorporates runtime optimization of collective communication operations providing a similar
API as defined in the MPI specifications. Using an Automatic Empirical Optimization Software (AEOS),
the library performs dynamic tuning of each collective operation by determining the performance of
all available algorithms in a repository. Once performance data for all available algorithms have been
gathered, STAR-MPI determines the most efficient algorithm.
STAR-MPI tunes different instances/call-sites for each operation separately. In order to achieve this
goal, the prototypes of the STAR-MPI collective operations have been extended by an additional argument, namely an integer value uniquely identifying each call-site. This is however hidden from applications by using pre-processor directives to redirect the MPI calls to their STAR-MPI.
Similarly to all projects focusing on runtime adaption techniques, the largest overhead in STAR-MPI
comes from the initial evaluation of the underperforming algorithms and the distributed decision logic,
which is necessary to ensure that all processes agree on the final winner. While STAR-MPI does a
good job of minimizing the latter one by only introducing a single collective global reduction, the first
item, i.e. testing of underperforming implementations is highly evident in STAR-MPI, due to the independent optimization of all operations per call-site. In contrary to that, ADCL allows for both, per-call
site optimization and concatenating performance data of multiple call-sites for the same operation and
message length.
One approach used in STAR-MPI to minimize the problem outlined in the previous paragraph is to
introduce grouping of algorithms. STAR-MPI initially compares a single algorithm from all available
groups. After the winner group has been determined, the library does a fine-tuning of the performance
by evaluating all other available algorithms within the winner group. As described later, ADCL further
extends the notion of grouping implementations using an attribute concept, which allows characterizing algorithms and alternative implementations without enforcing the participation of an algorithm in
a single group.
588
SALSA
The Self-Adapting Large-scale Solver Architecture (SALSA) aims at providing the best suitable linear
and non-linear system solver to an application. Using the characteristics of the application matrix, the
solver contacts a knowledge database and provides an estimate on the best solver to use. Among the
characteristics used for choosing the right solver are structural properties of the matrix (e.g. maximum
and minimum number of non-zeros per row), matrix norms such as the 1- or the Frobenius-norm, and
spectral properties.
Recently, the authors have applied algorithms from machine learning such as boosting algorithms
and alternating decision trees to improve the prediction quality of the system (Bhowmick, in press). The
decision algorithm has been trained by using a large set of matrices from various application domains.
Among the interesting features of this approach is, that the algorithm is capable of handling missing
features for the prediction, e.g. in case some norms are considered too expensive to be calculated at
runtime. The main drawback of the approach within the context of this chapter lies in the fact that the
training steps have to be executed before running the application, due to the computation complexity of
the according operations. The problem is however softened by the fact that the knowledge data base is
per design reusable across multiple runs/executions.
THE ABSTRACT DATA AND COMMUNICATION LIBRARY

The Abstract Data and Communication Library (ADCL) enables the creation of self-optimizing applications by either registering alternative versions of a particular function or use predefined operations
capable of self-optimization. ADCL uses the initial iterations of the application to determine the fastest
available code version. Once performance data on a sufficient number of versions is available, the library
makes a decision on which alternative to use throughout the rest of the execution.
From the conceptual perspective, ADCL takes advantage of two characteristics of most scientific
applications:
1.
2.
Iterative execution: most parallel, scientific applications are centered around a large loop, and execute therefore the same code sequence over and over again. Consider for example an application
which solves a time dependent partial differential equation (PDE). These problems are often solved
by discretizing the PDE in space and time, and by solving the resulting system of linear equations
for each time step. Depending on the application, iteration counts can reach six digit numbers.
Collective execution: most large scale parallel applications are based on data decomposition, i.e.
all processes execute the same code sequence on different data items. Processes are typically also
synchronized, i.e. all processes are in the same loop iteration. This synchronization is often required
for numerical reasons and is enforced by communication operations.
Description of the ADCL API

The ADCL API offers high level interfaces of application level collective operations. These are required
in order to be able to switch the implementation of the according collective operation within the library
without modifying the application itself. The main objects within the ADCL API are:
589
ADCL_Topology: provides a description of the process topology and neighborhood relations

within the application.
ADCL_Vector: specifies the data structures to be used during the communication. The user can
for example register a data structure such as a matrix with the ADCL library, detailing how many
dimensions the object has, the extent of each dimension, the number of halo-cells, and the basic
datatype of the object.
ADCL_Function: each ADCL function is the equivalent to an actual implementation of a particular operation.
ADCL_Fnctset: a collection of ADCL functions providing the same functionality. ADCL provides pre-defined function-sets, such as for neighborhood communication (ADCL_FNCTSET
_NEIGHBORHOOD). The user can however also register its own functions in order to utilize the
ADCL runtime selection logic.
ADCL_Attribute: abstraction for a particular characteristic of a function/implementation. Each
attribute is represented by the set of possible values for this characteristic.
ADCL_Attrset: an ADCL Attribute-set is a collection of ADCL attributes. An ADCL Functionset can have an ADCL Attribute-set attached to it, in which case all functions in the function-set
have to provide valid values for each attribute in the attribute-set.
ADCL_Request: combines a process topology, a function-set and a vector object. The application can initiate a communication by starting a particular ADCL request.
The following code sequence gives a simple example for an ADCL code, using a 2-D neighborhood
communication on a 2-D process topology. This application first generates a 2-D process topology using an MPI Cartesian communicator. By registering a multi-dimensional matrix with ADCL, the library
generates a vector-object. Combining the process topology, the vector object and the predefined function
set ADCL_FNCTSET_NEIGHBORHOOD allows the library to determine automatically which portions
of the vector have to be transferred to which process. Afterwards, each call to ADCL_Request_start
initiates a neighborhood communication.
double vector[...][...];
ADCL_Vector vec;
ADCL_Topology topo;
ADCL_Request request;
/* Generate a 2-D process topology */
MPI_Cart_create (MPI_COMM_WORLD, 2, cart_dims, periods,
0, &cart_comm);
ADCL_Topology_create (cart_comm, &topo);
/* Register a 2D vector with ADCL */
ADCL_Vector_register (ndims, vec_dims, NUM_HALO_CELLS,
MPI_DOUBLE, vector, &vec);
/* Combine description of data structure and process topology */
ADCL_Request_create (vec, topo, ADCL_FNCTSET_NEIGHBORHOOD, &request);
/* Main application loop */
590
for (i=0; i<NIT; i++) {

...
/* Initiate neighborhood communication */
ADCL_Request_start (request);
...
}
Technical Concept
Two key components of ADCL are the algorithm used to determine, which versions of a particular
operation shall be tested, and how to decide efficiently across multiple process on the best performing
version. In the following, we give some details on both components.
Distributed Decision Logic

A fundamental assumption within ADCL is, that the library has multiple alternative versions for a particular functionality available to choose from. These alternatives will be stored as different functions in
the same function-set. The number of alternatives can reach from a few, (e.g. the user providing three
different version of a parallel matrix-multiply operation) to many millions, in case the user is exploring different values for internal or external parameters, such as various buffer sizes, loop unroll depth
etc. As of today, ADCL incorporates two different strategies for version selection at runtime. The first
version incorporates a simple brute-force search, which evaluates all available alternatives. An alternative version selection algorithm is used, if the user annotates the implementations by a set of attributes/
attribute values. These attributes are used to reduce the time taken by the runtime selection procedure,
by tuning each attribute separately.
Independently of the version selection approach used by the library, the collective decision logic of
ADCL will have to compare performance data of multiple functions gathered on different processes.
The challenge lies in the fact, that in the most general case, processes only have access to their own
performance data and performance data for the same code version might in fact differ significantly
across multiple processes. Distributing the performance data of all processes for all versions to all other
processes is however not feasible, since the costs for communicating these large volumes of data would
often offset the performance benefits achieved by runtime tuning. The approach taken by the library
relies therefore on data reduction, i.e. each process provides only a single value for each alternative
version of the code section being optimized. In order to detail the algorithm lets assume ADCL gather
n measurements/data points for each version i on each process j. Let us denote the execution time of
the kth measurement by t(I, j, k). In an initial step, the library removes outliers, i.e. measurements not
fulfilling a condition C = t(i, j, k) | t(i, j, k) < b minkt(i, j, k) with b being a well defined constant, from
the data set. This leads to a filtered subsetM f (i, j) = {t(i, j, k) | t(i, j, k) fulfills C}of measurements with
cardinality nf (i, j). Then, the performance measurements for each version are analyzed locally on each
process and characterized by the local average execution time
591
m(i, j ) =
1
n
t(i, j, k )
k
and its filtered counterpart

1
nf
m f (i, j ) =
t(i, j, k )
k M f (i , j )
as estimates of the mean value. In a global reduction operation, the library determines for each version
the maximum average execution time across all processes
m(i ) = max m(i, j )
j
m f (i ) = max m f (i, j )
j
considering all respectively only filtered data, and the maximum number of outliers no(i) over all processes
no (i ) = max no (i, j )
j
This reduction is motivated by a fundamental law in parallel computing, which states, that the performance of a (synchronous) application is determined by the slowest process/ processor. Finally, the
library selects the maximum execution time including or excluding outliers by
m (i ) if
r (i ) = f
m(i ) else
no (i ) nmo
depending on whether the maximum number of outliers allowed is exceeded or not. The algorithm i
fulfilling r (i ) = mini r (i ) is chosen as the best one. Assuming that the runtime environment produces
reproducible performance data over the lifetime of an application, this algorithm is guaranteed to find
the fastest of available implementations for the current tuple of {problem size, runtime environment,
versions tested}.
Attribute Based Tuning

ADCL extends the algorithm grouping concept used in STAR-MPI by introducing the formal notion of
attributes. The main idea behind that concept is, that any implementation of a collective communica-
592
tion operation has certain implicit requirements to the hardware and software environment in order to
achieve the expected performance. As an example, ADCL uses as of today three attributes in order to
characterize an implementation for the neighborhood communication function-set:
1.
2.
3.
Number of simultaneous communication partners: this attribute characterizes how many communication operations are initiated at once. For neighborhood communication, the currently supported
values by ADCL are all (ADCL attribute value aao) and one (pair). This parameter is typically
bound by the network/switch.
Handling of non-contiguous messages: supported values are MPI derived data types (ddt) and
pack/unpack (pack). The optimal value for this parameter will depend on the MPI library and some
hardware characteristics.
Data transfer primitive: a total of eight different data transfer primitives are available in ADCL as
of today, which can be categorized as either blocking communication (e.g. MPI_Send, MPI_Recv),
non-blocking/asynchronous communication (e.g. MPI_Isend, MPI_Irecv), or one-sided operations
(e.g. MPI_Put, MPI_Get). Which data transfer primitive will deliver the best performance depends
on the implementation of the according function in the MPI library and potentially some hardware
support (e.g. for one-sided communication).
Please note that not all combinations of attributes might really lead to feasible implementations. As an
example, implementations using a blocking data transfer primitives such as MPI_Send/Recv cannot be
applied for implementations having more than one simultaneous communication partner, since this would
potentially result in a deadlock. Therefore, a total of 20 implementations are currently available within
ADCL for the n-dimensional neighborhood communication. Further attributes such as the capability of
the library/environment to overlap communication and computation will be added in the near future.
In order to speed up the selection logic, an alternative runtime heuristic based on the attributes characterizing an implementation has been developed. The heuristic is based on the assumption, that the fastest
implementation for a given problem size on a given execution environment is also the implementation
having optimal values for the attributes in the given scenario. Therefore, the algorithm tries to determine
the optimal value for each attribute used to characterize an implementation. Once the optimal value for
an attribute has been found, the library removes all implementations not having the required value for
the according attribute and thus shrinks the list of available implementations.
In order to explain the approach in more details, lets assume that an implementation is characterized
by N attributes. Each attribute has nv(i), i = 1,N possible values v(i, j), j = 1, nv(i). The library assumes
that the optimal value kopt(i) for an attribute i has been found, if rc(i) measurements confirm this hypothesis. In order to be able to deduct from a set of measurements towards the optimal value of a single
attribute, the library only compares the execution times of implementations whose attributes differ only
in the according attribute. To clarify this approach, please assume that we have to deal with four different attributes (N = 4), and want to determine the best value for the second attribute. We assume that this
attribute has three distinct values (nv(2) = 3)), e.g. v(2, 1) = 1, v(2, 2) = 2, and v(2, 3) = 3. Since the
values of all attributes except for the second one are being constant we assume that any performance
differences between the three implementations can be accredited to the second attribute. The library
determines collectively across all processes which of the three implementations has the lowest average
execution time, using the same approach as outlined in the section 0. If we assume as an example, that
the implementation with the attribute values [v(1, j), 3, v(3, j), v(4, j)] has the lowest average execu-
593
tion time, the library would develop the hypothesis that 3 is the optimal value for the second attribute.
At this point, only one set of measurement confirms the hypothesis that 3 is the optimal value for the
second attribute. Thus, the confidence value in this hypothesis is set to 1. Typically, a hypothesis has to
be confirmed by more than one set of measurements before ADCL considers this hypothesis to be probably correct. Thus, an additional set of measurements with differing (but constant) values for one of the
other attributes has to be gathered, e.g by using v(3, j+1) as the value for the third attribute. If the new
set of measurements confirms the result of the previous set, the confidence value for the hypothesis is
increased. If another attribute value is determined for this set of measurements to be the best one, the
confidence value for the original performance hypothesis is decreased by one. Once a hypothesis reaches
the required number of confirmations, the library removes all implementations which do not have the
optimal value for the according attribute and shrinks the list of available implementations. Please note
that if the measurements do not converge toward an optimal value for an attribute, no implementation
will be removed based on this attribute.
Ongoing Research
Additionally to the topics described so far, active research is currently being performed in multiple areas.
In this subsection we would like to highlight some of the topics.
The first research direction is related to the version management and selection within ADCL. As
described in the previous paragraphs, ADCL utilizes as of today either a brute force search strategy,
which is applied in case a function-set is not characterized by any attributes. Alternatively, ADCL can
utilize the per-attribute based search strategy. Both approaches can however further be improved. The
brute force search can be extended by algorithm containing early stopping criteria, as described in
(Vuduc, 2004). This approach helps to reduce the number of alternative versions tested by randomly
selecting versions to test and giving a measure, when the worst implementations have been excluded
with a certain probability.
The main restriction of the per-attribute based search strategy as of today is that it assumes that attributes are fundamentally not correlated. Depending on the usage scenario and attributes defined by the
application, this assumption is not necessarily correct. However, there is a broad body of work known
in experimental design theory, namely the 2k factorial design algorithms, which provide an excellent
framework for version management of correlated attributes. ADCL is currently being extended to include
2k factorial design algorithms for correlated attributes.
A second active research area is to introduce the notion of historic learning in ADCL, i.e. develop
mechanism to propagate results of various optimizations from one run to another run. Historic Learning
in ADCL extends the Wisdom concept of FFTW in multiple areas: first, the historic data in ADCL is
always accompanied by a high level description of the architecture used to determine the results of the
optimization. This is necessary in order to develop mechanisms which can discard automatically data
in the historic data base, e.g. in case an application is run on a different network than the machine had
when a particular optimization has been performed. Discarding historic data is furthermore supported by
introducing the notion of expected performance. In case the historic data suggests a particular version of
a function-set to lead to the optimal performance, ADCL can also generate an estimate of the execution
time for that operation. If the measured execution time for the operation deviates from the predicted
execution time by a given threshold, the library automatically discards the results and starts a new optimization, assuming that the runtime conditions have changed compared to the original assumptions.
594
Lastly, ADCL introduces the notion of related problems, in order to have the ability to deduct results
from similar problems solved on similar machines. Related problems in that are defined by introducing a function-set specific distance measure. As an example, for the neighborhood communication the
library uses a Euclidean distance between two problems using the vector of the data sizes transmitted
to each process as the base measure.
PERFORMANCE EVALUATION
In the following, we present performance results of three different scenarios. The first one discusses the
optimization of the three-dimensional neighborhood communication as often occurring in scientific applications. The second scenario describes the results achieved for tuning a parallel matrix-matrix multiply
kernel. Finally, the third scenario describes the usage of ADCL within the context of a tool optimizing
the system parameters of the Open MPI library.
Optimizing 3-D Nearest Neighbor Communication

In the following, we will analyze the effect of using different implementations for the neighborhood
communication on the performance of a parallel, iterative solver as often applied in scientific application. The software used in this section solves a set of linear equations that stem from discretization of a
partial differential equation (PDE) using center differences. The parallel implementation subdivides the
computational domain into subdomains of equal size. The processes are mapped onto a regular threedimensional Cartesian topology. Due to the discretization scheme, a processor has to communicate with
at most six processors to perform a matrix-vector product. For the subsequent analysis, the code has
been modified such that it makes use of the ADCL library, i.e. the sections of the source code which
established the 3-D process topology and the neighborhood communication routines have been replaced
by the according ADCL counterparts.
In order to evaluate the correct decision of the runtime selection logic, we additionally executed the
same application using one single implementation at a time by circumventing the runtime selection logic
of ADCL. In order to make the conditions as comparable as possible, the reference data was produced
within the same batch scheduler allocation and thus had the same node assignments. We will refer to
these measurements as verification runs throughout the rest of this section. During each verification
run, the execution times of 700 iterations per implementation were stored and subsequently averaged
over all three runs. Depending on the machine, we have executed three different problem sizes, namely
a small test case with (32 x 32 x 32) mesh points per process, a medium test case with (64 x 32 x 32)
mesh points per process, and a large test case with (64 x 64 x 32) mesh point per process. Since most
MPI libraries do not show performance advantages for MPI put/get operations compared to two-sided
communication on a typical PC cluster and in order to simplify our analysis, we have configured ADCL
for the following tests without the one-sided data transfer primitives. This leaves twelve implementations
for the 3-D neighborhood communication for the runtime selection logic to choose from. The number
of tests required to evaluate an implementation has been set to 30.
Tests have been executed on six platforms all in all:
595
Table 1. Best performing implementation on different architectures and for different problem sizes
Architecture
DataStar p655 (DS)
# of proc
Pb Size/Proc
64
64x32x32
SendIrecv_aao
64x64x32
SendIrecv_aao
64x32x32
SendIrecv_pair_pack
64x64x32
IsendIrecv_aao
64x32x32
IsendIrecv_aao
64x64x32
IsendIrecv_aao
64x32x32
SendIrecv_pair_pack
64x64x32
IsendIrecv_aao
64x32x32
SendIrecv_pair_pack
64x64x32
IsendIrecv_aao_pack
64x32x32
SendIrecv_pair_pack
64x64x32
IsendIrecv_aao_pack
64x32x32
SendIrecv_pair_pack
64x64x32
IsendIrecv_aao_pack
64x32x32
Sendrecv_pair
64x64x32
Sendrecv_pair
64x32x32
SendIrecv_pair_pack
64x64x32
SendIrecv_pair_pack
64x32x32
Sendrecv_pair_pack
64x64x32
SendIrecv_pair_pack
32x32x32
IsendIrecv_aao
64x64x32
SendIrecv_pair_pack
32x32x32
SendIrecv_aao
64x64x32
IsendIrecv_aao
32x32x32
IsendIrecv_aao
64x64x32
IsendIrecv_aao_pack
32x32x32
IsendIrecv_aao
64x64x32
Sendrecv_pair
64x32x32
SendRecv_pair_pack
128
256
512
IBM Blue Gene/L (BG)
128
256
512
NEC SX8 (SX)
16
32
64
SharkIB (ShIB)
32
48
SharkGE (ShGE)
32
48
CacauGE (CGE)
1.
2.
596
Best implementation
IBM Blue Gene/L: The Blue Gene system at the San Diego Supercomputing Center consists of
3,072 compute nodes with 6,144 processors. Each node consists of two PowerPC processors that
run at 700 MHz and share 512 MB of memory. All compute nodes are connected by two high-speed
networks: a 3-D torus for point-to-point message passing and a global tree for collective message
passing. We ran tests using 128, 256 and 512 processes.
NEC SX8: the installation at the High Performance Computing Center in Stuttgart, Germany
(HLRS) consists of 72 nodes, each node having 8 vector processors of 16 GOPS peak (2Ghz) and
128 GB of main memory. The nodes are interconnected by an IXS switch. Each node can send and
receive data with 16 GB/s in each direction. We executed tests with 16, 32 and 64 processes.
Figure 1. Performance overhead of the slowest vs. fastest implementation on each platform and for
each problem size
3.
4.
5.
6.
DataStar p655: the p655 partition of DataStar cluster at San Diego Supercomputing Center has
272 8-way compute nodes 176 nodes with 1.5-GHz Power4+ CPUs and 16 GB of memory, and 96
with 1.7 GHz Power4+ CPUs and 32 GB of memory. The nodes are connected by an IBM highspeed Federation switch. The tests executed and presented in this subsection include 64, 128, 256
and 512 processes runs.
SharkIB: this cluster consists of 24 nodes, each node having a dual core AMD Opteron processor
and 2 GB of main memory. The nodes are interconnected by a 4xInfiniBand network. We present
results for 32 and 48 processes on 16 respectively 24 nodes.
SharkGE: This is same cluster as described in the previous bullet, using however the Gigabit
Ethernet network interconnect.
CacauGE: Cacau is a 200 node, dual processor Intel EM64T cluster at the HLRS. Although the
main network of the cluster is a 4xInfiniBand interconnect, we used for the subsequent analysis the
secondary network of the cluster, namely a hierarchical Gigabit Ethernet network. This network
consists of six 48-port switches, each 48-port switch has four links to the upper level 24 port Gigabit
Ethernet switch. Thus, this network has a 12:1 blocking factor. We have executed tests using 64
processors on 64 nodes in order to ensure that communication between the processes has to use
two or more of the 48-port switches.
597
Figure 2. Comparison of the largest problems run on each patform (left) and of various platforms and
problem sizes (right)
Table 1 summarizes for each platform and problem size the implementation of the neighborhood
communication which leads to the overall best performance. In the 29 different test cases presented
here, seven out of the twelve implementations available in ADCL for that communication pattern turn
out to lead to the minimal execution time of the application. Most notably, there are changes for the
best performing implementation depending on both, the number of processors and the problem size per
process for basically all platforms tested.
In the following, we show that (1) hard coding a particular code sequence can lead to a significant
performance penalty on any platform; (2) pre-tuning the code on a platform does not lead to portable
performance on other platforms and (3) that ADCL is capable of generating portable code with minimal
overhead compared to manually tuned versions.
Since most applications hard-code the neighborhood communication using a sequence of Send/Receive operations, we show in the following diagram the performance implication of potentially using a
suboptimal code version for the neighborhood communication on the overall performance of the code.
For this, we show in Figure 1 the maximum performance penalty that an application code could face
for that scenario by comparing the performance of the best vs. the worst performing implementation.
The penalty an application faces in this scenario does depend on the platform used. While most of the
platforms analyzed show a performance penalty in the range of 5-20% in this test, some platforms show
also more dramatic variability to the implementation, such as both platforms using Gigabit Ethernet
networks, for which the execution time nearly doubles in the worst case. The NEC SX8 also shows a
significant sensitivity to the implementation with an additional overhead of more than 60% depending
on the number of processes and the problem size.
Next, we would like to quantify the penalty an application would pay by using a code version which
has been tuned on one platform on another platform. We detail two scenarios. First, for each platform
we choose the largest problem on the largest number of processors that we ran, and evaluate what the
performance penalty would be to use an implementation, which has been determined to be the winner on
any of the other platforms. In Figure 2, each entry represents the performance penalty of the application
running on the platform shown on the x-axis when using the implementation determined to be the winner on the platform shown in the according y-axis. As an example, the bar in the first row, third column
598
Figure 3. Performance difference between the manually tuned and an automatically tuned code version
using ADCL
shows the performance penalty for the application running the large problem size on the Datastar cluster
using the winner function determined by the SX8 using 64 processes.
The most remarkable result of Figure 2 (left) is, that the winner functions of the SharkGE and CacauGE cause significant performance penalties on the high performance interconnects used on Datastar,
IBM Blue Gene and NEC SX 8. The performance penalty ranges from 1.74% up to 58.99%. Vice versa,
the implementation chosen by these machines lead to significant performance penalties on SharkGE
and CacauGE, increasing the execution time up to 90% compared to their fastest implementation. This
result is especially relevant, since many large-scale scientific applications are originally developed and
tuned on a smaller cluster within the institute where the authors reside. Typically, these smaller clusters
utilize Gigabit Ethernet network interconnects. Our results indicate, that when moving to a large-scale
system at a remote site, the code tuned for the Gigabit Ethernet network might in-fact pay a significant
performance penalty when run without modifications.
In the second scenario, we focus on only three architectures, namely Datastar, IBM Blue Gene and
the NEC SX8. We analyze the execution times for the medium problem size for all available number of
processes. The results are presented using same format as previously. Although the performance penalty
for many scenarios shown in the right part of Figure 2 is negligible, there are notable exceptions. As
an example, applying the winner function of the 64 processes case on Datastar onto the 128 processes
599
case on Blue Gene/L leads to a 5% increase in the execution time of that simulation. Similarly, the best
performing implementation in the 256 processes test case on the Blue Gene would lead to a 11.49%
increase in the execution time for the 256 processes test case on the Datastar architecture. Last but not
least, the implementation leading to the best performance on the 64 processes test case on the SX8 would
lead to a performance penalty of more than 15% on the same architecture for the same problem size per
processes but for the 32 processes test case.
ADCL Performance Results

As of now, we have documented the fact that the performance of an application does depend on the
implementation of the neighborhood communication, the hardware architecture, the application problem
size, and the number of processes. In the following, we would like to show that using the ADCL runtime
selection logic leads in most cases to a close-to-optimal performance. Figure 3 documents the average
overhead of the application when using ADCL compared to the performance of the application using the
fastest implementation for that particular scenario. Figure 3 distinguishes between the two runtime selection logics of ADCL, namely the brute force search strategy, and the attribute based search strategy.
The main result of Figure 3 is that the execution time of the application when using ADCL with its
runtime adaption features is in fact very close the optimal performance determined in the verification
runs. The overhead introduced by ADCL is in the vast majority of the test cases below 1%. This (minimal) overhead stems from two facts: first, during the initial iterations of the application, ADCL evaluates
some of the implementations, which show a suboptimal performance on that platform for that particular
problem size. Second, ADCL incorporates a distributed decision algorithm in order for all processes to
agree on the same implementation as the winner. This distributed decision algorithm requires one allreduce operation per implementation. Furthermore, the attribute based search strategy shows in virtually
all test scenarios a lower overhead due to the reduced number of implementations being tested.
There are two notable exceptions to the results above: firstly, for the hierarchical Gigabit Ethernet
network used on Cacau and secondly, for the 48 processes test case on SharkGE when using the attributes based search strategy and the large problem size. In the first problem scenario, despite of the
fact that the ADCL runtime selection logic did determine in all three runs the correct implementation as
the winner, the performance penalty for using a suboptimal implementation during the learning phase
turned out to be tremendous: the overall execution time of the application increased by 72% compared
to using the optimal implementation from the very beginning. This result only highlights the necessity
for additional, improved runtime selection algorithms which can further reduce the time required to
determine the fastest algorithm.
For the second problem scenario, a more detailed analysis shows, that in two out of the three runs
which have been used to calculate the average overhead in Figure 3, the parameter based search strategy
did reveal a very good performance, showing only a minor overhead compared to the optimal execution
time. However, in the third run, the system seemed to face some perturbations which lead to a wrong
decision by the ADCL runtime selection logic. Using a suboptimal implementation for that test case
resulted in a significant overhead. ADCL as of today relies on the fact, that the data gathered during the
training phase is representative for the overall execution. In case this assumption turns out to be wrong,
as happened in the third run, the runtime selection logic will make a suboptimal decision. In order to
handle this scenario, ADCL will be extended by a monitoring subsystem in the near future. In case the
performance data of the winner implementation deviates significantly from the performance data
600
Figure 4. Performance of various algorithms for parallel matrix-matrix multiply operation and the according ADVL results
gathered during the learning phase, ADCL will be able to re-start the runtime selection logic and thus
correct an erroneous decision. However, this component is not yet available as of today.
Tuning a Parallel Matrix-Matrix Multiply Kernel

Matrix-matrix multiplication is a common operation in many applications from graph theory, numerical
algorithms, digital control, and signal processing. Within the frame of a master thesis (Huang, 2007),
three different kernels for a parallel matrix-matrix multiply operation have been implemented. The code
used as the basis for this analysis assumed that matrices are decomposed among the processes using 1-D
decomposition, i.e. each process holds a certain number of columns of the overall matrix. During execution, a process calculates a partial result using the sub-matrices it currently has access to. Using some form
of communication, the processes then exchange their sub-matrices and successively perform the same
calculation on new sub-matrices, adding the new results to the previous ones. From the computational
perspective, Cannons algorithm is used to implement the parallel matrix-matrix multiplication. The
main difference between the versions is the approach taken for communication between the processes.
The three communication patterns explored within the thesis can be described as follows:
Synchronous: In this version, the algorithm performs the computation on the local part of the
matrices, followed by a circular shift of the sub-matrices of B. Thus, after the first communication step, process 1 holds the sub-matrices of B which has originally been assigned to process 0,
process 2 holds the sub-matrices of B originally assigned to process 1 etc. This sequence of computation and communication is repeated p times for a p processes scenario. After the final computation, an additional shift operation is required in order to have the original assignment of the
sub-matrices of B in place. This implementation is called synchronous, since there is no overlap
between the communication and the computation.
Overlapping: The main difference of this version to the previous one is, that the code tries to
overlap the communication occurring in the shift operations, and the computation. For this, an additional temporary buffer is required to hold a sub-matrix. Using a double-buffering concept, the
buffer given by the sub-matrix B and the temporary buffer are used in an alternating fashion for
601
communication and computation.

Broadcast: this implementation avoids the circular shift operations for transferring sub-matrices
of B. Instead, process i broadcasts its sub-matrix in iteration i to all other processes, and each
process performs the corresponding part of the computation
The three versions of the matrix-matrix multiplication described in this section have been integrated
with ADCL in order to let the ADCL runtime selection logic decide dynamically, which algorithm performs
best for a given application and hardware configuration. Note, that ADCL deals in this subsection with
user-defined function-set in contrary to the pre-defined function-set used in the previous subsection.
For our tests, we used the SharkIB and SharkGE cluster described in the previous subsection. We
show here performance results for matrix sizes of 1600x1600 and 3200x3200, using 8, 16 and 32 processes. (Figure 4)
The results obtained indicate, that the algorithm using the broadcast operation for disseminating the
sub-matrices is performing significantly slower than the implementations entitled synchronous or overlap.
Over InfiniBand, synchronous typically achieves a slightly better performance than overlap, while over
Gigabit Ethernet overlap is somewhat faster. ADCL chose for all instances the right implementation.
The execution time of the operations when using ADCL is however somewhat slower than the according
implementation, due to the fact that the library has to test also the under-performing broadcast version.
This is especially evident for the 32 process test case over Gigabit Ethernet, where the penalty for using
the broadcast versions is tremendous.
Using ADCL to Tune System Parameters of Open MPI

The last usage scenario of ADCL described in this chapter deals with tuning runtime parameters of a
communication library such as Open MPI (Gabriel, 2004). Open MPI supports a large number of runtime parameters, which allow an end-user or system administrator to tune specifics of the library, such
as network parameters, settings for collective operations or processor affinity options, without having
to recompile the library. The Open Tool for Parameter Optimization (OTPO) (Chaarawi, 2008), the result of a joint project between Cisco Systems and the University of Houston, allows the user to specify
a certain number of Open MPI parameters, desired values or ranges of values to be explored, and the
benchmark to be executed with each run. The result of an OTPO run is a collection of Open MPI parameters and the according optimal values, which lead to the minimal execution time of the benchmark
previously specified.
Internally, OTPO maps Open MPI parameters to ADCL attributes, and creates an ADCL function set.
Each function in the function set executes the same sequence, namely spawning a new process which
starts an MPI job with the according Open MPI parameters. This allows OTPO to take advantage of the
regular ADCL mechanisms to determine the best performing function in the function set.
In (Chaarawi, 2008) we demonstrated how this tool can be used to tune the InfiniBand parameters
of Open MPI. Two separate set of tests have been executed, one exploiting the shared-receive queue
feature of InfiniBand, the second using a separate receive queue per process. For the sake of clarity, we
focused on tuning only four parameters, which lead to 825 possible combinations for the first scenario,
and 275 possible combinations of for the second scenario. The test code executed for both scenarios
consisted of the NetPipe (Turner, 2002) benchmark.
602
The results reveal a small number of parameter sets that resulted in the lowest latency (3.77s and
3.78s), namely four parameter sets for shared receive queues and six parameter sets for per peer receive
queues. However, there were a significant number of parameter combinations leading to results within
0.05s of the best latency. These results highlight, that typically, the optimization process using OTPO
will not deliver a single set of parameters leading to the best performance, but will result in groups of
parameter sets leading to similar performance
CONCLUSION
This chapter presented a library capable of adapting the behavior of an application at runtime, which
allows for tuning the performance of a particular code section by switching between different implementations. The library has been used in various scenarios, such as for tuning the neighborhood communication of scientific applications, tuning parallel matrix-matrix multiply operations, or adjusting the
InfiniBand parameters of the Open MPI library. ADCL does not only allow for seamless tuning of an
application at runtime, but also helps from the software engineering perspective since it avoids having
to maintain different code version for different platforms. These two features combined make us believe,
that runtime adaptation techniques such as used in ADCL are one of the most promising approaches in
order to successfully use and exploit Petascale architectures.
REFERENCES
Alexandrov, A. Ionescu, M. F. Schauser, K. E. & Scheiman, C. (1995). LogGP: Incorporating long messages into the LogP model. In Proceedings of the seventh annual ACM symposium on Parallel algorithms
and architectures, (pp. 95105). New York: ACM Press.
Benkert, K. Gabriel, E. & Resch, M. M. (2008). Outlier Detection in Performance Data of Parallel
Applications. In the 9th IEEE International Workshop on Parallel Distributed Scientific and Engineering
Computing (PDESC), Miami, Florida, USA.
Bhowmick, S. Eijkhout, V. Freund, Y. Fuentes, E. & Keyes, D. (in press). Application of Machine Learning in Selecting Sparse Linear Solver. Submitted for publication to the International Journal on High
Performance Computing Applications.
Chaarawi, M. Squyres, J. Gabriel, E. & Feki, S. (2008). A Tool for Optimizing Runtime Parameters of
Open MPI. Accepted for publications in EuroPVM/MPI, September 7-10, Dublin, Ireland.
Culler, D., & Karp, R. Patterson, D. Sahay, A. Schauser, K. E. Santos, E. Subramonian, R. & von Eicken,
T. (1993). LogP: Towards a realistic model of parallel computation. In Proceedings of the fourth ACM
SIGPLAN symposium on Principles and practice of parallel programming, (pp. 112). New York: ACM
Press.
Dongarra, J. J., & Eijkhout, V. (2003). Self-Adapting Numerical Software for Next-Generation Applications. International Journal of High Performance Computing Applications, 17(2), 125131.
doi:10.1177/1094342003017002002
603
Evans, J. J. Hood, C. S.& Gropp, W. D. (2003). Exploring the Relationship Between Parallel Application
Run-Time Variability and Network Performance. In Proceedings of the Workshop on High-Speed Local
Networks (HSLN), IEEE Conference on Local Computer Networks (LCN), (pp. 538-547).
Faraj, A. Yuan, X. & Lowenthal, D. (2006). STAR-MPI: self tuned adaptive routines for MPI collective
operations. In ICS 06: Proceedings of the 20th Annual International Conference on Supercomputing,
(pp. 199-208). New York: ACM Press.
Faraj, A. Patarasuk, P. & Yuan, X. (2007). A Study of Process Arrival Patterns for MPI Collective Operations. International Conference on Supercomputing, (pp.168-179).
Frigo, M., & Johnson, S. (2005). The Design and Implementation of FFTW3. Proceedings of the IEEE,
93(2), 216231. doi:10.1109/JPROC.2004.840301
Gabriel, E. Fagg, G. Bosilca, G. Angskun, T. Dongarra, J. J. Squyres, J. M., et al. (2004). Open MPI:
Goals, Concept, and Design of a Next Generation MPI Implemention. In D. Kranzlmueller, P. Kacsuk,
J. J. Dongarra (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface,
(LNCS, Vol. 3241, pp. 97-104). Berlin: Springer.
Gabriel, E., & Huang, S. (2007). Runtime optimization of application level communication patterns. In
Proceedings of the 2007 International Parallel and Distributed Processing Symposium, 12th International
Workshop on High-Level Parallel Programming Models and Supportive Environments, (p. 185).
Gill, P. E. Murray, W. & Wright, M. H. (1993). Practical Optimization. London: Academic Press Ltd.
Hoefler, T. Lichei, A. & Rehm, W. (2007). Low-Overhead LogGP Parameter Assessment for Modern
Interconnect Networks. Proceedings of the IPDPS, Long Beach, CA, March 26-30. New York: IEEE.
Huang, S. (2007). Applying Adaptive Software Technologies for Scientific Applications. Master Thesis,
Department of Computer Science, University of Houston, Houston, TX.
Jain, R. K. (1991). The Art of Computer Systems Performance Analysis: Techniques for Experimental
Design, Measurement, Simulation, and Modeling. New York: Wiley.
Kielmann, T. Hofman, R. F. H. Bal, H. E. Plaat, A. & Bhoedjang, R. A. F. (1999). MagPIe: MPIs
collective communication operations for clustered wide area systems. ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming (PPoPP99), 34(8),131-140.
Petrini, F. Kerbyson, D. J. & Pakin, S. (2003). The Case of the Missing Supercomputer Performance:
Achieving Optimal Performance on the 8,192 Processors of ASCI Q. Proceedings of the 2003 ACM/
IEEE Conference on Supercomputing.
Pjesivac-Grbovic, J., Bosilca, G., Fagg, G. E., Angskun, T., & Dongarra, J. J. (2007). MPI Collective
Algorithm Selection and Quadtree Encoding. Parallel Computing, 33(9), 613623. doi:10.1016/j.
parco.2007.06.005
Turner, D., & Chen, X. (2002). Protocol-dependent message-passing performance on linux clusters.
Proceedings of the 2002 IEEE International Conference on Linux Clusters, pp. 187-194. New York:
604
Voss, M. J., & Eigenmann, R. (2000). ADAPT: Automated De-coupled Adaptive Program Transformation. International Conference on Parallel Processing, Toronto, Canada, (pp. 163).
Vuduc, R., Demel, J., & Bilmes, J. A. (2004). Statistical Models for Empirical Search-Based Performance Tuning. International Journal of High Performance Computing Applications, 18(1), 6594.
doi:10.1177/1094342004041293
Whaley, R. C., & Petite, A. (2005). Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software, Practice & Experience, 35(2), 101121. doi:10.1002/spe.626
Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (2nd
Ed.). San Francisco: Morgan Kaufmann.

Adaptive Applications: Application capable of changing its behavior, switch to alternate code sections or change to different values for certain parameters at runtime as a response to different input data
or changing conditions.
Decision Algorithms: algorithms used to compare different versions of the same unctionalitywhile
executing the application with respect to a particular metric such as execution time.
Dynamic Tuning: Tuning of a code sequence or function during the execution of the real application.
Static Tuning: Tuning of a code sequence or function before executing the real application.
605
606
Chapter 26
A Scalable Approach to RealTime System Timing Analysis

Alan Grigg
Loughborough University, UK
Lin Guan
Loughborough University, UK
ABSTRACT
This chapter describes a real-time system performance analysis approach known as reservation-based
analysis (RBA). The scalability of RBA is derived from an abstract (target-independent) representation
of system software components, their timing and resource requirements and run-time scheduling policies.
The RBA timing analysis framework provides an evolvable modeling solution that can be instigated in
early stages of system design, long before the software and hardware components have been developed,
and continually refined through successive stages of detailed design, implementation and testing. At
each stage of refinement, the abstract model provides a set of best-case and worst-case timing guarantees that will be delivered subject to a set of scheduling obligations being met by the target system
implementation. An abstract scheduling model, known as the rate-based execution model then provides
an implementation reference model with which compliance will ensure that the imposed set of timing
obligations will be met by the target system.
INTRODUCTION
A key requirement in the development of a real-time system is the ability to demonstrate that the final
target system has met its specified timing requirements. This can be carried out by constructing a model
of system timing behaviour that can be used to make predictions about worst-case performance in terms
of maximum response times, communication delays, delay variations and resource utilisation. In order
to develop a modelling solution that is scalable, however, the timing analysis model must be constructed
with this goal specifically in mind.
DOI: 10.4018/978-1-60566-661-7.ch026
A Scalable Approach to Real-Time System Timing Analysis
Scalability of the timing analysis solution can ultimately be achieved by meeting two key modelling
objectives:
Partitionable analysis: The ability to model the timing behaviour of parts of the system independently from other parts and, correspondingly, limit the scope of re-analysis required in the event
of localised changes;
Evolvable analysis: The ability to assess the timing behaviour of the system incrementally
throughout the development process (and through later modifications), based on the information
available at each stage of development.
Partitionable Analysis
It was recognised by Audsley et al (1993) that the holistic nature of timing analysis models for distributed
real-time systems (the inability to analyse the system in part due to circular inter-dependencies in the
model) arises due to a combination of functional and physical integration effects. It therefore follows
that the model could be partitioned by placing appropriate restrictions on the manner in which integration of the system software and hardware components is performed as follows:
Functional partitioning: Constraints on data communication and ordering/ precedence relationships between related software components;
Physical partitioning: Constraints on resource sharing policies (scheduling and communication
protocols).
Functional partitioning can be implemented by breaking up end-to-end transactions (sequences of

processing and communication activities required to meet some higher level system objective) into individual components based purely on knowledge of the end-to-end timing requirements of the system (and
the transaction topology). The net result is that the timing behaviour of groups of software components
allocated to individual processors (and other resources) can potentially be analysed independently.
Functional partitioning, in its extreme, results in a federated timing analysis model where resource
boundaries in the physical architecture dictate the partitions in the timing model. The resulting model can
only be applied/evaluated after the physical integration stage of system development has been performed,
i.e. the allocation of software components to hardware processors (and other resources) has been defined
and the details of the scheduling solution and communication protocols have been finalised.
Physical partitioning can be implemented by applying resource scheduling mechanisms that support temporal partitioning between the components being scheduled. The net result is that the timing
behaviour of functionally-dependent groups of components, ie. end-to-end transactions, can be analysed
independently. This offers some key advantages over functional partitioning. Firstly, for systems that
embody a degree of safety-critical functionality, this means that there is inherent temporal isolation
between safety-critical and non-safety-critical software components that share processing resources
and/or communication media. Secondly, and more generally from the perspective of an overall system
engineering process, there is potential for supporting independent development and verification of groups
of functionally-related components for a target computing environment that is physically integrated.
Moreover, the analysis can potentially be applied before the physical integration stage of development
has been performed, i.e. much earlier in the development life of the system.
607
The Reservation-Based Analysis (RBA) method described in this chapter uses a physical partitioning approach throughout but also supports functional partitioning with the guideline that this is to be
used sparingly. Before going on to describe RBA, the second of the two key modelling requirements is
discussed below.
Evolvable Analysis
No particular form of development life-cycle is assumed here, merely the notion that requirements,
design, implementation/integration and testing/verification stages are involved, where these stages may
be performed with a degree of concurrency and iteration. More specifically, it is assumed that the following activities are involved in the process:
Definition of system-level timing requirements;

Decomposition and refinement of these requirements during system design and implementation;
Development/acquisition and integration of software and hardware components;
Verification of systems timing properties against stated requirements.
Timing analysis models are normally developed in a bottom-up manner, i.e. the model is not finalised
until after the implementation and integration details of the system have been decided. Hence, it is not
possible to assess the timing behaviour of the system until late in the development process. Furthermore,
subsequent changes to the implementation or integration details of the system, particularly changes to
the scheduling or communication protocols involved, are likely to impact on the details of the timing
model. The results of analysis performed at this late stage of development are, of course, essential to
support final verification of the system timing requirements. Any deficiencies discovered at this late
stage, however, can give rise to significant re-work, involving possible re-consideration of the artefacts
produced from the integration, implementation, design or even requirements phases of development,
depending on the severity of the problem. The costs associated with re-work can be a major factor in
the overall development cost of industrial real-time systems.
The amount of re-work associated with the development and verification of the timing properties
of the system could be reduced by making the notion of timing analysis more integral to the systems
engineering process as a whole, allowing it to be applied throughout the development of the system,
starting much earlier in the process. This would allow an ongoing assessment of emerging system timing properties relative to specified timing requirements and also provide progressive guidance on the
selection of future design/implementation details at successive stages of development. Two fundamental issues can be identified regarding the provision of an evolvable timing analysis model through the
system life-cycle:
608
How to perform timing analysis with a lack of system integration and implementation details during the earlier stages of development;
How to deal with a continually evolving definition of the system, with different parts of the system
evolving at different rates.
Timing Analysis without System Integration and Implementation Details

In the earlier stages of development, the timing-related information for all parts of the system will be
scarce. Whilst the system level timing requirements may be reasonably well understood and decomposed to varying extent during design stages, the ability to perform timing analysis relies fundamentally
on the notion of resources - some media through which the functions of the system can be performed.
Processing resource details are used as a basis for calculating worst-case execution times of software
components. Similarly, details of the communication media (and associated protocols) are used to characterise worst-case communication delays.
In order to perform timing analysis prior to the system implementation stage, therefore, it is therefore
necessary to work with an implementation-independent, abstract model of system resources. Such a
model could provide a means to perform timing analysis in the earlier stages of system development
using estimations or assumptions about system integration and implementation details that will not become concrete until a later date. This abstract model must be defined to provide a sufficient (although
pessimistic) basis for performing timing analysis but without over-constraining the final implementation of the system. When implementation details are eventually finalised, the results of timing analysis
performed via the abstract model could then be verified. This gives rise to a two-stage approach to
timing analysis:
Abstract timing analysis: Performed during the definition and decomposition stages of development on the basis of the abstract resource model; the net result is a set of worst-case (and bestcase) guarantees regarding the timing behaviour of the system that are subject to a set of obligations being met by the final implementation of the system;
Target-specific timing analysis: Performed during the system implementation and integration
stages of system development, the aim being to demonstrate that the set of obligations imposed
during the abstract timing analysis phase have actually been met.
By appropriate construction of the timing model, the guarantees generated during the abstract timing
analysis phase can be inferred automatically at the target-specific analysis stage, i.e. there should be no
need to reconstruct the results of the system-wide abstract analysis after the target-specific details have
been finalised and verified.
Significantly, the opportunity can be taken here to define a consistent (unified) model for system-wide
analysis, i.e. for end-to-end timing analysis across all system resources. This is a key attribute of the RBA
framework. It is normally the case that different types of system resources (processing and communication media) and resource access policies (scheduling and communication protocols) are modelled in
different ways. The use of consistent abstractions improves the composability and ultimately scalability
of the system-wide timing analysis model.
Timing Analysis for an Evolving System Definition

The development of a timing analysis model for application through the system life-cycle must inherently
address the problem of dealing with an evolving definition of the system. Throughout the development
process, different parts of the system will mature at different rates, leading to an inconsistent level of
detail regarding the timing properties of the system. The timing model must not only be able to represent
609
the various parts of the system at different levels of detail but also still be evaluated to determine system
timing behaviour based on such information.
This can be achieved by structuring the real-time transaction model as a hierarchy, where successive
levels of the hierarchy capture the results of successive stages of system evolution. At any given stage
of development, the hierarchy needs to capture all known ordering/precedence relationships between
the entities that describe the system at that stage and all known resource requirements (if any). Ideally,
the nodes of the hierarchy should be of a form that is consistent throughout the development of the
system in order to avoid any problems associated with transforming the timing properties of the system
between different representations at different stages of development. The proposal to partition the timing
model in the physical domain, as described above, suggests that the nodes of the hierarchy should be
characterised by indivisible resource usage requirements (and the corresponding delay/response times)
associated with computational and communication elements within the system.
In this way, functionally unrelated parts of the system can potentially be modelled, analysed and
modified independently and therefore allowed to evolve at different rates throughout the development
process. Similarly, in the later stages of system development, the analysis should be applicable to partially
integrated subsets of the system in order that verification of system timing properties can occur in step
with the integration process itself, rather than waiting for the complete system to be finalised before the
timing behaviour of any one part of it can be verified.
In terms of the two-stage approach described above, the timing guarantees generated from the abstract
timing analysis model at successive stages of refinement can be taken to represent timing obligations on
subsequent stages of development. As the model is evolved, timing obligations are refined to generate
increasingly detailed implementation constraints on the final system.
Abstract Timing Analysis Model

The system level, end-to-end timing characteristics are captured, developed and analysed in terms of
transactions. A transaction captures the temporal relationship between:
A set of input events from the external environment of the system (or from other transaction(s));
A set of output events to the external environment of the system (or to some other
transaction(s)).
In the true sense, transaction input and output events correspond to the arrival or dispatch of control
signals and/or information from or to the external interfaces of the system. In order to better support
the engineering of large-scale real-time systems, however, such events can relate to other transactions
within the system as well as the external environment of the system. This allows particular end-to-end
requirements of the system to be modelled as multiple transactions where practical engineering constraints
dictate, such as the need to allocate responsibilities across industrial partnerships and sub-contractors.
This practical concern is supported without compromising the system timing model due to the uniform
nature adopted for expressing transaction structural and timing properties, as explained below.
610
Real-Time Transaction Topology

For reasons of supporting an evolving definition of the system during development, the structure of the
transaction model is hierarchical. The body of a transaction at any stage in its development is expressed
in the form of an acyclic, directed, nested graph whose leaf nodes capture the concurrent processing
and communication elements of the transaction, termed activities. Non-leaf nodes in the hierarchy are
referred to as nested transactions. The edges (or arcs) of the graph capture the precedence (ordering)
and nesting relationships within the transaction:
Precedence relationships describe the required order of execution of activities and nested
transactions;
Nesting relationships capture strict refinement (specification-implementation) relationships that
arise during the evolution of the transaction model.
Consider the transaction depicted in Figure 1. This illustrates the evolving definition of a transaction
from an initial system level description (referred to as the level 0 model) through two successive stages/
levels of refinement. At the first stage of refinement, the level 0 nested transaction 1 is implemented
via three level 1 nested transactions, each of which is implemented as a set of level 2 activities at the
second stage of refinement.
In the general case, let i,..,k denote any arbitrarily nested transaction or activity within transaction .
This notation is sufficient to define the topology of the transaction for the purpose of performing timing
analysis (once the timing characteristics of the individual activities have been defined - see section 2.2).
Figure 1. Example transaction topology
611
Further structural information, however, can optionally be provided regarding the nature of precedence
and nesting relationships as a means of reducing the pessimism in the analysis. This is achieved via the
association of a guard function Qi,..,k with each node i,..,k describing the conditions that trigger its arrival
for execution. In many cases, Qi,..,k is implicit given the form of relationship involved. For example, the
arrival of i,..,k dependent on completion of a single predecessor does not require Qi,..,k to be defined at all
since the constraint is wholly described by topology of the directed graph. In other cases, however, Qi,..,k
is not sufficiently defined from the graph topology alone and can, if required, be expressed explicitly.
For example, Q1,3 could be defined such that the arrival of 1,3 is triggered only upon completion of the
two predecessors, 1,1 and 1,2 or, alternatively, on completion of either one of those predecessors. The
choice of guard functions will have an impact on the timing analysis in terms of improving accuracy
of the model.
To this end, a basic categorisation of precedence and nesting relationships, from which more complex
transaction topologies can be constructed, is given below. Consider first the case of precedence relationships, which can be categorised as follows:
One-to-many: Where the completion of a single activity or nested transaction triggers the arrival
of one or more successors (as illustrated in Figure 1 within the level 0 model);
Many-to-one: Where the completion of one or more concurrent activities or nested transactions
is required to trigger the arrival of a single successor (as illustrated in Figure 1 within the level 1
model).
The simple case of one-to-one precedence, where the completion of a single activity or nested transaction triggers the arrival of a single successor, is a special case of each of these classes and actually
represents the intersection between the two.
An analogous categorisation can be defined for nesting relationships as follows, observing that these
relationships involve successive (referred to as parent and child) levels of the transaction hierarchy:
One-to-many: Where the arrival of a single nested transaction at the parent level triggers the arrival of one or more concurrent activities at the child level (as illustrated in Figure 1 in the transition from the level 1 to the 2 model at 1,2) or, traversing the hierarchy in the opposite direction,
where the completion of a single activity at the child level triggers the completion of one or more
concurrent nested transactions at the parent level;
Many-to-one: Where the arrival of one or more concurrent nested transaction at the parent level
triggers the arrival of a single activity at the child level or, traversing the hierarchy in the opposite
direction, where the completion of one or more concurrent activities at the child level triggers
the completion of a single nested transaction at the parent level (as illustrated in Figure 1 in the
reverse transition from the level 2 to the 1 model at 1,1).
For ease of future reference, nesting relationships that are directed from parent to child will be referred
to as descending and those that are directed from child to parent will be referred to as ascending.
The purpose of the nesting relationships in the transaction topology is to reflect the ongoing evolution of the system during development and later in service. For convenience, transactions can also be
visualised at any stage of refinement in an equivalent flat form, i.e. with nesting relationships transformed into precedence relationships. This is achieved by performing a depth-first traversal of the nested
612
transaction graph and successively replacing each nested transaction with its lower level sub-graph; the
sub-graph inheriting all higher level nesting and precedence relationships from the nested transaction
that it replaces. For example, Figure 2 illustrates the example transaction with the nesting relationship
that stems from 1,1 resolved.
Applying this process repeatedly to resolve all nesting relationships for the example transaction gives
the resultant flat topology illustrated in Figure 3.
Note that this flattening of the transaction hierarchy is merely for user visualisation purposes and
does not affect the transaction from the perspective of the timing model.
Figure 2. Example transaction with nesting partially resolved
Figure 3. Example transaction with nesting fully resolved
613
Real-Time Transaction Properties

A set of timing parameters must be assigned for each transaction in order to perform timing analysis.
A key consideration here is the need for the timing model to be evolvable and, to this end, the timing
parameters that represent the nested transaction and the activity are defined such that these objects are
interchangeable. In other words, any nested transaction can be implemented in terms of activities (and
further nested transactions) and, vice versa, an activity can be replaced by a nested transaction and become the subject of further evolution. In this way, the same timing analysis approach can be applied to
predict the behaviour of a transaction throughout its evolution.
The only distinction between activities and nested transactions is that activities, since these represent
leaf nodes of the transaction graph, must define their resource requirements directly, whereas nested
transactions inherit such characteristics from the activities they embody. Otherwise, the parameters via
which timing behaviour is represented and observed are the same for a single activity, a group of activities, a nested transaction and all the way up the hierarchy to a level 0 transaction.
In the general case, for any arbitrarily nested transaction or activity, i,..,k, with level 0 parent i, the
timing properties are captured as follows:
input jitter, J iin,..,k - the maximum width of the time window that spans the arrival of all input
events associated with i,..,k ;
- the maximum width of the time window that spans the deliverance of all outoutput jitter, J iout
,..,k
put events associated with i,..,k ;
minimum I/O separation, di,..,k - the minimum separation in time (delay) between the input and
output event windows of i,..,k ;
minimum inter-arrival time, ai,..,k - the minimum separation in time between the input event windows associated with successive instances of i,..,k .
The relationships between these parameters are depicted in Figure 4.

In Figure 4, the minimum inter-arrival time of i,..,k is shown to be greater than the latest completion
time of any i,..,k output event, corresponding to the case where any one instance of i,..,k will always
complete before the next instance arrives. This is merely to keep the illustration simple and is not actually a constraint on the model. For example, a pilot display generation application is likely to require
Figure 4. Timing properties of transaction/Activity i,..,k
614
a minimum frame update time that is significantly less than the worst-case latency of the data being
displayed. The model supports this type of behaviour, though there are some other constraints imposed
on model parameters as follows, for reasons as given:
di,..,k > -J iin,..,k : A necessary condition arising from the basic restriction that, for any given instance
of i,..,k, the output event window cannot begin before the input event window begins;
ai,...,k > 0: A constraint imposed to ensure that the input event windows associated with successive
instances of i,..,k are totally ordered.
End-to-End Timing Analysis

End-to-end timing analysis can be performed at any stage of evolution of the transaction model, based
on the information specified at that stage. Clearly, the analysis results will become more accurate as the
definition of the model/system is evolved during development. The starting point for describing this
timing analysis is to express the relationship between the basic timing parameters as specified and the
overall delays accrued, at each level of nesting in the transaction definition. Let ri,..,k and Ri,..,k denote
the minimum and maximum accrued delays (response times) associated with any nested transaction or
activity i,..,k. The following relationships are observed:
di,..,k = ri,..,k - J iin,..,k
(1)
J iout
= J iin,..,k + (Ri,..,k - ri,..,k )
,..,k
(2)
These relationships are clarified by the illustration in Figure 5.

The values ri,..,k and Ri,..,k must then be specified for each leaf node of the hierarchy, i.e. for each activity - these activity level delays are referred to as localised delays. The end-to-end analysis can then
proceed by recursively descending the nested transaction topology/graph definition, accounting at each
stage for the impact of nesting relationships, precedence relationships and localised delays on the overall
end-to-end delays. The same approach can be taken to determining accrued delay variation, i.e. jitter.
This will be illustrated by example later in the section.
Ultimately, all accrued delay and jitter values relate back to the input event window for the level 0
transaction, . That said, the end-to-end timing model is constructed such that the delays and jitter accrued across any nested transaction or activity can be calculated relative to those inherited at its time
Figure 5. Delay relationships for transaction/activity i,..,k
615
of arrival, i.e. the relative impact of a given stage in the transaction can be observed. In the transaction
depicted in Figure 1, for example, the activities 1,2,1 and 1,2,2 will each inherit accrued delay and jitter
values on their arrival via a nesting relationship from their parent transaction 1,2. In the general case, let
diin,..,k denote the accrued delay inherited by i,..,k upon its arrival and, in turn, let diout
denote the accrued
,..,k
delay that i,..,k exports to its successors upon completion. These values are related as follows:
diout
= diin,..,k + ri,..,k
,..,k
(3)
This relationship is illustrated in Figure 6. Notice in the diagram that the term diin,..,k actually represents
the separation in time between two input windows, rather than one input and one output window. This is
consistent with the use of d to denote minimum I/O separation since the input event window of activity
i,..,k is equivalent to the output window that describes the combined output jitter of it predecessors.
Equation (3) provides a means by which minimum localised delay values calculated at the activity
level can be consolidated into the end-to-end delay calculation. The consolidation of maximum local
delays is taken care of by the output jitter calculation given in Equation (2).
The manner in which accrued delays and jitter are inherited (and exported) for any particular i,..,k
depends on the form of precedence or nesting relationships involved. In the previous section, a number
of fundamental forms of such relationships were identified. In the following circumstances, accrued
delays and jitter are directly inherited (unchanged) in the direction of the relationship:
one-to-many precedence relationship: When i,..,k is one of many successors to i,..,j: (Figure 7)
diin,..,k = diout
,.., j
(4)
J iin,..,k = J iout
,.., j
(5)
One-to-many nesting relationship (descending): When i,..,k is one of many child activities whose
arrival is triggered by that of the parent transaction i,..,j: (Figure 8)
diin,..,k = diin,.., j
(6)
J iin,..,k = J iin,.., j
(7)
One-to-many nesting relationship (ascending): When i,..,k is one of many parent transactions
whose completion is triggered by that of the child activity i,..,j: (Figure 9)
diout
= diout
,..,k
,.., j
616
(8)
Figure 6. Accrued delay relationships for transaction/activity i,..,k
Figure 7. Delay inheritance - one-to-many precedence relationship
Figure 8. Delay inheritance - one-to-many nesting relationship (descending)
617
J iout
= J iout
,..,k
,.., j
(9)
In circumstances other than those cases illustrated above, however, delay and jitter inheritance is less
straight forward, depending ultimately on the form of guard function Qi,..,k that resides over the arrival
of i,..,k. Without knowledge of Qi,..,k, the exact values for the inherited delay and jitter parameters cannot
be determined, but the smallest safe range of values can be stated as follows:
many-to-one precedence relationship: When i,..,j is one of many predecessors to i,..,k: (Figure
10)
min(diout
) diin,..,k max(diout
)
,.., j
,.., j
j
(10)
min(diout
+ J iout
) - diin,..,k J iin,..,k max(diout
+ J iout
) - diin,..,k
,.., j
,.., j
,.., j
,.., j
j
(11)
many-to-one nesting relationship (descending): When i,..,j is one of many parent transactions
whose arrival is required to trigger that of the child activity i,..,k: (Figure 11)
min(diin,.., j ) diin,..,k max(diin,.., j )

j
(12)
n
min(diin,.., j + J iin,.., j ) - diin,..,k J iin,..,k max(dii,..,
+ J iin,.., j ) - diin,..,k
j
j
(13)
many-to-one nesting relationship (ascending): When i,..,j is one of many child activities whose
completion is required to trigger that of the parent transaction i,..,k: (Figure 12)
min(diout
) diout
max(diout
)
,.., j
,..,k
,.., j
j
(14)
min(diout
+ J iout
) - diout
J iout
max(diout
+ J iout
) - diout
,.., j
,.., j
,..,k
,..,k
,.., j
,.., j
,..,k
j
(15)
Note again that these bounds are derived without knowledge of Qi,..,k and are safe but pessimistic:
618
The stated lower bounds correspond to the case where Qi,..,k is defined such that the arrival of i,..,k
Figure 9. Delay inheritance - one-to-many nesting relationship (ascending)
Figure 10. Delay inheritance - many-to-one precedence relationship
is triggered upon completion of any one i,..,j;

The stated upper bounds correspond to the case where Qi,..,k is defined such that the arrival of i,..,k
is triggered upon completion of all i,..,j.
In practice, the form of Qi,..,k could be defined and refined in line with the development of the associated transaction and this information can be used to reduce the pessimism of the accrued delay and jitter
bounds compared to those determined by Equations (10) to (15). This can be illustrated by considering
the two extreme cases that are used as the basis for deriving those equations, as given below.
In the first case, where Qi,..,k is defined such that the arrival of i,..,k is triggered upon completion of
any one i,..,j, Equations (10) to (15) can be reduced as follows:
619
Figure 11. Delay inheritance - many-to-one nesting relationship (descending)
Figure 12. Delay inheritance - many-to-one nesting relationship (ascending)
many-to-one precedence relationship: When i,..,j is one of many predecessors to i,..,k:
diin,..,k = min(diout
)
,.., j
j
J iin,..,k = min(diout
+ J iout
) - diin,..,k
,.., j
,.., j
j
620
(10a)
(11a)
many-to-one nesting relationship (descending) - when i,..,j is one of many parent transactions
whose arrival is required to trigger that of the child activity i,..,k:
diin,..,k = min(diin,.., j )
j
J iin,..,k = min(diin,.., j + J iin,.., j ) - diin,..,k

j
(12a)
(13a)
completion is required to trigger that of the parent transaction i,..,k:
diout
= min(diout
)
,..,k
,.., j
j
J iout
= min(diout
+ J iout
) - diout
,..,k
,.., j
,.., j
,..,k
j
(14a)
(15a)
In the second case, where Qi,..,k is defined such that the arrival of i,..,k is triggered only upon completion of all i,..,j, Equations (10) to (15) can be reduced as follows:
many-to-one precedence relationship: When i,..,j is one of many predecessors to i,..,k:
diin,..,k = max(diout
)
,.., j
j
J iin,..,k = max(diout
+ J iout
) - diin,..,k
,.., j
,.., j
j
(11b)
many-to-one nesting relationship (descending): When i,..,j is one of many parent transactions
whose arrival is required to trigger that of the child activity i,..,k:
diin,..,k = max(diin,.., j )
j
J iin,..,k = max(diin,.., j + J iin,.., j ) - diin,..,k

j
(10b)
(12b)
(13b)
completion is required to trigger that of the parent transaction i,..,k:
621
diout
= max(diout
)
,..,k
,.., j
j
(14b)
J iout
= max(diout
+ J iout
) - diout
,..,k
,.., j
,.., j
,..,k
j
(15b)
Timing Model Initialisation and Finalisation

Initialisation of the end-to-end timing analysis model involves the assignment of input jitter and minimum I/O separation parameters for all nested transactions and activities that directly service the input
events of the transaction. For example, for the transaction depicted in Figure 1, this relates to the nested
transaction 1. In the general case, the following assignments are made for each level 0 i whose arrival
is triggered directly by some transaction input event:
diin = -J in
(16)
J iin = J in
(17)
where Jin is the transaction level input jitter, i.e. the maximum variation in arrival time over all transaction input events.
Finalisation of the end-to-end analysis involves the assignment of transaction level minimum I/O
separation and output jitter values. Transaction level values are determined by consolidating the values
of the same parameters for all nested transactions and activities that directly service the output events of
the transaction. For example, for the transaction depicted in Figure 1, this relates to the activities 2, 3
and 4. In the general case, the following expressions are evaluated over all level 0 i that relate directly
to transaction output events:
min(diout ) d max(diout )
i
(18)
min(diout + J iout ) - d J out max(diout + J iout ) - d

i
(19)
where d and Jout are the transaction level minimum I/O separation and output jitter, respectively.
Example Transaction Definition and Decomposition

In order to evaluate the end-to-end timing model at any stage of refinement, values must be assigned to
the localised (activity level) delay parameters ri,..,k and Ri,..,k for each activity i,..,k. In general terms, this
can be done in one of two ways:
622
By assigning budgeted values for each ri,..,k and Ri,..,k, e.g. based on knowledge of the transaction
timing requirements;
By assigning actual values for each ri,..,k and Ri,..,k, e.g. based on actual measurement or static
analysis of code.
The latter approach is clearly only applicable when the target hardware and software implementation
are complete (or at least underway). The former approach is what is required during the early stages of
system development and evolution and is used in the example that follows.
To start, consider the assignment of the following end-to-end timing properties to the example transaction - in practice, such details could be extracted from an overall statement of system level end-to-end
timing requirements. (Figure 13 and Figure 14)
The initial level 0 model of the transaction is depicted in Figure 15 - in practice, this information
could be extracted from the top level software architecture design.
From the set of boundary conditions given in Equations (16) to (19), it is straight forward to assign
a set of values to the corresponding level 0 timing attributes. Firstly, from the transaction input conditions given in Equations (16) and (17), values are directly inferred for the input jitter and initial accrued
delay of 1:
d1in = -5 d1in = -5 J 1in = 5 J 1in = 5
Figure 13. Example - transaction timing requirements
Figure 14. Example - Level 0 Model
623
Figure 15. Example partial assignment of Level 0 attributes
From the transaction output conditions given in Equations (18) and (19), suitable values can be found
for the output jitter and final accrued delay of 2, 3 and 4:
d2out = 40 d2out = 40 J 2out = 25 J 2out = 25
In this example, the parameters have been assigned such that the output event window of the transaction is exactly spanned by the set of activity level output windows. This means that the corresponding
transaction level timing requirements have been met exactly, rather than leaving an element of redundancy
in the transaction level requirements relative to the level 0 model timing properties. Beyond that, and the
satisfaction of the boundary conditions given in Equations (16) to (19), the actual values assigned are
somewhat arbitrary and chosen purely for the purposes of illustration. Figure 15 illustrates this (partial)
assignment of level 0 timing attributes.
The rest of the level 0 timing attributes can be assigned on the basis of the level 0 topology details
and the appropriate means of accounting for precedence relationships as defined in Equations (4) to (9).
In practice, additional application-specific information could be taken into account here. In this example,
values have been assigned as explained below.
Given the one-to-many precedence relationship between 1 and its successor activities 2, 3 and 4,
Equation (5) implies that the parameters J 1out , J 2in , J 3in and J 4in should all be assigned the same value.
624
Given that jitter tends to increase in the direction of control flow along the transaction (unless specific
jitter control mechanisms are introduced such as by using time-triggered releases), this value should be
less than any of the output jitter values already assigned for activities 2, 3 and 4. For the purposes of
this example, the following assignment has been made:
J 1out = J 2in = J 3in = J 4in = 12
From Equation (4), the positions of the time windows whose widths are defined by the above jitter
values are all fixed by the size of d1out . Hence, the parameters d1out , d2in , d3in and d4in should all be assigned the same value. Given the topology of the transaction, a reasonable assignment for illustration
purposes would be:
d1out = d2in = d3in = d4in = 35
The level 0 timing attributes are now sufficiently defined to fix the position of all input and output windows in the level 0 topology. Figure 16 illustrates this (full) assignment of level 0 timing attributes.
The final stage of level 0 transaction definition is to derive the set of timing obligations that are to be
inherited as constraints on the next stage of model refinement (or implementation if so desired). These
obligations are in the form of a set of minimum and maximum response times and minimum I/O separation values and can be determined (uniquely=) for all level 0 activities from the application of Equations
(3), (2) and (1), respectively:
r1 = 40 R1 = 47 d1 = 35
r2 = 5 R2 = 18 d2 = 7
r3 = 10 R3 = 15 d3 = 2
r4 = 20 R4 = 23 d4 = 8
The level 0 model is now completely defined. To illustrate how the approach supports further refinement of the timing model, a second stage of decomposition is now illustrated. The level 1 model for the
nested transaction 1 is depicted in Figure 17.
From the statement of 1 timing attributes above and the set of Equations (4) to (15), it is straight
forward to assign a set of values to the corresponding level 1 timing attributes. Firstly, given the one-tomany nesting relationship (descending) between 1 and its child input activities 1,1 and 1,2, Equations
(6) and (7) give:
d1in,1 = -5 d1in,1 = -5 J 1in,1 = 5 J 1in,1 = 5
d1in,2 = -5 d1in,2 = -5 J 1in,2 = 5 J 1in,2 = 5
625
Figure 16. Example - full assignment of Level 0 attributes
Figure 17. Example - Level 1 Model for 1
Given the one-to-one nesting relationship (ascending) between 1 and its child output activity 1,3,
Equations (8) and (9) give:
d1out
= 35 d1out
= 35 J 1out
= 12 J 1out
= 12
,3
,3
,3
,3
626
Figure 18 illustrates this (partial) assignment of level 1 timing attributes for 1.

The rest of the level 1 timing attributes for 1 can be assigned on the basis of the level 1 topology
details and the appropriate means of accounting for precedence relationships as defined in Equations
(4) to (9). Assuming Q1,3 has been specified such that the arrival of 1,3 will be triggered only upon
completion of both predecessors 1,1 and 1,2, Equations (10b) and (11b) can be applied to assign level
1 parameter values as described below.
Given that jitter tends to increase in the direction of control flow along the transaction, J 1in,3 is assigned
and
an appropriate intermediate value between the specified input and output jitter values for 1. J 1out
,1
out
in
J 1,2 are then assigned with knowledge of Q1,3 and the corresponding relationship with J 1,3 as expressed
in Equation (11b). This expression implies that larger jitter values than J 1in,3 can be assigned to some
(though not all) predecessors in a many-to-one precedence relationship so long as the output window
has been assigned a
terminates no later than the required successor input window. On this basis, J 1out
,1
in
(see
larger value than J 1,3 , which means that this must be taken into account in the assignment of d1out
,1
below). This leads to the following assignments:
J 1in,3 = J 1out
= 9 J 1in,3 = J 1out
= 9 J 1out
= 11 J 1out
= 11
,2
,2
,1
,1
An appropriate intermediate value can be assigned for d1in,3 to fix the position of the input window for
1,3. The accrued minimum delay requirements can then be specified for 1,1 and 1,2 such that the latest
completion time for 1,1 is less than that of 1,2.
d1in,3 = d1out
= 20 d1in,3 = d1out
= 20 d1out
= 15 d1out
= 15
,2
,2
,1
,1
Figure 19 illustrates the full assignment of level 1 timing attributes for 1.
Once again, a set of timing obligations can be determined for the purposes of further refinement or
direct implementation. These obligations are specified in the form of a set of minimum and maximum
Figure 18. Example - partial assignment of Level 1 attributes for 1
627
response times and minimum I/O separation values for all level 1 activities by application of Equations
(3), (2) and (1), respectively:
r1,1 = 20 R1,1 = 26 d1,1 = 15
r1,2 = 25 R1,2 = 29 d1,2 = 20
r1,3 = 15 R1,3 = 18 d1,3 = 6
A final stage of decomposition for the example transaction gives the set of level 2 timing attributes for
the nested transaction 1 as depicted in Figure 20. In practice, refinement of the transaction and its timing
attributes could continue until the required level of detail is obtained. Clearly, the topological details
generated at each stage of refinement and the number of refinement stages performed are dependent on
the nature of the application and the software design/implementation approach. (Table 1)
The final set of timing obligations for 1 can now be determined via Equations (3), (2) and (1):
Assuming no further refinement of the transaction prior to its implementation, this final set of timing
obligations represents a set of constraints on the implementation of the system. During the implementation
stages, however, the timing model could be evolved further in the same manner as above as a means of
supporting more progressive implementation (and integration) of the final system. It should be observed
during any such evolution, however, that the stage at which the timing model become target-specific or
integration-specific, i.e. appropriate to a particular scheduling or communication regime, is the stage at
which it ceases to support changes to the target system without the need to restate the model.
When the final implementation and integration details of the system are stabilised, the timing model
must be verified, i.e. the timing obligations defined in the abstract timing model must be shown to be
safe. This requires some form of localised timing analysis model, ie. a model to determine activity
Figure 19. Example - full assignment of Level 1 attributes for 1
628
Figure 20. Example - final assignment of Level 2 attributes for 1
629
Table 1. Example transaction (localised timing attributes for 1)

ri,...,k
di,...,k
Ri,...,k
1,1,1
15
10
18
1,1,2
-3
1,1,3
-3
12
1,2,1
10
20
1,2,2
20
15
22
1,2,3
-2
1,3,1
13
15
1,3,2
-9
level delay and jitter characteristics based on some notion of what constitutes a resource - processor or
communication medium. This is almost where the transition begins towards a target-specific model.
RBA permits the transition to be deferred a little longer, however, by adopting a rate-based execution
model an abstract model of run-time scheduling behaviour. This abstract scheduling model can then
be implemented using either cyclic or priority-based scheduling. This next stage is described below.
RATE-BASED EXECUTION MODEL

The RBA rate-based execution model is a generalised form of scheduling model that provides independence from the final target implementation and integration details of the system, including the precise
form of the final run-time scheduling solution. This abstract scheduling model can be used to guide the
final target scheduling solution to preserve the performance predictions of the abstract timing model. A
range of compliant scheduler implementation schemes will be described later.
Let {j; j=1,..,n} denote the set of activities allocated to a shared system resource and denote the associated set of timing obligations by {(Cj, vj, Rj); j=1,..,n}, where Cj is the maximum execution time (or
analogous communication bandwidth requirement), vj is the minimum required rate of execution and
Rj is the worst-case response time requirement. The rate-based execution model defines the following
simple linear relationship between these parameters:
vj =
Cj
Rj
(20)
An analogous set of best-case parameters is also defined by {(cj, Vj, rj); j=1,..,n}. The objective of
any compliant implementation scheme is thus to maintain the run-time execution rate of each activity
within the required range [vj, Vj], as illustrated in Figure 21.
To illustrate the application of the rate-based execution model (and subsequent implementation
schemes) by example, Table 2 presents a set of timing attributes for the GAP task set (Locke, 1991).
Each GAP task is modeled as a single RBA activity since there is no benefit in further decomposition in this example. All GAP tasks are periodic with period Tj=Rj, except for 10 which is sporadic with
630
minimum inter-arrival time a10=200. Since no input jitter is specified for the periodic tasks, it is assumed
that aj=Tj for these tasks. Conversely, assigning a10=T10 for the sporadic task (the value of 200 shown in
brackets in the table) gives a total task set utilisation requirement of 83.5%.
The set of minimum execution rates are derived from Equation (20) but, since all GAP tasks are
periodic with period=deadline (Tj=Dj), then {vj=Uj.; j=1,..,16}. The total bandwidth reservation requirement is therefore equal to the total utilisation requirement of the task set, i.e. 83.5%, which would be
schedulable on a single processor by an exact implementation of the rate-based execution model.
Basic Schedule Implementation Scheme

A form of cyclic schedule implementation scheme can be used to directly implements the RBA rate-based
execution model. This allows the run-time scheduling solution for a system to be derived directly from
an RBA target-independent timing analysis model for the system without compromising the original
timing requirements. The simplest form of such scheme has the following attributes:
A fixed cycle time min R j ;

j
A fixed time and duration of execution j for each activity j within each cycle;
The restriction Rj aj for each activity j.
Consequently, each activity will execute for exactly j time units in any interval of size , ie. not
necessarily aligned to the minor cycle. The actual order of execution of activities within each cycle is
arbitrary. Moreover, the execution time j allocated to an activity within a cycle does not need to be
contiguous.
It is necessary to assign an appropriate value for and for each j such that the timing obligations
for each activity j are met. The following scheme can be applied to achieved this but note that other
valid assignments will normally exist for a given set of timing obligations. An example is developed
alongside the description of the scheme by considering the activity 7 from the GAP task set.Firstly,
Figure 21. Valid execution space
631
Table 2. Example task set

j
Function
1
Radar Track Filter
Cj
Rj
vj=Uj
25
0.08
RWR Contact Mgt.
25
0.2
Data Bus Poll Device
40
0.025
Weapon Aiming
50
0.06
Radar Target Update
50
0.1
Nav. Update
59
0.1355
Display Graphic
80
0.1125
Display Hook Update
80
0.025
Target Update
100
0.05
10
Weapon Protocol
200
0.005
11
Nav. Steering Cmds.
200
0.015
12
Display Sores Update
200
0.005
13
Display Keyset
200
0.005
14
Display Stat. Update
200
0.015
15
BET E Status Update
1000
0.001
16
Nav. Status
1000
0.001
define the normalised response time value Rj Rj as follows:
Rj = [
Rj
(21)
From Equation (20), define the corresponding normalised execution rate v j as follows:
v j =
Cj
Rj
(22)
It can be seen from Equation (20) that v j v j since Rj Rj . Subsequently, assign j the minimum
value that will guarantee j to meet its normalised response time requirement Rj :
dj = [v j ]
(23)
In the final schedule, each activity j will consequently be executed at a guaranteed minimum rate
v given as follows:
v j =
632
dj
(24)
Hence, for the example task, a value of = 25 gives R7 = 75 , v7 = 0.12 , d7 = 3 and v7d = 0.12 .
Since each activity will be executed at a rate which is no less than that specified by its minimum rate
requirement, the worst-case response time can be guaranteed for any worst-case execution time in the
range [0, Cj]. This makes final verification for a specific target implementation very straight forward.
Denoting the target-specific resource requirement of each j by C j* gives the target-specific feasibility
test:
C j* C j
(25)
This test is independent from the choice of and, for any given activity, the timing and resource requirements of other activities allocated to the shared resource. The test also allows simple re-verification
of j following any software implementation changes that impact on the value of C j* .
A target-independent feasibility test (which could be applied as a resource allocation constraint) for
the set of activities as a whole is as follows:
n
d
j
j =1
(26)
Since the value of v jd is only dependent upon the timing obligations for activity j, the test can be
applied incrementally, i.e. to accept or reject the addition of a new activity to an existing set by comparing its final rate requirement with the remaining capacity available, independent from the actual rate
requirements of activities that already exist in the schedule (that are already guaranteed). Hence, denoting
the new activity by n+1, the following acceptance test can be applied:
n
vnd +1 1 - v jd
j =1
(27)
Neither form of the test can be applied until the value of is fixed, since this determines the value
of each v jd . When the value of is fixed, typically at design-time, this could be taken into account in
the assignment of timing obligations at the final stage of decomposition of the end-to-end transactions.
For example, the value of impacts on the efficiency of the final bandwidth allocation, as discussed
below.
For any activity, the inefficiency of the scheme increases as the worst-case response time requirement
decreases relative to the minimum inter-arrival time. This inefficiency is manifest in the final scheduling
solution as an over-allocation of bandwidth compared to that which is actually required, as stipulated
by the true utilisation requirement of the activity. This arises since the minimum inter-arrival time is
not recognised in the construction of the cyclic schedule beyond the assumption that it is greater than
the original response time requirement, i.e. Rj aj. Consequently, sufficient capacity is reserved in the
schedule to execute each activity once in any time interval of duration Rj (irrespective of the minimum
inter-arrival time). Any over-allocated bandwidth, however, along with any that is allocated but unused
at run-time due to variation in execution times, can potentially be reclaimed at run-time. Reclamation
633
of unused bandwidth is discussed later in the paper. Alternatively, the over-allocation of bandwidth can
be exploited to give a larger upper bound for the target-specific worst-case computation time for j by
restating the target-specific feasibility test, as previously stated in Equation (25), to give:
C j* C j
(28)
where C jd represents the actual maximum computation time allocated in the cyclic schedule over the
time duration Rj and is given as follows:
C jd =
Rj
dj
(29)
Note that the value of C jd will automatically be integer since j is integer and Rj is exactly divisible
by . This larger upper bound on the target-specific worst-case computation time can then be exploited
to give a (specified) margin for error in either:
The actual execution-time of j at run-time compared to the specified value Cj such that transient
over-run of the activity can be tolerated;
The worst-case computation time of a software component procured from some third-party compared to the specified value Cj such that failure of the supplier to meet the original specification
can be tolerated to a limited extent.
The final target-independent response time Rjd for j, given the original computation time budget Cj
and the final bandwidth allocation due to the cyclic scheduling solution, can be stated as follows:
Cj
Rjd = [
dj
]
(30)
Hence, the target-dependent response time Rj* for j, given an actual target-specific computation
time value C j* C jd can be stated as follows:
C j*
Rj* = [
dj
]
(31)
Note that the scheme is exact, in the sense that the allocated bandwidth is both necessary and sufficient to meet true worst-case utilisation requirements, only under certain conditions. This is the case
when there are nil effects from rounding in Equations (21) and (23) the sufficient but not necessary
stages of the calculation. In the general case, the degree of inefficiency of the scheme is dependent upon
the actual timing requirements of the activities.
634
Example Application of Basic Cyclic Scheme

The basic scheme can be applied to determine an RBA-compliant schedule by first selecting an appropriate value for the cycle time . Then, for each activity j:
Determine the normalised response time Rj ;
Determine the normalised execution rate v j ;

Determine the time j for which the task must be executed in each cycle ;
Derive the guaranteed response time Rjd ;
Derive the minimum run-time execution rate v jd ;
Derive the guaranteed computation time C jd .
Observing the original schedule construction constraint min R , assign =25. This leads to the
j
solution given in Table 3, Table 4, and Table 5.
j
A number of observations can be made from these results. From Table 3, the sum of the initial execution rate parameters (vj) corresponds exactly to the total utilisation requirement of the task set (83.51%).
This arises since the worst-case response time of every task is equal to its minimum inter-arrival time.
After defining a cycle time of =25, the sum of the rate parameters ( v j ) corresponds to a total bandwidth
allocation of 88.37%, a noticeable but reasonable increase compared to the true requirement. At the
final stage of calculation, however, the need to provide integer values for the final rate parameters ( v jd
) gives rise to a significant over-allocation of bandwidth due to the combination of rounding effects for
the overall task set. The final bandwidth allocation is 120% and, hence, the extent of the over-allocation
is sufficient to make the task set no longer schedulable on a single processor (by this scheme). The
cyclic schedule has been constructed, however, to allow individual activities to be removed (or have
their timing attributes changed) without affecting other activities in the schedule. Hence, it is straight
forward to reduce the task set to one that is schedulable on a single processor by simply removing one
or more activities (to be reallocated to another processor) until the final bandwidth allocation is less than
100%. The ability to manipulate the schedule in this manner is a considerable benefit in the context of
engineering larger-scale real-time systems.
A counter effect of bandwidth over-allocation is an equivalent reduction in worst-case response times
d
( Rj ) compared to the stated requirements (Rj), as can be seen in Table 4. For example, 15 has a final
bandwidth allocation of 4% (equivalent to its execution rate of 0.04) compared to its stated requirement
of 0.1%. The corresponding reduction in its worst-case response time is apparent in the final value of
25 compared to an original requirement of 1000.
The over-allocation of bandwidth is due to the restriction that every task is executed (for a duration
j) in every cycle , as reflected in the final computation times (C jd ) given in Table 5. This restriction
leads to a simpler (and more readily modifiable) scheduling solution but can be lifted to allow a more
flexible scheme to be defined in favour of reducing the bandwidth over-allocation. Such a scheme is
described and illustrated in the next section.
Note that the basic scheme does not compromise the true timing requirements of the task set there
is no imposition of false iteration rates for the purposes of constructing a schedule (a criticism often
levelled at cyclic scheduling solutions). Furthermore, the schedule is incrementally modifiable such that
635
schedulability can be maintained following activities being added, removed or modified by merely ensuring that the final bandwidth allocation is less than 100% (and that the choice of is still suitable).
Bandwidth Server-Based Implementation Scheme

As suggested above, it is possible to reduce the bandwidth over-allocation associated with the basic
cyclic implementation scheme by relaxing the constraint that every activity must be offered the chance
to execute in every cycle. This gives rise to the cyclic bandwidth server scheme.
The starting point is once again the selection of a cycle time subject to the same constraint. Then
define a the server activity S(S,NS) as a notional activity that is allocated S execution time units in
every cycle but does not actually consume that allocation itself. Instead, the server offers the resource
to other activities so that these can execute with an effective cycle time of NS. The total bandwidth of
the server can then be used to execute a number of activities that individually have relatively low bandwidth requirements that would otherwise be allocated a disproportionate amount of bandwidth by the
basic scheme. Assuming that the server executes its allocated activities in a fixed order then an activity
j allocated to a S(S,NS) server will execute for a duration j spread over an interval of NS .
The cyclic server exploits the fact that the basic scheme, and its analysis, does not require the execution time allocated to an activity within a scheduling cycle to be contiguous. The analysis associated
with the cyclic server is, therefore, exactly analogous to that for the basic cyclic scheme but with
replaced by NS. Hence, the derivation of j for an activity j executed via a cyclic server S(S,NS) is
Table 3. Example - cyclic schedule implementation (execution rate parameters)

j
636
Uj
vj
v j
v jd
0.0800
0.0800
0.0800
0.0800
0.2000
0.2000
0.2000
0.2000
0.0250
0.0250
0.0400
0.0400
0.0600
0.0600
0.0600
0.0800
0.1000
0.1000
0.1000
0.1200
0.1356
0.1356
0.1600
0.1600
0.1125
0.1125
0.1200
0.1200
0.0250
0.0250
0.0267
0.0400
0.0500
0.0500
0.0500
0.0800
10
0.0050
0.0050
0.0050
0.0400
11
0.0150
0.0150
0.0150
0.0400
12
0.0050
0.0050
0.0050
0.0400
13
0.0050
0.0050
0.0050
0.0400
14
0.0150
0.0150
0.0150
0.0400
15
0.0010
0.0010
0.0010
0.0400
16
0.0010
0.0010
0.0010
0.0400
0.8351
0.8351
0.8837
1.2000
Table 4. Example - cyclic schedule implementation (response time parameters)

j
Rj
Rj
Rjd
25
25
25
25
25
25
40
25
25
50
50
50
50
50
50
59
50
50
80
75
75
80
75
50
100
100
75
10
200
200
25
11
200
200
75
12
200
200
25
13
200
200
25
14
200
200
75
15
1000
1000
25
16
1000
1000
25
Table 5. Example - cyclic schedule implementation (computation time parameters)

j
Cj
C jd
10
11
12
13
14
15
40
16
40
637
given by Equations (21) to (23) with replaced by NS. Similarly, Equations (24), (29) and (30) can be
applied with replaced by NS to determine the final rate, computation time and response time values,
respectively. For this reason, the cyclic server method is actually a generalisation of the basic cyclic
scheme described previously, where multiple cycle times are supported.
For the general case of activity execution via a cyclic server S(S,NS), the expression for determining
the normalised response time for j is adapted as follows:
Rj = [
Rj
NS
]N S
(32)
The corresponding normalised execution rate v j is then found as before by Equation (22). The
minimum execution time j per interval NS that will guarantee j to meet its normalised response time
requirement Rj (and therefore its true requirement Rj) is given by:
dj = [v j N S ]
(33)
Allocating j execution time units per interval NS in the final schedule means that each activity j
will be executed at a guaranteed minimum rate v jd as follows:
v j =
dj
NS
(34)
The allocated computation time C jd over a time interval of duration Rj is as follows:
C jd =
Rj
NS
dj
(35)
d
The final target-independent response time Rj for j, given the original computation time budget Cj
and the final bandwidth allocation due to the cyclic scheduling solution, is given as follows:
Cj
Rjd = [
dj
]N S
(36)
To illustrate by example, consider activity 15 from the GAP case study (which was shown above to
suffer a factor of 40 bandwidth over-allocation when the basic scheme is applied). For example, a cyclic
server S(1,8) allocated to serve activity 15 leads to the following results from successive application of
= 0.001 , 15 = 1 and v15d = 0.005 . This represents

Equations (32), (22), (33) and (34): R15 = 1000 , v15
a factor of 5 over-allocation, a significant improvement compared to the basic scheme but still quite poor,
although the remaining server capacity could be used to service further activities. The use of a server
S(1,40) dedicated to 15 would be required to give an exact allocation for the single activity alone.
The utilisation-based feasibility test given in Equation (26) is no longer exact but merely a neces-
638
sary condition when one or more activities are executed via servers. A sufficient test can be produced
by replacing the combined execution rates of the activities executed by servers by the total capacities
of the corresponding servers, where the total capacity vS of a server S(S,NS) is given by adaptation of
Equation (24):
vS =
dS
(37)
A simple test for feasible allocation of server bandwidth is given as follows:

m
v
k =1
d
k
vS
(38)
where vkd denotes the final execution rate of each of m activities k allocated to the server. For a server
S(S,NS) with period NS and set of allocated activities {k; k = 1,..,nS}, the value of S can be derived
from the set of activity execution times {k in NS; k = 1,..,nS}:
nS
d = [ k =1 S ]
S
(39)
Observing that the total time required to execute non-server-based activities in the basic cyclic schedule
is equivalent to that of a cyclic server S(S,1), referred to as the base level server, the final bandwidth
requirement is given as follows:
Y=
1 nS S
d
i =1 i
(40)
given the set of servers { liS (diS , N iS ) ; i = 1,..,nS} that includes the base level server. This expression is
sufficient-but-not-necessary since, depending on the actual server periods and utilisation figures, the
bandwidth requirements of a lower rate server could, in practice, be absorbed within the spare capacity
of a higher rate server. In such cases, the bandwidth requirements of a lower rate server can be effectively
eliminated from the total bandwidth calculation (as illustrated by example later in the paper).
Example Application of Server-Based Scheme

This example illustrates the use of the cyclic server method to improve bandwidth allocation compared
to the basic cyclic implementation scheme. Assuming the same basic cycle time =25, define a server
S(2,40) to execute the low utilisation activities {10, , 16}. Table 6 shows the improved results under
this scheme (the values of other parameters not shown in the table are the same as before under the
basic cyclic scheme).
639
The total capacity of the server S(2,40) is given by Equation (37) as: vS = 0.08. Hence, 92% of the
total processor capacity is available for non-server-based activities {1, , 9} and 8% for server-based
activities {10, , 16}. So, whilst the total bandwidth allocation is more efficient than for the basic
scheme - 97.5% compared to 120%, this is not sufficient to guarantee feasibility on a single processor
it is also necessary to show separately that activities {1, , 9} can be executed within their 92%
allocation and that activities {10, , 16} can be executed within their 8% allocation. From Table 6,
the combined allocation for activities {1, , 9} turns out to be exactly 92% and the combined allocation for activities {10, , 16} is 5.5%. Hence, the complete set of activities is schedulable on a single
processor under this scheme.
The improved efficiency of this scheme is also reflected in the increased number of activities that
have been allocated the exact bandwidth to meet their requirements 10 out of the 16 activities now,
compared to only 4 previously.
Introducing Priorities to Improve Resource Bandwidth

Allocation and System Responsiveness
It is now shown that the RBA rate-based execution model and cyclic implementation scheme can co-exist
alongside a static priority-based scheduling regime to provide a flexible three-tier run-time execution
model as follows:
High priority activities that execute according to a static priority-based regime;

RBA-compliant activities that execute according to the cyclic RBA implementation scheme, subject to interference from the set of high priority activities;
Low priority activities that execute according to a static priority-based regime, subject to interference from the set of high priority activities and the set of RBA-compliant activities.
The motivation for this combined scheme is two-fold. Firstly, the high priority band can be used
to schedule activities with short response requirements compared to their minimum inter-arrival time
without incurring bandwidth over-allocation. Secondly, the low priority band can be used to execute
activities in the bandwidth that is over-allocated by the RBA cyclic/server scheme, plus any remaining
capacity of the resource, thus reclaiming such bandwidth.
Introducing High Priority Activities for Improved Responsiveness

Activities executed in the high priority band will execute according to a static priority-based scheduling regime in accordance with their relative priorities and always in preference to activities in the RBA
band. These activities can be verified by static priority-based response time analysis given in (Audsley,
1993). The rate-based execution model and cyclic implementation schemes must be extended, however,
to cater for interference effects due to the execution of high priority activities.
The rate-based execution model can be adapted to recognise interference effects using an analogous
approach to that for response time analysis for static priority-based scheduling. The solution is simply
to add a worst-case interference time to the actual worst-case response time or, analogously, to subtract
the interference delay from the required worst-case response time (deadline). The required minimum
execution rate for an activity j subjected to worst-case interference Ij is thus stated as follows:
640
Table 6. Example - cyclic server implementation (improved allocation for {10, , 16})
j
v jd
Rjd
C jd
0.0800
25
0.2000
25
0.0400
25
0.0800
50
0.1200
50
0.1600
50
0.1200
75
0.0400
50
0.0800
75
10
0.0050
200
11
0.0150
200
12
0.0050
200
13
0.0050
200
14
0.0150
200
15
0.0050
200
40
16
0.0050
200
40
0.9750
vj =
Cj
Rj - I j
(41)
For an activity j that shares a resource with a set of high priority activities {k; k = 1,..,nH}, Ij equates
to the interference term stated in the response time expression for static priority-based scheduling. Hence,
given a set of timing attributes for the set of high priority activities, Ij can be determined as follows:
nH
Rj + J kin
k =1
Tk
Ij = [
]C k
(42)
Introducing high priority activities and interference into the cyclic implementation scheme has two
effects (the scheme is otherwise unchanged). Firstly, the initial assignment of is now subject to the
constraint min(Rj - I j ) . Secondly, for the general case of activity execution via a cyclic server
j
S(S,NS), the expression for determining the normalised response time for j, Equation (32), is adapted
as follows:
Rj = [
Rj - I j
NS
]N S
(43)
641
Introducing Low Priority Activities for Improved Bandwidth Allocation

The problem of bandwidth over-allocation has been highlighted in the series of examples given earlier.
This problem occurs in the target-independent RBA rate-based execution model due to the calculation
of activity execution rates based on response time requirements rather than minimum inter-arrival times.
The problem is then compounded in the cyclic implementation scheme due to the need for a common
cycle time and integer execution times within this cycle time (or some multiple of the cycle time when
cyclic servers are used). This motivates the consideration of bandwidth reclamation via the execution of
activities outside the RBA scheme according to a priority-based regime. This new set of priority-based
activities is referred to as low priority since none of these can pre-empt any RBA activity nor any high
priority activity.
The RBA cyclic scheme itself does not actually require modification. The low priority activities can
be guaranteed (or rejected) by adapting the response time analysis for static priority-based scheduling
as shown below. Given a set of high priority activities {k; k = 1,..,nH}, a set of RBA activities {j; j =
1,..,n} and a set of low priority activities {i; i = 1,..,nL}, the following response time can be stated for
a given low priority activity l:
nH
Rl + J kin
k =1
Tk
Rl = C l + [
Rl + J jin
j =1
Tj
]C k + [
l -1
Rl + J iin
i =1
Ti
]C j + [
]C i
(44)
Note that the difference between this expression and the response time analysis for static prioritybased scheduling given in (Audsley, 1993) is merely notational - the interference term is decomposed
into three bands to reflect the composite nature of the scheme.
Example of Reclaiming Over-Allocated Bandwidth

Consider the GAP task set extended by the introduction of a set of low priority activities subject to
deadline monotonic priority assignment: (Table 7)Equation (44) then gives: (Table 8)
All low priority activities are thus feasible since all response times are less than their corresponding deadline. The total bandwidth requirement for the set of low priority activities is 12.8%. This has
effectively been reclaimed from the over-allocated bandwidth for the set of RBA activities, whose true
requirement is 83.5% but final allocation is 97.5% (or, including spare server capacity, exactly 100%).
Table 7. Example - low priority timing attributes

i
642
Ci
Ti
Di
Ui
200
150
0.015
200
180
0.04
25
500
400
0.05
500
450
0.008
15
1000
800
0.015
Table 8. Example - low priority response times

I
Ri
Feasible?
140
148
384
388
789
RELATED WORK
A number of scheduling schemes that support bandwidth-based (or, analogously, rate-based) expression
of timing and resource requirements have previously been proposed for multimedia applications. These
schemes offer a degree of abstraction from the target platform in the way that requirements are specified
but are invariably aimed at dynamic applications and generally require the use of dynamic earliestdeadline-first (EDF) scheduling at run-time. Examples of such schemes include generalised processor
sharing (GPS) (Parekh, 1994), virtual clock (Yau, 1996), constant utilisation server (Deng, 1999) and
weighted fair queuing (Demers, 1989). Due to the reliance on EDF, however, the final bandwidth allocation (or execution rate) granted to each task is dependent on the actual degree of competition for
resources at run-time as the total demand on a resource increases, the bandwidth reserved for a given
task will decrease in absolute terms. Such solutions are more accurately referred to as proportional share
methods than bandwidth reservation methods and are not suitable for dependable applications that require
a priori performance guarantees. See (Grigg, 2002) for a comprehensive survey of related work.
SUMMARY
RBA provides a target-independent timing analysis framework for application during the definition and
decomposition stages of real-time system development, based on an abstract representation of target
system processing and communication resources. Application of the abstract model provides a set of
best-case and worst-case timing guarantees that will be delivered subject to a set of scheduling obligations being met by the target system implementation. An abstract scheduling model, known as the
rate-based execution model then provides an implementation reference model with which compliance
will ensure that the imposed set of timing obligations will be met by the target system.
The end-to-end timing properties of the system are captured, decomposed and analysed in terms of
real-time transactions. The transaction model is hierarchical, in the form of an acyclic, directed, nested
graph, capturing an evolving system definition during development. The leaf nodes of the graph capture the concurrent processing and communication elements within the transaction, termed activities;
non-leaf nodes are referred to as nested transactions. The edges of the graph capture the precedence and
nesting relationships within the transaction. The parameters via which timing behaviour is represented
and observed are the same for a single activity, a group of related activities, a nested transaction and a
system level transaction, thus providing a highly composable and scalable model of real-time system
performance.
643
End-to-end delays and jitter are determined by a depth-first traversal of each transaction graph, accounting for activity level delays, precedence relationships and nesting relationships. In the earlier stages
of system development, activity level delays can be specified directly in the form of budgets. Later in
development, these delays can be determined via some form of localised timing analysis model. When
the target platform implementation details are finally fixed, these delays can be verified.
A number of further developments of the RBA framework and implementation schemes are being
investigated. This includes extending the cyclic server implementation scheme to support nested or
hierarchical bandwidth servers as a means of further reducing the extent of bandwidth over-allocation.
Other work is beginning to investigate RBA-compliant support for scheduling communication network
resources, initially focusing on ATM networks for future avionics applications.
Work is also underway to develop RBA process and tool support for technology transfer into the
sponsoring customers organization. Tool support is being implemented as an extension to the customers
software design environment rather than as a separate standalone tool.
REFERENCES
Audsley, N. C., Burns, A., Richardson, M. F., Tindall, K., & Wellings, A. (1993). Applying New Scheduling Theory to Static Priority Pre-emptive Scheduling. Software Engineering Journal, 8(5).
Demers, A., Keshav, S., & Shenker, S. (1989). Analysis and Simulation of a Fair Queuing Algorithm.
Proceedings of ACM SIGCOMM.
Deng, Z., Liu, J.W.S., Zhang, L., Mouna, S., & Frei, A. (1999). An Open Environment for Real-Time
Applications. Real-Time Systems Journal, 16(2/3).
Grigg, A. (2002). Researvation-Based Timing Analysis A Partitioned Timing Analysis Model for Distributed Real-Time Systems (YCST-2002-10). York, UK: University of York, Dept. of Computer Science.
Locke, C. D., Vogel, D. R., & Mesler, T. J. (1991). Building A Predictable Avionics Platform in Ada. In
Proceedings of IEEE Real-Time Systems Symposium.
Parekh, A.K. & Gallager, R.G. (1994). A Generalised Processor Sharing Approach to Flow Control in
Integrated Services Networks. IEEE Transactions on Networking 2(2).
Yau, D. K. Y., & Lam, S. S. (1996). Adaptive Rate-Controlled Scheduling for Multimedia Applications.
In Proceedings of ACM Multimedia Conference.
ENDNOTE
644
The assignment can easily be shown to be unique by inspection of Equations (3), (2) and (1).
645
Chapter 27
Scalable Algorithms for Server

Allocation in Infostations
Alan A. Bertossi
University of Bologna, Italy
M. Cristina Pinotti
University of Perugia, Italy
Romeo Rizzi
University of Udine, Italy
Phalguni Gupta
Indian Institute of Technology Kanpur, India
ABSTRACT
The server allocation problem arises in isolated infostations, where mobile users going through the coverage area require immediate high-bit rate communications such as web surfing, file transferring, voice
messaging, email and fax. Given a set of service requests, each characterized by a temporal interval
and a category, an integer k, and an integer hc for each category c, the problem consists in assigning
a server to each request in such a way that at most k mutually simultaneous requests are assigned to
the same server at the same time, out of which at most hc are of category c, and the minimum number
of servers is used. Since this problem is computationally intractable, a scalable 2-approximation online algorithm is exhibited. Generalizations of the problem are considered, which contain bin-packing,
multiprocessor scheduling, and interval graph coloring as special cases, and admit scalable on-line
algorithms providing constant approximations.
INTRODUCTION
An infostation is an isolated pocket area with small coverage (about a hundred of meters) of high bandwidth connectivity (at least a megabit per second) that collects information requests of mobile users
DOI: 10.4018/978-1-60566-661-7.ch027
Scalable Algorithms for Server Allocation in Infostations
Table 1. Examples of actual time intervals to serve different kinds of requests

Category
Size (kbps)
Time (s) low rate
Time (s) high rate
FTP download
10000
100
10
Video stream
5000
50
Audio stream, E-mail attachment
512
0.5
E-mail, Web browsing
64
0.6
0.06
and delivers data while users are going through the coverage area. The available bandwidth usually
depends on the distance between the mobile user and the center of the coverage area: increasing with
decreasing distance. An infostation represents a way in the current generation of mobile communication
technology for supporting at many-time many-where high-speed and high-quality services of various
categories, like web surfing, file transferring, video messaging, emails and fax. It has been introduced
to reduce the cost per bit on wireless communications, and hence to encourage the exchange of ever
increasing volumes of information. Infostations are located along roadways, at airports, in campuses,
and they provide access ports to Internet and/or access to services managed locally (Goodman, Borras,
Mandayam, & Yates, 1997; Wu, Chu, Wine, Evans, & Frenkiel, 1999; Zander, 2000; Jayram, Kimbrel,
Krauthgamer, Schieber, & Sviridenko, 2001).
It is desirable that the infostation be resource scalable, that is able to easily expand and contract its
resource pool to accomodate a heavier or lighter load in terms of number and kind of users, and/or category
of services. Indeed, the mobile user connection lasts for a temporal interval, which starts when the user
first senses the infostations presence and finishes when it leaves the coverage area. Depending on the
mobility options, three kinds of users are characterized: drive-through, walk-through, and sit-through.
According to the mobility options, the response time must be immediate for drive-through, slightly
delayed for walk-through, and delayed for sit-through. In general, several communication paradigms
are possible: communications can be either broadcast or dedicated to a single user, data can be locally
provided or retrieved from a remote gateway, and the bit-rate transmission can be fixed or variable,
depending on the infostation model and on the mobility kind of the user.
Each mobile user going through the infostation may require a data service out of a finite set of possible
service categories available. The admission control, i.e., the task of deciding whether or not a certain
request will be admitted, is essential. In fact, a user going through an infostation to obtain a (toll) service
is not disposed to have its request delayed or refused. Hence, the service dropping probability must be
kept as low as possible. For this purpose, many admission control and bandwidth allocation schemes for
infostations maintain a pool of servers so that when a request arrives it is immediately and irrevocably
assigned to a server thus clearing the service dropping probability. Precisely, once a request is admitted,
the infostation assigns a temporal interval and a proper bandwidth for serving the request, depending on
the service category, on the size of the data required and on the mobility kind of the user, as shown in
Table 1 for a sample of requests with their actual parameters. Moreover, the infostation decides whether
the request may be served locally or through a remote gateway. In both cases, a server is allocated on
demand to the request during the assigned temporal interval. The request is immediately assigned to its
server without knowing the future, namely with no knowledge of the next request. Requests are thus
served on-line, that is in an ongoing manner as they become available.
Each server, selected out of the predefined server pool, may serve more than one request simultane-
646
ously but it is subject to some architecture constraints. For example, no more than k requests could be
served simultaneously by a local server supporting k infrared channels or by a gateway server connected
to k infostations. Similarly, no more than h services of the same category can be delivered simultaneously
due to access constraints on the original data, such as software licenses, limited on-line subscriptions
and private access.
This chapter considers the infostation equipped with a large pool of servers, and concentrates on the
server allocation problem where one has to determine how many servers must be reserved to on-line
satisfy the requests of drive-through users, so that the temporal, architectural and data constraints are
not violated. In particular, it is assumed that the isolated infostation controls in a centralized way all the
decisions regarding the server allocation. Moreover, the pool of servers of the infostation is localized
in the center of the coverage area, and therefore the distance from a mobile user and any server in the
pool is the same. In other words, all the servers are equivalent to serve a mobile user, independent of
the user proximity.
In details, a service request r will be modeled by a service category cr and a temporal interval Ir =
[sr, er) with starting time sr and ending time er. Two requests are simultaneous if their temporal intervals
overlap. The input of the problem consists of a set R of service requests, a bound k on the number of
mutually simultaneous requests to be served by the same server at the same time, and a set C of service
categories with each category c characterized by a bound hc. The output is a mapping from the requests
in R to the servers that uses the minimum possible number of servers to assign all the requests in R
subject to the constraints that the same server receives at most k mutually simultaneous requests at the
same time (k-constraint), out of which at most hc are of category c (h-constraint). In this chapter, we
refer to this problem as the Server Allocation with Bounded Simultaneous Requests (Bertossi, Pinotti,
Rizzi, & Gupta, 2004).
It is worthy to note that, equating servers with bins, and requests with items, the above problem is
similar to a generalization of Bin-Packing, known as Dynamic Bin-Packing (Coffman, Galambos, Martello,
& Vigo, 1999), where in addition to size constraints on the bins, the items are characterized by an arrival
and a departure time, and repacking of already packed items is allowed each time a new item arrives. The
problem considered in this chapter, in contrast, does not allow repacking and has capacity constraints also
on the bin size for each category. Furthermore, equating servers with processors and requests with tasks,
the above problem becomes a generalization of deterministic multiprocessor scheduling with task release
times and deadlines (Lawler & Lenstra, 1993) where in addition each processor can execute more than
one task at the same time, according to the k-constraints and h-constraints. Moreover, equating servers
with colors and requests with intervals, our problem is a generalization of the classical interval graph
coloring (Golumbic, 1980), but with the additional k-constraints and h-constraints. Another generalization
of interval graph coloring has been introduced for modelling a problem involving an optical line system
(Winkler & Zhang, 2003), which reduces to ours where only the k-constraint is considered. Finally, a
weighted generalization of interval coloring has been introduced (Adamy & Erlebach, 2004) where there
is only the k-constraint, namely, where each interval has a weight in [0,1] and the sum of the weights of
the overlapping intervals which are colored the same cannot exceed 1. Further generalizations of such
a weighted version were also considered (Bertossi, Pinotti, Rizzi, & Gupta, 2004).
This chapter surveys the complexity results as well as the main scalable on-line algorithms for the
Server Allocation with Bounded Simultaneous Requests problem, which are published in the literature
(Adamy & Erlebach, 2004; Winkler & Zhang, 2003; Bertossi, Pinotti, Rizzi, & Gupta, 2004). Briefly,
the rest of this chapter is structured as follows. The first section shows that the Server Allocation with
647
Bounded Simultaneous Requests problem is computationally intractable and therefore a solution using
the minimum number of servers cannot be found in polynomial time. The second section deals with
-approximation algorithms, that is polynomial time algorithms that provide solutions which are guaranteed
to never be greater than times the optimal solutions. In particular, a 2-approximation on-line algorithm
is exhibited, which asymptotically gives a (2 h/k)-approximation, where h is the minimum among all
the hcs. Finally, a generalization of the problem is considered in the third section, where each request
r is also characterized by an integer bandwidth rate wr, and the bounds on the number of simultaneous
requests to be served by the same server are replaced by bounds on the sum of the bandwidth rates of
the simultaneous requests assigned to the same server. For this problem, on-line scalable algorithms are
illustrated which give a constant approximation.
COMPUTATIONAL INTRACTABILITY
The Server Allocation with Bounded Simultaneous Requests problem on a set R = {r1,...,rn} of requests
can be formulated as a coloring problem on the corresponding set I = {I1,,In} of temporal intervals.
Indeed, equating servers with colors, the original server allocation problem is equivalent to the following coloring problem:
Problem 1 (Interval Coloring with Bounded Overlapping). Given a set I of intervals each belonging
to a category, an integer k, and an integer hc k for each category c, assign a color to each interval in
such a way that at most k mutually overlapping intervals receive the same color (k-constraint), at most
hc mutually overlapping intervals all having category c receive the same color (h-constraint), and the
minimum number of colors is used.
To prove that Problem 1 is computationally intractable, the following simplified decisional formulation of Problem 1 was considered, where |C| = 4, k = 2, and hc = 1 for each category c.
Problem 2 (Interval Coloring with Bounded Overlapping and Four Categories). Given a set I of
intervals each belonging to one of four categories, and an integer b, decide whether b colors are enough
to assign a color to each interval in such a way that at most two mutually overlapping intervals receive
the same color and no two overlapping intervals with the same category receive the same color.
In (Bertossi, Pinotti, Rizzi, & Gupta, 2004), Problem 2 was proved to be NP-complete by exhibiting a polynomial time reduction from the 3-Satisfiability (3SAT) problem, a well-known NP-complete
problem (Garey & Johnson, 1979):
Problem 3 (3SAT). Given a boolean formula B in conjunctive normal form, i.e. as a product of
clauses, over a set U of boolean variables, such that each clause is the sum of exactly 3 literals, i.e. direct
or negated variables, decide whether there exists a truth assignment for U which satisfies B.
Theorem 1. Interval Coloring with Bounded Overlapping and Four Categories is NP-complete.
By the above result, Problem 2, and hence the Server Allocation with Bounded Simultaneous Requests problem, is computationally intractable. Therefore, one is forced to abandon the search for fast
algorithms that find optimal solutions. Thus, one can devise fast algorithms that provide sub-optimal
solutions which are fairly close to optimal. This strategy is followed in the next section, where a scalable polynomial-time approximation algorithm is exhibited for providing sub-optimal solutions that will
never differ from the optimal solution by more than a specified percentage.
Moreover, further negative results have been proved (Bertossi, Pinotti, Rizzi, & Gupta, 2004). Assume that the intervals in I arrive one by one, and are indexed by non-decreasing starting times. When
648
an interval Ii arrives, it is immediately and irrevocably colored, and the next interval Ii+1 becomes known
only after Ii has been colored. If multiple intervals arrive at the same time, then they are colored in any
order. An algorithm that works in such an ongoing manner is said on-line (Karp, 1992). On-line algorithms are opposed to off-line algorithms, where the intervals are not colored as they become available,
but they are all colored only after the entire sequence I of intervals is known. While Theorem 1 shows
that Problem 1 is computationally intractable even if there are only four categories, k = 2, and hc = 1 for
each category, the following result shows also that there is no optimal on-line algorithm even when the
number of categories becomes three.
Theorem 2. There is no optimal on-line algorithm for the Interval Coloring with Bounded Overlapping problem even if there are only 3 categories, k = 2, and h1 = h2 = h3 = 1.
ALGORITHM FOR INTERVAL COLORING WITH BOUNDED OVERLAPPING

Since there are no fast algorithms that find optimal solutions for Problem 1, on-line algorithms providing sub-optimal solutions are considered. An -approximation algorithm for a minimization problem
is a polynomial-time algorithm producing a solution of value appr(x) on input x such that, for all the
inputs x, appr(x) * opt(x), where opt(x) is the value of the optimal solution on x. In other words,
the approximate solution is guaranteed to never be greater than times the optimal solution (Garey &
Johnson, 1979). For the sake of simplicity, from now on, appr(x) and opt(x) will be simply denoted by
appr and opt, respectively.
A simple polynomial-time on-line algorithm for the Interval Coloring with Bounded Overlapping
problem can be designed based on the following greedy strategy:
AlgorithmGreedy(Ii)
Color Ii with any already used color which does not violate the k-constraints and the h-constraints.
If no color can be reused, then use a brand new color.
Theorem 3. Algorithm Greedy provides a 2-approximation for the Interval Coloring with Bounded
Overlapping problem.
Proof. Let appr = be the solution given by the algorithm and assume that the colors 1,, f have
been introduced in this order. Let Ir = [sr, er) be the first interval colored . Let 1 be the set of intervals
in I containing sr and let 2 be the set of intervals in I containing sr whose category is cr. Clearly, 2 is
w
1
contained in 1. Let 1 and 2 be the cardinalities of 1 and 2, respectively. Clearly, opt and
k

w
2
opt . Color was introduced to color Ir because, for every 1 1, at least one of the fol hcr

lowing two conditions held:
1.
2.
Exactly k intervals in 1 have color ;

Exactly hc intervals in 2 have color .
r
For i = 1 and 2, let ni be the number of colors in {1,, 1} for which Condition i holds (if for a
color both conditions hold, then choose one of them arbitrarily). Hence, n1 + n2 = 1 or, equivalently,
649
w
appr = = n1 + n2 + 1. Clearly, 1 k n1 + hc n2 + 1 and 2 hc n2 + 1. Therefore:opt max{ 1 ,
r
r
k
w
h n + 1
kn + h + 1 h n + 1
h
c
2
2
1
cr
cr 2
r
} max{
, n2 + 1} max{n1+ n2, n2 + 1}where
,
} max{n1+ h
k
hcr
k
h
c
c

r
h = min{h1,,h|C|}.
h
If n2 + 1 n1+ n2, then:
k
h
h
n1 + n2 + 1 n2 (1 - k ) + n2 + 1 appr n1 + n2 + 1 n2 (1 - k ) + n2 + 1
h n2
appr
2
=
k n2 + 1
n2 + 1
n2 + 1
opt
opt
n2 + 1
n2 + 1
n1 + n 2 + 1
n2 + 1
h
n2 (1 - ) + n2 + 1
h n2
k
= 2 2.
k n2 + 1
n2 + 1
h
If n2 + 1 n1+ n2, then:
k
n1 + n 2 + 1
appr
opt
h
n1 + n 2
k
n1 + n 2 + 1
h
n1 + n 2
k
h
h
n1 + n1 + n 2
n2
n
+
n
+
1
n1
appr 1
2
k
k
=
1+
opt
h
h
h
h
n1 + n 2
n1 + n 2
n1 + n 2
n1 + n 2
k
k
k
k
n1 + n1 +
h
n
n1
k 2 = 1+
2.
h
h
n1 + n 2
n1 + n 2
k
k
n1 + n1 +
Therefore, Algorithm Greedy gives a 2-approximation. QED

Actually, a stronger result has been proved (Bertossi, Pinotti, Rizzi, & Gupta, 2004):
Theorem 4. Algorithm Greedy asymptotically provides a (2 h/k)-approximation for the Interval
Coloring with Bounded Overlapping problem, where h = min{h1,, h|C|}.
Moreover, such an asymptotic bound is the best possible, even in the very special case that h = 1, k
= 2, and no interval contains another interval:
Theorem 5. Algorithm Greedy admits no -approximation with < 2 1/k for the Interval Coloring with Bounded Categories problem, even if min{h1,, h|C|} = 1, k = 2, and no interval is properly
contained within another interval.
Finally, the result below shows that the Greedy algorithm is optimal in some special cases.
Theorem 6. Algorithm Greedy is optimal for the Interval Coloring with Bounded Overlapping
problem when either
h
c C
650
k , or
hc = k for all c C.
Proof. Let be the solution given by the Greedy algorithm and assume without loss of generality
that 2, since otherwise the solution is trivially optimal. As in the proof of Theorem 3, let Ir = [sr, er)
be the first interval colored , let 1 be the set of intervals in I containing sr, let 2 be the set of intervals
in I containing sr with category cr, and let 1 = |1| and 2 = |2|. Recall that was introduced to color
Ir because, for every 1 1, at least one of the following two conditions held:
1.
2.
Exactly k intervals in 1 have color ;

Exactly hc intervals in 2 have color .
r
When
h
c C
k , it is easy to see that if Condition 1 is true for any color then Condition 2 is also
true. Indeed, by hypothesis, the only way to exhaust a color is to have exactly hc intervals of category
r
w

cr all colored . Therefore, 2 (1) hc + 1 and opt 2 = .
r
hcr

When hc = k for all c C, it is easy to see that any cannot be used only if Condition 1 is true. Thus,
w
1
1 (1)k + 1 and opt = .
k

In conclusion, in both cases the Greedy algorithm provides the optimal solution. QED
Note that, in Theorem 6, when hc = k for all c C, the h-constraint is redundant, since it is dominated
by the k-constraint. When hc = k = 1 for all c C, the Greedy algorithm reduces to the well-known optimal
algorithm for coloring interval graphs (Golumbic, 1980). Moreover, when hc = k for all c C and k > 1,
Problem 1 is the same as the Generalized Interval Graph Coloring problem (Winkler & Zhang, 2003).
As regard to the time complexity of the Greedy algorithm, the following result holds:
Theorem 7. Algorithm Greedy requires O(1) time to color each interval Ir.
Proof. The algorithm employs |C| palettes P1, , P|C|, one for each category. The generic palette Pc
is implemented as a double linked list and stores all the colors that can be assigned to a new interval of
category c. For each color , a record R with |C| + 1 counters and |C| pointers is maintained. For each
category c, the corresponding counter R.countc stores how many intervals of category c can be still
colored (such a counter is initialized to hc). Moreover, there is an additional counter R.kcount (initialized to k) storing how many intervals of any category can be still colored . Finally, for each category
c, there is a pointer to the position of color in Pc.
The algorithm uses a global counter, initialized to 0, to keep track of the overall number of colors
used. When a brand new color is needed, the global counter is incremented. Let be the new value of
the global counter. Then, a new record R is initialized, color is inserted in all the palettes, and the
pointers of R to the palettes are updated. This requires O(|C|) time.
When a new interval Ii starts, say of category ci, it is colored in O(1) time by any color available
in palette Pc . Then, the counters Rg .countc and R.kcount are decremented. If Rg .countc becomes 0,
i
i
i
then color is deleted from Pc . Whereas, if R.kcount becomes 0, then color is deleted from all the
i
palettes. In the worst case, O(|C|) time is needed.
When interval Ii ends, the counters Rg .countc and R.kcount are incremented, where is the color of
i
Ii. If R.kcount becomes 1, then color is inserted in all the palettes Pc for which R.countc is greater than
0. Instead, if R.kcount is larger than 1, then color is inserted in Pc if Rg .countc becomes 1. Again,
i
651
in the worst case, O(|C|) time is needed.

Since |C| is a constant, O(1) time is required to color each single interval Ii. QED
ALGORITHM FOR WEIGHTED INTERVAL COLORING

Consider now a generalization of the Server Allocation with Bounded Simultaneous Requests problem, where each request r is also characterized by an integer bandwidth rate wr, and the bounds on the
number of simultaneous requests to be served by the same server are replaced by bounds on the sum of
the bandwidth rates of the simultaneous requests assigned to the same server. Such a problem can be
formulated as a weighted generalization of Problem 1 as follows.
Problem 4 (Weighted Interval Coloring with Bounded Overlapping). Given a set I of intervals, with
each interval Ir characterized by a category cr and an integer weight wr, an integer k, and an integer hc
k for each category c, assign a color to each interval in such a way that the sum of the weights for mutually overlapping intervals receiving the same color is at most k (k-constraint), the sum of the weights
for mutually overlapping intervals of category c receiving the same color is at most hc (h-constraint),
and the minimum number of colors is used.
More formally, denote by
I[t] the set of intervals which are active at instant t, that is, I[t] = {Ir I: sr t er};
I[c] the set of intervals belonging to the same category c, that is, I[c] = {Ir I: cr = c};
I() the set of intervals colored ;
I()[t] = I() I[t], namely, the set of intervals colored and active at instant t; and
I()[t][c] = I()[t] I[c], namely, the set of intervals of category c, colored , and active at instant
t.
Then, the k-constraints and h-constraints can be stated as follows:
I r I ( g )[t ]
wr k for all and t (k-constraints),
I r I ( g )[t ][c ]
wr hc for all , t, and c (h-constraints).
Note that Problem 1 is a particular case of Problem 4, where wr = 1 for each interval Ir. When considering only the k-constraints and normalizing each weight wr in [0,1], Problem 4 is a generalization
of that introduced in (Adamy & Erlebach, 2004) where a 195-approximate solution is provided under
a particular on-line notion, namely, when the intervals are not given by their arrival time, but by some
externally specified order. An approximation on-line algorithm for Problem 4, which contains Bin-Packing
as a special case (Coffman, Galambos, Martello, & Vigo, 1999), is presented below.
AlgorithmFirst-Color(Ii)
Color interval Ii with the smallest already used color which does not violate the k-constraints and the
h-constraints. If no color can be reused, then use a brand new color.
The following result has been proved in (Bertossi, Pinotti, Rizzi, & Gupta, 2004).
652
Theorem 8. Algorithm First-Color asymptotically provides a constant approximation for the Weighted
Interval Coloring with Bounded Overlapping problem.
k
k 8
The worst approximation constant proved by Theorem 8 is 5 , when > , and 8, otherwise (by
h
h 5
k 8
the way, an 8-approximation could be achieved even in the case that > , but by a different, off-line
h 5
algorithm). It is worthy to note that in the case there are no h-constraints on the total weight of mutually
overlapping intervals of the same category, the First-Color algorithm yields a 4-approximation. As
regard to the time complexity of algorithm First-Color, an implementation similar to that described in
Theorem 7 can be used, where the palettes are maintained as heaps. Then, it is easy to see that a single
interval can be colored in O(log ) time, where is the total number of colors used.
FURTHER GENERALIZATIONS
Consider now two further generalizations of the Server Allocation with Bounded Simultaneous Requests
problem, where each request r is characterized by real bandwidths, normalized in [0,1] for analogy with
the Bin-Packing problem (Coffman, Galambos, Martello, & Vigo, 1999).
In the first generalization, which contains Multi-Dimensional Bin-Packing as a special case, each
request r is characterized by a k-dimensional bandwidth rate wr = ( wr(1) , , wr(k ) ), where the c-th component specifies the bandwidth needed for the c-th category and k is the number of categories, i.e. k =
|C|. The overall sum of the bandwidth rates of the simultaneous requests of the same category assigned
to the same server at the same time is bounded by 1, which implies that the total sum of the bandwidth
rates over all the categories is bounded by k. Such a generalized problem can be formulated as the following variant of the interval coloring problem.
Problem 5 (Multi-Dimensional Weighted Interval Coloring with Unit Overlapping). Given a set I of
intervals, with each interval Ir characterized by a k-dimensional weight wr = ( wr(1) , , wr(k ) ), where wr(c )
[0,1], for 1 c k, assign a color to each interval in such a way that the overall sum of the weights
of the same category for mutually overlapping intervals receiving the same color is bounded by 1 and
the minimum number of colors is used.
More formally, according to the notations introduced in the previous section, the constraints of
Problem 5 can be stated as follows:
I r I ( g )[t ][c ]
wr(c ) 1 for all , t, and c.
Note that the above constraints are in fact h-constraints and, when added up over all the categories in
C, imply the following redundant k-constraint
k
c =1
I r I ( g )[t ][c ]
wr(c ) k for all and t,
653
which is analogous to the k-constraint of Problem 4. Problem 5 can also be solved on-line by the FirstColor algorithm introduced in the previous section.
Theorem 9. Algorithm First-Color provides a 4k-approximation for the Multi-Dimensional Weighted
Interval Coloring with Unit Overlapping problem.
It is worth mentioning that the above problem, when considered as an off-line problem, is APX-hard
since it contains Multi-Dimensional Bin-Packing as a special case, which has been shown to be APXhard (Woeginger, 1997) already for k = 2. Therefore, there is no polynomial time approximation scheme
(PTAS) that solves the problem within every fixed constant (that is, one different polynomial time
approximation algorithm for each constant ) unless P = NP.
In the second generalization, instead, each request r is characterized by a gender bandwidth rate gr ,c
r
associated to the category cr and by a bandwidth rate wr. The overall sum of the bandwidth rates of the
simultaneous requests assigned to the same server at the same time is bounded by 1, as well as the overall
sum of the gender bandwidth rates of the simultaneous requests of the same category assigned to the
same server at the same time, which is also bounded by 1. This generalized problem can be formulated
as the following variant of the interval coloring problem.
Problem 6 (Double Weighted Interval Coloring with Unit Overlapping). Given a set I of intervals,
with each interval Ir characterized by a gender bandwidth gr ,c (0,1] associated to the category cr and
r
by a bandwidth weight wr (0,1], assign a color to each interval in such a way that the overall sum of
the gender weights for mutually overlapping intervals of the same category receiving the same color is
bounded by 1 (h-constraint), the overall sum of the bandwidth weights for mutually overlapping intervals
receiving the same color is bounded by 1 (k-constraint), and the minimum number of colors is used.
Formally, the constraints of Problem 6 are given below:
I r I ( g )[t ]
wr 1 for all and t,
I r I ( g )[t ][c ]
gr ,c 1 for all , t, and c.
Note that Problem 6 is a generalization of Bin-Packing, and hence it is NP-hard. However, Problem
6 can again be solved on-line by the First-Color algorithm introduced in the previous section.
Theorem 10. Algorithm First-Color provides a constant approximation and, asymptotically, an 11approximation for the Double Weighted Interval Coloring with Unit Overlapping problem.
CONCLUSION
This chapter has considered several scalable on-line approximation algorithms for problems arising in
isolated infostations, where user requests characterized by categories and temporal intervals have to be
assigned to servers in such a way that a bounded number of simultaneously requests are assigned to the
same server and the number of servers is minimized. However, several questions still remain open. For
instance, one could lower the approximation bounds derived for the problems reviewed in this chapter.
Moreover, it is still an open question to determine whether the NP-hardness result reported in this chapter
still holds when k = 2, there are only 3 categories, and h1 = h2 = h3 = 1. Finally, one could consider the
654
scenario in which the number of servers is given in input, each request has a deadline, and the goal is
to minimize the overall completion time for all the requests.
REFERENCES
Adamy, U., & Erlebach, T. (2004). Online coloring of intervals with bandwidth (LNCS Vol. 2909, pp.
112). Berlin: Springer.
Bertossi, A. A., Pinotti, M. C., Rizzi, R., & Gupta, P. (2004). Allocating servers in infostations for bounded
simultaneous requests. Journal of Parallel and Distributed Computing, 64, 11131126. doi:10.1016/
S0743-7315(03)00118-7
Coffman, E. G., Galambos, G., Martello, S., & Vigo, D. (1999). Bin-packing approximation algorithms:
Combinatorial analysis. In D. Z. Du & P. M. Pardalos, (Ed.), Handbook of Combinatorial Optimization,
(pp. 151207). Dondrecht, the Netherlands: Kluwer.
Garey, M. R., & Johnson, D. S. (1979). Computers and Intractability. San Francisco: Freeman.
Golumbic, M. C. (1980). Algorithmic Graph Theory and Perfect Graphs. New York: Academic Press.
Goodman, D. J., Borras, J., Mandayam, N. B., & Yates, R. D. (1997). INFOSTATIONS: A new system
model for data and messaging services. Proceedings of the 47th IEEE Vehicular Technology Conference
(VTC), Phoenix, AZ, (Vol. 2, pp. 969973).
Jayram, T. S., Kimbrel, T., Krauthgamer, R., Schieber, B., & Sviridenko, M. (2001). Online server
allocation in server farm via benefit task systems. Proceedings of the ACM Symposium on Theory of
Computing (STOC01), Crete, Greece, (pp. 540549).
Karp, R. M. (1992). Online algorithms versus offline algorithms: How much is it worth to know the
future? In J. van Leeuwen, (Ed.), Proceedings of the 12th IFIP World Computer Congress. Volume 1:
Algorithms, Software, Architecture, (pp. 416429). Amsterdam: Elsevier.
Lawler, E. L., Lenstra, J. K., Rinnooy Kan, A. H. G., & Shmoys, H. (1993). Sequencing and Scheduling:
Algorithms and Complexity. Amsterdam: North-Holland.
Winkler, P., & Zhang, L. (2003). Wavelength assignment and generalized interval graph coloring. In
Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA03), Baltimore, MD, (pp.
830831).
Woeginger, G. J. (1997). There is no asymptotic PTAS for two-dimensional vector packing. Information
Processing Letters, 64, 293297. doi:10.1016/S0020-0190(97)00179-8
Wu, G., Chu, C. W., Wine, K., Evans, J., & Frenkiel, R. (1999). WINMAC: A novel transmission protocol for infostations. Proceedings of the 49th IEEE Vehicular Technology Conference (VTC), Houston,
TX, (Vol. 2, pp. 13401344).
655
Zander, J. (2000). Trends and challenges in resource management future wireless networks. In Proceedings of the IEEE Wireless Communications and Networks Conference (WCNC), Chicago, (Vol. 1, pp.
159163).
KEY TERMS
-Approximation Algorithm: An algorithm producing a solution which is guaranteed to be no worst
than times the best solution.
Bin-Packing: A combinatorial problem in which objects of different volumes must be packed into a
finite number of bins of given capacity in a way that minimizes the number of bins used.
Infostation: An isolated pocket area with small coverage of high bandwidth connectivity that delivers data on demand to mobile users.
Interval Graph Coloring: A combinatorial problem in which colors have to be assigned to intervals
in such a way that two overlapping intervals are colored differently and the minimum number of colors
is used. Such a problem corresponds to color the vertices of an interval graph, that is, a graph representing the intersections of the set of intervals.
Multiprocessor Scheduling: A method by which tasks are assigned to processors.
On-Line Algorithm: An algorithm that processes its input data sequence in an ongoing manner, that
is as they become available, without knowledge of the entire input sequence.
Scalable Algorithm: An algorithm able to maintain the same efficiency when the workload
grows.
Server Allocation: An assignment of servers to the user requests.
656
Section 7
Web Computing
658
Chapter 28
Web Application Server

Clustering with Distributed
Java Virtual Machine1
King Tin Lam
The University of Hong Kong, Hong Kong
Cho-Li Wang
The University of Hong Kong, Hong Kong
ABSTRACT
Web application servers, being todays enterprise application backbone, have warranted a wealth of
J2EE-based clustering technologies. Most of them however need complex configurations and excessive
programming effort to retrofit applications for cluster-aware execution. This chapter proposes a clustering
approach based on distributed Java virtual machine (DJVM). A DJVM is a collection of extended JVMs
that enables parallel execution of a multithreaded Java application over a cluster. A DJVM achieves
transparent clustering and resource virtualization, extolling the virtue of single-system-image (SSI). The
authors evaluate this approach through porting Apache Tomcat to their JESSICA2 DJVM and identify
scalability issues arising from fine-grain object sharing coupled with intensive synchronizations among
distributed threads. By leveraging relaxed cache coherence protocols, we are able to conquer the scalability barriers and harness the power of our DJVMs global object space design to significantly outstrip
existing clustering techniques for cache-centric web applications.
INTRODUCTION
Scaling applications in web server environment is a fundamental requisite for continued growth of ebusiness, and is also a pressing challenge to most web architects when designing large-scale enterprise
systems. Following the success of the Java 2 Platform, Enterprise Edition (J2EE), the J2EE world has
developed an alphabet soup of APIs (JNDI, JMS, EJB, etc) that programmers would need to slurp down
if they are to cluster their web applications. However, comprehending the bunch of these APIs and the
clustering technologies shipped with J2EE server products is practically daunting for even those expeDOI: 10.4018/978-1-60566-661-7.ch028
Web Application Server Clustering with Distributed Java Virtual Machine
rienced programmers. Besides the extra configuration and setup time, intrusive application rework is
usually required for the web applications to behave correctly in the cluster environment. Therefore, there
is still much room for researchers to contribute improved clustering solutions for web applications.
In this chapter, we introduce a generic and easy-to-use web application server clustering approach
coming out from the latest research in distributed Java virtual machines. A Distributed Java Virtual
Machine (DJVM) fulfills the functions of a standard JVM in a distributed environment, such as clusters.
It consists of a set of JVM instances spanning multiple cluster nodes that work cooperatively to support
parallel execution of a multithreaded Java application. The Java threads created within one program
can be distributed to different nodes and perform concurrently to exploit higher execution parallelism.
The DJVM abstracts away the low-level clustering decisions and hides the physical boundaries across
the cluster nodes from the application layer. All available resources in the distributed environment,
such as memory, I/O and network bandwidth can be shared among distributed threads for solving more
challenging problems. The design of DJVM adheres to the standard JVM specification, so ideally all
applications that follow the original Java multithreaded programming model on a single machine can
now be clustered across multiple servers in a virtually effortless manner.
In the past, various efforts have been conducted in extending JVM to support transparent and parallel
execution of multithreaded Java programs on a cluster of computers. Among them, Hyperion (Antoniu et al., 2001) and Jackal (Veldema et al., 2001) compile multithreaded Java programs directly into
distributed applications in native code, while Java/DSM (Yu & Cox, 1997), cJVM (Aridor, Factor, &
Teperman, 1999), and JESSICA (Ma, Wang, & Lau, 2000) modify the underlying JVM kernel to support
cluster-wide thread execution. These DJVM prototypes debut as proven parallel execution engines for
high-performance scientific computing over the last few years. Nevertheless, their leverage to clustering
real-life applications with commercial server workloads has not been well-studied.
We strive to bridge this gap by presenting our experience in porting the Apache Tomcat web application
server on a DJVM called JESSICA2. A wide spectrum of web application benchmarks modeling stock
quotes, online bookstore and SOAP-based B2B e-commerce are used to evaluate the clustering approach
using DJVMs. We observe that the highly-threaded execution of Tomcat involves enormous fine-grain
object accesses to Java collection classes such as hash tables all over the request handling cycles. This
presents the key hurdles to scalability when the thread-safe object read/write operations and the associated synchronizations are performed in a cluster environment. To overcome this issue, we employ a
home-based hybrid cache coherence protocol to support object sharing among the distributed threads.
For cache-centric applications that cache hot and heavyweight web objects at the application-level, we
find that by using JESSICA2, addition of nodes can grow application cache hits linearly, significantly
outperforming the share-nothing approach using web server load balancing plug-in. This is attributed to
our global object space (GOS) architecture that virtualizes network-wide memory resources for caching the application data as a unified dataset for global access by all threads. Clustering HTTP sessions
over the GOS enables effortless cluster-wide session management and leads to a more balanced load
distribution across servers than the traditional sticky-session request scheduling. Our coherence protocol
also scales better than the session replication protocols adopted in existing Tomcat clustering. Hence,
most of the benchmarked web applications show better or equivalent performance compared with the
traditional clustering techniques.
Overall, the DJVM approach emerges as a more holistic, cost-effective and transparent clustering
technology that disappears from the application programmers point of view. With efficient protocol
support for shared object access, such a middleware-level clustering solution is suitable for scaling most
659
web applications in a cluster environment. Maturing of the DJVM technology would bring about stronger
server resource integration and open up new vistas of clustering advances among the web community.
The rest of the chapter is organized as follows. In Section 2, we survey the existing web application
clustering technologies. Section 3 presents the system architecture of our JESSICA2 DJVM. In Section 4, we describe Tomcat execution on top of the JESSICA2 DJVM. Section 5 discusses JESSICA2s
global object space design and implementation. In Section 6, we evaluate the performance of Tomcat
clustering using the DJVM. Section 7 reviews the related work. Section 8 concludes this chapter and
suggests some possible future work.
EXISTING WEB APPLICATION CLUSTERING TECHNOLOGIES

In the web community, clustering is broadly viewed as server load balancing and failover. Here, we
discuss several widely adopted clustering technologies under the hood of J2EE.
The most common and cost-effective way for load balancing is to employ a frontend web server with
load balancing plug-ins such as Apache mod_jk (ASF, 2002) to dispatch incoming requests to different
application servers. The plug-ins usually support sticky-sessions to maintain a user session entirely on
one server. This solution could make the cluster resource utilization more restricted and is not robust
against server failures.
More advanced solutions need to support application state sharing among servers. Large-scale J2EE
server products generally ship with clustering support for HTTP sessions and stateful session beans.
One traditional approach is to serialize the session contents and persist the states to a data store like
a relational database or a shared file system. However, this approach is not scalable. In-memory session replication is an improved technique also based on Java serialization to marshal session-bound
objects into byte streams for sending to peer servers by means of some group communication services
such as JGroups (Ban, 1997) (based on point-to-point RMI or IP multicast). Such a technique has been
implemented in common web containers such as Tomcat. However, scalability issues are still present in
group-based synchronous replications, especially over the general all-to-all replication protocols which
are only efficient in very small-size clusters.
Enterprise JavaBeans (EJB) is a server-side component architecture for building modular enterprise
applications. Yet the EJB technology itself and its clustering are both complicated. Load balancing
among EJB containers can be achieved by distributed method call, messaging or name services which
correspond to the three specifications: Remote Method Invocation (RMI), Java Messaging Service (JMS)
and Java Naming and Directory Interface (JNDI). In particular, JNDI is an indispensible element of EJB
clustering as EJB access normally starts with a lookup of its home interface in the JNDI tree. For clients
to look up clustered objects, EJB containers implement some global JNDI services (e.g. cluster-wide
shared JNDI tree) and ship with special RMI compilers to generate replica-aware stubs for making userdefined EJBs cluster-aware. The stub contains the list of accessible target EJB instances and codes
for load balancing and failover among the instances. EJB state changes are serialized and replicated to
peer servers after the related transaction commits or after each method invocation. Undoubtedly, this
clustering technology is expensive, complicated and with application design restrictions.
In recent years, a growing trend in web application development has begun to adopt lightweight
containers such as the Spring Framework (Johnson, 2002) to be the infrastructural backbone instead
of the EJB technology. On such a paradigm, business objects are just plain old Java objects (POJOs)
660
implementing data access logic and running in web containers like Tomcat. Caching POJOs in a collection object like Hashtable is also a common practice for saving long-latency access to database and file
systems. To support clustering of POJOs which conform to no standard interface, it seems almost inevitable that application programmers have to rework their application code to use extra APIs to synchronize
object replicas among the JVMs. Though distributed caching libraries (Perez, 2003) can facilitate POJO
clustering, these solutions again rely on Java serializations and require complex configurations. The
cache sizes they support are usually bounded by single-node memory capacity as a result of employing
simplistic all-to-all synchronization and full replication protocols.
Although the clustering solutions surveyed so far have their own merit points, most of them share
several significant shortcomings.
Restrictions on application design: Many object sharing mechanisms rely on Java serializations
which pose restrictions on application design and implementation. They cannot easily work in a
cluster environment.
Possible loss of referential integrity: Most solutions suffer the break of referential integrity since
it creates clones of the replicated object graph at deserialization and may lose the original object
identity. Thats why when a shared object undergoes changes, it must be put back into the container object by an explicit call like setAttribute() to reflect the new referential relation. Likewise,
consistency problems occur when attributes with cross-references in HTTPSession are modified
and unmarshaled separately.
Costly communication: Object serialization is known to be hugely costly in performance. It performs a coarse trace and clones a lot of objects even for one field change. So there is certain limit
on the number and sizes of objects that can be bound in a session.
No global signaling/coordination support: Subtle consistency problems arise when some design patterns and services are migrated to clusters. For example, the singleton pattern sharing a
single static instance among threads as well as some synchronization codes become localized to
each server, losing global coordination. Event-based services like timers make no sense if they are
not executed on a single platform. Only a few products (e.g. JBosss clustered singleton facility)
ship with configurable cluster-wide coordination support to ease these situations.
Lacking global resource sharing: Most clustering solutions in the web domain put little focus on
global integration of resources. They cannot provide a global view of the cluster resources such as
memory, so each standalone server just does its own work without cooperation and may not fully
exploit resources.
JESSICA2 DISTRIBUTED JVM

JESSICA2 (Zhu, Wang, & Lau, 2002) is a DJVM designed to support transparent parallel execution of
multithreaded Java applications in a networked cluster environment. It was developed based on Kaffe
JVM (Wilkinson, 1998). The acronym JESSICA2 spells as Java-Enabled Single-System-Image Computing Architecture version 2; this architecture promotes the single-system image (SSI) notion when
connecting Java with clusters. Such a design concept is helpful to take away the burden of clustering by
hand from application developers. The key advantage with using JESSICA2 is its provision of transparent clustering services which require no source code modification and bytecode preprocessing. It will
661
Figure 1. JESSICA2 DJVM System Architecture
automatically take care of thread distribution, data consistency of the shared objects and I/O redirection
so that the program will run under an SSI illusion with integrated computing power, memory and I/O
capacity of the cluster.
Figure 1 shows the system architecture of the JESSICA2 DJVM. JESSICA2 has bundled a number
of salient features extended from the standard JVM that realize the SSI services. To execute a Java application on JESSICA2, a tailored command is called to start the master JVM on the local host and the
worker JVMs on remote nodes, based on the specified list of hostnames. In each JVM, a class loader is
responsible for importing bytecode data (of both the basic Java class library classes and the application
classes) into its method area where a Java thread can look up a specific method to invoke. The class
loader of JESSICA2 is extended to support remote class loading which ensures when a worker JVM
cannot find a class file locally, it can request the class bytecode on demand and fetch the initialized static
data from the master JVM through network communication. This feature greatly simplifies cluster-wide
deployment of Java applications and hence transparently provides the web farming support which traditionally requires application server extension to fulfill.
When the Java threads of the application are started, the thread scheduler of the JVM will put their
contexts (e.g. program counter and other register values) into the execution engine in turns. The Java
methods invoked by the running thread will be compiled by the Just-In-Time (JIT) compiler into native codes for high-speed execution. JESSICA2 incorporates a cluster-aware JIT compiler to support
lightweight Java thread migration across node boundaries to assist global thread scheduling. Java
threads will be assigned to each worker JVM at the startup time in a round-robin manner to strike a raw
load balance. Dynamic load balancing during runtime can be done by migrating Java threads that are
running into computation hotspots to the less loaded nodes. For detecting hotspots, each JVM instance
662
has a load monitor daemon that periodically wakes up and sends current load status such as CPU and
memory utilization to the master JVM which is then able to make thread migration decisions with a
global resource view.
Java threads migrated to remote JVMs may still be carrying references to the objects under the source
JVM heaps. For seamless object visibility, JESSICA2 employs a special heap-level service called the
Global Object Space (GOS) to support location-transparent object access. Objects can be shared among
distributed threads over the GOS as if they were under a single JVM heap. For this to happen, the GOS
implements object packing functions to transform object graphs into byte streams for shipping to the
requesting nodes. The shipped object data will be saved as a cache copy under the local heap of the
requesting node. Caching improves data access locality but leads to cache consistency issues. To tackle
the problem of stale data, the GOS employs release-consistent memory models stemmed from software
Distributed Shared Memory (DSM) systems to preserve correct memory views on shared objects across
reads/writes done by distributed threads.
JESSICA2 offers parallel I/O and location-transparent file access. We extend JESSICA2 to support
transparent I/O redirection mechanism so that I/O requests (file and socket access) can be virtually served
at any node. Our system does not rely on shared distributed file systems such as NFS, nor does it need
to restrict a single IP address for all the nodes in the running cluster. Rather, we extend each JVM to run
a transparent I/O redirection mechanism to redirect non-home I/O operations on files or sockets to their
home nodes. To attain I/O parallelism atop transparency, read-only file operations and connectionless
network I/O can be done at the local nodes concurrently without redirection.
Finally, all inter-node communication activities required by the subsystems at upper layers like the
GOS and I/O redirections are supported by a common module called the host manager which wraps
up the underlying TCP communication functions with connection caching and message compression
optimizations.
On the whole, we can see that DJVM is a rather generic middleware system that supports parallel
execution of any Java program. Since the unveiling of DJVMs, their application domains remain mostly
in scientific computing over the last few years. They were used to support multithreaded Java programs
that are programmed in a data-parallel manner. These applications tend to be simple, embarrassingly
parallel so that DJVMs could offer good scalability. However, much more mainstream applications are
business-oriented, centered at server-side platforms and run atop some Java application servers. Their
object access and synchronization patterns are far more complex. In the next sections, we will elaborate
on the common runtime characteristics of application servers and their impacts on the DJVM performance
through a case study of Apache Tomcat running on JESSICA2.
APACHE TOMCAT ON DISTRIBUTED JVM

Apache Tomcat is a Java servlet container developed at the Apache Software Foundation (ASF). It serves
as the official reference implementation of the Java Servlet and JavaServer Page (JSP) specifications.
Tomcat is the worlds most widely used open-source servlet engine and has been used by renowned
companies like WalMart, E*Trade Securities and The Weather Channel to power their large-scale and
mission-critical web applications in production systems.
As a common design in many servers, Tomcat maintains a thread pool to avoid thread creation cost
for every short-lived request as well as to give an upper bound to the overall system resource usage.
663
Upon an incoming connection, a thread is scheduled from the pool to handle it. The web container then
performs various processing such as HTTP header parsing, sessions handling, web context mapping
and servlet class loading. The request eventually reaches the servlet code which implements application
logics such as form data processing, database querying, HTML/XML page generation, etc. Finally, the
response is sent back to the client. This request service cycle is complex, comes across many objects
throughout the container hierarchy and imposes multithreading challenges to the DJVM runtime.
Being a classical and large-scale web application server, Tomcat reflects an important class of reallife object-oriented server execution patterns that are summarized as follows.
1.
2.
3.
4.
5.
6.
7.
I/O-intensive workload: Most web server workloads are I/O-bound and composed of short-lived
request processing. The per-request computation-communication ratio is usually small.
Highly-threaded: It is common that a server instance is configured with a large number of threads,
typically a few tens to a hundred per server to hide I/O blocking latency.
High read/write ratios: Shaped by customer buying behaviors and e-business patterns, web applications usually consist of high read/write ratio, say around 90/10; the dominant reads come from
browsing while only a few writes owing to ordering happen over a period.
Long-running: Typically a server application runs for an indefinitely long time, processing requests
received from the client side.
High utilization of collection framework: Tomcat makes extensive use of Java collection classes
like Hashtable and Vector to store information (e.g. web contexts, sessions, attributes, MIME types,
status codes, etc). They are accessed frequently when checking, mapping and searching operations
happen inside the container. To reduce object creation and garbage collection costs, many application servers apply the object pooling technique and use collection classes to implement the object
pools.
Fine-grain object access: Fine-grain object access has two implications here: (1) the object size is
small; (2) the interval between object accesses to the heap is short. Unlike many scientific applications which have well-structured objects with size of at least several hundred bytes, Tomcat contains
an abundance of small-size objects (average about 80 bytes by our experience) throughout the
container hierarchy. Object accesses are very frequent due to object-oriented design of Tomcat.
Complex object graph with irregular reference locality: Some design patterns such as facade
and chain of interceptors used in Tomcat yield ramified object connectivity, cross-referencing
and irregular reference locality among objects throughout the container hierarchy. By property 5,
heavy use of Java Hashtable or HashMap also intensifies the irregularity of reference locality as
hash entries are accessed in a shuffling pattern, contrasting with the consecutive memory access
pattern in array-based scientific computations.
Figure 2 depicts the execution of the Tomcat application server on top of a 4-node cluster. When
Tomcat is executed atop JESSICA2, Tomcat is exposed to an SSI view of the integrated resources of the
cluster nodes as if it was in one powerful server. A customized Tomcat startup script is used to bring up
the server, running atop the master JVM. The script is tailored to supply the DJVM runtime parameters
(e.g. the port number for master-worker handshaking) and to read a host configuration file which defines
the hostnames or IP addresses of the worker nodes the DJVM would span.
When the server spawns a pool of threads, the threads will be migrated to the worker nodes. They
will load the classes of the Java library, Tomcat and the web applications deployed dynamically through
664
Figure 2. Execution of Tomcat on JESSICA2 DJVM
the cluster-aware class loader of JESSICA2. In this way, virtual web application server instances are
set up on the worker nodes. The virtual server instances pull workload continuously from the master
node by accepting and handling incoming connections through transparent I/O redirections. On each
worker node, I/O operations (accept, read/write and close) performed on the shared server socket object
(wrapped in the pooled TCP connector) will be redirected to the master node where it was bound to the
outside world. Most other I/O operations can be performed on I/O objects created locally; so each cluster
node can serve web page requests and database queries in parallel.
When a client request is accepted, the context manager of Tomcat will match it to the target web application context. If the request carries session state such as a cookie, the standard manager will search
for the allocated session object from the sessions hash table. In essence, all Tomcat container objects
including the context manager, the standard manager, the sessions hash table and web contexts allocated
in the master JVM heap are transparently shared among the distributed threads by means of the underlying
GOS service mentioned in section 3. When a thread gets the first access to a non-local object reference,
it will encounter an access fault and send a fetching request to the objects home node. The home node
will respond with the up-to-date object data and export the local object as the home copy of the shared
object. Cluster-wide data consistency will be enforced on the home copy and all cache copies derived
from it thereafter. Since each thread will be able to see the shared object updates made by others through
synchronization, the global shared heap creates an implicit cooperative caching effect among the threads.
The power of this effect can be exemplified by collection classes like hash tables.
As illustrated, all HTTP sessions stored in a Tomcat-managed hash table can be globally accessible.
The responsibility of maintaining HTTP session data consistency across servers has transparently shifted
to the GOS layer. In other words, every server is eligible to handle requests belonged to any client
session. This leads to more freedom of choice in request scheduling policies over sticky-sessions load
665
balancing which can run into hotspots. Another useful scenario is using the GOS to augment the effective cache size of an application-level in-memory Java cache (e.g., a hash table for looking up database
query results). The fact that every thread sees the cache entries created by one another contributes to
secondary (indirect) application cache hits through remote object access. The cache size can now scale
linearly with additional nodes, so we can greatly take the load off the bottlenecked database tier by
caching more data at the application tier.
The DJVM approach inherits most advantages of clusters. However, the aforesaid server runtime
properties bring additional design challenges on the DJVM runtime. First, I/O intensive workloads are
known to be more difficult to scale efficiently over a cluster. Second, the high thread count property
implies higher blocking latency if contention occurs. More memory overheads would be resulted from
any per-thread protocol data structures. High read/write ratio is a positive news to the GOS as it implies
shared writes are limited, so our protocols can take this property as a design tradeoff. Next, for longrunning applications, we need to make sure the memory overhead induced by the coherence protocol
data structures scales up slowly for less frequent garbage-collection cycles. Property 5 puts up the biggest barrier to scalability. Frequent synchronizations on the globally shared thread pool and object pools
produce intensive remote locking overhead. Worse still, these pools are usually built from Java collection
classes which are not scalable. For example, fine-grain accesses to hash entries of a Java hash table are
all bottlenecked around the single map-wide lock contention which will be much intensified by distributed locking. Finally, properties 6 and 7 together issue enormous remote access roundtrips and demand
smart object prefetching techniques for aggregating fine-grain communications. These observations call
for a renovation of JESSICA2s global object space (GOS) architecture.
GLOBAL OBJECT SPACE

In this section, we elaborate on the design and implementation of our enhanced GOS system. We discuss
the structure of the extended JVM heap, a home-based cache coherence protocol tailored for managing
locks and a cluster-wide sequential consistency protocol for handling volatile field updates.
5.1 Overview of the Extended JVM Heap

To support cluster-wide object modification propagation and consistency maintenance, the heap of the
standard JVM should be extended to make it cluster-aware. In JESSICA2, each JVM heap is logically
divided into two areas: the master heap area and the cache heap area. The master heap area essentially
rides on the unmodified JVM heap, storing ordinary local objects. To make it cluster-aware, when
the local objects are being shared with some remote threads, they are exported as home objects with
special flags marked in their object headers. The cache heap area manages cache objects brought from
the master heap of a peer node. It consists of extra data structures for maintaining cluster-wide data
consistency. The original GOS follows an intuitive design in which each thread has its own cache heap
area, resembling the thread-private working memory based on the Java memory model (JMM). This
design prevents local threads from interfering each others cache copies (such as during invalidations) but
wastes precious memory space to keep redundant per-thread cache copies on the same node. So we adopt
a unified cache design in the enhanced GOS which allows all local threads running on a single node to
share a common cache copy. This design not only makes better usage of available memory resources but
666
Figure 3. GOS internal data structures
also reduce remote object fetching since when a thread faults in an object, other peer threads at the same
node requesting the same object could find it in place. We also switch to a release consistency memory
model in which the dominant read-only objects are never invalidated, so the interference among local
threads is practically small. These modifications potentially could accommodate a high server thread
count and achieve better memory utilization.
Figure 3 shows the internal data structures of the extended JVM heap. The object header of every
object is augmented with special fields such as the cache pointer. A local or home object has a null cache
pointer whereas a cache object has its cache pointer points to an internal data structure called cache
header that contains the state and home information of the object. A node-level hash table (shared by all
local threads) is used to manage and to look up cache headers during fetching events. In order to tell the
home nodes of the modifications made on cache objects, each thread maintains a dirty list that records
the ids of cache objects it has modified. At synchronization points, updates made on the dirty objects
are flushed back to their home nodes. A similar per-node volatile dirty list is used to record updates on
objects with volatile fields which are maintained by a separate single-writer protocol to be explained
in section 5.3.
Object state is composed of two bits: valid/invalid and clean/dirty. The JIT compiler is tweaked to
perform inline checking on each cache object access to see if its state is valid for read/write. Read/write
on an invalid object will trigger appropriate interface functions to fault-in the up-to-date copy from its
home. For efficiency, the software check is injected as a tiny assembly code fragment to the relevant
bytecode instructions (GETFIELD, PUTFIELD, AALOAD, AASTORE, etc), testing the last two bits
of the cache pointer. Valid object access passing the check will not impose any GOS interface function
call overhead and is thus as fast as local object access.
Creating a single-heap illusion to distributed threads entails an advanced design of distributed cache
coherence protocol as it has to be compliant to the Java memory model that defines the memory consistency semantics across multiple threads. The Java language provides two synchronization constructs for
667
the programmers to render thread-safe code the synchronized and volatile keywords. The synchronized
keyword guarantees a code fragment or method with atomicity and memory visibility while volatile
ensures that threads can see the latest values of volatile variables. We will discuss our enhancements of
the GOS for handling the two types of synchronizations in Section 5.2 and 5.3 respectively.
5.2 Home-based Lazy Release Consistency Protocol

Entering and exiting a synchronized block or method correspond to acquiring and releasing the lock associated with the synchronized object. To fulfill the Java memory model, the original GOS implements
an intuitive solution that works as follows. Upon a lock release, all updates to cache objects are flushed
to their home nodes. Upon a lock acquire, all cache objects are invalidated, so later accesses will fault
in the up-to-date copies from the home nodes. However, this would incur significant object fault-in
overheads after every lock acquire. Thus, we renovate the original global object space by adopting a
more relaxed home-based lazy release consistency (HLRC) memory model.
Contrary to the intuitive solution, upon a lock acquire, we confine invalidations to cache copies of
shared objects that have been modified by other nodes only, rather than invalidating the total cache heap
area. Our home-based cache coherence protocol guarantees memory visibility based on Lazy Release
Consistency (LRC) (Keleher, Cox, & Zwaenepoel, 1992). LRC delays the propagation of modifications
to a node until it performs a lock acquire. Lock acquire and release delimit the start and end of an interval. Specifically, LRC insures that the node can see the memory changes performed in other nodes
intervals according to the happened-before-1 partial order (Adve & Hill, 1993), which is basically given
by the local nodes locking order and the shared lock transfer event. This means all memory updates
preceding the release performed by a node should be made visible to the node that acquires the same
lock. HLRC is similar to LRC in the sense of lock management but shapes the modification propagation
into home-based patterns.
Memory updates are communicated based on a multiple-writer protocol implemented using the
twin-and-diff technique that allows two or more threads to modify different parts (i.e. different fields or
array portions) of the same shared object concurrently without conflict. In this technique, a twin copy
is made as a data snapshot before the first write to a cache object in the current interval. Upon a shared
lock release, for each dirty cache object, the modified part, i.e. diff, is differentiated from the twin. The
diff is eagerly flushed to the corresponding home node, keeping the home copy always up-to-date. The
thread can then safely discard the twins and diffs and close the interval. When the lock is acquired by
another thread, the releaser passes write notices along the lock grant to the acquirer. The acquirer uses
the write notices to invalidate the corresponding cached objects. It also saves the write notices so that
they can be passed on to the next acquirer enforcing the happens-before partial order. A later access on
an invalidated cache object will fault in the up-to-date copy from its home.
Here, we have to deal with some tricky data-race problems arising from sharing a unified cache
copy among local threads. First, for systems of object-based granularity as in our case, field-level false
sharing may occur since protecting different fields of one object by different locks is reckoned as wellsynchronized in Java. For example, while one thread T1 holds a lock for modifying field A of a cache
copy and makes it becomes in dirty state, another local thread T2 may acquire a lock for modifying
field B of the same object. If another node has modified field B using the same lock, then T2 will invalidate that cache copy and fault-in the home copy, overwriting those pending modifications made by
T1. Second, in systems with object prefetching, it is possible for one thread faulting in a home object A
668
with object B prefetched to overwrite the pending modifications on the shared cache copy B made by
another thread. Currently, we deal with these hazards by reconciling the timestamp field associated to
each object to resolve detectable version conflicts and by incorporating techniques similar to two-way
diffing (Stets et al., 1997).
For home objects, local read/write can be done directly without generating and applying diffs. This
benefit is usually known as the home effect (Zhou, Iftode, & Li, 1996). Some minor overhead that
home nodes still need to pay is to keep record of the local writes for the next remote acquiring thread to
invalidate the relevant cache copies. Locking of home objects resembles locking of local objects if the
lock ownership has not been given to any remote nodes. Otherwise, it has to wait for the lock release
done by the last remote acquirer.
Compared with homeless protocols, the advantages of HLRC are: 1. the home effect for reducing
high diffing overheads; 2. fewer messages since an object fault can always be satisfied by one round-trip
to home instead of diff request messages to multiple writers; 3. no diff accumulation and so no need for
garbage collection of diffs. Hence, this becomes our protocol design choice for shorter latency seen by
I/O-bound workload and less garbage accruing from long-running server applications.
Nevertheless, we depart from the usual HLRC implementations in some aspects. To track and enforce
the happens-before partial order, traditional HLRC implementations rely heavily on vector timestamps to
dig out the exact minimal intervals (and write notices) that the acquirer must apply. While this ensures
the most relaxed invalidation, this entails complex data structures like interval records in Treadmarks
(Keleher et al., 1994) or bins database (Iosevich & Schuster, 2005) to keep the stacks of vectors. The
storage size occupied by them scales with the number of locking on shared objects. For lock-intensive
applications, these stacks can grow up quickly and consume enormous space. For long-running server
applications, the problem becomes more critical and systems that rely on pre-allocation schemes such
as cyclic bins buffers (Iosevich & Schuster, 2005) will ultimately run out of space and result in runtime
failure. Discarding interval records is possible if they have already been applied to all nodes. But some
nodes may never acquire a particular lock while some nodes intensively acquire it. This issue is not
ignorable particularly in multithreaded protocols where the length of vector timestamp scales with the
number of threads. For highly-threaded applications like Tomcat, this has scalability impacts on both
memory and network bandwidth. Therefore, our protocol eschews the use of vector timestamps.
Rather we employ object-level scalar timestamps to assist deriving the set of write notices. The
basic idea is illustrated by Figure 4. Each node maintains a data structure called timestamp map which
is essentially a hash table recording all shared objects that have once been modified. Each map entry
consists of the object id, a scalar timestamp and a n-bit binary vector (n being number of nodes) and is
used to derive the corresponding write notice formatted as a couple (object id, timestamp). The n-bit
binary vector is used to keep track of which node has applied the write notice (0 = not yet; 1 = applied).
If all the n nodes have applied the write notice, it is considered obsolete and can be discarded. The size
of this map scales with the number of modified shared objects rather than the number of shared locking. Repetitive locking on the same object will not generate separate interval records but go to update
the same entry in the timestamp map. Due to high read/write ratios in web applications, the number of
modified shared objects is limited. The timestamp map will also undergo a periodic shrinking phase to
clean up those obsolete entries. So the map is practically small in size at most of the time.
Upon a shared lock release, modifications will be recorded into the local nodes timestamp map.
When a lock transfer happens, all non-obsolete map entries will be extracted as write notices and passed
from the releaser to the acquirer. They will also be saved in the acquirers map. Write notices with a
669
Figure 4. Timestamp map for implementing HLRC
newer timestamp will overwrite an old map entry if any and reset its n-bit vector to all zeros so that
future acquirers will be able to know the changes. Without tracking the exact partial order, the set of
write notices sent to an acquirer may not be minimal and possibly include modifications that happensafter the release of the lock being acquired. The drawback is that some cache objects at the acquirer
side may be invalidated earlier than necessary. However this effect is insignificant since if the thread is
really going to access the invalidated cache objects, it eventually needs to see the modifications. This
effect will not accrue owing to our periodic cleanup of obsolete map entries and selective invalidations
based on object timestamp comparison.
5.3 Volatile Consistency Protocol

Most DJVM prototype implementations enforce cluster-wide semantics of the volatile construct in a way
that is stricter than necessary. For straight-forward compatibility, the volatile construct is usually treated
as if it was a lock, thus introducing unnecessary mutual exclusivity to the application. The latest Java
concurrent utility package (JSR166, 2004), particularly the ConcurrentHashMap class shipped along,
employs segment-based locks plus volatile count and object value fields to guard different hash bucket
ranges. The advanced data structure offers much more scalable throughput than the conventional Java
Hashtable. However, such a good design for concurrency will be smothered if the underlying DJVM
handle the volatile fields as locks. So we decided to tailor consistency support to volatile fields.
Our new protocol for maintaining cluster-wide volatile field consistency is a passive-based concurrent-read exclusive-write (CREW) protocol. It enforces sequential consistency to ensure the next reader
thread can see the updates made by the last writer on the same object.
To implement this model, we need to assign a manager for each object with volatile fields and it is
naturally the home node where the object is created. For ease of explanation, we call an object with
a volatile field as volatile object. The home node needs to maintain two states on the home copy of
volatile object: readable and exclusive, as well as a list called copyset of the nodes that currently have
a valid cache copy of this object. When the home node receives a fetch request from a node on a readable volatile object, the nodes id will be added to the copyset list of the home copy. The consistency
of a volatile object relies on the active writer to tell the readers of such an update. When a thread wants
670
to write the object, no matter the home or cache copy, it must first gain the exclusive right on it from
its home node. Before the exclusive right is granted to the candidate writer, the home will broadcast
invalidations to all members of the copyset and clean up the copyset. The writer will record its modified objects into the per-node volatile dirty list. The exclusive right will be returned to the home along
when the modification (diff) is flushed. Read/write on home objects similarly need to go through the
state check except that they are done directly on the object data without diff generation and flushing.
There is no need to generate any write notices because volatile cache copies are passively invalidated
by the home when a writer exists.
Upon read on an invalid volatile object, it will need to contact the home for the latest copy and join
the copyset again. If the state of the home copy is exclusive, then the fetch request will be put into a
queue pointed by volatile object header. When the writer returns the diff and exclusive right to the home,
the home will turn the object state back to readable and reply all queued readers with the updated object
data. As long as the state of a cache volatile object stays valid, its consistency has been guaranteed and
the thread can directly trust it until invalidation is received when some writer exists. This leads to the
beauty of this protocol that results in much better concurrency. Reads on a valid volatile object are pure
local operations without remote locks and any communications. For high application read/write ratio,
our design tradeoff shifts the communication overhead of the dominant reads to writes.
PERFORMANCE ANALYSIS
In this section, we present the performance results obtained by running Tomcat on JESSICA2.
6.1 Experimental Setup

Our experimental platform consists of three tiers: 1. web tier: a 2-way Xeon SMP server with 4GB RAM
for running the master JVM of JESSICA2 with Apache Tomcat 3.2.4 started up on it. 2. application
tier: a cluster of eight x86-based PCs with 512 MB RAM serving as the DJVM worker nodes. 3. data
tier: a cluster of four x86-based PCs with 2GB RAM supporting MySQL Database Server 5.0.45. All
nodes run under Fedora Core 1 (kernel 2.4.22). A Gigabit Ethernet switch is used to link up the three
tiers, while nodes within the same tier are connected by Fast Ethernet networks.
The initial and maximum heap sizes of each worker JVM are set to 128MB and 256MB respectively.
Each database node has the same dataset replica with MySQL replication enabled to synchronize data
updates across database servers at nearly real time. Jakarta JMeter 2.2 is used to synthesize varying
workloads to stress the testing platform.
Table 1 shows the application benchmark suite that we use to evaluate our clustering approach using
the DJVM. They are designed to model real-life web application patterns.
1.
2.
Bible-quote characterizes applications like text search engines, news archives and company catalogs. The servlet application is I/O intensive, serving document retrievals and search requests over
a set of text files of books.
Stock-quote models stock market data providers. We follow the trend of web services that deliver
price data by XML messages. The application reads stock price data matching the input date range
from the database and formats the query result into an XML response.
671
Table 1. Application Benchmark Suite

Application
Bible-quote
Object Sharing
Workload Nature
I/O
No sharing
I/O-intensive
Text files
Relatively
compute-intensive
Database
Stock-quote
Stock-quote/RSA
SOAP-order
HTTP session
I/O-intensive
Cached database records
Memory-intensive
TPC-W
Bulletin-search
3.
4.
5.
6.
Database and image files

Database
Stock-quote/RSA is secure version of Stock-quote involving compute-intensive operations of 1024bit RSA encryption on the price data.
SOAP-order models a B2B e-commerce web service. A SOAP engine is needed to support the
service. We choose Apache SOAP 2.3.1 and deploy it to Tomcat. The application logic is to parse
a SOAP message enclosing securities order placements, validate the user account and order details
and then put the successful transactions into the database.
TPC-W is a standard transactional web benchmark specification. It models an online bookstore
with session-based workloads and a mix of static and dynamic web interactions. We adopt the Java
servlet implementation developed by (ObjectWeb, 2005) but tailor the utility class for data access
by disabling the default database connection pooling and utilizing thread-local storage to cache
connections instead.
Bulletin-search emulates a search engine in a bulletin board or web forum system. We take the data
dump from the RUBBoS benchmark (ObjectWeb, 2004) to populate the database. The application
maintains a hash-based LRU-cache map of the results of the costly database searches, and is thus
memory-intensive. In order not to lift up garbage collection frequency too much, we impose a
capacity limit on the cache map, taking up about one-forth of the local JVM heap.
The original Tomcat is ported to JESSICA2 with a few customizations as follows: 1. the shared thread
pool is disbanded. We replace the original thread pool by a simpler implementation which spawns a
static count of non-pooled threads based on the server configuration file. 2. several shared object pools
(e.g. static mapping tables for MIME types and status codes) are disintegrated into thread-local caches.
The total lines of modified code including the new thread pool source file we introduce are less than
370 (about 0.76% of the Tomcat source base).
6.2 Scalability Study

In this experiment, we measure the maximum throughputs and average response times obtained by scaling the number of worker nodes from two to eight. The speedup is calculated by dividing the baseline
runtime of Tomcat on Kaffe JVM 1.0.7 by the parallel runtime of Tomcat on JESSICA2. Figure 5 shows
the results obtained for each benchmark. We can see that most of the applications scale well and achieve
efficiency ranging from 66% (SOAP-order) to 96.7% (Stock-quote). Bible-quote, Stock-quote and Stockquote/RSA show almost linear speedup because they belong to the class of stateless applications, undergoing true parallelism without any GOS communications between the JVMs. In particular, Stock-quote
672
Figure 5. Scalability and average response time obtain by Tomcat on JESSICA2
and Stock-quote/RSA involve operations of coarser work granularity, such as string manipulations and
RSA encryptions, and are hence more readily to attain nearly perfect scalability. The relatively poorer
speedups seen by SOAP-order and TPC-W are expected as they are stateful applications and involve
GOS overheads when sharing HTTP session objects among JVM heaps. We will further discuss the
limited speedup obtained by SOAP-order in section 6.4.
Bulletin-search shows a nonlinear but steepening curve in speedup when the number of worker nodes
scales out due to the implicit cooperative cache effect given by the GOS that we described in section 4.
Along the scaling of nodes, when the cluster-wide aggregated available memory becomes large enough
to accommodate most of the data objects cached in the application, the cache benefit will contribute an
impulsive rise in speedup. Further study on this effect will be given in section 6.3.
Table 2 shows the cluster-wide thread count used in each application and the overall protocol messaging overheads inside JESSICA2 in the 8-node configuration. The count of I/O redirections is proportional
to the request throughput and generally does not have impact on the scalability. The higher number of
GOS protocol messages explains the poorer scalability obtained by the application if we reconcile with
Figure 5. Bulletin-search is regarded as an exceptional case for its performance is more determined by
its cooperative caching benefits which could supersede the cost of GOS communications.
6.3 Comparison with Existing Tomcat Clustering

A control experiment is conducted on the same platform to compare the DJVM approach with an existing clustering method for Tomcat using web load balancing plug-ins. We run an instance of Apache web
server 2.0.53 on the web tier and eight standalone Tomcat servers on the application tier of our platform.
The web server is connected to the Tomcat servers via the mod_jk connector 1.2.18 with sticky-session
enabled (in-memory session replication is not supported in this comparison). The cluster-wide total
number of threads and heap size configurations in this experiment are equal to the previous ones used
in the DJVM approach.
Figure 6 shows the throughputs obtained by the two clustering approaches on eight nodes. We can
see that both solutions achieve similar performance (within 8%) for those stateless web applications
(Bible-quote, Stock-quote and Stock-quote/RSA). These applications exhibit embarrassing parallelism
673
Table 2. Protocol message overheads of JESSICA2 DJVM

Application
# Threads
# GOS Messages / Sec
# I/O Redirections / Sec
Bible-quote
80
2006
Stock-quote
80
1791
Stock-quote/RSA
80
275
SOAP-order
16
979
146
TPC-W
40
351
1413
Bulletin-search
16
483
297
and will not gain much advantage from the GOS. So putting the GOS aside, we can expect both solutions
should perform more or less the same because both our transparent I/O redirection and mod_jks socket
forwarding are functionally alike for dispatching requests and collecting responses. Yet, extra overheads
could be incurred in our solution when transferring big trunks of data via fine-grain I/O redirections and
during object state checks.
TPC-W performs about 11% better on the DJVM than with mod_jk. One reason is that servers sharing sessions over the GOS are no longer restricted to handle requests bounded to their sticky sessions
while load hotspots could happen intermittently when using mod_jk. On the other hand, SOAP-order
performs 26% poorer on JESSICA2 than with mod_jk. The main factor that pulls down the performance
is that the SOAP library has some code performing fairly intensive synchronizations in every request
processing cycle. We will see later that the overhead breakdown presented in Section 6.5 echoes this
factor. Bulletin-search performs 8.5 times better on the DJVM due to application cache hits augmented
by the GOS. We will explain why the DJVM approach has significantly outplayed the existing solution
in the next section.
Figure 6. Comparison of Tomcat on DJVM and existing Tomcat clustering
674
Table 3. Bulletin-searchs cache size setting and hit rates augmented by GOS
No.
of
Nodes
Cache Size
(#Cache Entries)
Relative
Cache Size
Total
Hit Rate
Indirect Hit
Latency (ms)
Cost Ratio of
Miss: Indirect Hit
Throughput
Speedup
512
12.5%
18.6%
N/A
N/A
N/A
931
22.7%
33.9%
9.07
40.79
1.26
1862
45.5%
59.3%
8.18
45.23
2.02
3724
90.9%
90.7%
11.74
31.52
7.96
6.4 Effect of Implicit Cooperative Caching

Bulletin-search exemplifies the class of web applications that can exploit the GOS to virtualize a large
heap for caching application data. Table 3 shows the application cache hits obtained by Bulletin-search
when the number of cluster nodes scales from one to eight. With the GOS, the capacity setting of the
cache map can be increased proportional to the node count beyond the single-node limit for different
portions of the map are stored under different heaps. This is not possible without the GOS. Upon the
creation of a new cache entry, its object reference built to the map is brought visible to all threads across
synchronization points. So redundant caching is eliminated. Threads can exploit indirect (or global)
cache hits in case the desired object is not in the local heap, easing the database bottleneck.
We can see from Figure 7 that the overall hit rate keeps rising along with the scaling of worker nodes
of the DJVM and most of the cache hits are contributed by the indirect hits when the single-node capacity has been exceeded. This is the reason why our approach achieves a multifold throughput than the
existing clustering approach in which there are only direct (local) hits that would level off or even drop
slightly no matter how many nodes are added.
Here we define a term called relative cache size (RCS) that refers to the percentage of the aggregated
cache size (combining all nodes) relative to the total size of the data set. When the RCS is below 50% in
the 4-node case, the achievable cache hit rate is only around 60% and the 40% misses get no improvement
Figure 7. Composition of application cache hits in Bulletin-search with GOS
675
Table 4. GOS overhead breakdown

# Messages / Sec
GOS Message Type
SOAP-order
TPC-W
Bulletin-search
Lock acquire
198
48
61
Lock release
198
48
61
Flush
217
70
92
Static data fetch
18
10
Object fault-in
197
99
160
Array fault-in
79
50
105
such that the application obtains a speedup of merely two. But when the RCS exceeds certain level (e.g.
90% in the 8-node case), most of the requests are fulfilled by the global cache instead of going through
the database tier. This explains the non-uniform scalability curve of this application in Figure 5.
6.5 GOS Overhead Breakdowns

Table 4 shows the GOS overhead breakdowns in terms of message count per second for the three stateful
applications. Figure 8 supplements with percentage breakdown of the message count as well as message
latency. Lock acquire and release messages are issued when locking a remote object. Flush messages
are sent upon lock releases but the flush message count is a bit more than lock release messages because
in some cases updates are flushed to more than one homes. Other overheads are related to access faults
which translate to communications with the corresponding home nodes. It is obvious that SOAP-order
involves much more remote locking overhead than the other applications.
Our further investigation finds that one utility class of the deployed SOAP library would induce
for each request about five to six remote locks on several shared hash tables and four remote locks on
Figure 8. GOS percentage overhead
676
Table 5. Cluster-wide locking overheads

Application
# Local
Locks / Sec
# Remote
Locks / Sec
% Remote Locks
Under Contention
Ratio of Local:
Remote Locks
SOAP-order
232631
198
35%
1175:1
TPC-W
240470
48
45%
5010:1
Bulletin-search
27380
61
6.5%
449:1
ServletContextFacade coming from the facade design pattern of Tomcat. Such heavy cluster-wide synchronization overheads justify the relatively poorer scalability given by this application.
Table 5 presents the local and remote locking rates for each application. We can see that local locks
are much more than remote locks. The main reason behind this is that in Java-based servers, threadsafe reads/writes on I/O stream objects are exceptionally frequent, producing tremendous local locks.
While local lock latency is very short (benchmarks shows an average of 0.2us), remote lock latency is
however at least several thousand times longer in commodity clusters; yet remote locks are practically
much fewer in most web applications. Another piece of information given by Table 5 is that SOAP-order
and TPC-W have about 35% to 45% remote locks under cluster-wide contention, thus prolonging the
wait time before locks are granted. This is why lock acquire has been the dominant part in the message
latency for these two applications in Figure 8.
RELATED WORK
Despite the boom of software DSM and the later DJVM research, it seems there have been only a few
attempts at transparently supporting real-life server applications by means of shared virtual memory
systems. Even fewer have been successful cases demonstrating good scalability though some of them
had relied on non-commodity hardware to support their systems.
Shasta (Scales & Gharachorloo, 1997) is a fine-grained software DSM system that uses binary code
instrumentation techniques extensively to transparently extend memory accesses to have cluster-wide
semantics. Oracle 7.3 database server was ported to Shasta running on SMP clusters, albeit without
success in achieving good scalability. They used TPC-B and TPC-D database benchmarks which model
online transaction processing and decision support queries respectively. TPC-B failed to scale at all due to
too frequent updates while TPC-D strived to achieve a speedup of one point something on three servers
connected by non-commodity Memory Channel Network. To some extent, their experience and result
exhibit many limitations of implementing a single system image at operating system level, compared to
our approach of clustering at middleware level. For example, relaxed memory consistency model cannot be adopted at operating system level in usual cases, since correctness of binary applications often
relies on consistency model imposed by hardware, which is generally much stricter than Java memory
model. Being able to adopt relaxed memory model such as HLRC in our case is very important to server
applications which may be intensive in synchronization.
cJVM (Aridor et al., 1999) is one of the earliest DJVM designed with intent to enable large multithreaded server applications such as Jigsaw to run transparently on a cluster. cJVM operates in interpretermode; it employs a master-proxy model and a method shipping approach to support object sharing among
677
distributed threads. The system relies on proxy objects to redirect field access and method invocation
to the node where the objects master copy resides. This model basically conforms to sequential consistency and is not efficient since every object access and method invocation may require communication
although some optimization techniques were developed to avoid needless shipping. In contrast, our DJVM
runs in JIT-compilation mode and conforms to release consistency, both propelling faster execution. In
(Aridor et al., 2000), cJVM was evaluated by running pBOB (Portable Business Object Benchmark), a
multithreaded business benchmark inspired by TPC-C, on a 4-node cluster connected by non-commodity
Myrinet. They obtained an efficiency of around 80%. However, it is unclear that whether cJVM will
perform such well if JIT gets enabled and commodity Ethernet is used as in our case.
Terracotta (Zilka, 2006) is a JVM-level clustering product emerged on the market for a couple of
years. It applies bytecode instrumentation techniques similar to JavaSplit (Factor, Schuster, & Shagin,
2003) to a predefined list of common products and to user-defined classes for clustering among multiple
Java application instances. Users need to manually specify shared classes as distributed shared objects
(DSOs) and their cluster-aware concurrency semantics. Contrasting with our SSI-oriented approach,
this configuration-driven approach may impair user transparency and create subtle semantic violation.
Terracotta uses a hub and spoke architecture that requires setting up a central server, namely the L2
server, to store all DSOs and to coordinate heap changes (field-level diffs) across JVMs. At synchronization points, changes on a DSO have to be sent to the L2 server that forwards the changes to all other
clustered JVMs under the DSOs copyset to keep all replicas consistent. Our home-based protocol needs
to keep only the home copy up-to-date by flushing diffs, then the next acquirer can see the changes by
faulting in the whole object. Terracottas centralized architecture may make the cluster susceptible to a
global bottleneck when scaling out. Tailoring the bottleneck requires forklift upgrades on the L2 server
(i.e. vertical scaling) that spoil the virtue of horizontal scaling using commodity hardware. We believe
a home-based peer-to-peer protocol is a more scalable architecture for distributed object sharing.

In this chapter, we introduce a new transparent clustering approach using distributed JVMs (DJVMs) for
web application servers like Apache Tomcat. A DJVM couples a group of extended JVMs for distributing a multithreaded Java application on a cluster. It realizes transparent clustering without the need for
introducing new APIs and incorporates most of the advantages of a SSI-centric system such as global
resource integration and coordination. Using DJVMs to cluster web application servers can enhance the
ease of web application clustering and global resource utilization both have been poorly met in most
existing clustering solutions among the web community.
We port Tomcat to the JESSICA2 DJVM to testify this clustering approach. Our study addresses new
challenges of supporting web application servers that characterize unique runtime properties of todays
object-oriented servers over the classical scientific applications evaluated in the previous DJVM projects. The key challenge lies in making the system scalable with a large number of threads and offering
efficient shared memory support for fine-grain object sharing among the JVMs. We enhance the cache
coherence protocol design accordingly in several aspects: 1. adopt a unified cache among local threads
to make better memory utilization; 2. implement a timestamp-assisted HLRC protocol to ensure release
consistency of shared objects; 3. enforce sequential consistency among cluster-wide volatile fields via
a concurrent-read exclusive-write (CREW) protocol. These improvements result in more relaxed coher-
678
ence maintenance and higher concurrency. Our experimental result has illustrated significant cache hits
obtained by using the global object space (GOS) to cache a large application dataset with automatic
consistency guarantee.
Several trends have put forward the advent of the DJVM clustering technology. Todays web applications are becoming increasingly resource-intensive due to security enhancement, more complicated
business logics and XML-based standards. Collaborative computing paradigm provisioned by DJVMs
becomes vital to generate helpful cache effect across cluster nodes for efficient resource usage. Second,
application logics tend to increase in complexity and now more and more application frameworks are
POJO-based. Clustering at application level and adoption of proprietary clustering mechanisms shipped
with particular application server products will tend to be laborious and error-prone, if not unfeasible. We
foresee DJVMs, typifying the kind of generic clustering middleware systems, will be gaining more user
acceptance. Third, design and development for user applications, server programs and library support
nowadays have put more emphasis on scalability than ever. When scalability or performance portability
is not a problem and meanwhile DJVMs are supreme in cost-effectiveness, this would have a catalytic
effect that more applications readily go for the DJVM technology.
In future, we will investigate solutions to enhance fine-grain object sharing efficiency in the DJVM
environment. In our research plans, we would consider incorporating transactional consistency (Hammond et al., 2004) into the cluster-wide memory coherence protocol.
REFERENCES
Adve, S. V., & Hill, M. D. (1993). A Unified Formalization of Four Shared-Memory Models. IEEE
Transactions on Parallel and Distributed Systems, 4(6), 613624. doi:10.1109/71.242161
Antoniu, G., Boug, L., Hatcher, P., MacBeth, M., McGuigan, K., & Namyst, R. (2001). The Hyperion
system: Compiling multithreaded Java bytecode for distributed execution. Parallel Computing, 27(10),
12791297. doi:10.1016/S0167-8191(01)00093-X
Aridor, Y., Factor, M., & Teperman, A. (1999). cJVM: A Single System Image of a JVM on a Cluster.
Paper presented at the Proceedings of the 1999 International Conference on Parallel Processing.
Aridor, Y., Factor, M., Teperman, A., Eilam, T., & Schuster, A. (2000). Transparently obtaining scalability
for Java applications on a cluster. Journal of Parallel and Distributed Computing, 60(10), 11591193.
doi:10.1006/jpdc.2000.1649
ASF. (2002). The Apache Tomcat Connector. Retrieved June 18, 2008, from http://tomcat.apache.org/
connectors-doc/
Ban, B. (1997). JGroups - A Toolkit for Reliable Multicast Communication. Retrieved June 18, 2008,
from http://www.jgroups.org/javagroupsnew/docs/index.html
Factor, M., Schuster, A., & Shagin, K. (2003). JavaSplit: a runtime for execution of monolithic Java
programs on heterogenous collections of commodity workstations. Paper presented at the Proceedings
of the IEEE International Conference on Cluster Computing.
679
Hammond, L., Wong, V., Chen, M., Carlstrom, B. D., Davis, J. D., & Hertzberg, B. (2004). Transactional Memory Coherence and Consistency. SIGARCH Comput. Archit. News, 32(2), 102.
doi:10.1145/1028176.1006711
Iosevich, V., & Schuster, A. (2005). Software Distributed Shared Memory: a VIA-based implementation
and comparison of sequential consistency with home-based lazy release consistency: Research Articles.
Software, Practice & Experience, 35(8), 755786. doi:10.1002/spe.656
Johnson, R. (2002). Spring Framework - a full-stack Java/JEE application framework. Retrieved June
18, 2008, from http://www.springframework.org/
JSR166. (2004). Java concurrent utility package in J2SE 5.0 (JDK1.5). Retrieved June 24, 2008, from
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/package-summary.html
Keleher, P., Cox, A. L., Dwarkadas, S., & Zwaenepoel, W. (1994). TreadMarks: Distributed Shared
Memory on Standard Workstations and Operating Systems. Paper presented at the Proceedings of Winter
1995 USENIX Conference.
Keleher, P., Cox, A. L., & Zwaenepoel, W. (1992). Lazy release consistency for software distributed
shared memory. Paper presented at the Proceedings of the 19th annual international symposium on
Computer architecture.
Ma, M. J. M., Wang, C. L., & Lau, F. C. M. (2000). JESSICA: Java-enabled single-system-image computing architecture. Journal of Parallel and Distributed Computing, 60(10), 11941222. doi:10.1006/
jpdc.2000.1650
ObjectWeb. (2004). RUBBoS: Bulletin Board Benchmark. Retrieved June 19, 2008, from http://jmob.
objectweb.org/rubbos.html
ObjectWeb. (2005). TPC-W Benchmark (Java Servlets version). Retrieved June 19, 2008, from http://
jmob.objectweb.org/tpcw.html
Perez, C. E. (2003). Open Source Distributed Cache Solutions Written in Java. Retrieved June 24, 2008,
from http://www.manageability.org/blog/stuff/distributed-cache-java
Scales, D. J., & Gharachorloo, K. (1997). Towards transparent and efficient software distributed shared
memory. Paper presented at the Proceedings of the sixteenth ACM symposium on Operating systems
principles.
Stets, R., Dwarkadas, S., Hardavellas, N., Hunt, G., Kontothanassis, L., & Parthasarathy, S. (1997).
Cashmere-2L: software coherent shared memory on a clustered remote-write network. SIGOPS Oper.
Syst. Rev., 31(5), 170183. doi:10.1145/269005.266675
Veldema, R., Hofman, R. F. H., Bhoedjang, R., & Bal, H. E. (2001). Runtime optimizations for a Java
DSM implementation. Paper presented at the Proceedings of the 2001 joint ACM-ISCOPE conference
on Java Grande.
Wilkinson, T. (1998). Kaffe - a clean room implementation of the Java virtual machine. Retrieved 2002,
from http://www.kaffe.org/
680
Yu, W., & Cox, A. (1997). Java/DSM: A Platform for Heterogeneous Computing. Concurrency
(Chichester, England), 9(11), 12131224. doi:10.1002/(SICI)1096-9128(199711)9:11<1213::AIDCPE333>3.0.CO;2-J
Zhou, Y., Iftode, L., & Li, K. (1996). Performance evaluation of two home-based lazy release consistency
protocols for shared virtual memory systems. SIGOPS Oper. Syst. Rev., 30(SI), 75-88.
Zhu, W., Wang, C. L., & Lau, F. C. M. (2002). JESSICA2: A Distributed Java Virtual Machine with
Transparent Thread Migration Support. Paper presented at the Proceedings of the IEEE International
Conference on Cluster Computing.
Zilka, A. (2006). Terracotta - JVM Clustering, Scalability and Reliability for Java. Retrieved June 19,
2008, from http://www.terracotta.org

Copyset: The current set of nodes or threads that hold a valid cache copy of an object. This data
structure is kept at the home node of the object and is helpful for sending invalidations in a single-writermultiple-reader cache coherence protocol.
Distributed Java Virtual Machine (DJVM): A parallel execution environment composed of a
collaborative set of extended Java virtual machines spanning multiple cluster nodes for running a multithreaded Java application.
Global Object Space (GOS): A virtualized memory address space for location-transparent object
access and sharing across distributed threads. The GOS for distributed Java virtual machines is built
upon a distributed shared heap architecture.
Java Memory Model (JMM): A memory (consistency) model that defines legal behaviors in a
multi-threaded Java code with respect to the shared memory. The JMM serves as a contract between
programmers and the JVM.
Lazy Release Consistency (LRC): The most widely adopted memory consistency model in software
distributed shared memory (DSM) in which the propagation of shared page/object modifications (in
forms of invalidation/update) is delayed to lock-acquire time.
Implicit Cooperative Caching (ICC): A helpful cache effect created by distributed threads through
cluster-wide accesses to a collection of shared object references.
ENDNOTE
1
This research was supported by Hong Kong RGC grant (HKU7176/06E) and China 863 grant
(2006AA01A111).
681
682
Chapter 29
Middleware for Community

Coordinated Multimedia
Jiehan Zhou
University of Oulu, Finland
Zhonghong Ou
Junzhao Sun
Mika Rautiainen
Mika Ylianttila
ABSTRACT
Community Coordinated Multimedia (CCM) envisions a novel paradigm that enables the user to consume multiple media through requesting multimedia-intensive Web services via diverse display devices,
converged networks, and heterogeneous platforms within a virtual, open and collaborative community.
These trends yield new requirements for CCM middleware. This chapter aims to systematically and extensively describe middleware challenges and opportunities to realize the CCM paradigm by reviewing
the activities of middleware with respect to four viewpoints, namely mobility-aware, multimedia-driven,
service-oriented, and community-coordinated.
INTRODUCTION
With the popularity of mobile devices (e.g. mobile phone, camera phone, PDA), the advances of mobile
ad hoc networks (e.g. enterprise networks, home networks, sensor networks), and the rapidly increasing
amount of end user-generated multimedia content (e.g. audio, video, animation, text, image), human
experience is being enhanced and extended by the consumption of multimedia content and multimedia
DOI: 10.4018/978-1-60566-661-7.ch029
Middleware for Community Coordinated Multimedia
services over mobile devices.

This enhanced human experience paradigm is generalized with the term of Community Coordinated
Multimedia, abbreviated as CCM, in this chapter. The emerging CCM communication takes on the feature of pervasively or wirelessly accessing multimedia-intensive Web services for aggregating, sharing,
viewing TV broadcasting/multicasting services, or on-demand audiovisual content over mobile devices
collaboratively. Thus the end users experience is enhanced and extended by mobile multimedia communication with the transparencies in networking, location, synchronization, group communication,
coordination, collaboration, etc.(Zhou et al, 2008a).
Middleware plays a key role in offering the transparent networking, location, synchronization, group
communication, coordination, collaboration, etc. In this chapter, middleware is perceived as a software
layer that sits above the network operating system and below the application layer. It encapsulates the
knowledge from presentation layer and session layer in OSI model that provides controls on the dialogues/
connections (sessions) and the understanding of syntax and semantics between distributed applications,
and abstracts the heterogeneity of the underlying environment between distributed applications.
This chapter presents a survey and initial design of P2P service-oriented community coordinated
multimedia middleware. This work is a part of EUREKA ITEA2 project CAM4Home1 metadata-enabled
content delivery and service framework. The chapter investigates technological CCM middleware
challenges and opportunities from four viewpoints that describe the CCM: mobility-aware, multimediadriven, service-oriented, and community-coordinated. These are the most highlighted characteristics for
CCM applications. The following lists identified middleware categories for addressing challenges and
opportunities in the CCM paradigm:
Middleware for mobility management. The middleware for mobility management aims to provide
mobile access to distributed multimedia applications and services, and addresses the limitations
caused by terminal heterogeneity, network resource limitation, and node mobility.
Middleware for multimedia computing and communication. The middleware for multimedia
computing and communication aims to provide standard formats, specification and techniques for
representing all multimedia types in a digital form, handling compressed digital video and audio
data, and delivery streams.
Middleware for service computing and communication. The middleware for service computing
and communication aims to provide specifications and standards in the context of Web services to
achieve the service-oriented multimedia computing paradigm covering service description, interaction, discovery, and composition.
Middleware for community computing and communication. The middleware for community
computing and communication aims to provide standards and principles which govern the participation of peers into the community and messaging models.
The remainder of the chapter is organized as follows: Section 2 defines concepts relevant to CCM
and middleware. Section 3 illustrates a generic CCM scenario. Section 4 analyzes the requirements
of middleware for CCM. Section 5 designs a middleware architecture for CCM. Section 6 surveys
middleware technology for CCM with respect to mobility-aware, multimedia-driven, service-oriented,
and community coordinated viewpoints. Section 7 discusses the future trends towards the evolution of
CCM. Finally, Section 8 draws a conclusion for the chapter.
683
Figure 1. Middleware ontology in the context of CCM
DEFINITIONS
This section specifies a few concepts relevant to the CCM paradigm and CCM middleware as
follows:Multimedia: Represents a synchronized presentation of bundled media types, such as text,
graphic, image, audio, video, and animation.Community: Is generally defined as groups of limited
number of people held together by common interests and understandings, a sense of obligation and possibly trust (Bender, 1982).Community Coordinated Multimedia: (CCM) system maintains a virtual
community for the consumption of CCM multimedia elements, i.e. both content generated by end users
and content from professional multimedia provider (e.g., Video on Demand). The consumption involves
a series of interrelated multimedia intensive processes such as content creation, aggregation, annotation,
etc. In the context of CCM, these interrelated multimedia intensive processes are encapsulated into Web
services, instead of multimedia applications, namely multimedia intensive services, briefly multimedia
services.Standard: Refers to an accepted industry standard. Protocol refers to a set of governing rules
in communication between computing endpoints. A specification is a document that proposes a standard.
Middleware: is the key technology which integrates two or more distributed software units and allows
them to exchange data via heterogeneous computing and communication devices (Quasy, 2004). In this
chapter, middleware is perceived as an additional software layer in OSI model encapsulating knowledge
from presentation and session layers, consisting of standards, specifications, forms, and protocols for
multimedia, service, mobility and community computing and communication. Figure 1 illustrates the
middleware ontology with relationship to other defined concepts.
684
Figure 2. An example scenario for CCM as presented in (Zhou et al, 2008b)
USAGE SCENARIO
CCM envisions that user experiences are enriched and extended by the collaborative consumption of
multimedia services with the interplay of two key enabling technologies of web services and P2P technology. Figure 2 illustrates the CCM paradigm. The use sequence about the CCM paradigm is given in
(Zhou et al, 2008b).
CCM MIDDLEWARE REqUIREMENTS

Figure 3 illustrates the four major viewpoints which provide a cooperating outline for the specification of
CCM middleware which supports the emerging CCM application solutions (e.g. mobile content creation,
online learning, collaborative multimedia creation, etc.). These four viewpoints are mobility-aware,
multimedia-driven, service-oriented, and community-coordinated perspectives. The requirements associ-
685
Figure 3. CCM middleware viewpoints
ated with each viewpoint comprises the complete requirements for CCM middleware specifications.
Mobility-aware CCM. Ubiquitous computing is becoming prominent as small portable devices, and
the wireless networks to support them, have become more and more pervasive. In the CCM system,
content can be consumed, created, analyzed, aggregated, and transmitted over mobile devices. Also
services can be requested, discovered, invoked and interacted over mobile devices. Examples of such
mobile devices are portable computers, PDAs, and mobile phones. Mobile communication is the key
technical enabler allowing mobile service computing to deliver services, including conventional voice
and video on demand over broadband 3G connection. In the last decade, the mobile communication
industry has been one of the most flourishing sectors within the ICT industry. Mobile communication
is a solution that enables flexible integration of smart mobile phones, such as camera phones, to other
computer systems. Smart phones make it possible to access services anywhere and anytime. In the context
of mobility-aware CCM, mobile communication systems and infrastructures play an important role in
delivering services cost-efficiently and high QoS-guaranteed to support the users activities on the air.
Therefore, the CCM middleware systems must provide context management, dynamic configuration,
connection management, etc. to facilitate anytime, anywhere service access.
Multimedia-driven CCM. Digital convergence between audiovisual media, high-speed networks and
smart devices becomes a reality and helps make media content more directly manageable by computers (ERCIM, 2005). It presents new opportunities for CCM applications to enhance, enrich and extend
user daily experiences. The CCM applications are multimedia intensive. These applications include
multimedia content creation, annotation, aggregation, and sharing. The CCM is expected to facilitate
multimedia content management through multimedia-intensive automatic or semi-automatic applications
(e.g. automatic content classification and annotation), or services. The nature of multimedia-driven CCM
686
yields the requirements for multimedia representation, compression, and streaming delivery.
Service-driven CCM. The CCM system is expected to employ service-oriented computing (Krafzig
et al, 2005, Erl, 2005) and content delivery networking (Vakali et al, 2003, Dixit et al, 2004) technologies for delivering content by broadcasting (e.g. music, TV, radio). By providing Web sites and tool
suites, the CCM system enables all end users to access to content and services with desired quality and
functionality. This vision takes advantages of the nature of registration, discoverability, composibility,
open standards, etc. in service orientation approach. Service-orientation perceives functions as distinct
service units, which can be registered, discovered, and invoked over a network. Technically, a service
consists of a contract, one or more interfaces, and an implementation (Krafzig et al, 2005). The service
orientation principles for modern software system design are gained and promoted through contemporary service-oriented architecture (SOA) by introducing standardizations to service registration,
semantic messaging platforms, and Web services technology. Therefore the CCM middleware system
must provide description, discovery, and interaction mechanisms in the context of multimedia services.
Multimedia services dealing with multimedia computing such as content annotation are networked and
made available by service discovery.
Community-coordinated CCM. The CCM attempts to build an online community for end users and
service providers by providing the means of managing community memberships and members contacts. As illustrated in the scenario, end user Bob aggregates content with information that is relevant
within a common subject and sends it to his friend Alice in the community who are also interested in
it. Moreover, the CCM attempts to provide the end user with individual and customized content and
service by managing CCM users preferences and profiles. The CCM usually has a large user base for
sharing multimedia. The user base is usually organized in terms of specific interest ties and membership
management. In order to succeed community coordination in multimedia sharing, the CCM middleware
system must provide standards and principles which govern the participation of peers into the community
(peer management), preference management, and various messaging models.
MIDDLEWARE ARCHITECTURE FOR CCM

Based on the analysis of the middleware requirements from the four viewpoints, a middleware architecture
for CCM is introduced in Figure 4. It comprises of four layered categories which abstract the components
for computing and communication into multimedia, service, mobility, and community management perspectives. From top to bottom, the CCM middleware architecture consists of the Community-coordinated,
Service-oriented, Multimedia-driven, and Mobility-aware layers. Each layer is a collection of related
standards and specifications that provides services to manage multimedia data representations, sessions,
and end-to-end collaboration in a heterogeneous computer environment (Table 1).
The lowest layer is the mobility-aware middleware, which aims to provide multimedia service access
to the multimedia user equipped with a portable device anytime and anywhere. Due to the limitations on
the terminal heterogeneity, network resource, and node mobility, this mobility-aware middleware layer
meets various requirements such as management of context, resource, and connections.
The second layer is multimedia-driven middleware, which establishes the context of video and audio
representation, encoding standards, and communications protocol for audio, video, and data. In this
sense, the multimedia-driven middleware layer contains specifications and standards for multimedia
representation and multimedia communication.
687
Figure 4. The middleware architecture for CCM
Table 1. Overview of the middleware architecture for CCM

Middleware layer
Specification
Keywords
Related CCM viewpoints
Mobility-aware
Context management, connection management, dynamic configuration, and

adaptivity
Context-awareness, network connection, mobile nodes
Mobility and pervasivenessaware CCM
Multimedia-driven
Multimedia representation, compression,

and communication
Multimedia description language,

audio, image, video codec, and multimedia streaming
Multimedia-driven CCM
Service-oriented
Service description languages, messaging

formats, discovery mechanisms
Service description, service discovery,

service composition
Service-oriented CCM
Community-coordinated
Principles and rules for grouping and

messaging management
Peer, group, messaging modes, etc.
Community-coordinated
CCM
688
The third layer is service-oriented middleware, which comprises of specifications and standards
which allow traditional software applications to exchange data with one another as they participate in
multimedia business processes. These specifications include XML, SOAP, WSDL, UDDI, etc. By using
service-oriented middleware, the traditional multimedia applications are transformed into multimedia
services which are made accessible over a network and can be combined and reused in the development
of multimedia applications.
The top layer is community-coordinated middleware, which establishes the technical context that
allows any peer connected to a network to exchange messages and collaborate independently of the
underlying network topology. This ultimately leads to the idea of creating a community-coordinated
multimedia communication environment for social, professional, educational, or other purposes.
SURVEY IN MIDDLEWARE FOR CCM

This section aims to elaborate the middleware layers identified above. On the one hand, this section
describes the state of the art of middleware with respect to four major viewpoints, i.e. multimediadriven, mobility-aware, service-oriented, and community-coordinated CCM. On the other hand, this
section presents a feasible and integrated middleware solution to meet the generic CCM requirements
specified in Section 4.
MIDDLEWARE FOR MOBILITY-AWARE CCM

Limitations and Requirements
The rapid growth of wireless technologies and the development of smaller and smaller portable devices
have led to the widespread use of mobile computing. Any user, equipped with a portable device, is able
to access any service at anytime and anywhere. Mobile access to distributed applications and services
brings with it a great number of new issues. The limitations caused by the inherent characteristics of
mobility are as follows:
Terminal heterogeneity (Gaddah et al, 2008). Mobile devices usually have diverse physical capabilities, e.g. CPU processing, storage capacity, power consumption, etc. For example, laptops
own much more storage capacity and provide faster CPU processing capability, etc., while pocket
PCs and mobile phones usually have much less available resources. Though the mobile terminal
technology has been progressed promptly in the recent years, it is still impossible to make mobile
devices as competitive as fixed terminals. Hence, middleware should be designed to achieve optimal resource utilization.
Network resource limitation. Compared with fixed networks, the performance of wireless networks (GPRS, UMTS, Beyond 3G networks, Wi-Fi/WiMAX, HiperLAN, Bluetooth, etc.) vary
significantly depending on various protocols and technologies. Meanwhile, mobile devices may
encounter sharp drop in network bandwidth, high interference, or temporary disconnection when
moving around different areas. Therefore, the middleware should be designed in a way that takes
into account intrinsically the optimization of the limited network resource.
689
Mobility. According to the node mobility, when mobile devices move from one place to another,
they will have to deal with different types of networks, services, and security policies. In turn, this
requires applications to behave accordingly to handle various dynamic changes of the environment parameters. Henceforth, the design of middleware should take into consideration the mobility of nodes as well.
Various requirements of middleware for mobility-aware CCM are as follows:
690
Context management. The characteristics of mobile networks in the CCM environment are the
intermittent network connection and limited network bandwidth. Disconnection can happen frequently, either by active reason, i.e. saving power, or passive reason, temporary uncoverage or
high interference. In order to deal with the disconnection effectively and efficiently, the context
of middleware should be disclosed to the upper application layer instead of being hidden from it
to make the application development much easier. Bellavista (Bellavista et al, 2007) summarized
the context in a mobile computing environment as three different categories: network context, device context, and user context. The network context consists of the adopted wireless technology,
available bandwidth, addressing protocol, etc. The device context includes details on the status
of the available resources, such as CPU, batteries, memory, etc. The user context, in its turn, is
composed of information related to the location, preferences and QoS requirement of the user.
Dynamic reconfiguration (Gaddah et al, 2008). During the CCM application lifetime, dynamic
changes in infrastructure facilities, e.g. the availability of certain services, will require the application behavior to be altered accordingly. Therefore, dynamic reconfiguration is needed in such
environment. It can be achieved by adding a new functionality or changing an existing one at the
CCM application runtime. For the purpose of supporting dynamic reconfiguration, middleware
should be able to detect changes happening in the available resources and adopt corresponding
approaches to deal with it. Reflective middleware is the widespread solution adopted to solve this
problem.
Connection management. User mobility and intermittent wireless signals result in the frequent
disconnection and reconnection of mobile devices, which is exceptional in fixed distributed systems. Therefore, middleware for mobility-aware CCM should adopt different connection management mechanism from fixed distributed systems. Asynchronous communication mechanism
is usually used to decouple the client from the server, with which tuple space system is one of the
typical solutions. Another issue related to connection management is the provision of services
based on the concept of session. In this case, the proxy can be adopted to hide the disconnection
from the service layer.
Resource management. Mobile devices are characterized by their limited resources, such as battery, CPU, memory, etc. Henceforth, mobile middleware for CCM should be lightweight enough
to avoid overloading the mobile devices. Currently, middleware platforms designed for fixed distributed systems, e.g. CORBA, are too heavy to run on mobile devices as they usually have a
number of functionalities which are not necessarily needed in resource-limited devices. Modular
middleware design is adopted widely to make the middleware more lightweight.
Adaptability. In mobile CCM, adaptability mainly refers to the ability to adapt to context changes
dynamically. According to the currently available resources, adaptability allows middleware to
optimize the system behavior by choosing the corresponding protocol suite that better suits the
current environment, integrating new functionalities and behaviors into the system, and so on.
STATE OF THE ART IN MIDDLEWARE FOR MOBILITY-AWARE CCM

Traditional middleware for fixed distributed systems are too heavy to be used in mobile computing environments. In order to provide new solutions, the research work has been progressed along two distinct
directions in the last decade (Cotroneo et al, 2008): (1) extending traditional middleware implementations with primitive mobile-enabled capabilities (e.g. Wireless-CORBA (Kangasarju, 2008)), and (2)
proposing middleware that adopts mobile-enabled computing models (e.g. Lime (Murphy et al, 2001)).
The former adopts a more effective computing model, but does not effectively overcome the intrinsic
limitation of the synchronous remote procedure call; the latter adopts decoupled interaction mechanisms,
but fails in providing a high level and well understood computing model abstraction (Migliaccio, 2006,
Quasy, 2004). Following the similar categories, we divide middleware for mobility-aware CCM into
four categories: extended traditional middleware, reflective middleware, tuple space middleware, and
context-aware middleware.
Extending traditional middleware. To be able to operate within existing fixed networks, object-oriented
middleware has been proposed to mobile environments. Wireless CORBA is a CORBA specification
in Wireless Access and Terminal Mobility (OMG, 2008). The overall system architecture is divided
into three separate domains: home domain, terminal domain, and visited domain. In ALICE (Haahr et
al, 1999), in order to support client/server architecture in nomadic environments, mobile devices with
Windows CE operating system and GSM sensors have been adopted. The main focus of the existing
examples of extending traditional middleware is on the provision of services from a back-bone network
to the network-edge, i.e. mobile devices. Therefore, the main concerns are how to deal with connectivity
and exchange messages. However, in cases where the networks are unstructured and the services have to
be provided by the mobile devices, traditional middleware does not work well and new paradigms have
to be put forward. This has motivated the birth of e.g. reflective middleware, tuple space middleware,
and context-aware middleware.
Reflective middleware. The primary motivation of reflective middleware is to increase its adaptability
to the changing environment. A reflective system consists of two levels referred to as meta-level and baselevel. The former performs computation on the objects residing in the lower levels, the latter performs
computation on the application domain entities (Gaddah et al, 2008). Open-ORB (Blair et al, 2002) and
Globe (Steen et al, 1999) are two examples of middleware which utilized the concept of reflection.
Tuple space middleware. The characteristics of wireless propagation environment make the synchronous communication mechanism, typical in most of the traditional distributed systems, not suitable
for mobile applications. One solution for this is the so-called tuple space. A Tuple space is a globally
shared, associatively addressed memory space that is organized as a bag of tuples (Ciancarini, 1996).
Client processes can create the tuples by utilizing write operation and then retrive the tuples by read
operation. LIME (Murphy et al, 2001), TSpaces (Wyckoff et al, 1998), JavaSpace (Bishop et al, 2002)
are examples of the tuple space based systems.
Context-aware middleware. Mobile systems are characterized by the dynamic execution context due
to the mobility of the mobile devices. The context information has to be exposed to the application layer
to make it adaptable to the corresponding changes which happen in the lower-levels. Context-aware
computing was first proposed by (Schilit et al, 1994, Haahr et al, 1999). After that, lots of research inter-
691
Table 2. Overview of standards and specifications for multimedia representation

Specification
Key notes
Role in CCM
DCMI
Element, qualifier, application, profile
Document description
DICOM
Image specific
Image specific
SMDL
Music data, SGML-based
Music data
MULL
Course, XML-based
Multimedia course preparation
MRML
Multimedia, XML-based
Multimedia
EDL
Sessions, XML-based
Session description
SMIL
Audiovisual, XML-based
AV data
SMEF
Media, SMEF-DM
Media description
P/Meta
Metadata, XML-based, P/Meta scheme
Metadata framework
SMPTE
Metadata, XML-based
Metadata framework
MXF
Audio-visual, SMPTE-based
AV data description
SVG
Describing 2D graphics, XML-based
Describing 2D graphics
TV-Anytime
Audio-visual, descriptor, preferences
AV data description
MPEG-7
Multimedia content data, interactive, integrated audio-visual
Multimedia description
MPEG-21
Common framework, multimedia delivery chain, digital item
Common multimedia description

framework
ests are plunged into this field, but most of them focus on the location awareness, e.g. Nexus (Fritsch et
al, 2000) was designed to provide various kinds of location-aware applications. Some other approaches
investigated the feasibility of utilizing reflection in the context of mobile systems to offer dynamic
context-awareness and adaptation mechanisms (Roman et al, 2001).
MIDDLEWARE FOR MULTIMEDIA-DRIVEN CCM

In order to support multimedia content transmission via various networks, the issues of semantic multimedia representation, multimedia storage capacity, and the delivery time delay must be taken into consideration. Middleware for multimedia-driven CCM aims to abstract the knowledge about multimedia
representation and communication, which comprises of specifications and standards for multimedia
representation, compression, and communication.
STANDARDS FOR MULTIMEDIA REPRESENTATION AND COMPRESSION

Table 2 presents an overview of standards and specifications for multimedia representation. A brief
description about these specifications is given as follows and detailed in corresponding references.
Dublin Core Metadata Initiative (DCMI). In the Dublin Core (DC) (Stuart et al, 2008), the description of the information resources is created using Dublin Core elements, and may be refined or further
explained by a qualifier. Qualification schemes are used for ensuring a minimum level of metadata
interoperability. No formal syntax rules are defined. DCMI evolution involves in extending the element
692
Table 3. Some compression standards for multimedia

Specification
Key notes
Role in CCM
JPEG
Image, discrete cosine transform-based, codec specification, ISO standard
Compression for single images
JPEG-2000
Image, wavelet-based, greater decompression time

than JPEG
Compression for single images
MPEG-1
Lossy video and audio compression, MP3, ISO

standard
Compression for video and audio
MPEG-2
Lossy video and audio compression, popular DTV

format, ISO standard
MPEG-4
AV compression for web, CD distribution, voice and

TV applications.
set, description of images, standardization, special interests, and metadata scheme. The Digital Imaging
and Communications in Medicine (DICOM) standard (ACR, 2008) is used for the exchange of images
and related information. The DICOM standard has several supports, including the support for image
exchange between senders and receivers, support for retrieving image information, and image management. Standard Music Description Language (SMDL) (ISO/IEC, 2008) defines an architecture for the
representation of music information, either alone or in conjunction with text, graphics, or other information
needed for publishing or business purposes. MUltimedia Lecture description Language (MULL) (Polak
et al, 2001) enables to modify a remote presentation and control this presentation. Multimedia Retrieval
Markup Language (MRML) (MRML, 2008) aims to unify access to multimedia retrieval and manage
software component in order to extend their capabilities. Event Description Language (EDL) (Rodriguez,
2002) describes advanced multimedia sessions for supporting multimedia services management, provision and operation. The Synchronized Multimedia Integration Language (SMIL) (SMIL, 2005) enables
simple authoring of interactive audiovisual presentations, which integrate streaming audio and video
with images, text, or any other media type. Standard Media Exchange Framework (SMEF) (BBC, 2005)
is defined by BBC to support and enable media asset production, management, and delivery. P/Meta
(Hopper, 2002) is developed for content exchange by providing the P/Meta Scheme which consists of
common attributes and transaction sets for P/Meta members such as content creators and distributors.
Metadata Dictionary & Sets Registry (SMPTE) (SMPTE, 2004) creates the Metadata Dictionary (MDD)
and a sets registry. The MDD dynamic document encompasses all the data elements considered relevant
by the industry. The sets registry describes the business purpose and the structure of the sets. The Material eXchange Format (MXF) (Pro-MPEG, 2005) targets at supporting the interchange of audio-visual
material with associated data and metadata. Scalable Vector Graphics (SVG) (Watt et al, 2003) describes
2D graphics and graphical applications in XML. It contains two parts: (1) an XML-based file format and
(2) a programming API for graphical applications. TV-Anytime metadata (TV-Anytime, 2005) consists
of the attractors/descriptors used, e.g. in Electronic Program Guides (EPG), or in Web pages to describe
content. Multimedia Content Description Interface (MPEG-7) (ISO/IEC, 2003) describes the multimedia
content data that supports some degree of interpretation of the information meaning, which can be passed
onto, or accessed by a device or a computer code. The MPEG-21 multimedia framework (Burnett, 2006)
identifies and defines the key elements needed to support the multimedia delivery chain.
Table 3 presents widely used compression techniques that are in part competitive and in part complementary. The details about the standards and specifications are given as follows.
693
Table 4. Protocols and specifications for multimedia communication

Middleware
Specification
Remote Procedure Call (RPC)
Procedure-oriented call, synchronous interaction model
Remote Method Invocation (RMI)
Object-oriented RPC, object-oriented references
Message Oriented Middleware
Message oriented communication, asynchronous interaction model
Stream-oriented communication
Continuous asynchronous, synchronous, isochronous, and QoS-specified

multimedia transmission
ISO JPEG standard (William et al, 1992) defines how an image is compressed into a stream of bytes
using discrete cosine transform and decompressed back into an image. It also defines the file format used
to contain that stream. JPEG 2000 (Taubman et al, 2001) is an image compression standard advanced
from JPEG. It is based on wavelet-based compression, which requires longer decompression time than
JPEG and allows more sophisticated progressive downloads. MPEG-1 (Harte et al, 2006) is a standard
for lossy compression of video and audio. It has been used for example as the standard for video CDs,
but later video disc formats adopt newer codecs. It also contains the well-known MP3 audio compression.
MPEG-2 (Harte et al, 2006) describes several lossy video and audio compression methods for various
purposes. It is widely used in the terrestrial, cable, and satellite digital television formats. MPEG-4 (Harte
et al, 2006) defines newer compression techniques for audio and video data. H.264/AVC (Richardson et
al, 2003) is also known as MPEG-4 Part 10 for video compression, which is widely utilized in modern
mobile TV standards and specifications. Audio counterpart is AAC, defined in MPEG-4 Part 3.
MIDDLEWARE FOR MULTIMEDIA COMMUNICATION

The CCM multimedia applications support the view that local multimedia systems expand towards
distributed solutions. Applications such as multimedia creation, aggregation, consumption, and others
require high speed networks with a high transfer rate. Multimedia communication sets several requirements on services and protocols, e.g. processing of AV data needs to be bounded by deadlines or by a time
interval. Multimedia communication standards and protocols can be categorized into Remote Procedure
Call (RPC) based (Nelson, 1981, Tanenbaum et al, 2008), Message Oriented Middleware (MOM) based
(Tanenbaum et al, 2008, Quasy, 2004), Remote Method Invocation (RMI) based (Tanenbaum et al, 2008),
and Stream based (Tanenbaum et al, 2008, Halsall, 2000). They define the middleware alternatives for
multimedia communication (Table 4). Details about the protocols and specifications are given below
with relevant references.
Remote procedure call (RPC) (Nelson, 1981, Tanenbaum et al, 2008) allows a software program
to invoke a subroutine or procedure to execute in another computer. In the remote procedure call, the
programmer writes the subroutine call code whether it is local or remote. Remote Method Invocation
(RMI) (Nelson, 1981, Tanenbaum et al, 2008) is another RPC paradigm based on distributed objects. In
the case of Java RMI, the programmer can create applications consisting of Java objects from different
host computers. Message-oriented middleware (MOM) (Tanenbaum et al, 2008, Quasy, 2004) typically
supports asynchronous calls between the client and server. With the message queue mechanism, MOM
reduces the involvement of application developers. For example, applications send a subjective mes-
694
Figure 5. The relationship between Web service technologies
sage to logical contact points or indicate theirs interest for a specific type of message. As examined in
CCM scenario, CCM multimedia communication involves multiple media types of audio and video.
It becomes necessary for CCM to use stream-oriented middleware for streaming multimedia which
purposes to support for continuous asynchronous, synchronous, isochronous, and QoS-specified media
transmission. Stream-oriented middleware (Tanenbaum et al, 2008) examples are MPEG-TS (Harte,
2006), Resource ReSerVation Protocol (RSVP) (Liu et al, 2006), and Real-time Transport Protocol (RTP)
(Perkins, 2003), etc. The MPEG-TS (Harte, 2006) is designed to allow multiplexing of digital video and
audio and to synchronize the output. The RSVP (Liu et al, 2006) is a transport layer protocol designed
to reserve resources across a network for an integrated services Internet. The RTP (Ray, 2003) defines
a standardized packet format for delivering audio and video over the Internet.
MIDDLEWARE FOR SERVICE-ORIENTED CCM

This section discusses middleware for service-oriented CCM, which consists of standards and specifications which govern the conversion of conventional multimedia applications into a service-oriented
computing environment. These standards and specification are based on several notable Web service
technologies, i.e. XML (Ray, 2003), WSDL (WSDL, 2005, Erl, 2005), UDDI (UDDI, 2004), SOAP
(SOAP, 2003), and BPEL (Thatte, 2003). See Figure 5.
As a successor to HTML, the eXtensible Markup Language (XML) (Ray, 2003) is used to represent
information objects consisting of elements (e.g. tags and attributes). XML defines the syntax for markup
languages. XML Schema allows definition of languages in a machine readable format. Web Services
Description Language (WSDL) (WSDL, 2005) is an XML-based language for describing Web services
in a machine understandable form. WSDL describes and exposes a web service using major elements
of portType, message, types, and binding. The portType element describes the operations performed
by a web service. The message element defines the data elements of an operation. The types element
695
defines the data type used by the web service. The binding element defines the message format and
protocol details for each port. Universal Description Discovery and Integration (UDDI) (UDDI, 2004)
is regarded as a specification of the service, service definition, and metadata hub for service-oriented
architecture. Various structural templates are provided by UDDI for representing data about business
entities, their services, and the mechanisms for governing them. The UDDI upper service model consists
of a BusinessEntity (who), a BusinessService (what), a BindingTemplate (how and where) and a tModel
(service interoperability). XML Schema Language is used in UDDI to formalize its data structures.
Simple Object Access Protocol (SOAP) (SOAP, 2003) is regarded as a protocol for exchanging XMLbased messages over computer networks. One of the most common SOAP messaging patterns is the
Remote Procedure Call (RPC) pattern, in which the client communicates with the server by sending a
request/response message. Business Process Execution Language for Web Services (BPEL, also WSBPEL or BPEL4WS) (Thatte, 2003) provides a flexible way to define business processes comprised
of services. BPEL supports executable processes and abstract processes. Executable processes allow
specifying the exact details of business processes and can be executed by an orchestration engine. The
process descriptions for business protocols are called abstract processes, which allow to specifying the
public message exchange between parties. With BPEL, complex business processes can be defined in
an algorithmic manner (Thatte, 2003).
MIDDLEWARE FOR COMMUNITY-COORDINATED CCM

In the CCM scenario, end users experience is enriched and extended by community-coordinated multimedia. In the case of community-coordinated TV channel, TV viewers can watch video clips which are
uploaded by themselves, add comments and even vote for them. The user preference profile is maintained
in the community coordinator. The moderator moderates the incoming videos and compiles a playlist for
the TV program. The moderator also filters the incoming comments and chooses which can be shown on
the program. This section discusses standards and principles that govern the participation of peers into
the community (peer management); messaging models, e.g. point-to-point, publish/subscribe, multicast
and broadcast, and profile management, P2P SIP new features, and coordination, etc.
CLASSIFICATION OF COMMUNITIES
From the technical point of view, user communities can be classified into private and public communities,
as done in the project JXTA (Oaks et al, 2002). When taking the purpose of communities into account,
these two fundamental classes could be further divided at least into social, commercial, and professional
communities (Koskela et al, 2008) that are somewhat paid attention to in an attribute-based system.
In practice, there will probably be situations where the members of a public community do not want
to reveal their memberships to nodes outside of their sub-community. These kinds of communities, where
a part of the members do not publish their membership to the main overlay, are called partially private
communities. However, to maintain the community, at least one of the peer members must publish their
membership to the main overlay (Koskela, 2008).
696
REqUIREMENTS FOR COMMUNITY MIDDLEWARE

The requirements of middleware for community are initially specified as messaging management and
peer management.
Messaging management. Messaging management is crucial for the middleware for community, as it
provides the basic communication methods based on the messaging models.
Peer management. Peer management functionality manages the formation of the peer group, the scale
of the community, the joining and leaving of the peers.
SURVEY ON MIDDLEWARE TECHNOLOGY FOR

COMMUNITY-COORDINATED CCM
Messaging models. A solid understanding of the available messaging models is crucial to understand
the unique capabilities it provides. Four main message models are commonly available, unicast, broadcast, multicast, anycast. The unicast model, also known as point-to-point messaging model, provides
a straightforward exchanging of messages between software entities. Broadcast is a very powerful
mechanism used to disseminate information between anonymous message consumers and producers. It
provides a one-to-many distribution mechanism where the number of receivers is not limited. The multicast model (Pairot et al, 2005) is a variation of the broadcast. It works by sending a multicast message
to a specific group of members. The main difference between broadcast and multicast is that multicast
just sends messages to the members of a subscribed group, while the broadcast sends messages to everyone without any membership limitation. The broadcast model can also be implemented as a publish/
subscribe messaging model so that it resembles multicast. The anycast model means sending anycast
notification to a group, which will make the senders closest member in the network answer, as long as
it satisfies a condition (Pairot et al, 2005). This feature is very useful for retrieving object replicas from
the service network.
Peer management protocols. For the purpose of interoperability and other peer management functionalities, Internet Engineering Task Force (IETF) founded a Peer-to-Peer Session Initiation Protocol
(P2PSIP) working group recently. The P2PSIP working group is chartered to develop protocols and
mechanisms for the use of the Session Initiation Protocol (SIP) in settings where the service of establishing and managing sessions is principally handled by a collection of intelligent endpoints, rather than
centralized servers, as in SIP as currently deployed (P2PSIP, 2008). There are two different kinds of
nodes in P2PSIP networks: P2PSIP peers and P2PSIP clients. P2PSIP peers participate the P2PSIP
overlay networks, provides routing information to other peers, etc. P2PSIP clients do not participate
the P2PSIP overlay networks, but instead utilize the service provided by the peers to locate users and
resources. In this way, P2PSIP can determine the correct destination of SIP requests by this distributed
mechanism. The other functionalities, e.g. session management, messaging, and presence functions are
performed using conventional SIP. The work on the P2PSIP working group is still work-in-progress,
but it has put forward some peer protocols, such as RELOAD (Jennings et al, 2008), SEP (Jiang et al,
2008), etc. for the management of peers, and two client protocols (Pascual et al, 2008, Song, 2008) to
manage the clients. Furthermore, JXTA also supports community concept which it calls as group. It
provides a dedicated Membership Service to manage the group related issues.
697
FUTURE TRENDS
The trend of CCM is towards delivering multimedia services in a customized quality over heterogeneous
network, which enables multimedia services to be adapted to any IP-based mobile and P2P content
delivery networks. The future work of middleware on CCM is identified as the follows:
Context-aware middleware. Context-aware middleware provides mobile applications with necessary knowledge about the execution context in order to make them adapt to dynamic changes in
mobile condition. But most of the current systems just focus on the location awareness. Thus, there
is no middleware which can fully support all the requirements of mobile applications. Further research is still needed.
QoS-aware middleware. The strong motivation to QoS-aware middleware is initiated by meeting
stringent QoS requirements such as predictability, latency, efficiency, scalability, dependability,
and security. The goal is to help accelerate the software process by making it easier to integrate
parts together and shielding developers from many inherent and accidental complexities, such as
platform and language heterogeneity, resource management, and fault tolerance (Quasy, 2004).
The extension of Web service specification, i.e. WS-* specifications (Erl, 2005), provide a means
to assert control over QoS management.
Middleware for multimedia service delivery over 4G networks. The motivation to 4G network
operators is initiated by providing multimedia service for mobile devices. Incorporating IMS
(Camarillo, 2006) into mobile multimedia services is part of the vision for evolving mobile networks beyond GSM.
Middleware for multimedia service delivery over P2P SIP. P2P technologies have been widely
used on the Internet in file sharing and other applications including VoIP, Instant Message, and
presence. This research continues the study of community middleware and extends capabilities
of delivering multimedia services to mobile devices over P2P network, especially, by employing
SIP session management.
CONCLUSION
Community Coordinated Multimedia presents a novel use paradigm for consuming multimedia through
requesting multimedia-intensive Web services via diverse terminal devices, converged networks, and heterogeneous platforms within a virtual, open and collaborative community. In order to reach the paradigm,
this chapter focused on addressing the key enabling technology of middleware for CCM. It started with
the concept definition relevant with CCM and the specification of middleware ontology in the context
of CCM. Then a generic CCM scenario was described and the requirements for CCM middleware were
analyzed with respect to the characteristics of mobility-aware, multimedia-driven, service-oriented,
and community-coordinated CCM. A middleware architecture for CCM was introduced to address the
requirements from four viewpoints. Each part of the middleware architecture for CCM was surveyed.
Finally, the future trends in the evolution of CCM middleware were discussed.
698
ACKNOWLEDGMENT
This work is being carried out in the EUREKA ITEA2 CAM4Home project funded by the Finnish Funding Agency for Technology and Innovation (Tekes).
REFERENCES
p2psip working group. (2008). Peer-to-Peer Session Initiation Protocol Specification. Retrieved June
15th, 2008, from http://www.ietf.org/html.charters/p2psip-charter.html
ACR-NEMA. (2005). DICOM (Digital Image and Communications in Medicine). Retrieved June 15th,
2008, from http://medical.nema.org/
BBC. (2005). SMEF- Standard Media Exchange Framework. Retrieved June 15th, 2008, from http://
www.bbc.co.uk/guidelines/smef/.15th June, 2008.
Bellavista, P., & Corradi, A. (2007). The Handbook of Mobile Middleware. New York: Auerbach publications.
Bender, T. (1982). Community and Social Change in America. Baltimore, MD: The Johns Hopkins
University Press.
Bishop, P., & Warren, N. (2002). JavaSpaces in Practice. New York: Addison Wesley.
Blair, G. S., Coulson, G., & Blair, L. DuranLimon, H., Grace, P., Moreira, R., & Parlavantzas, N. (2002).
Reflection, self-awareness and self-healing in OpenORB. In WOSS 02 Proceedings of the First Workshop on Self-Healing Systems, (pp. 9-14).
Burnett, I. (2006). MPEG-21: Digital Item Adaptation - Coding Format Independence, Chichester,
UK. Retrieved 15th June, 2008, from http://www.ipsi.fraunhofer.de/delite/projects/mpeg7/Documents/
mpeg21-Overview4318.htm#_Toc523031446.
Ciancarini, P. (1996). Coordination Models and Languages as Software Integrators. SCM Comput. Surv.,
28(2), 300302. doi:10.1145/234528.234732
Cotroneo, D., Migliaccio, A., & Russo, S. (2007). The Esperanto Broker: a communication platform
for nomadic computing systems. Software, Practice & Experience, 37(10), 10171046. doi:10.1002/
spe.794
Dixit, S., & Wu, T. (2004). Content Networking in the Mobile Internet. New York: John Wiley &
Sons.
ERCIM. (2005). Multimedia Informatics. ERCIM News, 62.
Erl, T. (2005). Service-Oriented Architecture (SOA): Concepts, Technology, and Design. Upper Saddle
River, NJ: Prentice Hall.
699
Fritsch, D., Klinec, D., & Volz, S. (2000). NEXUS positioning and data management concepts for location
aware applications. In the 2nd International Symposium on Telegeoprocessing (Nice-Sophia-Antipolis,
France), (pp. 171-184).
Gaddah, A., & Kunz, T. (2003). A survey of middleware paradigms for mobile computing. Carleton
University and Computing Engineering [Research Report]. Retrieved June 15th, 2008, from http://www.
sce.carleton.ca/wmc/middleware/middleware.pdf
Gonzalo, C., & Garca-Martn, M.-A. (2006). The 3G IP Multimedia Subsystem (IMS): Merging the
Internet and the Cellular Worlds. New York: Wiley.
Haahr, M., Cunningham, R., & Cahill, V. (1999). Supporting CORBA applications in a mobile environment. In MobiCom 99: Proceedings of the 5th Annual ACM/IEEE International Conference on Mobile
Computing and Networking, (pp. 36-47).
Halsall, F. (2000). Multimedia Communications: Applications, Networks, Protocols and Standards
(Hardcover). New York: Addison Wesley.
Harte, L., Wiblitzhouser, A., & Pazderka, T. (2006). Introduction to MPEG; MPEG-1, MPEG-2 and
MPEG-4. Fuquay Varina, NC: Althos Publishing.
Hopper, R. (2002). P/Meta - metadata exchange scheme. Retrieved June 15th, 2008, from http://www.
ebu.ch/trev_290-hopper.pdf
ISO/IEC. (1995). SMDL (Standard Music Description Language) Overview. Retrieved June 15th, 2008,
from http://xml.coverpages.org/gen-apps.html#smdl
ISO/IEC. (2003). MPEG-7 Overview. Retrieved June 15th, 2008, from http://www.chiariglione.org/
mpeg/standards/mpeg-7/mpeg-7.htm.
Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., & Schulzrinne, H. (2008). REsource LOcation
And Discovery (RELOAD). Retrieved June 15th, 2008, from http://tools.ietf.org/id/draft-bryan-p2psipreload-04.txt.
Jiang, X.-F., Zheng, H.-W., Macian, C., & Pascual, V. (2008). Service Extensible P2P Peer Protocol.
Retrieved June 15th, 2008, from http://tools.ietf.org/id/draft-jiang-p2psip-sep-01.txt
Kangasharju, J. (2002). Implementing the Wireless CORBA Specification. PhD Disertation, Computer
Science Department, University of Helsinki, Helsinki, Finland. Retrieved June 15th, 2008, from http://
www.cs.helsinki.fi/u/jkangash/laudatur-jjk.pdf
Koskela, T., Kassinen, O., Korhonen, J., Ou, Z., & Ylianttila, M. (2008). Peer-to-Peer Community
Management using Structured Overlay Networks. In the Proc. of International Conference on Mobile
Technology, Applications and Systems, September 10-12, Yilan, Taiwan.
Krafzig, D., Banke, K., & Slama, D. (2005). Enterprise SOA: Service-Oriented Architecture Best Practices. Upper Saddle River, NJ: Prentice Hall.
Liu, C., Qian, D., Liu, Y., Li, Y., & Wang, C. (2006). RSVP Context Extraction in IP Mobility Environments. Vehicular Technology Conference, 2006, VTC 2006-Spring, IEEE 63rd, (Vol. 2, pp. 756-760).
700
Matjaz, B. J. (2008). BPEL and Java. Retrieved June 15th, 2008, from http://www.theserverside.com/
tt/articles/article.tss?l=BPELJava
Migliaccio, A. (2006). The Design and Development of a Nomadic Computing Middleware: the Esperanto
Broker. PhD Dissertation, Department of Computer and System Engineering, Federico II, University
of Naples, Naples, Italy.
MRML. (2003). MRML- Multimedia Retrieval Markup Language. Retrieved June 15th, 2008, from
http://www.mrml.net/
Murphy, A. L., Picco, G. P., & Roman, G. (2001). LIME: a middleware for physical and logical mobility.
21st International Conference on Distributed Computing Systems, (pp. 524-533).
Nelson, B. J. (1981). Remote Procedure Call. Palo Alto, CA: Xerox - Palo Alto Research Center.
Oaks, S., Traversat, B., & Gong, L. (2002). JXTA in a Nutshell. Sebastopol, CA: OReilly Media, Inc.
OMG. (2002). Wireless Access and Terminal Mobility in CORBA Specification. Retrieved June 15th,
2008, from http://www.info.fundp.ac.be/~ven/CIS/OMG/new%20documents%20from%20OMG%20
on%20CORBA/corba%20wireless.pdf
Pairot, C., Garcia, P., Rallo, R., Blat, J., & Gomez Skarmeta, A. F. (2005). The Planet Project: collaborative educational content repositories on structured peer-to-peer grids. CCGrid 2005, IEEE International
Symposium on Cluster Computing and the Grid, (Vol. 1, pp. 35-42).
Pascual, V., Matuszewski, M., Shim, E., Zheng, H., & Song, Y. (2008). P2PSIP Clients. Retrieved June
15th, 2008, from http://tools.ietf.org/id/draft-pascual-p2psip-clients-01.txt
Pennebaker, W. B., & Mitchell, J. L. (1992). JPEG: Still Image Data Compression Standard (Digital
Multimedia Standards). Berlin: Springer.
Perkins, C. (2003). RTP: Audio and Video for the Internet. New York: Addison-Wesley.
Polak, S., Slota, R., Kitowski, J., & Otfinowski, J. (2001). XML-based Tools for Multimedia Course
Preparation. Archiwum Informatyki Teoretycznej i Stosowanej, 13, 321.
Pro-MPEG. (2005). Material eXchange Format (MXF). Retrieved 15th June, 2008, from http://www.
pro-mpeg.org.
Quasy, H. M. (2004). Middleware for Communications. Chichester, UK: John Wiley Sons ltd.
Ray, E. (2003). Learning XML. Sebastopol, CA: OReilly Media, Inc.
Richardson, I., & Richardson, I. E. G. (2003). H.264 and MPEG-4 Video Compression: Video Coding
for Next Generation Multimedia. Chichester, UK: Wiley.
Rodriguez, B. (2002). EDLXML serialization. Retrieved 15th June, 2008, from download.sybase.com/
pdfdocs/prg0390e/prsver39edl.pdf
Roman, M., Kon, F., & Campbell, R. (2001). Reflective Middleware: From your Desk to your Hand.
IEEE Communications Surveys, 2(5).
701
Schilit, B., Adams, N., & Want, R. (1994). Context-aware computing applications. In Proceedings of
Mobile Computing Systems and Applications, (pp. 85-90).
SMIL/ W3C. (2005). SMIL- Synchronized Multimedia Integration Language. Retrieved June 15th, 2008
from http://www.w3.org/AudioVideo/
SMPTE. (2004). Metadata dictionary registry of metadata element descriptions. Retrieved June 15th,
2008, from http://www.smpte-ra.org/mdd/rp210-8.pdf
SOAP/W3C. (2003). SOAP Version 1.2 Part 1: Messaging Framework. Retrieved June 15th, 2008, from
Http://www.w3.org/TR/2003/REC-soap12-part1-20030624/
Song, Y., Jiang, X., Zheng, H., & Deng, H. (2008). P2PSIP Client Protocol. Retrieved June 15th, 2008,
from http://tools.ietf.org/id/draft-jiang-p2psip-sep-01.txt.
Steen, van M., Homburg, P., & Tanenbaum, A. S. (1999). Globe: a wide area distributed system. Concurrency, IEEE [See also IEEE Parallel & Distributed Technology], 7, 70-78.
Stuart, W., & Koch, T. (2000). The Dublin Core Metadata Initiative: Mission, Current Activities, and
Future Directions, (Vol. 6). Retrieved June 15th, 2008, from http:/www/dlib.org/dlib/december00/
weibel/12weibel.html
Tanenbaum, A. S., & Steen, M. V. (2008). Distributed Systems: Principles and Paradigms. Upper Saddle
River, NJ: Prentice Hall.
Taubman, D., & Marcellin, M. (2001). JPEG2000: Image Compression Fundamentals, Standards and
Practice. Berlin: Springer.
Thatte, S. (2003). BPEL4WS, business process execution language for web services. Retrieved June
15th, 2008, from http://xml.coverpages.org/ni2003-04-16-a.html
TV-Anytime. (2005). TV-Anytime. Retrieved June 15th, 2008, from http://www.tv-anytime.org
UDDI. (2004). UDDI Version 3.0.2. Retrieved June 15th, 2008, from http://www.Oasis-Open.org/committees/uddi-spec/doc/spec/v3/uddi-v3.0.2-20041019.Htm
Vakali, A., & Pallis, G. (2003). Content Delivery Networks: Status and Trends. IEEE Internet Computing, (November 6): 6874. doi:10.1109/MIC.2003.1250586
Watt, A. Lilley Chris, & J., Daniel. (2003). SVG Unleashed. Indianapolis, IN: SAMS.
WSDL/W3C. (2005). WSDL: Web Services Description Language (WSDL) 1.1. Retrieved June 15th,
2008, from http://www.w3.org/TR/wsdl.
Wyckoff, P., McLaughry, S. W., Lehman, T. J., & Ford, D. A. (1998). T Spaces. IBM Systems Journal,
37, 454474.
Zhou, J., Ou, Z., Rautiainen, M., & Ylianttila, M. (2008b). P2P SCCM: Service-oriented Community
Coordinated Multimedia over P2P. In Proceedings of 2008 IEEE International Conference on Web
Services, Beijing, China, September 23-26, (pp. 34-40).
702
Zhou, J., Rautiainen, M., & Ylianttila, M. (2008a). Community coordinated multimedia: Converging
content-driven and service-driven models. In proceedings of 2008 IEEE International Conference on
Multimedia & Expo, June 23-26, 2008, Hannover, Germany.

Community: is generally defined as groups of limited number of people held together by common
interests and understandings, a sense of obligation and possibly trust.
Community Coordinated Multimedia (CCM): system maintains a virtual community for the consumption of CCM multimedia elements, i.e. both content generated by end users and content from professional multimedia provider (e.g., Video on Demand). The consumption involves a series of interrelated
multimedia intensive processes such as content creation, aggregation, annotation, etc. In the context of
CCM, these interrelated multimedia intensive processes are encapsulated into Web services, instead of
multimedia applications, namely multimedia intensive services, briefly multimedia services.
Middleware: is the key technology which integrates two or more distributed software units and allows them to exchange data via heterogeneous computing and communication devices. In this chapter,
middleware is perceived as an additional software layer in OSI model encapsulating knowledge from
presentation and session layers, consisting of standards, specifications, forms, and protocols for multimedia, service, mobility and community computing and communication.
Multimedia: represents a synchronized presentation of bundled media types, such as text, graphic,
image, audio, video, and animation.
Standard: refers to an accepted industry standard. Protocol refers to a set of governing rules in communication between computing endpoints. A specification is a document that proposes a standard.
ENDNOTE
1
http://www.cam4home-itea.org/
703
Section 8
Mobile Computing and Ad Hoc

Networks
705
Chapter 30
Scalability of Mobile
Ad Hoc Networks
Dan Grigoras
University College Cork, Ireland
Daniel C. Doolan
Robert Gordon University, UK
Sabin Tabirca
University College Cork, Ireland
ABSTRACT
This chapter addresses scalability aspects of mobile ad hoc networks management and clusters built on
top of them. Mobile ad hoc networks are created by mobile devices without the help of any infrastructure
for the purpose of communication and service sharing. As a key supporting service, the management
of mobile ad hoc networks is identified as an important aspect of their exploitation. Obviously, management must be simple, effective, consume least of resources, reliable and scalable. The first section
of this chapter discusses different incarnations of the management service of mobile ad hoc networks
considering the above mentioned characteristics. Cluster computing is an interesting computing paradigm that, by aggregation of network hosts, provides more resources than available on each of them.
Clustering mobile and heterogeneous devices is not an easy task as it is proven in the second part of
the chapter. Both sections include innovative solutions for the management and clustering of mobile ad
hoc networks, proposed by the authors.
INTRODUCTION
In this chapter, we discuss the concept of scalability applied to Mobile Ad hoc NETworks (MANET).
MANETs are temporarily formed networks of mobile devices without the support of any infrastructure.
One of the most important characteristics of MANETs is the unpredictable evolution of their configuration. The number of member nodes within a MANET can vary immensely over a short time interval,
from tens to thousands and vice-versa. Therefore, the scalability of network formation and management,
DOI: 10.4018/978-1-60566-661-7.ch030
Scalability of Mobile Ad Hoc Networks
mobile middleware and applications is a key factor in evaluating the overall MANET effectiveness.
The large diversity and high penetration of mobile wireless devices make their networking a very
important aspect of their use. By self-organizing in mobile ad hoc networks, heterogeneous devices can
communicate, share their resources and services and run new and more complex distributed applications.
Mobile applications such as multiplayer games, personal health monitoring, emergency and rescue,
vehicular nets and control of home/office networks illustrate the potential of mobile ad hoc networks.
However, the complexity of these networks brings new challenges regarding the management of heterogeneity, mobility, communication and scarcity of resources that all have an impact on scalability.
The scalability property of complex distributed systems does not have a general definition and evaluation strategy. Within the realm of MANET, scalability can refer to several aspects, from performance
at the application layer to the way scarce resources are consumed. One example is the case where the
mobile system is not scalable if batteries energy is exhausted by demanding management operations.
A mobile middleware service is not scalable if it does not meet the mobile clients requirements with
similar performance, irrespective of their number or mobility patterns.
All current MANET deployments or experiments involve a small number of devices, at most of the
order of few tens, but, in the future, hundreds, thousands or even more devices will congregate and run
the same application(s). Therefore it is essential to consider the strategies by which scalability will be
provided to the network and application layers such that any number of devices and clients will be accommodated with the same performance. When used, mobile middleware systems will also be required
to be scalable.
Following, the most important aspects of scalability with regard to mobile ad hoc networks will be
reviewed considering how MANET can be managed cost-effectively and how an important application
of large distributed systems, clustering, can be implemented in a scalable manner on MANET.
This chapter is organized as follows. The first Section discusses the management service of mobile
ad hoc networks and innovative means for making it a scalable service. The rapid change of MANET
membership impacts on the node address management. Additionally, frequent operations as split and
merge require address management as well. Therefore, MANET management is mostly the management
of node addresses.
As potentially large networks, MANET can be used as the infrastructure that supports mobile cluster computing. Consequently, the second section is dedicated to cluster computing on MANET and its
related scalability issues.
MANAGEMENT OF MANET
The Set of Management Operations for MANET
An ad hoc network is a dynamic system whose topology and number of member nodes can change at
any time. The MANET scenario assumes that there will always be one node which will set up the network followed up by other nodes which will join the network by the acquisition of unique addresses.
This is, for example, the strategy of Bluetooth, where the initial node, also known as the master, creates
the network identity and allocates individual addresses to up to seven new members of the network as
they join it.
706
During its lifetime, any MANET can be characterized by the following set of management operations:
Network setup, usually executed by one (initial) node that also creates the MANET identity;
Join/leave, when a node joins or leaves an existing MANET;
Merge of two or more MANETs and the result is one larger MANET;
Split of MANET into two or more sub-MANETs;
Termination when nodes leave the network and the MANET ceases to exist.
As, due to nodes mobility, these operations can be quite frequent, the MANET management becomes
complex and resource-consuming, especially in terms of node addresses, battery energy and bandwidth.
Therefore it is important to study strategies proposed for MANET organization in respect to the way these
critical resources are managed. Although one of the main goals of mobile platforms is to minimize energy
consumption, by introducing more sleep states for the CPU and the idle thread in the operating system
for example, less attention is paid to MANET management operations cost in terms of resources. For
example if lots of messages are used for a basic join operation, there will be high energy and bandwidth
costs for each joining node and for the entire system. A good strategy would use a minimum number of
short messages and be scalable while a poor strategy fails to manage a large number of devices.
Currently, there are two main technologies, Bluetooth (Bluetooth, 2008) and IEEE 802.11x (WiFi,
2008), used to create MANET. Almost all new mobile phones, PDAs and laptops are Bluetooth-enabled
making them potentially members of Bluetooth MANET. However, Bluetooth accepts only eight devices
in a piconet (this is the name of the Bluetooth MANET) as the standard adopted only three bits for addressing a node. The first node, setting up the piconet, becomes its master. It creates the piconet identity
and allocates addresses to nodes that join the network. Regarding the merge operation, Bluetooth does
not have any specific procedure for it. However, a node that is within the range of two piconets can
switch from one to the other. Alternative membership to the two piconets creates the potential for the
node of acting like a gateway the larger network is called scatternet.
There is no provision for split as this was probably not considered as a likely event for such a small
network. There is no clear protocol for piconet termination, this operation being left to users intervention. The lack of provision for MANET management in the case of Bluetooth is explained by its primary
goal of eliminating cables and not to create P2P mobile networks. However the increasing popularity of
Bluetooth may lead to the necessity to manage large scatternets (collections of interconnected piconets)
and a rethink of this technology.
The 802.11x standards cover only the physical and link layer. A WiFi MANET is generally an IPbased network.
Solutions for the Management of IP-based MANET

The management of an IP-based MANET is the management of the IP addresses. Considering the role of
IP addresses as a pointer to a unique computer/node in a certain network, it is easy to understand the difficulty of porting this concept to the mobile area, where nodes can change network membership often.
Several solutions were proposed for IP address allocation to mobile nodes. The simplest use the current features of the protocol. Either, there is a DHCP server that has a pool of IP addresses, or mobile
nodes are self-configuring. In the former situation, it is assumed that one node runs the DHCP server,
707
the pool has enough IP addresses for all potential candidates to membership, and the DHCP host will
belong to MANET for its entire lifetime. Obviously, all these assumptions are difficult to guarantee. In
the latter situation, each node picks an IP address from the link-local range or from a private range of
addresses and then checks for duplicates. From time to time, nodes will search for a DHCP server and if
this is present, will ask for an IP address. Although much simpler, this strategy assumes that nodes share
the same link and the size of the network is not too large. Otherwise, duplicate checks would dominate
the network communication. Moreover, join and merge can be committed only after duplicate check
confirms the lack of duplicates. If these operations are frequent, the management of IP addresses will
consume a lot of the most important resources of mobile nodes, battery energy and bandwidth.
IP-based MANET termination is either triggered by the return of all IP addresses to the pool, or by
a user defined protocol.
MANETconf
One of the earliest projects that offered a full solution to IP-based MANET management is MANETconf
(Nesargi, 2002). This protocol assumes that all nodes use the same private address block, e.g., 10.0.0.0 to
10.255.255.255, and each node that requests to join a network will benefit of the services of an existing
member of the network. This node, acting as a proxy, will allocate an IP address to the newly arrived
node after checking with all the other nodes that that IP address is idle. Conflicts among multiple proxies trying to allocate IP addresses in the same time are solved by introducing priorities. The proxy with
the lower IP address has priority over the other(s). Split and merge were considered as well. While split
is managed by simply cleaning up the IP addresses of departed nodes, belonging to all the other partitions, merge requires a more elaborated algorithm. The authors associate with each partition an identity
represented by a 2-tuple. The first element of the tuple is the lowest IP address in use in the partition.
The second element is a universally unique identifier (UUID) proposed by the node with this lowest
IP address. Each node in the partition stores the tuple. When two nodes come into each others radio
range, they exchange their partitions identities. If they are different, a potential merge is detected. This
operation will proceed by exchanging the sets of idle IP addresses and then broadcasting them to all the
other members of each partition. If by merging there are conflicting addresses (duplicates) the node(s)
with the lower number of active TCP connections will request a new IP address.
MANETconf as a complete solution requires a lot of communication that increases with the size of
the network. Therefore, we cannot consider this protocol a scalable solution to MANET management.
IP addressing and the associated protocol are effective for wired networks and to a certain extent
to Access Point-based mobile networks (Mobile IPv4, or IPv6), but difficult to manage in a MANET
(Ramjee, 2005) (Tseng, 2003). The main difficulty arises from the fact that an IP address has no relevance
for a mobile node that can change network membership frequently. Any such change may result in a new
address allocated to that node, each time, followed by duplicate checks. More important than a numeric
address is the set of services and resources made available to other peers by the node. In this respect,
there are initiatives to introduce new ways of addressing mobile nodes of MANET (Adjie-Winoto,
1999), (Balazinska, 2002). One particular project deals with a service oriented strategy which builds on
the assumption that MANET are mainly created for the purpose of sharing services and, in this context,
IP addresses as an indication of location have no relevance (Grigoras, 2005). Then, service discovery,
remote execution and service composition are the most important operations related to service sharing.
708
If the Internet Protocol is not used anymore, new transport protocols, probably simpler but still reliable
if replacing TCP, have to be designed.
To define the scope of MANET, a new concept of soft network identity was proposed in (Grigoras,
2007a). As this concept provides a totally new approach in the way MANET are managed and moreover
assures scalability, we will explain it in the following section.
The Management of Non-IP MANET

The difficulties and high cost of managing IP addresses led to the idea that we might find better ways
to manage MANET, preserving the requirements of communication and service provisioning among
all nodes. Because MANET is a system with a limited lifetime, it makes sense to allocate it an identity
that is valid only as long as MANET is active/alive. This identity is then used for all the management
operations.
The first node that organizes MANET computes a network identity, net_id for short, based on its
MAC address, date and time. It then attaches to it a time-to-live (TTL), an expectation of how long that
network will be alive:{net_id, TTL}
This pair represents the soft identity state of the new network. For example, {112233, 200} corresponds to network 112233 with a living expectation of 200 seconds. A node joins the network after
requesting, receiving and storing, from any one-hop neighbour already in the network, the net_id and
updated TTL, for example {112233, 150}. TTL is counted down by each node. When it times out, the
associated net_id is cancelled, meaning that the node is no more a member of the network, 112233 in
our example. On the other hand, TTL is prolonged when a message carrying the net_id in the header is
received by the node. The significance is that messages mean activity and therefore the network should
be kept alive (i.e. the node is or can be on an active path).
For increasing the chance to find services of interest, a node may join as many MANET as it wants
using the Greedy join algorithm (Grigoras, 2007a). All {net_id, TTL} records are cached and each
of them managed separately. If a node leaves a network, it may still be active in other network(s).
Within MANET, a node is uniquely addressed by the MAC address and its set of public and private
services.
The MANET Management Operations

MANET setup and join are executed by the same algorithm: initially, the host broadcasts a message
to join an existing network; if there is a reply, carrying the net_id and TTL, the host caches them and
becomes a member of the network; if there is no reply within a join time interval, it will still wait for a
delay interval for possible late replies and, then, if no reply was received, it will compute its own net_id
and attach an TTL. The expectation is that other hosts will join this network and activity will start.
Otherwise, the TTL counter will time out and the net_id will be cancelled. The host is free to start the
procedure again. For example, this can be a background process that will assist a distributed application
by providing the network infrastructure.
The join and delay time intervals are two parameters whose initial values, picked by the user, can be
updated depending on the environment (number of failures, mobility pattern etc).
Merge is triggered by a host that receives messages carrying in the header a new net_id. This operation
can be executed on demand or implicitly two or more overlapping networks merge. In both cases, the
709
contact host will forward the new {net_id, TTL} pair to all peers. The merge can be mandatory or not.
Obviously, islands of nodes may lose the membership by time out if they dont receive/route messages. This behaviour was indeed noticed during simulation (Grigoras, 2007a).
Split is simpler: all sub-networks preserve the net_id; if there is activity, TTL will be prolonged,
otherwise it will time out. If split networks from the same network will come together again, the net_id
is still the same.
Termination is signalled by time out. Indeed, when there is no activity, the counter will time out and
hosts will gracefully leave.
The MANET management based on the soft net_id concept, presented here, is simple, uses the minimum number of messages (2), is scalable and offers a full solution.
Experimental results (Grigoras, 2007b) showed not only that the net_id strategy uses the minimum
number of messages for carrying out the management operations but also is scalable. Scalability is provided by the use of local operations. Indeed, when a node plans to join one or more networks, it simply
broadcast its join request and then listens for offers. By storing the net_id, the node becomes a de facto
member of the network and can now communicate with other nodes. No management operation requires
global communication and this is a key rule for scalable distributed systems.
As MANET is still a new networking model, there will be more and potentially better strategies for
their management that will also be scalable - accept any number of new nodes with minimum consumption of resources.
CLUSTER COMPUTING ON MANET

Global High Performance Mobile Computing
The world of High Performance Computing (HPC) utilises the combined processing power of several
inter-connected nodes to compute the solution of a complex problem within a reasonable timeframe.
Presently the top rated HPC machine is IBMs Blue Gene/L (IBM, 2007) comprising of 131,072 processors providing a peak performance of 596 Teraflops. According to performance projections it is
expected that a Petaflop capable machine should be in place before 2009 (TOP500, 2007) and a ten
Petaflop machine by 2012.
The SETI@Home project is the most well know distributed Internet computing project in the world.
It is just one of several projects (BOINC, 2008) that are part of the Berkeley Open Infrastructure for
Network Computing (BOINC). The recent upgrade of the worlds largest radio telescope in Arecibo,
Puerto Rico from where SETI@Home receives its data stream means a five hundred fold increase in the
amount of data that needs to be processed (Sanders, 2008). This amounts to 300 gigabytes of data per
day. The Seti@Home project uses a divide and conquer strategy implemented as a Client/Server architecture whereby client applications running on personal computers throughout the world carry out the
task of processing the data and returning the results to the servers at Berkeley. The project has over five
million registered volunteers, with over 201,147 users processing data blocks on a regular basis across
348,819 hosts. The project was running at 445.4 teraflops (SETIstats, 2008), the combined speed of all
the BOINC projects was rated at 948.7 teraflops comprising 2,781,014 hosts (BOINCstats, 2008) as of
10th March 2008. The Folding@home project is similar to SETI@home using the processing power of
volunteers from around the globe. The client statistics for Folding@home (as of 9th March 2008) had
710
Table 1. Comparison of mobile phone CPU speeds

Phone
Announced
OS
Nokia N96
11/02/2008
Symbian OS v9.3
CPU
JBenchmark ACE
400Mhz
Unknown
Nokia N93
25/04/2006
Symbian OS v9.1
330Mhz
329Mhz
Nokia N70
27/04/2005
Symbian OS v8.1a
220Mhz
220Mhz
Nokia N73
25/04/2006
Symbian OS v9.1
206Mhz
221Mhz
Nokia 6680
14/02/2005
Symbian OS v8.0a
220Mhz
224Mhz
Nokia 6630
14/06/2004
Symbian OS v8.0a
220Mhz
227Mhz
Nokia 7610
18/03/2004
Symbian OS v7.0s
123Mhz
126Mhz
264,392 active nodes operating at 1,327 Teraflops (Folding@home, 2008), well over twice that of the
worlds most powerful supercomputer. The bulk of this processing came from Playstation 3 gaming
machines giving 1,048 Teraflops from 34,715 active nodes.
Clearly for applications that require a high degree of processing, the architecture of distributing
the work out to numerous clients can achieve processing speeds far in excess of the worlds top HPC
machines. Scalability still poses many questions within the realm of HPC such as what type of architecture should a million-node system have, or how should an application be scheduled on a 1,024 core
processor. Could the principle of client applications carrying out CPU intensive operations be feasible
within the world of mobile computing? If so what possibilities may lie in store for the future of mobile
distributed computation?
The rate at which mobile phone technology is being adopted is astonishing, the first billion subscribers took 20 years, the next billion required just 40 months, while the third required a mere 24 months.
It would appear that the world has an ever growing and insatiable hunger for mobile technology. November 29, 2007 saw a significant milestone in global mobile phone ownership when it was announced
that mobile telephone subscriptions reached 50%, this amounts to over 3.3 billion subscribers (Reuters,
2007). Reports predict that subscriptions may be as high as five billion by 2012 (PortoResearch, 2008).
Could the computing power of these billions and billions of mobile phones be harnessed? If so what
would the combined computing power of all these devices be? The present number of phones outstrips
the total number of processors of the worlds largest supercomputer by over 25,000 times. Mobile devices
may have far less computing power than high-end server machines, but their sheer and ever growing
number more that counteracts this, as well as their rapidly increasing processing capabilities. In January
2008, ARM announced that it achieved the ten billion processor milestone. The current rate at which
these chips are being created is staggering with the annual run rate now estimated at three billion units
per year (ARM, 2008).
What level of computing power could the mobiles of today provide? In 1999 a 500 Mhz Pentium
III machine had a capacity of about 1,354MIPS. In October 2005 ARM announced the 1 Ghz Cortex
A8 processor capable of a whopping 2,000MIPS. Even the processors of a phone of five years ago are
rated at 200MIPS. Table 1 gives a cross section overview of a selection of mobile phone types and their
associated processor speeds. The table was compiled using information from the Nokia developer forum, the reviews section of the my-symbian.com website and was cross referenced with yet another site
that provides detailed and up-to-date comparisons of mobile phone specifications (Litchfield S, 2008).
The phones presented were also evaluated against JBenchmarks ARM CPU Estimator (ACE) which
711
provides an accurate estimate the processors CPU speed. It is generally very difficult to obtain concrete
and detailed information about a phones specification. Most manufacturers neglect to provide detailed
specifications so that mobile phones are not weighed up by the common factors such as system memory,
processor speed and persistent storage, by which desktop/laptop machines are examined by consumers.
This may change in time as phones are gaining more and more computing capabilities. A testament to
the increasing power of the mobile device is Sun Microsystems (Shankland, 2007) discontinuation of
the use of Java Micro Edition in favour of a full blown virtual machine the likes of which run on the
desktop systems of today.
An article in late 2007 (Davis, 2007) considered the notion of cell-phone grid computing, and proposed
the question of whether Android, the new open-source mobile platform, could provide a foundation for
the same. It is therefore becoming evident that people are now becoming aware of the huge potential
computing capabilities that the billions of mobile phones could provide. In summary, the combined might
of all the worlds mobile phones could be the most powerful supercomputer in the world if we could just
harness their processing capabilities in a manner similar to the BOINC projects. One of course may say
that processor intensive computation would quickly drain the limited battery. Even if processing was
carried out only while the phone was connected to mains power it would still allow for probably two
hours of solid processing per week. Given this, then at any one time one would still have upwards of 40
million devices contributing their processing power at any instant given the 3.3 billion mobile phone
population at the end of 2007.
Such extreme mobile parallel computing systems would of course be suitable only for hyper-parallel
tasks that can be easily divided into million/billions of distinct jobs. Third generation phones allows for
relatively fast internet connectivity with rates of several hundred kbit/s. The main prohibiting factors
in the creation of a hyper parallel mobile grid are the interconnectivity costs and peoples willingness
to participate. Costs are continually reducing, and as more and more people join BOINC like projects
they are realising that they can contribute to the solving of complex and processor intensive problems.
Moving in to the future Science will tackle larger and larger problems that will require all the potential
processing power we can muster to solve within a reasonable timeframe.
It may take some time before we see a hyper-parallel globalised mobile grid, but on a smaller scale
the alternative is to use the processing power of the phones within our local vicinity. This is where
technologies such as Bluetooth and message passing come into their own.
Localised Mobile Parallel Computing

The majority of todays phones are Bluetooth-enabled as standard; they also have the ability of executing java based applications in the form of MIDlets. Most of these Bluetooth enabled devices allow for
data transmission rates of up to 723kbit/s (Bluetooth 1.2) and have an effective range of 10 meters.
These mobile phones are therefore perfect platforms for parallel computing tasks on a small scale. The
standard Bluetooth Piconet allows for up to eight devices to be interconnected together. This functions
in a star network topology using a Client/Server architecture. A star network is of limited use when one
Client device wishes to communicate with another client device. In this case, all traffic will have to be
routed through the Master device. The solution lies within the bedrock that is parallel computing today,
the message passing interface, whereby any node is capable of communicating with any other node. In
the mobile world this is achieved by firstly creating the standard star network topology after the process
of device and service discovery have been carried out. With connections established to a central node
712
Figure 1. MMPI network structure for Piconet and Scatternet sized networks
the process of creating the inter-client connections can take place allowing for the building up of a fully
interconnected mesh network (Figure 1). A system called the Mobile Message Passing Interface (MMPI)
allows for such an infrastructure to be created and provides methods for both point to point and global
communications (Doolan, 2006).
Bluetooth itself is inherently Client/Server based, therefore when establishing a parallel world using
the MMPI system it is necessary for the user to indicate if the application should be started in Client
or Server mode. In the case of a node started with a Client setting, its primary task is to create a Server
object to advertise itself as being available. With all the Client nodes up and running, the remaining node
can be started as a Master node, which will carry out the discovery process and coordinate the creation
of the inter-client links. Bluetooth programming in itself changes the form of how a typical Client/Server
system works, as the Client devices are required to establish server connections to allow the Server device
to carry out the discovery process and establish Client connections to same. In standard Client/Server
systems it is the Server application that is started first and left running, and remains in a constant loop
awaiting of incoming Client applications to connect to it, a web server being a typical example.
To ensure correct inter-node communication each node maintains an array of connections to every
other device within the world. This takes the form of a set of DataInputStreams and DataOutputStreams.
Communication is achieved between nodes through a set of methods that abstracts the developer from
dealing with streams, communication errors and so forth. In the case of point to point communication
one needs to simply call a method such as send() to transmit a message to another node. The parameters that are passed are firstly the data to be transmitted (an array of data), an offset, the amount of data
to send the data type, and most importantly the id (rank) of the receiving device. Correspondingly the
receiving device must have a matching recv() method call, to correctly receive the message from the
source node.
Can the MMPI system scale to a world larger than eight nodes? The Bluetooth Piconet allows for a
maximum of eight interconnected devices; however one may use the Scatternet architecture to build
larger systems, by interconnecting two or more Piconets together by way of a bridging node common
to both Piconets. Using a Scatternet framework, the MMPI system can be scaled to allow for larger
networks, for example one could have a network of twelve, fifteen or even twenty devices that allows
for inter-node communications between all nodes. This is achieved by the creation of a java class called
713
CommsCenter which forms the heart of the Scatternet MMPI system (Donegan, 2008). The CommsCenter
receives raw data from the network and translates it into MMPI messages. These messages are passed
on to the MMPI interface that is exposed to the developer by means of an additional intermediary class
called the MMPINode. The purpose of this is to interface between the high level MMPI methods and
the lower level communications; it also helps to take care of the discovery process. Messages that are
sent out on to the Bluetooth network are fed up and down through this chain of classes that allows for
the abstraction of lower level operations.
Messages that are received by the CommsCenter are identified by the header and may take one of
five forms: Bridge, Master, Slave, Confirm and Data. The first three are used for the establishment of
the network structure to inform what role a specific device should take. The Confirm message is initiated on completion of the network formation process. The Data header is used for the transportation of
inter-node messages.
In the case that the number of devices that are discovered exceeds the limit of seven, then one of these
devices will be chosen to act as a bridging node, therefore forming essentially two distinctive Piconets.
The root node will carry out this selection process as it is aware of the number of active nodes that are
advertising themselves of inclusion within the parallel world. The root node will build up a list of what
devices are to be in each network, and in the case of devices that will appear in a network connected to a
bridging node, a Bridging message will be sent to the bridge in question with a list of the node addresses
to which it should establish connections to.
A routing table is also established as many nodes may not have a direct connection to several of
the other nodes in the world. A routing table is maintained by each node, which maintains an entry for
every other node except itself. The entries in the table are an index to which node a message should be
routed through in order to get to its destination. In the case of a slave node on one Piconet wishing to
communicate to a slave node on another Piconet, the message is firstly transmitted to the Master node
of the first Piconet (Figure 1). The Master will then forward the message on to the bridging node that
interconnects the two networks, and is again forwarded to the Master node of the second Piconet. The
message can then be finally sent on to its final destination (a slave node on the second Piconet).
Figure 1 clearly shows the interconnections for MMPI running on Piconet sized (limited to eight or
less) and Scatternet sized network. In the case of the Piconet sized worlds every node maintains direct
connections with every other node. This differs greatly in the case of larger MMPI worlds where the
network structure reverts back to a Scatternet structure, comprising of star network topologies interconnected by bridges. The Master node for each sub-network in this case must deal with the routing of
messages between Slave nodes, therefore this node can easily become a communications bottleneck
when there is a high amount of data transmissions. The larger incarnation of the MMPI architecture was
developed in a Scatternet manner to keep the routing tables as simple as possible. This however, could
be improved by creating inter-slave/bridge connections between each of the nodes in each sub-network.
The process of network formation and routing would be more complex for the initial creation of the
world, but it would have the effect of reducing the bottleneck effect on the sub-network Master nodes
in the case of inter-slave communications.
The MMPI system can be used for a myriad of applications, from parallel processing and graphics to
mLearning and multiplayer gaming. Due to the high level of abstraction it liberates the developer from
java Bluetooth development, network formation, and the handling of data streams. One can develop a
multi-node application very rapidly in comparison to a multi-node application developed from scratch.
Instead of writing hundreds of lines to carry out discovery and establish connections, one needs only to
714
call the constructor of the MMPI system. Therefore one single line replaces hundreds, when carrying
out communications between nodes one simply just needs to call and appropriate method be it for point
to point or global communications. In the space of less than a dozen lines one can develop the necessary
code to build a fully functional Bluetooth network, and achieve communications between the nodes. This
has several advantages such as the speeding up of application development, allowing the developer to
focus on the domain specific task at hand, and not having to worry about handling detailed communications issues. In the area of games development a well built single user game can be transformed into
a multiplayer game in a matter of hours requiring minimal code changes. Many people enjoy playing
computer games, but playing against another human player rather than an AI algorithm adds far more
unpredictably to the game. The number of multiplayer Bluetooth-enabled games for mobile phones is
quite limited; one reason is so that a game is compatible with as many devices as possible. The process
of transforming a single player game in to a multiplayer game can also be time consuming and require
significant development resources, but this however is no longer the case. Perhaps as more and more
people invest in Bluetooth enabled phones we will see a change in the market, with more multiplayer
games being developed, as such the MMPI system may prove to be of significant advantage to these
developers, both in reducing development time and costs, and reducing code complexity.
CONCLUSION
In this chapter, we addressed aspects of scalability of MANET management and MANET clusters. Regarding the MANET management, the prevalent strategy is to use IP. However, managing IP addresses
is resource-consuming and for large MANET can become a nightmare. Our conclusion is that IP-based
MANET can not be scalable. New approaches such as the net_id are simpler, use less resources and,
more important, provide scalability.
Clustering is an interesting solution for creating more powerful systems out of many basic devices.
For example, mobile phones can generally be classed as having very limited resources, be it a combination of electrical power, processor and system memory. The use of parallel computing techniques can
allows for these small devices to divide up large tasks among themselves and carryout jobs that would
otherwise be impossible for a single device. Such examples of this would be a job may have higher
memory requirements than what is available on a single device. Another and more imperative restriction
is electrical power, where by a task may take too long to process, given a limited battery. The division of
work among multiple nodes can spread the resource cost among a number of devices allowing for tasks
impossible for one single device to solve to be completed and the results from same obtained at a far
reduced wall clock time. The amalgamation of Bluetooth and message passing paradigms to form Java
based mobile parallel applications is one solution to this problem, allowing for mobile parallel computing
to take place between a limited number of mobile devices. Perhaps in the not too distant future we will
see the rise of the hyper-parallel globalised mobile grid as the information processing needs of research
projects escalate. Supercomputing may no longer be the realm of high end server farms, but ubiquitous
throughout the world with devices such as our phones, set top boxes, desktop computers, and even our
cars providing their free clock cycles to solve the data processing requirements of tomorrow.
715
REFERENCES
Adjie-Winoto, W., Schwartz, E., Blakrishnan, H., & Lilley, J. (1999). The design and implementation of an
intentional naming system. Operating Systems Review, 34(5), 186201. doi:10.1145/319344.319164
ARM. (2008). ARM Achieves 10 Billion Processor Milestone. Retrieved March 10, 2008, from http://
www.arm.com/news/19720.html
Balazinska, M., Blakrishnan, H., & Karger, D. (2002). INS/Twine: a scalable peer-to-peer architecture
for intentional resource discovery. In Pervasive 2002, Zurich, Switzerland, August. Berlin: Springer
Verlag.
Bluetooth (2008). Retrieved November 2008 from www.bluetooth.com
BOINC. (2008). Berkeley Open Infrastructure for Network Computing. Retrieved March 10, 2008 from
http://boinc.berkeley.edu
BOINCstats. (2008). Seti@home Project Statistics. Retrieved March 10, 2008, from http://boincstats.
com/stats/project_graph.php?pr=sah
Davis, C. (2007). Could Android open door for cellphone Grid computing? Retrieved March 10, 2008,
from http://www.google-phone.com/could-android-open-door-for-cellphone-grid-computing-12217.
php
Donegan, B., Doolan, D. C., & Tabirca, S. (2008). Mobile Message Passing using a Scatternet Framework. International Journal of Computers, Communications & Control, 3(1), 5159.
Doolan, D. C., Tabirca, S., & Yang, L. T. (2006). Mobile Parallel Computing. In Proceedings of the Fifth
International Symposium on Parallel and Distributed Computing (ISPDC 06), (pp. 161-167).
Folding@home, (2008). Client statistics by OS. Retrieved March 10, 2008, from http://fah-web.stanford.
edu/cgi-bin/main.py?qtype=osstats
Grigoras, D. (2005). Service-oriented Naming Scheme for Wireless Ad Hoc Networks. In the Proceedings of the NATO ARW Concurrent Information Processing and Computing, July 3-10 2003, Sinaia,
Romania, 2005, (pp. 60-73). Amsterdam: IOS Press
Grigoras, D., & Riordan, M. (2007a). Cost-effective mobile ad hoc networks management. Future Generation Computer Systems, 23(8), 990996. doi:10.1016/j.future.2007.04.001
Grigoras, D., & Zhao, Y. (2007b). Simple Self-management of Mobile Ad Hoc Networks. Proc of the 9th
IFIP/IEEEInternational Conference on Mobile and Wireless Communication Networks, 19-21 September
2007, Cork, Ireland.
IBM. (2007). Blue Gene. Retrieved March 10, 2008, from http://domino.research.ibm.com/comm/
research_projects.nsf/pages/bluegene.index.html
Litchfield, S. (2008). A detailed comparison of Seires 60 (S60) Symbian smartphones. Retrieved March
10, 2008, from http://3lib.ukonline.co.uk/s60history.htm
716
Nesargi, S., & Prakash, R. (2002). MANETconf: Configuration of Hosts in a Mobile Ad Hoc Network.
In Proceedings of the IEEE Infocom 2002, New York, June 2002.
PortoResearch. (2008). Slicing Up the Mobile Services Revenue Pie. Retrieved March 10, 2008, from
http://www.portioresearch.com/slicing_pie_press.html
Ramjee, R., Li, L., La Porta, T., & Kasera, S. (2002). IP paging service for mobile hosts. Wireless Networks, 8, 427441. doi:10.1023/A:1016534027402
Reuters (2007). Global cellphone penetration reaches 50 pct. Retrieved March 10, 2008, from http://
investing.reuters.co.uk/news/articleinvesting.aspx?type=media&storyID=nL29172095
Sanders, R. (2008). SETI@home looking for more volunteers. Retrieved 10 March, 2008, from http://
www.berkeley.edu/news/media/releases/2008/01/02_setiahome.shtml
SETIstats. (2008). Seti@home Project Statistics. Retrieved March 10, 2008, from http://boincstats.com/
stats/project_graph.php?pr=bo
Shankland, S. (2007). Sun starts bidding adieu to mobile-specific Java. Retrieved March 10, 2008, from
http://www.news.com/8301-13580_3-9800679-39.html?part=rss&subj=news&tag=2547-1_3-0-20
TOP500. (2007). TOP 500 Supercomputer Sites, Performance Development, November 2007. Retrieved
March 10, 2008 from http://www.top500.org/lists/2007/11/performance_development
Tseng, Y-C., Shen, C-C. & Chen, W-T. (2003). Integrating Mobile IP with ad hoc networks. IEEE
Computer, May, 48-55.
WiFi (2008). Retrieved November 2008 from http://www.ieee802.org/11/

Bluetooth: An RF based, wireless communications technology that has very low power requirements
making it a suitable system for energy conscious mobile devices. The JSR-82 Bluetooth API facilitates
the development of Java based Bluetooth applications.
IEEE 802.11x (WiFi): A set of standards defined by IEEE for wireless local area networks.
IP: The Internet Protocol is a data communication protocol used on packet-switched networks.
MANET: Mobile ad hoc network, a temporarily created network by mobile devices without any
infrastructure support.
MMPI: The Mobile Message Passing Interface is a library designed to run on a Bluetooth piconet
network. It facilitates application development of parallel programs, parallel graphics applications,
multiplayer games and handheld multi-user mLearning applications.
Net_id: The mobile ad hoc network identity created by the mobile host which organizes it. It is soft
variable that is valid only as long as the network is active.
717
718
Chapter 31
Network Selection Strategies

and Resource Management
Schemes in Integrated
Heterogeneous Wireless
and Mobile Networks
Wei Shen
University of Cincinnati, USA
Qing-An Zeng
University of Cincinnati, USA
ABSTRACT
Integrated heterogeneous wireless and mobile network (IHWMN) is introduced by combing different
types of wireless and mobile networks (WMNs) in order to provide more comprehensive service such as
high bandwidth with wide coverage. In an IHWMN, a mobile terminal equipped with multiple network
interfaces can connect to any available network, even multiple networks at the same time. The terminal
also can change its connection from one network to other networks while still keeping its communication alive. Although IHWMN is very promising and a strong candidate for future WMNs, it brings a
lot of issues because different types of networks or systems need to be integrated to provide seamless
service to mobile users. In this chapter, the authors focus on some major issues in IHWMN. Several
noel network selection strategies and resource management schemes are also introduced for IHWMN
to provide better resource allocation for this new network architecture.
INTRODUCTION
Wireless and mobile networks (WMNs) attract a lot of attention in both academic and industrial fields.
They are also witnessing a great success in recent years. Generally, WMN can be classified into two
types, centralized (or infrastructure-based) and distributed (or infrastructure-less) WMNs. Cellular netDOI: 10.4018/978-1-60566-661-7.ch031
works are the most widely deployed centralized WMNs and have evolved from the earliest 1G cellular
network to current 2G/3G cellular networks. Generally, the service area of a cellular network is divided
into multiple small areas that are called cells. Each cell has a central control unit that is referred to as
base station (BS). All the communications in the cellular network take place via the BSs. That is, the
communication in a cellular must be relayed through a BS. The IEEE 802.11 WLAN (Wireless Local
Area Network) is another type of centralized WMNs, which has much smaller coverage compared to
cellular networks. Because WLANs are easy to deploy and can provide high bandwidth service, they have
experienced rapid growth and wide deployment since they were launched to the market. In a WLAN,
the central control unit is called access point (AP). Similar to cellular networks, the communications
in a WLAN must be via the APs. The BSs or APs are connected to the backbone networks and provide
connections with other external networks, such as Public Switched Telephone Network (PSTN) and Internet. Besides the cellular network and WLANs, there are also many other types of centralized WMNS
such as satellite network, WiMax, HiperLan etc.
Unlike centralized WMNs, there is no fixed network structure in a decentralized (or distributed)
WMN. Wireless and mobile ad hoc network is a typical distributed WMN that attracts a lot of research
interests recently (Agrawal, 2006). The wireless and mobile ad hoc network is dynamically created and
maintained by the nodes. The nodes forward packets to/from each other via a common wireless channel without the help of any wired infrastructure. When a node needs to communicate with other nodes,
it needs a routing discovery procedure to find a potential routing path to the destination node. Due to
frequent movement of communication nodes, the routing path between two communication nodes is not
fixed. When a relay node moves out of transmission range of other communication nodes, the current
routing path is broken. As a result, another routing path has to be found in order to keep the communication alive. Wireless and mobile ad hoc networks are very useful in some areas that a centralized WMN
is not possible or inefficient, such as disaster recovery and battle field.
Although there are a lot of wireless and mobile networks and they are witnessing a great success in
recent years, different types of WMNs have different design goals and restriction in the wireless signal
transmission which results in the limitation of the services. Therefore, they cannot satisfy all the communication needs for the mobile users. For example, any single type of existing WMN is not able to
provide a comprehensive service such as high bandwidth with wide coverage. In order to provide more
comprehensive services, a concept of integrated heterogeneous wireless and mobile network (IHWMN)
is introduced by combing different types of WMNs.
On the other hand, any traditional mobile terminal only supports one network interface, which can
only connect to one type of network. With the advance of the software defined radio technology, it is
possible to integrate multiple WMN interfaces (multi-mode interfaces) into a single mobile terminal
now. Such multi-mode terminal is able to access multiple WMNs if it is under the coverage of multiple
WMNs. For example, a mobile terminal equipped with cellular network interfaces and WLAN can connect cellular network or WLAN if both networks are available. It can further connect to both networks
at the same time. However, it is a big challenge since effective and efficient schemes are required to
manage the connection.
It is obvious that the introduction of IHWMN as well as multi-mode terminal brings more flexible and
plentiful access options for mobile users. The mobile users can connect the more suitable network for
different communication purpose. One example is that the mobile user may connect to cellular network
for the voice communication and connect to the WLANs to receive email and surf the Internet. However,
there is a lot of challenge, such as the architecture of network integration, network selection strategies,
719
Figure 1. An example of integrated heterogeneous wireless and mobile network
handoff scheme, resource allocation, etc. These problems have to be solved before launching IHWMN
to the commercial market and enjoying its benefit. The major challenges for IHWMN are:
How to integrate different types of wireless and mobile networks? In the same geographic area,
it is possible to have more than one network as shown in Figure 1. If these networks belong to
the same operator, it is obvious that the network operator can allocate the network resource in a
centralized way. However, these networks may belong to different operators in most cases. These
different networks may manage resources individually based on different policies, which is possible to cause low resource utilization for the whole IHWMN. Therefore, how to integrate these
different types of WMNs from different operators directly affects the performance of the whole
systems.
How to manage radio resource in IHWMNs? The resource management becomes more complex
in IHWMN due to the diversity of the services (or traffic) provided by heterogeneous WMNs.
Another challenging issue has to be handled in multiple traffic system is the fairness among different types of traffic and different types of networks. That is, some low priority traffic obtains poor
performance while some high priority traffic obtains over-qualified services. From the point of
view of network, the throughput of some networks may saturate due to high volume traffic while
other networks have much less traffic to handle.
How to select a network in the IHWMN? When a mobile user having a multi-mode terminal generates a new (or originating) call in an IHWMN, it needs a network selection strategy to determine
720
which network should be accessed.

How to manage the vertical handoff? When a mobile user roams in an IHWMN, the multi-mode
terminal may change the connection from one network to another network. Such process is called
vertical handoff which will be disused in the following section. A great challenge for IHWMN is
how to manage the vertical handoff since frequent vertical handoff causes a lot of signaling burden and fluctuation of the service quality.
How to efficiently manage multi-mode interfaces? Since the power consumption of each wireless
network interface cannot be neglected even in the idle or power saving mode, the terminal cannot activate all the interfaces all the times. Therefore, some algorithms are required to find the
preferred networks promptly. Another problem of managing multi-mode interface is how to use
multiple interfaces at the same time. Such kind of technology is very useful to support services
that require very high bandwidth. However, it brings a lot of issues such as bandwidth allocation
among different types of networks, synchronization etc.
In the chapter, we focus on several issues in IHWMN. The existing strategies and schemes are reviewed
and compared. Several novel schemes are proposed to improve the performance of the IHWMN, which
can be categorized as network selection strategies and resource management schemes. The issues addressed in this chapter provide many insights into characterizing the emerging problems in IHWMNs.
The remainder of this chapter is organized as followings. We first introduce some basic definition for
IHWMN and review the existing work in the next section. Then, we tackle the network selection and
resource management problems in IHWMNs and provide several novel solutions for these problems.
After that, we analyze the potential direction in the next step for IHWMNs.
BACKGROUND
Figure 1 is an example of integrated heterogeneous wireless and mobile networks. The entire service area
in Figure 1 is covered by satellite network. There are two cells of cellular network which has smaller cell
size than satellite network. Each cell of cellular network has a BS which manages the communication in
the cell. Several WLAN cells are overlapped with the cells of cellular networks, where some areas are
covered by both networks. All these networks can be integrated as a whole, which is an IHWMN. The
mobile user having multi-mode terminal can enjoy multiple communication modes in the IHWMN. In
the following, we give some basic definitions for the IHWMN.
In a traditional WMN such as cellular or WLAN network, an active mobile user (in communication)
may move from one cell to another cell. In order to keep the communication alive, the connection has
to be changed from one BS (or AP) to another BS (or AP). Such process of changing the connection
within the same network is called handoff (Wang, 2003). In this chapter, we define such handoff as
horizontal handoff. For example, the handoff between two adjacent cellular network cells in Figure 1
is a horizontal handoff. In an IHWMN, however, the connection may be changed from one network to
another network for better service besides the horizontal handoff. Such process of changing the connection between two different types of networks is called vertical handoff (Chen, 2004). For example,
the handoff between cellular network and WLAN in Figure 1 is a vertical handoff. Since the vertical
handoff happens between two different types of networks (or systems), it is also known as inter-network
handoff or inter-system handoff. Therefore, the horizontal handoff can be called intra-network or intra-
721
system handoff. Compared to horizontal handoff, vertical handoff is more complex and brings a lot of
issues such as vertical handoff decision and vertical handoff execution, need to be handled carefully.
The horizontal handoff usually has to be made in order to keep the communication alive in the traditional cellular networks or WLANs. Therefore, it is mandatory. Vertical handoff is more complicated
than horizontal handoff, which can be divided into two categories. If an active mobile user roams into
a new network which provides better service than the current serving network, it may request a vertical
handoff and change the connection to the better network. Unlike the horizontal handoff, this type of
vertical handoff is optional and is called downward vertical handoff (DVH) (Chen, 2004). Furthermore,
the mobile user may still keep the communication with the current serving network. On the other hand,
when an active mobile user moves out the coverage of the current serving network, it has to make a
vertical handoff to other available networks. Such type of vertical handoff is called upward vertical
handoff (UVH) (Chen, 2004). Similar to the horizontal handoff, the UVH is mandatory since a failed
vertical handoff terminates the communication. In the following, we review the related work that has
been done in the field of IHWMN.
Some research has been done to integrate different types of wireless and mobile networks. In
(Salkintzis, 2002; Salkintzis, 2004), two different mechanisms, tight coupling and loose coupling, have
been introduced to interconnect WLANs and cellular networks (GPRS and 3G networks). In the tight
coupling mechanism, WLAN connects to the GPRS core network like other Radio Access Networks
(RAN). In other words, the traffic between WLAN and other external communication networks goes
through the core network of cellular network. Therefore, the traffic of WLAN incurs a burden to the
core network of cellular network. In the loose coupling mechanism, however, WLAN is deployed as
a complementary network for the cellular network. The traffic of WLAN does not go through the core
network of cellular network. The tight coupling mechanism requires that WLAN and cellular networks
belong to the same operator. By using loose coupling mechanism, WLAN and cellular networks can be
deployed individually. Both WLANs and cellular networks need not belong to the same operator, which
is more flexible than the tight coupling mechanism. Additionally, 3GPP (the 3rd Generation Partnership
Project) (3GPP, 2007) working group also has discussed the requirements, principles, architectures,
and protocols to interwork the 3G networks and WLANs. In (Akyildiz, 2005), the authors proposed to
use a third party to integrate different types of wireless and mobile networks. The third party, called as
Network Inter-operating Agent, resides in the Internet and manages the vertical handoff between different types of networks.
When a multi-mode terminal generates a call in an IHWMN, it requires a strategy to determine
which network should be accessed. In (Stemm, 1998), since a mobile user always selects the network
with the highest available bandwidth among all the available networks during its communication, the
only concern of the network selection for the mobile user is bandwidth. From the users point of view,
this is good for the service quality. In (Nam, 2004), a network selection strategy that only considers the
power consumption mobile users has been introduced. In order to maximize the battery life, the mobile
user selects the uplink and downlink from 3G network or WLAN that have the lowest power consumption. Consider the scenario that the power consumption of the uplink in the 3G network is less than
the uplink in WLAN and the power consumption of the downlink in the 3G network is larger than the
downlink in WLAN. In (Wang, 1999), the authors have proposed a policy-enabled network selection
strategy which combines several factors such as bandwidth provision, price, and power consumption.
A mobile user defines the best network based on his preferences. By setting different weights over
different factors, a mobile user can calculate the total preference of each available network. The mobile
722
user connects to the network with the highest preference, which is its most desired network. In order to
reduce the computation complexity of the cost function in (Wang, 1999), an optimization algorithm has
been proposed in (Zhu, 2004). The authors have proposed another network selection algorithm in (Song,
2005) by using two mathematic methods: analytical hierarchy process (AHP) and grey relational analysis
(GRA). The AHP algorithm divides the complex network selection problem into a number of decision
factors and the optimal solution can be found by integrating the relative dominance among these factors.
The GRA [17] has also been proposed for selecting the best network for a mobile user. Although the
above network selection strategies have their own advantages, they are all designed to meet individual
mobile users needs. That is, they are user-centric. Furthermore, they do not put much attention on the
system performances, such as the blocking probability of originating calls and the forced termination
probabilities of horizontal and vertical handoff calls.
Generally, the vertical handoff can be divided into three phases: system discovery, vertical handoff
decision, and vertical handoff execution (McNair, 2004). In the first phase, the multi-mode mobile terminal keeps searching for another network that can provide better service. Once such network is found,
vertical handoff decision is made. The vertical handoff decision is a multiple criteria process which
involves many factors like bandwidth usage, money cost, QoS parameters etc. The decision results also
affect both the degree of users satisfaction and system performance. If the vertical handoff decision
has been made to change the connection to the new network, the context has to be switched to make the
change smoothly and user-transparent. Since the vertical handoff may not be mandatory and it incurs
significant signaling messages, the decision algorithm is critical to the IHWMN. The vertical handoff
decision is also seen as network selection problem in some literature (Wang, 1999; Song, 2005). The
number of users in a WMN after a successful vertical handoff is considered to affect the QoS of the
IHWMN. A modiffed Elman neural network is used to predict such number. The predicted number of
mobile users is fed into a fuzzy interference system to make a vertical handoff decision.
With the rapid emergence of multimedia applications such as voice, video, and data, these different
types of traffic should be supported in wireless and mobile networks. Generally, multiple traffic can be
classified into real-time and non-real-time traffic based on their sensitivity to the delay. The major challenge to support such multiple traffic is that different types of traffic are incorporated into one system and
each traffic has its distinct QoS requirements. For example, real-time traffic (such as voice and video)
is delay-sensitive, while non-real-time traffic (such as data) is delay-tolerant. Therefore, an efficient
resource management scheme to support multiple traffic has to treat them differently and satisfy their
individual QoS requirements. In an integrated wireless and mobile network, resource management faces
more challenges due to the diversity of the services provided by different types of wireless and mobile
networks. Unfairness may happen among different types of traffic when handling multiple traffic. That
is, the performance of lower priority traffic should be improved when the higher priority traffic has been
provided satisfied services.
In (Pavlidou, 1994), the authors have presented different call admission control polices for voice
and data traffic. Since data traffic is delay-insensitive and voice traffic is stringent to access delay,
they have proposed to allow voice traffic to preempt data traffic. A priority queue is introduced to hold
the preempted data calls. When a data call arrives, it is also put into the queue if there is no enough
resource. Although their scheme improves the blocking probability of originating voice calls, it treats
the originating calls and the handoff calls equally. Since terminating an ongoing call is more frustrating than blocking an originating call from a users point of view, higher priority should be provided to
the ongoing calls (handoff calls). In (Wang, 2003), the authors have proposed an analytical model that
723
supports preemptive and priority reservation for handoff calls. Detailed performance analysis is also
provided to give guidelines on how to configure system parameters to balance the blocking probability
of originating calls and the forced termination probability of handoff calls. Multiple traffic with different QoS requirements have been discussed in (Xu, 2005). In order to support different types of traffic,
a model that gives different priorities to different types of traffic has been designed. Their model allows
the traffic with lower priority to be preempted by the traffic with higher priority, which can support
DiffServ (Differentiate Service) in WMNs. Although all of the above resource management schemes
achieve significant improvements on the system performances, they focus on a single WMN, which
may not efficiently support multiple traffic in an IHWMN.
Compared to a single type of WMN, the resource management in an IHWMN has to face more challenges due to the heterogeneity of different types of WMNs. That is, different types of WMNs may have
different resource management policies for the same type of traffic. The resource management scheme
in (Park, 2003) treats the real-time and non-real-time traffic differently in an integrated CDMA-WLAN
network. For real-time traffic, vertical handoff is made as soon as possible to minimize the handoff delay.
For non-real-time traffic, they considered that the amount of data being transmitted is more important
than the delay. Therefore, the connection to the higher bandwidth network is kept as long as possible to
maximize the throughput. In (Zhang, 2003), the authors have also proposed different vertical handoff
policies for real-time and non-real-time traffic. Although all of the above schemes improve the system
performances in certain perspectives, call-level performances such as the blocking probability of originating calls and the forced termination probability of handoff calls are not examined. Furthermore, in all of
above schemes, any type of traffic is switched to a higher bandwidth network when a higher bandwidth
network becomes available. However, this policy may not be suitable for some delay-sensitive traffic
because frequent handoffs may interrupt the ongoing communications. The goal of the proposed resource
management scheme in (Liu, 2006) is to increase users data rate and decrease the blocking probability
and the forced termination probability. A switch profit is used to encourage the vertical handoff to the
network that can offer better bandwidth. On the other hand, a handoff cost is used to prevent excessive
vertical handoff. The switch profit depends on the bandwidth gain obtained from the vertical handoff,
while the handoff cost depends on the delay incurred by the vertical handoff. The simulation results
show that their scheme can reduce the blocking probability and the forced termination probability. It
also achieves better throughput and grade of service. Although the above schemes focus on the resource
management scheme in IHWMNs, they only consider a single type of traffic. Therefore, they may not
efficiently support multiple traffic in IHWMNs.
NETWORK SELECTION STRATEGIES AND

RESOURCE MANAGEMENT SCHEMES
Cost-Function-Based Network Selection Strategies
System Model
As we mentioned before, most existing network selection strategies are user-centric and focus on the
individual users needs. Our motivation is to design a network selection strategy from systems perspective and the network selection strategy can also meet certain individual users needs. Before we discuss
724
how our proposed cost-function-based network selection strategy (CFNS) works, we briefly describe
our system model.
We consider an integrated heterogeneous wireless and mobile system having M different types of
networks. We assume that the entire service area of the system is covered by network N1 that consists of
many homogeneous cells and provides a low bandwidth service. Assume that network Ni(2 i M) is
randomly distributed in the service area covered by network N1 and provides a higher bandwidth service
than network N1. Network Ni(2 i M) has limited coverage, which only covers some portion of the
entire service area. For example, a cellular network N1 covers several WLANs (N1, N2,,NM). For the
purpose of simplicity, we focus on one cell of network N1, which is called marked cell, where some area
is covered by several high bandwidth networks. Each cell of a higher bandwidth network Ni(2 i M)
has an AP (access point), and each cell of network N1 has a BS (base station).
We assume that each cell of network Ni(1 i M) has a circular cell shape with radius Ri . We denote
the area covered by network Ni(2 i M) as area Ai(2 i M). In the overlapped areas, mobile users
may have more than one connection option. We assume that each cell of network Ni(2 i M) has Bi
bandwidth units. It is necessary to clarify that each bandwidth unit is a logic channel which can be allocated to a mobile user. We assume that mobile users are uniformly distributed in the service area. They
move in all the directions with equal probability. The moving speed V (random variable) of mobile user
follows an arbitrary distribution with a mean value of E[V]. In the system, we assume that there are three
types of calls, namely originating calls, horizontal handoff calls, and vertical handoff calls. An originating call is an initial call in the system, and a handoff call, either horizontal or vertical handoff call, is an
ongoing call. When an active mobile user changes its connection from its current serving network Ni to
network Nj (for all i, j), a handoff call (request) is generated in network Nj. If i = j, the handoff call is a
horizontal handoff call. If i j, it is a vertical handoff call .
Cost-Function-Based Network Selection Strategy

When an originating call is generated, the proposed network selection strategy works as follows.
If there is no free bandwidth unit, the originating call is blocked;

If only one available network has free bandwidth units, the originating call is accepted by that
network;
If there are more than one available network having free bandwidth units, all these candidate networks are compared based on network selection strategy and the originating call is accepted by
the most desired network.
Since we focus on network selection strategy in this chapter, horizontal handoff is handled in a traditional way like (Wang, 2003) and vertical handoff is handled in the following ways: when an active
mobile user moves from area covered by network Ni into adjacent area covered by network Nj, it changes
its connection from network Ni to network Nj if network Nj has a higher bandwidth than Ni and there are
free bandwidth units in Ni. If the target area is not covered by network Ni, the mobile user has to change
its connection to other available networks. If there is no free bandwidth unit in other available networks,
the vertical handoff call will be forcedly terminated. If there are more than one available network having
free bandwidth units, the vertical handoff call is randomly accepted by any one of these networks.
725
Our proposed network selection strategy prefers an originating call to be accepted by a network with
a low traffic load and stronger received signal strength, which can achieve better traffic balance among
different types of networks and a good service quality. Consequently, we define a cost function to combine these two factors, traffic load and received signal strength. Therefore, the cost to use network Ni
for an originating call is defined as
Ci = wg Gi + ws Si, for i=1,2,...,M.
(1)
where Gi is the complementary of normalized utilization of network Ni, Si is the relative received signal
strength from network Ni. wg and ws are the weights that provide preferences to Gi and Si, where 0 wg,s
1. The constraint between wg and ws is given by
wg + ws = 1
(2)
The complementary of normalized utilization Gi is defined by

Gi =
Bif
Bi
, for i=1,2,...,M,
(3)
where Bif is the available bandwidth units of network Ni and Bi is the total bandwidth units of network
Ni. In general, stronger received signal strength indicates better signal quality. Therefore, an originating
call prefers to be accepted by a network with a higher received signal strength. However, it is difficult
to compare received signal strengths among different types of networks because they have different
maximum transmission powers and receiver threshold. As a result, we propose to use a relative received
signal strength to compare different types of WMNs. Therefore, Si in Equation (1) is defined by
Si =
Pic - Pith
Pi max - Pith
for i=1,2,...,M,
(4)
where Pic is the current received signal strength from network Ni. Pith is the receiver threshold from
network Ni. Pi max is the maximum transmitted signal strength from network Ni. Note that we only consider path loss in the propagation model. Consequently, the received signal strength (in decibel) from
network Ni is given by
Pic = Pi max - 10g log(ri )
(5)
where ri is the distance between the mobile user and the BS (or AP) of network Ni, and is the fading
factor that is generally in the range of [2, 6]. Therefore, the receiver threshold from network Ni is given
by
Pith = Pi max - 10g log(Ri )
726
(6)
The relative received signal strength from network Ni is rewritten as

Si = 1 -
log(ri )
log(Ri )
, for i=1,2,...,M.
(7)
If an originating call has more than one connection option, the costs for all candidate networks are
calculated by using cost function of Equation (1). The originating call is accepted by a network that
has the largest cost, which indicates the best network. If there are more than one best network, the
originating call is randomly accepted by any one of these best networks.
In the following, we discuss two special cases of the proposed CFNS strategy, i.e., wg = 1 and wg
= 0. When wg = 1, the cost function of Equation (1) only considers Gi. It gives rise to another network
selection strategy and we call it traffic balanced-based network selection (TBNS) strategy. This network
selection strategy tries to achieve the best traffic balance among different types of networks. In this
case, when an originating call is generated and there are more than one network having free bandwidth
units, the originating call is accepted by a network that has the largest Gi. That is, the call is accepted by
a network which has more free bandwidth units. In the second case, when wg = 0, the proposed CFNS
strategy gives rise to another network selection strategy, i.e., received signal strength-based network
selection (RSNS) strategy. It is obvious that the only concern of selecting a desired network is based on
received signal quality. In this case, when an originating call is generated in an area covered by more
than one network, the call is accepted by the network that has the largest Si.
In this chapter, although our cost function only consists of two factors, traffic load and received
signal strength, it is easy to extend to involve more factors like access fee of using network Ni, which
can be rewritten as
Ci = wg Gi + ws Si + w i,
(8)
and access fee i is given by

Fi = 1 -
Fi
Fmax
(9)
where max is the highest access fee that the mobile user likes to pay and i is the actual access fee to
use network Ni. The mobile user does not connect to a network which charges more than max even if the
network has free bandwidth units. w (0 i 1) is the weight for the access fee with the constraint:
wg + ws + w = 1
(10)
Therefore, a network with a cheaper price has a larger cost, and the mobile user is more likely to be
accepted by that network. Using the similar way, other factors also can be included into the cost function after properly normalization.
727
Numerical Results
We also apply Markov model method to analyze the system performance of the proposed CFNS strategy.
Due to space limitations, we do not provide the details of the performance analysis and results, which
can be found in (Shen, 2007; Shen, 2008). In the following, we give some numerical results for the
system performance of the proposed CFNS strategy. By comparing major system performance, CFNS
strategy can achieve a tradeoff between the blocking probability of originating calls and the average
received signal strength, which are very important for both systems and users. This is the major difference of our strategies compared to most existing strategies, which considers both system performance
and users needs.
RESOURCE MANAGEMENT SCHEMES FOR MULTIPLE TRAFFIC

System Model
Since the IHWMN is a new concept, there is not much research to discuss about the resource management schemes to support multiple traffic in IHWMN. In this section, we propose a novel resource
management scheme to support real-time and non-real-time traffic in IHWMN. The fairness issue
between real-time and non-real-time traffic is also addressed to avoid the unbalanced QoS provision to
non-real-time traffic.
The system model used in this section is similar to the last section, except we consider two types
of traffic: real-time and non-real-time traffic. In this chapter, voice traffic is used as a real-time traffic
and data traffic is used as non-real-time traffic. Each bandwidth unit in different networks has different
bandwidth provision, while bandwidth provision in network N2 is much larger than that in network N1.
In the following, we describe our schemes from handling the voice traffic.
Preemption-Based Resource Management Scheme

An ongoing voice call is forcedly terminated due to a failure handoff since voice traffic is delay-sensitive.
Therefore, a resource management scheme needs to reduce the number of handoff. On the other hand,
a voice call only needs a low bandwidth. The call holding time of a voice call does not change even if a
higher bandwidth channel is allocated. In other words, the resource utilization is not efficient if a higher
bandwidth channel is allocated to a voice call. Therefore, we assume that a voice call is accepted only
by network N1 to prevent vertical handoff and occupation of a higher bandwidth channel. As a result,
there are only two types of voice calls arrival in our system, i.e., originating voice calls and horizontal
handoff voice calls. A horizontal handoff voice request is generated in the marked cell when an active
voice call user moves into the marked cell from neighboring cells of network N1. When an originating
voice request or horizontal handoff voice request is generated in the marked cell of network N1, it is
accepted if there are free channels in the marked cell of network N1. We assume that voice traffic has a
higher priority than data traffic since voice traffic is delay-sensitive. Therefore, an incoming voice call,
either originating or horizontal handoff voice call, can preempt ongoing data call in the marked cell
of network N1 if there is no free bandwidth unit upon its arrival. We adopt a queue to hold those preempted data calls in the marked cell of network N1. Two concerns may arise in such preemption-based
728
resource management scheme. First of all, excessive preemption easily results in unfairness between
voice traffic and data traffic, which must be avoided. The other concern is the priority of horizontal
handoff voice calls over originating voice calls. From a users point of view, terminating a horizontal
handoff voice call is more frustrating than blocking an originating voice call. Therefore, higher priority
should be provided to horizontal handoff voice calls. In some channel reservation schemes, certain logic
channels are exclusively reserved for handoff calls to provide such priority. However, the originating
and horizontal handoff voice calls completely share the resources in our scheme unlike the reservation
scheme. Therefore, we have to treat them differently during the preemptions in order to provide higher
priority to horizontal handoff voice calls. In the following, we describe how the preemption works to
differentiate the originating and horizontal handoff voice calls.
Firstly, we do not want to terminate an ongoing data call for accepting an incoming voice call. Therefore, the preemption fails if the queue of the marked cell of network N1 is full. We further propose two
thresholds, VHmax and VOmax , to prevent excessive preemption. Both thresholds are defined as the maximum
capacities of ongoing voice calls when an incoming voice call tries to make a preemption. VHmax is used
for the preemption of horizontal handoff voice calls, and VOmax is used for the preemption of originating
max
voice calls. In the following, we introduce how the preemption works with VH and VOmax .
VHmax is a real value and can be presented as
VHmax = VHmax +VHmax VHmax aH = VHmax + aH
(11)
where VHmax is the integral part of VHmax , and H is the decimal part of VHmax .
When an incoming horizontal handoff voice call tries to make a preemption, the result of the preemption depends on the value of VHmax and the state of the marked cell, which is given by
If the number of current ongoing voice calls is less than VHmax , the incoming horizontal handoff
call can successfully preempt ongoing data calls if there are ongoing data calls and the queue is
not full;
If the number of current ongoing voice calls is equal to VHmax , the incoming horizontal handoff
call can successfully preempt ongoing data calls only with probability H if there are ongoing data
calls and the queue is not full. In other words, the preemption fails with probability H;
If the number of current ongoing voice calls is larger than VHmax , the preemption fails even if
there are ongoing data calls and the queue is not full in the marked cell of network N1.
When implementing the above preemption scheme, the preemption succeeds (or fails) if the number
of current ongoing voice calls is less (or larger) than VHmax . If the number of current ongoing voice
calls is equal to VHmax , a random number is generated uniformly in the range of [0; 1). If the gener
ated random number is less than H, the preemption succeeds. Otherwise, the preemption fails and the
incoming horizontal handoff voice call is forcedly terminated.
Similar to Equation (11), VOmax can be presented as
729
max
VOmax = VOmax +VOmax VOmax aO = VO + aO
(12)
where VOmax is the integral part of VOmax , and O is the decimal part of VOmax . The preemption of
originating voice calls is the same as the preemption of horizontal handoff voice calls except using VOmax
instead of VHmax . If the preemption fails, the incoming originating voice call is blocked. It is obvious that
thresholds, VOmax and VHmax , provide certain limitation to the preemption.
Unlike voice traffic, data traffic is delay-tolerant and benefits from a higher bandwidth channel. That
is, a higher bandwidth channel can improve the throughput of a data call and reduce its holding time.
Therefore, we assume that an originating data call always tries the highest bandwidth network first if
there are more than one network available. When an originating data call is generated in an overlapped
area, it tries network Ni(2 < i M) first. If there is no free channel in network Ni, the originating data
call is put into the queue of network Ni(2 < i M). When an originating data call is generated in the
area that is only covered by network N1, it is accepted by network N1 if there are free channels in the
marked cell of network N1. Otherwise, it is put into the queue of network N1 if the queue is not full or
terminated if the queue is full.
If an active data call mobile user moves into the marked cell from neighboring cells of network N1,
a horizontal handoff data request is generated if only network N1 is available. The horizontal handoff
data call is accepted if there are free channels in the marked cell of network N1. Otherwise, it is put
into the queue of the marked cell if the queue is not full or terminated if the queue is full. For a data
call waiting in the queue of neighboring cells of network N1, it also generates a horizontal handoff data
request in the marked cell of network N1 when its mobile user moves into the marked cell. If an active
data call mobile user in singly covered area moves into area covered by more than one network, a DVH
(downward vertical handoff) request is generated in the higher bandwidth network Ni. A data call in the
queue of network N1 also generates a DVH request in network Ni(2 < i M) when its mobile user moves
into doubly covered area. If an active data call mobile user in a higher bandwidth network Ni(2 < i M).
moves out of its coverage before call completion, it generates a UVH (upward vertical handoff) request
in an available network Nj(i j). A data call in the queue of network Ni(2 < i M). also generates a UVH
request in an available network Nj when the mobile user moves out of coverage of Ni.
In the following, we define ODA as the average arrival rate of originating data call in different areas,
OV is the average arrival rate of originating voice calls in the marked cell, HHV and HHD is the average
arrival rate of horizontal handoff voice and data call, respectively. DVH (UVH) is the average arrival rate
of downward (upward) vertical handoff data calls.
Fairness Between Voice and Data Traffic

The aim of WMNs is to provide desired services to mobile users, which can be measured using QoS
requirements. Two main QoS requirements of voice traffic are the blocking probability BOV of originating
calls and the forced termination probability BHHV of handoff calls. For data traffic, main QoS requirements
include the blocking probability of originating data calls and the average delay. In our system, the QoS
requirements of voice traffic are our main concern. We also do not want to ignore the performances of
data traffic. That is, the system provides the guaranteed BOV and BHHV to voice traffic and the best effort
service to data traffic. In other words, BOV and BHHV must be less than certain thresholds. Therefore, we
730
Figure 2. Performances of voice traffic with different VOmax
th
th
define two probability thresholds, BOV
and BHHV
, where the blocking probability of originating voice
th
calls must not be larger than BOV and the forced termination probability of handoff voice calls must not
th
.
be larger than BHHV
Intuitively, by increasing the values of VOmax and VHmax , BOV and BHHV decrease since the originating
voice calls and the horizontal handoff voice calls obtain more priority. Since resources are completely
shared by voice and data traffic, the performances of data traffic deteriorate simultaneously when the
performances of voice traffic improve. If the QoS requirements of voice traffic have been met, the further increase of VOmax and VHmax imposes the unfairness to data traffic. In order to provide the best effort
service to data traffic and guaranteed QoS to voice traffic, we have to find the minimum values of VOmax
and VHmax that satisfy the QoS requirements of voice traffic. It is obvious that these minimum values of
VOmax and VHmax result in the best performances for data traffic. Bisection algorithm is used to find the
minimum values of VOmax and VHmax .
Numerical Results
Due to space limitation, we do not provide the details on how to obtain the performance metric through
Markov methods. Figure 2, Figure 3, Figure 4, and Figure 5 gives the numerical results to show the
performance of our proposed schemes.
Figure 2 shows the blocking probability of originating voice calls and the forced termination prob-
731
Figure 3. Average delay of data calls with different VOmax
ability of horizontal handoff voice calls with different VOmax and VHmax . The offered voice traffic load is
fixed. With the increase of VOmax , the blocking probability of originating voice calls becomes less. For
fixed VHmax , the forced termination probability of horizontal handoff voice calls improves significantly
when VHmax increases. It is obvious that larger VOmax and VHmax can provide better performances for voice
traffic. However, larger VOmax and VHmax result in deterioration of data traffic. Figure 3 shows that the
average delay becomes longer when VOmax and VHmax increase. Therefore, we have to find the suitable
VOmax and VHmax to provide the best service for data traffic.
In the following, we examine the system performances under our optimum VOmax and VHmax . The total
offered traffic load is fixed while the ratio between voice and data traffic changes. The QoS requirements
th
th
= 5% and BHHV
= 2% . In order to compare the performances of voice
of voice traffic are set to BOV
max
max
traffic using optimum set of VO and VH with other sets, we define three sets of VOmax and VHmax ,
set 1 = {12.6, 14.6}, set 2={16.5, 19.1}, and set 3={20.2, 21.2}. Figure 4 shows the performances of
voice traffic using different sets of VOmax and VHmax . Only Set 3 and the optimum set can provide the
th
th
and BHHV
under any offered voice traffic load. When using optimum set, the blocking
guaranteed BOV
probability of originating voice calls and the forced termination probability of horizontal handoff calls
th
th
and BHHV
. In low traffic load, Set 2 can only provide
are not larger than the guaranteed values, i.e., BOV
th
th
guaranteed BOV and BHHV . QoS requirements of voice traffic cannot be met by Set 1 since the forced
th
under any offered voice
termination probability of horizontal handoff voice calls is larger than BHHV
traffic load. Figure 5 show the average delay of data traffic. Although Set 3 can provide better service
732
Figure 4. Performances of voice traffic with different offered voice traffic load
Figure 5. Average delay of data calls with different offered voice traffic load
733
for voice traffic than other sets, it achieves the worst average delay of data traffic. Set 1 provides the
best service for data traffic as shown in Figures 5. However, it cannot provide satisfied service for voice
traffic. Compared to Set 2, the optimum set achieves better performances of data traffic when both of
them satisfy the QoS requirements of voice traffic. Therefore, the optimum set can provide the best
service for data traffic while satisfying the QoS requirements for voice traffic.
FUTURE TRENDS
In the next generation wireless and mobile networks (Beyond 3G or 4G), cellular networks still play a
major role due to their dominant market share and good service quality. Other types of networks, such
as WiMax and WLAN, are also witnessing fast deployment. However, any single type of WMN cannot
always provide the ``best service for every mobile user everywhere. Network integration is a promising
choice to offer the best service. Based on the type of users service request and network availability,
the mobile user can obtain the best service from IHWMNs. However, such integration still faces a lot
of challenges as follows:
734
Adjustment of bandwidth allocation in IHWMNs: In the IHWMN, different types of WMN have
different bandwidth provision. On the other hand, different types of traffic have different bandwidth requirements. Therefore, the users may experience unstable QoS when the vertical handoff
happens frequently. Some bandwidth adjustment or smoothing algorithms are required to make
the transition smooth and achieve more stable QoS.
Application of bandwidth splitting in IHWMNs: Some applications have very high bandwidth
requirement, like Internet TV. As a result, any single type of WMN may not support such high
bandwidth application very well. Bandwidth splitting is an approach to solve this problem, where
the whole bandwidth requirement is divided into several parts and different parts are serviced by
different types of WMNs. However, such splitting approach brings a lot of issues, such as bandwidth splitting strategies and synchronization among different networks. Efficient algorithms are
required to provide satisfying services through multiple networks at the same time.
Adaptive bandwidth allocation in IHWMNs: In modern wireless and mobile systems, adaptive
bandwidth allocation can be applied to accept more mobile users when the incoming traffic becomes heavy. However, it becomes more difficult when it is applied to support multiple services
in IHWMNs. It needs to decide to make a vertical handoff that may cause adaptive bandwidth
reallocation, or stay within the current serving networks.
Charging model and its effect: Different types of WMNs have different charging models. For example, some networks charge the access fee based on the amount of traffic, while other networks
have the monthly charging plan. Therefore, such diversity of charging models will affect the
users preference. An approach to combine charging models and resource management schemes
is emergently required.
CONCLUSION
In this chapter, we have reviewed the current research in integrated heterogeneous wireless and mobile
networks. We also proposed network selection strategies and resource management schemes in IHWMNs
and analyzed their system performance. Unlike most existing network selection strategies which are
user-centric, our proposed CFNS (cost-function-based network selection) strategy is designed based on
systems perspective and also considers users needs. The numerical results showed that the proposed
CFNS strategy can achieve a tradeoff between the blocking probability of originating calls and the average
received signal strength. We also proposed preemption-based resource management schemes to support
voice and data traffic in IHWMNs, which takes advantages of heterogeneities of traffic and networks,
and the moving nature of mobile users. In the proposed preemption scheme, two thresholds were set to
differentiate the originating and horizontal handoff voice calls. In order to provide the best service for
data traffic and the guaranteed QoS for voice traffic, a bisection algorithm was used to find the suitable
thresholds. The numerical results showed that the proposed scheme can provide the best effort service
to data traffic while satisfying the QoS requirement of voice traffic. Finally, we discuss the open issues
for the IHWMNs. We believe that the research topics and analytic methods presented in our work will
contribute to the research and development of the future IHWMNs.
REFERENCES
Agrawal, D. P., & Zeng, Q.-A. (2006). Introduction to wireless and mobile systems (2nd Ed.). Florence,
KY: Thomson.
Akyildiz, I., Mohanty, S., & Xie, J. (2005). A ubiquitous mobile communication architecture for nextgeneration heterogeneous wireless systems. IEEE Radio Communications, 43(6), 2936. doi:10.1109/
MCOM.2005.1452832
Chen, W., Liu, J., & Huang, H. (2004). An adaptive scheme for vertical handoff in wireless overlay
networks. IEEE International Conference on Parallel and Distributed Systems (ICPADS) (pp. 541-548).
Washington, DC: IEEE.
3GPP TS 23.234 V7.5.0 (2007). 3GPP system to WLAN interworking, 3GPP Specification. Retrieved
May 1, 2008, from http://www.3gpp.org, 2007.
Liu, X., Li, V., & Zhang, P. (2006). Joint radio resource management through vertical handoffs in 4G
networks IEEE GLOBECOM (pp. 1-5). Washington, DC: IEEE.
McNair, J., & Fang, Z. (2004). Vertical handoffs in fourth-generation multinetwork environments. IEEE
Wireless Communications., 11(3), 815. doi:10.1109/MWC.2004.1308935
Nam, M., Choi, N., Seok, Y., & Choi, Y. (2004). WISE: Energy-efficient interface selection on vertical
handoff between 3G networks and WLANs. IEEE PIMRC 2004, 1, (pp. 692-698). Washington, DC:
IEEE.
735
Park, H.-S., Yoon, S.-H., Kim, T.-Y., Park, J.-S., Do, M., & Lee, J.-Y. (2003). Vertical handoff procedure
and algorithm between IEEE 802.11 WLAN and CDMA cellular network (LNCS, pp.103-112). Berlin:
Springer.
Pavlidou, F. N. (1994). Two-dimensional traffic models for cellular mobile systems. IEEE Transactions
on Communications, 42(234), 15051511. doi:10.1109/TCOMM.1994.582831
Salkintzis, A. K. (2004). Interworking techniques and architectures for WLAN-3G integration toward 4G mobile data networks. IEEE Wireless Communications, 11(3), 5061. doi:10.1109/MWC.2004.1308950
Salkintzis, A. K., Fords, C., & Pazhyannur, R. (2002). WLAN-GPRS integration for next generation mobile data networks. IEEE Wireless Communications, 9(5), 112124. doi:10.1109/MWC.2002.1043861
Shen, W., & Zeng, Q.-A. (2007). Cost-function-based network selection strategy in heterogeneous wireless networks. IEEE International Symposium on Symposium on Ubiquitous Computing and Intelligence
(UCI-07). Washington, DC: IEEE.
Shen, W., & Zeng, Q.-A. (2008). Cost-function-based network selection strategy in integrated heterogeneous wireless and mobile networks. To appear in IEEE Transactions on Vehicle Technology.
Song, Q., & Jamalipour, A. (2005). Network selection in an integrated wireless LAN and UMTS environment using mathematical modeling and computing techniques. IEEE Wireless Communications,
12(3), 4248. doi:10.1109/MWC.2005.1452853
Stemm, M., & Katz, R. H. (1998). Vertical handoffs in wireless overlay networks. ACM Mobile Networking (MONET) [New York: ACM.]. Special Issue on Mobile Networking in the Internet, 3(4), 335350.
Wang, H., Katz, R., & Giese, J. (1999). Policy-enabled handoffs across heterogeneous wireless networks.
Mobile Computing Systems and Applications (PWMCSA), (pp. 51-60).
Wang, J., Zeng, Q.-A., & Agrawal, D. P. (2003). Performance analysis of a preemptive and priority
reservation handoff scheme for integrated service-based wireless mobile networks. IEEE Transactions
on Mobile Computing, 2(1), 6575. doi:10.1109/TMC.2003.1195152
Xu, Y., Liu, H., & Zeng, Q.-A. (2005). Resource management and Qos control in multiple traffic wireless and mobile Internet systems. [WCMC]. Wileys Journal of Wireless Communications and Mobile
Computing, 2(1), 971982. doi:10.1002/wcm.360
Zhang, Q., Guo, C., Guo, Z., & Zhu, W. (2003). Efficient mobility management for vertical handoff between WWAN and WLAN. IEEE Communications Magazine, 41(11), 102108. doi:10.1109/
MCOM.2003.1244929
Zhu, F., & McNair, J. (2004). Optimizations for vertical handoff decision algorithms. IEEE Wireless
Communications and Network Conference (WCNC), (pp. 867-872).

Fairness Among Different Types of Traffic: Due to the limitation of some resource management
736
schemes, some traffic may be allocated too much resource while other traffic may achieve very bad
performance
Integrated Heterogeneous Wireless and Mobile Networks: A new network architecture that combines different types of wireless and mobile networks and provide comprehensive services
Multi-Mode Terminal: A terminal equipped with multiple network interfaces
Multiple Traffic: The combination of different types of traffic, e.g., voice and data traffic in this
chapter
Network Selection Strategy: A strategy to determine which network should be connected in an
IHWMN
Preemption: a resource allocation scheme that preempt ongoing lower priority traffic when higher
priority traffic is coming and there is no enough resource in the system
Resource Management: The allocation of radio resource like channel, bandwidth to different types
of traffic. Optimization of resource management can achieve better system performance.
Vertical Handoff: A switch process that changes the connection from one network to another different type of network in integrated heterogeneous wireless and mobile networks
737
Section 9
739
Chapter 32
Scalable Internet
Architecture Supporting
Quality of Service (QoS)
Priyadarsi Nanda
University of Technology, Sydney (UTS), Australia
Xiangjian He
University of Technology, Sydney (UTS), Australia
ABSTRACT
The evolution of Internet and its successful technologies has brought a tremendous growth in business,
education, research etc. over the last four decades. With the dramatic advances in multimedia technologies and the increasing popularity of real-time applications, recently Quality of Service (QoS) support in
the Internet has been in great demand. Deployment of such applications over the Internet in recent years,
and the trend to manage them efficiently with a desired QoS in mind, researchers have been trying for
a major shift from its Best Effort (BE) model to a service oriented model. Such efforts have resulted in
Integrated Services (Intserv), Differentiated Services (Diffserv), Multi Protocol Label Switching (MPLS),
Policy Based Networking (PBN) and many more technologies. But the reality is that such models have
been implemented only in certain areas in the Internet not everywhere and many of them also faces
scalability problem while dealing with huge number of traffic flows with varied priority levels in the
Internet. As a result, an architecture addressing scalability problem and satisfying end-to-end QoS still
remains a big issue in the Internet. In this chapter the authors propose a policy based architecture which
they believe can achieve scalability while offering end to end QoS in the Internet.
INTRODUCTION
The concept of Policy Based Networking has long been in use by networks for controlling traffic flows
and allocating network resources to various applications. A network policy defines how traffic, user
and/or applications should be treated differently within the network based on QoS parameters, and may
include policy statements. In most cases, such statements are defined and managed manually by the
DOI: 10.4018/978-1-60566-661-7.ch032
Scalable Internet Architecture Supporting Quality of Service (QoS)
network administrator based upon the Service Level Agreements (SLA) between the network and its
customers. Management of network devices for policy conditions to be satisfied is usually performed
by a set of actions performed on various devices. For example, Internet Service Providers (ISPs) rely on
network operators to monitor their networks and reconfigure the routers when necessary. Such actions
may work well within the ISPs own network, but when considered across the Internet, may have serious effect in balancing traffic across many ISPs on an end-to-end basis. Hence, managing traffic over
multiple Autonomous System (AS) domains requires an obvious need for change in the architecture for
the current Internet and the way they function.
Traffic control and policy management between these AS domains also encounter an additional set
of challenges that are not present in the intra-domain case, including trust relationship between different competing ISPs. We demonstrated the architecture based on these heterogeneous policy issues and
identified various architectural components which may contribute significantly towards simplification
of traffic management over the Internet. Validity of the architecture and its deployment in the Internet
heavily depends on the following factors:
1.
2.
3.
4.
5.
Service Level Agreements (SLAs)

Autonomous Systems (ASs) relationship
Traffic engineering and Internet QoS routing
Internet wide resource and flow management
Device configuration in support for QoS
The architecture takes into account above-mentioned factors in an integrated approach in order to
support end-to-end QoS over the Internet. These factors are discussed and the design objectives of our
architecture are presented throughout this chapter. We first discuss the design objectives of the architecture. In section two, we introduce background knowledge about the Internet topology and hierarchy, and
identify various relationships which exist between those hierarchies. We also discuss how this knowledge of relationship between Autonomous Systems affects key design decisions. Section three provides
an overview of our architecture with a brief description on various components involved within them.
Section four summarizes the key features of the architecture and concludes this chapter.
DESIGN OBJECTIVES
Service Level Agreement (SLA) is one of the first requirements towards implementing policy based
network architecture in the Internet. With a growing demand for better QoS, AS domains and network
operators need to enforce strong SLA at various service boundaries by having some additional mechanisms for such support. Hence, in order to achieve end-to-end QoS over the Internet, the SLAs must be
extended beyond the standard customer and provider relationships as used in the past and the architecture
should incorporate necessary components to build such SLAs dynamically spanning different ASs in
the end-to-end path.
Current Internet is a connection of ASs where the connection between the ASs are very much influenced by the relationship based on which such connectivity are formed. Fundamentally, the relationships
between those ASs may be categorized as peer-to-peer, client-server and sibling (Gao, 2001), and are
the driving forces behind economic benefits of individual domains. Most of the ASs try to perform load
740
balancing through certain links connected to their neighbors and peers by using either traffic engineering approaches, such as MPLS and ATM, or policy routing decisions supported by Border Gateway
Protocol (BGP) or/and a combination of traffic engineering and Internet routing. But there is no standard
mechanism which may be applied universally by individual networks.
One of the alternatives to support better QoS on an end-to-end basis over the Internet may be considered by deploying overlay networks. Such approaches are also being deployed by various network
service providers to support many new applications and protocols without any changes to its underlying
network layer (Li & Mohapatra, 2004). Because overlay traffic use application layer profiles, they can
effectively use the Internet as a low level infrastructure to provide high level services to end users of
various applications provided that the lower layers support adequate QoS mechanisms.
Traffic engineering (Awduche, Chiu, Elqualid, Widjaja & Xiao, 2002) is crucial for any operational
network and such thoughts have been put into the architecture by using BGP based parameter tuning
in support for end-to-end QoS. We discuss various aspects of traffic engineering and its impact on the
architecture in this chapter.
Managing resources for QoS flows play an important role in supporting multiple users with multiple
service requirements and can be directly seen as a result of traffic engineering in the Internet. Because
ASs act upon their own policy rules defined by their network administrators, achieving network wide
traffic engineering and resource management is quite difficult though not impossible. Our proposed
architecture is based upon a hierarchical resource management scheme (Simmonds & Nanda, 2002)
which distributes the control of network functions at three different levels.
Policy Based Networking has been seen as a major area of research in the past few years and continues to draw attention from various researchers, network vendors and service providers due to increased
number of network services over the Internet. Also the need for policy based network architecture should
be considered more actively for the current as well as future demands for Internet use.
Earlier works on policy based network management within the IETF have resulted in two standard
working groups: Resource Allocation Protocol (RAP), and the policy framework working groups (Yavatkar, Pendarakis & Guerin, 2000), (Salsano, 2001). These standard architectures describe standard
policy mechanisms (mainly policy core schemas and QoS schemas that are currently being used in policy
servers and specifically to manage intra-domain traffic) but do not include how and where they should
be applied within the current structure of the Internet. Following key components are addressed within
the architecture using a bottom up approach:
1.
2.
3.
4.
5.
Service differentiation mechanism

Network wide traffic engineering
Resource availability to support QoS
Routing architecture to dynamically reflect any policy changes
Inter, intra and device level policy co-ordination mechanisms
The bottom up approach emphasizes that the network can be ready for policy compliance considering
device level policy being configured first and stored in a database. Then looking at the high level policies based on business relationship, a mapping function would be able to pick up right devices, traffic
routes and other associated resources in order to satisfy the QoS for various services. Based on this, the
architecture is broadly divided into a three layer model and is presented later in this chapter.
The architecture would be able to control various network devices with policy rules both within and
741
between the AS domains dynamically, as opposed to current static procedures deployed in the Internet. The
architecture would then work collaboratively with the underlying technologies such as Diffserv, Intserv,
and other lower layer support for achieving performance enhancements over the Internet. Scalability is
considered one of the important features of the architecture and in order to address the scalability issue
within such architecture, we emphasized that each domain may manage its so called policy devices in
a hierarchical way in three groups:
1.
2.
3.
Service support provided within the network with their related traffic characteristics such as bandwidth, delay, loss, jitter etc. for any traffic belonging to that specific service class
Devices (such as routers, switches, servers etc.) falling within each of the service class in support
for QoS
The third group deals with management of all these devices in the second group by fine tuning
traffic engineering through proper selection of protocols, and their related parameters
Currently most networks involving policy based management activities support intra-domain QoS
guaranteed services for their networks only. However in this architecture we considered support for
inter-domain QoS (Quoitin & Bonaventure, 2005),(Agarwal, Chuah & Katz, 2003) with an assumption
that QoS inside the network will still be met.
Hence, our effort is to present design a network architecture based on the policies negotiated between
the customer and the service provider on a direct SLA agreement, and define a policy negotiation/coordination mechanism between the ASs because, without such negotiation an end-to-end QoS model is
difficult, particularly at a service level. By doing so, the architecture can allow network administrators
to automate many key labor intensive tasks and hence increase overall performance related to QoS
for various services. The network architecture is described to achieve the following objectives when
deployed across the Internet:
742
Scalability: Considering the intrinsic characteristics of various traffic and their requirements for
QoS in the Internet, the architecture can incrementally scale to support large number of users with
heavy traffic volumes and real-time performance requirements. It is also intended to manage control plane activities by easily adapting to a three layer hierarchy and with a clear understanding of
communication between each layer.
Efficient use of network resources: The architecture attempts to allocate resources at three different levels: Device level, Network level and Application level. Interactions between these three
levels are controlled through dedicated resource managers and their associated protocols. One of
the key components for such resource management strategy is based upon resource availability
mechanism through the use of BGP based community attribute announcement carrying specific
values for resources available within an AS domain.
Provisioning QoS parameters between end nodes: The architecture does not restrict itself to
technology specifics such as: Intserv, Diffserv, MPLS etc. However, it does recommend using aggregated resource allocation strategies along the source and destination networks in the Internet.
Such an approach would simplify overall network management; achieve scalability in the face of
increased user base and better co-ordination between the control and data planes.
Support for standard architectural components: In order to support optimal QoS performance
involving various applications in the Internet, the architecture is built upon various key functions
such as: Traffic engineering, Inter-domain routing, Resource management and Service Level
Agreements. Hence an Integrated frame-work is presented without which end-to-end QoS would
be difficult to achieve in the Internet.
The resource management mechanism is implemented through proper co-ordination of service parameters both within an AS and between the neighboring ASs in a hierarchical manner. The architecture
also ensures that there are sufficient available resources on both intra and inter-domain links to carry any
QoS aware applications before admitting the flow into the network. Such strategy would then control
various factors such as maximum loss rate, delay and jitter contributing to performance improvement
before deciding which QoS flows to allow and which to deny for a better QoS model in the Internet.
We define the policy and trust issues between various AS domains based on their connectivity in the
Internet. We also investigate the effect of such policies on various other components of the architecture.
Policies are central to two or more entities where various levels of services are offered, based upon the
Service Level Agreements (SLAs) between them. Current Internet is comprised of groups of ASs (ISPs)
placed into different tiers and the connectivity between these tiers are performed through Internet Exchange Points (IXPs) (Huston, n.d.). One of the key concerns about connectivity among various tiers is
based on the kind of relationship each AS holds with their neighbors and peers. Hence the architecture
considers those relationships between ASs and investigate further to identify their effect on various
components of the architecture. Following section of this chapter presents AS relationships along with
the AS hierarchy in the Internet.
AUTONOMOUS SYSTEM (AS) RELATIONSHIPS AND NETWORK POLICIES

Current Internet with more than 16,000 Autonomous Systems (ASs) reflects tremendous growth both in
its size and complexity since its commercialization. These ASs may be classified into different types such
as Internet Service Providers (ISPs), Universities or other enterprises having their own administrative
domains. Sometimes, each administrative domain may have several ASs. Based upon the property of
each AS and their connections between each other, it is important to develop intelligent routing policies
for transportation of Internet traffic and achieve desired performance objectives for various applications. We first describe those properties related to individual AS and then work further to reflect our
principles within the scope of the proposed architecture. In (Rekhter & Li, 2002) Gao et al, classified
the types of routes that could appear in BGP routing tables on the basis of relationships between the
ASs and presented a heuristic algorithm based on the degree of connectivity, inferring AS relationships
from BGP routing tables.
Internet Connectivity
Based on network property, type of connectivity and traffic transportation principles, ASs may be classified under the following three categories. While most of the Stub networks are very much generic
in nature and mainly limited to customer networks only, multi-homed and transit ASs are widely used
within the Internet hierarchy because of their working relationship through which traffics are transported
through them.
743
1.
2.
3.
Stub AS: A stub AS is usually referred to an end user customers internal network, typically a LAN
(Local Area Network). One of the most important properties of stub network is that hosts in a stub
network do not carry traffic for other networks (i.e. no transit service).
Multi-homed AS: Many organizations having their own AS depend upon Internet connectivity to
support critical applications. One popular approach for improving Internet connectivity is to use a
technique called multi-homing to connect to more than one Internet service provider (ISP). Multihoming can be very effective ensuring continuous connectivity, eliminating the ISP as a single
point of failure, and it can be cost effective as well. However, ASs must plan their multi-homing
strategy carefully to ensure that such a scheme actually improves connectivity instead of degrading the service availability. Also the number of providers an AS can subscribe to is always limited
because of economic considerations. In most cases, an AS uses only one of its ISP connectivity for
normal traffic, whilst the second one is reserved as a back-up link in case of failure. From traffic
engineering point of view such scheme improves the performance of traffic throughput between
multiple links.
Transit AS: Transit ASs are described as multi-homed due to multiple connections with other service
providers and carry both local and transit traffic. Such networks are generally the ISPs located within
the Internet hierarchy (tier-1, tier-2,, Customer network) as shown and described below. The
figure does not show tiers 3 to 1. However connectivity between tiers is through exchange points
in the Internet. Such exchange points then carry the transit traffic between each tier connected to
them. (Figure 1)
Connectivity among different ISPs in the Internet is always subjected to the tier, in which they are
placed in, the size of each ISP and the number of subscribers. There are mainly four ISP tier levels:
Tier-1: These ISPs are called Transit providers in each country which carries core traffic in the
Internet.
Figure 1. Transit AS: Multiple connections in the Internet hierarchy
744
Tier-2: These are nationwide backbone networks with over a million subscribers. Such networks
are connected to transit ISPs in each country.
Tier-3: Tier-3 ISPs are regional backbone networks which may have over 50,000 subscribers and
connect to Tier-1 ISPs through peering relationship.
Tier-4: These ISPs belong to local service providers and consist of major small ISPs in each country. Tier-4 ISPs support less than 50,000 users offering local services to their customers.
Apart from the above mentioned properties of ASs, they can be categorized based upon contractual
relationship and agreements between them. These agreements between the ASs play an important role in
representing the structure of the Internet as well end-to-end performance characteristics. Such relationships between the ASs are fundamental to the architecture and are discussed in the following:
1.
2.
3.
Customer-Provider relationship: In a customer-provider relationship scenario, a customer buys

services (network connectivity and service support) from its provider which is typically an ISP.
Similarly, the ISPs buy the services they offer to their customers from their upstream service providers such as tier-4. In other words, a provider does transit traffic for its own customers, whereas
a customer does not transit traffic between any two of its providers even if multi-homed. Network
architecture supporting customer-provider relationship need to address issues in relation with
Service Level Agreements (SLA) enforced between the customer and its providers.
Peer-to-Peer relationship: Two ASs offering connectivity between their respective customers
without exchange of any payments are said to have peer-to-peer relationship. Hence these two ASs
agree to exchange traffic between their respective customers free of charge. Such relationship is
enforced through routing policies between the ASs at the same level within the Internet hierarchy.
For example, a tier-4 service provider must peer with another service provider at the same level,
i.e. another tier-4 only. Such a relationship and agreement between two ASs would mutually benefit
both, perhaps because roughly equal amounts of traffic flow between them.
Sibling relationship: Sibling relationship may be established between two or more ASs if they are
closely placed to each other. In this situation, the relationship allows the individual domains to
provide connectivity to the rest of the Internet for each other. Also sometimes called mutual transit
relationship, they may be used to provide backup connectivity to the Internet for each other when
connection for one of the AS fails. Sibling relationship between ASs may also be used for load
balancing and using bandwidth efficiently among various services in the Internet provided the ASs
involved agree to such an arrangement.
In order to design a new Internet architecture based on the AS properties and their relationship (stub,
multi-homed, transit and customer-provider, peer, or sibling), it is important that the architecture must
first of all support them and then derive related policies to enforce them across the Internet. Such architecture will then be able to answer the following issues when deployed across the Internet:
Resource management with multiple service support

End-to-End QoS for individual service
Load Balancing
Fault management
Facilitate Overlay routing
745
Security related to information sharing between ASs
One such mechanism which has potential to address above issues is the BGP that is dynamic and
also support network policies. Next section of this chapter provides details about how BGP can be used
to support network policies in our proposed architecture.
Border Gateway Protocol (BGP) and AS Relationships

Currently, Border Gateway Protocol (BGP) is being deployed across the Internet as a standard routing protocol between AS domains, where AS relationships are enforced by configuring certain BGP
parameters. Routing traffic containing BGP carry nearly 90% of Internet route announcements due to
its rich feature supporting network policies and contributing significantly towards Internet load balancing, traffic engineering and support for fall-back procedures in the event of network bottlenecks. BGP
as part of the Inter-domain routing protocol standard for the current Internet allows each AS domain to
select its own administrative policy by choosing the best route and announcing and accepting routes to
and from other AS domains connected as neighbors. Though such an approach works reasonably well
for most of the ASs individually to satisfy their personal objectives and maximizing their profits, it does
not address the impact of such an approach on a global scale.
Before presenting the architecture in detail, in the following, we present a few of the BGP approaches
for policy enforcement between ASs. We try to present those policy issues and how BGP is currently
configured related to AS relationships mentioned above.
Border Gateway Protocol (BGP-4) (Rekhter & Li, 2002) was a simple path vector protocol when
first developed and the main purpose of BGP was to communicate and control path level information
between ASs so as to control route selection process between them. Using path level announcements by
neighbors, an AS decides which path to use in order to reach specific prefixes. One of the main reasons
Figure 2. Connectivity among autonomous systems
746
Table 1. BGP routing decision process

1. Find the path with highest Local-Preference
2. Compare AS-Path length and choose the one with least length
3. Look for the path with Lowest MED attribute
4. Prefer e-BGP learned routes over i-BGP routes
5. Choose the path with lowest IGP metric to next hop
ASs use BGP for Inter-domain routing is for their own policies to be communicated to their neighbors
and subsequently across the whole Internet.
Many modifications to the original BGP have happened over time, and today we see BGP as a protocol
weighed down with a huge number of enhancements overlapping and conflicting in various unpredictable ways. In this chapter we do not try to analyze those complex issues with BGP, instead our aim is
to use BGP as a transport vehicle across ASs and implement network wide policies between them. It
is sensible at this point of time to consider the ASs as ISPs and then we can be more specific in terms
of exploring policies related to those ISPs and work for a better management of Internet wide traffic
mapping to the relationships we have mentioned before. Henceforth, this chapter will use the terms AS
and ISP interchangeably. Figure 2, shows a scenario connecting different ASs and representing their
relationships with each other.
One of the key features of BGP is the decision process through which each BGP router determines
the path to destination prefixes. The rules are given in Table 1 below:
As shown in Table 1, the relationships between individual ASs are realized by BGP attributes and
in order to determine the actions to be performed for the purpose of traffic engineering between them,
the following must be considered:
Use of Local Preference to influence Path announcement: In a customer-provider relationship,

providers prefer routes learned from their customers over routes learned from peers and providers, when all those routes to the same prefix are available. Hence in the above figure, ISP A would
certainly prefer routes to all those prefixes within customer Y from customer X instead from ISP
B. By doing so, ISP A can generate revenue by sending traffic through its own customer. Instead,
if ISPA sends traffic through its own provider (not shown), it costs money for ISP A and through
ISP B (peer), the credit rating of ISP A will be down graded. In order to implement such a policy
(prefer customer route advertisements) using the local preference attribute, ISPs in general must
assign higher local preference value to the path for a given prefix learned from their customer. In
(Caesar & Rexford, 2002), Caesar et al described the use of assigning a non-overlapping range of
Local Preference values to each type of peering relationships between AS domains and mentioned
that Local Preference be varied within each range to perform traffic engineering between them.
Hence Local Preference attribute in BGP can be used to perform the job of traffic engineering especially controlling outgoing traffic between ASs as well holding the policy relationships between
them.
Use of AS path pre-pending and Multi-Exit Discriminator (MED) attributes to influence transit
and peering relationships: ISPs may influence the load balance of incoming traffic on different
links connected to their neighbors. Such a scheme can be implemented by selectively exporting
747
the routes to their neighbors. For example, an ISP may selectively announce its learned paths
to its peer thereby forcing the peer to only get information on specific routes. Transit ISPs can
control their incoming traffic by selectively announcing their learned routes to their peers. Apart
from this, BGP makes use of AS path pre-pending technique where an ISP can apply its own AS
number to multiple times and announce the path to its neighbor.
Because BGP decision process selects lowest AS path length (rule-2 in Table1) such a technique will
force the neighbor to choose another path if available, instead of a pre-pended AS path announced by
the neighbor. To investigate further into the policy mechanisms associated with AS path pre pending and
load balancing between peers, consider Figure 2 again. In this, ISP A decides to pre-pend its AS number
in the path announcement to ISP B 3 times, ISP A will announce the path: ISP A, ISP A, ISP A, Customer
A to ISP B. Hence ISP B will instead choose a different path to reach the prefixes with Customer A. But
such a scheme is often selected manually on a trial and error basis simply to avoid more traffic from
other domains. In the architecture we discuss to use AS path pre-pending only when the relationship
between peers is based upon strict exchange of traffic without monetary involvement.
Another scheme in which ISPs use the Multi Exit Discriminator (MED) attribute to control incoming
traffic from its neighbors. ISPs having multiple links to other ISPs can use MED attribute in order to
influence the link that should be used by the other ISP to send its traffic towards a specific destination.
However, use of MED attribute must be negotiated beforehand between two peering ISPs. In the architecture, MED attribute is only used between transit ISPs having multiple links to other ISPs.
Use of community attribute for route export to neighbors: ISPs have been using the BGP community attributes for traffic engineering and providing their customers a finer control on the redistribution of their routes (Quoitin, Uhlig, Pelsser, Swinnen & Bonaventure, 2003).
Internet Assigned Numbers Authority (IANA) typically assigns a block of 65536 community values
to each AS, though only a few of them are used to perform community based traffic engineering. Using
these attributes, by tagging them into the AS path announcement, an ISP may ask its neighbor or customer
to perform a set of actions on the AS path when distributing that path to its neighbor (Li & Mohapatra,
2004),(Uhlig, Bonaventure, & Quoitin, 2003). For example in Figure 2, ISP A may want its customer
(customer X) to pre-pend customer Xs AS path information 3 times before announcing further up stream.
Similarly, ISP B may ask ISP A not to announce any path to customer X (e.g. NO_EXPORT). By doing
so, ISPs are able to better control their incoming traffic. However, because community based attributes
are yet to be standardized, there is a need for uniform structure in these attributes in order to apply them
in the Internet. Also because each community attribute value requires defining a filter for each supported
community in the BGP router, such a process will add more complexity into already fragile BGP and
hence increase the processing time of the BGP message (Yavatkar, Pendarakis & Guerin, 2000).
In summary, AS relationships play important role in Internet connectivity between various ISPs and
contribute significantly towards designing a new architecture for the Internet. BGP based traffic engineering can be made more scalable by carefully selecting and configuring the attributes based on business
relationships between various ISPs. The architecture is presented in next section which is based upon
the analysis we have presented before on the use of ISP relationships and their corresponding traffic
engineering attributes supported by many features of BGP.
748
THREE-LAYER POLICY ARCHITECTURE FOR THE INTERNET

We present the architecture supporting scalability and end-to-end performance in this section while accomplishing the following major tasks:
Traffic flow management and resource monitoring: Flows are usually aggregated based on desired characteristics such as delay, loss, jitter, bandwidth requirements and assigned priority levels
which determine end-to-end performance for various applications. Based on such aggregated flow
characteristics Bandwidth Brokers in each AS domain then decide whether to accept the flow or
reject them. Such flow management activities are performed at layer-2 of the architecture delivering network layer QoS across multiple domains in the end-to-end path.
QoS Area identification: Each domain in the Internet is engineered and provisioned to support a
variety of services to their customers. In the worst case an AS having connectivity to the Internet
at least supports Best Effort service as a default service. In addition many of these ASs are also
engineered to support VoIP, Video Conferencing and other time critical applications. Identifying
these AS domains in the Internet and routing traffics through them improves overall QoS for
various applications. This function is supported at layer-3 of the architecture which essentially
performs Inter-domain QoS routing by identifying QoS areas in support for a specified service.
This layer is different from TCP based session establishment because, in our architecture it is
assumed that, there exist multiple AS domains with multiple QoS. Hence using QoS routing the
architecture then supports selection of different QoS networks based on AS relationships for QoS
sensitive applications.
Traffic engineering and load balancing: Since traffic engineering is important to improve endto-end QoS, the architecture tries to balance traffic flows between various domains through policy
co-ordination mechanism. The mechanism uses an approximation technique to balance any traffic
parameter conflict between neighboring domains and improve overall QoS for services.
Policy based routing: Applications requiring strict QoS must adhere to certain policies and our
architecture uses BGP based policy decisions and applies various attributes of BGP to compute
optimized routing paths. A route server in each domain relieves the routers from such complex
policy decisions and processes information in fast mode.
The above mentioned functions are integrated and operate at different levels within our proposed
architecture. Resource management within and between AS domains are clearly supported through the
hierarchical grouping of various architectural components and are described below:
1.
2.
3.
4.
The architecture is hierarchical and operates at different levels to support both high level (business
focused) and low level (device focused) resource availability for maintaining end-to-end QoS.
The control plane of the architecture is separated from the data plane in order to maintain scalability
across the Internet with wide variety of service classes.
Each level in the hierarchy is controlled by a manager which receives resource availability information from components within that level only and the output is informed to a higher level in the
hierarchy. Hence the approach is bottom up.
Communication between same levels is allowed through peer-to-peer session establishment without
additional overhead to manage the Internet. Any conflict for end-to-end QoS is resolved through
749
5.
proper policy co-ordination mechanisms.

Apart from resource management, the architecture also includes routing and traffic engineering
and hence is an integrated approach to manage various services in the Internet.
The logical view of the architecture is presented in Figure 3. Each level in the hierarchy is associated
with number of functions independent of any technology choice.
One of the key components of the architecture is to separate out the control plane from the data
forwarding plane by hierarchically grouping network management functions. Also important to note
that, both layer-2 and layer-3 can be combined to form the Inter-network layer of TCP/IP network architecture. Essentially, layer-3 of the architecture determines the AS domains in support for a specified
QoS through which a flow can be set-up but does not go beyond that to address the issue of resource
and flow management which are performed by layer-2 only. The architecture also considers both flow
and resource management functions between domains only as individual domains need to guarantee
QoS based on device capabilities within their own domain. A detailed description on individual layers
of the architecture is presented below.
Layer-1: Device Layer qoS

Network devices including routers, switches, servers etc. are the key components to maintain and manage
traffic flows across networks in an AS domain. These devices can be configured with low-level policy
information managed by a single or multiple device managers depending on the size of the network.
Support for QoS to various applications heavily depends on identifying, configuring, maintaining and
accounting for these QoS aware devices within an AS. In order to support device level QoS in our architecture, following policies may be applied:
Figure 3. Scalable QoS architecture, logical view
750
1.
2.
3.
4.
5.
Each device registers their device functionalities through a policy repository indicating the kind
of service support to different applications
The repository has a direct interface with network management tool such as SNMPv3 which monitors the devices on a time scale to determine fairly accurate assessment of the physical connectivity
of various devices in the network.
Information supporting various queuing strategies, scheduling mechanisms and prioritized traffic
handling for different devices may also be obtained from the repository. Such information is useful
to determine the kind of QoS architecture supported (Intserv, Diffserv, MPLS) within a network
domain.
Decision on admitting a traffic flow and offering particular level of QoS to the flow depends on
the ability of those devices falling on the path of the flow. However such decision on admission
control is managed at the next level of the architecture by inspecting the policy repository along
with any routing decisions.
The overall management of network devices within an AS is performed by a Device manager.
The device manager will then need a direct interface with the management tool and the policy
repository. Use of separate management component reduces the load of SNMP tool and carries
out next level communication within the architecture. Hence the device manager handles device
configuration decisions and communicates with higher level managers to indicate various device
level QoS support in the network. One thing to note in the architecture is that the device level QoS
is only responsible for obtaining different device resources and help in preparing network topology
for QoS support to various flows within the AS domain only.
The logical view of device level QoS is shown in Figure 4. It is assumed that resources within an AS
are computed based on aggregated bandwidth requirements by different service classes. For scalability
reasons, the architecture handles aggregated reservation for resources within each device in the path of
Figure 4. Device layer QoS components
751
the flow sharing same link and request for the same traffic class.
If any one of the devices cannot support the required resources and QoS parameters, either of the
following actions may be performed which is based on service priority and any further economic incentives:
The request for the flow may be denied

Alternative devices may be selected and informed to the next layer, the network layer, where a
separate path may be created by applying network QoS flow management using BB resource
management strategy
The device manager may negotiate with network layer and further up the line in the hierarchy,
and the final decision to offer a lower QoS level may be communicated between different entities
involved in this process
However network layer decision on QoS flow management across various AS domains is important
within the architecture in order to determine how traffic can best be handled with the desired QoS support from the devices below. The architectural support at next level in the hierarchy (network layer QoS)
is based on admission control policy, inter-domain flow management and signaling for various flows.
The layer-2 functions of the architecture are presented below.
Layer-2: Network Layer qoS

Network layer service guarantees for QoS sensitive applications are supported through the information
from device managers and their associated management tools which are located in the bottom most layer
of the architecture. Once this information is obtained, the network layer QoS determines and supports
QoS guarantees between network boundaries within an AS domain. Hence this layer performs flow and
resource management functions for both intra-domain and inter-domain in the Internet. The intra-domain
functions are presented below, followed by inter-domain functions which are important. However for
intra-domain following procedures may be noted:
1.
Identifying paths between edge routers applying intra-domain routing and ranking them (high-tolow) according to QoS service support based on parameters such as: bandwidth, delay, jitter, loss
and cost.
Resource allocation for each device, in every edge-to-edge path based on aggregated reservation
strategy, and is classified under the intra-domain resource management framework. Such aggregated
reservation is made for a group of flows sharing same link and request for the same traffic class.
Once allocations of resources are completed, such information is stored in a repository which is
updated at regular intervals.
Admission control for individual flows at the edge (ingress) routers by checking SLAs and any
policy related information.
A QoS manager within each AS ensures support for network layer QoS to various applications
as well as communicating with the device manager and other higher level components in the
architecture.
2.
3.
4.
5.
The network layer QoS is also responsible for topology discovery within an AS where connectivity
752
information between various devices is obtained in order to know the exact path between end points
within an AS domain. Such topology discovery information is required for two different cases. The first
one is the information related to physical topology of the network, describing physical connectivity between different devices which in most cases is static unless physical changes happen within the network.
The second one is based on routing topology, which changes frequently as the routes taken between any
pair of devices within the network are likely to change relatively more frequently e.g. effect of traffic
engineering to load balance traffic among different links.
Physical topology information is obtained by interacting with the device manager while the routing
topology information is based on factors such as the type of routing protocol, various routing schemes
(overlay routing vs. standard next-hop routing) and any routing policies applied in order to support service
guarantees for different traffic flows within the AS domain. However, we only describe the architecture
and its associated components without further details on any specific mechanisms for deployment in the
Internet. For the sake of an example we consider Diffserv, but it is entirely up to the network managers/
designers to decide the kind of technology to be used with their relevant QoS support for various applications in the Internet. The logical view of the network layer QoS for intra-domain flow management
is shown in Figure 5.
The QoS manager plays a central role in managing network wide QoS delivery by interacting with other
components of the architecture. As device managers manage various devices and interact with device repository and management tool to monitor, configure, and record any device level statistics in support for QoS in
the network, such information are crucial for QoS manager at the network layer to apply between network
edges for intra-domain QoS guarantee. Hence an accurate view of both device support at lower layers and
resource management contributes significantly towards building good architecture for the Internet.
One of the important tasks at the network layer is to make sure sufficient resources are available for
different QoS sensitive flows originating from both within the network and outside the network. While
flows originating within the network are guaranteed resources for specific QoS based on SLAs and poli-
Figure 5. Network level QoS (intra-domain): logical view
753
cies between the network and its customers, flows entering from outside the network are permitted if
prior contract and relationships are established between other network domains. Otherwise, the network
treats the flows as BE specific without further guarantees on QoS.
Another interesting point to note within the architecture is the interaction between routing topology
and physical topology information. While intra-domain routing protocols are used to determine the
network level path within an AS, such path may not give an optimized solution to support QoS for the
application. Hence, the QoS manager communicates with the QoS path repository to determine any better path availability with the desired resources for that application at that instant. If an alternative path
is discovered between the same edge points, actions may be taken by the QoS manager to inform the
device manager to configure those devices falling on the path. Physical topology of a network describing connectivity between devices may be used as a choice for forwarding the traffic in situations where
routing protocols may not be useful in support for QoS within an AS. Such considerations are taken
within the architecture in order to support better than best-effort QoS particularly involving control
load services defined in Diffserv.
QoS manager is responsible for providing service level guarantees within an AS only. However
end-to-end QoS guarantee in the Internet needs to be supported by multiple domains through which
Internet wide connectivity is established. Various factors apart from individual network level QoS are
important to consider in this regard. Within the architecture, third layer in the hierarchy, the Inter-domain
QoS, is designed to further manage end-to-end QoS for various applications. Issues in relation to trust
management, policy co-ordination, inter-domain routing, traffic engineering and competitive pricing
structures are some of the key factors which are considered at the next level in the architecture and are
described below.
Layer-3: End-to-End qoS

The End-to-End QoS layer in the architecture is responsible for managing higher level policies in the
networks in order to guarantee end-to-end QoS across the Internet. One of the most important functions
performed by this layer is selecting QoS areas between end nodes and routing traffic flows based on
their QoS parameters. Inter-domain policy routing, traffic engineering for load balancing and supporting
various user QoS requirements using SLAs are the key functions of this layer. Hence this layer extends
single AS level QoS (offered at layer-2 of the architecture) to multiple AS level QoS by adding following functions into the architecture.
1.
2.
3.
4.
754
Application level QoS are supported through SLAs between the network service provider and
customers. Hence identifying various parameters from the SLAs such as customer identification,
service category, resource support, service schedule, traffic descriptor, QoS parameters and outof-profile traffic treatments are important to consider at this layer of the architecture.
Admission control policies determine user authentication through a central repository within each
AS and find out the level of QoS support for the flows.
Administrative and pricing policies are considered part of admission control process and resource
allocation strategy to various applications. However the architecture does not include pricing
issues.
AS relationships and trust issues are central to determine existence of any QoS paths between end
nodes spanning multiple domains in the Internet. Such approach investigates a number of QoS paths
5.
6.
7.
8.
rather than simply choosing the lowest metric routing path between end points in the Internet. AS
relationships are determined by inferring various Internet characteristics as well as using policy
based routing information between different domains.
Inter domain routing decisions based on various policies are given preferences, allowing the
set of service requirements to be optimally supported within the networks aggregate resource
capability.
A central domain coordinator within each AS is responsible for the above mentioned activities by
interacting with domain coordinators from other domains in the Internet. Hence, identifying QoS
domains and investigating their service offerings are keys to such architecture.
Any conflict in resource and traffic management including simple pricing parameters is resolved by
applying resource co-ordination or similar algorithm between various domains in the QoS path.
Once QoS discovery process is completed, the extracted technical parameters from the SLAs, which
are referred to as service level specifications (Goderis, et al., 2001), are mapped across network
layer QoS components and finally through the device managers in individual ASs.
The architectural support at the End-to-End QoS layer which does the above mentioned functions
are achieved largely through a series of negotiation and policy co-ordination activities before the exact
QoS parameters are determined by the domain coordinator and applied across various domains in the
QoS path. Such approach would then guarantee the SLAs between the service provider and its customers supporting end-to-end QoS objectives for various applications. The logical view of End-to-End QoS
layer is presented in Figure 6 with various interactions among the components present within them.
The function of the domain coordinator can be compared with a Bandwidth Broker (Terzis, Wang,
Ogawa & Zhang, 1999), (Li, Zhang, Duan, Gao & Hou, 2000) or a Policy Server (Yavatkar, Pendarakis
& Guerin, 2000) which does the job of policy based admission control by comparing SLSs with user
Figure 6. End-to-End QoS: Logical view
755
flow parameters. Resource control similar to the work stated in both (Yavatkar, Pendarakis & Guerin,
2000), (Terzis, Wang, Ogawa & Zhang, 1999) is also performed by the domain coordinator. The domain
coordinator primarily manages two different sets of policies as specified in the logical diagram and
exchanges with other domains.
The customer specific policy controls access to available services within a service providers domain
by comparing parameters such as: priorities, usable services, resource availability, valid time against SLA
rules as specified between the service provider and its customer. A decision on whether to accept the customers flow or to deny it is finally conveyed through the admission control module. The right side of the
domain coordinator as shown in Figure 6 is responsible for service and resource specific policies. Service
parameters related to QoS values, valid time and cost are compared against any policy rules found in their
respective SLAs. In order to determine optimized values for these service parameters, the domain coordinator needs to consider various traffic engineering policies as well as routing policies involving peering
domains. Finally the architecture deals with resource specific policies such as: bandwidth, delay and jitter
available at the network level by communicating with the QoS manager.
In case of policy conflicts (e.g. available resources are not sufficient), the domain coordinator initiates a
policy coordination algorithm between domains present in the end-to-end QoS path. In order to understand
the overall architecture on an end-to-end basis, the flow diagram in Figure 7 demonstrates various activities
and their sequence of operations between end systems in the Internet. In the functional description of the
architecture, while the objective is to support QoS between various domains under different AS administrative control, individual domains should support both network level and device level QoS throughout the life
of the flow.
Figure 7. Functional descriptions and interactions among different layers
756
CONCLUSION
We discussed a Policy architecture that handles resource management for both intra and inter domain
resources for QoS specific (high-priority) applications. One of the strengths of the architecture is its
separation of control and management plane from the data plane in order to facilitate better end-to-end
QoS control. The architecture is also hierarchical and operates at different levels to support both high
level (business focused) and low level (device focused) resource availability for maintaining end-toend QoS. Various levels in the hierarchy are controlled by a manager that receives resource availability
information from components within that level only, and the output is informed to a higher level in the
hierarchy. Hence the approach is also bottom up.
Communication between same levels is allowed through peer-to-peer session establishment and making use of necessary signaling protocols. Any conflict for end-to-end QoS is resolved through proper
policy co-ordination mechanism. Apart from resource management, the architecture also includes policy
based routing and traffic engineering through fine tuning of BGP routing policy structures. Hence the
architecture is scalable, integrated and is aimed at improving end-to-end QoS for various services in
the Internet.
Validation of the architecture presenting functionalities of the three different layers is performed
using three different environments. Layer-1 functionalities of the architecture demonstrating Diffserv
network is presented by creating a test bed scenario with Diffserv capable domain and measuring
end-to-end QoS parameters for VoIP application in the presence of other background traffic. Such
experiments then motivate to consider various QoS parameters and use resource management strategy
between AS domains. However, layer-2 of our architecture mainly deals with resource management
between neighboring AS domains on an end-to-end basis. For this we designed a prototype based on
Bandwidth Broker and using our own signaling scheme to properly manage traffic flows with different
QoS classes. Finally layer-3 of our architecture is designed to select QoS domains and forwarding traffic
based on Inter domain routing protocol such as BGP to enforce routing policies in a dynamic way. In
order to demonstrate such functions of our architecture we used several simulation experiments based
on OPNET simulator. The simulation environment also considered parameters used by the real routers
and demonstrated the efficiency of using community based attribute and policy co-ordination algorithm
in case of policy conflict. A series of experiments are conducted to investigate the effect of BGP based
policy enforcement, load balancing between AS domains and traffic engineering for scalability and better
management of QoS in the Internet.
We presented a Policy based architecture which is designed to support end-to-end QoS for multiple
service classes in an integrated way. With the integrated approach, our design and performance evaluation
results presented in (Nanda, 2008) indicated such end-to-end QoS can be achieved with the help of service
mapping, policy based routing and traffic engineering, resource management using BB functionalities,
and device level QoS support across the Internet. The main strengths of our design are scalability, ability to handle heterogeneous policies, and distributed resource management support. This chapter also
established a foundation for further research on policy routing involving security, policy based billing
and charging in the Internet, and application level resource management in the Internet.
757
REFERENCES
Agarwal, S., Chuah. C. N., & Katz, R. H. (2003). OPCA: Robust Inter-domain Policy Routing and
Traffic Control, OPENARCH.
Awduche, D.O., Chiu, A., Elqalid, A., Widjaja, I., & Xiao, X. (2002). A Framework for Internet Traffic
Engineering [draft 2]. Retrieved from IETF draft database.
Caesar, M., & Rexford, J. (2005, March). BGP routing policies in ISP networks, (Tech. Rep. UCB/CSD05-1377). U. C. Berkeley, Berkeley, CA.
Gao, L. (2001). On inferring autonomous system relationships in the Internet. IEEE/ACM Transactions
on networking, 9(6), December.
Goderis, D. et al. (2001, July). Service Level Specification Semantics and parameters: draft-tequilasls-01.txt [Internet Draft].
Huston, G. (n.d.). Peering and settlements Part-1. The Internet protocol journal. San Jose, CA: CISCO
Systems.
Li, Z. Zhang, Duan, Z., Gao, L.& Hou, Y.T.(2000). Decoupling QoS control from Core routers: A Novel
bandwidth broker architecture for scalable support of guaranteed services. Proc. Of SIGCOMM00,
Stockholm, Sweden, (pp. 71-83).
Li, Z., & Mohapatra, P. (2004, January). QoS Aware routing in Overlay networks (QRON). IEEE Journal
on Selected Areas in Communications, 22(1).
Nanda, P. (2008, January). A three layer policy based architecture supporting Internet QoS. Ph.D. thesis,
University of Technology, Sydney, Australia.
Quoitin, B., & Bonaventure, O. (2005). A Co-operative approach to Inter-domain traffic engineering.
1st Conference on Next Generation Internet Networks Traffic Engineering (NGI 2005), Rome, Italy,
April 18-20th.
Quoitin, B., Uhlig, S., Pelsser, C., Swinnen, L., & Bonaventure, O. (2003). Internet traffic engineering
with BGP: Quality of Future Internet Services. Berlin: Springer
Rekhter, Y. & Li, T. (2002, January). A border gateway protocol 4 (BGP-4): draft-ietf-idr-bgp4-17.txt
[Internet draft, work in progress].
Salsano, S. (2001 October). COPS usage for Diffserv resource allocation (COPS-DRA) [Internet
Draft].
Simmonds, A., & Nanda, P. (2002). Resource Management in Differentiated Services Networks. In C
McDonald (Ed.), Proceedings of Converged Networking: Data and Real-time Communications over
IP, IFIP Interworking 2002, Perth, Australia, October 14 - 16, (pp. 313 323). Amsterdam: Kluwer
Academic Publishers.
Terzis, A., Wang, L., Ogawa, J. & Zhang, L. (1999, December). A two tier resource management model
for the Internet, Global Internet, (pp. 1808 1817).
758
Uhlig, S., Bonaventure, O., & Quoitin, B. (2003). Internet traffic engineering with minimal BGP configuration. 18th International Teletraffic Congress.
Yavatkar, R., Pendarakis, D., & Guerin, R. (2000, January). A framework for policy based admission
control, (RFC 2753).

Autonomous System (AS): An autonomous system is an independent routing domain connecting
multiple networks under the control of one or more network operators that presents a common, clearly
defined routing policy to the Internet and has been assigned an Autonomous System Number (ASN).
Bandwidth Broker(BB): Bandwidth Broker (BB) is a logical entity used to act as a resource manager
both within a network and between networks so as to guarantee performance.
Border Gateway Protocol (BGP): BGP is a routing protocol which allows networks to tell other
networks about destinations that they are responsible by exchanging routing information in different
autonomous systems.
Differentiated Services (Diffserv): Diffserv supports QoS guarantee by aggregating traffic flows
on a per class basis.
Integrated services (Intserv): Intserv supports end-to-end QoS guarantee on a per flow basis.
Policy Based Networking (PBN): Policy based networking is defined as the management of a network so that various kinds of traffic get certain priority of availability and bandwidth needed to serve
the networks users effectively.
Quality of Service (QoS): Quality of Service (QoS) is defined as supporting and guaranteeing network resources to various users, applications and services in the Internet.
Traffic Engineering (TE): Traffic Engineering (TE) is concerned with performance optimization
of operational IP networks and can be used to reduce congestion and improve resource utilization by
careful distribution of traffic in the network.
759
760
Chapter 33
Scalable Fault Tolerance

for Large-Scale Parallel and
Distributed Computing
Zizhong Chen
Colorado School of Mines, USA
ABSTRACT
Todays long running scientific applications typically tolerate failures by checkpoint/restart in which
all process states of an application are saved into stable storage periodically. However, as the number
of processors in a system increases, the amount of data that need to be saved into stable storage also
increases linearly. Therefore, the classical checkpoint/restart approach has a potential scalability
problem for large parallel systems. In this chapter, we introduce some scalable techniques to tolerate
a small number of process failures in large parallel and distributed computing. We present several encoding strategies for diskless checkpointing to improve the scalability of the technique. We introduce
the algorithm-based checkpoint-free fault tolerance technique to tolerate fail-stop failures without
checkpoint or rollback recovery. Coding approaches and floating-point erasure correcting codes are
also introduced to help applications to survive multiple simultaneous process failures. The introduced
techniques are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. Experimental results demonstrate that the introduced
techniques are highly scalable.
INTRODUCTION
The unquenchable desire of scientists to run ever larger simulations and analyze ever larger data sets is
fueling a relentless escalation in the size of supercomputing clusters from hundreds, to thousands, and
even tens of thousands of processors (Dongarra, Meuer & Strohmaier, 2004). Unfortunately, the struggle
to design systems that can scale up in this way also exposes the current limits of our understanding
DOI: 10.4018/978-1-60566-661-7.ch033
of how to efficiently translate such increases in computing resources into corresponding increases in
scientific productivity. One increasingly urgent part of this knowledge gap lies in the critical area of
reliability and fault tolerance.
Even making generous assumptions on the reliability of a single processor, it is clear that as the
processor count in high end clusters grows into the tens of thousands, the mean time to failure (MTTF)
will drop from hundreds of days to a few hours, or less. The type of 100,000-processor (Adiga, et al.,
2002) machines projected in the next few years can expect to experience a processor failure almost daily,
perhaps hourly. Although todays architectures are robust to enough incur process failures without suffering complete system failure, at this scale and failure rate, the only technique available to application
developers for providing fault tolerance within the current parallel programming model checkpoint/restart
has performance and conceptual limitations that make it inadequate to the future needs of the communities that will use these systems. Alternative fault tolerance techniques need to be investigated.
In this chapter, we present some scalable techniques to tolerate a small number of process failures in
large scale parallel and distributed computing. The introduced techniques are scalable in the sense that
the overhead to survive k failures in p processes does not increase as the total number of application
processes p increases. We introduced several encoding strategies into diskless checkpointing to improve
the scalability of the technique. We present an algorithm-based checkpoint-free fault tolerance approach,
in which, instead of taking checkpoint periodically, a coded global consistent state of the critical application data is maintained in memory by modifying applications to operate on encoded data. Because
no periodical checkpoint or rollback-recovery is involved in this approach, process failures can often
be tolerated with a surprisingly low overhead. We explore a class of numerically stable floating-point
number erasure codes based on random matrices which can be used in the algorithm-based checkpointfree fault tolerance technique to tolerate multiple simultaneous process failures. Experimental results
demonstrate that the introduced fault tolerance techniques can survive a small number of simultaneous
processor failures with a very low performance overhead.
BACKGROUND
Current parallel programming paradigms for high-performance distributed computing systems are typically
based on the Message-Passing Interface (MPI) specification (Message Passing Interface Forum,1994).
However, the current MPI specification does not specify the behavior of an MPI implementation when
one or more process failures occur during runtime. MPI gives the user the choice between two possibilities on how to handle failures. The first one, which is the default mode of MPI, is to immediately abort
all survival processes of the application. The second possibility is just slightly more flexible, handing
control back to the user application without guaranteeing that any further communication can occur.
FT-MPI Overview
FT-MPI (Fagg, Gabriel, Losilca, Angskun, Chen, Pjesivac-Grbovic, et al., 2004) is a fault tolerant version
of MPI that is able to provide basic system services to support fault survivable applications. FT-MPI
implements the complete MPI-1.2 specification and parts of the MPI-2 functionality, and extends some
of the semantics of MPI to support self-healing applications. FT-MPI is able to survive the failure of
n 1 processes in an n-process job, and, if required, can re-spawn the failed processes. However, fault
761
tolerant applications have to be implemented in a self-healing way so that they can survive failures.
Although FT-MPI provides basic system services to support self-healing applications, prevailing benchmarks show that the performance of FT-MPI is comparable (Fagg, Gabriel, Bosilca, Angskun, Chen,
Pjesivac-Grbovic, et al., 2005) to the current state-of-the-art non-fault-tolerant MPI implementations.
FT-MPI Semantics
FT-MPI provides semantics that answer the following questions:
1.
2.
What is the status of an MPI communicator after recovery?

What is the status of the ongoing communication and messages during and after recovery?
When running an FT-MPI application, there are two parameters used to specify which modes the
application is running. The first parameter is communicator mode which indicates the status of an MPI
object after recovery. FT-MPI provides four different communicator modes, which can be specified
when starting the application:
ABORT: like any other MPI implementation, in this FT-MPI mode, application aborts itself after
failure.
BLANK: failed processes are not replaced; all survival processes have the same rank as before the
crash and MPI COMM WORLD has the same size as before.
SHRINK: failed processes are not replaced; however the new communicator after the crash has no
holes in its list of processes. Thus, processes might have a new rank after recovery and the size
of MPI COMM WORLD will change.
REBUILD: failed processes are re-spawned; survival processes have the same rank as before. The
REBUILD mode is the default, and the most used mode of FT-MPI.
The second parameter, the communication mode, indicates how messages, which are sent but not
received while a failure occurs, are treated. FT-MPI provides two different communication modes, which
can be specified while starting the application:
CONT/CONTINUE: all operations which returned the error code MPI SUCCESS will finish
properly, even if a process failure occurs during the operation (unless the communication partner
has failed).
NOOP/RESET: all pending messages are dropped. The assumption behind this mode is that on
error the application returns to its last consistent state, and all currently pending operations are not
of any further interest.
FT-MPI Usage
It usually takes three steps to tolerate a failure: 1) failure detection, 2) failure notification, and 3) recovery. The only assumption the FT-MPI specification makes about the first two points is that the run-time
environment discovers failures and all remaining processes in the parallel job are notified about these
events. The recovery procedure consists of two steps: recovering the MPI run-time environment, and
762
recovering the application data. The latter one is considered to be the responsibility of the application
developer.
In the FT-MPI specification, the communicator-mode discovers the status of MPI objects after recovery, and the message-mode ascertains the status of ongoing messages during and after recovery. FT-MPI
offers for each of these modes several possibilities. This allows application developers to take the specific
characteristics of their application into account and use the best-suited method to tolerate failures.
SCALABLE DISKLESS CHECKPOINTING FOR

LARGE SCALE SCIENTIFIC COMPUTING
In this section, we introduce some techniques to improve the scalability of classical diskless checkpointing technique.
Diskless Checkpointing: From an Application Point of View

Diskless checkpointing (Plank, Li & Puening, 1998) is a technique to save the state of a long running
computation on a distributed system without relying on stable storage. With diskless checkpointing,
each processor involved in the computation stores a copy of its state locally, either in memory or on
local disk. Additionally, encodings of these checkpoints are stored in local memory or on local disk of
some processors which may or may not be involved in the computation. When a failure occurs, each
live processor may roll its state back to its last local checkpoint, and the failed processors state may
be calculated from the local checkpoints of the surviving processors and the checkpoint encodings. By
eliminating stable storage from checkpointing and replacing it with memory and processor redundancy,
diskless checkpointing removes the main source of overhead in checkpointing on distributed systems
(Plank, Li & Puening, 1998). Figure 1 is an example of how diskless checkpoint works.
To make diskless checkpointing as efficient as possible, it can be implemented at the application
level rather than at the system level (Plank, Kim & Dongarra, 1997). In typical long running scientific
applications, when diskless checkpointing is taken from application level, what needs to be checkpointed
is often some numerical data (Kim, 1996). These numerical data can either be treated as bit-streams or
as floating-point numbers. If the data are treated as bitstreams, then bit-stream operations such as parity
Figure 1. Fault tolerance by diskless checkpointing
763
can be used to encode the checkpoint. Otherwise, floating-point arithmetic such as addition can be used
to encode the data. In this research, we treat the checkpoint data as floating-point numbers rather than
bit-streams. However, the corresponding bit-stream version schemes could also be used if the the application programmer thinks they are more appropriate. In the rest of this chapter, we discuss how local
checkpoints can be encoded efficiently so that applications can survive process failures.
Checksum-Based Checkpointing
The checksum-based checkpointing is a floating-point version of the parity-based checkpointing
scheme proposed in (Plank, Li, & Puening, 1998). In the checksum-based checkpointing, instead of
using parity, floating-point number addition is used to encode the local checkpoint data. By encoding
the local checkpoint data of the computation processors and sending the encoding to some dedicated
checkpoint processors, the checksum-based checkpointing introduces a much lower memory overhead
into the checkpoint system than neighborbased checkpoint. However, due to the calculating and sending of the encoding, the performance overhead of the checksum-based checkpointing is usually higher
than neighbor-based checkpoint schemes. The basic checksum scheme works as follow. If the program
is executing on N processors, then there is an N + 1-st processor called the checksum processor. At all
points in time a consistent checkpoint is held in the N processors in memory. Moreover a checksum of
the N local checkpoints is held in the checksum processor. Assume Pi is the local checkpoint data in the
memory of the i-th computation processor. C is the checksum of the local checkpoints in the checkpoint
processor. If we look at the checkpoint data as an array of real numbers, then the checkpoint encoding
actually establishes an identity (1) between the checkpoint data Pi on computation processors and the
checksum data C on the checksum processor. If any processor fails, then the identity (1) becomes an
equation with one unknown. Therefore, the data in the failed processor can be reconstructed through
solving this equation.P1 ++Pn = C (1)
Due to the floating-point arithmetic used in the checkpoint and recovery, there will be round-off errors
in the checkpoint and recovery. However, the checkpoint involves only additions and the recovery involves
additions and only one subtraction. In practice, the increased possibility of overflows, underflows, and
cancellations due to round-off errors in the checkpoint and recovery algorithm is negligible.
Overhead and Scalability Analysis

Assume diskless checkpointing is performed in a parallel system with p processors and the size of
checkpoint on each processor is m bytes. It takes + x to transfer a message of size x bytes between
two processors regardless of which two processors are involved. is often called latency of the network,
1/ is called the bandwidth of the network. Assume the rate to calculate the sum of two arrays is seconds per byte. We also assume that it takes + x to write x bytes of data into the stable storage. Our
default network model is the duplex model where a processor is able to concurrently send a message
to one partner and receive a message from a possibly different partner. The more restrictive simplex
model permits only one communication direction per processor. We also assume that disjoint pairs of
processors can communicate each other without interference each other.
In classical diskless checkpointing, binary-tree based encoding algorithm is often used to perform the
checkpoint encoding (Chiueh & Deng, n.d.), (Kim, 1996), (Plank, 1997), (Plank, Li & Puening, 1998),
(Silva & Silva, 1998). By organizing all processors as a binary tree and sending local checkpoints along
764
the tree to the checkpoint processor (see Figure 2 (Plank, Li & Puening, 1998)), the time to perform one
checkpoint for a binary-tree based encoding algorithm, Tdisklessbinary, can be represented as
Tdiskless binary = 2 log p ( + ( + )m ).

In high performance scientific computing, the local checkpoint is often a relatively large message
(megabyte level), so ( + )m is usually much larger than Tdiskless binary 2 log p ( + ) m. Therefore,
Ttotal ( s) = (t + p 2)( + s + s)
= (m / s + p 2)( + s + s)
= m / s + ( p 2)( + ) s + ( p 2) + ( + )m
2 ( p 2) ( + )m + ( p 2) + ( + ) m
= ( + )m (1 + ( p / m )
Note that, in a typical checkpoint/restart approach where m is usually also much larger than , the
time to perform one checkpoint, Tcheckpoint/restart, isTcheckpoint/restart = p( + m) p
Therefore, by eliminating stable storage from checkpointing and replacing it with memory and processor redundancy, diskless checkpointing improves the scalability of checkpointing greatly on parallel
and distributed systems.
A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing

Although the classical diskless checkpointing technique improves the scalability of checkpointing dramatically on parallel and distributed systems, the overhead to perform one checkpoint still increases
logarithmicly as the number of processors increases. In this section, we propose a new style of encoding
algorithm which improves the scalability of diskless checkpointing significantly. The new encoding algorithm is based on the pipeline idea. When the number of processors is one or two, there is not much that
we can improve. Therefore, in what follows, we assume the number of processors is at least three.
Figure 2. Encoding local checkpoints using the binary tree algorithm
765
Pipelining
The key idea of pipelining is (1) the segmenting of messages and (2) the simultaneous non-blocking
transmission and receipt of data. By breaking up a large message into smaller segments and sending
these smaller messages through the network, pipelining allows the receiver to begin forwarding a segment while receiving another segment. Data pipelining can produce several significant improvements in
the process of checkpoint encoding. First, pipelining masks the processor and network latencies that are
known to be an important factor in high-bandwidth local area networks. Second, it allows the simultaneous sending and receiving of data, and hence exploits the full duplex nature of the interconnect links in
the parallel system. Third, it allows different segments of a large message being transmitted in different
interconnect links in parallel after a pipeline is established, hence fully utilize the multiple interconnects
of a parallel and distributed ststem.
Chain-Pipelined Encoding for Diskless Checkpointing

Let m[i] denote the data on the ith processor. The task of checkpoint encoding is to calculate the encoding
which is m[0] +m[1] + ... +m[p 1] and deliver the encoding to the checkpoint processor. The chainpipelined encoding algorithm works as follows. First, organize all computational processors and the
checkpoint processor as a chain. Second, segment the data on each processor into small pieces. Assume
the data on each processor are segmented into t segment of size s. The jth segment of m[i] is denoted
as m[i][j]. Third, m[0]+m[1]+...+m[p1] are calculated by calculating m[0][j]+m[1][j]+...+m[p1]
[j] for each 0 j t 1 in a pipelined way. Fourth, when the jth segment of encoding m[0][j]+m[1]
[j]+...+m[p1][j] is available, start to send it to the checkpoint processor.
Figure 3 demonstrates an example of calculating a chain-pipelined checkpoint encoding for three
processors (processor 0, processor 1, and processor 2) and deliver it to the checkpoint processor (processor
3). In step 0, processor 0 sends its m[0][0] to processor 1. Processor 1 receives m[0][0] from processor 0
and calculates m[0][0]+m[1][0]. In step 1, processor 0 sends its m[0][1] to processor 1. Processor 1 first
concurrently receives m[0][1] from processor 0 and sends m[0][0] + m[1][0] to processor 2 and then
calculates m[0][1] + m[1][1]. Processor 2 first receives m[0][0] + m[1][0] from processor 1 and then
calculate m[0][0] + m[1][0] + m[2][0]. As the procedure continues, at the end of step 2, the checkpoint
processor will be able to get its first segment of encoding m[0][0] +m[1][0]+m[2][0]+m[3][0]. From
now on, the checkpoint processor will be able to receive a segment of the encoding at the end of each
step. After the checkpoint processor receives the last checkpoint encoding, the checkpoint is finished.
Overhead and Scalability Analysis

In the chain-pipelined checkpoint encoding, the time for each step is Teach-step = + s + s, where s is
the size of the segment. The number of steps to encode and deliver t segments in a p processor system
is t + p 2. If we assume the size of data on each processor is m (= ts), then the total time for encoding
and delivery is
s=
766
m
( p 2)( + )
The minimum is achieved when
2 log p (( + ) m + )
(2)
Therefore, by choosing an optimal segment size, the chain-pipelined encoding algorithm is able to
reduce the checkpoint overhead to tolerate single failure from ( + ) m (1 + ( p / m ) to
a11 P1 + + a1n Pn
a P + + a P
Mn n
M1 1
C1
= CM
In diskless checkpointing, the size of checkpoint m is often large (megabytes level). The latency is
often a very small number compared with the time to send a large message. If p is not too large, then Ttotal
( + )m. Therefore, in practice, the number of processors often has very little impact on the time to
perform one checkpoint unless p is very large. If p does become very large, strategies in one dimensional
weighted checksum scheme can be used to guarantee small latency related terms.
Coding to Tolerate Multiple Simultaneous Process Failures

To tolerate multiple simultaneous process failures of arbitrary patterns with minimum process redundancy,
a weighted checksum scheme can be used. A weighted checksum scheme can be viewed as a version
of the Reed-Solomon erasure coding scheme (Plank, 1997) in the real number field. The basic idea of
this scheme works as follow: Each processor takes a local in-memory checkpoint, and M equalities are
established by saving weighted checksums of the local checkpoint into M checksum processors. When f
failures happen, where f M, the M equalities becomes M equations with f unknowns. By appropriately
choosing the weights of the weighted checksums, the lost data on the f failed processors can be recovered
by solving these M equations.
Figure 3. Chain-pipelined encoding for diskless checkpointing
767
The Basic Weighted Checksum Scheme

Suppose there are n processors used for computation. Assume the checkpoint data on the ith computation
processor is Pi. In order to be able to reconstruct the lost data on failed processors, another M processors
are dedicated to hold M encodings (weighted checksums) of the checkpoint data (see Figure 4).
The weighted checksum Cj on the jth checksum processor can be calculated from
2 log p (( + 2 ) m + )
(3)
where aij, i = 1, 2, ...,M, j = 1, 2, ..., n, is the weight we need to choose. Let A = (aij)Mn. We call A the
checkpoint matrix for the weighted checksum scheme. Suppose that k computation processors and M
h checkpoint processors have failed. Then there are n k computation processors and h checkpoint
processors that have survived. If we look at the data on the failed processors as unknowns, then (3)
becomes M equations with M (h k) unknowns.
If k > h, then there are fewer equations than unknowns. There is no unique solution for (3), and the
lost data on the failed processors can not be recovered. However, if k < h, then there are more equations
than unknowns. By appropriately choosing A, a unique solution for (3) can be guaranteed, and the lost
data on the failed processors can be recovered by solving (3).
Let Ar denote the coefficient matrix of the linear system that need to be solved to recover the lost
data. Whether we can recover the lost data on the failed processes or not directly depends on whether Ar
has a full column rank or not. However, Ar can be any sub-matrix (including minor) of A depending on
the distribution of the failed processors. If any square sub-matrix (including minor) of A is non-singular
and there are no more than M process failed, then Ar can be guaranteed to have full column rank. Therefore, to be able to recover from no more than any M failures, the checkpoint matrix A has to satisfy the
condition that any square sub-matrix (including minor) of A is non-singular. How can we find such kind
of matrices? It is well known that some structured matrices such as Vandermonde matrix and Cauchy
matrix satisfy this condition (Golub & Van Loan, 1989).
Let Tdiskless_pipeline(k, p) denotes the encoding time to tolerate k simultaneous failures in a p-processor
system using the chain-pipelined encoding algorithm and Tdiskless_binary(k, p) denotes the corresponding
encoding time using the binary-tree encoding algorithm.
Figure 4. Basic weighted checksum scheme for diskless checkpointing
768
When tolerating k simultaneous failures, k basic encodings have to be performed. Note that, in addition
to the summation operation, there is an additional multiplication operation involved in (3). Therefore,
the computation time for each number will increase from to 2. Hence, when the binary-tree encoding
algorithm is used to perform the weighted checksum encoding, the time for one basic encoding is 2log
p (( +2)m + ). Therefore, the time for k basic encodings is
Tdiskless binary (k , p) = k 2 log p ( + ( + 2 ) m ) 2 log p k ( + 2 ) m.

When the chain-pipelined encoding algorithm is used to perform the checkpoint encoding, the overhead to tolerate k simultaneous failures becomes
p
p
Tdiskless pipeline (k , p) = k 1 +
( + 2 ) m = 1 +
k ( + 2 ) m.
m
m
When the number of processors p is not too large, the overhead for the basic weighted checksum
scheme Tdiskless-pipeline(k, p) k( + 2)m.
However, in todays large computing systems, the number of processors p may become very large.
If we do have a large number of processors in the computing systems, either the one dimensional
weighted checksum scheme or the localized weighted checksum scheme discussed in the following can
be used.
One Dimensional Weighted Checksum Scheme

The one dimensional weighted checksum scheme works as follows. Assume the program is running on
p = gs processors. Partition the gs processors into s groups with g processors in each group. Dedicate another M checksum processors for each group. In each group, the checkpoint are done using the
basic weighted checksum scheme (see Figure 5). This scheme can survive M processor failures in each
group. The advantage of this scheme is that the checkpoints are localized to a subgroup of processors,
so the checkpoint encoding in each sub-group can be performed in parallel. Therefore, compared with
the basic weighted checksum scheme, the performance of the one dimensional weighted checksum
scheme is usually better.
By using a pipelined encoding algorithm in each subgroup, the time to tolerate k simultaneous failures
in a p-processor system is now reduced to
g
Tdiskless pipeline (k , p) = Tdiskless pipeline (k , g ) = 1 +
k ( + 2 ) m.
m
which is independent of the total number of processors p in the computing system. Therefore, in this
fault tolerance scheme, the overhead to survive k failures in a p-processor system does not increase as
the total number of processors p increases. It is in this sense that the sub-group based chain-pipelined
checkpoint encoding algorithm is a scalable recovery algorithm.
769
CHECKPOINT-FREE FAULT TOLERANCE FOR MATRIX MULTIPLICATION

It has been proved in previous research (Huang & Abraham, 1984) that, for some matrix operations,
the checksum relationship in input checksum matrices is preserved in the final computation results at
the end of the operation. Based on this checksum relationship in the final computation results, Huang
and Abraham have developed the famous algorithm-based fault tolerance (ABFT) (Huang & Abraham,
1984) technique to detect, locate, and correct certain processor miscalculations in matrix computations
with low overhead. The algorithm-based fault tolerance proposed in (Huang & Abraham, 1984) was
later extended by many researches (Afinson & Luk, 1988), (Banerjee, Rahmeh, Stunkel, Nair, Roy,
Balasubramanian & Abraham, 1990), (Balasubramanian & Banarjee, 1990), (Boley, Brent, Golub &
Luk, 1992), (Luk & Park, 1986).
In order to be able to recover from a fail-stop process failure in the middle of the computation, a
global consistent state of the application is often required when a process failure occurs. Checkpointing
and message logging are typical approaches to maintain or construct such global consistent state in a
distributed environment. But if there exists a checksum relationship between application data on different processes, such checksum relationship can actually be treated as a global consistent state. However,
it is still an open problem that whether the checksum relationship in input checksum matrices in ABFT
can be maintained during computation or not. Therefore, whether ABFT can be extended to tolerate
fail-stop process failures in a distributed environment or not remains open.
In this section, we first demonstrate that, for many matrix matrix multiplication algorithms, the
checksum relationship in the input checksum matrices does not preserve during computation. We then
prove that, however, for the outer product version matrix matrix multiplication algorithm, it is possible
to maintain the checksum relationship in the input checksum matrices during computation. Based on
this checksum relationship maintained during computation, we demonstrate that it is possible to tolerate
fail-stop process failures (which are typically tolerated by checkpoting or message logging) in distributed
outer product version matrix multiplication without checkpointing. Because no periodical checkpoint
or rollback-recovery is involved in this approach, process failures can often be tolerated with a surprisingly low overhead.
Figure 5. One dimensional weighted checksum scheme for diskless checkpointing
770
Maintaining Checksum at the End of Computation

For any general m-by-n matrix A defined by
a00
a0 n 1

A =
am 1 0 am 1 n 1
The column checksum matrix Ac of the matrix A is defined by
a00
c
A =
am 1 0
i = m 1
i =0 ai 0
a0 n 1

am 1 n 1
i = m 1
i =0 ai n 1
The row checksum matrix Ar of the matrix A is defined by

j = n 1
a0 j
a
a
j =0
0 n 1
00
A =

j = n 1
am 1 0 am 1 n 1 j =0 am 1 j
The full checksum matrix Af of the matrix A is defined by
a00
f
A =
am 1 0
i = m 1 a
i =0 i 0
a0 n 1

am 1 n 1
i = m 1
i =0
ai n 1
j = n 1
a
j =0 m1 j
i = m 1
j = n 1
i =0 j =0 ai j
j = n 1
j =0
a0 j
Theorem 1: Assume A, B, and C are three matrices. If A B = C, then Ac Br = Cf .

Proof: Assume A is m-by-n, B is n-by-k, then C is m-by-k. Let eT denote a 1-by-m vector [1, 1, . . .,
1], then
771
A
Ac = T ;
e A
B R = (B B e );
Ce
C
Cf = T
.
T
e C e C E
Ac B r
A
= T (B B e )
e A
AB
ABe
= T
T
e AB e ABe
C
Ce
= T
e C eT Ce
= Cf.
Theorem 1 was first proved by Huang and Abraham in (Huang & Abraham, 1984). The reason that
we repeat the proof here again is that we want to point out the fact that the proof of Theorem 1 is independent of the algorithms used for the matrix matrix multiplication operation. Therefore no mater which
algorithm is used to perform the matrix matrix multiplication, the checksum relationship of the input
matrices will always be preserved in the final computation results at the end of the computation.
Based on this checksum relationship in the final computation results, the low-overhead algorithmbased fault tolerance technique has been developed in (Huang & Abraham, 1984) to detect, locate, and
correct certain processor miscalculations in matrix computations.
Is the Checksum Maintained During Computation?

Algorithm-based fault tolerance usually detects, locates, and corrects errors at the end of the computation.
But in todays high performance computing environment such as MPI, after a fail-stop process failure
occurs in the middle of the computation, it is often required to recover from the failure first before the
continuation of the rest of the computation.
In order to be able recover from fail-stop failures occurred in the middle of the computation, a global
consistent state of an application is often required in the middle of the computation. The checksum relationship, if exists, can actually be treated as a global consistent state. However, from Theorem 1, it is
still uncertain whether the checksum relationship is preserved in the middle of the computation or not.
In what follows, we demonstrate, for both cannons algorithm and foxs algorithm for matrix matrix
multiplication, this checksum relationship in the input checksum matrices is generally not preserved in
the middle of the computation.
Assume A is an (n 1)-by-n matrix, B is an n-by-(n 1) matrix. Then Ac = (aij)nn, Br = (bij)nn,
and Cf = Ac * Br are all n-by-n matrices. For convenience of description, but without loss of generality,
assume there are n2 processors with each processor stores one element from Ac, Br,and C f respectively.
The n2 processors are organized into an n-by-n processor grid.
772
Consider using the cannons algorithm (Cannon, 1969) in Figure 6 to perform Ac * Br in parallel on
an n-by-n processor grid. We can prove the following Theorem 2.
Theorem 2: If the cannons algorithm in Figure 6 is used to perform parallel matrix matrix multiplication, then there exist matrices A and B such that, at the end of each step s, where s = 0, 1, 2, , n 2,
the matrix C = (cij) is not a full checksum matrix.
When the cannons algorithm in Figure 6 is used to perform Ac * Br in parallel for general matrix A
and B, it can be proved that at the end of the sth step
s
cij = ai , (i + j + k ) mod n b(i + j + k ) mod n, j

k =0
It can be verified that C = (cij)nn is not a full checksum matrix unless s = n 1 which is the end of
the computation. Therefore the checksum relationship in the matrix C is generally not preserved during
computation in the cannons algorithm for matrix multiplication.
It can also be demonstrated that the checksum relationship in the input matrix C is not preserved
during computation in many other parallel algorithms for matrix matrix multiplication algorithms such
as Foxs algorithm.
Figure 6. Matrix matrix multiplication by Cannons algorithm with checksum matrices
773
Figure 7. Matrix-matrix multiplication by outer product algorithm with input checksum matrices
An Algorithm That Maintains the Checksum during Computation

Despite the checksum relationship of the input matrices is preserved in final results at the end of computation no matter which algorithm is used, from last subsection, we know that the checksum relationship
is not necessarily preserved during computation. However, it is interesting to ask: is there any algorithm
that preserves the checksum relationship during computation?
Consider using the outer product version algorithm (Golub & Van Loan, 1989) in Figure 7 to perform Ac
* Br in parallel. Assume the matrices Ac, Br, and Cf have the same data distribution scheme as before.
Theorem 3: If the algorithm in Figure 7 is used to perform the parallel matrix matrix multiplication, then
the matrix C = (cij)nn is a full checksum matrix at the end of each step s, where s = 0, 2, , n 1.
Proof: It is trivial to show that, at the end of the sth step in Figure 7, the cij in the algorithm satisfies
s
cij = aik bkj ,

k =0
where i, j = 0, 2, , n 1.
Note that Ac is a column checksum matrix, therefore
n2
t j
t =0
= an 1
where j = 0, 2, , n 1.
Since Br is a column checksum matrix, we have
n2
b
t =0
774
it
= bi n 1
where i = 0, 2, , n 1.
Therefore, for all j = 0, 2, , n 1, we have
n2
n2
ctj = atk * bkj

t =0
t =0 k =0
= atk * bkj
k =0 t =0
n2
= an 1k * bkj
k =0
= cn 1 j
Similarly, for all i = 0, 2, , n 1, we have
n2
n2
t =0
t =0 k =0
cit = aik * bkt

s
n2
= aik * bkt
k =0
t =0
s
= aik * bkn 1
k =0
= cin 1
Therefore, we can conclude that C is a full checksum matrix. Hence, at the end of each step s, where
s = 0, 2, , n 1, C = (cij)nn is a full checksum matrix.
Theorem 3 implies that a coded global consistent state of the critical application data (i.e. the checksum relationship in Ac, Br, and C f) can be maintained in memory at the end of each iteration in the outer
product version matrix matrix multiplication if we perform the computation with the checksum input
matrices.
However, in a high performance distributed environment, different processes may update their data
in local memory asynchronously. Therefore, if a failure happens at a time when some processes have
updated their local matrix in memory and other processes are still in the communication stage, then the
checksum relationship in the distributed matrix will be damaged and the data on all processes will not
form a global consistent state.
But this problem can be solved by simply performing a synchronization before performing local
memory update. Therefore, it is possible to maintain a coded global consistent state (i.e. the checksum
relationship) of the matrix Ac, Br and Cf in the distributed memory at any time during computation.
Hence, a single fail-stop process failure in the middle of the computation can be recovered from the
checksum relationship.
Note that it is also the outer product version algorithm that is ofen used in todays high performance
computing practice. The outer product version algorithm is more popular due to both its simplicity and
its efficiency in modern high performance computer architecture. In the widely used parallel numerical
775
linear algebra library ScaLAPACK (Blackford, Choi, Cleary, Petitet, Whaley, Demmel, et al. 1996), it
is also the outer product version algorithm that is chosen to perform the matrix matrix mulitiplication.
More importantly, it can also be proved that similar checksum relationship exists for the outer product
version of many other matrix operations (such as Cholesky and LU factorization).
PRACTICAL NUMERICAL ISSUES

Both the encoding schemes introduced in the scalable checkpointing and the algorithm-based fault
tolerance presented before involve solving system of linear equations to recover multiple simultaneous
process failures. Therefore, the practical numerical issues involved in recovering multiple simultaneous
process failures have to be addressed.
Numerical Stability of Real Number Codes

From previous section, it has been derived that, to be able to recover from any no more than m failures,
the encoding matrix A has to satisfy: any square sub-matrix (including minor) of
A is non-singular. This requirement for the encoding matrix coincides with the properties for the
generator matrices of real number Reed-Solomon style erasure correcting codes. In fact, our weighted
checksum encoding discussed before can be viewed as a version of the Reed-Solomon erasure coding
scheme (Plank, 1997) in real number field. Therefore any generator matrix from real number ReedSolomon style erasure codes can actually be used as the encoding matrix of algorithm-based checkpointfree fault tolerance In the existing real number or complex-number Reed-Solomon style erasure codes
in literature, the generator matrices mainly include: Vandermonde matrix (Vander), Vandermonde-like
matrix for the Chebyshev polynomials (Chebvand), Cauchy matrix (Cauchy), Discrete Cosine Transform
matrix (DCT), Discrete Fourier Transform matrix (DFT). Theoretically, these generator matrices can all
be used as the encoding matrix of the algorithm-based checkpoint-free fault tolerance scheme. However,
in computer floating point arithmetic where no computation is exact due to round-off errors, it is well
known (Golub & Van Loan, 1989) that, in solving a linear system of equations, a condition number of 10k
for the coefficient matrix leads to a loss of accuracy of about k decimal digits in the solution. Therefore,
in order to get a reasonably accurate recovery, the encoding matrix A actually has to satisfy any square
sub-matrix (including minor) of A is well-conditioned.
The generator matrices from above real number or complex-number Reed-Solomon style erasure
codes all contain ill-conditioned sub-matrices. Therefore, in these codes, when certain error patterns
occur, an ill-conditioned linear system has to be solved to reconstruct an approximation of the original
information, which can cause the loss of precision of possibly all digits in the recovered numbers.
Numerically Good Real Number Codes Based on Random Matrices

In this section, we will introduce a class of new codes that are able to reconstruct a very good approximation of the lost data with high probability regardless of the failure patterns processes. Our new codes
are based on random matrices over real number field. It is well-known (Edelmann, 1988) that Gaussian
random matrices are well-conditioned. To estimate how well conditioned Gaussian random matrices
are, we have proved the following theorem:
776
Theorem 4: Let Gmn be an mn real random matrix whose elements are independent and identically
distributed standard normal random variables, and let 2(Gmn) be the 2-norm condition number of Gmn.
Then, for any m2, n2 and x|n m| + 1, 2(Gmn) satisfies
n m +1
c

x
2
n m +1
c
k2 (Gmn )
x
< P
> x <
n / ( n m + 1)
and
E (ln k2 (Gmn )) < ln
n
+ 2.258,
n m +1
where 0.245 c 2.000 and 5.013 C 6.414 are universal positive constants independent of m, n
and x.
Due to the length of the proof for Theorem 4, we omit the proof here and refer interested readers to
(Chen & Dongarra, n.d.) for complete proof.
Note that any sub-matrix of a Gaussian random matrix is still a Gaussian random matrix. Therefore,
a Gaussian random matrix would satisfy any sub-matrix of the matrix is well-conditioned with high
probability.
Theorem 4 can be used to estimate the accuracy of recovery in the weighted checksum scheme. For
example, if an application uses 100,000 processes to perform computation and 20 processes to hold
encodings, then the encoding matrix is a 20 by 100,000 Gaussian random matrix. If 10 processors fail
concurrently, then the coefficient matrix in the recovery algorithm is a 20 by 10 Gaussian random matrix.
From Theorem 4, we can getE(log10 2(Ar) < 1.25andP{2(Ar) > 100} < 3.1 10-11.
Therefore, on average, we will loss about one decimal digit in the recovered data and the probability
to loss 2 digits is less than 3.1 1011.
EXPERIMENTAL EVALUATION
In this section, we evaluate the performance of the introduced fault tolerance schemes experimental.
Performance of the Chain-Pipelined Checkpoint Encoding

In this section, we evaluate the scalability of the proposed chain-pipelined checkpoint encoding algorithm using a preconditioned conjugate gradient (PCG) equation solver (Barrett, Berry, Chan, Demmel,
Donato, Dongarra, Eijkhout, et al., 1994). The basic weighted checksum scheme is incorporates into
our PCG code. The checkpoint encoding matrix we used is a pseudo random matrix. The programming
environment we used is FT-MPI (Fagg & Dongarra, 2000), (Fagg, et al., 2004), (Fagg, et al., 2005). A
process failure is simulated by killing one process in the middle of the computation. The lost data on
the failed process is recovered by solving the checksum equation.
777
We fix the number of simultaneous processor failures and increase the total number of processors
for computing. But the problems to solve are chosen very carefully such that the size of checkpoint
on each processor is always the same (about 25 Megabytes) in every experiment. By keeping the size
of checkpoint per processor fixed, we are able to observe the impact of the total number of computing
processors on the performance of the checkpointing.
In all experiments, we performed checkpoint every 100 iterations and run PCG for 2000 iterations;
In practice, there is an optimal checkpoint interval which depends on the failure rate, the time cost of
each checkpoint and the time cost of each recovery. Much literature about the optimal checkpoint interval (Gelenebe, 1979), (Plank & Thomason, 2001), (Young, 1998) is available. We will not address
this issue further here.
Figure 8 reports both the checkpoint overhead (for one checkpoint) and the recovery overhead (for
one recovery) for tolerating 4 simultaneous process failures on a IBM RS/6000 with 176 Winterhawk
II thin nodes (each with 4 375 MHz Power3-II processors). The number of checkpoint processors in
the experiment is four. We simulate a failure of four simultaneous processors by killing four processes
during the execution.
Figure 8 demonstrates that both the checkpoint overhead and the recovery overhead are very stable
as the total number of computing processes increases from 60 to 480. This is consistent with our theoretical result in previous section.
Performance of the Algorithm-Based Checkpoint-Free Fault Tolerance

In this section, we experimentally evaluate the performance overhead of applying this checkpoint-free
fault tolerance technique to tolerate single fail-stop process failure in the widely used ScaLAPACK
matrix-matrix multiplication kernel. The size of the problems and the number of computation processes
used in our experiments are listed in Figure 9.
All experiments were performed on a cluster of 64 dual-processor nodes with AMD Opteron(tm)
Processor 240. Each node of the cluster has 2 GB of memory and runs a Linux operating system. The
nodes are connected with Myrinet. The timer we used in all measurements is MPI Wtime. The programming environment we used is FT-MPI (Fagg, et al., 2005).
When there is no failure occurs, the total overhead equals to the overhead for calculating encoding
Figure 8. Scalability of the checkpoint encoding and recovery decoding
778
Figure 9. Experiment configurations
Figure 10. The total overhead (time) for fault tolerance
at the beginning plus the overhead of performing computation with larger (checksum) matrices. If there
are failures occur, then the total performance overhead equals the overhead without failures plus the
overhead for recovering FT-MPI Environment and the overhead for recovering the application data from
the checksum relationship.
Figure 10 reports the execution times of the original matrix-matrix multiplication, the fault tolerant version matrix-matrix multiplication without failures, and the fault tolerant version matrix-matrix
multiplication with a single fail-stop process failure. Figure 11 reports the total fault tolerance overhead
(%).
779
Figure 11 demonstrates that, as the number of processors increases, the total overhead (%) decreases.
This is because, as the number of processors increases, the time overhead is pretty stable but the total
amount of time to solve a problem increases. The percentage overhead equals to the time overhead
divided by the total amount of time to solve the problem.

In this chapter, we presented two scalable fault tolerance techniques for large-scale high performance
computing. The introduced techniques are scalable in the sense that the overhead to survive k failures in
p processes does not increase as the number of processes p increases. Experimental results demonstrate
that the introduced techniques scale well as the total number of processors increases.
REFERENCES
Adiga, N. R., et al. (2002). An overview of the BlueGene/L supercomputer. In Proceedings of the Supercomputing Conference (SC2002), Baltimore MD, USA, (pp. 122).
Anfinson, J., & Luk, F. T. (1988, December). A Linear Algebraic Model of Algorithm-Based Fault Tolerance. IEEE Transactions on Computers, 37(12), 15991604. doi:10.1109/12.9736
Balasubramanian, V., & Banerjee, P. (1990). Compiler-Assisted Synthesis of Algorithm-Based Checking
in Multiprocessors. IEEE Transactions on Computers, C-39, 436446. doi:10.1109/12.54837
Figure 11. The total overhead (%) for fault tolerance
780
Banerjee, P., Rahmeh, J. T., Stunkel, C. B., Nair, V. S. S., Roy, K., & Balasubramanian, V. (1990). Algorithmbased fault tolerance on a hypercube multiprocessor. IEEE Transactions on Computers, C-39,
11321145. doi:10.1109/12.57055
Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., et al. (1994). Templates for
the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition., Philadelphia, PA:
SIAM.
Blackford, L. S., Choi, J., Cleary, A., Petitet, A., & Whaley, R. C. Demmel, et al. (1996). ScaLAPACK:
a portable linear algebra library for distributed memory computers - design issues and performance. In
Supercomputing 96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM),
(p. 5).
Boley, D. L., Brent, R. P., Golub, G. H., & Luk, F. T. (1992). Algorithmic fault tolerance using the lanczos
method. SIAM Journal on Matrix Analysis and Applications, 13, 312332. doi:10.1137/0613023
Cannon, L. E. (1969). A cellular computer to implement the kalman filter algorithm. Ph.D. thesis, Montana State University, Bozeman, MT.
Chen, Z., & Dongarra, J. (2005). Condition numbers of gaussian random matrices. SIAM Journal on
Matrix Analysis and Applications, 27(3), 603620. doi:10.1137/040616413
Chiueh, T., & Deng, P. (1996). Evaluation of checkpoint mechanisms for massively parallel machines.
In FTCS, (pp. 370379).
Dongarra, J., Meuer, H., & Strohmaier, E. (2004). TOP500 Supercomputer Sites, 24th edition. In Proceedings of the Supercomputing Conference (SC2004), Pittsburgh PA. New York: ACM.
Edelman, A. (1988). Eigenvalues and condition numbers of random matrices. SIAM Journal on Matrix
Analysis and Applications, 9(4), 543560. doi:10.1137/0609045
Fagg, G. E., & Dongarra, J. (2000). FT-MPI: Fault tolerant MPI, supporting dynamic applications in a
dynamic world. In PVM/MPI 2000, (pp. 346353).
Fagg, G. E., Gabriel, E., Bosilca, G., Angskun, T., Chen, Z., Pjesivac-Grbovic, J., et al. (2004). Extending
the MPI specification for process fault tolerance on high performance computing systems. In Proceedings of the International Supercomputer Conference, Heidelberg, Germany.
Fagg, G. E., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., & Pjesivac-Grbovic, J. (2005). Process faulttolerance: Semantics, design and applications for high performance computing. [Winter.]. International Journal
of High Performance Computing Applications, 19(4), 465477. doi:10.1177/1094342005056137
Fox, G. C., Johnson, M., Lyzenga, G., Otto, S. W., Salmon, J., & Walker, D. (1988). Solving Problems
on Concurrent Processors: Vol. 1. Englewood Cliffs, NJ: Prentice-Hall.
Gelenbe, E. (1979). On the optimum checkpoint interval. Journal of the ACM, 26(2), 259270.
doi:10.1145/322123.322131
Golub, G. H., & Van Loan, C. F. (1989). Matrix Computations. Baltimore, MD: The John Hopkins
University Press.
781
Huang, K.-H., & Abraham, J. A. (1984). Algorithm-based fault tolerance for matrix operations. EEE
Transactions on Computers, C-33, 518528. doi:10.1109/TC.1984.1676475
Kim, Y. (1996, June). Fault Tolerant Matrix Operations for Parallel and Distributed Systems. Ph.D.
dissertation, University of Tennessee, Knoxville.
Luk, F. T., & Park, H. (1986). An analysis of algorithm-based fault tolerance techniques. SPIE Adv. Alg.
and Arch. for Signal Proc., 696, 222228.
Message Passing Interface Forum. (1994). MPI: A Message Passing Interface Standard. (Technical
Report ut-cs-94-230), University of Tennessee, Knoxville, TN.
Plank, J. S. (1997, September). A tutorial on Reed-Solomon coding for fault-tolerance in RAIDlike systems. Software, Practice & Experience, 27(9), 9951012. doi:10.1002/(SICI)1097024X(199709)27:9<995::AID-SPE111>3.0.CO;2-6
Plank, J. S., Kim, Y., & Dongarra, J. (1997). Fault-tolerant matrix operations for networks of workstations using diskless checkpointing. Journal of Parallel and Distributed Computing, 43(2), 125138.
doi:10.1006/jpdc.1997.1336
Plank, J. S., & Li, K. (1994). Faster checkpointing with n+1 parity. In FTCS, (pp. 288297).
Plank, J. S., Li, K., & Puening, M. A. (1998). Diskless checkpointing. IEEE Transactions on Parallel
and Distributed Systems, 9(10), 972986. doi:10.1109/71.730527
Plank, J. S., & Thomason, M. G. (2001, November). Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing, 61(11), 15701590.
doi:10.1006/jpdc.2001.1757
Silva, L. M., & Silva, J. G. (1998). An experimental study about diskless checkpointing. In EUROMICRO98, (pp. 395402).
Young, J. W. (1974). A first order approximation to the optimal checkpoint interval. Communications
of the ACM, 17(9), 530531. doi:10.1145/361147.361115

Checkpointing: Checkpointing is a type of techniques to incorporate fault tolerance into a system.
Erasure Correction Codes: An erasure correction code transforms a message of n blocks into a
message with more than n blocks, such that the original message can be recovered from a subset of
those blocks.
Fail-Stop Failure: Fail-stop failure is a type of failures that cause the component of a system experiencing this type of failure stops operating.
Fault Tolerance: Fault tolerance is the property of a system that enables it to continue operating
properly after a failure occurred in the system.
Message Passing Interface: Message Passing Interface is a specification for an API that allows
different processes to communicate with one another.
782
Parallel and Distributed Computing: Parallel and distributed computing is a sub-field of computer
science that handles computing involving more than one processing unit.
Pipeline: A pipeline is a set of data processing elements connected in series so that the output of one
element is the input of the next one.
783
Section 10
Applications
785
Chapter 34
Efficient Update Control of

Bloom Filter Replicas in Large
Scale Distributed Systems
Yifeng Zhu
University of Maine, USA
Hong Jiang
University of Nebraska Lincoln, USA
ABSTRACT
This chapter discusses the false rates of Bloom filters in a distributed environment. A Bloom filter (BF)
is a space-efficient data structure to support probabilistic membership query. In distributed systems, a
Bloom filter is often used to summarize local services or objects and this Bloom filter is replicated to
remote hosts. This allows remote hosts to perform fast membership query without contacting the original
host. However, when the services or objects are changed, the remote Bloom replica may become stale.
This chapter analyzes the impact of staleness on the false positive and false negative for membership
queries on a Bloom filter replica. An efficient update control mechanism is then proposed based on the
analytical results to minimize the updating overhead. This chapter validates the analytical models and
the update control mechanism through simulation experiments.
INTRODUCTION TO BLOOM FILTERS

A standard Bloom filter (BF) (Bloom, 1970) is a lossy but space-efficient data structure to support
membership queries within a constant delay. As shown in Figure 1, a BF includes k independent random
hash functions and a vector B of a length of m bits. It is assumed that the BF represents a finite set S =
{x1, x2,,xn} of n elements from a universe U . The hash functions hi(x), 1 i k, map the universe U
to the bit address space [1,m], shown as follows,
H(x) = {hi(x) | 1 hi(x) m for 1 i k}
DOI: 10.4018/978-1-60566-661-7.ch034
(1)
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Figure 1. A Bloom filter with a bit vector of m bits, and k independent hash functions. When an element
x is added into the set represented, all bits indexed by those hash functions are set to 1.
Definition 1. For all x U, B[H(x)] {B[hi(x)] | 1 i k}.

This notation facilitates the description of operations on the subset of B addressed by the hash functions. For example, B[H(x)] = 1 represents the condition in which all the bits in B at the positions of
h1(x),, and hk(x) are 1. Setting B[H(x)] means that the bits at these positions in B are set to 1.
Representing the set S using a BF B is fast and simple. Initially, all the bits in B are set to 0. Then for
each x S, an operation of setting B[H(x)] is performed. Given an element x, to check whether x is in
S , one only needs to test whether B[H(x)] = 1. If no, then x is not a member of S; If yes, x is conjectured
to be in S. Figure 1 shows the results after the element x is inserted into the Bloom filter.
A standard BF has two well-known properties that are described by the following two theorems.
Theorem 1.Zero false negative

For x U, if i, B[hi(x)] 1, then x
/S.
For a static set S whose elements are not dynamically deleted, the bit vector indexed by those hash functions always never returns a false negative. The proof is easy and is not given in this chapter.
Theorem 2.Possible false positive

/ S . This probability is called the
For x U, if B[H(x)] = 1, then there is a small probability f+ that x
+
kn/m k
false positive rate and f (1 e ) . Given a specific ratio of m/n, f+ is minimized when k = (m/n)ln2
+
(0.6185)m /n .
and fmin
Proof: The proof is based on the mathematical model proposed in (James, 1983; McIlroy, 1982).
Detailed proof can be found in (Li et al., 2000; Michael, 2002). For the convenience of the reader, the
proof is briefly presented here.
After inserting n elements into a BF, the probability that a bit is zero is given by:
786
Figure 2. Expected false positive rate in a standard Bloom filter. A false positive is due to the collision
of hash functions, where all indexed bits happen to be set by other elements.
kn
1
P0 (n ) = 1 - e -kn /m .
m
(2)
Thus the probability that k bits are set to 1 is

k
kn

1
P (k bits set) = 1 - 1 - (1 - e -kn /m )k .

m
(3)
Assuming each element is equally likely to be accessed and |S||U|, then the false positive rate is
|S
f + = 1
|U
|
P (k bits set) (1 - e -kn /m )k .
|
Given a specific ratio of
m
n
(4)
, i.e., the number of bits per element, it can be proved that the false posi-
787
tive rate f+ is minimized when k =

(Michael, 2002)
m
n
ln 2 and the minimal false positive rate is, as has been shown
f+ 0.5k = (0.6185)m/n
(5)
The key advantage of a Bloom filter is that its storage requirement falls several orders of magnitude
below the lower bounds of error-free encoding structures. This space efficiency is achieved at the cost
of allowing a certain (typically none-zero) probability of false positives, that is, it may incorrectly return
an yes although x is actually not in S. Appropriately adjusting m and k can minimize this probability
of false-positive to a sufficiently small value so that benefits from the space and time efficiency far outweigh the penalty incurred by false positives in many applications. For example, when the bit-element
ratio is 8 and the number of hash functions is 6, the expected false positive rate is only 0.0216. Figure
2 shows the false positive rate under different configurations.
In order to represent a dynamic set that is changing over time, (Li et al., 2000) proposes a variant
named counting BF. A counting BF includes an array in which each entry is not a bit but rather a counter
consisted of several bits. Counting Bloom filters can support element deletion operations. Let C = {cj
| 1 j m} denote the counter vector and the counter cj represents the difference between the number
of settings and the number of unsetting operations made to the bit B[j]. All counters cj for 1 j m are
initialized to zero. When an element x is inserted or deleted, the counters C[H(x)] are incremented or
decreased by one, respectively. If cj changes its value from one to zero, B[j] is reset to zero. While this
counter array consumes some memory space, (Li et al., 2000) shows that 4 bits per counter will guarantee
the probability of overflow minuscule even with several hundred million elements in a BF.
APPLICATIONS OF BLOOM FILTERS IN DISTRIBUTED SYSTEMS

Bloom filters have been extensively used in many distributed systems where information dispersed
across the entire system needs to be shared. For example, to reduce the message traffic, (Li et al, 2000)
propose a web cache sharing protocol that employs a BF to represent the content of a web proxy cache
and then periodically propagates that filter to other proxies. If a cache miss occurs on a proxy, that proxy
checks the BFs replicated from other proxies to see whether they have the desired web objects in their
caches. (Hong & Tao, 2003; Hua et al., 2008; Ledlie et al., 2002; Matei & Ian, 2002; Zhu et al., 2004;
Zhu et al., 2008) use BFs to implement the function of mapping logical data identities to their physical
locations in distributed storage systems. In these schemes, each storage node constructs a Bloom filter
that summarizes the identities of data stored locally and broadcasts the Bloom filter to other nodes.
By checking all filters collected locally, a node can locate the requested data without sending massive
query messages to other nodes. Similar deployments of BFs have been found in geographic routing in
wireless mobile systems (Pai-Hsiang, 2001), peer-to-peer systems (Hailong & Jun, 2004; John et al.,
2000; Mohan & Kalogeraki, 2003; Rhea & Kubiatowicz, 2002), naming services (Little et al., 2002),
and wireless sensor networks (Ghose et al. 2003; Luk et al. 2007).
A common characteristic of distributed applications of BFs, including all those described above, is that
a BF at a local host is replicated to other remote hosts to efficiently support distributed queries. In such
dynamical distributed applications, the information that a BF represents evolves over time. However, the
updating processes are usually delayed due to the network latency or the delay necessary in aggregating
788
small changes into single updating message in order to reduce the updating overhead. Accordingly the
contents of the remote replicas may become partially outdated. This possible staleness in the remote
replicas not only changes the probability of false positive answers to membership queries on the remote
hosts, but also brings forth the possibility of false negatives. A false negative occurs when a BF replica
answers no to the membership query for an element while that element actually exists in its host. It
is generated when a new element is added to a host while the changes of the BF of this host, including
the addition of this new element, have not been propagated to its replicas on other hosts. In addition,
this staleness also changes the probability of false positives, an event in which an element is incorrectly
identified as a member. Throughout the rest of this chapter, the probabilities of false negatives and false
positives are referred to as the false negative rate and false positive rate, respectively.
While the false negative and false positive rates for a BF at a local host have been well studied in
the context of non-replicated BF (Bloom, 1970; Broder & Mitzenmacher, 2003; James, 1983; Li et al.,
2000; Michael, 2002), very little attention has been paid to the false rates in the Bloom filter replicas
in a distributed environment. In the distributed systems considered in this chapter, the false rates of the
replicas are more important since most membership queries are performed on these replicas. A good
understanding of the impact of the false negatives and false positives can provide the system designers
with important and useful insights into the development and deployment of distributed BFs in such important applications as distributed file, database, and web server management systems in super-scales.
Therefore, the first objective of this chapter is to analyze the false rates by developing analytical models
and considering the staleness.
Since different application may desire a different tradeoff between false rate (e.g, miss/fault penalty)
and update overhead (e.g., network traffic and processing due to broadcasting of updates), it is very
important and significant for the systems overall performance to be able to control such a tradeoff for
a given application adaptively and efficiently. The second objective is to develop an adaptive control
algorithm that can accurately and efficiently maintain a desirable level of false rate for any given application by dynamically and judiciously adjusting the update frequency.
The primary contribution of this chapter is its developments of accurate closed-form expressions for the
false negative and false positive rates in BF replicas, and the development of an adaptive replica-update
control, based on our analytical model, that accurately and efficiently maintains a desirable level of false
rate for any given application. To the best of our knowledge, this study is the first of its kind that has
considered the impact of staleness of replicated BF contents in a distributed environment, and developed
a mechanism to adaptively minimize such an impact so as to optimize systems performance.
The rest of the chapter is organized as follows. Section 3 presents our analytical models that theoretically derive false negative and false positive rates of a BF replica, as well as the overall false rates
in distributed systems. Section 4 validates our theoretical results by comparing them against results
obtained from extensive experiments. The adaptive updating protocols based on our theoretical analysis
models are presented in Section 5. Section 6 gives related work and Section 7 concludes the chapter.
The chapter is extended from our previous publication (Zhu & Jiang, 2006).
FALSE RATES IN THEORY

In many distributed systems, the information about what data objects can be accessed through a host or
where data objects are located usually needs to be shared to facilitate the lookup. To provide high scal-
789
Figure 3. An example application of Bloom filters in a distributed system with 3 hosts.
ability, this information sharing usually takes a decentralized approach, to avoid potential performance
bottleneck and vulnerability of a centralized architecture such as a dedicated server. While BFs were
initially used in non-distributed systems to save the memory space in the 1980s when memory was
considered a precious resource (Lee, 1982; McIlroy, 1982), they have recently been extensively used
in many distributed systems as a scalable and efficient scheme for information sharing, due to their low
network traffic overhead.
The inherent nature of such information sharing in almost all these distributed systems, if not all,
can be abstracted as a location identification, or mapping problem, which is described next. Without
loss of generality, the distributed system considered throughout this chapter is assumed to consist of
a collection of autonomous data-storing host computers dispersed across a communication network.
These hosts partition a universe U of data objects into subsets S1, S2,,S, with each subset stored on
one of these hosts. Given an arbitrary object x in U, the problem is how to efficiently identify the host
that stores x from any one of the hosts.
BFs are useful to solve this kind of problems. In a typical approach, each host constructs a BF representing the subset of objects stored in it, and then broadcasts that filter to all the other hosts. Thus
each host keeps 1 additional BFs, one for every other host. Figure 3 shows an example of a system
with three hosts. Note that a filter B i is a replica of Bi from Host i and B i may become outdated if the
changes to Bi are not propagated instantaneously. While the solution to the above information sharing
problem can implemented somewhat differently giving rise to a number of solution variants (Hua et al.,
2008; Ledlie et al., 2002; Zhu et al., 2004), the analysis of false rates presented in this chapter can be
easily applied to these variants.
The detailed procedures of the operations of insertion, deletion and query of data objects are shown
in Figure 4. When an object x is deleted from or inserted into Host i, the values of the counting filters
Ci[H(x)] and bits Bi[H(x)] are adjusted accordingly. When the fraction of modified bits in Bi exceeds
some threshold, Bi is broadcast to all the other hosts to update B i . To look up x, Host i performs the
membership tests on all the BFs kept locally. If a test on Bi is positive, then x can potentially be accessed
locally. If a test in the filter B j for any j i is positive, then x is conjectured to be on Host j with high
probability. Finally, if none of the tests is positive, x is considered nonexistent in the system.
790
Figure 4. Procedures of adding, deleting and querying object x at host i
In the following, we begin the analysis by examining the false negative and false positive rate of a
single BF replica and then present the analysis of the overall false rates of all BFs kept locally on a host.
The experimental validations of the analytical models are presented in the next section.
False Rates of Bloom Filter Replicas

Let B be a BF with m bits and B a replica of B. Let n and n be the number of objects in the set represented by B and by B , respectively. We denote 1 (0) as the set of all one (zero) bits in B that are
different than (i.e., complement of) the corresponding bits in B . More specifically,
D1 = {B[i ] | B[i ] = 1, B[i ] = 0, "i [1, m ]}
791
Figure 5. An example of a BF B and its replica B where bits are reordered such that bits in 1 and 0
are placed together.
D0 = {B[i ] | B[i ] = 0, B[i ] = 1, "i [1, m ]}.

Thus, 0 + 1 represent the set of changed bits in B that have not been propagated to B . The number
of bits in this set is affected by the update threshold and update latency. Furthermore, if a nonempty 1
is hit by least one hash function of a membership test on B while all other hash functions of the same
test hit bits in B - D1 - D0 with a value of one, then a false negative occurs in B . Similarly, a false
positive occurs if the nonempty 1 is replaced by a nonempty 0 in the exact membership test scenario
on a B described above.
Lemma 1. Suppose that the numbers of bits in 1 and in 0 are m1 and m0, respectively. Then n
is a random variable following a normal distribution with an extremely small variance (i.e., extremely
highly concentrated around its mean), that is,
E ()
n =-
m
ln(e -kn /m + d1 - d0 ).
k
(6)
Proof: In a given BF representing a set of n objects, each bit is zero with probability P0(n), given in
Equation 2, or one with probability P1(n) = 1 P0(n). Thus the average fractions of zero and one bits
are P0(n) and P1(n), respectively. Ref. (Michael, 2002) shows formally that the fractions of zero and one
bits are random variables that are highly concentrated on P0(n) and P1(n) respectively.
B -1 -0
101101101
B - -
1111
101101101
0 0 0 0
0
0 0 0
0
111
Figure 5 shows an example of B and B where bits in 1 and 0 are extracted out and placed together.
The expected numbers of zero bits in B 1 0 and in B - D1 - D0 should be equal since the bits in
them are always identical for any given B and B . Thus for any given n, 1 and 0, we have
792
Figure 6. Expected false negative rate of a Bloom filter replica when the configuration of its original
Bloom filter is optimal.
P0 (n ) - d0 = E (P0 ())
n - d1
(7)
Substituting Equation 2 into the above equation, we have

e -kn /m - d0 = e -kE (n)/m - d1
(8)
After solving Equation 8, we obtain Equation 6.

Pragmatically, in any given BF with n objects, the values of 1 and 0, which represent the probabilities of a bit falling in 1 and 0 respectively, are relatively small. Theoretically, the number of bits
in 1 is less than the total number of one bits in B, thus we have 1 1 ekn/m. In a similar way, we can
conclude that 0 ekn/m.
Theorem 3.False Negative Rate

The expected false negative rate f in the BF replica B is P1(n)k (P1(n) 1)k, where P1(n) = 1
ekn/m.
Proof: As mentioned earlier, a false negative in B occurs when at least one hash function hits the
bits in 1 in B while the others hit the bits in B - D1 - D0 with a value of one. Hence, the false nega-
793
Figure 7. Expected false positive rate of a Bloom filter replica when the configuration of its original
Bloom filter is optimal.
tive rate is

k i
n - d0
f = i =1 i d1 P1 ()
k
k -i
= P1 ()
n - d0 + d1 - P1 ()
n - d0
Since P0(n) = 1 P1(n) and P0 ()

n = 1 - P1 ()
n , Equation 7 can be rewritten as:
E (P1 ())
n = P1 (n ) + d0 - d1
(9)
Hence
k
E ( f ) = E (P1 ())
n - d0 + d1 - E (P1 ())
n - d0
= P1 (n )k - P1 (n ) - d1
(10)
Figure 6 shows the expected false negative rate when the false positive of the original BF is minimized.
The minimal false positive rate is 0.0214, 0.0031 and 0.00046 when the bit-element ratio is 8, 12 and 16
794
respectively. Figure 6 shows that the false negative rates of a BF replica are more than 50% of the false
positive rates of the original BF when 1 is 5%, and more than 75% when 1 is 10%. This proves that
the false negative may be significant and should not be neglected in distributed applications.
Theorem 4.False Positive Rate

+
The expected false positive rate f for the Bloom filter replica B is (P1(n) + 0 1)k , where P1(n) =
1 ekn/m.
Proof: If B confirms positively the membership of an object while this object actually does not be/B
long to B, then a false positive occurs. More specifically, a false positive occurs in B if for any x
, all hit bits by hash functions of the membership test for x are ones in B - D1 - D0 , or for any x U,
all hit bits are ones in B but at least one hit bit is in 0. Thus, we find that
k k

k

P1 ()
n - d0 )k -i
n - d0 + d0i (P1 ()
|
i =1
n
nk(P1()
n - d0 )k
= P1()
|U |
n
+
f = 1
|U
(11)
Considering n |U and Equation 9, we have

n
+
E ( f ) = (E (P1 ()))
n k(E (P1 ())
n - d0 )k
|U |
= P1 (n ) + d0 - d1 -
n
(P1 (n ) - d1 )k
|U |
P1 (n ) + d0 - d1
(12)
Overall False Rates

In the distributed system considered in this study, there are a total of hosts and each host has BFs,
with 1 of them replicated from the other hosts. To look up an object, a host performs the membership
tests in all the BFs kept locally. This section analyzes the overall false rates on each BF replica and
each host.
Give any BF replica B , the events of a false positive and a false negative are exclusive. Thus it is
easy to find that the overall false rate of B is
E ( foverall ) = E ( f - ) + E ( f + )
(13)
where E ( f - ) and E(f+) are given in Equation 10 and 12 respectively.
795
Figure 8. Comparisons of estimated and experimental f of B when k is 6, 8 and 11 respectively. The

initial object number in both B and B is 25, 75, 150 and 300 (m = 1200).
796
Table 1. False positive rates comparisons when k is 6 and 8 respectively (m = 1200).

+
f
(percentage)
Estimated
Experimental
25
0.0942
0.2042
0.0002
25
0.0800
0.3650
0.0002
25
0.0600
0.4875
0.0001
75
0.0800
0.1608
0.0934
0.1090
75
0.0600
0.2833
0.0794
0.1090
75
0.0483
0.3758
0.0799
0.1090
150
0.0533
0.1042
2.2749
2.6510
150
0.0400
0.1800
2.3540
2.6510
150
0.0325
0.2508
2.1872
2.6530
300
0.0250
0.0417
23.6555
25.4790
300
0.0183
0.0692
25.4016
25.4710
300
0.0117
0.1000
24.7241
25.4750
25
0.1083
0.2425
0.00002
25
0.0792
0.4192
0.00002
25
0.0550
0.5425
0.00002
75
0.0792
0.1767
0.0525
0.0540
75
0.0550
0.3000
0.0504
0.0540
75
0.0425
0.3917
0.0506
0.0540
150
0.0475
0.1050
2.5163
2.5770
150
0.0350
0.1758
2.6783
2.5780
150
0.0283
0.2367
2.5384
2.5790
300
0.0192
0.0333
33.2078
33.2580
300
0.0133
0.0558
34.4915
33.2550
300
0.0083
0.0817
32.1779
33.2550
On Host i, BF Bi represents all the objects stored locally. While only false positives occur in Bi, both
false positives and false negatives can occur in the replicas B j for any j i. Since the failed membership test in any BF leads to a lookup failure, the overall false positive and false negative rates on Host
i are therefore
+
E ( fhost
) = 1 - (1 - fi + )
j =1, j i
(1 - fj )
(14)
and
797
Figure 9. Comparisons of estimated and experimental foverall in a distributed system with 5 hosts when k
is 6, 8, and 11 respectively. The initial object number n on each host is 25, 75, 150 and 300 respectively.
Then each host adds a set of new objects. The number of new objects on each host increases from 50 to
300 with a step size of 50. (m = 1200)
798
Table 2. Overall false rate comparisons under optimum initial operation state when k is 6 and 8 respectively. 100 new objects are added on each host and then a set of existing objects are deleted from each
host. The number of deleted objects increases from 10 to 100 with a step size of 10. (m = 1200) In the first
group, initially Initially n = 150 and m/n = 8; in the second group, n = 100 and m/n = 12 initially.
foverall
(percentage)
Estimated
Experimental
0.0100
0.1705
46.2259
45.2200
0.0227
0.1657
42.4850
40.6880
0.0347
0.1627
38.7101
37.2420
0.0458
0.1582
34.9268
33.8460
0.0593
0.1545
31.3748
30.4540
0.0715
0.1497
27.8831
27.3700
0.0837
0.1445
24.5657
24.8000
0.0938
0.1392
21.2719
22.5560
0.1045
0.1340
18.2490
20.4520
0.1165
0.1300
15.5103
18.7540
0.0123
0.2375
30.9531
29.6280
0.0255
0.2275
25.7946
23.6280
0.0413
0.2180
21.0943
18.0000
0.0552
0.2123
16.7982
14.6720
0.0658
0.2043
12.9800
12.0040
0.0772
0.1965
9.7307
9.7320
0.0920
0.1900
7.1016
7.7520
0.1075
0.1848
4.9936
6.1280
0.1237
0.1788
3.4031
4.8400
0.1377
0.1732
2.2034
3.8160
E ( fhost
) = 1-
j =1, j i
(1 - fj )
(15)
+
where fi + , fj and fj are given in Theorem 2, 3 and 4 respectively.
The probability that Host i fails a membership lookup can be expressed as follows:
+
+
E ( fhost ) = E ( fhost
+ fhost
- fhost
fhost
)
(16)
In practice, we can use the overall false rate of a BF replica to trigger updating process and use the
overall false rate of all BFs on a host to evaluate the whole systems. In a typical distributed environment
with many nodes, the updating of a Bloom filter replica B i stored on node j can be triggered by either
the home node i or the node j. Since many nodes hold the replica of Bi, it is more efficient to let the
home node i to initiate the updating process of all replicas of Bi. Otherwise, the procedure of checking
799
whether an updating is needed would be performed by all other nodes, wasting both network and CPU
resources. Accordingly, we can only use the overall false rate of a BF replica E(foverall) as the updating
criteria. On the other hand, E(fhost) can be used to evaluate the overall efficiency of all BFs stored on
the same host.
EXPERIMENTAL VALIDATION
This section validates our theoretical framework developed in this chapter by comparing the analytical
results produced by our models with experimental results obtained through real experiments.
We begin by examining a single BF replica. Initially the Bloom filter replica B is exactly the same as
B. Then we artificially change B by randomly inserting new objects into B or randomly deleting existing
objects from B repeatedly. For each specific modification made to B, we calculate the corresponding
1 and 0 and use 100,000 randomly generated objects to test the memberships against B . Since the
actual objects represented in B are known in the experiments, the false negative and positive rates can
be easily measured.
Figure 8 compares analytical and real false negative rates, obtained from the theoretic models and
from experiments respectively, by plotting the false negative rate in B as a function of 1, a measure of
update threshold, for different numbers of hashing functions (k = 6 and k = 8) when the initial number
of objects in B are 25, 75, 150 and 300 respectively. Since the false negative rates are independent of
0, only object deletions are performed in B.
Table 1 compares the analytical and real false positive rates of B when k is 6 and 8 respectively. In
these experiments, both object deletions and additions are performed in B while B remains unaltered.
It is interesting that the false positive rates of B is kept around some constant for a specific n although
the objects in B changes in the real experiments. It is true that if the number of objects in B increases
or decreases, the false positive rate in B should decrease or increase accordingly before the changes
of B is propagated to B . However, due to the fact that n is far less than the total object number in the
universe U, the change of the false positive rate in B is too small to be perceptible. These tests are made
accordant with the real scenarios of BF applications in distributed systems. In such real applications,
the number of possible objects is usually very large and thus BFs are deployed to efficiently reduce the
network and network communication requirements. Hence, in these experiments the number of objects
used to test B is much larger than the number of objects in B or B (100,000 random objects are tested).
Under such large size of testing samples, the influence of the modification in B on the false positive rate
of B is difficult to be observed.
We also simulated the lookup problem in a distributed system with 5 hosts. Figure 9 shows the comparisons of the analytical and experimental average overall false rates on each host. In these experiments,
we only added new objects without deleting any existing items so that 0 is kept zero. The experiments
presented in Table 2 considers both the deletion and addition of objects on each host when the initial
state of BF on each host is optimized, this is, the number of hash functions is the optimal under the ratio
between m and the initial number of objects n. This specific setting aims to emulate the real application
where m/n and k are usually optimally or sub-optimally matched by dynamically adjusting the BF length
m (Hong & Tao, 2003) or designing the BF length according to the average number of objects (Ledlie
et al., 2002; Li et al., 2000; Little et al., 2002; Matei & Ian, 2002; Zhu et al., 2004). All the analytical
800
results have been very closely matched by their real (experimental) counterparts consistently, strongly
validating our theoretical models.
Figure 10. In an environment of two servers, the figures show the overall false rate on one server when
the initial number of elements in one server are 25 and 150 respectively. The ratio of bits per element
is 8 and 6 hash functions are used. The rate for element addition and deletion are respectively 5 and 2
per time unit on each server.
801
REPLICA UPDATE PROTOCOL

To reduce the false rate caused by staleness, the remote Bloom filter replica needs to be periodically
updated. An update process is typically triggered if the percentage of dirty bits in a local BF exceeds
some threshold. While a small threshold causes large network traffic and a large threshold increases the
false rate, this tradeoff is usually reached by a trial-and-error approach that runs numerous (typically a
large number of) trials in real experiments or simulations. For example, in the summery cache study (Li
et al., 2000), it is recommended that if 10 percent of bits in a BF are dirty, then the BF propagates its
changes to all replicas. However, this approach has the following disadvantages.
1.
2.
3.
It cannot directly control the false rate. To keep the false rate under some target value, complicated
simulations or experiments have to be conducted to adjust the threshold for dirty bits. If the target
false rate changes, this tedious process has to be repeated to find a golden threshold.
It treats all dirty bits equally and does not distinguish the zero-dirty bits from the one-dirty bits. In
fact, as shown in previous sections, the dirty one bits and the dirty zero bits exert different impacts
on the false rates.
It does not allow flexible update control. In many applications, the penalty of a false positive and
a false negative are significantly different. For example, in summery cache (Li et al., 2000), a false
positive occurs if a request is not a cache hit on some web proxy when the corresponding Bloom
filter replica confirms so. The penalty of a false positive is a waste of query message to this local
web proxy. A false negative happens if a request can be hit in a local web proxy but the Bloom
filter replica mistakenly indicates otherwise. The penalty of a false negative is a round-trip delay in
retrieving information from a remote web server through the Internet. Thus, the penalty of a false
negative is much larger than that of a false positive. The updating protocols based on the percentage of dirty bits do not allow one to place more weight on the false negative rate, thus limiting the
flexibility and efficiency of the updating process.
Based on the theoretic models presented in the previous sections, an updating protocol that directly
controls the false rate is designed in this chapter. In a distributed system with nodes where each node
has a local BF to represent all local elements, each node is responsible for automatically updating its
BF replicas. Each node estimates the false rate of its remote BF replica and if the false rate exceeds
some desire false rate, as opposed to a predetermined threshold on the percentage of dirty bits in the
conventional updating approaches, a updating process is triggered. To estimate the false rate of remote
BF replica B , each node has to record the number of elements stored locally (n), in addition to a copy
of remote BF replica B . This copy is essentially the local BF B when the last updating is made. It is
used to calculate the percentage of dirty one bits (1) and the dirty zero bits (0). Compared with the
conventional updating protocols based on the total percentage of dirty bits, this protocol only needs to
record one more variable (n), thus it does not significantly increase the maintenance overhead.
This protocol allows more flexible updating protocols that consider the penalty difference between
a false positive and a false negative. The overall false rate can be a weighted sum of the false positive
rate and the false negative rate, shown as follows:
E ( foverall ) = w +E ( f + ) + w -E ( f - )
802
(17)
where w+ and w are the weights. The values of w+ and w depends on the applications and also the application environments.
We prove the effectiveness of this update protocol through event driven simulations. In this simulation, we made the following assumptions.
1.
2.
3.
Each item is randomly accessed. This assumption may not be realistic in some real workloads, in
which an item has a greater than equal chance of being accessed again once it has been accessed.
Though all previous theoretic studies on Bloom filters assume a workload with uniform access
spectrum, further studies are needed to investigate the impact of this assumption.
Each local node deletes or adds items at a constant rate. In fact, the deletion and addition rate changes
dynamically throughout the lifetime of applications. This simplifying assumption is employed just
to prove our concept while keeping our experiments manageable in the absence of a real trace or
benchmark.
The values of w+ and w are 1. Their optimal values depend on the nature of the applications and
environments.
We simulate a distributed system with two nodes where each node keeps a BF replica of the other.
We assume the addition and deletion are 5 and 2 per time unit respectively and our desired false rate
is 10%. Figure 10 shows the estimated false rate and the measured false rate of node 1 throughout the
deletion, addition and updating processes. Due to the space limitation, the false rate on node 2, which is
similar to node 1, is not shown in this chapter. In addition, we have changed the addition rate and deletion rates. Simulation results consistently indicate that our protocol is accurate and effective in control
the false rate.
RELATED WORK
Standard Bloom filters (Bloom, 1970) have inspired many extensions and variants, such as the Counting Bloom filters (Li et al., 2000), compressed Bloom filters (Michael, 2002), the space-code Bloom
filters (Kumar et al., 2005), the spectral Bloom filters (Saar & Yossi, 2003), time-decaying Bloom filters
(Cheng et al., 2005), and the Bloom filter state machine (Bonomi et al., 2006). The counting Bloom
filters are used to support the deletion operation and handle a set that is changing over time (Li et al.,
2000). Time-decaying Bloom filters maintains the frequency count for each item stored in the Bloom
filters and the values of these frequency count decay with time (Cheng et al., 2005). Multi-Dimension
Dynamic Bloom Filters (MDDBF) supports representation and membership queries based on the multiattribute dimension (Guo et al., 2006). Its basic idea is to represent a dynamic set A with a dynamic s
m bit matrix, in which there are s standard Bloom filters and each Bloom filter has a length of m bits. A
novel Parallel Bloom Filters (PBF) and an additional hash table has been developed to maintain multiple
attributes of items and verify the dependency of multiple attributes, thereby significantly decreasing
false positives (Hua & Xiao, 2006).
Bloom filters have significant advantages in space saving and fast query operations and thus have
been widely applied in many distributed computer applications, such as aiding longest prefix matching
(Dharmapurikar et al., 2006), and packet classification (Baboescu & Varghese, 2005). Extended Bloom
filter provides better throughput performance for router applications based on hash tables by using a
803
small amount of multi-port on-chip memory (Song et al., 2005). Whenever space is a concern, a Bloom
filter can be an excellent alternative to storing a complete explicit list.
In many distributed applications, BFs are often replicated to multiple hosts to support membership
query without contacting other hosts. However, these replicas might become stale since the changes of
BFs usually cannot be propagated instantly to all replicas in order to reduce the update overhead. As a
result, the BF replicas may return false negatives. This observation motivates the research presented in
this chapter.
CONCLUSION
Although false negatives do not occur in standard BF, this chapter shows that the staleness in a BF replica
can produce false negative. We presents the theoretical analysis of the impact of staleness existing in
many distributed BF applications on the false negative and false positive rates, and developed an adaptive update control mechanism that accurately and efficiently maintains a desirable level of false rate
for a given application. To the best of our knowledge, we are the first to derive accurate closed-form
expressions that incorporate the staleness into the analysis of the false negative and positive rates of a
single BF replica, to develop the analytical models of the overall false rates of BF arrays that have been
widely used in many distributed systems, and to develop an adaptively controlled update process that
accurately maintains a desirable level of false rate for a given application. We have validated our analysis by conducting extensive experiments. The theoretical analysis presented not only provides system
designers with significant theoretical insights into the development and deployment of BFs in distributed
systems, but also are useful in practice for accurately determining when to trigger the processes of updating BF replicas in order to keep the false rates under some desired values, or, equivalently, minimize
the frequency of updates to reduce update overhead.
ACKNOWLEDGMENT
This work was partially supported by a faculty startup grant of University of Maine, and National Science Foundation Research Grants (CCF #0621493, CCF #0754951, CNS #0723093, DRL #0737583,
CNS #0619430, CCF #0621526).
REFERENCES
Baboescu, F., & Varghese, G. (2005). Scalable packet classification. IEEE/ACM Trans. Netw., 13(1),
214.
Bloom, H. B. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of
the ACM, 13(7), 422426. doi:10.1145/362686.362692
804
Bonomi, F., Mitzenmacher, M., Panigrah, R., Singh, S., & Varghese, G. (2006). Beyond bloom filters:
from approximate membership checks to approximate state machines. Paper presented at the Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer
communications.
Broder, A., & Mitzenmacher, M. (2003). Network Applications of Bloom Filters: A Survey. Internet
Mathematics, 1(4), 485509.
Cheng, K., Xiang, L., Iwaihara, M., Xu, H., & Mohania, M. M. (2005). Time-Decaying Bloom Filters
for Data Streams with Skewed Distributions. Paper presented at the Proceedings of the 15th International
Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications.
Dharmapurikar, S., Krishnamurthy, P., & Taylor, D. E. (2006). Longest prefix matching using bloom
filters. IEEE/ACM Trans. Netw., 14(2), 397409.
Ghose, F., Grossklags, J., & Chuang, J. (2003). Resilient Data-Centric Storage in Wireless Ad-Hoc Sensor Networks. Proceedings the 4th International Conference on Mobile Data Management (MDM03),
(pp. 45-62).
Guo, D., Wu, J., Chen, H., & Luo, X. (2006). Theory and Network Applications of Dynamic Bloom
Filters. Paper presented at the INFOCOM 2006. 25th IEEE International Conference on Computer
Communications.
Hailong, C., & Jun, W. (2004). Foreseer: a novel, locality-aware peer-to-peer system architecture for
keyword searches. Paper presented at the Proceedings of the 5th ACM/IFIP/USENIX International
Conference on Middleware.
Hong, T., & Tao, Y. (2003). An Efficient Data Location Protocol for Self.organizing Storage Clusters.
Paper presented at the Proceedings of the 2003 ACM/IEEE conference on Supercomputing.
Hua, Y., & Xiao, B. (2006). A Multi-attribute Data Structure with Parallel Bloom Filters for Network
Services. Proceedings of 13th International Conference of High Performance Computing (HiPC),(pp.
277-288).
Hua, Y., Zhu, Y., Jiang, H., Feng, D., & Tian, L. (2008). Scalable and Adaptive Metadata Management
in Ultra Large-Scale File Systems. Proceedings of the 28th International Conference on Distributed
Computing Systems (ICDCS 2008).
James, K. M. (1983). A second look at bloom filters. Communications of the ACM, 26(8), 570571.
doi:10.1145/358161.358167
John, K., David, B., Yan, C., Steven, C., Patrick, E., & Dennis, G. (2000). OceanStore: an architecture
for global-scale persistent storage. SIGPLAN Not., 35(11), 190201. doi:10.1145/356989.357007
Kumar, A., Xu, J., & Zegura, E. W. (2005). Efficient and scalable query routing for unstructured peerto-peer networks. Paper presented at the Proceedings INFOCOM 2005, 24th Annual Joint Conference
of the IEEE Computer and Communications Societies.
805
Ledlie, J., Serban, L., & Toncheva, D. (2002). Scaling Filename Queries in a Large-Scale Distributed
File System. Harvard University, Cambridge, MA.
Lee, L. G. (1982). Designing a Bloom filter for differential file access. Communications of the ACM,
25(9), 600604. doi:10.1145/358628.358632
Li, F., Pei, C., Jussara, A., & Andrei, Z. B. (2000). Summary cache: a scalable wide-area web cache
sharing protocol. IEEE/ACM Trans. Netw., 8(3), 281293.
Little, M. C., Shrivastava, S. K., & Speirs, N. A. (2002)... The Computer Journal, 45(6), 645652.
doi:10.1093/comjnl/45.6.645
Luk, M., Mezzour, G., Perrig, A., & Gligor, V. (2007). MiniSec: A Secure Sensor Network Communication Architecture. Proceedings of IEEE International Conference on Information Processing in Sensor
Networks (IPSN), (pp. 479-488).
Matei, R., & Ian, F. (2002). A Decentralized, Adaptive Replica Location Mechanism. Paper presented at the
Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing.
McIlroy, M. (1982). Development of a Spelling List. Communications, IEEE Transactions on [legacy,
pre - 1988], 30(1), 91-99.
Michael, M. (2002). Compressed bloom filters. IEEE/ACM Trans. Netw., 10(5), 604612.
Mohan, A., & Kalogeraki, V. (2003). Speculative routing and update propagation: a kundali centric
approach. Paper presented at the IEEE International Conference on Communications, 2003.
Pai-Hsiang, H. (2001). Geographical region summary service for geographical routing. Paper presented
at the Proceedings of the 2nd ACM international symposium on Mobile ad hoc networking and computing.
Rhea, S. C., & Kubiatowicz, J. (2002). Probabilistic location and routing. Paper presented at the IEEE
INFOCOM 2002, Twenty-First Annual Joint Conference of the IEEE Computer and Communications
Societies Proceedings.
Saar, C., & Yossi, M. (2003). Spectral bloom filters. Paper presented at the Proceedings of the 2003
ACM SIGMOD international conference on Management of data.
Song, H., Dharmapurikar, S., Turner, J., & Lockwood, J. (2005). Fast hash table lookup using extended
bloom filter: an aid to network processing. Paper presented at the Proceedings of the 2005 conference
on Applications, technologies, architectures, and protocols for computer communications.
Zhu, Y., & Jiang, H. (2006). False Rate Analysis of Bloom Filter Replicas in Distributed Systems. Paper
presented at the Proceedings of the 2006 International Conference on Parallel Processing.
Zhu, Y., Jiang, H., & Wang, J. (2004). Hierarchical Bloom filter arrays (HBA): a novel, scalable metadata management system for large cluster-based storage. Paper presented at the Proceedings of the 2004
IEEE International Conference on Cluster Computing.
806
Zhu, Y., Jiang, H., Wang, J., & Xian, F. (2008). HBA: Distributed Metadata Management for Large
Cluster-Based Storage Systems. IEEE Transactions on Parallel and Distributed Systems, 19(6), 750763.
doi:10.1109/TPDS.2007.70788

Bloom Filter: A Bloom filter is a space-efficient data structure that supports membership queries.
It consists of a bit array and all bits are initially set to 0. It uses a fix number of predefined independent
hash functions. For each element, all hashed bits are set to 1. To check whether an element belongs to
the set represented by a Bloom filter, one simply checks all bits pointed by the hash functions are 1. If
not, the element is not in the set. If yes, the element is consider as a member.
Bloom Filter Array: A Bloom filter array, consisted of multiple Bloom filters, represents multiple
sets. It is a space-efficient data structure to evaluate whether an element is within these sets and which
set this element belongs to if yes.
Bloom Filter Replica: A Bloom filter replica is a replication of a Bloom filter. In a distributed
environment, the original and replicated Bloom filters are typically stored on different servers for improved performance and fault tolerance. A Bloom filter replica will generate both false positives and
false negatives.
Bloom Filter Update Protocol: When the set that a Bloom filter represents is changed over time,
the corresponding Bloom filter replica becomes out-dated. In order to reduce the probability that the
Bloom filter replica reports the membership incorrectly, the replica needs to be updated frequently. The
Bloom filter update protocol determines when a Bloom filter replica needs to be updated.
Distributed Membership Query: Membership query is one fundamental function that reports where
the target data, resource, or service is located. The membership query can be performed by a centralized
server or by a group of distributed server. The latter approach has a stronger scalability and is referred
as distributed memory query.
False Negative: A false negative happens when an element is a member of the set that a Bloom
filter represents but the Bloom filter mistakenly reports it is not. A standard Bloom filter has no false
negatives. However, in a distributed system, a Bloom filter replica can generate false negatives when
the replica is not timely updated.
False Positive: A false positive happens when an element is not a member of the set that a Bloom
filter represents but the Bloom filter mistakenly reports it is. The probability of false positives can be
very slow when the Bloom filter is appropriately designed.
807
808
Chapter 35
Image Partitioning on
Spiral Architecture
Qiang Wu
University of Technology, Australia
Xiangjian He
University of Technology, Australia
ABSTRACT
Spiral Architecture is a relatively new and powerful approach to image processing. It contains very useful
geometric and algebraic properties. Based on the abundant research achievements in the past decades,
it is shown that Spiral Architecture will play an increasingly important role in image processing and
computer vision. This chapter presents a significant application of Spiral Architecture for distributed
image processing. It demonstrates the impressive characteristics of spiral architecture for high performance image processing. The proposed method tackles several challenging practical problems during
the implementation. The proposed method reduces the data communication between the processing nodes
and is configurable. Moreover, the proposed partitioning scheme has a consistent approach: after image
partitioning each sub-image should be a representative of the original one without changing the basic
object, which is important to the related image processing operations.
INTRODUCTION
Image processing is a traditional area in computing science which has been used widely in many applications including the film industry, medical imaging, industrial manufacturing, weather forecasting
etc. With the development of new algorithms and the rapid growth of application areas, a key issue
emerges and attracts more and more challenging research in digital image processing. That issue is the
dramatically increasing computation workload in image processing. The reasons can be classified into
three groups: relatively low-power computing platform, huge image data to be processed and the nature
of image-processing algorithms.
DOI: 10.4018/978-1-60566-661-7.ch035
Image Partitioning on Spiral Architecture
Inefficient computing is a relative concept. The microcomputer has been powerful enough in the last
decade to make personal image processing practically feasible to the individual researcher for inexpensive image processing (Miller, 1993; Schowengerdt & Mehldau, 1993). In recent years, although such
systems still functionally satisfy the requirements of most general purpose image-processing needs, the
limited computing capacity in a standalone processing node has become inadequate to keep up with the
faster growth of image-processing applications in such practical areas as real-time image processing
and 3D image rendering.
The huge amount of image data is another issue which has been faced by many image-processing
applications today. Many applications such as computer graphics, rendering photo realistic images and
computer-animated films consume the aggregate power of whole farms of workstations (Oberhuber,
1998). Although the common sense of what is large to the image data being processed has changed
over time, expression in Megabytes or Gigabytes is observed from the application point of view (Goller,
1999). Over the past few decades, the image to be processed has become larger and larger. Consequently,
the issue of how to decrease the processing time despite the growth of image data becomes an urgent
point in digital image processing.
Moreover, the nature of the traditional image-processing algorithms is another issue which reduces
the processing speed. In digital image processing, the elementary image operators can be differentiated
between point image operators, local imageoperators and global image operators (Braunl, Feyrer, Rapf,
& Reinhardt, 2001). The main characteristic of point operator is that a pixel in the output image depends
only on the corresponding pixel in the input image. Point operators are used to copy an image from one
memory location to another, in arithmetic and logic operations, table lookup and image composition
(Nicolescu & Jonker, 2002). Local operators create a destination pixel based on criteria that depend on
the source pixel and the values of the pixels in some neighbourhood surrounding it. They are used
widely in low-level image processing such as image enhancement by sharpening, blurring and noise
removal. Global operators create a destination pixel based on the entire image information. A representative example of an operator within this class is the Discrete Fourier Transform (DFT). Compared with
point operators, local operators and global operator are more computationally intensive.
As a consequence of the above, image processing-related tasks involve the execution of a large number
of operations on large sets of structured data. The processing power of the typical desktop workstation
can therefore become a severe bottleneck in many image-processing applications. Thus, it may make
sense to perform image processing on multiple workstations or on a parallel processing system. Actually,
many image-processing tasks exhibit a high degree of data locality and parallelism, and map quite readily to specialized massively parallel computing hardware (Chen, Lee, & Cho, 1990; Siegel, Armstrong,
& Watson, 1992; Stevenson, Adams, Jamieson, & Delp, April, 1993).
For several image-processing applications, a number of existing programs have been optimized for
execution on parallel computer architectures (Pitas, 1993). The parallel approach, as an alternative to
replace the original sequential processing, promises many benefits to the development of image processing. Gloller and Leberl (Goller & Leberl, 2000) implemented shape-from-shading, stereo matching,
re-sampling, gridding and visualization of terrain models, which are all compute-intensive algorithms in
radar image processing, in such a manner that they execute either on a parallel machine or on a cluster
of workstations which connects many computing nodes together via a local area network. Other typical applications of image processing on parallel computing platform can be seen in the field of remote
image processing such as the 3D object mediator (Kok, Pabst, & Afsarmanseh, April, 1997), internetbased distributed 3D rendering and animation (Lee, Lee, Lu, & Chen, 1997), remote image-processing
809
systems design using an IBM PC as front-end and a transputer network as back-end (Wu & Guan, 1995),
telemedicine projects (Marsh, 1997), satellite image processing (Hawick et al., 1997) and the general
approach for remote execution of software (Niederl & Goller, Jan, 1998).
Many parallel image-processing tasks map quite readily to specialized massively parallel computing
hardware. However, specific parallel machines require a significant investment but may only be needed
for a short time. Accessing such systems is difficult and requires in-depth knowledge of the particular
system. Alternatively, the users must turn to supercomputers, which may be unacceptable for many customers. These three aspects are the main reasons why parallel computing has not been widely adopted
for computer vision and image processing.
Clusters of workstations have been proposed as a cheap alternative to parallel machines. Driven by
advances in network technology, cluster management systems are becoming a viable and economical
parallel computing platform for the implementation of parallel processing algorithms. Moreover, the
utilization of workstation clusters can yield many potential benefits, such as performance and reliability.
It can be expected that workstation clusters can take over computing intensive tasks from supercomputers.
Offsetting the many advantages mentioned above, the main disadvantages of clusters of workstation
are high communication latency and irregular load patterns on the computing nodes. The system performance mainly depends on the amount and structure of communication between processing nodes. Thus,
many coarse-grained parallel algorithms perform well, while fine-grained data decomposition methods
like the ones in the Parallel Image-Processing Toolkit (PIPT) (Squyres, Lumsdaine, & Stevenson, 1995)
require such a high communication bandwidth that execution on the cluster may even be slower than
on a single workstation. Moreover, the coexistence of parallel and sequential jobs that is typical when
interactive users work on the cluster makes scheduling and mapping a hard problem (Arpaci et al., May,
1995).
Thus, taking care of the intercommunication required for processing is an important issue for distributed
processing. For instance, if a particular processor is processing a set of rows, it needs information about
the rows above and below its first and last rows, when row partitioning is effected (Bharadwaj, Li., &
Ko, 2000; Siegel, Siegel, & Feather, 1982). The additional information must be exchanged between the
corresponding nodes. It can be done by two approaches in general. In the first approach, explicit communication is built up on-demand between the processors (Siegel et al., 1992) and is carried out concurrently with the main processes. In another approach, the required data is over-supplied to the respective
processor at the distribution phase (Siegel et al., 1982). In many cases, the second approach is a natural
choice for the architecture that is considered, although it introduces additional data transfer.
The facts revealed above is a key problem related to information partitioning. In term of the applications of image processing, it is defined as the problem of image data partitioning. Most image partitioning techniques can be classified into two groups: fine-grained decomposition and coarse-grained
decomposition (Squyres et al., 1995). A fine-grained decomposition-based image-processing operation
will assign an output pixel per processor and assign the required windowed data for each output pixel
to the processor. Thus, each processor will perform the necessary processing for its output pixel. A
coarse-grained decomposition will assign large contiguous regions of the output image to each of a
small number of processors. Each processor will perform the appropriate window based operations to
its own region of the image. Appropriate overlapping regions of the image will be assigned in order to
properly accommodate the processing at the image boundaries.
There are some difficulties as a consequence of the general data partitioning. The first one is the extra
810
communication required between the processors, which has been mentioned above. This is inevitable
when a processor participating in the parallel computation needs some additional information pertaining
to the data residing in other processors (Bertsekas & Tsitsiklis, 1989; Siegel et al., 1992) for processing
its share of the data. Another difficulty is that the number of processors available and the size of the input
image may vary in the different applications, so the sizes of sub-images for distribution and the number
of processors for a specific operation cannot be arbitrarily determined in the early stages.
This chapter presents a highly efficient image partitioning method which is based on a special image
architecture, Spiral Architecture. Using Spiral Architecture on a cluster of workstations, a new uniform
image partitioning scheme is derived in order to reduce many overhead components that otherwise
penalize time performance. With such scheme, uniform sub-images can be produced, which are near
copies rather than different portions of the original image. Each sub-image can then be processed by the
different processing nodes individually and independently. Moreover, this image-partitioning method
provides a possible stereo method to deal with many traditional image-processing tasks simultaneously.
Because each partitioned sub-image contains the main features of the original image, i.e. a representation of the original image, the different tasks can execute on the different processing nodes in parallel
without interfering with each other.
This method is a closed-form solution. In each application, the number of partitions can be decided
based on the practical requirements and the practical system conditions. A formula is derived to build
the relation between the number of partitions and the multiplier in Spiral Multiplication which is used
to achieve image partitioning.
The organization of this chapter is as follows. Spiral Architecture and its special mathematic operations are introduced on the section of Related Work which is followed by the detailed explanation of
image partitioning on Spiral Architecture. In this section, several problems and the solutions are discussed
regarding the implementation of image segmentation on the new architecture. Finally, the experimental
results and conclusion are presented.
RELATED WORK
Spiral Architecture
Traditionally, almost all image processing and image analysis is based on the rectangular architecture,
which is a collection of rectangular pixels in the column-row arrangement. However, rectangular
architecture is not historically the only one used in image-processing research. Another architecture
used often is the Spiral Architecture. Spiral Architecture is inspired from anatomical considerations of
primate vision (Schwartz, 1980). The cones on the retina possess the hexagonal distribution feature as
shown in Figure 1.
The cones, with the shape of hexagons, are arranged in a spiral cluster. Each unit is a set of seven
hexagons (Sheridan, Hintz, & Alexander, 2000). That is, each pixel has six neighbouring pixels. This
arrangement is different from the 33 rectangular vision unit in Rectangular Architecture, where each
pixel has eight neighbouring pixels. A collection of hexagonal pixels represented using spiral architecture is shown as Figure 2. The origin point is normally located on the centre of Spiral Architecture. In
Spiral Architecture any pixel has only six neighbour pixels which have the same distance to the central
hexagon of the seven-hexagon unit of vision. From research on the geometry of the cones in the pri-
811
Figure 1. Distribution of Cones on the Retina (from (He, 1998))
Figure 2. A collection of hexagonal cells
812
Figure 3. A labelled cluster of seven hexagons
mates retina it can be concluded that the cones distribution is distinguished by its potential powerful
computation abilities.
Spiral Addressing
It is obvious that the hexagonal pixels on Figure 2 cannot be labelled in column-row order as in rectangular architecture. Instead of labelling the pixel with a pair of numbers (x, y), each pixel is labelled
with a unique number.
Addressing proceeds in a recursive manner. Initially, a collection of seven hexagons are labelled as
shown in Figure 3. Such a cluster of seven hexagons dilates so that six more clusters of seven hexagons
are placed around the original cluster. The addresses of the centres of the additional six clusters are
obtained by multiplying the adjacent address in Figure 3 by 10 (See Figure 4.)
In each new cluster, the other pixels are labelled consecutively from the centre as shown in Figure
3. Dilation can then repeat to grow the architecture in powers of seven with unique assigned addresses.
The hexagons thus tile the plane in a recursive modular manner along a spiral direction (Alexander,
Figure 4. Dilation of the cluster of seven hexagons
813
1995). It eventuates that spiral address in fact is a base-seven number. A cluster with size of 73 with the
corresponding addresses is shown in Figure 5.
Mathematical Operations on Spiral Architecture

Spiral Architecture contains very useful geometric and algebraic properties, which can be interpreted
in terms of a mathematical object, the Euclidean ring. Two algebraic operations have been defined on
Spiral Architecture: Spiral Addition and Spiral Multiplication. The neighbouring relation among the
pixels on Spiral Architecture can be expressed uniquely by these two operations. Spiral Addition and
Spiral Multiplication will be used together to achieve uniform and complete image partitioning which
is very important to distributed image processing.
Figure 5. Hexagons with labelled addresses on Spiral Architecture (He, 1998)
814
Spiral Addition
Spiral Addition is an arithmetic operation with closure properties defined on the spiral address space
so that the result of Spiral Addition will be an address in the same finite set on which the operation is
performed (Sheridan, 1996). In addition, Spiral Addition incorporates a special form of modularity.
To develop Spiral Addition, a scalar form of Spiral Addition is defined first as shown in Table 1.
A procedure for Spiral Addition based on the Spiral Counting principle (Sheridan, 1996) is defined.
For the convenience of our explanation, a common naming convention is followed. Any number X =
(Xn Xn-1 ... X1) Xi {0, 1,...6}, where Xi is a digit of number X. Let a = (an an-1 ... a1) and b = (bn bn-1 ...
b1) be two spiral addresses. Then the result of Spiral Addition of them is worked out as follows.
1.
2.
3.
4.
5.
6.
7.
8.
9.
scale = 1; result = 0;
OP1 = (OP1n OP1n-1 ... OP11) = (an an-1 ... a1) OP2 = (OP2n OP2n-1 ... OP21) = (bn bn-1 ... b1)
C = OP1 + OP21 = (Cn Cn-1 ... C1) (Spiral Addition). Here, the carry rule is applied. For Spiral
Addition between two single-digit addresses, it follows the rules as shown in Table 1;
result = result + scale C1, scale = scale 10 (Here, + and mean normal mathematical
addition and mathematical multiplication respectively);
CA = OP1; CB = OP2;
OP1= (CBn CBn-1 ... CB2) = (OP1n-1OP1n-2 ... OP11) OP2= (Cn Cn-1 ... C2) = (OP2n-1OP2n-2 ...
OP21)
Repeatedly apply steps 3 through 6 until OP1 = 0;
result = result + scale OP2 (Here, + and mean normal mathematical addition and mathematical multiplication respectively);
Return result.
For example, for Spiral Addition 26+14, the procedure is shown below. In the demonstration, numbering a.b like 3.2 means Step 3 and 2nd time as mentioned above:
1.
2.
scale = 1; result = 0;
OP1 = (2 6); OP2 = (1 4);
Table 1. Scalar Spiral Addition* (Sheridan, 1996)

0
63
15
64
15
14
26
26
25
31
31
36
42
42
41
53
64
53
52
* Bold type shows the scalar spiral address; normal type shows the results of Spiral Addition between the corresponding spiral addresses in
the first row and the first column respectively.
815
3.1
4.1
5.1
6.1
3.2
4.2
5.2
6.2
7.
8.
9.
C = 26 +4 = (2 5);
result = 0 + 1 5 = 5; scale = 1 10 = 10;
CA = 26; CB = 14;
OP1 = 1; OP2 = 2;
C = 1 + 2 = (1 5);
result = 5 + 10 5 = 55; scale = 10 10 = 100;
CA = 1; CB = 2;
OP1 = 0; OP2 = 1;
OP1 = 0;
result = 55 + 100 1 = 155;
Return 155.
To guarantee that all the pixels are still located within the original image area after Spiral Addition
(Sheridan et al., 2000), a modulus operation is defined on the spiral address space. From Figure 5, it is
shown that the spiral address is a base-seven number, so modular operations based on such a number
system must execute accordingly. Suppose spiral_addressmax stands for the maximum spiral address in
the given Spiral Architecture area, the modulus number is
modulus = spiral_addressmax +1
(1)
where + is Spiral Addition.

Then, the modular operation on spiral addressing system can be performed as follows. First, the address number and the corresponding modulus number are converted to their decimal formats and work
out the result of the modulus operation in the decimal number system. Then, the result in decimal format
is converted to its corresponding base-seven spiral address again.
In addition, an Inverse Spiral Addition exists on spiral address space. That means for any given
spiral address x there is a unique spiral address x in the same image area, which satisfies the condition
x + x = 0 , where sign + stands for Spiral Addition. The procedure for computing the inverse value
of a spiral address can be summarized briefly as follows.
According to Table 1, it is shown that the inverse values of the seven basic spiral addresses, 0, 1, 2,
3, 4, 5 and 6, are 0, 4, 5, 6, 1, 2 and 3 respectively. So the inverse value p of any spiral address p = (pn
pn-1 ... p1) can be computed as:
p = (pn pn -1 p1 )
(2)
Furthermore, Spiral Addition meets the requirement of a bijective mapping. That is, each pixel in the
original image maps one-to-one to each pixel in the output image after Spiral Addition.
Spiral Multiplication
Spiral Multiplication is also an arithmetic operation with closure properties defined on the spiral addressing system so that the resulting product will be a spiral address in the same finite set on which the
816
operation is performed. In addition, like Spiral Addition, Spiral Multiplication incorporates a special
form of modularity.
For basic Spiral Multiplication, a scalar form is defined as Table 2.
The same naming convention is followed as in the last section in the Spiral Addition explanation.
Multiplication of address a by the scalar ({0, 1, ... 6}) is obtained by applying scalar multiplication
to each digital component of a according to the above scalar form, and denoted by:
(a) = (an an-1 ... a1) where
a = (an an-1 ... a1) ai {0, 1, ..., 6}
(3)
If the address in Spiral Multiplication is a common address like,

b = (bn bn-1 ... b1) bi {0, 1, ..., 6}
(4)
then
n
a b = a bi nml 10i -1
i =1
(5)
where denotes Spiral Addition, denotes Spiral Multiplication and nml denotes normal mathematical
multiplication. A carry rule is required in Spiral Addition to handle the addition of numbers composed
of more than one digit.
For example, to compute the Spiral Multiplication of 2614, the procedure is shown below:
Table 2. Scalar Spiral multiplication* (Sheridan, 1996)

0
* Bold type shows the scalar spiral address; normal type shows the results of Spiral Multiplication between the corresponding spiral addresses
on the first row and the first column respectively.
817
26 14 = 26 4 nml 1+ 26 1 nml 10
= (2 4 6 4) nml 1 + (2 1 6 1) nml 10
= 53 nml 1 + 26 nml 10
= 33
In the above demonstration, the Spiral Addition procedure is omitted.

Similarly to Spiral Addition, a modulus operation on the spiral address space is defined in order to
guarantee that all the pixels are still located within the original Spiral area after Spiral Multiplication.
Furthermore the transformation through Spiral Multiplication defined on spiral address space is a bijective mapping. That is, each pixel in the original image maps one-to-one to each pixel in the output
image after Spiral Multiplication.Modulus Multiplication is shown as follows:Let p be the product of
two elements a,b. That is,
p = a b
(6)
where a and b are two spiral addresses

If p (modulus), then
if a is a multiple of 10
p = (p + (p (modulus)))mod(modulus)
(7)
otherwise,
p = p mod(modulus),
(8)
where
modulus = spiral_addressmax + 1
and + is Spiral Addition.
(9)
Finally, another point relative to Spiral multiplication is the existence of a multiplicative inverse.
Given a spiral address a, there should be another address b, such that ab = 1 (Spiral Multiplication),
denoted by a-1, i.e., b = a-1.
Two cases must be considered to find out the inverse value for a spiral address. Here, it is assumed
818
that spiral address 0 has no valid inverse value.
Case 1: a is Not a Multiple of 10

Let us assume a = (an an-1 ... a1) " ai {0, 1, ..., 6} and the inverse value b = (bn bn-1 ... b1) "bi {0, 1,
..., 6}. In general, it is easy to get the inverse values for the basic spiral addresses 1, 2, 3, 4, 5 and 6.
They are 1, 6, 5, 4, 3 and 2 respectively. So the inverse value, b, can be constructed successfully by the
following formula,
b1 = a1-1
b2 = - (a2b1 ) b1
n -2
bn = - an -i bi +1 b1
i =0
(10)
Case 2: a is a Multiple of 10
a = k10m (m < n)
modulus = 10n = spiral_addressmax + 1
(11)
k-1 can be obtained by Equation (10). Then, the inverse value of a is,
a-1 = k-1 10n-m (Spiral Multiplication)
(12)
Mimicking Spiral Architecture

In order to implement the idea of Spiral Architecture on the applications for image processing, it is inevitable to use mimic Spiral Architecture based on the existing rectangular image architecture because
of lack of mature devices for capturing image and for displaying image based on hexagonal image architecture. Mimic Spiral Architecture plays an important role in image processing applications on Spiral
Architecture. It forwards the image data between image processing algorithms on Spiral Architecture
and rectangular image architecture for display purpose (see Figure 6). Such mimic Spiral Architecture
must retain the symmetrical properties of hexagonal grid system. In addition, mimic Spiral Architecture
does not degrade the resolution of the original image.
For a given picture represented on rectangular architecture, if it is re-represented on Spiral Architecture on which each hexagonal grid has the same area size as square grid on rectangular architecture,
the image resolution is retained.
In order to work out the size of hexagonal gird, the length of the side in a square grid is defined as
819
Figure 6. Image processing on mimic Spiral Architecture
Figure 7. A square grid and a hexagonal grid which have the same size of area
Figure 8. Relation between mimic hexagonal grid and the connected square grid. si is the size of overlap
area
1 unit length. Namely, the area of a square grid is 1 unit area. Then, for a hexagonal grid which has the
same area size as square grid, the distance from the centre to the side in a hexagonal grid is 0.537 (see
Figure 7).
In order to work out the grey value of a hexagonal gird, the relations between the hexagonal grid and
its connected square grid must be investigated. The purpose is to find out the different contribution of
each connected square grids grey value to the referenced hexagonal grid (see Figure 8).
Let N denote the number of square grids which are connected to a particular hexagonal grid. si denotes the size of overlap area between square grid i, one of connected square grid, and the hexagonal
grid. Because the size of grid is 1 unit area (see Figure 7), the percentage of overlap area in a referenced
820
hexagonal grid is,

pi = si / 1100% = si
(13)
Let gh denote the grey value of hexagonal grid, and gs denote the grey value of square grid. Thus,
the grey value of hexagonal grid is calculated as the weighted average of the grey values of the connected square grids as,
N
gh = pi gsi .
i =1
(14)
On the other hand, the reverse operation must be considered in order to map the images from virtual
Spiral Architecture to rectangular architecture after image processing on Spiral Architecture (see Figure
6). After image processing on Spiral Architecture, the grey values of virtual hexagonal grids have been
changed. Thus, the aim is to calculate the grey values of square grid from the connected hexagonal
grids (see Figure 8.b). The same way as Equation (14) is used to calculate the grey value of square
grid. However, pi stands for the percentage of overlap area in a referenced square grid (see Figure 8.b).
Supposing there are M virtual hexagonal grids connected to a particular square grid, the square grids
grey value is,
M
gs = pi ghi .
i =1
(15)
Using Equation (14) and (15), the grey values of grids can be calculated easily as long as pi can be
calculated. Wu et al. (Wu, He, & Hintz, 2004) proposed a practically achievable method for easily calculating the relation between mimic Spiral Architecture and the connected square grid on digital images.
IMAGE PARTITIONING ON SPIRAL ARCHITECTURE

A novel image partitioning method is proposed for distributed image processing based on Spiral Architecture. Using this method each processing node will be assigned a uniform partitioned sub-image that
contains all the representative information, so each processing node can deal with the assigned information independently without data exchanges between the processing nodes. The first requirement for such
a partitioning scheme is that it should be configurable according to the number of partition required.
Second, the partitioning has a consistent approach: after image partitioning each sub-image should be
a representative of the original one without changing the basic object features. Finally, the partitioning
should be fast without introducing extra cost to the system.
General Image Partitioning on Spiral Architecture

Under the traditional rectangular image architecture there are three basic image partitioning, row partitioning, column partitioning and block partitioning (Bharadwaj et al., 2000; Koelbel, Loveman, Schreiber,
821
Jr., & Zosel, 1994). Compared with rectangular image architecture, Spiral Architecture does not arrange
the pixels row-wise, column-wise or in normal rectangular blocks. Instead, each pixel is positioned by a
unique Spiral Address along the spiral rotation direction shown in Figure 9. The traditional partitioning
methods are infeasible except for block partitioning. For example, the image on Figure 9 can be partitioned into seven parts evenly with seven sub-data sets like [0, 1, , 6], [10, 11, , 16], [20, 21, , 26],
[30, 31, , 36], [40, 41, , 46], [50, 51, , 56], [60, 61, , 66], where numbers in the brackets are the
spiral addresses of the pixels of Figure 9. That is, Spiral Architecture can split the original picture into
M (M = 7n,n = 1,2,...) parts. Based on Spiral addressing scheme, the continually consecutive hexagonal
pixels are grouped together. Inside of each part, the total number of pixel is also a power of seven. The
index of the partitioned sub-area is consistent with the spiral addressing system. Thus the pixels on the
different sub-areas can be identified immediately. A real example of image segmentation based on the
partition scheme above is show on Figure 10.
From Figure 10, it is seen that such partition scheme simply splits the original image area into the
equal size pieces but does not consider the image contents inside. For global image processing operation such as global Gaussian processing using distributed processing system, each node may process
one segment of original image. During processing, the nodes have to exchange necessary information
between them. Such kind of local communication will be a disadvantage. It will become greater as the
number of partitions increase.
Uniform Image Partitioning on Spiral Architecture

In Spiral Architecture, two algebraic operations have been defined, Spiral Addition and Spiral Multiplication. After an image is projected onto Spiral Architecture, each pixel on the image is associated with a
particular hexagon and its Spiral address. The two operations mentioned above can then be used to define
Figure 9. Pixel arrangement on Spiral Architecture
822
Figure 10. Simple equal size image partitioning on Spiral Architecture
two transformations on the spiral address space: image translation and rotating image partitioning. In
our research, Spiral Multiplication is applied to achieve uniform image partitioning which is capable of
balancing workload among the processing nodes and achieves zero data exchange between the nodes.
From Figure 10, it is seen that simple image segmentation will result in much network overhead for
node synchronization and algorithm programming. In Spiral Architecture, after an image is multiplied
by a specific Spiral Address, the original image is partitioned into several parts. Each part is a near
copy of the original image. Each copy results from a unique sampling of the input image. Each sample
is mutually exclusive and the collection of all such samples represents a partitioning of the input image. As the scaling in effect represents the viewing of the image at a lower resolution, each copy has
less information. However, as none of the individual light intensities have been altered in any way, the
scaled images in total still hold all of the information contained in the original one (Sheridan, 1996).
Consequently, the sub-images can be processed independently by the corresponding processing nodes
without requiring data exchange between them.
Figure 11 shows an example of image partitioning with Spiral Multiplication. The original image has
16807 hexagon pixels. The multiplier used in Spiral Multiplication is 100001. With the novel uniform
image partitioning on Spiral Architecture, task parallelism can be achieved. An application containing
complicated image processing often requires processing results in the different aspects such as histogram,
edge map and spectrum distribution. Under the proposed image partitioning, all these tasks can be dealt
with independently on the assigned sub-images. Such a parallel processing scheme increases the system
efficiency. Moreover, because each node possesses less information than the original image, processing
time will be shortened dramatically. Detailed demonstration will be shown in the experiment section.
There are two points still existing in the above partitioning method that must be resolved before
it is utilized in practical applications. First, it is known that the uniform image partitioning is simply
823
achieved by Spiral Multiplication. However, the relation between the multiplier and the number of partitions has not been described yet. This is an important point in practical systems, since it must be able
to determine the number of partitions according to the image to be processed and the practical system
performance. Second, a complete sub-image may not be obtained when the multiplier used in Spiral
Multiplication is a general spiral address. For example, when the spiral address is 55555, the original
image is partitioned into several parts, but only the middle part holds the complete information of the
original image. Other sub-images are scattered on different areas (see Figure 12). It would be necessary
to collect the corresponding scattered parts together to form a complete sub-image before it is delivered
to the corresponding node for distributed processing.
In the following sections, solutions are proposed to deal with the two points mentioned above.
Computing the Number of Partitions

It is necessary to determine the relation between the multiplier and the number of partitions. Further, the
relation should be static, so for any given multiplier, the number of partitions is determined uniquely
when the corresponding Spiral Multiplication is executed. From the aspect of a distributed processing
application, the number of partitions often needs to be decided according to the image to be processed and
the performance of the system platform before the processing procedure commences. With the help of a
static relation between the multiplier and the partitioning number, it can be known that what multiplier
to use in Spiral Multiplication in order to partition the image into the specific parts.
In the work, it was found that it is unable to find such a relation between the Spiral Address and the
partitioning number directly. In order to achieve this goal, the Spiral Architecture is refined with the
help of the structure proposed in (He, 1998). This redefined architecture originally was used to find the
spiral address for image rotation on Spiral Architecture. In this chapter, it will be used to construct the
relation between the multiplier (spiral address) which is used in the Spiral Multiplication for a particular
Figure 11. Seven part (near copies) image partitioning on Spiral Architecture
824
Figure 12. Spiral Multiplication by a common Spiral Address 55555
image partitioning and the number of partitions. The newly refined architecture contains three parameters
to identify each of its hexagonal pixels. Then, every spiral address will be mapped into a set of three
parameters. The refined Spiral Architecture is shown in Figure 13.
The original Spiral Architecture is divided into six regions, which are denoted by r = 1, 2,...,6. In each
region, the pixels are grouped into the different levels denoted by l = 0, 1,... along the radial direction.
On each level, each pixel is regarded as an item denoted by i, where i = 0, 1,...,l clockwise, as shown
in Figure 13. Each pixel can then be located uniquely by the three parameters, (r, l, i), in addition to the
Spiral Address within Spiral Architecture.
Based on the theory of Spiral Multiplication, every spiral address value x has a unique inverse value
y for an image of a given size. They should meet the condition,
Figure 13. Redefined Spiral architecture
825
(x y)mod N = 1
(16)
where N is determined by the size of the image. Suppose the maximum spiral address of an image is
amax, then N = amax + 1 (Spiral Addition).
In order to find the relation between the multiplier and the number of partitions, the second step is to
work out the inverse value of the multiplier with Equation (16), which is also a spiral address. This can
be done instantly using the principles of Spiral Multiplication. Then, the parameter (r, l, i) corresponding
to the inverse value of the multiplier can be found in the refined Spiral Architecture as shown in Figure
13. In a practical application, a table is made to map each spiral address to its corresponding parameter
(r, l, i) beforehand.
Naturally, there are no mathematical modules which yield the relation between multiplier and number
of partitions. In this research, an inductive method is followed to exploit the principle. The number of
partitions is counted manually after the image is transformed by Spiral Multiplication with a particular
multiplier. For example, the number of partitions is counted manually when the inverse values of the
multiplier are 0, 1, 2, 14, 15, 63 whose corresponding parameters in the refined Spiral Architecture
are (0, 0, 0), (1, 1, 0), (1, 1, 1), (1, 2, 2), (1, 2, 1), (1, 2, 0). The numbers of partitions are 0, 1, 1, 4, 3,
4 respectively. In the work, more similar tests were made manually in order to reveal the relationship
between the multiplier and the number of partitions. Based on the inductive method, the following
formula is derived:
PNumber (r , l, i ) = [l 2 - i(l - 1) + i(i - 1)] ,
r = 1, 2, ..., 6;
l = 0, 1, 2, ... ; and
i = 0, 1, ..., l .
where
(17)
Then, the above formula is tested by a special image partitioning whose multiplier for Spiral Multiplication and numbers of partitions are known. Initially, it is known that, for an image of 49 hexagonal
pixels, it will be partitioned into seven near copies (See Figure 11) through Spiral Multiplication with
the multiplier 10. In this case, the inverse value of 10 is also 10 according to the principles of Spiral
Multiplication. From the refined Spiral Architecture, it is known that the corresponding parameters are
(1, 3, 2).
They are substituted in Equation (17). The number of partitions is calculated, and is seven, as expected.
It is found that the number of partitions is only determined by the parameters l and i . That means
the partitioning number is only related to the level number and the item number of the inverse value
of the multiplier. The rotation angle is the only difference among the images transformed by Spiral
Multiplication with different multipliers that correspond to the same parameters l and i, but different
values of r. The angle difference is a multiple of 60 degrees. This point will be analysed in detail in the
next section.
In addition, every point on the border of two adjoining regions on the refined Spiral Architecture
(See Figure 13) has two different sets of parameters, because it strides over two regions with different
region numbers and different item numbers. Its corresponding number of partitions is identical, however,
regardless of which set of parameters are substituted into Equation (17). For example, address 14 has
826
the parameters (1, 2, 2) and (2, 2, 0), but the corresponding number of partitions is 4 if the inverse value
of the multiplier used by the Spiral Multiplication for image partitioning is 4.
Using the formula derived above, an image can be partitioned into as many parts as required, which
are subsampled copies of the original image in Spiral Architecture. This image partitioning method is
thus controllable and manageable according to the required precision and the capacity of processing
nodes on the network for distributed image processing.
Unfortunately, using Spiral Multiplication cannot partition the original image into any number of
sub-images. For example, it is impossible to find a multiplier which can partition an image into two parts
by Spiral Multiplication. Thus, in practical applications, the approximate number of partitions must be
found to meet the requirements. The reason is that uniform image partitioning on Spiral Architecture
is the result of Spiral Multiplication. The new positions of the pixels are determined uniquely by the
principle of Spiral Multiplication. The relation of the pixel positions before and after Spiral Multiplication is a one-to-one mapping. Ordinary mathematical multiplication is defined on a continuous domain.
However, Spiral Multiplication is actually a kind of address counting operation which is a procedure for
pixel repositioning. Consequently, it cannot be guaranteed that a multiplier (spiral address) can be found
to partition the input image into any number of parts. From the mathematical view, it cannot always
guaranteed an integral solution of the multivariate formula as:
l2 i(l1) + i(i1) PNumber = 0, where
l = 0, 1, 2,...; and
i = 0, 1, ..., l,
(18)
Complete Image Partitioning in Spiral Architecture

With the formula developed in the previous section, a Spiral Multiplier can be decided to partition the
original image into the required number of near copies. The number of partitions can be found to match
practical requirements by an adaptive method such as Divisible Load Theory (DLT) (Bharadwaj et al.,
2000).
However, it is found that when the number of partitions is not a power of seven such as 7, 49 or
343 only one sub-image in the middle of the image area is a complete sub-image, while the other subimages are segmented into several fragments scattered to different positions in the original image area.
For example, an image multiplied by a common spiral address 55555 gives the results shown in Figure
12. With the exception of the middle sub-image, the other three sub-images are each split into two fragments. This is unacceptable for distributed processing. Obviously, two problems must be resolved before
distributing the data to the processing nodes. One is that the corresponding fragments that belong to the
same sub-image must be identified. Another problem is that all the corresponding fragments must be
moved together to become a complete sub-image.
In this research, it is found that the boundaries of the different sub-image areas could be detected by
investigating the neighbouring relation of the spiral addresses between the reference point and its six
adjacent points: the neighbouring relation of spiral addresses along the boundary is different from the
neighbouring relation within the sub-image area. All the points belonging to the same sub-image area
827
Figure 14. Seven hexagon cluster with six addends of Spiral Addition
have a consistent relation. Consistency is destroyed only across a boundary between two different subimage areas. Moreover, it is shown that the consistency can be expressed by Spiral Addition.
Figure 14 shows a seven-hexagon cluster. The six numbers n1, n2,...,n6 shown are addends for Spiral
Addition, to be used later. The values of these addends are different under different Spiral Multiplications with the appropriate multipliers for the required image partitioning. The details of the method to
calculate the addends will be explained later. Here, it is assumed these addends have been already given.
Then, after image partitioning, all the points of the original image will move to new positions in the
new image. In the output partitioned image, if a points original spiral address on the input image before
partitioning is given, its six neighbouring points original spiral addresses will be determined by Spiral
Addition with the addends as shown in Figure 14. For example, suppose a points spiral address on the
original image is x and the original address of its neighbouring point below is y, corresponding to the
position labelled n1 in Figure 14. If y = x + n1, these two points are in the same sub-image. Otherwise,
these two points are in different sub-images and they both stay on the boundaries of the sub-images.
Here, + stands for Spiral Addition including modular operation if necessary.
Definition 4.1.A point is defined as an inside point, i.e. a point within a sub-image area, if the relation
between the points address x and its six neighbouring points addresses yi for i {1, 2,...,6} satisfies
Equation (19); otherwise it is defined as an adjoining point, i.e. a point on the boundary between two
sub-image areas.
yi = x + ni, i {1, 2,...,6}
(19)
Addition in Equation (19) is Spiral Addition including any necessary modular operation (See Section
1.1) rather than normal mathematical addition.
Now, the remaining question is how to compute the addends ni, for i {1, 2, 3, 4, 5, 6}. During
image partitioning, the values of addends are determined by Spiral Multiplication, which achieves the
corresponding image partitioning. In other words, once the number of image partitions is determined, the
multiplier used in Spiral Multiplication is determined as explained in the previous section. The values
of addends as shown in Figure 14 are then fixed. Whether the point is an inside point or an adjoining
point is determined by the condition mentioned above. In fact, the values of addends in Figure 14 are
the original spiral addresses of the six points surrounding the centre of the image. An example is given
below.
Figure 15 shows the computation results of the Spiral Multiplication with multiplier 23 on an image
of 49 points. As shown in the figure, all the points move to unique new positions. Based on the above
explanation, the addends ni, i {1, 2, 3, 4, 5, 6}, are 15, 26, 31, 42, 53 and 64 respectively. The point
828
with address 15 is an inside point because the relation between its address and its six neighbouring
points addresses meet the condition shown in Equation (19). The point with address 25 is an adjoining point because some of its neighbouring points cannot meet the address relation of Equation (19).
For example, its upper neighbouring points original address is 24. The corresponding addend used
for Spiral Addition in Equation (19) is n4 = 42. According to Equation (19) if the point of address 25
is an inside point, the original address of the neighbouring point above it should be 30, i.e, 25+42 =
30 (Spiral Addition) rather than 24. So the point of address 25 is an adjoining point. This checking
procedure proceeds on each point as follows:
1.
2.
3.
4.
5.
6.
Initialize sub-image number sn = 1;

Choose any unchecked point on the image as the next point to be checked;
Label this point as sn;
Label all the unchecked neighbouring points which meet the condition in Equation (19) as sn;
Store the neighbouring points just labelled in step 4 temporarily in a buffer;
Choose any one of the neighbouring points which was just labelled in step 4 as the next point to
be checked;
7. Repeat steps 3 to 6 until no unchecked neighbouring points can be found in step 4;
8. Choose any one of the unchecked points stored in the buffer as the next point to be checked;
9. Repeat steps 3 to 8 until no unchecked point can be found in the buffer;
10. Clear the buffer and set sn= sn + 1;
Figure 15. Relocation of points after Spiral Multiplication with multiplier 23
829
Figure 16. Three labelled sub-image areas after image partitioning
11. Repeat steps 2 to 10 until no unchecked point can be found on the image.
Then, all the points will be labelled by an area number. The fragments corresponding to the same
sub-image are found as shown in Figure 16.
The last requirement is to collect the corresponding fragments together to form a complete sub-image.
Actually, suppose the number of partitions is not the power of seven. After image partitioning in Spiral
Architecture all the sub-images are incomplete partitioned images except the middle one. It is known
that Spiral Addition with a common addend will move each point to a new position and guarantee oneto- one mapping between the input image and the output image without changing the object shape, so it
is a good technique to collect fragments of the sub-image. Moreover, from Figure 16 it is observed that
all the sub-images have similar sizes and the sub-image in the middle area is always a complete subimage. There is a special case: when the number of partitions is a power of seven, all the sub-images
have exactly the same size. This fact confirms that if the pixels in an incomplete sub-image can be moved
into the middle sub-image area properly, this sub-image will be restored successfully.
Since Spiral Addition is a consistent operation, if the point that was closest to the point with spiral
address 0 on the original image is moved, other points will be automatically located to corresponding
positions without changing the object shape in the image. Such movement is achieved using Spiral Addition as mentioned above. This operation is performed on each sub-image that has been given an area
number in the previous step, and then all the incomplete sub-images will be restored one by one.
Let us call the point which was closest to the point with spiral address 0 before image partitioning
the relative centre of the sub-image. The addend of Spiral Addition for restoring the incomplete subimage is computed as follows.
830
Figure 17. Four-part complete image partitioning in Spiral Architecture
Suppose the spiral address of the relative centre of the sub-image is x after image partitioning.
Then the addend of Spiral Addition for collecting the fragments of sub-images is the inverse value of x,
x , which is computed according to the principles of Spiral Addition. As a result, the relative centre is
moved to the point of spiral address 0 and other points in the fragments are moved to corresponding
positions to produce a complete sub-image.
Figure 17 gives an examples showing the procedure mentioned above. The original images contain
16807 points. They are partitioned into four parts with multiplier 55555 and three parts with multiplier
56123 respectively. The separated sub-image areas are shown using different illuminations and labelled
using different area numbers. Finally, the fragments of incomplete sub-images were collected together to
produce a complete partitioned sub-image. The addends used in Spiral Addition for fragment collection
are also shown on each sub-image. The complete sub-images so obtained can be distributed to different
nodes for further processing.
Figure 18 gives another example which shows image portioning to 3 parts on Spiral Architecture.
The relevant addends are shown on the pictures.
EXPERIMENTS
In order to demonstrate the advanced performance provided by distributed processing based on special
image partitioning on Spiral Architecture, global Gaussian processing for image blurring is chosen as
a testing algorithm. Gaussian processing is an algorithm used widely in image processing for several
applications such as edge detection, image denoising, and image blurring. It can be mathematically
explained as,
831
Figure 18. Three-part complete image partitioning on Spiral Architecture
Figure 19. Prototype of distributed system topology
832
L(x , y; t ) = g(x , y; t ) * f (x , y )
=
1 f (u, v )
e
2pt
(x -u )2 +(y -v )2
2t
dudv.
(20)
2
where f maps the coordinates of the pixel (x, y) to a value representing the light intensity, i.e. f :
. g() is the Gaussian kernel. L() represents the image after Gaussian processing. t is called the coarseness scale and t > 0. stands for a set of points on the image area which participate in the Gaussian
convolution. For global Gaussian processing, stands for the whole area of original image. As t increases
the signal which is L becomes gradually smoother.
In our work, partitioning approach is implemented on a cluster of workstations shown as Figure
19. One of the computers acts as a master computer (master node) and the remaining seven computers are used as slave computers (slave nodes). In the early phase of the processing, the master node is
responsible for initial data partitioning and data delivery to the slave nodes. The data is then processed
on the slave nodes. Depending on the image partition scheme, the slave nodes may or may not need to
exchange data (denoted by dash line on Figure 19). For the scheme of simple image partitioning shown
on Figure 10, data exchange between slave nodes is inevitable since each part does not represent the
information of the whole image. During the procedure of global Gaussian processing, each slave node
must obtain necessary pixel information which is located on other parts to complete the computation of
Equation (20). On the other hand, if uniform image partitioning scheme based on Spiral Architecture is
chosen (see Figure 11), each slave node can carry out the necessary processing independently without
data exchange between nodes because each node possesses a near copy of the original image. The individual processing result will be sent to the master node where the relatively simple process is carried
out to combine the individual result together to produce the final result of global Gaussian processing.
Thus, the dash lines presented on Figure 19 can be removed.
A three-level algorithm is designed, which consists of a parent process, seven child processes and
seven remote processes. The parent process and the seven child processes reside on the master computer.
Each slave node executes one of the remote processes. The parent process is mainly responsible for
data management and process management, including data communication, command delivery, data
manipulation, child process creation, process monitoring and process synchronization. Each remote
process completes all the detail work on the data block assigned by the master node.
Three techniques are applied for data communication between the processes. They are Share Memory,
Message Queues and Sockets. The former two are used for the data exchanges between the parent processes and the child processes. The latter is used for Client-Server communication between the child
processes and the remote processes, and between the remote processes if required.
In the experiments, two approaches are used to achieve distributed processing. One used a single CPU
with multiple processes. Another used multiple computers in a network, where each of them had one process
to deal with the assigned sub-image. The test bed consists of eight computers (Ultra Sun-Workstations,
of which each has a SPARC-based CPU with clock rate being approximately 333.6 MHz).
The experimental results based on simple data partitioning (see Figure 10) is shown on Figure 20,
where the data communication between slave nodes are necessary.
In the figures, 1 Process/1 Node actually is sequential processing, where only one computer and one
process deal with the task. 7 Processes/1 Node uses one CPU with multiple processes, as mentioned
833
above. Finally, 7 Processes/7 Nodes means that seven computers on a network are used to achieve
distributed processing and each of them has only one process to process the assigned sub-image, the
second approach mentioned above. As shown in the figure, distributed processing speeds up the data
processing for the case shown, but this is not always true. For an image of 2401 pixels, processing based
on a single CPU with multiple processes will take more time than sequential processing, because the
CPU will require more extra time to deal with process management, so the time cost exceeds the time
saved by distributed processing.
This situation becomes more serious when the pixel number decreases to 343. Besides the extra cost for
process or node management, data communication becomes a significant issue during the procedure.
The total processing time is divided into data-processing time and non-data-processing time, including the time for data exchange, process management and sub-task synchronization. The statistical results
for processing times are shown in Figure 21. The components of processing time under the different
situations based on simple image partitioning. It shows that the fraction of time for data processing decreases as the size of the image decreases. The reason is that when the size of the image decreases, the
system requires less time for data processing. This part of the time decreases dramatically. However,
the non-data-processing time decreases relatively more slowly. The reason is that the time for process
management does not change when the number of child process stays fixed. In addition, the throughput
of data I/O is determined by the system I/O performance. The response of a high-speed hard disk to
an image of 1MBytes and an image of 100KBytes is almost the same, so when the size of the image
decreases, the time cost for data I/O through the hard disk does not change much. The situation is the
same for data communication on high-speed Local Area Network (LAN).
Moreover, Figure 10. Simple Equal Size Image Partitioning on Spiral Architecture shows that after
equal-size image partitioning, processing nodes do not receive equal effective object information. Some
nodes contain much more object information, while some nodes do not contain any effective object
information. Consequently, some nodes finish their assigned tasks earlier than other nodes. The processing times on each node may range from one second to several minutes. The nodes with less object
information must therefore wait for the nodes with more object information before they can receive new
commands and update information for the next sub-task from the master node. That is another reason
that sequential processing is sometimes faster than distributed processing.
As discussed above, if uniform image partitioning scheme (see Figure 11) is chosen, system overheads
and the complexities of program design will be greatly reduced because there will no data communication between slave nodes.
The same task, global Gaussian processing as in the previous section, is now carried out again here
based on the new partitioning scheme. Some statistics for processing time are shown in Figure 22 and
Figure 23.
Obviously, the computing complexity has been both reduced and nicely partitioned without discarding
any information for distributed processing. In addition, as shown in Figure 23, most of the processing
time is the cost of data processing. This processing system is clearly highly efficient. If the percentage
of data processing time in the total processing time is used as the index of system efficiency, the new
partitioning scheme improves the system efficiency about 2% from 96.94% to 98.73% compared to the
same processing approach, 7 Processes/7 Nodes, using the simple partitioning scheme.
834
Figure 20. Image processing time based on simple image partitioning
835
Figure 21. The components of processing time under the different situations based on simple image
partitioning
CONCLUSION
This chapter presents an application of Spiral Architecture to image partitioning which is important for
distributed image processing. Based on the principle of Spiral Multiplication, a new image partitioning
scheme is proposed. Using Spiral Multiplication an image can be partitioned into a few parts. Each part
is an exclusive sampling of the original image and contains representative information from all areas
of the original image. Consequently, each sub-image can be viewed as a near copy of the original im-
836
Figure 22. Processing time comparison based on the normal equal-size partitioning and uniform partitioning on Spiral Architecture (Image of 16807 points)
Figure 23. The components of the processing time under the different partitioning scheme
837
age. In distributed processing based on such an image partitioning scheme, each node will process the
assigned sub-image independently without the data exchange normally required. This should speed up
processing very much.
In a practical system, the number of partitions is determined by the application requirement, the image
to be processed and the system performance. However, the relation between the partitioning number and
the multiplier (spiral address) used in Spiral Multiplication was not known. In this chapter, an equation
was built up to describe this relationship, so the number of partitions can be worked out for the given
multiplier and vice versa as required.
Unfortunately, complete sub-images can be obtained by Spiral Multiplication only when the partitioning number is a power of seven. In other words, when the number of image partitions is some other
value like 4 and 5, all the sub-images except one are split into a few fragments and scattered to different positions. It was impossible to tell which fragments belonged to which sub-image, an unacceptable
flaw for parallel image processing. In this chapter, the neighbouring relation of the points is found out
and explicitly expressed after Spiral Multiplication using Spiral Addition. The different sub-image areas are identified. Then, the points on the different sub-image areas are labelled. Finally, the fragments
corresponding to the same sub-images are collected together to produce the complete sub-images. Such
complete sub-images can be distributed to the different nodes for further processing.
REFERENCES
Alexander, D. (1995). Recursively Modular Artificial Neural Network. Doctoral Thesis, Macquire University, Australia, Sydney, Australia.
Arpaci, R. H., Dusseau, A. C., Vahdat, A. M., Liu, L. T., Anderson, T. E., & Patterson, D. A. (May, 1995).
The interaction of parallel and sequential workloads on a network of workstations. Paper presented at
the 1995 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems.
Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and Distributed Computation: Numerical Methods.
Englewood Cliffs, NJ: Prentice Hall.
Bharadwaj, V., Li, X., & Ko, C. C. (2000). Efficient partitioning and scheduling of computer vision and
image processing data on bus networks using divisible load analysis. Image and Vision Computing, 18,
919938. doi:10.1016/S0262-8856(99)00085-2
Braunl, T., Feyrer, S., Rapf, W., & Reinhardt, M. (2001). Parallel Image Processing. Berlin: SpringerVerlag.
Chen, C. M., Lee, S. Y., & Cho, Z. H. (1990). A Parellel Implementation of 3D CT Image Reconstruction on HyperCube Multiprocessor. IEEE Transactions on Nuclear Science, 37(3), 13331346.
doi:10.1109/23.57385
Goller, A. (1999). Parallel and Distributed Processing of Large Image Data Sets. Doctoral Thesis, Graz
University of Technology, Graz, Austria.
Goller, A., & Leberl, F. (2000). Radar Image Processing with Clusters of Computers. Paper presented
at the IEEE Conference on Aerospace.
838
Hawick, K. A., James, H. A., Maciunas, K. J., Vaughan, F. A., Wendelborn, A. L., Buchhorn, M., et al.
(1997). Geostationary-satellite Imagery Application on Distributed, High-Performance Computing. Paper
presented at the High Performance Computing on the Information Superhighway: HPC Asia97.
He, X. (1998). 2D -Object Recognition With Spiral Architecture. Doctoral Thesis, University of Technology, Sydney, Sydney, Australia.
Koelbel, C. H., Loveman, D. B., & Schreiber, R. S., Jr. G. L. S., & Zosel, M. E. (1994). The High Performance Fortran Handbook. Cambridge, MA: MIT Press.
Kok, A. J. F., Pabst, J. L. v., & Afsarmanseh, H. (April, 1997). The 3D Object Mediator: Handling 3D
Models on Internet. Paper presented at the High-Performance Computing and Networking, Vienna,
Austria.
Lee, C., Lee, T.-y., Lu, T.-c., & Chen, Y.-t. (1997). A World-wide Web Based Distributed Animation Environment. Computer Networks and ISDN Systems, 29, 16351644. doi:10.1016/S0169-7552(97)00078-0
Marsh, A. (1997). EUROMED - Combining WWW and HPCN to Support Advanced Medical Imaging.
Paper presented at the High-Performance Computing and Networking, Vienna, Austria.
Miller, R. L. (1993). High Resolution Image Processing on Low-cost Microcomputer. International
Journal of Remote Sensing, 14(4), 655667. doi:10.1080/01431169308904366
Nicolescu, C., & Jonker, P. (2002). A Data and Task Parallel Image Processing Environment. Parallel
Computing, 28, 945965. doi:10.1016/S0167-8191(02)00105-9
Niederl, F., & Goller, A. (Jan, 1998). Method Execution On A Distributed Image Processing Backend.
Paper presented at the 6th EUROMICRO Workshop on Parallel and Distributed Processing, Madrid,
Spain.
Oberhuber, M. (1998). Distributed High-Performance Image Processing on the Internet. Doctoral Thesis,
Graz University of Technology, Austria.
Pitas, I. (1993). Parallel Algorithm for Digital Image Processing, Computer Vision and Neural Network.
Chichester, UK: John Wiley & Sons.
Schowengerdt, R. A., & Mehldau, G. (1993). Engineering a Scientific Image Processing Toolbox for the Macintosh II. International Journal of Remote Sensing, 14(4), 669683.
doi:10.1080/01431169308904367
Schwartz, E. (1980). ComputationalAnatomy and FunctionalArchitecture of Striate Cortex:ASpatial Mapping
Approach to Perceptual Coding. Vision Research, 20, 645669. doi:10.1016/0042-6989(80)90090-5
Sheridan, P. (1996). Spiral Architecture for Machine Vision. Doctoral Thesis, University of Technology,
Sydney.
Sheridan, P., Hintz, T., & Alexander, D. (2000). Pseudo-invariant Image Transformations on a Hexagonal
Lattice. Image and Vision Computing, 18(11), 907917. doi:10.1016/S0262-8856(00)00036-6
Siegel, H. J., Armstrong, J. B., & Watson, D. W. (1992). Mapping Computer-Vision-Related Tasks onto
Reconfigurable Parallel-Processing Systems. IEEE Computer, 25(2), 5463.
839
Siegel, L. J., Siegel, H. J., & Feather, A. E. (1982). Parallel Processing Approaches to Image Correlation.
IEEE Transactions on Computers, 31(3), 208218. doi:10.1109/TC.1982.1675976
Squyres, J. M., Lumsdaine, A., & Stevenson, R. L. (1995). A Cluster-based Parallel Image Processing
Toolkit. Paper presented at the IS&T Conference on Image and Video Processing, San Jose, CA.
Stevenson, R. L., Adams, G. B., Jamieson, L. H., & Delp, E. J. (1993, April). Parallel Implementation
for Iterative Image Restoration Algorithms on a Parallel DSP Machine. The Journal of VLSI Signal
Processing, 5, 261272. doi:10.1007/BF01581300
Wu, D. M., & Guan, L. (1995). A Distributed Real-Time Image Processing System. Real-Time Imaging,
1(6), 427435. doi:10.1006/rtim.1995.1044
Wu, Q., He, X., & Hintz, T. (2004, June 21-24). Virtual Spiral Architecture. Paper presented at the International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas,
Nevada, USA.

Distributed Processing: Distributed processing refer to a special computer system which is capable
of running a program simultaneously on multiple nodes such as computers and processors. These nodes
are connected to each other and managed by sophisticated software which detects idle notes and parcels
out programs to utilize them.
Image Partitioning: For distributed processing purpose, image partitioning is to efficiently segment
the images into multiple parts. Each part is sent to a computing node and processed simultaneously.
Spiral Architecture: Spiral Architecture is a special image architecture where the image is displayed
by a set of hexagonal pixels. These pixels with the shape of hexagons are arranged in a spiral cluster.
Each unit is a set of seven hexagons. That is, each pixel has six neighbouring pixels.
Spiral Addressing: Spiral Addressing is a special addressing scheme which is used to uniquely
identify each pixel on Spiral Architecture. The addressing number in fact is a base-seven number. Such
kind of addressing labels all pixel in a recursive modular manner along a spiral direction.
Spiral Addition: Spiral Addition is an arithmetic operation with closure properties defined on the
spiral address space. Applying spiral addition on the image labelled by spiral address can achieve image
transfer on Spiral Architecture.
Spiral Multiplication: Spiral Multiplication is an arithmetic operation with closure properties defined
on the spiral address space. Applying spiral multiplication on the image labelled by spiral address can
achieve image rotation on Spiral Architecture.
ENDNOTE
1
840
This is a base 7 number. Unless specified otherwise, spiral addresses, addends used in Spiral Addition and multipliers used in Spiral Multiplication are base 7 numbers in the following sections.
841
Chapter 36
Scheduling Large-Scale DNA

Sequencing Applications
Sudha Gunturu
Oklahoma State University, USA
Xiaolin Li
Oklahoma State University, USA
ABSTRACT
This chapter studies a load scheduling strategy with near-optimal processing time that is designed
to explore the computational characteristics of DNA sequence alignment algorithms, specifically, the
Needleman-Wunsch Algorithm. Following the divisible load scheduling theory, an efficient load scheduling strategy is designed in large-scale networks so that the overall processing time of the sequencing
tasks is minimized. In this study, the load distribution depends on the length of the sequence and number of processors in the network and, the total processing time is also affected by communication link
speed. Several cases have been considered in the study by varying the sequences, communication and
computation speeds, and number of processors. Through simulation and numerical analysis, this study
demonstrates that for a constant sequence length as the numbers of processors increase in the network
the processing time for the job decreases and minimum overall processing time is achieved.
INTRODUCTION
Large-scale network-based computing has attracted tremendous efforts from both academia and industry
because it is scalable, flexible, extendable, and economic with wide-spread applications across many
disciplines in science and engineering. To address scalability issues for an important class of applications,
researchers proposed a divisible load scheduling theory (DLT). These applications are structured as large
numbers of independent tasks with low granularity (Bharadwaj. V., Ghose.D ., & Thomas Robertazzi.
G ., 2003). They are thus amenable to embarrassingly parallelism, typically in master-slave fashion.
DOI: 10.4018/978-1-60566-661-7.ch036
Scheduling Large-Scale DNA Sequencing Applications
Such applications are called divisible load because a scheduler may divide the computation time among
worker processes arbitrarily, both in terms of task and task sizes. Scheduling the tasks of a parallel application on the resources of a distributed computing platform efficiently is critical for achieving optimal
performance (Bharadwaj. V., Ghose. D., & Mani.V ., 1995)
The load distribution problem in distributed computing networks, consisting of a number of processors
interconnected through communication links, has attracted a great deal of attention (Sameer Bataineh,
Te-Yu Hsiung, & Thomas Robertazzi, 1994). Divisible Load Theory (DLT) is a methodology that is
involved in the linear and continuous modeling of partitioning the computation and communication
loads for parallel processing (Robertazzi,T.G, 2003). DLT is primarily used for handling large scale
processing on network based systems.
The DLT paradigm has demonstrated numerous applications such as edge detection in image processing, file compression, joining operations in relational databases, graph coloring and genetic searches
(Wong Han Min., & Bharadwaj Veeravalli, 2005). Some more examples of real divisible applications
include searching for pattern in text, audio, graphic files, database and measurement processing, data
retrieval systems, some linear algebra algorithms, and simulations (Maciej Drozdowski., & Marcin
Lawenda., 2005).
Over the past few decades research in the field of molecular biology has made advancement that is
coupled with advances in genomic technologies. This has led to an explosive growth in the biological
information generated, in turn, led to the requirement for computerized databases to store, organize, and
index the data and for specialized tools to view and analyze the data.
In this chapter a parallel strategy is designed to explore the computational characteristics of the
Needleman-Wunsch algorithm that are used for biological sequence comparisons proposed in the literature. In designing the strategy the load is partitioned among the processors of the network using the
DLT paradigm (Bharadwaj. V., Ghose. D., & Mani.V ., 1995).
Two commonly used algorithms for sequence alignment are the Needleman-Wunsch Algorithm and
Smith-Waterman Algorithm where the former is employed for Global Alignment and the latter is used for
Local Alignment. The complexity of the Needleman-Wunsch Algorithm and Smith-Waterman Algorithm
to align sequence of length x is given by O(x2) (Wong Han Min., & Bharadwaj Veeravalli, 2005).
The algorithm used in this study is the Needleman-Wunsch Algorithm. The way that has been adopted
in this study to for parallelizing the Needleman-Wunsch Algorithm is by computing the matrix elements
in diagonal fashion by using a Multiple Instruction Multiple Data Systems.
Divisible Load Theory is employed for handling the sequence alignment. The objective is to minimize the total processing time for sequence alignment. The partition of the load depends primarily on
the matrix that is generated by the Needleman-Wunsch Algorithm. The network has been studied for
variable link speed and constant link speed.
RELATED WORK
The merging of the two rapid advancing technologies of molecular biology and computer science resulted
in a new informatics science, namely bio informatics (Wong Han Min., & Bharadwaj Veeravalli, 2005).
Over the past few years, the interest and research in the area of biotechnology has increased drastically.
This area of study deals primarily with the methodologies of operating on molecular biological information. The present days of molecular biology is characterized by collection of large volumes of data.
842
Information science when applied to biology produced a field called the Bioinformatics. The areas of bioinformatics and computational biology involve the use of techniques and concepts including
applied mathematics, informatics, statistics, computer science, artificial intelligence, chemistry, and
biochemistry to solve biological problems usually on the molecular level. The terms of bioinformatics
and computational biology are often interchangeable. Research in computational biology often overlaps
with systems biology. Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression
and protein-protein interactions, and the modeling of evolution. The area of bioinformatics more clearly
refers to the creation and advancement of algorithms, computational and statistical techniques, and also
includes the theory to solve formal and practical problems arising from the management and analysis of
biological data. Computational biology refers to hypothesis-driven investigation of a specific biological
problem using computers, carried out with experimental or simulated data, with the primary goal of
discovery and the advancement of biological knowledge. In other words, bioinformatics is concerned
with the information while computational biology is concerned with the hypotheses. The most common
operations on biological data include sequence analysis, protein structures predications, genome sequence
alignment, phylogeny tree construction, pathway research and sequence database placement. One of the
most basic and important application of bio informatics task is to find a set of homologies for a given
sequence because the sequences are often related to the functions, if they are similar (Felix Autenrieth.,
Barry Isralewitz ., Zaida Luthey-Schulten., Anurag Sethi., & Taras Pogorelov., 2000) (Jones., Neil C.,
& Pavel A. Penvzner., 2004).
The different bio informatics applications like sequence analysis, protein structures predications,
genome sequence alignment, and phylogeny tree construction are distributed in different individual
projects and they require high performance computational environments. Biologists use a tool called the
BLAST for performing research (Altschul., Gish W., Miller W., Myers., & Lipman, 1990).This tool is a
database search, in other words this is described as a Google for biological sequences. This tool provides
a method for searching a nucleotide and protein database. This BLAST is designed in such a way that
it can detect local and global alignment. Sequence Alignment is often used in biological analysis. This
sequence alignment between any two newly discovered biological sequences can be aligned with the
algorithms present in the literature and the similarity can be determined. This sequence alignment can
be useful in understanding the function, structure and origin of the new gene. In sequence alignment two
sequences are compared with the residues of one another while taking the positions of the residues into
account. Residues in the sequence can be inserted, deleted or substituted to achieve maximum similarity
or optimal alignment . For example, GenBank is growing at an exponential rate up to over 100 million
of sequences 1 (Wong Han Min., & Bharadwaj Veeravalli, 2005) (Benson D.A ., Karsch-Mizrachi I.,
Lipman D. J ., Ostell J., Rapp B. A., & Wheller D.L., 2000). To meet the growing needs a wide variety
of heuristics methods have been proposed for aligning the sequences such as FASTP, FASTA,BLAST,
and FLASH (Yap T.K ., Frieder O., & Martino R.L ., 1998).
The NIH Biomedical Information Science and Technology Initiative Consortium that was held on
July 17, 2000 has agreed on formal definitions for bioinformatics and computational biology. They also
recognized that there is no definition that could completely eliminate the overlap of the variations in
interpretation by different individuals and organizations. One of the definition proposed by them are
as follows:
843
Bioinformatics: Research, development, or application of computational tools and approaches for

expanding the use of biological, medical, behavioral or health data, including those to acquire,
store, organize, archive, analyze, or visualize such data (Michael Huerta ., Florence Haseltine.,
&Yuan Liu ., 2000).
Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological,
behavioral, and social systems (Michael Huerta ., Florence Haseltine., &Yuan Liu ., 2000).
The areas of bioinformatics and computational biology use mathematical tools to extract useful
information from data produced by high-throughput biological techniques such as genome sequencing.
One of the most common representative problems in bioinformatics is the assembly of high-quality
genome sequences from fragmentary shotgun DNA sequencing. Other common problems include
the study of gene regulation using data from micro arrays or mass spectrometry (Cristianini N., & Hahn
M., 2006).
Sequence Analysis in biology can be explained by subjecting a DNA or peptide sequence to sequence alignment, sequence databases, repeated sequence searches, or other bioinformatics methods
on a computer (Felix Autenrieth., Barry Isralewitz ., Zaida Luthey-Schulten., Anurag Sethi., & Taras
Pogorelov, 2000). Sequence analysis in molecular biology and bioinformatics is an automated, computerbased examination of characteristically fragments, for example a DNA-strand. It basically includes five
biologically relevant topics: (1) This is used for the comparison of sequences in order to find similar
sequences (sequence alignment). (2) In identification of gene-structures, reading frames, distributions
of introns and exons and regulatory elements. (3) Used for prediction of protein structures. (4) Used for
genome mapping. (5) Comparison of homologous sequences to construct a molecular phylogeny.
Similarity detection is often used in biological analysis. This is widely used when a new gene sequence and unknown gene sequence can give significant understanding on the function, structure and
origin of the new gene. While comparing two gene sequences, which is also known as aligning two
sequences residues form, one sequence is compared with the residues of the other, in which the position
of the residues is taken into consideration. The different operations that can be performed are insertion,
deletion and substitution of residues in other sequence.
Many algorithms have been proposed in the literature for comparing two biological sequences for
similarities. The most popular algorithms in aligning the DNA are the Needleman-Wunsch algorithm.
For protein alignment its the Smith-Waterman algorithm. In the sequence comparison a combination of
the DLT approach and the algorithms are used in order to align the sequences accurately. In this chapter
a load scheduling strategy is designed for large scale networks for sequence alignment and it has been
observed that as the number of processors increases in the network results in a minimal computation
time.
PROBLEM FORMULATION
The Needleman-Wunsch algorithm is one of the algorithms that perform a global alignment on two
sequences (called X and Y here). This algorithm finds its application in bioinformatics to align protein
or nucleotide sequences (Vladimir Likic, 2000). The algorithm was first proposed by Saul Needleman
and Christian Wunsch in 1970 (Walter B.Goad, 1987).The Needleman-Wunsch algorithm is an example
844
Figure 1. Needleman-Wunsch algorithm after the generation of S matrix
of dynamic programming, and was the first application of dynamic programming to biological sequence
comparison. The algorithm can be explained in the following steps
1.
2.
3.
4.
Initialize the matrix S=0

Fill in the matrix S with 1 if it is a match and 0 if it is a mismatch
Compute score from right hand bottom based on the formula M[i,j]=S[i, j]+Max{M[i+1:x
],M[j+1:y]}
Trace back from the left-top corner, and select the maximum value from the adjacent column and
row, and so on. For example let us consider two sequences GTCAGTC and GCCTC. In order to
align these sequences we first need to construct the matrix as shown in figures
The Needleman-Wunsch Algorithm as well as some of the characteristics of the S and M matrices
that are generated by the algorithm are explained as follows. In aligning two biological sequences that
are denoted as Seq X and Seq Y of length x and y, respectively, the algorithm generates two matrices
represented by S and M as shown in Figure 1 and Figure 2. The matrices S and M are related to each
other with the equation M[i,j]=S[i, j]+Max {[M[i+1: x ],M[j+1: y]]} for the range 1 <= p <= x, 1 <= q
<= y where S p,q and M p,q represents the pth row and qth column of the matrices S and M respectively.
Figure 2. Needleman-Wunsch algorithm after the generation of M matrix
845
Figure 3. Single level tree network
The network under consideration is a simple single level tree network (SLTN) which is shown in
Figure 3, the root node can communicate only with one child at a time. The approach to the problem
can be described in a series of steps. The first step is create a simple SLTN with a fixed number of
nodes and apply divisible load theory on the same network. Further the number of nodes in the system
is increased and DLT technique is applied. The two biological sequences are given to the network and
the Needleman-Wunsch algorithm gives the alignment. The final aim of this chapter will include the
computation time involved in processing the job. From the results it can be observed that by applying
DLT technique the computation time decreases drastically.
The objective is to design a strategy such that the processing time or the computation time for the
alignment of the two biological sequences is a minimum. The two biological sequences are considered
to be of length x and y. These sequences may vary from 1 character to 1000s of characters. In the results
section however the sequences are varied from length of 100 to 1000.
We assume that all the processors in the network P1,P2,......, Pm already have Sequence x and Sequence
y in their local memories, or they can be initialized in this way. To carry the process of sequence alignment in a multiprocessor environment one of the way will be by keeping a copy of the sequences in the
local memory.
LOAD SCHEDULING STRATEGIES AND ANALYSIS

The distribution strategy for the S matrix is given by a matrix consisting of 0s and 1s so it does not
have any special kind of distribution. The M matrix is partitioned into sub matrices like Mp,q where p=
1,2,...m and q= 1,2,...z where each portion of Seq x and Seq y is contained in one particular cell of the
matrix M. This assignment can be explained as shown in the Figure 4. The distribution pattern is as
shown in figures 4 and 5.
According to the Needleman-Wunsch Algorithm the last row will be calculated first. So the last row
is given to the first processor or the root node of the system. In accordance with the Needleman-Wunsch
Algorithm the timing diagram is as shown in Figure 6.
The generalized equations are as shown below. The two sequences can be divided into a number of
smaller parts. This can be explained from the example given below. Let us consider that sequence Seq
x and Seq y are where,Seq x = GCCTCSeq y = GCTAC
846
Figure 4. Illustration of the computational dependency of the element (p,q) in the M matrix
The length of Seq x is 5 and length of Seq y is 5. There for the total length should be 5. From the
above example we can write the generalized equations as shown in equation (1).
x 1 + x 2 + x 3 + ....... + x n -1 + x n = x
y1 + y2 + y 3 + ....... + yn -1 + yn = y
(1)
From the timing diagram we can derive the generalized equation for the load on each processor is
given in the equation (2).
m
n =2
x n = x n -1yn -1En -1 - 2C n -1x n -1 / yn En

(2)
The load that is given to the first processor or the root node is given by equation (3).
m
x 1 = x / [1 + [y1E1 - 2C 1 ] / yn En ]
n =2
(3)
The total completion time for the alignment of the two sequences can be given by
847
Figure 5. Illustration of Load Distribution
T (m ) = xyE1 + Ei + 2C (m - 1)
i =2
(4)
To enhance the understanding of the performance of Needleman-Wunsch Algorithm and divisible

load strategy a single machine has been used (Wong Han Min., & Bharadwaj Veeravalli, 2005). Therefore the speedup, can be given bySpeedup= T(1)/ T(m) (5)where T(m) is the processing time of our
strategy on a system using m-processors. T(1) is the processing time using a single processor and is
given byT(1)=xyE1 (6)
RESULTS AND DISCUSSION

This section presents the evaluation of the results of load scheduling for sequence alignment technique.
The results have been tabulated for the single level tree network that has constant link speed and also
variable link speed. In the experimental results the sequence lengths have been varied from 100 to 1000
characters.
848
Figure 6. Timing diagram
Figure 7. Variable link speed
849
Figure 8. Variable link speed 3-D graph
Table 1. Processing time variations for variable link speed

Number of processors
Processing Time
(Sec)
166915.49
62671.53
33498.78
10
20163.56
20
11250.11
30
6804.34
40
4493.82
50
3186.41
60
2387.78
70
1868.71
80
1514.16
90
1262.08
100
1076.88
850
Figure 9. Graph for Variable Link Speed
Variable Link Speed

This section briefly discusses about how does the processing time changes when the link speed has been
varied. The graphs have been plotted for two ranges of link speed variations. Experiments have been
conducted for variable link speed where link speed has been varied from 1-10 nanoseconds and 1-100
nanoseconds. From the graphs it can be observed that the processing time depends on the communication
link speed C. In other words, the higher the link speed of the network the faster is the job processing. The
link speed has been varied using the Random Generator method in java. The results and the tabulated
values are as shown in the Figure 7, Figure 8 and Table 1.
The tabular column related to the table present below represents the graph for Number of Processors
Vs Processing time with X-axis as the number of processors and Y-axis as the Processing Time. This
graph has been plotted for a constant sequence length of 1000. From the figure 7 it has been observed
that for a constant sequence length as the number of processors increase the computation time decreases
This also reemphasizes the definition of DLT that as more number of processors are added into the
network the processing time decreases.
Figure 8 demonstrates the 3-D representation of how the processing time varies with respect to the
length of the sequence and number of processors. From the 3-D graph of the single processor tree network it can be observed that keeping the length of the sequence constant as the number of processors
increase the processing time decreases. On the other hand it can also be observed that as the numbers of
851
Figure 10. Graph for constant link speed
processors are kept constant and the length of sequences increases and the computation time increases.
As discussed in the Problem Formulation chapter, the speedup has been calculated and the values are
tabulated as shown for a constant sequence length of 1000.
Figure 9 represents the graph for Length of sequence Vs Processing Time for a constant number
Figure 11. 3-D graph for constant link speed
852
Table 2. Processing time variations for constant link speed

Number of processors
Processing Time
(Sec)
165000
62500
33400
10
20100
20
11200
30
6790
40
4480
50
3170
60
2380
70
1864
80
1511
90
1250
100
1074
of processors (m=100), in which X-axis represents the length of sequence and Y-axis represents the
Processing Time. According to divisible load theory for a constant number of processors as the length
of sequence increases the processing time increases because each processors in the system has more
load on it. But from the graph we can say that the processing time is not constantly increasing. This
can be attributed to the communication link, as the processing time is dependent on the communication
link speed. From the results it can be concluded the greater the communication link speed the lesser the
processing time for the sequences.
Constant Link Speed

The variation of the processing time vary when the link speed has been varied. In the graphs shown
below in Figure 10 and Figure 11, the link speed(C) has been taken as 5 nanoseconds. The results have
been discussed as shown below.
Figure 10 represents the graph for Number of Processors Vs Processing time with X-axis as the
number of processors and Y-axis as the Processing Time. This graph has been plotted for a constant
sequence length of 1000. From the graph it has been observed that for a constant sequence length as the
number of processors increase the computation time decreases This adds strength to the definition of
DLT that as more number of processors are added into the network the processing time decreases
Figure 11 demonstrates the 3-D representation of how the processing time varies with respect to the
length of the sequence and number of processors. From the 3-D graph of the single processor tree network it has been observed that keeping the length of the sequence constant as the number of processors
increase the processing time decreases. On the other hand it has also been observed that as the numbers
of processors are kept constant and the length of sequences increases the computation time increases.
As discussed in the Methodology chapter, the speedup has been calculated and the values are tabulated
as shown for a constant sequence length of 1000.
853
Figure 12. Number of processors vs. processing time for a constant length of sequence
Figure 12 represents the graph for Number of Processors Vs Processing Time for a constant length
of sequence, in which X-axis represents the number of processors and Y-axis represents the Processing
Time. According to divisible load theory for a constant length of sequence as the number of processors increases the processing time should decrease, as more number of processors is being added to the
system the load should be distributed among all of them. But from the graph it can be observed that the
processing time is not constantly decreasing. This can be attributed to the communication link speed,
as the processing time is dependent on the communication link speed. The greater the communication
link speed, the lesser the processing time
CONCLUSION
This chapter presented a method for the alignment of two biological sequences following divisible load
scheduling paradigm (DLT). A parallel solution in a single level tree network has been proposed and
the communication delays are assumed to be non zero. We adopted the Needleman-Wunsch algorithm
for aligning two biological sequences. Following Divisible Load Theory (DLT) we can determine the
number of residues that should be assigned to each processor in the network.
The approach presented in this chapter is as follows. First we had a matrix S which is a matrix of
order x*y where x is the length of the first sequence and y is the length of the second sequence. Then we
854
derived the M matrix which will give the final values and depending on that we can align the sequences.
We derived the equations that will determine the size of the sub matrices according to the processor
speeds where here it is assumed that all processors have equal speeds and the communication speeds are
varied. With these constraints the equations have been derived and the graphs have been plotted.
We evaluated the performance by varying the communication link speed from 10-100 nanoseconds
and for a constant link speed of 5 nanoseconds. First, we considered the performance of our strategy
when the communication link has been maintained at a constant value of 5 nanoseconds. The results
clearly demonstrated that the processing time decreased constantly for a constant number of processors
for a constant length of sequence. Then the communication link speed has been varied from 10-100
nanoseconds and the performance has been observed. From the graph it can be observed that for a variable link speed the computation time decreases for a constant length of sequence. In certain cases the
behavior of the graph was not uniform; this explains that the communication link plays a major role in
the processing time of the sequence alignment.
Extensions to this work can be deriving solutions that will further decrease the computation speed.
This can be achieved by applying multi-installment strategy and performing the analysis using the
Needleman-Wunsch Algorithm. The same problem of aligning biological sequences can be applied to
various types of networks. The alignment of biological sequences can be solved using the Sellers algorithm
(Michael Huerta, Florence Haseltine, and Yuan Liu, 2000) and the load distribution strategy. Further
work can also be carried out on aligning multiple sequences with various types of clustering strategies.
The same strategy of aligning sequences can be further extended to aligning multiple sequences using
the algorithm like Berger-Munson algorithm (Berger M.P., & Munson P. J., 1991).
REFERENCES
Altschul, G. W., M. W, Myers., & Lipman. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403410.
Autenrieth., F., Isralewitz, B., Luthey-Schulten., Z., Sethi, A. & Pogorelov, T. Bioinformatics and Sequence Alignment.
Bataineh, S., Hsiung, T.-Y., & Robertazzi, T. (1994). Closed form solutions for bus and tree networks
of processors load sharing a divisible job. Institute of Electrical and Electronic Engineers, 43(10),
1184119.
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A., & Wheller, D. L. (2000, October). Genbank. Nucleic Acids Research, 28(1), 1518. doi:10.1093/nar/28.1.15
Berger, M. P., & Munson, P. J. (1991). A novel randomized iteration strategy for aligning multiple protein
sequences. Computer Applications in the Biosciences, 7, 479484.
Bharadwaj, V., Ghose, D., & Mani, V. (1995, April). Multi-installment load distribution in tree network
with delays. Institute of Electrical and Electronic Engineers, 31(2), 555567.
Bharadwaj, V., Ghose, D., & Robertazzi, T. G. (2003, January). Divisible load theory: A new paradigm for load
scheduling in distributed systems. Cluster Computing, 6(1), 717. doi:10.1023/A:1020958815308
855
Cristianini, N., & Hahn, M. (2006). Introduction to Computational Genomics. Cambridge, UK: Cambridge University Press.
Drodowski, M., Lawenda, M., & Guinand, F. (2006). Scheduling multiple divisible loads. International Journal of High Performance Computing Applications, 20(1), 1930. doi:10.1177/1094342006061879
Drozdowski, M., & Lawenda, M. (2005). On Optimum Multi-installment Divisible Load Processing in
Heterogeneous Distributed Systems, (LNCS 3648, pp. 231240). Berlin: Springer.
Fourment, M., & Gillings, M. R. (2008, February). A comparison of common programming languages
used in bioinformatics. Bioinformatics (Oxford, England), 9.
Goad, W. B. (1987). Sequence analysis. Los Alamos Science, (Special Issue), 288291.
Huerta, M. Haseltine, F., & Liu, Y. (2004, July). Nih working definition of bioinformatics and computational biology.
Jones, N. C. & Penvzner, P.A. (2004, August). An Introduction to Bioinformatics Algorithms.
Likic, V. (2000). The needleman-wunsch algorithm for sequence alignment. The University of Melbourne, Australia.
Min, W. H., & Veeravalli, B. (2005, December). Aligning biological sequences on distributed bus networks: a divisible load scheduling approach. Institute of Electrical and Electronic Engineering, 9(4),
489501.
Robertazzi, T. (2003). Ten reasons to use divisible load theory. Institute of Electrical and Electronic
Engineering, 36(5), 6368.
Trelles, Andrade, & Valencia, Zapata, & Carazo. (1998, June). Computational space reduction and parallelization of a new clustering approach for large groups of sequences. Bioinformatics (Oxford, England),
14(5), 439451. doi:10.1093/bioinformatics/14.5.439
Yap, T., Frieder, O., & Martino, R. (1998, March). Parallel computation in biological sequence analysis.
Institute of Electrical and Electronic Engineers, 9(3), 283294.

Bioinformatics: Bioinformatics is the application of information technology to the field of molecular
biology. Bioinformatics entails the creation and advancement of databases, algorithms, computational
and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.
Cluster Computing: A computer cluster is a group of linked computers, working together closely
so that in many respects they form a single computer. The components of a cluster are commonly, but
not always, connected to each other through fast local area networks. Clusters are usually deployed to
improve performance and/or availability over that provided by a single computer, while typically being
much more cost-effective than single computers of comparable speed or availability.
856
Computational Biology: Computational biology refers to hypothesis-driven investigation of a

biological problem using computers, carried out with experimental or simulated data, with the primary
goal of discovery and the advancement of biological knowledge.
Computer Networks: A computer network is a group of interconnected computers. Networks may
be classified according to a wide variety of characteristics.
Divisible Load Theory: Divisible load theory is a methodology involving the linear and continuous
modeling of partitionable computation and communication loads for parallel processing
Parallel Computing: Parallel computing is a form of computation in which many calculations are
carried out simultaneously, operating on the principle that large problems can often be divided into
smaller ones, which are then solved concurrently .
Sequence Alignment: In bioinformatics, a sequence alignment is a way of arranging the primary
sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of
functional, structural, or evolutionary relationships between the sequences.
ENDNOTE
1
http://www.ncbi.nlm.nih.gov/Genbank/
857
858
Chapter 37
Multi-Core Supported
Deep Packet Inspection
Yang Xiang
Central Queensland University, Australia
Daxin Tian
Tianjin University, China
ABSTRACT
Network security applications such as intrusion detection systems (IDSs), firewalls, anti-virus/spyware
systems, anti-spam systems, and security visualisation applications are all computing-intensive applications. These applications all heavily rely on deep packet inspection, which is to examine the content of
each network packets payload. Today these security applications cannot cope with the speed of broadband
Internet that has already been deployed, that is, the processor power is much slower than the bandwidth
power. Recently the development of multi-core processors brings more processing power. Multi-core
processors represent a major evolution in computing hardware technology. While two years ago most
network processors and personal computer microprocessors had single core configuration, the majority
of the current microprocessors contain dual or quad cores and the number of cores on die is expected to
grow exponentially over time. The purpose of this chapter is to discuss the research on using multi-core
technologies to parallelize deep packet inspection algorithms, and how such an approach will improve
the performance of deep packet inspection applications. This will eventually provide a security system
the capability of real-time packet inspection thus significantly improve the overall status of security on
current Internet infrastructure.
1. INTRODUCTION
Current Internet is facing many serious attacks such as financial frauds, viruses and worms, distributed
denial of service attacks, spyware, and spam. Although many network security applications such as intrusion detection systems (IDS), anti-virus/spam systems, and firewalls have been proposed to control
DOI: 10.4018/978-1-60566-661-7.ch037
Multi-Core Supported Deep Packet Inspection
the attacks, securing distributed systems and networks is still extremely challenging. There are unknown
threats and zero day attacks (exploits released before the vendor patch is released to the public) appearing
everyday, which place an impractical burden on network security systems. The key question here is can
we have real time solutions to identify and eliminate attacks without excessive security and management overhead overburdening the networks and computer systems? To deal with the rapidly evolving
threats today and more intelligent and automatic threats in the future, we urgently need new methods
that support network security applications, at all times and in real time, without causing performance
penalty to normal network and system operations.
A multi-core processor combines two or more independent cores into a single package composed of
a single integrated circuit (called a die), or more dies packaged together (Intel, 2007). Multi-core processors represent a major evolution in computing hardware technology. While two years ago most network
processors and personal computer microprocessors had single core configuration, the majority of the
current microprocessors contain dual or quad cores and the number of cores on die is expected to grow
exponentially over time (Johnson & Welser, 2005). As the price of multi-core processors keeps falling,
multi-core will eventually provide affordable processing power to support the real-time requirement of
network security applications.
Multi-core provides a network security application with more processing power from the hardware
perspective. However, there are still significant software design challenges that must be overcome. Today
the difficulty is not in building multi-core hardware, but programming it in a way that lets applications
benefit from the continued growth in CPU performance (Sutter & Larus, 2005). From the server or router
side, if the network security software is not fast enough, it can be very difficult to process every incoming packet then it would slow down the traffic. From the client side, it can also be very difficult to run
network security applications without any interruption to normal applications because those computingintensive applications significantly slow down other simultaneously running applications.
Taking advantage of the full power of multi-core processor requires an in-depth approach to realize the
speedups by parallelizing the traditional deep packet inspection applications. In this chapter we discuss
the research direction of using multi-core processors to support real-time deep packet inspection applications. Section 2 introduces the related work in the parallel approaches to enhance the performance of
deep packet inspection applications. Section 3 presents our new system architecture of using multi-core
to support deep packet inspection applications. Section 4 presents the basic packet-level parallelization
and flow-level parallelization. Section 5 presents a new parallel string matching algorithm. Benefits of
using multi-core are discussed in Section 6. Section 7 concludes this chapter.
2. RELATED WORK
2.1 Development of Multi-Core Processors
In 1965, Gordon Moore observed an exponential growth in the number of transistors per integrated circuit
and predicted that this trend would continue - a prediction today known as Moores Law (Moore, 1965).
In reality, the doubling of transistors every couple of years has been maintained for almost 40 years.
However, scaling up the processors frequency has been more difficult because of several constraints
revealed recently. First, memory speeds are not increasing as quickly as processors logic speeds. Now
the processor takes more clock cycles to access memory than before. The wasted clock cycles can nul-
859
lify the benefits of frequency increases in the processor. Second, manufacture difficulty has shown that
smaller and denser transistors on chips need to be threaded together with ever-increasing lengths of wire
interconnects. As these interconnects stretch from hundreds to thousands of meters in length on a single
processor, path delays can offset the speed increases of the transistors. Finally, the power density problem
has become an unsustainable problem. The number of transistors per chip has significantly increased in
recent years. The power consumption and generated heat have been a serious problem for processors.
Instead of developing chips that can run faster, processor designers are adding more cores and more
cache to provide comparable or better performance at lower power. Additional transistors are being leveraged to create more diverse capability, such as virtualization technology or security features as opposed
to driving to higher clock speeds. Multi-core processors are clocked at slower speeds and supplied with
lower voltage to yield greater performance per watt.
The development of multi-core processors has a significant impact on software applications. To take
advantage of multi-core, software requires migration to a multi-threaded software model and necessitates incremental validation and performance tuning. Although kernel or system threads managed by the
operating system can enhance the application performance, it is essential to have multiple user threads
maintained by programmers to improve the performance of traditional applications.
2.2 Parallel Network Security Applications on Multi-Core Processors

As the Internet traffic volumes and rates continue to race forward, it has become difficult for network
security applications to process network packets in real-time. Many network security applications
nowadays can process network packets at Mbps level. However, most network backbones and many
local network interfaces operate at Gbps level. To improve the performance of the network security applications, most previous research focus on parallelism with the hardware approaches such as ASICs and
FPGAs (Dharmapurikar, Krishnamurthy, Sproull, & Lockwood, 2004; Hayes & Luo, 2007; Liu, Zheng,
Liu, Zhang, & Liu, 2006; Piyachon & Luo, 2006; Villa, Scarpazza, & Petrini, 2008). They require highly
deliberate and customized programming, which is directly at odds with the pressing need to perform
diverse, increasingly sophisticated forms of analysis. In (Paxson et al., 2006) the authors argued that it is
time to fundamentally rethink the nature of using hardware to support network security applications.
Previously, efforts in multi-core software design has been primarily on simultaneous multithreading
(SMT) (Eggers et al., 1997; Tullsen, Lo, Eggers, & Levy, 1999) at a low level, which permits multiple
independent threads of execution to better utilize the resources provided by microprocessor architectures. Most of current research is still focused on automatically mapping general-purpose applications
onto multi-core systems with instruction, data, or thread level parallelization techniques (Sohi, Breach,
& Vijaykumar, 1995; Taylor et al., 2004; Yan & Zhang, 2007) or relying on virtualization technologies
such as VMware (WMware, 2008). Most of them are essentially extensions of utilizing shared-memory
multiprocessors and can only execute coarse-grained threads. Network security applications have their
own unique behavioral characteristics such as frequent memory or disk access, complex data structures,
and high bandwidth and high speed requirements. There is a distinct mismatch between current multicore hardware development and high performance demand from network security applications. There
has been very little preliminary research done in this area (Paxson, Sommer, & Weaver, 2007; Qi et al.,
2007). In short, there is an imperative need for specifically re-designing network security applications
from a software perspective based on multi-core hardware architecture.
860
2.3 Parallel Deep Packet Inspection

Deep packet inspection refers to the process of checking packet payload and header in a network device.
The applications of deep packet inspection include, for example, network security applications that filter out packets containing certain malicious Internet worms or computer viruses; content-based billing
systems that analyze media files and bill the receiver based on the material transferred over the network;
and content forwarding applications that look at the hypertext transport protocol headers and distribute
the requests among the servers for load balancing (Dharmapurikar et al., 2004). Contrastingly, shallow
packet inspection refers to the process of only checking packet header in a network device. Deep packet
inspection requires much more processing power than shallow packet inspection.
Most deep packet inspection applications have a common requirement for string matching. For example, the presence of a string of certain byte sequences in packet payloads can identify the presence
of a virus, such as the well-known Internet worms Nimda, Code Red, and Slammer. One requirement
of deep packet inspection applications is that the applications must be able to detect strings of different
lengths starting at arbitrary locations in the packet payload because the location of such strings in the
packet payload and their length is normally unknown. The other requirement of deep packet inspection
applications is that they must be able to process network packets at line speed, otherwise it will causes
the delay of network traffic, or the incompleteness of deep packet inspection.
Dharmapurikar (Dharmapurikar et al., 2004) described a technique based on Bloom filters for detecting predefined signatures (a string of bytes) in the packet payload. A Bloom filter is a data structure for
representing a set of strings in order to support membership queries. It used parallel hardware Bloom
filters to isolate all packets that potentially contain predefined signatures. Another independent process
eliminates false positives produced by Bloom filters. The authors implemented a prototype system in a
Xilinx XCV2000E FPGA, using the Field-Programmable Port Extender (FPX) platform.
Finite state machine approach is another popular method in deep packet inspection. Tripp (Tripp,
2006) described a finite state machine approach to string matching for an intrusion detection system. By
splitting the search strings into multiple interleaved substrings and by combining the outputs from the
individual finite state machines in an appropriate way it can perform string matching in parallel across
multiple finite state machines. A VHDL model of a string matching engine based on the above ideas
has been developed on a Xilinx XC2V250-6 FPGA and tested via simulation. This implementation is
capable of matching up to 27 search strings in parallel, depending on the length of the strings.
3. SYSTEM DESIGN
The idea of using multi-core processors to enhance the performance of network security applications is
promising. However, the research in this area is just emerging and thus requires intensive exploration.
It faces many challenges such as
How can we actually use multi-core to continue running the network security applications while
keeping the overall system performance?
How can we efficiently partition and distribute the workload of network security applications
between the different cores in the multi-core processor?
How can we split network data and solve the data dependency problem?
861
Figure 1. System architecture of using multi-core processors to support deep packet inspection
As multi-core uses shared off-chip memory, how can we smartly utilize the memory then it will
bring less memory access latencies?
How can we synchronize and coordinate different threads of the applications when it is parallelized on multi-core?
To best solve these challenges, we propose the new system architecture as in figure 1. The essential
ability of this architecture is that it can process network packets in parallel and thus meet the real-time
requirement. As is shown in the figure, the multi-processing scheduler coordinates and distributes the
workload to different cores. The information from packets, events, flows, and messages are processed
in the multi-core processor in parallel. The processor has spare cores to run other applications.
As is illustrated in Figure 1, the proposed architecture must be able to process network packets at line
speed. In another words, it must keep its performance in normal applications, such as forwarding packets
in a router. To fully utilize the potential of multi-core, this system architecture will use different level of
parallelization such as instruction-level parallelization, memory parallelization, loop-level parallelization, and fine-grained thread-level parallelization. High performance can be achieved through interaction
between algorithms, strategies, and architectural design, from high-level decisions on data allocation and
task partitioning to low-level micro-architectural decisions on instruction selection and scheduling. For
each network security application, we also need to identify what the potential bottlenecks are and how
to possibly avoid them. The potential bottlenecks could be packet processing, data normalization, data
correlation, pattern generation, and pattern matching in such a parallel computing environment.
4. PACKET-LEVEL AND FLOW-LEVEL PARALLELIZATIONS

As the instance of the aforementioned multi-core based deep packet inspection architecture, we test two
levels of parallelization on multi-core: packet level and flow-level parallelization. As current network
traffic speeds and volume are increasing at an exponential rate, the processing speed required by a deep
862
Figure 2. Experiment environment
packet inspection application is high. In this experiment, we evaluate the performance of multi-core
parallel deep packet inspection methods based on Snort (Roesch, 1999), an open source intrusion detection system. The test environment is shown in Figure 2. To simulate the large volumes of network
traffic, network traffic generator is used to test the capability of the intrusion detection system Snort to
handle continuous high traffic loads. We use TG2.0 (McKenney, Lee, & Denny, 2008) to generate high
volumes of network traffic. A dual-core computer with a 2.26GHz Intel Pentium processor and 512MB
RAM is used in the experiment. The network adaptors used are 10/100M PCI Ethernet adaptors.
We first use the packet-level parallelization to evaluate the multi-core supported intrusion detection
systems performance on deep packet inspection. The TGs client is set to open a UDP socket to send
packets to the TG server waiting at 192.168.10.1 and port 4322, with packet data length being 576 and
packet number is 2000. On the multi-core system, the odd number packets are processed by one core
and the even number packets are processed by another core. The detailed procedure is specified in the
basic algorithm as follows.
The testing results under different inter-packet transmission time (0.02 seconds equals 1/0.02=50
packets/sec) is shown in Figure 4, and the dropping rate (the percentage of dropped packets in all packets) is shown in Figure 5. Because the system needs to capture the incoming packets from the network
adapter and analyze these packets for possible attacks, depending on the packet capturing and analyzing
Figure 3. Algorithm of packet-level parallelization
863
Figure 4. The number of packet analyzed on packet-level parallelization
speed (processing speed), and the speed of incoming packets (network speed), the system may be able
to process all incoming packets, or have to drop some packets. If the processing speed is slower than
the network speed (this may happen when the system is under heavy load), the system may drop some
packets thus may lose some useful information, which will increase false positive and false negative rate.
The figures show the trend that if the system is parallelized at packet-level by using more than one core,
the dropping rate can be slightly decreased. The number of analyzed packets thus can also be slightly
increased. This proves that multi-core can increase the processing speed of the whole system.
For packet-level parallelization to be practical, no resource should actually share common information
across separate packets. Although deep packet inspection applications like Snort receive input packetby-packet, they must aggregate distinct packets into flows such as TCP streams to prevent an attacker
from disguising malicious communications by breaking the data up across several packets. Since packets
Figure 5. Dropping rate on packet-level parallelization
864
Figure 6. Algorithm of flow-level parallelization
from one flow will not affect the states of another flow, different flows can be processed by independent
processing threads with no constraints on ordering. To test the flow-level parallelization performance
on multi-core, we conduct another experiment based on flow-level parallelization.
One TGs client is set to open a UDP socket to send packets to the TG server waiting at 192.168.10.1
and port 4322, with packet data length being 576 and packet number is 1000, another TGs client is set
to open a UDP socket to send packets to the TG server waiting at 10.10.10.1 and port 5322, with packet
data length being 576 and packet number is 1000. The two Snorts running on the multi-core processor
are set to one analyzes the packets relative to the 192.168.10.0 class C network, and another analyzes
the 10.10.10.0 class C network. The detailed procedure is specified in algorithm 2 in Figure 6.
The testing results are shown in Figure 7 and Figure 8. From the experiments we find that the flowlevel parallelization based on multi-core has higher speed to analyze packets than single-core system.
However, the double core system does not achieve double speedups. The experiment proves that the
parallel strategies can improve the deep packet inspection performance. From the comparison between
packet-level parallelization and flow-level parallelization we also find that flow-based parallelization
has similar performance speedups as packet-level parallelization. However, it is a practical solution to
meaningfully examine each packet in real deep packet inspection applications.
5. PARALLEL STRING-MATCHING
In the above experiments we find that the parallel algorithms at packet-level and flow-level running on
multi-core can minimize the number of dropped packets thus potentially increase the detection rate. In
this chapter we do not discuss the detection rate measurement because it does not depend on just the
parallel algorithms employed but also the intelligent algorithms used, such as neural networks, finite
state machine, and so on. The improvement of performance on detection rate brought by multi-core can
be found in our other papers (Chonka, Zhou, Knapp, & Xiang, 2008; Tian & Xiang, 2008).
Parallel deep packet inspection applications face the following challenges: distributed network traffic
over several hosts while avoiding overloading any of the cores; the traffic distribution scheme should
865
Figure 7. The number of packet analyzed on flow-level parallelization
not negatively impact the applications ability to perform packet inspection. The deep packet inspection
applications must be able to check the incoming payload against a rule set which contains the threat
signatures. However, given the growing number of rules, these applications are becoming more difficult
to perform inspections in real-time without fast string matching mechanisms. In this section, we evaluate
another parallel mechanism which pays more attention on fast multiple string matching.
Currently many string matching algorithms have been widely used in different applications. It is used
in data filtering and data mining to find selected patterns, for example, from a stream of newsfeed; it is
used in security applications to detect certain suspicious keywords, for example, in Snort; it is used in
DNA searching by translating an approximate search to a search for a large number of exact patterns,
and so on. Among them, Aho-Corasick algorithm (Aho & Corasick, 1975) is a linear-time algorithm
for this problem, based on an automata approach. The drawback of this automata approach is the large
amount of memory space it requires. Aho-Corasick algorithm and its extensions deal well with regular
Figure 8. Dropping rate on flow-level parallelization
866
Figure 9. Algorithm of parallel string matching
expression matching, but if the keyword set is larger, they are not practical because of its large number
of states.
Aho-Corasick algorithm constructs a trie with suffix tree-like set of links from each node representing
a string (e.g. abc) to the node corresponding to the longest proper suffix (e.g. bc if it exists, else c if that
exists, else the root). It also contains links from each node to the longest suffix node that corresponds
to a dictionary entry; thus all of the matches may be enumerated by following the resulting linked list.
It then uses the trie at runtime, moving along the input and keeping the longest match, using the suffix
links to make sure that computation is linear. For every node that is in the dictionary and every link
along the dictionary suffix linked list, an output is generated.
Parallel string matching offers a scalable method for inspecting packets in a high speed network
environment. But these parallel methods typically distribute the arriving packets evenly across the
array of processors, each having a copy of the complete policy. Given the rising number of rules, for
example the Snort contains more than thousands of rules, it follows that, for each packet, much more
signature strings need to be checked against the payload of the packet. To the multiple string matching
algorithms, such as Aho-Corasick algorithm, it not only requires a significant amount of memory for
the state machine and preprocessing, but also increases content matching time, thus can not be sufficient
for the high speed networks.
In this experiment we first partition the pattern string sets into small sets, then build the Aho-Corasick
automatons on each small set, and last run the matching algorithm in parallel. The detailed procedure is
specified in algorithm 3 in Figure 9 and Figure 10.
We evaluate the performance of the parallel string matching algorithm and compare it with the original
Aho-Corasick algorithm. To test the performance of the parallel string matching algorithm, the size of
pattern set is 1000, and the length of each pattern is from 3 to 33 characters; the size of the test data is
3953KB. The original and parallel Aho-Corasick algorithms run on the Linux OS. We test the performance under different set numbers. If the set number is k, we partition the pattern set into k groups and
each group contains 1000/k strings. The results are shown in Table 1, Table 2, Table 3, and Table 4.
In each table, the state number reflects the amount of the consumed memory. As the multi-core supports thread level parallelization, using multiple threads can significant reduce the consumed time for
string matching. Parallel mechanisms can also reduce the time for automaton building. The total state
numbers of parallel algorithms are shown in Figure 11. The speed of one parallel algorithm is decided
by the pattern string set i which finishes the searching process lastly. Total time comparison is shown
867
Figure 10. The progress of parallel Aho-Corasick algorithm
Table 1. One thread algorithm

Automaton Building
String Matching
State Number
Total Time
404.86ms
12756.67ms
3519
13161.53ms
Automaton Building
String Matching
State Number
Total Time
Thread 1
79.49ms
8013.18ms
3060
8092.67ms
Thread 2
102.46ms
7405.39ms
3509
7507.85ms
Automaton Building
String Matching
State Number
Total Time
Thread 1
41.24ms
2843.80ms
2216
2885.04ms
Thread 2
72.14ms
2811.93ms
2978
2884.07ms
Thread 3
59.34ms
2942.18ms
2224
3001.52ms
Thread 4
57.83ms
2568.61ms
3211
2626.44ms
Thread 1
Table 2. Two parallel threads algorithm
Table 3. Four parallel threads algorithm
868
Table 4. Eight parallel threads algorithm

Automaton Building
String Matching
State Number
Total Time
Thread 1
18.13ms
1538.62ms
1073
1556.75ms
Thread 2
16.04ms
1415.35ms
1347
1431.39ms
Thread 3
18.56ms
1294.76ms
1448
1313.32ms
Thread 4
39.85ms
1624.30ms
1749
1664.15ms
Thread 5
18.26ms
1454.73ms
1175
1472.99ms
Thread 6
33.98ms
1611.35ms
1390
1645.33ms
Thread 7
18.37ms
1439.90ms
1402
1458.27ms
Thread 8
21.98ms
1276.07ms
2004
1298.05ms
Figure 11. The total state numbers
Figure 12. The total time
869
in Figure 12. From the results we find that this multi-core supported parallel mechanism can speed up
Aho-Corasick string matching algorithm significantly.
6. DEEPER THINKING OF USING MULTI-CORE

From the above evaluation results we find there are many benefits to use multi-core to support deep packet
inspection applications. We summarize the benefits of using multi-core supported system architecture in
deep packet inspection applications as high performance, comprehensive, intelligent, and scalable.
Firstly, traditional deep packet inspection applications are based on serial or very limited parallel
execution of packet processing (Dharmapurikar et al., 2004). For example, traditional single-threaded
network-based intrusion detection systems log activities that it finds to a safeguarded database and detects if the events match any malicious event recorded in the knowledge base. It must read packet level
information and process it on the processor in serial. Traditional anti-virus systems and network visualizes also heavily rely on serially reading packets, files, or logged data. The serial execution performance
largely depends on the clock frequency of a single CPU, which has not been improved much in recent
years. Therefore, they cannot process large amount of packets in real-time. If we can parallelize them
at application level instead of relying on operating system level parallelization, then the workload of
network security applications can be distributed to different cores to achieve high performance, in terms
of latency, throughput, and CPU utilization.
Secondly, multi-core supported deep packet inspection applications can provide comprehensive
protection against different threats. Currently, if a router performs deep packet inspection, for example,
to check a certain virus signature in the packets payload, its forwarding capability will be significantly
affected. Most network providers cannot afford to slow down traffic to perform such security operations. Another fact is that current computer systems can only separately run a single network security
application at a time because these computing-intensive network security applications exclusively occupy CPU time. With the support from multi-core and application level parallelization, these computing tasks can be divided into many threads and distributed to different cores for processing. Thus if we
have enough cores, the network security applications will then be virtually invisible to users because
other applications still have free cores to perform their tasks. This enables comprehensive protection as
it can integrate as many modules (such as intrusion detection module, anti-virus module, and anti-spam
module) as necessary.
Thirdly, with the support from multi-core, deep packet inspection applications will have greatly
improved intelligence compared to traditional applications because we can employ many computingintensive methods to perform packet inspection and classification and anomaly detection. In (Xiang &
Zhou, 2006) we have tested the performance of using neural network to detect attack packets with the
aid of packet marking schemes. It has advantages such as high detection rate and low false positive
rate because it relies on more intelligent method rather than signature matching. However, it also has
the limitation of long training time, thus it cannot provide real-time protection. How to improve the
performance of the intrusion detection system by utilizing the power of multi-core becomes critical to
have a high-intelligent deep packet inspection application.
Lastly, multi-core supported deep packet inspection applications are scalable. They can be used on
not only network level devices but also end host level devices. As we know, pure network-based security applications cannot fully capture the profile of each end host. Therefore in order to achieve the
870
best protection, security checks must be performed on both network processing devices and end hosts.
Many tasks that must be done on infrastructure level computing nodes before can now be moved to the
far end of personal computers, which not only alleviate the load of the information infrastructure but
also make the security check more meaningful. On the other side, many tasks that must be done on end
host level before can now be performed at the infrastructure level, such as checking virus signatures,
which can effectively prevent the propagation of malicious codes from reaching the end hosts. They are
also customizable according to their scalability for different requirements because switching different
parallel applications on or off becomes easy with the support from multi-core.
7. CONCLUSION
In this chapter we present new multi-core supported deep packet inspection architecture and an instance of
the architecture. Leveraging the power of multi-core processors can be the answer to many yet-unsolved
but crucial challenges in deep packet inspection applications such as isolated security environment,
real-time attack detection and attack packets filtering, and real-time visualization of network monitoring. It enables sophisticated and stateful network processing rich in semantics and context as a routine
capability provided by a networks routers. The use of multi-core will support flexible recompilation of
security software but rather than redesign of hardware. It will provide significant benefits to the security
of future distributed networks and systems.
REFERENCES
Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6), 333340. doi:10.1145/360825.360855
Amarasinghe, S. (2007). Multicore programming primer and programming competition. A course at
MIT, Cambridge, MA.
Chonka, A., Zhou, W., Knapp, K., & Xiang, Y. (2008). Protecting information systems from ddos attack
using multicore methodology. Proceedings of IEEE 8th International Conference on Computer and
Information Technology.
Dharmapurikar, S., Krishnamurthy, P., Sproull, T. S., & Lockwood, J. W. (2004). Deep packet inspection
using parallel bloom filters. IEEE Micro, 24(1), 5261. doi:10.1109/MM.2004.1268997
Eggers, S., Emer, J., Levy, H., Lo, J., Stamm, R., & Tullsen, D. (1997). Simultaneous multithreading: A
platform for next-generation processors. IEEE Micro, 17(5), 1219. doi:10.1109/40.621209
Hayes, C. L., & Luo, Y. (2007). Dpico: A high speed deep packet inspection engine using compact finite
automata. Proceedings of ACM/IEEE ANCS, (pp. 195-203).
Intel (2007). Intel multi-core: An overview.
Johnson, C., & Welser, J. (2005). Future processors: Flexible and modular. Proceedings of 3rd IEEE/ACM/
IFIP International Conference on Hardware/Software Codesign and System Synthesis, (pp. 4-6).
871
Liu, H., Zheng, K., Liu, B., Zhang, X., & Liu, Y. (2006). A memory-efficient parallel string matching
architecture for high-speed intrusion detection. IEEE Journal on Selected Areas in Communications,
24(10), 17931804. doi:10.1109/JSAC.2006.877221
McKenney, P. E., Lee, D. Y., & Denny, B. A. (2008). Traffic generator software release notes.
Moore, G. (1965). Cramming more components onto integrated circuits. Electronics Magazine, 38(8).
Paxson, V., Asanovi, K., Dharmapurikar, S., Lockwood, J., Pang, R., Sommer, R., et al. (2006). Rethinking hardware support for network analysis and intrusion prevention. Proceedings of the 1st conference
on USENIX Workshop on Hot Topics in Security.
Paxson, V., Sommer, R., & Weaver, N. (2007). An architecture for exploiting multi-core processors to
parallelize network intrusion prevention. Proceedings of IEEE Sarnoff Symposium.
Piyachon, P., & Luo, Y. (2006). Efficient memory utilization on network processors for deep packet
inspection. Proceedings of ACM/IEEE ANCS, (pp. 71-80).
Qi, Y., Xu, B., He, F., Yang, B., Yu, J., & Li, J. (2007). Towards high-performance flow-level packet
processing on multi-core network processors. Proceedings of 3rd ACM/IEEE Symposium on Architecture
for Networking and Communications Systems, (pp. 17-26).
Roesch, M. (1999). Snort - lightweight intrusion detection for networks. Proceedings of 13th USENIX
LISA Conference, (pp. 229-238).
Sohi, G. S., Breach, S. E., & Vijaykumar, T. N. (1995). Multiscalar processors. Proceedings of 22nd
Annual International Symposium on Computer Architecture, (pp. 414-425).
Sutter, H., & Larus, J. (2005). Software and the concurrency revolution. ACM Queue; Tomorrows
Computing Today, 3(7), 5462. doi:10.1145/1095408.1095421
Taylor, M. B., Lee, W., Miller, J., Wentzlaff, D., Bratt, I., Greenwald, B., et al. (2004). Evaluation of
the raw microprocessor: An exposed-wire-delay architecture for ilp and streams. Proceedings of 31st
Annual International Symposium on Computer Architecture, (pp. 2-13).
Tian, D., & Xiang, Y. (2008). A multi-core supported intrusion detection system. Proceedings of IFIP
International Conference on Network and Parallel Computing.
Tripp, G. (2006). A parallel string matching engine for use in high speed network intrusion detection
systems. Journal in Computer Virology, 2(1), 2134. doi:10.1007/s11416-006-0010-4
Tullsen, D., Lo, J., Eggers, S., & Levy, H. (1999). Supporting fine-grain synchronization on a simultaneous multithreaded processor. Proceedings of the 5th International Symposium on High Performance
Computer Architecture, (p. 54).
Villa, O., Scarpazza, D. P., & Petrini, F. (2008). Accelerating real-time string searching with multicore
processors. IEEE Computer, 41(4), 4250.
WMware. (2008).
872
Xiang, Y., & Zhou, W. (2006). Protecting information infrastructure from ddos attacks by mark-aided
distributed filtering (madf). International Journal of High Performance Computing and Networking,
4(5/6), 357367. doi:10.1504/IJHPCN.2006.013491
Yan, J., & Zhang, W. (2007). Hybrid multi-core architecture for boosting single-threaded performance.
ACM SIGARCH Computer Architecture News, 35(1), 141148. doi:10.1145/1241601.1241603

Deep Packet Inspection: Deep Packet Inspection (DPI) is a form of computer network packet filtering that examines the data and/or header part of a packet as it passes an inspection point, searching for
protocol non-compliance, viruses, spam, intrusions or predefined criteria to decide if the packet can pass
or if it needs to be routed to a different destination, or for the purpose of collecting statistical information.
This is in contrast to shallow packet inspection which just checks the header portion of a packet.
High-Performance Security Systems: High-performance security systems refer to the software or
hardware systems that perform security functions at high performance in terms of processing speed,
data, or throughput.
Intrusion Detection: Intrusion detection is the act of detecting actions that attempt to compromise
the confidentiality, integrity or availability of a resource.
Multi-Core: Multi-core represents a major evolution in the development of processor. A multi-core
processor (or chip-level multiprocessor, CMP) combines two or more independent cores (normally
a CPU) into a single package composed of a single integrated circuit (IC), called a die, or more dies
packaged together.
Network Security: Network security consists of the provisions made in an underlying computer
network infrastructure, policies adopted by the network administrator to protect the network and the
network-accessible resources from unauthorized access and consistent and continuous monitoring and
measurement of its effectiveness (or lack) combined together.
Parallel Algorithms: Parallel algorithms are algorithms that can be executed a piece at a time on many
different processing devices, and then put back together again at the end to get the correct result.
Router: Router is a networking device whose software and hardware are usually tailored to the tasks
of routing and forwarding information.
873
874
Chapter 38
State-Carrying Code for

Computation Mobility
Hai Jiang
Arkansas State University, USA
Yanqing Ji
Gonzaga University, USA
ABSTRACT
Computation mobility enables running programs to move around among machines and is the essence
of performance gain, fault tolerance, and system throughput increase. State-carrying code (SCC) is a
software mechanism to achieve such computation mobility by saving and retrieving computation states
during normal program execution in heterogeneous multi-core/many-core clusters. This chapter analyzes
different kinds of state saving/retrieving mechanisms for their pros and cons. To achieve a portable, flexible
and scalable solution, SCC adopts the application-level thread migration approach. Major deployment
features are explained and one example system, MigThread, is used to illustrate implementation details.
Future trends are given to point out how SCC can evolve into a complete lightweight virtual machine.
New high productivity languages might step in to raise SCC to language level. With SCC, thorough
resource utilization is expected.
INTRODUCTION
The way in which scientific and engineering research is conducted has radically changed in the past two
decades. Computers have been used widely for data processing, application simulation and performance
analysis. As application programs complexity increases dramatically, powerful supercomputers are on
demand. Due to the cost/performance ratio, computer clusters are commonly utilized and treated as
virtual supercomputers. Such computing environments can be easily acquired for scientific and engineering applications.
For each individual computer node, multi-core/many-core architecture is becoming popular in the
computer industry. In the near future, hundreds and thousands of cores might be placed inside of computer
DOI: 10.4018/978-1-60566-661-7.ch038
State-Carrying Code for Computation Mobility
nodes on server-clusters. Multi-core clusters are promising high performance computing platforms where
multiple processes can be generated and distributed across participating machines and the multithreading technique can be applied to take advantage of multi-core architecture on each node. This hybrid
distributed/shared memory infrastructure fits the natural layout of computer clusters.
Since computer clusters for high performance computing can change their configurations dynamically,
i.e., computing nodes can join or leave the systems at runtime, the ability of re-arranging running jobs
is on demand to exploit the otherwise wasted resources. Such dynamic rescheduling can optimize the
execution of applications, utilize system resource effectively, and improve the overall system throughput.
Since computation mobility, i.e., the ability of moving computations around, is one of the essences to
this dynamic scheduling, it has become indispensable to scalable computing for the following outstanding features:
Load Balancing: Evenly distributing workloads over multiple cores/processors can improve the
whole computations performance. For scientific applications, computations are partitioned into
multiple tasks running on different processors/computers. In addition to variant computing powers, multiple users and programs share the computation resources in non-dedicated computing
environments where load imbalance occurs frequently even though the workload was initially
distributed evenly. Therefore, dynamically and periodically adjusting workload distribution is
required to make sure that all running tasks at different locations finish their execution at the same
time in order to minimize total idle time. Such load reconfiguration needs to transfer tasks from
one location to another.
Load Sharing: From the systems point of view, load sharing typically increases the throughput of
computer clusters. Studies have indicated that a large fraction of workstations could be unused for
a large fraction of time. Scalable computing systems seek to exploit otherwise idle cores/processors and improve the overall system efficiency.
Data Locality: Sharing resources includes two approaches: moving data to computation or moving computation to data. Current applications favor data migration as in FTP, web, and Distributed
Shared Memory (DSM) systems. However, when computation sizes are much smaller than data
sizes, the code or computation migration might be more efficient. Communication frequency and
volume will be minimized by converting remote data accesses into local ones. In data intensive
computing, when client-server and RPC (Remote Procedure Call) infrastructure are not available,
computation migration is an effective approach to accessing massive remote data.
Fault Tolerance: Before a computer system crashes, local computations/jobs should be transferred
to other machines without losing most existing computing results. Computation migration and
checkpointing are effective approaches.
Computation migration feature has existed in some batch schedulers and task brokers, such as Condor
(Bricker, Litzkow, & Livny, 1992), LSF (Zhou, Zheng, Wang, & Delisle, 1993) and LoadLeveler (IBM
Corporation, 1993). However, they can only work at a coarse granularity (process) level and in homogeneous environments. So far, there is no effective task/computation migration solution in heterogeneous
environments. This has become the major bottleneck in dynamic schedulers and obstacle for scalable
computing to achieve high performance and effective resource utilization.
State-Carrying Code (SCC) is a software mechanism that intends to provide the ability of moving
computations around. It allows programs to contain both normal statements for application execution
875
and special primitives inserted by the SCC precompiler for portable computation state construction.
Such a program can stop on one machine, generate its computation state, and then restart on another
machine. SCC brings mobility to applications and addresses some critical issues including computation
granularity, virtualization level, computation/data locality, distributed synchronization, distributed data
sharing, and highly efficient data conversion.
The objective of this chapter is to introduce the relevant backgrounds, design strategies and limitations of SCC as well as its uses for computation mobility in heterogeneous environments. The next
section discusses the related research and their limitations. Then the strategies and design of SCC are
introduced to show how to support heterogeneous computation migration to take advantage of multicore architecture. The future work and conclusion are given at the end.
Computation Mobility
Computation mobility concerns the movement of a computation starting at one node and continuing
execution at another network node in distributed systems. Such computation relocation requires the
movement of both programs and execution contexts. How to construct and reload the execution contexts
at runtime is the essence of computation mobility.
Sometimes data mobility might imply computation mobility when data is closely bound to computations and both of them have to be moved around together. In fact, such a scenario indicates the datacomputes rule: a computation only works on its own data.
Computation mobility enabled systems can be classified by their computation granularities and
implementation levels. Candidates for computation units could be processes, threads, and user defined
objects. The engine of computation mobility is optionally placed at data, language, application, library,
virtual machine, kernel, and platform levels. A variety of research activities have been reported in the
literature.
Process Migration
Process migration concerns the construction of process states. A process is an operating system abstraction
representing an instance of a running computer program. Each process contains its own address space,
program counter, stack, and heaps. All these dynamic factors form the process state which is normally
buried in operating systems.
Kernel-level process migration is supported by some distributed or networked operating systems.
Some operating systems such as MOSIX (Barak & Laadan, 1998), Sprite (Douglis & Ousterhout, 1991),
Mach (Accetta, et al., 1986), Locus (Walker, Popek, English, Kline, & Thiel, 1992), and OSF/1 (Zajcew, Roy, Black, & Peak, 1993) migrate the whole process images whereas others like Accent (Rashid
& Robertson, 1981) and V Kernel (Theimer, Lantz, & Cheriton, 1985) apply Copy-On-Reference
and precopying techniques to shorten the process freeze time. These operating systems can access
process states efficiently and support preemptive migration at virtually any point. Thus, they provide
good transparency and flexibility to end users. However, this approach brings much complexity into
kernels and process images cannot be shared between different operating systems. Inability to provide
a heterogeneous solution presents a severe drawback of this approach.
User-level process migration as in Condor (Bricker, Litzkow, & Livny, 1992) and Libckpt (Plank,
Beck, Kinsley, & Li, 1995) achieves similar results without the kernel modification. Normally user
876
libraries are used and linked to the application at compile-time. Process state is constructed through
library calls. Since system calls are invoked in library calls to fetch process images from the kernel, this
approach only works on homogeneous platforms.
Application-level approach supports process migration in a heterogeneous environment since the
process state can be replicated in user programs. The Tui system (Smith & Hutchinson, 1998), MigThread
(Jiang & Chaudhary, 2004), SNOW (Chanchio & Sun, 2001), Porch (Ramkumar & Strumpen, 1997),
PREACHES (Ssu, Yao, & Fuchs, 1999), and Process Introspection (Ferrari, Chapin, & Grimshaw,
1997) apply source-to-source transformation to convert programs into semantically equivalent source
programs for saving and recovering process states across binary incompatible machines. Such an approach deliberately sacrifices transparency and reusability although pre-compilers/preprocessors are
usually available to improve transparency to a certain degree.
Since the original process states are buried in operating systems and no proper system calls are provided to fetch them easily, most process migration systems are not being widely used in open systems.
One solution is to duplicate process states in user space as in application or language approaches. The
state replicas are used instead for migration in heterogeneous environments.
Thread Migration
Threads are flows of control in running programs. One process might contain multiple threads which
share the same address space including text and data segments. However, each thread has its own stack
and heap.
Thread migration enables fine-grained computation adjustment in parallel computing. As multithreading becomes a popular programming practice, thread migration is increasingly important in finetuning high-end computing to fit dynamic and non-dedicated environments. Different threads can be
migrated to utilize different resources for load balancing and load sharing. The core of thread migration
is about transferring thread state and necessary data from the local heap to the destination.
Current thread migration research focuses on updating internal self-referential pointers in stacks and
heaps. Three approaches exist in the literature. The first approach uses language and compiler support
to maintain enough type information and identify pointers as in MigThread (Jiang & Chaudhary, 2004)
and Arachne (Dimitrov & Rego, 1998). The second approach requires scanning the stacks at run-time
to detect and translate the possible pointers dynamically. The representative implementation of this is
Ariadne (Mascarenhas & Rego, 1995). Since some pointers in stack cannot possibly be detected (Itzkovitz, Schuster, & Shalev, 1998), the resumed execution can be incorrect. The third approach is the
most popular one. It necessitates the partitioning of the address space and reservation of unique virtual
addresses for the stack of each thread so that the internal pointers maintain the same values. A common
solution is to preallocate memory space for threads on all machines and restrict each thread to migrate
to its corresponding location on other machines. This iso-address solution requires a large address
space and is not scalable since there are limitations on stacks and heaps (Itzkovitz, Schuster, & Shalev,
1998). Such systems include Millipede (Itzkovitz, Schuster, & Shalev, 1998), Amber (Chase, Amador,
Lazowska, Levy, & Littlefield, 1996), UPVM system (Casa, Konuru, Prouty, Walpole, & Otto, 1994),
PM2 (Antoniu & Boung, 2001), Nomad system (Milton, 1998), and the one proposed by Cronk et al
(Cronk & Mehrotra, 1997).
Based on the location, threads can be classified as kernel-, user-, and language-level threads. Kernel-level threads exist in operating systems and can be scheduled onto processors directly. User-level
877
threads are defined and scheduled by libraries in user space. Language-level threads are defined in a
programming language. For example, Java threads are defined in the Java language and implemented
in the Java Virtual Machine (JVM). According to thread types, migration systems have to fetch thread
states from different places and port them to different platforms. To our knowledge, only MigThread
(Jiang & Chaudhary, 2004) and Jessica2 (Zhu, Wang, & Lan, 2002) can support heterogeneous thread
migration. MigThread achieves this by defining its own data conversion scheme whereas Jessica2 relies
on modified JVMs.
Checkpointing
Checkpointing is the saving of computation state, usually in stable storage, so that it may be reconstructed
later. Therefore, the major difference between migration and checkpointing is the medium: memory-tomemory vs. memory-to-file transfer. Checkpointing may use most migration strategies.
Libckpt (Plank, Beck, Kinsley, & Li, 1995), PREACHES (Ssu, Yao, & Fuchs, 1999), Porch (Ramkumar & Strumpen, 1997), CosMic (Chung, 1997), and other user-directed checkpointing systems
save process states in stable storage, such as magnetic disks. The memory exclusion technique has been
employed effectively in incremental checkpointing, where pages are not checkpointed when they are
clean. The technique, compiler-assisted checkpointing, uses compiler/preprocessor to ensure correct
memory exclusion calls for better performance.
For message passing and shared address space parallel computing applications, CoCheck (Stellner,
1996) and C3 (Bronevetsky, Marques, Schulz, Pingali, & Stodghill, 2004) manage to get clear-cut checkpoints which can be treated as computation states for migration. Hence, from the computation states
point of view, migration and checkpointing systems are equivalent.
Virtual Machines
To enable code portability in heterogeneous environments, Virtual Machine (VM) techniques are widely
used, for example in JVM (Lindholm & Yellin, 1999), VMware (VMware Inc., 1999), and Xen (Barham,
et al., 2003). They present the image of a dedicated raw machine to each user. Virtual machines allow
the configuration of an entire operating system to be independent from that of the physical resource; it
is possible to completely represent a VM guest machine by its virtual state and instantiate it in any
VM host. Therefore, VMs provide stable computing environments to programs.
Some migration systems, such as Jessica2 (Zhu, Wang, & Lan, 2002), can work on top of process VMs
to enable the migration in heterogeneous environments. Process VMs are used to interpret computation
states to hide low-level architecture variety. Since such process VMs play a role as data converter and
provide uniform platforms, states can be fetched in a unique way and heterogeneity issue is resolved
smoothly (Zhu, Wang, & Lan, 2002).
However, VMs do not support efficient computation mobility since they always have difficulties
in distinguishing useful resources. A safe way is to wrap up the whole abstract view of the underlying
physical machine. Obviously, the efficiency drops dramatically and VMs themselves are not portable
across different physical hosts.
878
Mobile Agents
A mobile agent is a software object, representing an autonomous computation that can travel in a network to perform tasks on behalf of its creator. It has the ability to interact with its execution environment, and to act asynchronously and autonomously upon it. The code is their object-oriented context,
and most existing mobile agent systems, including Charm++ (Kale & Krishnan, 1998), Emerald (Jul,
Levy, Hutchinson, & Blad, 1998), Telescript (White, 1996) and IBM Aglets (Lange & Oshima, 1998),
implement their agents in object-oriented languages such as Java and Telescript.
Mobile agents demand a different coding environment, including new language constructs, programming styles, compilers, and execution platforms. Although current mobile agent systems are intended for
general applications and have demonstrated some progress in Internet/mobile computing, it is still not
clear how they will perform for computation-intensive and high performance computing applications
with object-oriented technology.
Deployment of SCC
State-Carrying Code (SCC) takes advantage of multi-core SMP architecture, virtualizes computations,
and achieves better application performance in heterogeneous environments. Computation mobility is
the essential tool. All existing packages have indicated the fact that computation migration performance
is mainly affected by computing units and the location of computation states. SCC provides a flexible,
portable, practical, and efficient solution to computation mobility.
Granularity
The term granularity refers to the size of computation units which can move around individually. Normally it indicates the flexibility that mobility systems can provide.
Since Virtual Machines can only dump the whole system images, they support coarse-grained mobility.
All computations on the VMs will be transferred together and they cannot be distinguished explicitly. In
this case, sequential and parallel jobs are treated as the same. This extreme migration case is efficient only
when all VMs local jobs need to leave the current machine. VMs provide the virtually stable platform
for applications. However, VMs themselves are not portable over various physical machines. In VM
migration, applications are not aware of the movement, but obviously such coarse-grained migration
incurs high overhead and inflexibility.
From the traditional operating systems point of view, processes are the basic computational abstraction. All sequential computations are executed on just a single process. In parallel computing, the overall
jobs need to be decomposed into multiple tasks. In multi-process parallel applications such as those
using MPI (Message Passing Interface), tasks are assigned to processes for parallel execution. In such
cases, processes are treated as computation units. Compared to VM migration, process migration has
its advantages in reducing overheads significantly without worrying about the execution environment.
It can also manipulate individual or partial applications. Parallel jobs with multiple processes can be
reconfigured dynamically. Different implementations of process migration exhibit various degrees of
transparency to programmers.
To reduce the heavy inter-process context switch overhead, many modern parallel computing applications adopt multi-threading techniques. Each process may contain multiple threads to share the same
879
address space. Former multi-processed applications can be replaced by multi-threaded counterparts.

Otherwise, former processes are further partitioned into multiple local threads to achieve finer task
decomposition. Synchronization overhead among computation units is further reduced. Sequential applications can be viewed as single-threaded instances whereas parallel ones consist of multiple threads.
Once threads are used as computation units, thread migration can move sequential jobs and entire or
partial parallel computations. Since process migration treats multi-threaded applications as a whole, it
will either move the whole computation or perform no migration at all. Such all-or-nothing scenarios
do not exist in thread migration. However, since some thread libraries are invisible to operating systems,
the difficulty of implementing thread migration is higher than process migration.
Charm++ (Kale & Krishnan, 1998), Emerald (Jul, Levy, Hutchinson, & Blad, 1998), and other
mobile agent systems provide mechanisms at the language level to migrate the user defined objects/
agents. However, new languages and compilers expose everything to programmers. Therefore, many
legacy systems have to be re-deployed. Transparency is the major drawback so that most systems have
given up this approach.
Multi-core architecture has been widely adopted. To take advantage of the extra computing cores,
applications need to be implemented with multiple processes or threads. Normally they need to exchange or share data with each other. Thus multi-threading is the preferred selection. Also, all modern
operating systems have been multi-threaded. Threads in applications can be mapped onto kernel threads
naturally.
SCC follows this multi-threading trend and threads will be the basic units of computation mobility
although processes can be handled.
Positioning Computation States

The main issue in SCC and all other migration systems is how to retrieve computation states quickly
and precisely. Then the current computation can be stopped on the current machine and resumed on a
new machine based on its state. Computation states are buried or recreated at data, language, application, library, thread virtual machine, kernel, and platform virtual machine levels, as shown in Figure 1.
However, not all of them are suitable to user threads whose states consist of the execution contexts in
kernel and the thread contexts in thread libraries.
Platform Virtual Machine approach provides execution platforms such as VMware (VMware Inc.,
1999), and Xen (Barham, et al., 2003). These VMs can only deal with the whole environment, not the
fine-grained tasks such as threads.
Kernel-level approach is the most effective and transparent method for process migration. However,
it cannot deal with user threads and does not work in heterogeneous environments.
Thread Virtual Machine approach can only provide execution platform for running threads, such
as Jessica2 (Zhu, Wang, & Lan, 2002) in JVM (Lindholm & Yellin, 1999) which needs to be modified
to support Java thread migration. However, the requirement of distributing and installing the modified
JVMs makes applications only work in closed computer clusters.
User-level approach provides computation mobility function in user libraries (Bricker, Litzkow, &
Livny, 1992) (Plank, Beck, Kinsley, & Li, 1995). Unique library calls will be translated into different
system calls to fetch execution contexts inside operating systems. Normally it is not straightforward to
convert computation states from one system to another.
Application-level approach constructs computation states inside the source code for better portability
880
Figure 1. Implementation levels of computation mobility.
and heterogeneity (Smith & Hutchinson, 1998) (Jiang & Chaudhary, 2004) (Chanchio & Sun, 2001).
Normally pre-compilers/preprocessors are provided for code transformation without programmers
involvement. Computation units, such as processes or threads, are defined according to applications.
Flexibility is another major advantage of this approach.
Language-level approach requires new languages and compilers (Kale & Krishnan, 1998) (Jul, Levy,
Hutchinson, & Blad, 1998). It is hard to persuade programmers to learn new languages. Sometimes
data-level approach is applicable in regular scientific applications where computation states can be
represented by pure variable sets.
Due to the required characteristics in computer clusters, SCC selects the application-level approach
for portability, flexibility, heterogeneity and scalability.
Issues in Application-level Approach

There are two major issues in application-level thread migration:
Availability of Source Code
The major restriction of the application-level approach is the requirement of source code. Since thread
states need to be replicated in the programs, applications with only executable code cannot be transformed
for mobility. However, for high performance computing, users are normally also the programmers. Most
time the source code is available.
To get rid of this restriction, migration engine will have to be buried at lower levels, such as user,
kernel and platform virtual machine levels. However, as we discussed before, approaches at those levels will not be able to utilize multi-core architecture effectively and performance improvement will be
limited. Therefore, application-level approach is the better choice if the source code is available.
Associated Overheads
In some fields, such as Internet Computing and Mobile Agents, prompt state construction is the key
881
Figure 2. The Infrastructure of SCC.
issue so that computations (agents) can start and stop quickly. However, in the application-level approach, the overhead of constructing computation states is relatively higher. Therefore, it is not suitable
for Internet/Agent computing.
SCC aims at high-end computing where the most overhead lies in the scientific computing or business data processing (mainly the floating pointing operations or data searching). The cost to set up a
state replica in the program is ignorable. Many Grand Challenging problems in Computational Physics,
Computational Chemistry, and Bioinformatics have exhibited such characteristic. The work in MigThread
(Jiang & Chaudhary, 2004) has demonstrated such phenomena.
Infrastructure
SCC suggests supporting both process and thread migration at application level. Among many existing
application-level process migration systems, MigThread (Jiang & Chaudhary, 2004) covers the most
features required by SCC and supports thread migration to take advantage of multi-core architecture. In
this chapter, MigThread is used as an example to demonstrate major features in SCC.
SCC consists of two parts: a preprocessor (pre-compiler) and a run-time support module. The preprocessor is designed to transform users source code into a format from which the run-time support module
can construct the computation state precisely and efficiently. The run-time support module constructs,
transfers, and restores computation states dynamically as well as provides other run-time safety checks,
as shown in Figure 2 . Most of the time, user assistance is not required unless the preprocessor encounters
unsolvable third-party library calls. Manual support is a necessity for this case.
The preprocessor of MigThread is similar to the ones in other SCC systems and conducts the following tasks:
882
Information Collection: Collect related stack and heap data for future state construction. The stack
Figure 3. The original function.
data includes globally shared variables, local variables, function parameters, and program counters, etc. whereas the heap data contains dynamically allocated memory segments.
Tag Definition: Create tags for heterogeneous data blocks.
Position Labeling: Detect and label potential migration points.
Control Dispatching: Insert switch statements to orchestrate execution flows.
Safety Protection: Detect and overcome unsafe cases; seek human assistance/instruction for thirdparty library calls; and leave other unresolved cases to the run-time support module.
Its run-time support module consists of a thread record list (including globally shared data), a stack
management module, a memory block management module, and a pointer-casting closure. Since activation frames in the stack are arranged in last-in-first-out (LIFO) order, stacks are maintained in linked
lists. Meanwhile, heaps and PC Closures are maintained in red-black trees for random accesses.
The run-time support module is activated through primitives inserted by the preprocessor at compile
time. It is required to link this run-time support library with users applications in the final compilation.
During the execution, its task list includes:
Stack Maintenance: Keep a user-level stack of activation frames for each thread.
Tag Generation: Fill out tag contents which are platform-dependent.
Heap Maintenance: Keep a user-level memory management subsystem for dynamically allocated
memory blocks.
Migration: Construct, transfer, and restore computation state.
Data Conversion: Translate computation states for destination platforms.
Safety Protection: Detect and recover remaining unsafe cases.
Pointer Updating: Identify and update pointers after migration or checkpointing.
State Construction
The state data typically consists of the process data segment, stack, heap and register contents. In SCC,
the computation state is in a platform-independent format to reduce migration restrictions. Therefore,
883
SCC does not rely on any type of thread library or operating system.
State construction is done by both the preprocessor and the run-time support module. The preprocessor collects globally shared variables, stack variables, function parameters, program counters, and
dynamically allocated memory regions, into certain pre-defined data structures. Since the virtual address
spaces might be different, pointers are marked at compile time and updated at runtime. In many SCC
systems such as the Tui system (Smith & Hutchinson, 1998) and SNOW (Chanchio & Sun, 2001), related
variables are collected at the migration points to reduce the size of actual computation states. However,
due to pointer arithmetic operations, related variables are not always detected correctly.
Figure 4. The transformed function in MigThread.
884
Figure 5. Tag definition and generation in MigThread.
MigThread puts variables in two predefined structures to speed up the state construction process.
This speedy way can discover the hidden variables although it might be over-conservative. Figures 3
and 4 show this process in MigThread. A simple function foo() is defined in Figure 3. In MigThread,
the preprocessor transforms the function and generates a corresponding MTh_foo() shown in Figure 4.
Non-pointer variables are collected in MThV whereas pointers are gathered in MThP (as shown in area
1). In thread stacks, each functions activation frame contains MThV and MThP to record the current
functions computation states.
The program counter (PC) is a register that contains the memory address of the current execution point
within a program. Its content is represented as a series of integer values. In MigThread, it is declared
as MThV.stepno in each affected function. Since all possible positions for migration have been detected
at compile-time (as shown in area 4), different integer values of MThV.stepno correspond to different
adaptation points. In the transformed code, after the function initialization in area 2, a switch statement
is inserted to dispatch execution to each labeled point according to the value of MThV.stepno as shown
in area 3. The switch and goto statements help control jump to resumption points quickly.
SCC also supports user-level memory management for heaps. Eventually, all computation state related
contents, including stacks and heaps, are moved out to the user space and handled by SCC directly for
portability.
Data Conversion Schemes

Computation states can be transformed into pure data. If different platforms use different data formats, the
computation states constructed on one platform need to be interpreted by another. Thus, data conversion
is unavoidable. Some application-level migration systems only work in homogeneous systems without
the data conversion issue. Many SCC systems adopt symmetric data conversion approach which can
885
be easily implemented. Both the sender and receiver need to convert data to and from an intermediate
(universal) data format. Special compilers (Smith & Hutchinson, 1998) or data representation libraries
such as XDR (Srinivasan, 1995) are employed. In open and homogeneous systems, data conversion
will still have to be conducted twice although it is not necessary at all (in closes systems, they can be
eliminated).
MigThread (Jiang & Chaudhary, 2004) adopt asymmetric data conversion method to perform data
conversion only on the receiver side. This approach is more flexible in open systems. Data conversion
is only conducted when necessary, i.e., senders and receivers are on different platforms. The module
in MigThread is called Coarse-Grain Tagged receiver makes it right (CGT-RMR). This tagged RMR
version scheme can tackle data alignment and padding physically, convert data structures as a whole,
and eventually generate a lighter workload compared to existing standards. It accepts ASCII character
sets, handles byte ordering, and adopts IEEE 754 floating-point standard because of its dominance in
the market. Since CGT-RMR in MigThread converts variables as a whole, in most cases it is faster than
the ones in other SCC systems.
In MigThread, programmers do not need to worry about data formats. The preprocessor parses the
source code, sets up type systems, transforms source code, and communicates with the run-time support
module through inserted primitives. With helps from the type system, CGT-RMR can analyze data types,
flatten down aggregate types recursively, detect padding patterns, and define tags as in Figure 5. But
the actual tag contents can be set only at run-time and they may not be the same on different platforms.
Since all of the tedious tag definition work has been performed by the preprocessor, the programming
style becomes extremely simple. Also, with global control, low-level issues such as data conversion
status can be conveyed to upper-level scheduling modules. Therefore, easy coding style and performance
gains come from the preprocessor. CGT-RMR is very efficient in handling large data chunks which are
common in migration and checkpointing (Jiang & Chaudhary, 2004).
Tags in CGT-RMR are used to describe data types and their paddings so that data conversion routines
can handle aggregate types as well as common scalar types. Tags are defined and generated for these
structures as well as dynamically allocated memory blocks in the heap. At compile time, it is still too
early to determine the content of the tags. The preprocessor defines rules to calculate structure members
sizes and variant padding patterns, and inserts sprintf() to glue partial results together. The actual tag
generation has to take place at run-time when the sprintf() statement is executed. Only one statement is
issued for each data type regardless of whether it is a scalar or aggregate type. The flattening procedure
is accomplished by MigThreads preprocessor during tag definition instead of the encoding/decoding
process at run-time. Hence, programmers are freed from this responsibility. In MigThread, all memory
segments for predefined data structures are represented in a tag-block format. The process/thread stack
becomes a sequence of these structures and their tags. Memory blocks in heaps are also associated with
such tags to express the actual layout in memory space. Therefore, the computation state physically consists of a group of memory segments associated with their own tags in a tag-segment pair format.
State Restoration
To support open systems, if symmetric data conversion scheme is adopted, both sides need to convert
data. But with the asymmetric one, the senders do not need to perform data conversion. Only the receivers have to convert the computation state, i.e., data, as required. Normally variables are converted one
by one. MigThread can do this block by block. Since activation frames in stacks are re-run and heaps
886
are recreated, a new set of segments in tag-block format is available on the new platform. MigThread
first compares architecture tags by strcmp(). If they are identical and the blocks have the same sizes, the
platforms remain unchanged and the old segment contents are simply copied over by memcpy() to the
new architectures. This enables prompt processing between homogeneous platforms while symmetric
conversion approaches still suffer data conversion overhead on both ends.
If platforms have been changed, conversion routines are applied on all memory segments. For each
segment, a walk-through process is conducted against its corresponding old segment from the previous
platform. In these segments, according to their tags, memory blocks are viewed to alternately consist
of scalar type data and padding slots. The high-level conversion unit is data slots rather than bytes in
order to achieve portability. The walk-through process contains two index pointers pointing to a pair
of matching scalar data slots in both blocks. The contents of the old data slots are converted and copied
to the new data slots if byte ordering changes, and then the index pointers are moved down to the next
slots. In the mean time, padding slots are skipped over. In MigThread, data items are expressed in scalar
type data - padding slots pattern to support heterogeneity.
Safety Issues
Some SCC systems such as the Tui system (Smith & Hutchinson, 1998) and SNOW (Chanchio & Sun,
2001) declare that they only work with programs written in type safe languages or substrates whereas
others assume so. Most migration safety issues can be eliminated by the type safety requirement, but
not all of them. MigThread can detect and handle more migration unsafe features, including pointer
casting, pointers in unions, library calls, and incompatible data conversion (Jiang & Chaudhary, 2004).
Then computation states will be precisely constructed to make those programs eligible for migration.
Programmers are free to code in any programming style.
Pointer casting does not mean the cast between different pointer types, but the cast to/from integral
types, such as integer, long, or double. The problem is that pointers might hide in integral type variables.
The central issue is to detect those integral variables containing pointer values (or memory addresses)
so that they could be updated during state restoration. Casting could be direct or indirect.
Pointer arithmetic and operations may cause harmful pointer casting which is the most difficult safety
issue. In MigThread, an intra-procedural, flow-insensitive, and context-insensitive pointer inference algorithm was proposed to detect hidden pointers created by unsafe pointer casting, regardless of whether
it is applied in pointer assignments or memcpy() library calls. The static analysis at compile time and
dynamic checks at run-time work together to trace and recover unsafe pointer uses.
Library calls bring difficulties to all migration schemes since it is hard to determine what is going on
inside the library code. It is even harder for application-level schemes because they work on the source
code and memory leakage might happen in the libraries. Without the source code of libraries, it is
difficult to intercept all memory allocations because of the blackbox effect. The current version of
MigThread provides a user interface to specify the syntax of certain library calls so that the preprocessor
can know how to insert proper primitives for memory management and pointer tracing.
Another unsafe factor is the incompatible data conversion. Between incompatible platforms, if data
items are converted from higher precision formats to lower precision formats, precision loss may occur.
Detecting incompatible data formats and conveying this low-level information up to the scheduling
module can help move computations back or to proper nodes.
In MigThread, the pointer inference algorithm, a friendly user interface, and the data conversion scheme
887
CGT-RMR work together to eliminate unsafe factors to qualify almost all programs for migration.
Adaptation Point Analysis

An adaptation point is a location in a program where a thread/process can be migrated or checkpointed
correctly. The locations of adaptation points for computation migration are critical since the distance
between two consecutive adaptation points determines the migration schemes sensitivity and overheads.
If two adaptation points are too far apart, applications might be insensitive to the dynamic situation. But
if they are too close, the related overheads with constructing, saving, and retrieving computation states
will slow down the actual computation.
Several methods have been proposed in the literature regarding how to insert adaptation points.
The first approach is that adaptation points are inserted by users or initiated at a barrier (Abdel-Shafi,
Speight, & Bennet, 1999). This method is straightforward, but it brings undue burden on inexperienced
programmers who do not know the structure and workload of their applications. And for some large
and complex applications where many developers are involved, it is very difficult to insert adaptation
points by users.
Some automatic adaptation point placement methods were proposed in order to overcome the disadvantage of the above approach. SNOW handles migration points by counting the number of floating
point operations. That is, it inserts a migration point after a certain number of floating point operations.
Since we do not always know the upper bound of a loop at compile time, this scheme cannot determine
the count of operations inside a loop. Furthermore, this scheme is not applicable to non-scientific applications where most operations may not be floating point operations. Actually, it might be difficult to
adopt a quantitative method because of pipelining, caches, and compiler optimizations. Therefore, such
approach might be inaccurate under many circumstances.
The adaptation point placement approach in (Li, Stewart, & Fuchs, 1994) inserts potential adaptation points inside loops. It uses a counter to determine when checkpointing actually occurs. The counter
is initially set to a value, called reduction factor, and it is reduced by one for each loop. When the
counter is equal to zero, the program does actual checkpointing. This scheme can only insert potential
adaptation points at certain sparse points. With many small loops or loops with unknown upper bounds,
checkpointing might not take place for a long time. Therefore, this method is fine with checkpointing,
but if it is applied to migration, it will be insensitive to dynamic environment since migration is only
allowed at sparse points.
MigThread aggressively inserts a lot of potential adaptation points into users programs using its
preprocessor. That is, it inserts at least one potential adaptation point into each nested loop, subroutine
or branch. Whether a potential adaptation point will be actually activated is decided by a scheduler (or
server) which determines the actual adaptation intervals according to the dynamic environment or using
any existing optimal adaptation interval estimation methods, e.g Youngs Law (Young, 1974). When an
actual migration or checkpoint is needed, the scheduler sends a signal to users programs in order to set
a flag that controls each potential adaptation point. This approach can tolerate more adaptation points,
and thus the applications can be more sensitive to their dynamic situations.
Library and system calls can cause problems to all application-level migration approaches since no
source code is available to insert potential adaptation points at the higher level. However, to achieve
better portability of the application-level approach, it is reasonable to give up the sensitivity during the
third-party library call procedures. Luckily, the execution time of most library calls is relatively short.
888
For some parallel applications where relaxed memory consistency models are used, Migthread first
inserts a pseudo-barrier primitive to synchronize computation progress of multiple threads/processes
across different machines. If an actual migration is scheduled, real barrier operation will be activated
to synchronize both computation progress and data copies. Therefore, migration can take place with
consistent states.
Communication States
Networking applications set up communication channels by certain protocols such as TCP/IP. During
and after migration, messages and those channels themselves need to be handled properly. If migration
happens in a closed system using a distributed or networked operating system, the OS can re-establish
communication seamlessly. The kernel-level approach is always efficient and transparent. However,
portability is not supported.
Some SCC systems such as SNOW (Chanchio & Sun, 2001) proposed a new communication protocol above the regular TCP/IP and UDP/IP layers to deal with message channels in user space. This
connection-aware protocol can reconstruct communication channels after migration. However, it only
works within a closed PVM (Parallel Virtual Machine) system since the modified PVM library should be
installed on all participating computers. This closed-system restriction is too strict for generic systems
which have ports opened to communicate with outside applications.
In open systems, resetting the communication layout is difficult since we do not have the full control
of the whole system. To achieve portability, system call wrappers can be used again to forward data
to shadow-threads. The performance may not be as good as in kernel-level and VM-level approaches.
Portability gain is very attractive in heterogeneous distributed systems.
Future Trends
SCC adopts the application-level thread/process migration approach. To achieve the complete high-level
virtual machine abstraction, many features need to be enhanced or added. For example, communication
states need to be improved to support open systems properly. Some future trends are listed to indicate
possible research directions.
Resource Access Transparency

Most SCC systems focus on code and execution state migration. In fact, the most difficult task is to
access original system resources seamlessly. Such resources include data in memory, files in secondary
storage, signals, communication connections, and even databases. Communication states are the example
for connection migration. Modified communication primitives can tear down and set up connections
transparently. However, this requires a new communication library.
One possible solution is to install proxy server at the original node before any migration. To access
old resources such as printers and databases, requests are sent back so that the proxy server can perform
the local access and send back the results. However, this leftover residue method might slow down
the performance because of the introduction of communication channels.
Data in memory could be shared by multiple threads. After thread migration, such data will be shared
across threads on multiple machines. To achieve such distributed sharing, Distributed Shared Memory
889
(DSM) might be applied and a distributed lock mechanism needs to be implemented. This is similar to
cache coherence or page/object-based DSM.
Resources in secondary storage such as files can be accessed by migrated computations through global
references or copies. If the resources are supported by global naming, they can be accessed anywhere.
Otherwise, copying them to the new destination can enable local access to hide the remote access after
the migration.
Different resources require different strategies. SCC should be able to set up support accordingly.
Migration Policy
More and more researchers have realized the benefit of moving computations to data over the traditional
way of moving data to computations. SCC needs to analyze the data and computation locality. Then
it activates data/computation migration to minimize communication frequency and volume. Tasks and
their required data are distributed nearby since local accesses are much faster than remote ones. Such a
task-data relationship needs to be identified during the task mapping period. More importantly, it should
be adjusted dynamically. An SCC scheduler should be implemented to detect communication patterns
and orchestrate data/computation migration according to the predefined migration policy.
The migration policy is set up based on the size of data and computations. Its details need to be
exploited further.
New Languages
SCC adopts the application-level approach. It exhibits several disadvantages. Firstly, it needs to handle
different kinds of thread libraries. Secondly, the source-to-source translation might not be able to handle
all programming styles, i.e., migration safety issues will appear. Finally, SCC has to deal with different
kinds of libraries which might handle stacks or heaps directly or indirectly.
To support all migration features smoothly, powerful new programming languages might be the
future direction. As part of DARPAs High Productivity Computing Systems (HPCS) program, several
programming languages are being developed. The representative ones include IBMs X10 (Charles, et
al., 2005), Sun Microsystemss Fortess (Allen, et al., 2008), and Crays Chapel (Cray Inc., 2005). They
all support parallel and distributed computing over multi-core clusters. However, computation mobility
is not the significant feature of these on-going languages. If these languages take it as one of their main
goals, programmers can take advantage of all features together without having to face the difficulty of
combining multiple modules from different vendors. With the support from corresponding compilers
on different platforms, these languages can deploy more new portable features easily.
With the new language, SCC can be raised to the language level. The integrated package with computation mobility and other new features might attract more programmers. At least up to now, no new
language has been planned to tackle issues in SCC because of the difficulties.
CONCLUSION
This chapter points out the benefits of computation mobility in performance improvement, fault tolerance and throughput increase. To enable computation mobility, the key issue is to construct moveable
890
computation states during application execution. Their granularity and position in software determine
the solutions portability and flexibility. The eventual goal is to support computation mobility in heterogeneous environments and take advantage of multi-core/many-core clusters.
State-carrying code (SCC) is a software mechanism to achieve the above-mentioned computation
mobility. Applications replicate their states at high level during their normal execution. Whenever the
computation needs to be moved between different platforms, its computation state can be constructed
easily. Since multithreading is widely used in multi-core architecture, threads are chosen as the basic
computing units. Therefore, SCC adopts application-level thread migration to achieve computation
mobility in heterogeneous systems.
Deployment and major features of SCC are explained. MigThread and several other existing systems
are used to demonstrate the implementation strategies. Some future trends are given to indicate how
SCC can encapsulate computation states further and evolve into a lightweight virtual machine. New
high productivity languages can even upgrade SCC to language level. After the much richer system
resources such as cores are exploited thoroughly in clusters, performance gain and high availability are
expected.
REFERENCES
Abdel-Shafi, H., Speight, E., & Bennet, J. K. (1999). Efficient user-level thread migration and checkpointing on Windows NT clusters. Proceedings of the 3rd USENIX Windows NT Research Symposium,
(pp. 1-10).
Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid, R., Tevanian, A., et al. (1986). March: A New
Kernel Foundation for UNIX Development. Proceedings of the the Summer USENIX Confernce, (pp.
93-112).
Allen, E., Cadthase, D., Hallett, J., Luchangco, V., Maessen, J., Ryu, S., et al. (2008). The fortress language specification, version 1.0. Santa Clara, CA: Sun Microsystems.
Antoniu, G., & Boung, L. (2001). Dsm-pm2: A portable implementation platform for multi-threaded dsm
consistency protocols. Proceedings of the 6th international workshop on high-level parallel programming models and supportive environments.
Barak, A., & Laadan, O. (1998). The mosix multicomputer operating system for high per-formance
cluster computing. Journal of Future Generation Computer System, 13(4-5), 361372. doi:10.1016/
S0167-739X(97)00037-X
Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., et al. (2003). Xen and the art of virtualization. Proceedings of the ACM symposium on operating systems principles.
Bricker, A., Litzkow, M., & Livny, M. (1992). Condor Technical Summary, Version 4.1b. Madison, WI:
University of Wisconsin - Madison.
Bronevetsky, G., Marques, D., Schulz, M., Pingali, K., & Stodghill, P. (2004). Application-level checkpointing for shared memory programs. Proceedings of 11th international conference on architectural
support for programming languages and operating systems.
891
Casa, J., Konuru, R., Prouty, R., Walpole, J., & Otto, S. (1994). Adaptive Load Migration Systems for
PVM. Proceedings of supercomputing, (pp. 390-399). Washington D.C.
Chanchio, K., & Sun, X. H. (2001). Communication state transfer for the mobility of con-curent heterogeneous computing. Proceedings of the 2001 international conference on parallel processing.
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., et al. (2005). X10: an
object-oriented approach of non-uniform cluster computing. Proceedings of 20th annual ACM sigplan
conference on object oriented programming, systems, languages, and applications, (pp. 519-538).
Chase, J. S., Amador, F. F., Lazowska, E. D., Levy, H. M., & Littlefield, R. J. (1996). The amber systems:
Parallel programming on a network of multiprocessors. Proceedings of acm symposium on operating
system principles.
Chung, P. E. (1997). Checkpointing in cosmic: a user-level process migration environment. Proceedings
of Pacific Rim International Symposium on Fault-Tolerant Systems.
Corporation, I. B. M. (1993). IBM Load Leveler: Users Guide.
Cray Inc. (2005). The Chapel language specification, version 0.4.
Cronk, M. H., & Mehrotra, P. (1997). Thread migration in the presence of pointers. Proceedings of the
mini-track on multithreaded systems, 30th hawaii interantional conference on system science.
Dimitrov, B., & Rego, V. (1998). Arachne: A portable threads system supporting migrant threads on
heterogeneous network farms. IEEE Transactions on Parallel and Distributed Systems, 9(5), 459469.
doi:10.1109/71.679216
Douglis, F., & Ousterhout, J. K. (1991). Transparent process migration: Design alternatives and the sprite
implementation. Software, Practice & Experience, 21(8), 757785. doi:10.1002/spe.4380210802
Ferrari, A. J., Chapin, S. J., & Grimshaw, A. S. (1997). Process introspection: A heterogeneous checkpoint/
restart mechanism based on automatic code modification, (Technical Report: CS-97-05). University of
Virginia, Charlottesville, VA.
Itzkovitz, A., Schuster, A., & Shalev, L. (1998). Thread migration and its applications in distributed shared
memory systems. Journal of Systems and Software, 42(1), 7187. doi:10.1016/S0164-1212(98)000089
Jiang, H., & Chaudhary, V. (2004). Process/thread migration and checkpointing in heterogeneous distributed systems. Proceedings of the 37th Hawaii International Conference on System Sciences, Hawaii,
USA.
Jul, E., Levy, H., Hutchinson, N., & Blad, A. (1998). Fine-grained mobility in the emerald system. ACM
Transactions on Computer Systems, 6(1), 109133. doi:10.1145/35037.42182
Kale, L. V., & Krishnan, S. (1998). Charm++: Parallel Programming with Message-Driven Objects. In
G. V. Wilson, & P. Lu, Parallel programming using c++ (pp. 175-213). Cambridge, MA: MIT Press.
Lange, D., & Oshima, M. (1998). Mobile agents with java: The aglet api. World Wide Web (Bussum),
1(3). doi:10.1023/A:1019267832048
892
Li, C.-C. J., Stewart, E. M., & Fuchs, W. K. (1994). Compiler assisted full checkpointing. Software,
Practice & Experience, 24, 871886. doi:10.1002/spe.4380241002
Lindholm, T., & Yellin, F. (1999). The jave(tm) virtual machine specification (2nd Ed.).New York: Addison Wesley.
Mascarenhas, E., & Rego, V. (1995). Ariadne: Architecture of a portable threads system supporting
mobile process, (Tech. Rep. No. CSD-TR 95-017). Dept. of Computer Sciences, Purdue University,
Southbend, IN.
Milton, S. (1998). Thread migration in distributed memory multicomputers, (Tech. Rep. No. TR-CS-98-01).
Dept. of Comp Sci & Comp Sciences Lab, Australia National University, Acton, Australia.
Plank, J. S., Beck, M., Kinsley, G., & Li, K. (1995). Libckpt: Transparent checkpointing under unix.
Usenix winter technical conference, (pp. 213-223).
Ramkumar, B., & Strumpen, V. (1997). Portable checkpointing for heterogenous architectures. Symposium on fault-tolerent computing, (pp. 58-67).
Rashid, R. F., & Robertson, G. (1981). Accent: A communication oriented network operating system
kernel. Proceedings of the eighth acm symposium on operating systems principles, (pp. 64-75).
Smith, P., & Hutchinson, N. C. (1998). Heterogeneous process migration: The tui system. Software,
Practice & Experience, 28(6), 611639. doi:10.1002/(SICI)1097-024X(199805)28:6<611::AIDSPE169>3.0.CO;2-F
Srinivasan, R. (1995). XDR: External Data Representation Standard (Tech. Rep. No. RFC 1832).
Ssu, K., Yao, B., & Fuchs, W. K. (1999). An adaptive checkpointing protocol to bound recovery time
with message logging. Symposium on reliable distributed systems, (pp. 244-252).
Stellner, G. (1996). Cocheck: Checkpointing and process migration for mpi. Proceedings of 10th international parallel processing symposium.
Theimer, M. M., Lantz, K. A., & Cheriton, D. R. (1985). Preemptable remote execution facilities for the
v-system. SIGOPS Oper. Syst. Rev., 19(5), 212. doi:10.1145/323627.323629
VMware Inc. (1999). VMware virtual platform.
Walker, B., Popek, G., English, R., Kline, C., & Thiel, G. (1992). The locus distributed operating system.
Ditributed Computing Systems: Concepts and Structures, 17(5).
White, J. E. (1996). Telescript technology: Mobile agents. Journal of Software Agents.
Young, J. W. (1974). A first order approximation to the optimum checkpoint interval. Communications
of the ACM, 17, 530531. doi:10.1145/361147.361115
Zajcew, R., Roy, P., Black, D., & Peak, C. (1993). An osf/l unix for massively parallel multi-computers.
Proceedings of the winter 1993 conference, (pp. 449-468).
893
Zhou, S., Zheng, X., Wang, J., & Delisle, P. (1993). Utopia: a load sharing facility for large, heterogeneous distributed computer systems. Software, Practice & Experience, 23(12), 13051336. doi:10.1002/
spe.4380231203
Zhu, W., Wang, C.-L., & Lan, F. (2002). Jessica2: a distributed java virtual machine with trans-parent
thread migration support. Proc. of IEEE international conference on cluster computing.

Computation Mobility: The ability to move a running program from one machine to another
Computation States: The required information to indicate the execution progress, including register
contents, stacks, and heaps, etc.
Data Conversion: The function to translate data from one format to another
Migration Safety: The necessary features of a program to enable its mobility
State-Carrying Code: Transformed programs which can acquire the running state in order to stop
and restart the execution
Thread/Process Migration: The feature to move a running thread/process from one machine to
another
Virtualization: The abstraction of system resources where computations can be executed for portability
894
895
Compilation of References
3GPP TS 23.234 V7.5.0 (2007). 3GPP system to WLAN

interworking, 3GPP Specification. Retrieved May 1,
2008, from http://www.3gpp.org, 2007.
A Blueprint for the Open Science Grids. (2004, December). Snapshot v0.9.
Abawajy, J. (2004). Placement of file replicas in data grid
environments. In Proceedings of international conference
on computational science (Vol. 3038, pp. 66-73).
Abdel-Shafi, H., Speight, E., & Bennet, J. K. (1999). Efficient user-level thread migration and checkpointing on
Windows NT clusters. Proceedings of the 3rd USENIX
Windows NT Research Symposium, (pp. 1-10).
Abdennadher, N., & Boesch, R. (2005). Towards a peerto-peer platform for high performance computing. In
HPCASIA05 Proceedings of the Eighth International
Conference in High-Performance Computing in AsiaPacific Region, (pp. 354-361). Los Alamitos, CA: IEEE
Computer Society. Retrieved from http://doi.ieeecomputersociety.org/10.1109/HPCASIA.2005.98
Abdennadher, N., & Boesch, R. (2006, August). A

scheduling algorithm for high performance peer-to-peer
platform. In W. Lehner, N. Meyer, A. Streit, & C. Stewart
(Eds.), Coregrid Workshop, Euro-Par 2006 (p. 126-137).
Dresden, Germany: Springer.
Aberer, K., Cudr-Mauroux, P., Datta, A., Despotovic,
Z., Hauswirth, M., & Punceva, M. (2003). P-Grid: A
self-organizing structured p2p system. SIGMOD Record,
32(3), 2933. doi:10.1145/945721.945729
Abramson, D., Buyya, R., & Giddy, J. (2002). A computational economy for grid computing and its implementation
in the Nimrod-G resource broker. Future Generation
Computer Systems, 18(8), 10611074. doi:10.1016/S0167739X(02)00085-7
Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid,
R., Tevanian, A., et al. (1986). March: A New Kernel
Foundation for UNIX Development. Proceedings of the
the Summer USENIX Confernce, (pp. 93-112).
ACR-NEMA. (2005). DICOM (Digital Image and Communications in Medicine). Retrieved June 15th, 2008,
from http://medical.nema.org/
Copyright 2010, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Adabala, S., Chadha, V., Chawla, P., Figueiredo, R.,

Fortes, J., & Krsul, I. (2005, June). From virtualized
resources to virtual computing Grids: the In-VIGO
system. Future Generation Computer Systems, 21(6),
896909. doi:10.1016/j.future.2003.12.021
Adamy, U., & Erlebach, T. (2004). Online coloring of
intervals with bandwidth (LNCS Vol. 2909, pp. 112).
Berlin: Springer.
Adiga, N. R., et al. (2002). An overview of the BlueGene/L
supercomputer. In Proceedings of the Supercomputing Conference (SC2002), Baltimore MD, USA, (pp.
122).
Adjie-Winoto, W., Schwartz, E., Blakrishnan, H., &
Lilley, J. (1999). The design and implementation of an
intentional naming system. Operating Systems Review,
34(5), 186201. doi:10.1145/319344.319164
Adler, M., Halperin, E., Karp, R. M., & Vazirani, V.
(2003, June). A stochastic process on the hypercube
with applications to peer-to-peer networks. In Proc. of
STOC.
Adve, S. V., & Hill, M. D. (1993). A Unified Formalization of Four Shared-Memory Models. IEEE Transactions
on Parallel and Distributed Systems, 4(6), 613624.
doi:10.1109/71.242161
Afgan, E., Velusamy, V., & Bangalore, P. (2005). Grid
resource broker using application benchmarking. European Grid Conference, (LNCS 3470, pp. 691-701).
Amsterdam: Springer Verlag.
Agarwal, A., Bianchini, R., Chaiken, D., Johnson, K. L.,
Kranz, D., Kubiatowicz, J., et al. (1995). The MIT Alewife
machine: architecture and performance. In Proceedings
of the 22nd Annual International Symposium on Computer
Architecture (ISCA95), S. Margherita Ligure, Italy, (pp.
2-13). New York: ACM Press.
Agarwal, A., Lim, B.-H., Kranz, D., & Kubiatowicz, J.
(1990). April: a processor architecture for multiprocessing. In Proceedings of the 17th Annual International
Symposium on Computer Architecture (ISCA90), (pp.
104-114), Seattle, WA: ACM Press.
896
Agarwal, S., Chuah. C. N., & Katz, R. H. (2003). OPCA:

Robust Inter-domain Policy Routing and Traffic Control,
OPENARCH.
Agrawal, D. P., & Zeng, Q.-A. (2006). Introduction to
wireless and mobile systems (2nd Ed.). Florence, KY:
Thomson.
Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of
the ACM, 18(6), 333340. doi:10.1145/360825.360855
Akenine-Moller, T., & Haines, E., (2002, July). Realtime rendering (2nd Ed.). Wellesley, MA: A. K. Peters
Publishing Company.
Akyildiz, I., Mohanty, S., & Xie, J. (2005). A ubiquitous mobile communication architecture for nextgeneration heterogeneous wireless systems. IEEE
Radio Communications, 43(6), 2936. doi:10.1109/
MCOM.2005.1452832
Alam, S. R., Meredith, J. S., & Vetter, J. S. (2007, Sept.)
Balancing productivity and performance on the cell
broandband engine. IEEE Annual International Conference on Cluster Computing.
Aldinucci, M., Danelutto, M., & Teti, P. (2003). An advanced environment supporting structured parallel programming in Java. Future Generation Computer Systems,
19(5), 611626. doi:10.1016/S0167-739X(02)00172-3
Aldwairi, M., Conte, T., & Franzon, P., (2005) Configurable string matching hardware for speeding up intrusion detection. ACM SIGARCH Computer Architecture
News, 33(1).
Alexander, D. (1995). Recursively Modular Artificial
Neural Network. Doctoral Thesis, Macquire University,
Australia, Sydney, Australia.
Alexandrov, A. D., Ibel, M., Schauser, K. E., & Scheiman,
C. (1997, April). SuperWeb: Towards a global web-based
parallel computing infrastructure. In Proceedings of the
11th IEEE International Parallel Processing Symposium
(IPPS).
Alexandrov, A. Ionescu, M. F. Schauser, K. E. & Scheiman, C. (1995). LogGP: Incorporating long messages into
the LogP model. In Proceedings of the seventh annual
ACM symposium on Parallel algorithms and architectures, (pp. 95105). New York: ACM Press.
ALF for Cell BE Programmers Guide and API Reference. Retrieved from http://www01.ibm.com/chips/
techlib/techlib.nsf/techdocs/41838EDB5A15CCCD002
573530063D465
ALF for Hybrid-x86 Programmers Guide and API Reference. Retrieved from http://www01.ibm.com/chips/
techlib/techlib.nsf/techdocs/389BBE99638335B80025
735300624044
Alshwede, R., Cai, N, Li, S.-Y. R., & Yeung, R. W.

(2000). Network information flow: Single Source. IEEE
Transactions on information theory, (submitted for
publication).
Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher,
B., & Mock, S. (2004). Kepler: an extensible system for
design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and
Statistical Database Management (SSDBM), Santorini
Island, Greece.
Altschul, G. W., M. W, Myers., & Lipman. (1990). Basic
local alignment search tool. Journal of Molecular Biology, 215(3), 403410.
Ali, A., McClatchey, R., Anjum, A., Habib, I., Soomro,

K., Asif, M., et al. (2006). From grid middleware to a grid
operating system. In Proceedings of the Fifth International Conference on Grid and Cooperative Computing,
(pp. 9-16). China: IEEE Computer Society.
Alverson, R., Callahan, D., Cummings, D., Koblenz, B.,

Porterfield, A., & Smith, B. (1990) The Tera computer
system. In Proceedings of the 4th International Conference on Supercomputing (ICS90), (pp. 1-6). Amsterdam:
ACM Press.
Alima, L. O., El-Ansary, S., Brand, P., & Haridi, S.

(2003). DKS (N, k, f): A Family of Low Communication, Scalable and Fault-tolerant Infrastructures for P2P
Applications. In Proceedings of the 3rd IEEE Intl. Symp.
on Cluster Computing and the Grid (pp. 344-350). New
York: IEEE Computer Society Press.
Amarasinghe, S. (2007). Multicore programming primer

and programming competition. A course at MIT, Cambridge, MA.
Allcock, W. (2003, Mar). GridFTP protocol specification.

Global Grid Forum Recommendation GFD.20.
Amazon Elastic Compute Cloud. (2008, November).

Retrieved from http://www.amazon.com/ec2
Allen, E., Cadthase, D., Hallett, J., Luchangco, V.,

Maessen, J., Ryu, S., et al. (2008). The fortress language specification, version 1.0. Santa Clara, CA: Sun
Microsystems.
Allen, E., Chase, D., Hallett, J., Luchangco, V., Maessen, J.-W., Ryo, S., et al. (2008). The Fortress language
specification, Version 1.0. Santa Clara, CA: Sun Microsystems, Inc.
Allen, G., Davis, K., Dolkas, K. N., Doulamis, N. D.,
Goodale, T., Kielmann, T., et al. (2003). Enabling applications on the grid: A Gridlab overview. International
Journal of High Performance Computing Applications:
Special issue on Grid Computing: Infrastructure and
Applications.
Amazon Elastic Compute Cloud (2007). Retrieved from

www.amazon.com/ec2
Ambastha, N., Beak, I., Gokhale, S., & Mohr, A. (2003).

A cache-based resource location approach for unstructured P2P network architectures. Graduate Research
Conference, Department of Computer Science, Stony
Brook University, NY.
Ammann, P., Jajodia, S., & Ray, I. (1997). Applying
formal methods to semantic-based decomposition of
transactions. [TODS]. ACM Transactions on Database
Systems, 22(2), 215254. doi:10.1145/249978.249981
Ancilotti, P., Lazzerini, B., & Prete, C. A. (1990). A
distributed commit protocol for a multicomputer system. IEEE Transactions on Computers, 39(5), 718724.
doi:10.1109/12.53589
897
Anderson, D. P. (2004). BOINC: A system for publicresource computing and storage. In Grid04 Proceedings
of the Fifth IEEE/ACM International Workshop on Grid
Computing, (pp. 4-10). Los Alamitos, CA: IEEE Computer Society. Retrieved from http://dx.doi.org/10.1109/
GRID.2004.14
Anderson, D. P., Cobb, J., Korpela, E., Lebofsky, M.,
& Werthimer, D. (2002). SETI@home: An experiment
in public-resource computing. Communications of the
ACM, 45(11), 56-61. New York: ACM Press. Retrieved
from http://doi.acm.org/10.1145/581571.581573
Anderson, D., & Fedak, G. (2006). The computational
and storage potential of volunteer computing. In Proceedings of The IEEE International Symposium on Cluster
Computing and The Grid (CCGRID06).
Andrade, N., Brasileiro, F., Cirne, W., & Mowbray, M.
(2007). Automatic Grid assembly by promoting collaboration in peer-to-peer Grids. Journal of Parallel and
Distributed Computing, 67(8), 957966. doi:10.1016/j.
jpdc.2007.04.011
Andrade, N., Cirne, W., Brasileiro, F., & Roisenberg, R.
(2003, October). OurGrid: An approach to easily assemble
grids with equitable resource sharing. In JSSPP03 Proceedings of the 9th Workshop on Job Scheduling Strategies for Parallel Processing (LNCS). Berlin/Heidelberg,
Germany: Springer. doi: 10.1007/10968987
Androutsellis-Theotokis, S., & Spinellis, D. (2004).
A survey of peer-to-peer content distribution technologies. ACM Computing Surveys, 36(4), 335371.
doi:10.1145/1041680.1041681
Andrzejak, A., Domingues, P., & Silva, L. (2006). Predicting Machine Availabilities in Desktop Pools. In IEEE/
IFIP Network Operations and Management Symposium
(pp. 225234).
Andrzejak, A., Kondo, D., & Anderson, D. P. (2008).
Ensuring collective availability in volatile resource pools
via forecasting. In 19th Ifip/Ieee Distributed Systems:
Operations And Management (DSOM 2008). Samos
Island, Greece.
898
Anfinson, J., & Luk, F. T. (1988, December). A Linear

Algebraic Model of Algorithm-Based Fault Tolerance.
IEEE Transactions on Computers, 37(12), 15991604.
doi:10.1109/12.9736
Antoniu, G., & Boung, L. (2001). Dsm-pm2: A portable
implementation platform for multi-threaded dsm consistency protocols. Proceedings of the 6th international
workshop on high-level parallel programming models
and supportive environments.
Antoniu, G., Boug, L., Hatcher, P., MacBeth, M.,
McGuigan, K., & Namyst, R. (2001). The Hyperion
system: Compiling multithreaded Java bytecode for
distributed execution. Parallel Computing, 27(10),
12791297. doi:10.1016/S0167-8191(01)00093-X
Araujo, F., Domingues, P., Kondo, D., & Silva, L. M.
(2008, April). Using cliques of nodes to store desktop
grid checkpoints. In Coregrid Integration Workshop,
Crete, Greece.
Aridor, Y., Factor, M., & Teperman, A. (1999). cJVM:
A Single System Image of a JVM on a Cluster. Paper
presented at the Proceedings of the 1999 International
Conference on Parallel Processing.
Aridor, Y., Factor, M., Teperman, A., Eilam, T., & Schuster, A. (2000). Transparently obtaining scalability for
Java applications on a cluster. Journal of Parallel and
Distributed Computing, 60(10), 11591193. doi:10.1006/
jpdc.2000.1649
ARM. (2008). ARM Achieves 10 Billion Processor Milestone. Retrieved March 10, 2008, from http://www.arm.
com/news/19720.html
Arpaci, R. H., Dusseau, A. C., Vahdat, A. M., Liu, L. T.,
Anderson, T. E., & Patterson, D. A. (May, 1995). The
interaction of parallel and sequential workloads on a
network of workstations. Paper presented at the 1995
ACM SIGMETRICS Conference on Measurement and
Modeling of Computer Systems.
Arpaci-Dusseau, R. H., Arpaci-Dusseau, A. C., Vahdat,
A., Liu, L. T., Anderson, T. E., & Patterson, D. A. (1995).
The interaction of parallel and sequential workloads on a
network of workstations. SIGMETRICS, (pp. 267-278).
Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J.,

Husbands, R., Keutzer, K., et al. (2006, Dec). The Landscape of Parallel Computing Research: A View from
Berkeley (Tech. Rep. No. UCB/EECS-2006-183). EECS
Department, University of California, Berkeley.
ASF. (2002). The Apache Tomcat Connector. Retrieved
June 18, 2008, from http://tomcat.apache.org/connectorsdoc/
Audsley, N. C., Burns, A., Richardson, M. F., Tindall, K.,
& Wellings, A. (1993). Applying New Scheduling Theory
to Static Priority Pre-emptive Scheduling. Software
Engineering Journal, 8(5).
Australian Partnership for Advanced Computing (APAC)
Grid. (2005). Retrieved from http://www.apac.edu.au/
programs/GRID/index.html.
Autenrieth., F., Isralewitz, B., Luthey-Schulten., Z.,
Sethi, A. & Pogorelov, T. Bioinformatics and Sequence
Alignment.
Awduche, D.O., Chiu, A., Elqalid, A., Widjaja, I., & Xiao,
X. (2002). A Framework for Internet Traffic Engineering
[draft 2]. Retrieved from IETF draft database.
Azar, Y. Broder, A., et al. (1994). Balanced allocations.
In Proc. of STOC (pp. 593602).
Baboescu, F., & Varghese, G. (2005). Scalable packet
classification. IEEE/ACM Trans. Netw., 13(1), 214.
Bader, D. A., & Agarwal, V. (2007, Dec). FFTC: Fastest
fourier transform on the ibm cell broadband engine. In
14th IEEE international conference on high performance
computing (hipc 2007) Goa, India, (pp. 1821).
Badia, R. M., Labarta, J. S., Sirvent, R. L., Perez, J. M.,
Cela, J. M., & Grima, R. (2003). Programming grid applications with GRID superscalar. Journal of Grid Computing, 1, 151170. doi:10.1023/B:GRID.0000024072.93701.
f3
Bailey, D., Barszcz, E., Barton, J. T., Browning, D. S.,
Carter, R. L., & Dagum, L. (1991). The NAS parallel
benchmarks. The International Journal of Supercomputer
Applications, 5(3), 6373.
Baker, M., Buyya, R., & Laforenza, D. (2002). Grids and

grid technologies for wide-area distributed computing.
Software [SPE]. Practice and Experience, 32, 14371466.
doi:10.1002/spe.488
Baker, S. (2007). Google and the wisdom of clouds. Business Week, Dec. 13. Retrieved from www.businessweek.
com/magazine/content/07_52/b4064048925836.htm
Bal, H. E., & Haines, M. (1998). Approaches for integrating task and data parallelism. IEEE Concurrency, 6(3),
7484. doi:10.1109/4434.708258
Balasubramanian, V., & Banerjee, P. (1990). CompilerAssisted Synthesis of Algorithm-Based Checking in
Multiprocessors. IEEE Transactions on Computers,
C-39, 436446. doi:10.1109/12.54837
Balaton, Z., Gombas, G., Kacsuk, P., Kornafeld, A.,
Kovacs, J., & Marosi, A. C. (2007, March 26-30). Sztaki
desktop grid: a modular and scalable way of building large
computing grids. In Proceedings of the 21st International
Parallel And Distributed Processing Symposium, Long
Beach, CA.
Balazinska, M., Balakrishnan, H., & Stonebraker, M.
(2004, March). Contract-based load management in
federated distributed systems. In 1st Symposium on
Networked Systems Design and Implementation (NSDI)
(pp. 197-210). San Francisco: USENIX Association.
Balazinska, M., Blakrishnan, H., & Karger, D. (2002).
INS/Twine: a scalable peer-to-peer architecture for intentional resource discovery. In Pervasive 2002, Zurich,
Switzerland, August. Berlin: Springer Verlag.
Baldassari, J., Finkel, D., & Toth, D. (2006, November
13-15). Slinc: A framework for volunteer computing. In
Proceedings of the 18th Iasted International Conference
On Parallel And Distributed Computing And Systems
(PDCS 2006). Dallas, TX.
Baldridge, K., Biros, G., Chaturvedi, A., Douglas, C. C.,
Parashar, M., How, J., et al. (2006, January). National
Science Foundation DDDAS Workshop Report. Retrieved
from http://www.dddas.org/nsf-workshop-2006/wkshp
report.pdf.
899
Ban, B. (1997). JGroups - A Toolkit for Reliable Multicast

Communication. Retrieved June 18, 2008, from http://
www.jgroups.org/javagroupsnew/docs/index.html
Banerjee, P., Rahmeh, J. T., Stunkel, C. B., Nair, V. S.
S., Roy, K., & Balasubramanian, V. (1990). Algorithmbased fault tolerance on a hypercube multiprocessor.
IEEE Transactions on Computers, C-39, 11321145.
doi:10.1109/12.57055
Bangerth, W., Matossian, V., Parashar, M., Klie, H., &
Wheeler, M. (2005). An autonomic reservoir framework for the stochastic optimization of well placement.
Cluster Computing, 8(4), 255269. doi:10.1007/s10586005-4093-3
Banks, T. (2006). Web services resource framework
(WSRF). Organization for the Advancement of Structured
Information Standards (OASIS).
Barak, A., & Laadan, O. (1998). The mosix multicomputer
operating system for high per-formance cluster computing. Journal of Future Generation Computer System,
13(4-5), 361372. doi:10.1016/S0167-739X(97)00037-X
Barak, A., Guday, S., & R., W. (1993). The MOSIX Distributed Operating System, Load Balancing for UNIX
(Vol. 672). Berlin: Springer-Verlag.
Baratloo, A., Karaul, M., Kedem, Z., & Wyckoff, P. (1996).
Charlotte: Metacomputing on the Web. In Proceeidngs
of the 9th International Conference On Parallel And
Distributed Computing Systems (PDCS-96).
Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T.,
Ho, A., et al. (2003). Xen and the art of virtualization. In
19th ACM Symposium on Operating Systems Principles
(SOSP 03) (pp. 164177). New York: ACM Press.
Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato,
J., Dongarra, J., et al. (1994). Templates for the Solution
of Linear Systems: Building Blocks for Iterative Methods,
2nd Edition., Philadelphia, PA: SIAM.
Barua, S., Thulasiram, R. K., & Thulasiraman, P. (2005,
Aug.). High performance computing for a financial application using fast Fourier transform. In Euro-par parallel
processing (p. 1246-1253). Lisbon, Portugal.
900
Bassi, A., Beck, M., Fagg, G., Moore, T., Plank, J. S., &
Swany, M. (2002). The Internet BackPlane Protocol: A
Study in Resource Sharing. In Second ieee/acm international symposium on cluster computing and the grid,
Berlin, Germany.
Bataineh, S., Hsiung, T.-Y., & Robertazzi, T. (1994).
Closed form solutions for bus and tree networks of processors load sharing a divisible job. Institute of Electrical
and Electronic Engineers, 43(10), 1184119.
Baude, F., Caromel, D., Huet, F., & Vayssiere, J. (2000,
May). Communicating mobile active objects in java.
In R. W. Marian Bubak Hamideh Afsarmanesh & B.
Hetrzberger (Eds.), Proceedings of HPCN Europe 2000
(Vol. 1823, p. 633-643). Berlin: Springer. Retrieved from
http://www-sop.inria.fr/oasis/Julien.Vayssiere/publications/18230633.pdf
BBC. (2005). SMEF- Standard Media Exchange Framework. Retrieved June 15th, 2008, from http://www.bbc.
co.uk/guidelines/smef/.15th June, 2008.
BEinGRID. (2008). Business experiments in grids. Retrieved from www.beingrid.com
Belalem, G., & Slimani, Y. (2006). A hybrid approach
for consistency management in large scale systems. In
Proceedings of the international conference on networking and services (pp. 7176).
Belalem, G., & Slimani, Y. (2007). Consistency management for data grid in optorsim simulator. In Proceedings of the international conference on multimedia and
ubiquitous engineering (pp. 554560).
Bell, W. H., Cameron, D. G., Carvajal-Schiaffino, R.,
Millar, A. P., Stockinger, K., & Zini, F. (2003). Evaluation
of an economy-based file replication strategy for a data
grid. In Proceedings of the 3rdIEEE/ACM international
symposium on cluster computing and the grid.
Bell, W., Cameron, D., Capozza, L., Millar, P., Stockinger,
K., & Zini, F. (2003). Optorsim - A grid simulator for
studying dynamic data replication strategies. International Journal of High Performance Computing Applications, 17, 403416. doi:10.1177/10943420030174005
Bellavista, P., & Corradi, A. (2007). The Handbook of Mobile Middleware. New York: Auerbach publications.
Beltrame, F., Maggi, P., Melato, M., Molinari, E., Sisto,
R., & Torterolo, L. (2006, February 2-3). SRB Data grid
and compute grid integration via the enginframe grid
portal. In Proceedings of the 1st SRB Workshop, San
Diego, CA. Retrieved from www.sdsc.edu/srb/Workshop/
SRB-handout-v2.pdf
Ben Hassen, S., Bal, H. E., & Jacobs, C. J. H. (1998). A
task- and data-parallel programming language based
on shared objects. [TOPLAS]. ACM Transactions on
Programming Languages and Systems, 20(6), 11311170.
doi:10.1145/295656.295658
Bender, T. (1982). Community and Social Change in
America. Baltimore, MD: The Johns Hopkins University Press.
Benjelloun, O., Sarma, A. D., Halevy, A. Y., Theobald,
M., & Widom, J. (2008). Databases with uncertainty and
lineage. The VLDB Journal, 17(2), 243264. doi:10.1007/
s00778-007-0080-z
Benkert, K. Gabriel, E. & Resch, M. M. (2008). Outlier
Detection in Performance Data of Parallel Applications.
In the 9th IEEE International Workshop on Parallel Distributed Scientific and Engineering Computing (PDESC),
Miami, Florida, USA.
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell,
J., Rapp, B. A., & Wheller, D. L. (2000, October). Genbank. Nucleic Acids Research, 28(1), 1518. doi:10.1093/
nar/28.1.15
Berezin, Y. A., & Vshivkov, V. A. (1980). The method
of particles in rarefied plasma dynamic. Novosibirsk,
Russia: Nauka (Science).
Berger, M. P., & Munson, P. J. (1991). A novel randomized
iteration strategy for aligning multiple protein sequences.
Computer Applications in the Biosciences, 7, 479484.
Berman, F., Casanova, H., Chien, A. A., Cooper, K. D.,

Dail, H., & Dasgupta, A. (2005). New Grid scheduling
and rescheduling methods in the GrADS project. International Journal of Parallel Programming, 33(2-3),
209229. doi:10.1007/s10766-005-3584-4
Berman, F., Chien, A., Cooper, K., Dongarra, J., Foster, I.,
& Gannon, D. (2001). The grads project: Software support
for high-level grid application development. International
Journal of High Performance Computing Applications,
15(4), 327344. doi:10.1177/109434200101500401
Berman, F., Fox, G., & Hey, T. (Eds.). (2003). Grid computing making the global infrastructure a reality. New
York: Wiley Series in Communication Networking &
Distributed Systems.
Berman, F., Wolski, R., Casanova, H., Cirne, W., Dail,
H., & Faerman, M. (2003). Adaptive computing on the
Grid using AppLeS. IEEE Transactions on Parallel
and Distributed Systems, 14(4), 369382. doi:10.1109/
TPDS.2003.1195409
Berman, F., Wolski, R., Figueira, S., Schopf, J., & Shao,
G. (1996). Application-Level Scheduling on Distributed
Heterogeneous Networks. In Proc. of supercomputing96,
Pittsburgh, PA.
Berriman, G. B., Good, J. C., & Laity, A. C. (2003).
Montage: A grid enabled image mosaic service for the
national virtual observatory. In F. Ochsenbein (Ed.),
Astronomical Data Analysis Software and Systems XIII,
(pp. 145-167). Livermore, CA: ASP press.
Bertossi, A. A., Pinotti, M. C., Rizzi, R., & Gupta, P.
(2004). Allocating servers in infostations for bounded
simultaneous requests. Journal of Parallel and Distributed Computing, 64, 11131126. doi:10.1016/S07437315(03)00118-7
Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and
Distributed Computation: Numerical Methods. Englewood Cliffs, NJ: Prentice Hall.
Bharadwaj, V., Ghose, D., & Mani, V. (1995, April).
Multi-installment load distribution in tree network with
delays. Institute of Electrical and Electronic Engineers,
31(2), 555567.
901
Bharadwaj, V., Ghose, D., & Robertazzi, T. G. (2003,

January). Divisible load theory: A new paradigm for load
scheduling in distributed systems. Cluster Computing,
6(1), 717. doi:10.1023/A:1020958815308
Bharadwaj, V., Li, X., & Ko, C. C. (2000). Efficient partitioning and scheduling of computer vision and image
processing data on bus networks using divisible load
analysis. Image and Vision Computing, 18, 919938.
doi:10.1016/S0262-8856(99)00085-2
Bhatt, S. N., Chung, F. R. K., Leighton, F. T., & Rosenberg,
A. L. (1997). An optimal strategies for cycle-stealing in
networks of workstations. IEEE Transactions on Computers, 46(5), 545557. doi:10.1109/12.589220
Bhowmick, S. Eijkhout, V. Freund, Y. Fuentes, E. &
Keyes, D. (in press). Application of Machine Learning
in Selecting Sparse Linear Solver. Submitted for publication to the International Journal on High Performance
Computing Applications.
Bienkowski, M., Korzeniowski, M., & auf der Heide, F.
M. (2005). Dynamic load balancing in distributed hash
tables. In Proc. of IPTPS.
BIRN. (2008). Biomedical informatics research network.
Retrieved from www.nbirn.net/index.shtm
Bishop, P., & Warren, N. (2002). JavaSpaces in Practice.
New York: Addison Wesley.
Bloom, H. B. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM,
13(7), 422426. doi:10.1145/362686.362692
Bluetooth (2008). Retrieved November 2008 from www.
bluetooth.com
Boghosian, B., Coveney, P., Dong, S., Finn, L., Jha, S.,
Karniadakis, G. E., et al. (2006, June). Nektar, SPICE and
vortonics: Using federated Grids for large scale scientific
applications. In IEEE Workshop on Challenges of Large
Applications in Distributed Environments (CLADE).
Paris: IEEE Computing Society.
BOINC. (2008). Berkeley Open Infrastructure for Network Computing. Retrieved March 10, 2008 from http://
boinc.berkeley.edu
BOINCstats. (2008). Seti@home Project Statistics.
Retrieved March 10, 2008, from http://boincstats.com/
stats/project_graph.php?pr=sah
Bojadziew, G., & Bojadziew, M. (1997). Fuzzy Logic
for Business, Finance, and Management Modeling, (2nd
Ed.). Singapore: World Scientific Press.
Boley, D. L., Brent, R. P., Golub, G. H., & Luk, F. T. (1992).
Algorithmic fault tolerance using the lanczos method.
SIAM Journal on Matrix Analysis and Applications, 13,
312332. doi:10.1137/0613023
Black, F., & Scholes, M. (1973). The Pricing of Options

and Corporate Liabilities. The Journal of Political
Economy, 81(3). doi:10.1086/260062
Bolosky, W., Douceur, J., Ely, D., & Theimer, M. (2000).

Feasibility of a Serverless Distributed file System Deployed on an Existing Set of Desktop PCs. In Proceedings of sigmetrics.
Blackford, L. S., Choi, J., Cleary, A., Petitet, A., & Whaley,
R. C. Demmel, et al. (1996). ScaLAPACK: a portable
linear algebra library for distributed memory computers - design issues and performance. In Supercomputing
96: Proceedings of the 1996 ACM/IEEE conference on
Supercomputing (CDROM), (p. 5).
Bonomi, F., Mitzenmacher, M., Panigrah, R., Singh,

S., & Varghese, G. (2006). Beyond bloom filters: from
approximate membership checks to approximate state
machines. Paper presented at the Proceedings of the 2006
conference on Applications, technologies, architectures,
and protocols for computer communications.
Blair, G. S., Coulson, G., & Blair, L. DuranLimon, H.,

Grace, P., Moreira, R., & Parlavantzas, N. (2002). Reflection, self-awareness and self-healing in OpenORB.
In WOSS 02 Proceedings of the First Workshop on
Self-Healing Systems, (pp. 9-14).
Boyd, C. (2008, March/April). Data-parallel computing. ACM Queue; Tomorrows Computing Today, 6(2).
doi:10.1145/1365490.1365499
902
Boyer, R., & Moore, J. (1977). A fast string searching

algorithm. Communications of the ACM, 20(10), 762777.
doi:10.1145/359842.359859
Bulhes, P. T., Byun, C., Castrapel, R., & Hassaine, O.

(2004, May). N1 Grid Engine 6 Features and Capabilities
[White Paper]. Phoenix, AZ: Sun Microsystems.
Boyle, P. P. (1986). Option Valuing Using a Three Jump

Process. International Options Journal, 3(2).
Burnett, I. (2006). MPEG-21: Digital Item Adaptation Coding Format Independence, Chichester, UK. Retrieved
15th June, 2008, from http://www.ipsi.fraunhofer.de/delite/projects/mpeg7/Documents/mpeg21-Overview4318.
htm#_Toc523031446.
Brandes, T. (1999). Exploiting advanced task parallelism

in high performance Fortran via a task library. In EuroPar 99: Proceedings of the 5th International Euro-Par
Conference on Parallel Processing (pp. 833844).
Braun, T., Feyrer, S., Rapf, W., & Reinhardt, M. (2001).
Parallel Image Processing. Berlin: Springer-Verlag.
Brecht, T., Sandhu, H., Shan, M., & Talbot, J. (1996).
Paraweb: towards world-wide supercomputing. In Ew 7:
Proceedings of the 7th workshop on acm sigops european
workshop (pp. 181188). New York: ACM.
Bricker, A., Litzkow, M., & Livny, M. (1992). Condor
Technical Summary, Version 4.1b. Madison, WI: University of Wisconsin - Madison.
Brighten Godfrey, P., & Stoica, I. (2005). Heterogeneity
and load balance in distributed hash tables. In Proc. of
IEEE INFOCOM.
Broder, A., & Mitzenmacher, M. (2003). Network Applications of Bloom Filters: A Survey. Internet Mathematics,
1(4), 485509.
Bronevetsky, G., Marques, D., Schulz, M., Pingali, K.,
& Stodghill, P. (2004). Application-level checkpointing for shared memory programs. Proceedings of 11th
international conference on architectural support for
programming languages and operating systems.
Brune, M., Gehring, J., Keller, A., & Reinefeld, A.
(1999). Managing clusters of geographically distributed
high-performance computers. Concurrency (Chichester,
England), 11(15), 887911. doi:10.1002/(SICI)10969128(19991225)11:15<887::AID-CPE459>3.0.CO;2-J
Bruneo, D., Scarpa, M., Zaia, A., & Puliafito, A. (2003).
Communication paradigms for mobile grid users. In
CCGRID 03 (p. 669).
Burns, J., & Gaudiot, J.-L. (2002). SMT layout

overhead and scalability. IEEE Transactions on
Parallel and Distributed Systems, 13(2), 142155.
doi:10.1109/71.983942
Butt, A. R., Johnson, T. A., Zheng, Y., & Hu, Y. C. (2004).
Kosha: A Peer-to-Peer Enhancement for the Network File
System. In Proceeding of International Symposium On
Supercomputing SC04.
Butt, A. R., Zhang, R., & Hu, Y. C. (2003). A selforganizing flock of condors. In SC 03 Proceedings
of the ACM/IEEE Conference on Supercomputing, (p.
42). Los Alamitos, CA: IEEE Computer Society. Retrieved from http://doi.ieeecomputersociety.org/10.1109/
SC.2003.10031
Buyya, R., Abramson, D., & Giddy, J. (2000). Nimrod/G:
An architecture for a resource management and scheduling system in a global computational grid. In Proceedings
of the 4th International Conference on High Performance
Computing in the Asia-Pacific Region. Retrieved from
www.csse.monash.edu.au/~davida/nimrod/nimrodg.
htm
Buyya, R., Abramson, D., & Giddy, J. (2000, June). An
economy driven resource management architecture for
global computational power grids. In 7th International
Conference on Parallel and Distributed Processing
Techniques and Applications (PDPTA 2000). Las Vegas,
AZ: CSREA Press.
Buyya, R., Abramson, D., & Venugopal, S. (2005). The
Grid Economy. IEEE Journal.
903
Buyya, R., Giddy, J., & Abramson, D. (2000). An evaluation of Economy-based Resource Trading and Scheduling
on Computational Power Grids for Parameter Sweep
Applications. Proceedings of the 2nd Workshop on Active Middleware Services, Pittsburgh, PA.
Cappello, F., Djilali, S., Fedak, G., Herault, T., Magniette,

F., & Nri, V. (2004). Computing on large scale distributed
systems: Xtremweb architecture, programming models,
security, tests and convergence with grid. Future Generation Computer Science (FGCS).
Buyya, R., Yeo, C. S., & Venugopal, S. (2008, September). Market-oriented cloud computing: vision, hype, and
reality for delivering it services as computing utilities.
In HPCC08 Proceedings of the 10th IEEE International
Conference on High Performance Computing and Communications. Los Alamitos, CA: IEEE CS Press.
Cappello, P., Christiansen, B., Ionescu, M., Neary, M.,

Schauser, K., & Wu, D. (1997). Javelin: Internet-Based
Parallel Computing Using Java. In Proceedings of the
sixth acm sigplan symposium on principles and practice
of parallel programming.
Byers, J. Considine, J., & Mitzenmacher, M. (2003,

Feb.). Simple load balancing for distributed hash tables.
In Proc. of IPTPS.
Cabrera, F., Copel, G., & Coxetal, B. (2002). Web Services
Transaction (WS- Transaction). Retrieved from http://
www.ibm.com/developerworks/library/ws-transpec.
Caesar, M., & Rexford, J. (2005, March). BGP routing
policies in ISP networks, (Tech. Rep. UCB/CSD-05-1377).
U. C. Berkeley, Berkeley, CA.
Camiel, N., London, S., Nisan, N., & Regev, O. (1997,
April). The PopCorn Project: Distributed computation
over the Internet in Java. In Proceedings of the 6th international world wide web conference.
Cannataro, M., & Talia, D. (2003). Towards the nextgeneration grid: A pervasive environment for knowledgebased computing. In Proceedings of the International
Conference on Information Technology: Computers and
Communications (pp.437-441), Italy.
Cannon, L. E. (1969). A cellular computer to implement
the kalman filter algorithm. Ph.D. thesis, Montana State
University, Bozeman, MT.
Cao, J., Jarvis, S. A., Saini, S., Kerbyson, D. J., & Nudd,
G. R. (2002). ARMS: An agent-based resource management system for grid computing. Science Progress,
10(2), 135148.
904
Carlsson C. & Fullr, R. (2003). A Fuzzy Approach

to Real Option Valuation. Journal of Fuzzy Sets and
Systems, 39.
Caromel, D., di Costanzo, A., & Mathieu, C. (2007).
Peer-to-peer for computational Grids: Mixing clusters
and desktop machines. Parallel Computing, 33(45),
275288. doi:10.1016/j.parco.2007.02.011
Casa, J., Konuru, R., Prouty, R., Walpole, J., & Otto,
S. (1994). Adaptive Load Migration Systems for PVM.
Proceedings of supercomputing, (pp. 390-399). Washington D.C.
Casanova, H., Legrand, A., & Quinson, M. SimGrid: a
Generic Framework for Large-Scale Distributed Experimentations. In Proceedings of the 10th ieee international
conference on computer modelling and simulation (uksim/
eurosim08).
Casanova, H., Legrand, A., Zagorodnov, D., & Berman, F.
(2000, May). Heuristics for Scheduling Parameter Sweep
Applications in Grid Environments. In Proceedings of
the 9th heterogeneous computing workshop (hcw00)
(pp. 349363).
Casanova, H., Obertelli, G., Berman, F., & Wolski, R.
(2000, Nov.). The AppLeS Parameter Sweep Template:
User-Level Middleware for the Grid. In Proceedings of
supercomputing 2000 (sc00).
Castro, M., Costa, M., & Rowstron, A. (2004). Performance and Dependability of Structured Peer-to-Peer
Overlays. In Proceedings of the 2004 Intl. Conf. on Dependable Systems and Networks (pp. 9-18). New York:
Castro, M., Druschel, P., Hu, Y. C., & Rowstron, A.

(2002). Topology-aware routing in structured peer-to-peer
overlay networks. In Future Directions in Distributed
Computing.
Chang, R., & Chang, J. (2006). Adaptable replica consistency service for data grids. In Proceedings of the third
international conference on information technology:
New generations (ITNG06) (pp. 646651).
Catlett, C., Beckman, P., Skow, D., & Foster, I. (2006,

May). Creating and operating national-scale cyberinfrastructure services. Cyberinfrastructure Technology
Watch Quarterly, 2(2), 210.
Chang, R., & Chen, P. (2007). Complete and fragmented

replica selection and retrieval in data grids. Future Generation Computer Systems, 23(4), 536546. doi:10.1016/j.
future.2006.09.006
Cazorla, F. J., Ramirez, A., Valero, M., & Fernandez, E.

(2004). Dcache Warn: an I-fetch policy to increase SMT
efficiency. In Proceedings of the 18th International Parallel & Distributed Processing Symposium (IPDPS04), (pp.
74-83). Santa Fe, NM: IEEE Computer Society Press.
Chang, R., Chang, J., & Lin, S. (2007). Job scheduling

and data replication on data grids. Future Generation
Computer Systems, 23(7), 846860. doi:10.1016/j.future.2007.02.008
Cazorla, F. J., Ramirez, A., Valero, M., & Fernandez,

E. (2004). Dynamically controlled resource allocation
in SMT processors. In Proceedings of the 37th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO04), (pp. 171-182). Portland, OR: IEEE
CDO2. (2008). CDOSheet for pricing and risk analysis.
Retrieved from www.cdo2.com
Chang, R., Wang, C., & Chen, P. (2005). Replica selection

on co-allocation data grids. In Proceedings of the second
international symposium on parallel and distributed
processing and applications (Vol. 3358, pp. 584593).
Chao, C.-H. (2006, April). An Interest-based architecture
for peer-to-peer network systems. In Proceedings of the
International Conference AINA.
Chaarawi, M. Squyres, J. Gabriel, E. & Feki, S. (2008). A

Tool for Optimizing Runtime Parameters of Open MPI.
Accepted for publications in EuroPVM/MPI, September
7-10, Dublin, Ireland.
Chapman, B. M., Mehrotra, P., van Rosendale, J., & Zima,

H. P. (1994). A software architecture for multidisciplinary
applications: integrating task and data parallelism. In
CONPAR 94 - VAPP VI: Proceedings of the Third Joint
International Conference on Vector and Parallel Processing (pp. 664676). London: Springer-Verlag.
Chamberlain, B. L., Callahan, D., & Zima, H. P. (2007).

Parallel programmability and the chapel language. International Journal of High Performance Computing Applications, 21(3), 291312. doi:10.1177/1094342007078442
Chapman, B., Haines, M., Mehrota, P., Zima, H., & van
Rosendale, J. (1997). Opus: A coordination language
for multidisciplinary applications. Science Progress,
6(4), 345362.
Chanchio, K., & Sun, X. H. (2001). Communication state

transfer for the mobility of con-curent heterogeneous
computing. Proceedings of the 2001 international conference on parallel processing.
Charles, P., Grothoff, C., Saraswat, V., Donawa, C.,

Kielstra, A., Ebcioglu, K., et al. (2005). X10: An objectoriented approach to non-uniform cluster computing.
In OOPSLA 05 Proceedings of the 20th annual ACM
SIGPLAN Conference on Object Oriented Programming,
Systems, Languages, and Applications (pp. 519538).
New York: ACM.
Chandy, M., Foster, I., Kennedy, K., Koelbel, C., &

Tseng, C.-W. (1994). Integrated support for task and data
parallelism. The International Journal of Supercomputer
905
Chase, J. S., Amador, F. F., Lazowska, E. D., Levy,

H. M., & Littlefield, R. J. (1996). The amber systems:
Parallel programming on a network of multiprocessors.
Proceedings of acm symposium on operating system
principles.
Chen, W., Liu, J., & Huang, H. (2004). An adaptive scheme

for vertical handoff in wireless overlay networks. IEEE
International Conference on Parallel and Distributed
Systems (ICPADS) (pp. 541-548). Washington, DC:
IEEE.
Chase, J. S., Irwin, D. E., Grit, L. E., Moore, J. D., &

Sprenkle, S. E. (2003). Dynamic virtual clusters in a Grid
site manager. In 12th IEEE International Symposium on
High Performance Distributed Computing (HPDC 2003)
(p. 90). Washington, DC: IEEE Computer Society.
Chen, Z., & Dongarra, J. (2005). Condition numbers of gaussian random matrices. SIAM Journal on
Matrix Analysis and Applications, 27(3), 603620.
doi:10.1137/040616413
Chaubal, Ch. (2003). Sun grid engine, enterprise

editionSoftware configuration guidelines and use
cases. Sun Blueprints, Retrieved from www.sun.com/
blueprints/0703/817-3179.pdf
Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N.,
& Shenker, S. (2003). Making gnutella-like p2p systems
scalable. In Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols
for Computer Communications (pp. 407-418).
Chen, C. M., Lee, S. Y., & Cho, Z. H. (1990). A Parellel
Implementation of 3D CT Image Reconstruction on HyperCube Multiprocessor. IEEE Transactions on Nuclear
Science, 37(3), 13331346. doi:10.1109/23.57385
Chen, C.-H., & Lee, C.-Y. (1999). A cost effective lighting processor for 3D graphics application. Proceedings
of International Conference on Image Processing, 2,
792796.
Chen, D. J., & Huang, T. H. (1992). Reliability analysis of
distributed systems based on a fast reliability algorithm.
IEEE Transactions on Parallel and Distributed Systems,
3(2), 139154. doi:10.1109/71.127256
Chen, D. J., Chen, R. S., & Huang, T. H. (1997). A heuristic
approach to generating file spanning trees for reliability
analysis of distributed computing systems. Computers
and Mathematics with Applications, 34(10), 115131.
doi:10.1016/S0898-1221(97)00210-1
Chen, T., Raghavan, R., Dale, J. N., & Iwata, E. (2007,
Sept.). Cell Broadband Engine Architecture and its first
implementation-A performance view. IBM. Journal of
Research and Development (Srinagar), 51(5), 559572.
906
Cheng, A. H., & Joung, Y. J. (2006). Probabilistic file

indexing and searching in unstructured peer-to-peer networks. Computer Networks, 50(1), 106127. doi:10.1016/j.
comnet.2005.12.008
Cheng, K., Xiang, L., Iwaihara, M., Xu, H., & Mohania,
M. M. (2005). Time-Decaying Bloom Filters for Data
Streams with Skewed Distributions. Paper presented
at the Proceedings of the 15th International Workshop
on Research Issues in Data Engineering: Stream Data
Mining and Applications.
Chervenak, A. (2002). Giggle: A framework for constructing scalable replica location services. In Proceedings of
the IEEE supercomputing (pp. 117).
Chervenak, A., Foster, L., Kesselman, C., Salisbury,
C., & Tueckem, S. (2000). The data grid: Towards an
architecture for the distributed management and analysis of large scientific data sets. Journal of Network and
Computer Applications, 23(3), 187200. doi:10.1006/
jnca.2000.0110
Chien, A., Calder, B., Elbert, S., & Bhatia, K. (2003).
Entropia: Architecture and performance of an enterprise desktop grid system. Journal of Parallel and
Distributed Computing, 63, 597610. doi:10.1016/S07437315(03)00006-6
Chinese National Grid (CNGrid) Project Web Site. (2007).
Retrieved from http://www.cngrid.org/
Chiueh, T., & Deng, P. (1996). Evaluation of checkpoint
mechanisms for massively parallel machines. In FTCS,
(pp. 370379).
Choi, S., & Yeung, D. (2006). Learning-based SMT

processor resource distribution via hill-climbing. In
Proceedings of the 33rd Annual International Symposium
on Computer Architecture (ISCA06), (pp. 239-251),
Boston: IEEE Computer Society Press.
Chonka, A., Zhou, W., Knapp, K., & Xiang, Y. (2008).
Protecting information systems from ddos attack using multicore methodology. Proceedings of IEEE 8th
International Conference on Computer and Information
Technology.
Chow, A. C., Gossum, G. C., & Brokenshire, D. A.
(2005). A programming example: Large fft on the cell
broadband engine. In Gspx. tech. conf. proc. of the global
signal processing expo.
Chrysanthis, P. K., & Ramamriham, K. (1994). Synthesis of extended transaction models using ACTA. ACM
Transactions on Database Systems, 19(3), 450491.
doi:10.1145/185827.185843
Chrysanthis, P., & Ramamriham, K. (Eds.). (1992).
ACTA: The SAGA continues. Transactions Models
for Advanced Database Applications. San Francisco:
Morgan Kaufmann.
Chu, D., & Humphrey, M. (2004, November 8). Bmobile
ogsi.net: Grid computing on mobile devices. In Grid
computing workshop (associated with supercomputing
2004), Pittsburgh, PA.
Chu, E., & George, A. (2000). Inside the fft black box:
Serial and parallel fast Fourier transform algorithms.
Boca Raton, FL: CRC Press LLC.
Chu, X., Nadiminti, K., Jin, C., Venugopal, S., &
Buyya, R. (2007, December). Aneka: Next-generation
enterprise grid platform for e-science and e-business
applications, e-Science07: In Proceedings of the 3rd
IEEE International Conference on e-Science and Grid
Computing, Bangalore, India (pp. 151-159). Los Alamitos,
CA: IEEE Computer Society Press. For more information, see http://doi.ieeecomputersociety.org/10.1109/ESCIENCE.2007.12
Chung, P. E. (1997). Checkpointing in cosmic: a userlevel process migration environment. Proceedings of

Pacific Rim International Symposium on Fault-Tolerant
Systems.
Ciancarini, P. (1996). Coordination Models and Languages as Software Integrators. SCM Comput. Surv.,
28(2), 300302. doi:10.1145/234528.234732
Ciarpaglini, S., Folchi, L., Orlando, S., Pelagatti, S., &
Perego, R. (2000). Integrating task and data parallelism
with taskHPF. In H. R. Arabnia (Ed.). Proceedings of
the International Conference on Parallel and Distributed
Processing Techniques and Applications, PDPTA 2000.
Las Vegas, NV: CSREA Press.
Cirne, W., Brasileiro, F., Andrade, N., Costa, L., Andrade,
A., & Novaes, R. (2006, September). Labs of the world,
unite!!! Journal of Grid Computing, 4(3), 225246.
doi:10.1007/s10723-006-9040-x
Clarke, B., & Humphrey, M. (2002, April 19). Beyond
the device as portal: Meeting the requirements of
wireless and mobile devices in the legion grid computing system. In 2nd International Workshop On Parallel
And Distributed Computing Issues In Wireless Networks
And Mobile Computing (associated with ipdps 2002), Ft.
Lauderdale, FL.
CloudCamp. (2008). Retrived from http://www.cloudcamp.com/
CNGrid GOS Project Web site. (2007). Retrieved from
http://vega.ict.ac.cn
Coffman, E. G., Galambos, G., Martello, S., & Vigo, D.
(1999). Bin-packing approximation algorithms: Combinatorial analysis. In D. Z. Du & P. M. Pardalos, (Ed.),
Handbook of Combinatorial Optimization, (pp. 151207).
Dondrecht, the Netherlands: Kluwer.
CoG Toolkit (n.d.). Retrieved from http://www.cogkit.
org/
Cohen, B. (2002). BitTorrent Protocol 1.0. Retrieved
from BitTorrent.org.
907
Cohen, B. (2003). Incentives build robustness in BitTorrent. In Workshop on economics of peer-to-peer systems,
Berkeley, CA.
Condor Team. (2006). CondorVersion 6.4.7 Manual.
Retrieved October 18, 2006, from www.cs.wisc.edu/
condor/manual/v6.4
Cray Inc. (2005). The Chapel language specification,

version 0.4.
Cristianini, N., & Hahn, M. (2006). Introduction to
Computational Genomics. Cambridge, UK: Cambridge
University Press.
Corbat, F. J., & Vyssotsky, V. A. (1965). Introduction

and overview of the multics system. FJCC, Proc. AFIPS,
27(1), 185196.
Cronk, M. H., & Mehrotra, P. (1997). Thread migration

in the presence of pointers. Proceedings of the mini-track
on multithreaded systems, 30th hawaii interantional
conference on system science.
Coronato, A., & Pietro, G. D. (2007). Mipeg: A middleware infrastructure for pervasive grids. Journal of Future
Generation Computer Systems.
Culler, D. E., Singh, J. P., & Gupta, A. (1998) Parallel

computer architecture: a hardware/software approach,
(1st edition). San Francisco: Morgan Kaufmann.
Corporation, I. B. M. (1993). IBM Load Leveler: Users

Guide.
Culler, D., & Karp, R. Patterson, D. Sahay, A. Schauser,

K. E. Santos, E. Subramonian, R. & von Eicken, T. (1993).
LogP: Towards a realistic model of parallel computation.
In Proceedings of the fourth ACM SIGPLAN symposium
on Principles and practice of parallel programming, (pp.
112). New York: ACM Press.
Corradi, A., Leonardi, L., & Zambonelli, F. (1997). Performance comparison of load balancing policies based on
a diffusion scheme. In Proc. of the Euro-Par97 (LNCS
Vol. 1300). Springer: Germany.
Costa, F., Silva, L., Fedak, G., & Kelley, I. (2008, in press).
Optimizing the Data Distribution Layer of BOINC with
BitTorrent. In 2nd workshop on desktop grids and volunteer computing systems (pcgrid 2008), Miami, FL.
Cotroneo, D., Migliaccio, A., & Russo, S. (2007). The
Esperanto Broker: a communication platform for nomadic
computing systems. Software, Practice & Experience,
37(10), 10171046. doi:10.1002/spe.794
Cotton, W., Pielke, R., Walko, R., Liston, G., Tremback,
C., & Jiang, H. (2003). RAMS 2001: Current status and
future directions. Meteorology and Atmospheric Physics,
82(1-4), 529. doi:10.1007/s00703-001-0584-9
Coulson, G., Grace, P., Blair, G., Duce, D., Cooper, C.,
& Sagar, M. (2005, April). A middleware approach for
pervasive grid environments. In Uk-ubinet/ uk e-science
programme workshop on ubiquitous computing and
e-research.
Cox, J. C., Ross, S., & Rubinstein, M. (1979). Option
Pricing: A Simplified Approach. Journal of Financial
Economics, 3(7).
908
Czajkowski, K., Fitzgerald, S., Foster, I., & Kesselman,

C. (2001). Grid information services for distributed resource sharing. 10th International Symposium on High
Performance Distributed Computing (pp. 181-194). San
Francisco: IEEE Computer Society Press.
Dabek, F., Kaashoek, M. F., Karger, D., Morris, R., &
Stoica, I. (2001). Wide-Area Cooperative Storage with
CFS. In Proceedings of the 11th ACM Symp. on Operating Systems Principles (pp. 202-215). New York: ACM
Press.
Dabek, F., Zhao, B., Druschel, P., Kubiatowicz, J., &
Stoica, I. (2003). Towards a common API for structured
peer-to-peer overlays. In IPTPS03 Proceedings of the
2nd International Workshop on Peer-to-Peer Systems,
(pp. 33-44). Heidelberg, Germany: SpringerLink. doi:
10.1007/b11823
Dai, Y. S., & Levitin, G. (2006). Reliability and
performance of tree-structured grid services . IEEE
Transactions on Reliability, 55(2), 337349. doi:10.1109/
TR.2006.874940
Dai, Y. S., Pan, Y., & Zou, X. K. (2006). A hierarchical

modelling and analysis for grid service reliability. IEEE
Transactions on Computers.
Dai, Y. S., Xie, M., & Poh, K. L. (2002), Reliability
analysis of grid computing systems. IEEE Pacific Rim
International Symposium on Dependable Computing
(PRDC2002), (pp. 97-104). New York: IEEE Computer
Press.
Dai, Y. S., Xie, M., & Poh, K. L. (2005). Markov renewal models for correlated software failures of multiple
types. IEEE Transactions on Reliability, 54(1), 100106.
doi:10.1109/TR.2004.841709
Dai, Y. S., Xie, M., & Poh, K. L. (2006).Availability
modeling and cost optimization for the grid resource
management system. IEEE Transactions on Systems,
Man, and Cybernetics. Part A . Systems and Humans: a
Publication of the IEEE Systems, Man, and Cybernetics
Society., 38(1), 170.
Dai, Y. S., Xie, M., Poh, K. L., & Liu, G. Q. (2003). A
study of service reliability and availability for distributed
systems. Reliability Engineering & System Safety, 79(1),
103112. doi:10.1016/S0951-8320(02)00200-4
Dai, Y. S., Xie, M., Poh, K. L., & Ng, S. H. (2004).
A model for correlated failures in N-version programming. IIE Transactions, 36(12), 11831192.
doi:10.1080/07408170490507729
Dalal, S., Temel, S., & Little, M. (2003). Coordinating
business transactions on the Web. IEEE Internet Computing, 7(1), 3039. doi:10.1109/MIC.2003.1167337
Dang, N. N., & Lim, S. B. (2007). Combination of replication and scheduling in data grids. International Journal
of Computer Science and Network Security, 7(3).
Dang, V. D. (2004). Coalition Formation and Operation
in Virtual Organisations. PhD thesis, Faculty of Engineering, Science and Mathematics, School of Electronics and Computer Science, University of Southampton,
Southampton, UK.
Das, S. K., Harvey, D. J., & Biswas, R. (2001). Parallel

processing of adaptive meshes with load balancing. IEEE
Transactions on Parallel and Distributed Systems, 12(12),
12691280. doi:10.1109/71.970562
Davies, N., Friday, A., & Storz, O. (2004). Exploring
the grids potential for ubiquitous computing. IEEE
Pervasive Computing / IEEE Computer Society [and]
IEEE Communications Society, 3(2), 7475. doi:10.1109/
MPRV.2004.1316823
Davis, C. (2007). Could Android open door for cellphone
Grid computing? Retrieved March 10, 2008, from http://
www.google-phone.com/could-android-open-door-forcellphone-grid-computing-12217.php
de Assuno, M. D., & Buyya, R. (2008, December).
Performance analysis of multiple site resource provisioning: Effects of the precision of availability information
[Technical Report]. In International Conference on High
Performance Computing (HiPC 2008) (Vol. 5374, pp.
157168). Berlin/Heidelberg: Springer.
de Assuno, M. D., Buyya, R., & Venugopal, S. (2008,
June). InterGrid: A case for internetworking islands of
Grids. [CCPE]. Concurrency and Computation, 20(8),
9971024. doi:10.1002/cpe.1249
De Roure, D., Jennings, N., & Shadbolt, N. (2005,
March). The semantic grid: Past, present, and future.
Proceedings of the IEEE, 93(3), 669681. doi:10.1109/
JPROC.2004.842781
De Roure, M., & Surridge, D. (2003). Interoperability
challenges in Grid for industrial applications. GGF9
Semantic Grid Workshop, Chicago.
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified
Data Processing on Large Clusters. In Osdi04: Sixth
symposium on operating system design and implementation, (pp. 137150). San Francisco, CA.
DECI. (2008). DEISA extreme computing initiative.
Retrieved from www.deisa.eu/science/deci
909
Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta,

G., Patil, S., et al. (2004). Pegasus: Mapping scientific
workflows onto the grid. In M. Dikaiakos (Ed.), AxGrids
2004, (LNCS 3165, pp. 11-20). Berlin: Springer Verlag.
Diaz, M., Rubio, B., Soler, E., & Troya, J. M. (2002). A

border-based coordination language for integrating task
and data parallelism. Journal of Parallel and Distributed
Computing, 62(4), 715740. doi:10.1006/jpdc.2001.1814
DEISA. (2008). Distributed European infrastructure

for supercomputing applications. Retrieved from www.
deisa.eu
Diaz, M., Rubio, B., Soler, E., & Troya, J. M. (2003).

Domain interaction patterns to coordinate HPF tasks.
Parallel Computing, 29(7), 925951. doi:10.1016/S01678191(03)00064-4
Demers, A., Keshav, S., & Shenker, S. (1989). Analysis

and Simulation of a Fair Queuing Algorithm. Proceedings of ACM SIGCOMM.
Deng, Z., Liu, J.W.S., Zhang, L., Mouna, S., & Frei, A.
(1999). An Open Environment for Real-Time Applications. Real-Time Systems Journal, 16(2/3).
DESHL. (2008). DEISA services for heterogeneous
management layer. http://forge.nesc.ac.uk/projects/
deisa-jra7/
Desprez, F., & Vernois, A. (2006). Simultaneous scheduling of replication and computation for data-intensive
applications on the grid. Journal of Grid Computing,
4(1), 6674. doi:10.1007/s10723-005-9016-2
D-Grid (2008). Retrieved from www.d-grid.de/index.
php?id=1&L=1
Diaz, M., Rubio, B., Soler, E., & Troya, J. M. (2004).

SBASCO: Skeleton-based scientific components. In
Proceedings of the 12th Euromicro Workshop on Parallel, Distributed and Network-Based Processing (PDP
2004) (pp. 318325). Washington, DC: IEEE Computer
Society.
Dikaiakos, M. D. (2007). Grid benchmarking: vision,
challenges, and current status. [New York: Wiley InterScience.]. Concurrency and Computation, 19, 89105.
doi:10.1002/cpe.1086
Dimitrakos, T., Golby, D., & Kearley, P. (2004, October).
Towards a trust and contract management framework for
dynamic virtual organisations. In eChallenges. Vienna,
Austria.
Dharmapurikar, S., & Lockwood, J. (2006, October).

Fast and Scalable Pattern Matching for Network Intrusion Detection Systems. Communication of the IEEE
Journal, 24(10).
Dimitrov, B., & Rego, V. (1998). Arachne: A portable threads system supporting migrant threads on
heterogeneous network farms. IEEE Transactions
doi:10.1109/71.679216
Dharmapurikar, S., Krishnamurthy, P., & Taylor, D. E.

(2006). Longest prefix matching using bloom filters.
IEEE/ACM Trans. Netw., 14(2), 397409.
Ding, Q., Chen, G. L., & Gu, J. (2002). A unified resource

mapping strategy in computational grid environments.
Journal of Software, 13(7), 13031308.
Dharmapurikar, S., Krishnamurthy, P., Sproull, T. S.,

& Lockwood, J. W. (2004). Deep packet inspection
using parallel bloom filters. IEEE Micro, 24(1), 5261.
doi:10.1109/MM.2004.1268997
Dixit, K. M. (1991). The SPEC benchmarks. Parallel

Computing, 17(10-11), 11951209. doi:10.1016/S01678191(05)80033-X
Dheepak, R., Ali, S., Sengupta, S., & Chakrabarti, A.

(2005). Study of scheduling strategies in a dynamic data
grid environment. In Distributed Computing - IWDC
2004 (Vol. 3326). Berlin: Springer.
910
Dixit, S., & Wu, T. (2004). Content Networking in the

Mobile Internet. New York: John Wiley & Sons.
Dixon, C., Bragin, T., Krishnamurthy, A., & Anderson,
T. (2006, September). Tit-for-Tat Distributed Resource
Allocation [Poster]. The ACM SIGCOMM 2006 Conference.
Domenici, A., Donno, F., Pucciani, G., & Stockinger, H.

(2006). Relaxed data consistency with CONStanza. In
Proceedings of the sixth IEEE international symposium
on cluster computing and the grid (pp. 425429).
Doolan, D. C., Tabirca, S., & Yang, L. T. (2006). Mobile

Parallel Computing. In Proceedings of the Fifth International Symposium on Parallel and Distributed Computing
(ISPDC 06), (pp. 161-167).
Domenici, A., Donno, F., Pucciani, G., Stockinger, H.,

& Stockinger, K. (2004, Nov). Replica consistency in a
Data Grid. Nuclear Instruments and Methods in Physics
Research, 534, 2428. doi:10.1016/j.nima.2004.07.052
Dorigo, M. (1992). Optimization, learning and natural

algorithms (Tech. Rep.). Ph.D. Thesis, Politecnico di
Milano, Milan, Italy.
Domingues, P., Araujo, F., & Silva, L. M. (2006, December). A dht-based infrastructure for sharing checkpoints
in desktop grid computing. In Conference on e-science
and grid computing (escience 06), Amsterdam, The
Netherlands.
Donegan, B., Doolan, D. C., & Tabirca, S. (2008). Mobile Message Passing using a Scatternet Framework.
International Journal of Computers, Communications
& Control, 3(1), 5159.
Dong, X., Halevy, A. Y., & Yu, C. (2007). Data integration with uncertainty. In Vldb 07: Proceedings of the
33rd International Conference on Very Large Data Bases
(pp. 687698). VLDB Endowment.
Dongarra, J. J., & Eijkhout, V. (2003). Self-Adapting
Numerical Software for Next-Generation Applications. International Journal of High Performance Computing Applications, 17(2), 125131.
doi:10.1177/1094342003017002002
Dongarra, J., Foster, I., Fox, G., Gropp, W., Kennedy,
K., Torczon, L., & White, A. (2003). Sourcebook of
parallel computing. San Francisco: Morgan Kaufmann
Publishers.
Dongarra, J., Luszczek, P., & Petitet, A. (2003, August). The LINPACK Benchmark: past, present and
future. Concurrency and Computation, 15(9), 803820.
doi:10.1002/cpe.728
Dongarra, J., Meuer, H., & Strohmaier, E. (2004). TOP500
Supercomputer Sites, 24th edition. In Proceedings of the
Supercomputing Conference (SC2004), Pittsburgh PA.
New York: ACM.
Dorta, A. J., Gonzlez, J. A., Rodriguez, C., & de

Sande, F. (2003). LLC: A parallel skeletal language.
Parallel Processing Letters, 13(3), 437448. doi:10.1142/
S0129626403001409
Dorta, A. J., Lpez, P., & de Sande, F. (2006). Basic
skeletons in LLC. Parallel Computing, 32(7-8), 491506.
doi:10.1016/j.parco.2006.07.001
Douglis, F., & Ousterhout, J. K. (1991). Transparent
process migration: Design alternatives and the sprite
implementation. Software, Practice & Experience, 21(8),
757785. doi:10.1002/spe.4380210802
Draves, S. (2005, March). The electric sheep screen-saver:
A case study in aesthetic evolution. In 3rd european
workshop on evolutionary music and art.
Drodowski, M., Lawenda, M., & Guinand, F. (2006).
Scheduling multiple divisible loads. International Journal of High Performance Computing Applications, 20(1),
1930. doi:10.1177/1094342006061879
Drozdowski, M., & Lawenda, M. (2005). On Optimum
Multi-installment Divisible Load Processing in Heterogeneous Distributed Systems, (LNCS 3648, pp. 231240).
Berlin: Springer.
Duan, R., Prodan, R., & Fahringer, T. (2006). Run-time
optimization for Grid workflow applications. International Conference on Grid Computing. Barcelona, Spain:
Dumitrescu, C., & Foster, I. (2004). Usage policy-based
CPU sharing in virtual organizations. In Proceedings
of the fifth IEEE/ACM international workshop on grid
computing (pp. 5360).
911
Dumitrescu, C., & Foster, I. (2004). Usage policy-based

CPU sharing in virtual organizations. In 5th IEEE/ACM
International Workshop on Grid Computing (Grid
2004) (pp. 5360). Washington, DC: IEEE Computer
Society.
Dmmler, J., Rauber, T., & Rnger, G. (2008). Mapping algorithms for multiprocessor tasks on multi-core
clusters. In Proceedings of the 37th International Conference on Parallel Processing (ICPP08). New York: IEEE
Computer Society.
Dumitrescu, C., & Foster, I. (2005, August). GRUBER:

A Grid resource usage SLA broker. In J. C. Cunha &
P. D. Medeiros (Eds.), Euro-Par 2005 (Vol. 3648, pp.
465474). Berlin/Heidelberg: Springer.
Duvvuri, V., Shenoy, P., & Tewari, R. (2000). Adaptive

leases: A strong consistency mechanism for the World
Wide Web. In Proceedings of IEEE INFOCOM (pp.
834843).
Dumitrescu, C., Raicu, I., & Foster, I. (2005). DIGRUBER: A distributed approach to Grid resource
brokering. In 2005 ACM/IEEE Conference on Supercomputing (SC 2005) (p. 38). Washington, DC: IEEE
Computer Society.
Edelman, A. (1988). Eigenvalues and condition numbers

of random matrices. SIAM Journal on Matrix Analysis
and Applications, 9(4), 543560. doi:10.1137/0609045
Dumitrescu, C., Wilde, M., & Foster, I. (2005, June).

A model for usage policy-based resource allocation in
Grids. In 6th IEEE International Workshop on Policies
for Distributed Systems and Networks (pp. 191200).
Dmmler, J., Kunis, R., & Rnger, G. (2007). A scheduling toolkit for multiprocessor-task programming with
dependencies. In Proceedings of the 13th International
Euro-Par Conference (pp. 2332). Berlin: Springer.
Dmmler, J., Kunis, R., & Rnger, G. (2007). A comparison of scheduling algorithms for multiprocessortasks
with precedence constraints. In Proceedings of the 2007
High Performance Computing & Simulation (HPCS07)
Conference (pp. 663669). ECMS.
Dmmler, J., Rauber, T., & Rnger, G. (2007). Communicating multiprocessor-tasks. In Proceedings of the 20th
International Workshop on Languages and Compilers for
Parallel Computing (LCPC 2007). Berlin: Springer.
Dmmler, J., Rauber, T., & Rnger, G. (2008). A transformation framework for communicating multiprocessortasks. In Proceedings of the 16th Euromicro International
Conference on Parallel, Distributed and Network-Based
Processing (PDP 2008) (pp. 6471). New York: IEEE
Computer Society.
912
Eggers, S. J., Emer, J. S., Levy, H. M., Lo, J. L., Stamm, R.

L., & Tullsen, D. M. (1997). Simultaneous multithreading:
a platform for next-generation processors. IEEE Micro,
17(5), 1219. doi:10.1109/40.621209
Elias, J. A., & Moldes, L. N. (2002). Behaviour of the fast
consistency algorithm in the set of replicas with multiple
zones with high demand. In Proceedings of symposium
in informatics and telecommunications.
Elias, J. A., & Moldes, L. N. (2002). A demand based
algorithm for rapid updating of replicas. In Proceedings
of IEEE workshop on resource sharing in massively
distributed systems (pp. 686 691).
Elias, J. A., & Moldes, L. N. (2003). Generalization of
the fast consistency algorithm to a grid with multiple
high demand zones. In Proceedings of international
conference on computational science (ICCS 2003) (pp.
275284).
El-Moursy, A., & Albonesi, D. H. (2003). Front-end policies for improved issue efficiency in SMT processors. In
Proceedings of the 9th International Symposium on HighPerformance Computer Architecture (HPCA03), (pp.
31-40). Anaheim, CA: IEEE Computer Society Press.
Elmroth, E., & Gardfjll, P. (2005, December). Design
and evaluation of a decentralized system for Grid-wide
fairshare scheduling. In 1st IEEE International Conference on e-Science and Grid Computing (pp. 221229).
Melbourne, Australia: IEEE Computer Society Press.
Enabling Grids for E-sciencE (EGEE) project. (2005).

Retrieved from http://public.eu-egee.org.
EnginFrame. (2008). Grid and cloud portal. Retrieved
from www.nice-italy.com
Epema, D. H. J., Livny, M., van Dantzig, R., Evers, X.,
& Pruyne, J. (1996). A worldwide flock of condors: Load
sharing among workstation clusters. Future Generation
Computer Systems, 12(1), 5365. doi:10.1016/0167739X(95)00035-Q
ERCIM. (2005). Multimedia Informatics. ERCIM News,
62.
Erl, T. (2005). Service-Oriented Architecture (SOA):
Concepts, Technology, and Design. Upper Saddle River,
NJ: Prentice Hall.
Evans, J. J. Hood, C. S.& Gropp, W. D. (2003). Exploring
the Relationship Between Parallel Application Run-Time
Variability and Network Performance. In Proceedings of
the Workshop on High-Speed Local Networks (HSLN),
IEEE Conference on Local Computer Networks (LCN),
(pp. 538-547).
Factor, M., Schuster, A., & Shagin, K. (2003). JavaSplit:
a runtime for execution of monolithic Java programs on
heterogenous collections of commodity workstations.
Paper presented at the Proceedings of the IEEE International Conference on Cluster Computing.
Fagg, G. E., & Dongarra, J. (2000). FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic
world. In PVM/MPI 2000, (pp. 346353).
Fagg, G. E., Gabriel, E., Bosilca, G., Angskun, T., Chen,
Z., Pjesivac-Grbovic, J., et al. (2004). Extending the
MPI specification for process fault tolerance on high
performance computing systems. In Proceedings of the
International Supercomputer Conference, Heidelberg,
Germany.
Fagg, G. E., Gabriel, E., Chen, Z., Angskun, T., Bosilca,

G., & Pjesivac-Grbovic, J. (2005). Process faulttolerance:
Semantics, design and applications for high performance
computing. [Winter.]. International Journal of High
Performance Computing Applications, 19(4), 465477.
doi:10.1177/1094342005056137
Fahringer, T., Prodan, R., Duan, R., Hofer, J., Nadeem,
F., Nerieri, F., et al. (2006). ASKALON: A development
and grid computing environment for scientific workflows. In I. J. Taylor, E. Deelman, D. G. Ganon, & M.
Shields (Eds.), Workflows for e-Science (p. 530). Berlin:
Springer Verlag.
Faraj, A. Patarasuk, P. & Yuan, X. (2007). A Study of
Process Arrival Patterns for MPI Collective Operations.
International Conference on Supercomputing, (pp.168179).
Faraj, A. Yuan, X. & Lowenthal, D. (2006). STAR-MPI:
self tuned adaptive routines for MPI collective operations.
In ICS 06: Proceedings of the 20th Annual International
Conference on Supercomputing, (pp. 199-208). New
York: ACM Press.
Fasttrack product description. (2001). http://www.fasttrack.nu/index.html.
Fedak, G., & Germain, C. Neri, V., & Cappello, F. (2001,
May). XtremWeb: A Generic Global Computing System.
In Proceedings of the ieee international symposium on
cluster computing and the grid (ccgrid01).
Fedak, G., Germain, C., Neri, V., & Cappello, F. (2002,
May). XtremWeb: A generic global computing system. In
CCGRID01: Proceeding of the First IEEE Conference
on Cluster and Grid Computing, workshop on Global
Computing on Personal Devices, Brisbane, (pp. 582587). Los Alamitos, CA: IEEE Computer Society. Retrieved from http://doi.ieeecomputersociety.org/10.1109/
CCGRID.2001.923246
Fedak, G., He, H., & Cappello, F. (2008, November).
BitDew: A Programmable Environment for Large-Scale
Data Management and Distribution. In Proceedings
of the acm/ieee supercomputing conference (sc08),
Austin, TX.
913
Federation, M. D. The Biomedical Informatics Research

Network (2003). In I. Foster & C. Kesselman (Eds.), The
grid, blueprint for a new computing infrastructure (2nd
ed.). San Francisco: Morgan Kaufmann.
Foster, I. (2002). What is the Grid? A three point checklist. Retrieved from http://www-fp.mcs.anl.gov/~foster/
Articles/WhatIsTheGrid.pdf
Femando, R., Harris, M., Wloka, M., & Zeller, C. (2004).

Programming graphics hardware. In Tutorial on EUROGRAPHICS. NVIDIA Corporation.
Foster, I. (2006). Globus toolkit version 4: Software for

service-oriented systems. In Proceedings of the international conference on network and parallel computing
(pp. 213).
Fernandess, Y., & Malkhi, D. (2006). On Collaborative

Content Distribution using Multi-Message Gossip. In
Proceedings of the international parallel and distributed
processing symposium. Rhodes Island, Greece: IEEE.
Foster, I. Kesselman, & C., Tuecke, S. (2002). The anatomy

of the Grid: Enabling scalable virtual organizations.
Retrieved from www.globus.org/alliance/publications/
papers/anatomy.pdf
Ferrari, A. J., Chapin, S. J., & Grimshaw, A. S. (1997).

Process introspection: A heterogeneous checkpoint/restart mechanism based on automatic code modification,
(Technical Report: CS-97-05). University of Virginia,
Charlottesville, VA.
Foster, I. T., & Chandy, K. M. (1995). Fortran M: A

language for modular parallel programming. Journal
of Parallel and Distributed Computing, 26(1), 2435.
doi:10.1006/jpdc.1995.1044
Fink, S. J. (1998). A programming model for block-structured scientific calculations on smp clusters.Doctoral
thesis, University of California, San Diego, CA.
Fischer, L. (Ed.). (2004). Workflow Handbook 2004.
Lighthouse Point, FL: Future Strategies Inc.
Fisk, A. (2003). Gnutella dynamic query protocol v. 0.1.
Retrieved from http://www9.limewire.com/developer/
dynamic query.html.
Folding@home, (2008). Client statistics by OS. Retrieved
March 10, 2008, from http://fah-web.stanford.edu/cgibin/main.py?qtype=osstats
Fontn, J., Vzquez, T., Gonzalez, L., Montero, R. S.,
& Llorente, I. M. (2008, May). OpenNEbula: The open
source virtual machine manager for cluster computing.
In Open Source Grid and Cluster Software Conference
Book of Abstracts. San Francisco.
Foster, I. (2000). Internet computing and the emerging
grid. Nature. Retrieved from www.nature.com/nature/
webmatters/grid/grid.html
Foster, I. (2002). The grid: A new infrastructure
for 21st century science. Physics Today, 55, 4247.
doi:10.1063/1.1461327
914
Foster, I. T., & Iamnitchi, A. (2003). On death, taxes,

and the convergence of peer-to-peer and grid computing. 2735, 118-128.
Foster, I., & Kesselman, C. (1997). Globus: A
metacomputing infrastr ucture toolkit. International Journal of Supercomputer Applications and
High Performance Computing, 11(2), 115128.
doi:10.1177/109434209701100205
Foster, I., & Kesselman, C. (1999). The Grid: Blueprint
for a New Computing Infrastructure. San Francisco:
Morgan Kaufmann Publishers, Inc.
Foster, I., & Kesselman, C. (2003). The Grid 2: Blueprint
for a new computing infrastructure. San Francisco:
Morgan-Kaufmann.
Foster, I., & Kesselman, C. (2004). The Grid: Blueprint
for a future computing infrastructure (2 Ed.). San
Foster, I., Freeman, T., Keahey, K., Scheftner, D., Sotomayor, B., & Zhang, X. (2006, May). Virtual clusters for
Grid communities. In 6th IEEE International Symposium
on Cluster Computing and the Grid (CCGRID 2006) (pp.
513520). Washington, DC: IEEE Computer Society.
Foster, I., Kesselman, C., & Nick, J. (2002). Grid services

for distributed system integration. IEEE Computer,
35(6), 3746.
Fox, G., Hiranandani, S., Kennedy, K., Koelbel, C., Kremer, U., Tseng, C.-W., et al. (1990). Fortran D Language
Specification (No. CRPC-TR90079), Houston, TX.
Foster, I., Kesselman, C., & Tuecke, S. (2001). The

anatomy of the the grid: Enabling scalable virtual organization. The International Journal of Supercomputer
Fox, G., Williams, R., & Messina, P. (1994). Parallel

computing works! San Francisco: Morgan Kaufmann
Publishers.
Foster, I., Kesselman, C., Nick, J. M., & Tuecke, S. (2002).

Grid services for distributed system integration. Computer, 35(6), 3746. doi:10.1109/MC.2002.1009167
Foster, I., Kesselman, C., Nick, J., & Tuecke, S. (2002).
The physiology of the grid: An open grid services architecture for distributed systems integration. Retrieved
from citeseer.nj.nec.com/foster02physiology.html
Foster, I., Kesselman, C., Tsudik, G., & Tuecke, S.
(1998). A security Architecture for Computational Grids.
ACM Conference on Computer and Communications
Security.
Foster, I., Kohr, D. R., Krishnaiyer, R., & Choudhary, A.
(1996). Double standards: Bringing task parallelism to
HPF via the message passing interface. In Proceedings
of the 1996 ACM/IEEE Conference on Supercomputing
(pp. 36-36). New York: IEEE Computer Society.
FreePastry. (2008, November). Retrieved from http://

freepastry.rice.edu/FreePastry
Frey, J., Mori, T., Nick, J., Smith, C., Snelling, D., Srinivasan, L., & Unger, J. (2005). The open grid services
architecture, Version 1.0. Retrieved from www.ggf.org/
ggf_areas_architecture.htm
Frey, J., Tannenbaum, T., Livny, M., Foster, I. T., &
Tuecke, S. (2001, August). Condor-G: A computation
management agent for multi-institutional Grids. In 10th
IEEE International Symposium on High Performance
Distributed Computing (HPDC 2001) (pp. 5563). San
Francisco: IEEE Computer Society.
Frigo, M., & Johnson, S. (2005). The Design and Implementation of FFTW3. Proceedings of the IEEE, 93(2),
216231. doi:10.1109/JPROC.2004.840301
Fourment, M., & Gillings, M. R. (2008, February). A

comparison of common programming languages used in
bioinformatics. Bioinformatics (Oxford, England), 9.
Fritsch, D., Klinec, D., & Volz, S. (2000). NEXUS

positioning and data management concepts for location
aware applications. In the 2nd International Symposium
on Telegeoprocessing (Nice-Sophia-Antipolis, France),
(pp. 171-184).
Fowler, M. (2008, November). Inversion of control containers and the dependency injection pattern. Retrieved
from http://www.martinfowler.com/articles/injection.
html
Fu, S., Xu, C. Z., & Shen, H. (April 2008). Random

choices for Churn resilient load balancing in peer-topeer networks. Proc. of IEEE International Parallel and
Distributed Processing Symposium.
Fox, F., & Gannon, D. (2001). Computational grids.

Computing in Science & Engineering, 3(4), 7477.
doi:10.1109/5992.931906
Fu, Y., Chase, J., Chun, B., Schwab, S., & Vahdat, A.
(2003). SHARP: An architecture for secure resource
peering. In 19th ACM Symposium on Operating Systems Principles (SOSP 2003) (pp. 133148). New York:
ACM Press.
Fox, G. C., Johnson, M., Lyzenga, G., Otto, S. W.,

Salmon, J., & Walker, D. (1988). Solving Problems on
Concurrent Processors: Vol. 1. Englewood Cliffs, NJ:
Prentice-Hall.
Furmento, N., Hau, J., Lee, W., Newhouse, S., & Darlington, J. (2003). Implementations of a service-oriented
architecture on top of jini, jxta and ogsa. In Proceedings
of uk e-science all hands meeting.
915
Gabriel, E. Fagg, G. Bosilca, G. Angskun, T. Dongarra, J.

J. Squyres, J. M., et al. (2004). Open MPI: Goals, Concept,
and Design of a Next Generation MPI Implemention.
In D. Kranzlmueller, P. Kacsuk, J. J. Dongarra (Eds.),
Recent Advances in Parallel Virtual Machine and Message Passing Interface, (LNCS, Vol. 3241, pp. 97-104).
Berlin: Springer.
Gabriel, E., & Huang, S. (2007). Runtime optimization
of application level communication patterns. In Proceedings of the 2007 International Parallel and Distributed
Processing Symposium, 12th International Workshop
on High-Level Parallel Programming Models and Supportive Environments, (p. 185).
Gaddah, A., & Kunz, T. (2003). A survey of middleware
paradigms for mobile computing. Carleton University and
Computing Engineering [Research Report]. Retrieved
June 15th, 2008, from http://www.sce.carleton.ca/wmc/
middleware/middleware.pdf
Gao, L. (2001). On inferring autonomous system relationships in the Internet. IEEE/ACM Transactions on
networking, 9(6), December.
Garbacki, P., Biskupski, B., & Bal, H. (2005). Transparent
fault tolerance for grid application. In P. M. Sloot (Ed.)
Advances in Grid Computing - EGC 2005, (pp. 671-680).
Garcs-Erice, L., Biersack, E. W., Felber, P. A., Ross, K.
W., & Urvoy-Keller, G. (2003). Hierarchical Peer-to-Peer
Systems. In Proceedings of the 9th Intl. Euro-Par Conf.
Garcia-Molina, H., & Salem, K. (1987). SAGAS. In Proceedings of ACM SIGMOD87, International Conference
on Management of Data, 16(3), 249-259.
Garey, M. R., & Johnson, D. S. (1979). Computers and
Intractability. San Francisco: Freeman.
GAT. (2005). Grid application toolkit. www.gridlab.org/
WorkPackages/wp-1/
Gelenbe, E. (1979). On the optimum checkpoint
interval. Journal of the ACM, 26(2), 259270.
doi:10.1145/322123.322131
916
Gentzsch (2008). Top 10 rules for building a sustainable

Grid. In Grid thought leadership series. Retrieved from
www.ogf.org/TLS/?id=1
Gentzsch, W. (2002). Response to Ian Fosters What is
the Grid? GRIDtoday, August 5. Retrieved from www.
gridtoday.com/02/0805/100191.html
Gentzsch, W. (2004). Enterprise resource management:
Applications in research and industry. In I. Foster &
C. Kesselman (Eds.), The Grid 2: Blueprint for a new
computing infrastructure (pp. 157 166). San Francisco:
Morgan Kaufmann Publishers.
Gentzsch, W. (2004). Grid computing adoption in research
and industry. In A. Abbas (Ed.), Grid computing: A practical guide to technology and applications (pp. 309 340).
Florence, KY: Charles River Media Publishers.
Gentzsch, W. (2007). Grid initiatives: Lessons learned
and recommendations. RENCI Report. Retrieved from
www.renci.org/publications/reports.php
Gentzsch, W. (Ed.). (2007). A sustainable Grid infrastructure for Europe, Executive Summary of the e-IRG
Open Workshop on e-Infrastructures, Heidelberg, Germany. Retrieved from www.e-irg.org/meetings/2007-DE/
workshop.html
GEONgrid. (2008). Retrieved from www.geongrid.org
Georgakopoulos, D., Hornick, M., & Sheth, A. (1995).
An overview of workflow management: From process
modeling to workflow automation infrastructure. Distributed and Parallel Databases, 3(2), 119153. doi:10.1007/
BF01277643
Ghare, G., & Leutenegger, L. (2004, June). Improving
Speedup and Response Times by Replicating Parallel Programs on a SNOW. In Proceedings of the 10th workshop
on job scheduling strategies for parallel processing.
Ghinita, G., & Teo, Y. M. (2006). An adaptive stabilization
framework for distributed hash tables. In Proceedings of
the 20th IEEE Intl. Parallel and Distributed Processing
Symp. New York: IEEE Computer Society Press.
Ghodsi, A., Alima, L. O., & Haridi, S. (2005). Lowbandwidth topology maintenance for robustness in
structured overlay networks. In Proceedings of 38th
Hawaii Intl. Conf. on System Sciences (p. 302). New
Ghodsi, A., Alima, L. O., & Haridi, S. (2005). Symmetric replication for structured peer-to-peer systems.
In Proceedings of the 3rd Intl. Workshop on Databases,
Information Systems and Peer-to-Peer Computing (p.
12). Berlin: Spinger-Verlag.
Ghormley, D., Petrou, D., Rodrigues, S., Vahdat, A.,
& Anderson, T. (1998, July). GLUnix: A global layer
unix for a network of workstations. Software, Practice
& Experience, 28(9), 929. doi:10.1002/(SICI)1097024X(19980725)28:9<929::AID-SPE183>3.0.CO;2-C
Ghose, F., Grossklags, J., & Chuang, J. (2003). Resilient
Data-Centric Storage in Wireless Ad-Hoc Sensor Networks. Proceedings the 4th International Conference on
Mobile Data Management (MDM03), (pp. 45-62).
Gill, P. E. Murray, W. & Wright, M. H. (1993). Practical
Optimization. London: Academic Press Ltd.
Gilmont, T., Legat, J.-D., & Quisquater, J.-J. (1999).
Enhancing the security in the memory management
unit. In Proceedings of the 25th EuroMicro Conference
(EUROMICRO99). 1, 449-456. Milan, Italy: IEEE
Gkantsidis, C., & Rodriguez, P. (2005, March). Network
Coding for Large Scale Content Distribution. In Proceedings of ieee/infocom 2005, Miami, USA.
gLite - Lightweight Middleware for Grid Computing.
(2005). Retrieved from http://glite.web.cern.ch/glite.
Globus: Grid security infrastructure (GSI) (n.d.). Retrieved from http://www.globus.org/security/
Globus: The grid resource allocation and management
(GRAM) (n.d.). Retrieved from http://www.globus.org/
toolkit/docs/3.2/gram/
Goderis, D. et al. (2001, July). Service Level Specification Semantics and parameters: draft-tequila-sls-01.txt
[Internet Draft].
Godfrey, B., Lakshminarayanan, K., Surana, S., Karp,
R., & Stoica, I. (2004). Load balancing in dynamic
structured p2p systems. In Proceedings of INFOCOM
(pp. 2253- 2262). New York: IEEE Press.
Godfrey, B., Lakshminarayanan, K., Surana, S., Karp, R.,
& Stoica, I. (2006). Load balancing in dynamic structured
P2P systems. Performance Evaluation, 63(3).
Godfrey, P. B., & Stoica, I. (2005). Heterogeneity and
load balance in distributed hash tables. In Proceedings
of INFOCOM (pp. 596-606). New York: IEEE Press.
Goldberg, & D. E. (1989). Genetic algorithm: In search,
optimization and machine learning. New York: AddisonWesley.
Golding, R. A. (1992, Dec). Weak-consistency group
communication and membership (Tech. Rep.). Computer
and Information Sciences, University of California,
Ph.D. Thesis.
Goller, A. (1999). Parallel and Distributed Processing of
Large Image Data Sets. Doctoral Thesis, Graz University
of Technology, Graz, Austria.
Goller, A., & Leberl, F. (2000). Radar Image Processing
with Clusters of Computers. Paper presented at the IEEE
Conference on Aerospace.
Golub, G. H., & Van Loan, C. F. (1989). Matrix Computations. Baltimore, MD: The John Hopkins University
Press.
Golumbic, M. C. (1980). Algorithmic Graph Theory and
Perfect Graphs. New York: Academic Press.
Gong, L. (2001, June). JXTA: A network programming
environment. IEEE Internet Computing, 5(3), 88-95. Los
Alamitos, CA: IEEE Computer Society. Retrieved from
http://doi.ieeecomputersociety.org/10.1109/4236.93518
Goad, W. B. (1987). Sequence analysis. Los Alamos

Science, (Special Issue), 288291.
917
Gontmakher, A., Mendelson, A., Schuster, A., & Shklover,

G. (2006) Speculative synchronization and thread management for fine granularity threads. In Proceedings of
the 12th International Symposium on High-Performance
Computer Architecture (HPCA06), (pp. 278-287). Austin, TX: IEEE Computer Society Press.
Gonzalez-Castano, F. J., Vales-Alonso, J., Livny, M.,
Costa-Montenegro, E., & Anido-Rifo, L. (2003). Condor
grid computing from mobile handheld devices. SIGMOBILE Mobile Comput. Commun. Rev., 7(1), 117126.
doi:10.1145/881978.882005
Gonzalo, C., & Garca-Martn, M.-A. (2006). The 3G IP
Multimedia Subsystem (IMS): Merging the Internet and
the Cellular Worlds. New York: Wiley.
Goodale, T., Jha, S., Kaiser, H., Kielmann, T., Kleijer,
P., Merzky, A., et al. (2008). A simple API for Grid applications (SAGA). Grid Forum Document GFD.90. Open
Grid Forum. Retrieved from www.ogf.org/documents/
GFD.90.pdf
Goodman, D. J., Borras, J., Mandayam, N. B., & Yates,
R. D. (1997). INFOSTATIONS: A new system model for
data and messaging services. Proceedings of the 47th
IEEE Vehicular Technology Conference (VTC), Phoenix,
AZ, (Vol. 2, pp. 969973).
Google (2008). Google App Engine. Retrieved from http://
code.google.com/appengine/
Google App Engine. (2008, November). Retrieved from
http://appengine.google.com
Google Groups. (2008). Cloud computing. Retrieved from
http://groups.google.ca/group/cloud-computing
Govindaraju, M., Krishnan, S., Chiu, K., Slominski,
A., Gannon, D., & Bramley, R. (2002, June). Xcat 2.0:
A component-based programming model for grid web
services (Tech. Rep. No. Technical Report-TR562). Dept.
of C.S., Indiana Univ., South Bend, IN.
Graboswki, P., Lewandowski, B., & Russell, M. (2004).
Access from j2me-enabled mobile devices to grid services. In Proceedings of Mobility Conference 2004,
Singapore.
918
Grama, A., Gupta, A., Kumar, V., & Karypis, G. (2003).

Introduction to parallel computing. Upper Saddle River,
NJ: Pearson Education Limited.
Grassi, V., Donatiello, L., & Iazeolla, G. (1988). Performability evaluation of multicomponent fault tolerant systems. IEEE Transactions on Reliability, 37(2), 216222.
doi:10.1109/24.3744
Graupner, S., Kotov, V., Andrzejak, A., & Trinks, H.
(2002, August). Control Architecture for Service Grids
in a Federation of Utility Data Centers (Technical Report
No. HPL-2002-235). Palo Alto, CA: HP Laboratories
Palo Alto.
Gray, A. A., Arabshahi, P., Lamassoure, E., Okino, C.,
& Andringa, J.. (2004). A Real Option Framework for
Space Mission Design. Technical report, VNational
Aeronautics and Space Administration NASA.
Gray, J. (1981). The transaction concept: Virtues and
limitations. In Proceedings of the 7th International
Conference on VLDB, (pp.144-154).
Grelck, C., Scholz, S.-B., & Shafarenko, A. V. (2007).
Coordinating data parallel SAC programs with S-Net.
In Proceedings of the 21th International Parallel and
Distributed Processing Symposium (IPDPS 2007) (pp.
18). New York: IEEE.
Grid Computing, I. B. M. (n.d.). Retrieved from http://
www-1.ibm.com/grid/
Grid Engine. (2001). Open source project. Retrieved
from http://gridengine.sunsource.net/
Grid Interoperability Now Community Group (GINCG). (2006). Retrieved from http://forge.ogf.org/sf/
projects/gin.
GridFTP (n.d.). Retrieved from http://www.globus.org/
toolkit/docs/4.0/data/gridftp/
GridSphere (2008). Retrieved from www.gridsphere.
org/gridsphere/gridsphere
GridWay. (2008). Metascheduling technologies for the
grid. Retrieved from www.gridway.org/
Grigg, A. (2002). Researvation-Based Timing Analysis

A Partitioned Timing Analysis Model for Distributed
Real-Time Systems (YCST-2002-10). York, UK: University of York, Dept. of Computer Science.
Grigoras, D. (2005). Service-oriented Naming Scheme
for Wireless Ad Hoc Networks. In the Proceedings of the
NATO ARW Concurrent Information Processing and
Computing, July 3-10 2003, Sinaia, Romania, 2005,
(pp. 60-73). Amsterdam: IOS Press
Grigoras, D., & Riordan, M. (2007). Cost-effective
mobile ad hoc networks management. Future Generation Computer Systems, 23(8), 990996. doi:10.1016/j.
future.2007.04.001
Grigoras, D., & Zhao, Y. (2007). Simple Self-management of Mobile Ad Hoc Networks. Proc of the 9th IFIP/
IEEEInternational Conference on Mobile and Wireless
Communication Networks, 19-21 September 2007,
Cork, Ireland.
Grimme, C., Lepping, J., & Papaspyrou, A. (2008, April).
Prospects of collaboration between compute providers by
means of job interchange. In Job Scheduling Strategies
for Parallel Processing (Vol. 4942, p. 132-151). Berlin /
Grimshaw, A. S., & Wulf, W. A. (1997). The legion vision
of a worldwide virtual computer. Communications of the
ACM, 40(1), 3945. doi:10.1145/242857.242867
Grit, L. E. (2005, October). Broker Architectures for Service-Oriented Systems [Technical Report]. Durham, NC:
Department of Computer Science, Duke University.
Grit, L. E. (2007). Extensible Resource Management for
Networked Virtual Computing. PhD thesis, Department
of Computer Science, Duke University, Durham, NC.
(Adviser: Jeffrey S. Chase)
Gschwind, M., Hofstee, H.P., Flachs, B., Hopkins M.,
Watambe, Y., & Yamazaki, T., (2006). Synergistic Processing in Cells Multicore Architecture. IEEE Computer
Society, 0272-1732/06.
Guiffaut, C., & Mahdjoubi, K. (2001, April). A Parallel

FDTD Algorithm Using the MPI Library. IEEE Antennas
andPropagation Magazine, 43(No. 2), 94103.
Gummadi, K., Gummadi, R., Gribble, S., Ratnasamy,
S., Shenker, S., & Stoica, I. (2003). The impact of dht
routing geometry on resilience and proximity. In Proceedings of ACM SIGCOMM (pp. 381-394). New York:
ACM Press.
Guo, D., Wu, J., Chen, H., & Luo, X. (2006). Theory and
Network Applications of Dynamic Bloom Filters. Paper
presented at the INFOCOM 2006. 25th IEEE International
Conference on Computer Communications.
Guo, S.-F., Zhang, W., Ma, D., & Zhang, W.-L. (2004,
Aug.). Grid mobile service: using mobile software agents
in grid mobile service. Machine learning and cybernetics,
2004. In Proceedings of 2004 International Conference
on, 1, 178-182.
Gupta, A., Sahin, O. D., Agarwal, D., & El Abbadi, A.
(2004). Meghdoot: Content-based publish/subscribe over
peer-to-peer networks. In Middleware04 Proceedings
of the 5th ACM/IFIP/USENIX International Conference
on Middleware, (pp. 254-273). Heidelberg, Germany:
SpringerLink. doi: 10.1007/b101561.
Gupta, I., Birman, K., Linga, P., Demers, A., & Renesse,
R. V. (2003). Kelips: Building an efficient and stable
P2P DHT through increased memory and background
overhead. In Proceedings of the 2nd Intl. Workshop on
Peer-to-Peer Systems (pp. 160-169). Berlin: SpringerVerlag.
Gustafson, J. (1988). Reevaluating A mdahls
law. Communications of the ACM, 31, 532533.
doi:10.1145/42411.42415
Haahr, M., Cunningham, R., & Cahill, V. (1999). Supporting CORBA applications in a mobile environment.
In MobiCom 99: Proceedings of the 5th Annual ACM/
IEEE International Conference on Mobile Computing
and Networking, (pp. 36-47).
GSI (Globus Security Infrastructure). Retrieved from

http://www.globus.org/Security/
919
Hailong, C., & Jun, W. (2004). Foreseer: a novel,

locality-aware peer-to-peer system architecture for
keyword searches. Paper presented at the Proceedings
of the 5th ACM/IFIP/USENIX International Conference
on Middleware.
Haji, M. H., Gourlay, I., Djemame, K., & Dew, P. M.
(2005). A SNAP-based community resource broker using
a three-phase commit protocol: A performance study.
The Computer Journal, 48(3), 333346. doi:10.1093/
comjnl/bxh088
Hakami, S. (1999). Optimum location of switching centers
and the absolute centers and medians of a graph. Operations Research, 12, 450459. doi:10.1287/opre.12.3.450
Halsall, F. (2000). Multimedia Communications: Applications, Networks, Protocols and Standards (Hardcover).
New York: Addison Wesley.
Hammond, L., Hubbert, B. A., Siu, M., Prabhu,
M. K., Chen, M., & Olukotun, K. (2000). The
Stanford Hydra CMP. IEEE Micro, 20(2), 7184.
doi:10.1109/40.848474
Hammond, L., Nayfeh, B. A., & Olukotun, K. (1997).
A single-chip multiprocessor. IEEE Computer, 30(9),
7985.
Hammond, L., Wong, V., Chen, M., Carlstrom, B. D., Davis, J. D., & Hertzberg, B. (2004). Transactional Memory
Coherence and Consistency. SIGARCH Comput. Archit.
News, 32(2), 102. doi:10.1145/1028176.1006711
Harte, L., Wiblitzhouser, A., & Pazderka, T. (2006). Introduction to MPEG; MPEG-1, MPEG-2 and MPEG-4.
Fuquay Varina, NC: Althos Publishing.
Harvey, N. J., Jones, M. B., Saroiu, S., Theimer, M., &
Wolman, A. (2003). SkipNet: A scalable overlay network
with practical locality properties. In Proceedings of the
4th USENIX Symp. on Internet Technologies and Systems
(pp. 113-126). USENIX Association.
920
Hawick, K. A., James, H. A., Maciunas, K. J., Vaughan,

F. A., Wendelborn, A. L., Buchhorn, M., et al. (1997).
Geostationary-satellite Imagery Application on Distributed, High-Performance Computing. Paper presented at
the High Performance Computing on the Information
Superhighway: HPC Asia97.
Hayes, B. (2007). Computing in a parallel universe.
American Scientist, 95.
Hayes, C. L., & Luo, Y. (2007). Dpico: A high speed deep
packet inspection engine using compact finite automata.
Proceedings of ACM/IEEE ANCS, (pp. 195-203).
He, X. (1998). 2D -Object Recognition With Spiral Architecture. Doctoral Thesis, University of Technology,
Sydney, Sydney, Australia.
He, X., & Sun, X. (2005). Incorporating data movement
into grid task scheduling. In Proceedings of grid and
cooperative computing (pp. 394405).
He, X., Sun, X., & Laszewski, G. (2003). QoS guided
Min-Min heuristic for grid task scheduling. Journal of
Computer Science and Technology, Special Issue on
Grid Computing, 18 (4).
Heien, E., Fujimoto, N., & Hagihara, K. (2008). Computing low latency batches with unreliable workers in
volunteer computing environments. In Pcgrid.
Heine, F., Hovestadt, M., Kao, O., & Keller, A. (2005).
Provision of fault tolerance with grid-enabled and SLAaware resource management systems. In G. R. Joubert
(Ed.) Parallel Computing: Current and Future Issues of
High End Computing, (pp. 105-112), NIC-Directors.
Heine, F., Hovestadt, M., Kao, O., & Keller, A. (2005).
SLA-aware job migration in grid environments. In L.
Grandinetti (Ed.), Grid Computing: New Frontiers of
High Performance Computing (345-367). Amsterdam,
The Netherland: Elsevier Press.
Hennessy, J., & Patterson, D. (2006). Computer architecture: a quantitative approach (4th Ed.). San Francisco:
Morgan Kaufmann.
Hey, T., & Trefethen, A. E. (2002). The UK e-science

core programme and the Grid. Future Generation
Computer Systems, 18(8), 10171031. doi:10.1016/S0167739X(02)00082-1
Hoefler, T. Lichei, A. & Rehm, W. (2007). Low-Overhead

LogGP Parameter Assessment for Modern Interconnect
Networks. Proceedings of the IPDPS, Long Beach, CA,
March 26-30. New York: IEEE.
Heymann, E., Fernandez, A., Senar, M. A., & Salt, J.

(2003). The EU-Crossgrid approach for grid application
scheduling. European Grid Conference, (LNCS 2970,
pp. 17-24). Amsterdam: Springer Verlag.
Hong, T., & Tao, Y. (2003). An Efficient Data Location

Protocol for Self.organizing Storage Clusters. Paper
presented at the Proceedings of the 2003 ACM/IEEE
conference on Supercomputing.
High Performance Fortran Forum. (1993). High performance Fortran language specification, version 1.0
(No. CRPC-TR92225). Center for Research on Parallel
Computation, Rice University, Houston, TX.
Hopper, R. (2002). P/Meta - metadata exchange scheme.

Retrieved June 15th, 2008, from http://www.ebu.ch/
trev_290-hopper.pdf
High Performance Fortran Forum. (1997). High performance Fortran language specification 2.0. Center
for Research on Parallel Computation, Rice University,
Houston, TX.
Hill, M. D., & Marty, M. R. (2008, July). Amdahls Law
in the Multicore Era. HPCA 2008, IEEE 14th International
Symposium (pp.187).
Hingne, V., Joshi, A., Finin, T., Kargupta, H., & Houstis,
E. (2003). Towards a pervasive grid. In International
parallel and distributed processing symposium (ipdps03)
(p. 207).
Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D.,
Kyker, A., & Roussel, P. (2001). The microarchitecture
of the Pentium 4 processor. Intel Technology Journal,
5(1), 1-13.
Ho, T., Medard, M., Koetter, R., Karger, D. R., Effros,
M., Shi, J., & Leong, B. (2006, October). A random
linear network coding approach to multicast. IEEE
Transactions on Information Theory, 52(10). doi:10.1109/
TIT.2006.881746
Hockney, R., & Berry, M. (1994). PARKBENCH report:
public international benchmarks for parallel computers.
Science Progress, 3(2), 101146.
Hockney, R., & Eastwood, J. (1981). Computer simulation
using particles. London: McGraw-Hill, Inc.
Hoschek, W., Jaen-Martinez, J., Samar, A., Stockinger,

H., & Stockinger, K. (2000). Data management in an
international data grid project. grid computing - GRID
2000 (pp.333-361). UK.
Hovestadt, M. (2003). Scheduling in HPC resource management systems: Queuing vs. planning. In D. Feitelson
(Ed.), Job Scheduling Strategies for Parallel Processing,
(pp.1-20). Berlin: Springer Verlag.
Hsiao, H.-C., & King, C.-T. (2003). A tree model for
structured peer-to-peer protocols. In Proceedings of
the 3rd IEEE Intl. Symp. on Cluster Computing and the
Grid (pp. 336-343). New York: IEEE Computer Society
Press.
Hua, Y., & Xiao, B. (2006). A Multi-attribute Data Structure with Parallel Bloom Filters for Network Services.
Proceedings of 13th International Conference of High
Performance Computing (HiPC),(pp. 277-288).
Hua, Y., Zhu, Y., Jiang, H., Feng, D., & Tian, L. (2008).
Scalable and Adaptive Metadata Management in Ultra
Large-Scale File Systems. Proceedings of the 28th
International Conference on Distributed Computing
Systems (ICDCS 2008).
Huang, A. (2003) Hacking the Xbox: an introduction to
reverse engineering, (1st Ed.). San Francisco: No Starch
Press.
921
Huang, J., & Lilja, D. J. (1999). Exploiting basic block

value locality with block reuse. Proceedings of 5th International Symposium on High-Performance Computer
Architecture (HPCA99), (pp. 106-114). Orlando, FL:
Huang, K.-H., & Abraham, J. A. (1984). Algorithmbased fault tolerance for matrix operations. EEE Transactions on Computers, C-33, 518528. doi:10.1109/
TC.1984.1676475
Huang, R., Casanova, H., & Chien, A. A. (2006, April).
Using virtual Grids to simplify application scheduling.
In 20th International Parallel and Distributed Processing Symposium (IPDPS 2006). Rhodes Island, Greece:
IEEE.
Huang, S. (2007). Applying Adaptive Software Technologies for Scientific Applications. Master Thesis,
Department of Computer Science, University of Houston,
Houston, TX.
Huedo, E., Montero, R. S., & Llorente, I. M. (2004). A
framework for adaptive execution in Grids. Software,
Practice & Experience, 34(7), 631651. doi:10.1002/
spe.584
Huerta, M. Haseltine, F., & Liu, Y. (2004, July). Nih
working definition of bioinformatics and computational
biology.
Hull, J. C. (2006). Options, Futures, and Other Derivatives (6th Edition). Upper Saddle River, NJ: Prentice
Hall.
Hunold, S., Rauber, T., & Rnger, G. (2004). Multilevel
hierarchical matrix-matrix multiplication on clusters.
In Proceedings of the 18th International Conference
of Supercomputing (ICS04) (pp. 136145). New York:
ACM.
Hunold, S., Rauber, T., & Rnger, G. (2008). Combining
building blocks for parallel multi-level matrix multiplication. Parallel Computing, 34(6-8), 411426. doi:10.1016/j.
parco.2008.03.003
922
Huston, G. (n.d.). Peering and settlements Part-1. The

Internet protocol journal. San Jose, CA: CISCO Systems.
Hwang, J., & Arvamudham, P. (2004). Middleware
services for P2P computing in wireless grid networks.
IEEE Internet Computing, 8(4)4046. doi:10.1109/
MIC.2004.19
Hwang, S., & Kesselman, C. (2003). GridWorkflow: A
flexible failure handling framework for the Grid. In B.
Lowekamp (Ed.), 12th IEEE International Symposium
on High Performance Distributed Computing, (pp.
126131). New York: IEEE press.
Iamnitchi, A., Doraimani, S., & Garzoglio, G. (2006).
Filecules in High-Energy Physics: Characteristics and
Impact on Resource Management. In proceeding of
15th ieee international symposium on high performance
distributed computing hpdc 15, Paris.
Iamnitchi, A., Foster, I. T., & Nurmi, D. (2002). A peerto-peer approach to resource location in grid environments. In Hpdc (p. 419).
IBM. (2007). Blue Gene. Retrieved March 10, 2008, from
http://domino.research.ibm.com/comm/research_projects.nsf/pages/bluegene.index.html
Information Services. (n.d.). Retrieved from http://www.
globus.org/toolkit/mds/
Intel (2007). Intel multi-core: An overview.
Intel News Release. (2006). New dual-core Intel Itanium 2 processor doubles performance, reduces power
consumption. Santa Clara, C: Author.
Iosevich, V., & Schuster, A. (2005). Software Distributed Shared Memory: a VIA-based implementation and
comparison of sequential consistency with home-based
lazy release consistency: Research Articles. Software,
Practice & Experience, 35(8), 755786. doi:10.1002/
spe.656
Iosup, A., & Epema, D. H. (2006). GRENCHMARK:

A framework for analyzing, testing, and comparing
grids. International Conference on Cluster Computing
and the Grid (pp. 313-320). Singapore: IEEE Computer
Society Press.
Jacob, B., Ferreira, L., Bieberstein, N., Gilzean, C., Girard, J.-Y., Strachowski, R., & Yu, S. (2003). Enabling
applications for Grid computing with Globus. IBM Redbook. Retrieved from www.redbooks.ibm.com/abstracts/
sg246936.html?Open
Iosup, A., & Epema, D. H. (2007). Build-and-test

workloads for Grid middleware: Problem, analysis,
and applications. International Conference on Cluster
Computing and the Grid (pp. 205-213). Rio de Janeiro,
Brazil: IEEE Computer Society Press.
Jain, R. K. (1991). The Art of Computer Systems Performance Analysis: Techniques for Experimental Design,
Measurement, Simulation, and Modeling. New York:
Wiley.
Iosup, A., Epema, D. H. J., Tannenbaum, T., Farrellee,

M., & Livny, M. (2007, November). Inter-operating Grids
through delegated matchmaking. In 2007 ACM/IEEE
Conference on Supercomputing (SC 2007) (pp. 112).
Irwin, D., Chase, J., Grit, L., Yumerefendi, A., Becker,
D., & Yocum, K. G. (2006, June). Sharing networked
resources with brokered leases. In USENIX Annual
Technical Conference (pp. 199212). Berkeley, CA:
USENIX Association.
Ishikawa, Y., Matsuda, M., Kudoh, T., Tezuka, H., &
Sekiguchi, S. (2003). The design of a latency-aware mpi
communication library. In Proceedings of swopp03.
ISO/IEC. (1995). SMDL (Standard Music Description
Language) Overview. Retrieved June 15th, 2008, from
http://xml.coverpages.org/gen-apps.html#smdl
James, K. M. (1983). A second look at bloom filters. Communications of the ACM, 26(8), 570571.
doi:10.1145/358161.358167
Jamin, S., Jin, C., Jin, Y., Raz, D., Shavitt, Y., & Zhang,
L. (2000). On the placement of Internet instrumentation.
In Proc. of INFOCOM.
Jarvis, S. A., & Nudd, G. R. (2005, February). Performance-based middleware for Grid computing. Concurrency and Computation: Practactice and Experience,
17(2-4), 215234. doi:10.1002/cpe.925
Jayram, T. S., Kimbrel, T., Krauthgamer, R., Schieber,
B., & Sviridenko, M. (2001). Online server allocation in
server farm via benefit task systems. Proceedings of the
ACM Symposium on Theory of Computing (STOC01),
Crete, Greece, (pp. 540549).
ISO/IEC. (2003). MPEG-7 Overview. Retrieved June

15th, 2008, from http://www.chiariglione.org/mpeg/
standards/mpeg-7/mpeg-7.htm.
Jean, K., Galis, A., & Tan, A. (2004). Context-aware

grid services: Issues and approaches. In Computational
scienceiccs 2004: 4th international conference Krakow,
Poland, June 69, 2004, proceedings, part iii (LNCS
Vol. 3038, p. 1296). Berlin: Springer.
Itzkovitz, A., Schuster, A., & Shalev, L. (1998). Thread

migration and its applications in distributed shared
memory systems. Journal of Systems and Software, 42(1),
7187. doi:10.1016/S0164-1212(98)00008-9
Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., &

Schulzrinne, H. (2008). REsource LOcation And Discovery (RELOAD). Retrieved June 15th, 2008, from http://
tools.ietf.org/id/draft-bryan-p2psip-reload-04.txt.
Iwata, T., & Kurosawa, K. (2003). OMAC: One-Key CBC

MAC. In 10th International Workshop on Fast Software
Encryption (FSE03), (LNCS Vol. 2887/2003, pp. 129153), Lund, Sweden. Berlin/Heidelberg: Springer.
Jha, S., Kaiser, H., El Khamra, Y., & Weidner, O. (2007,

Dec. 10-13). Design and implementation of network performance aware applications using SAGA and Cactus.
3rd IEEE Conference on eScience and Grid Computing,
(pp. 143- 150). Bangalore, India.
923
Jiang, H., & Chaudhary, V. (2004). Process/thread migration and checkpointing in heterogeneous distributed
systems. Proceedings of the 37th Hawaii International
Conference on System Sciences, Hawaii, USA.
JSR166. (2004). Java concurrent utility package in

J2SE 5.0 (JDK1.5). Retrieved June 24, 2008, from http://
java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/
package-summary.html
Jiang, J. L., Yang, G. W., & Shi, M. L. (2006). Transaction

Model for Service Grid Environment and Implementation
Considerations. In Proceedings of IEEE International
Conference on Web Services (pp. 949 950).
Jul, E., Levy, H., Hutchinson, N., & Blad, A. (1998).

Fine-grained mobility in the emerald system. ACM
Transactions on Computer Systems, 6(1), 109133.
doi:10.1145/35037.42182
Jiang, S., OHanlon, P., & Kirstein, P. (2004). Moving

grid systems into the ipv6 era. In Proceedings of Grid
And Cooperative Computing 2003 (LNCS 3033, pp.
490499). Heidelberg, Germany: Springer-Verlag.
Jung, E. B., Choi, S.-J., Baik, M.-S., Hwang, C.-S., Park,

C.-Y., & Young, S. (2005). Scheduling scheme based on
dedication rate in volunteer computing environment. In
Third international symposium on parallel and distributed computing (ispdc 2005), Lille, France.
Jiang, X.-F., Zheng, H.-W., Macian, C., & Pascual, V.

(2008). Service Extensible P2P Peer Protocol. Retrieved
June 15th, 2008, from http://tools.ietf.org/id/draft-jiangp2psip-sep-01.txt
Jin, H., Xiong, M., Wu, S., & Zou, D. (2006). Replica
Based Distributed Metadata Management in Grid Environment. Computational Science (LNCS 3944, pp.
1055-1062). Berlin: Springer-Verlag.
John, K., David, B., Yan, C., Steven, C., Patrick, E.,
& Dennis, G. (2000). OceanStore: an architecture for
global-scale persistent storage. SIGPLAN Not., 35(11),
190201. doi:10.1145/356989.357007
Johnson, C., & Welser, J. (2005). Future processors:
Flexible and modular. Proceedings of 3rd IEEE/ACM/
IFIP International Conference on Hardware/Software
Codesign and System Synthesis, (pp. 4-6).
Johnson, R. (2002). Spring Framework - a full-stack
Java/JEE application framework. Retrieved June 18,
2008, from http://www.springframework.org/
Joisha, P. G., & Banerjee, P. (1999). PARADIGM (version
2.0): A new HPF compilation system. In IPPS 99/SPDP
99: Proceedings of the 13th International Symposium on
Parallel Processing and the 10th Symposium on Parallel
and Distributed Processing (pp. 609615). Washington,
DC: IEEE Computer Society.
Jones, N. C. & Penvzner, P.A. (2004, August). An Introduction to Bioinformatics Algorithms.
924
Kaashoek, M. F., & Karger, D. R. (2003). Koorde: A simple

degree-optimal distributed hash table. In Proceedings
of the 2nd Intl. Workshop on Peer-to-Peer Systems (pp.
Kale, L. V., & Krishnan, S. (1998). Charm++: Parallel
Programming with Message-Driven Objects. In G. V.
Wilson, & P. Lu, Parallel programming using c++ (pp.
175-213). Cambridge, MA: MIT Press.
Kalogeraki, V., Gunopulos, D., & Zeinalipour-Yazti,
D. (2002). A local search mechanism for peer-to-peer
networks. In Proceedings of the Eleventh International
Conference on Information and Knowledge Management
(pp. 300-307).
Kang, D.-S. (2004) Speculation-aware thread scheduling
for simultaneous multithreading. Doctoral Dissertation,
University of Southern California, Los Angeles, CA.
Kang, D.-S., Liu, C., & Gaudiot, J.-L. (2008). The impact
of speculative execution on SMT processors. [IJPP].
International Journal of Parallel Programming, 36(4),
361385. doi:10.1007/s10766-007-0052-3
Kangasharju, J. (2002). Implementing the Wireless
CORBA Specification. PhD Disertation, Computer
Science Department, University of Helsinki, Helsinki,
Finland. Retrieved June 15th, 2008, from http://www.
cs.helsinki.fi/u/jkangash/laudatur-jjk.pdf
Karger, D. R., & Ruhl, M. (2004). Diminished chord: A

protocol for heterogeneous subgroup. In Proceedings
of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp.
Karger, D. R., & Ruhl, M. (2004). Simple, efficient
load balancing algorithms for peer-to-peer systems. In
Proceedings of the 3rd Intl. Workshop on Peer-to-Peer
Systems (pp. 131-140). Berlin: Springer-Verlag.
Karger, D., Lehman, E., Leighton, T., Levine, M., et al.
(1997). Consistent hashing and random trees: Distributed
caching protocols for relieving hot spots on the World
Wide Web. In Proc. of STOC (pp 654663).
Karonis, N. T., Toonen, B., & Foster, I. (2003). MPICHG2: A Grid-enabled implementation of the message
passing interface. [JPDC]. Journal of Parallel and
Distributed Computing, 63, 551563. doi:10.1016/S07437315(03)00002-9
Karp, R. M. (1992). Online algorithms versus offline
algorithms: How much is it worth to know the future? In
J. van Leeuwen, (Ed.), Proceedings of the 12th IFIP World
Computer Congress. Volume 1: Algorithms, Software,
Architecture, (pp. 416429). Amsterdam: Elsevier.
Katzy, B., Zhang, C., & Lh, H. (2005). Virtual organizations: Systems and practices. In L. M. Camarinha-Matos,
H. Afsarmanesh, & M. Ollus (Eds.), (p. 45-58). New York:
Springer Science+Business Media, Inc.
Keahey, K., Foster, I., Freeman, T., & Zhang, X. (2006).
Virtual workspaces: Achieving quality of service and
quality of life in the Grids. Science Progress, 13(4),
265275.
Kedrinskii, V. K., Vshivkov, V. A., Dudnikova, G. I.,
Shokin, Yu. I., & Lazareva, G. G. (2004). Focusing of
an oscillating shock wave emitted by a toroidal bubble
cloud. Journal of Experimental and Theoretical Physics,
98(6), 11381145. doi:10.1134/1.1777626
Keleher, P., Cox, A. L., & Zwaenepoel, W. (1992). Lazy
release consistency for software distributed shared
memory. Paper presented at the Proceedings of the
19th annual international symposium on Computer
architecture.
Keleher, P., Cox, A. L., Dwarkadas, S., & Zwaenepoel,

W. (1994). TreadMarks: Distributed Shared Memory on
Standard Workstations and Operating Systems. Paper
presented at the Proceedings of Winter 1995 USENIX
Conference.
Kelly, W., Roe, P., & Sumitomo, J. (2002). G2: A grid
middleware for cycle donation using. net. In Proceedings
of the 2002 International Conference on Parallel and
Distributed Processing Techniques and Applications.
Kephart, J. O., & Chess, D. M. (2003). The vision of
autonomic computing. Computer IEEE Computer Society, 36(1), 4150.
Kertsz, A., Farkas, Z., Kacsuk, P., & Kiss, T. (2008,
April). Grid enabled remote instrumentation. In F.
Davoli, N. Meyer, R. Pugliese, & S. Zappatore (Eds.),
2nd International Workshop on Distributed Cooperative
Laboratories: Instrumenting the Grid (INGRID 2007)
(pp. 303312). New York: Springer US.
Kesselman, C., & Foster, I. (1998). The Grid: Blueprint
for a new computing infrastructure. San Francisco:
Morgan Kaufmann Publishers.
Kessler, C. W., & Lwe, W. (2007). A framework for
performance-aware composition of explicitly parallel
components. In [Jlich/Aachen, Germany: IOS Press.].
Proceedings of the International Conference ParCo,
2007, 227234.
Khanna, G., Vydyanathan, N., Catalyurek, U., Kurc, T.,
Krishnamoorthy, S., Sadayappan, P., et al. (2006).Task
scheduling and file replication for data-intensive jobs with
batch-shared I/O. In Proceedings of high-performance
distributed computing (HPDC) (pp. 241252).
Kielmann, T. Hofman, R. F. H. Bal, H. E. Plaat, A. &
Bhoedjang, R. A. F. (1999). MagPIe: MPIs collective communication operations for clustered wide area systems.
ACM SIGPLAN Symposium on Principles and Practice
of Parallel Programming (PPoPP99), 34(8),131-140.
Kim, J.-S., Nam, B., Keleher, P. J., Marsh, M. A., Bhattacharjee, B., & Sussman, A. (2006). Resource discovery
techniques in distributed desktop grid environments. In
Grid (pp. 9-16).
925
Kim, K. H., & Buyya, R. (2007, September). Fair resource

sharing in hierarchical virtual organizations for global
Grids. In 8th IEEE/ACM International Conference on
Grid Computing (Grid 2007) (pp. 5057). Austin, TX:
IEEE.
Kim, S., & Weissman, J. B. (2004). A genetic algorithm
based approach for scheduling decomposable data grid
applications. In Proceedings of international conference
on parallel processing (Vol. 1, pp. 405413).
Kim, Y. (1996, June). Fault Tolerant Matrix Operations
for Parallel and Distributed Systems. Ph.D. dissertation,
University of Tennessee, Knoxville.
Knuth, D. E. (1975). The art of computer programming.
Volume 1: Fundamental Algorithms. Reading, MA: Addison Wesley.
Koelbel, C. H., Loveman, D. B., & Schreiber, R. S., Jr.
G. L. S., & Zosel, M. E. (1994). The High Performance
Fortran Handbook. Cambridge, MA: MIT Press.
Koetter, R., & Medard, M. (2003, October). An algebraic
approach to network coding. IEEE/ACM Transactions on
Networking (TON), 11(5), 782 795.
Kok, A. J. F., Pabst, J. L. v., & Afsarmanseh, H. (April,
1997). The 3D Object Mediator: Handling 3D Models
on Internet. Paper presented at the High-Performance
Computing and Networking, Vienna, Austria.
Kondo, D., Araujo, F., Malecot, P., Domingues, P., Silva,
L. M., & Fedak, G. (2006). Characterizing result errors in
internet desktop grids (Tech. Rep. No. INRIA-HALTech
Report 00102840), INRIA, France.
Kondo, D., Chien, A. A., & Casanova, H. (2007). Scheduling task parallel applications for rapid turnaround on
enterprise desktop grids. Journal of Grid Computing,
5(4), 379405. doi:10.1007/s10723-007-9063-y
Kondo, D., Chien, A., & H., C. (2004, November). Rapid
Application Turnaround on Enterprise Desktop Grids.
In Acm conference on high performance computing and
networking, sc2004.
926
Kondo, D., Fedak, G., Cappello, F., Chien, A. A., &

Casanova, H. (2006, December). On Resource Volatility in Enterprise Desktop Grids. In Proceedings of the
2nd IEEE International Conference On E-Science And
Grid Computing (eScience06) (pp. 7886). Amsterdam,
Netherlands.
Kondo, D., Taufer, M., Brooks, C., Casanova, H., &
Chien, A. (2004, April). Characterizing and evaluating
desktop grids: An empirical study. In Proceedings of
the International Parallel and Distributed Processing
Symposium (IPDPS04).
Koskela, T., Kassinen, O., Korhonen, J., Ou, Z., & Ylianttila, M. (2008). Peer-to-Peer Community Management
using Structured Overlay Networks. In the Proc. of International Conference on Mobile Technology, Applications
and Systems, September 10-12, Yilan, Taiwan.
Koufaty, D., & Marr, D. (2003). Hyperthreading technology in the Netburst microarchitecture. IEEE Micro,
23(2), 5665. doi:10.1109/MM.2003.1196115
Kowaliski, C. (2008). NVIDIA CEO talks down CPUGPU hybrids, Larrabee. The Tech Report, April 11th. Retrieved from http://techreport.com/discussions.x/14538
Kraeva, M. A., & Malyshkin, V. E. (1997). Implementation of PIC method on MIMD multicomputers with
assembly technology. In Proc. of the High Performance
Computing and Networking Europe 1997 Int. Conference. (LNCS, Vol.1255), (pp. 541-549). Berlin: Springer
Verlag.
Kraeva, M. A., & Malyshkin, V. E. (1999). Algorithms
of parallel realization of PIC method with assembly
technology. In Proceedings of 7th High Performance
Computing and Networking Europe, (LNCS Vol. 1593),
(pp. 329-338). Berlin: Springer Verlag.
Kraeva, M. A., & Malyshkin, V. E. (2001). Assembly
technology for parallel realization of numerical models on
MIMD-multicomputers. International Journal on Future
Generation Computer Systems, Elsevier Science, 17(6),
755765. doi:10.1016/S0167-739X(00)00058-3
Krafzig, D., Banke, K., & Slama, D. (2005). Enterprise

SOA: Service-Oriented Architecture Best Practices.
Upper Saddle River, NJ: Prentice Hall.
Krauter, K., Buyya, R., & Maheswaran, M. (2002). A
taxonomy and survey of grid resource management
systems for distributed computing. Software, Practice
& Experience, 32(2), 135164. doi:10.1002/spe.432
Krishna, V. & Perry, M. (2007). Efficient mechanism
Design.
Krishnan, S., & Gannon, D. (2004). Xcat3: A framework
for cca components as ogsa services. In Proceedings of
Hips 2004, 9th International Workshop on High-Level
Parallel Programming Models and Supportive Environments.
Kubiatowicz, J., Bindel, D., Chen, Y., Eaton, P., Geels, D.,
Gummadi, R., et al. (2000). OceanStore: An Architecture
for Global-Scale Persistent Storage. In Proceedings of the
9th Intl. Conf. on Architectural Support for Programming
Languages and Operating Systems (pp. 190-201). New
York: ACM Press.
Khnemann, M., Rauber, T., & Rnger, G. (2004). A
source code analyzer for performance prediction. In Proceedings of IPDPS04 Workshop on Massively Parallel
Processing (WMPP04. New York: IEEE.
Kuksheva, E. A., Malyshkin, V. E., Nikitin, S. A.,
Snytnikov, A. V., Snytnikov, V. N., & Vshivkov, V. A.
(2005). Supercomputer simulation of self-gravitating
media. International Journal on Future Generation
Computer Systems, 21(5), 749758. doi:10.1016/j.future.2004.05.019
Kumar, A. (2000). An efficient SuperGrid protocol for
high availability and load balancing. IEEE Transactions on
Computers, 49(10), 11261133. doi:10.1109/12.888048
Kumar, A., Xu, J., & Zegura, E. W. (2005). Efficient and
scalable query routing for unstructured peer-to-peer
networks. Paper presented at the Proceedings INFOCOM 2005, 24th Annual Joint Conference of the IEEE
Computer and Communications Societies.
Kumar, R., Tullsen, D. M., & Jouppi, N. P. (2006). Core

Architecture Optimization for Heterogeneous Chip
Multiprocessors. In Proceedings of the 15th International
Conference on Parallel Architecture and Compilation
Techniques (pact 2006) (pp. 23-32).
Kumar, V. K. P., Hariri, S., & Raghavendra, C. S. (1986).
Distributed program reliability analysis. IEEE Transactions on Software Engineering, SE-12, 4250.
Kumary, R. Tullsen D.M., Ranganathan, P., Jouppi,
N.P., & Farkas, K.I., (2004). Single-ISA Heterogeneous
Multi-Core Architecture for Multithreaded Workload
Performance. In Proceedings of the 31st International
Symposium on Computer Architecture (ISCA04), June,
2004.
Kurkovsky, S. Bhagyavati, Ray, A., & Yang, M. (2004).
Modeling a grid-based problem solving environment for
mobile devices. In ITCC (2) (p. 135). New York: IEEE
Computer Society.
Kwok, T. T.-O., & Kwok, Y.-K. (2007). Design and Evaluation of Parallel String Matching Algorithms for Network
Intrusion Detection Systems (NPC 2007), (LNCS 4672,
pp. 344-353). Berlin: Springer.
Lamehamedi, H., & Szymanski, B. shentu, Z., &
Deelman, E. (2002). Data replication strategies in grid
environments. In Proceedings of the fifth international
conference on algorithms and architectures for parallel
processing (pp. 378383).
Lamehamedi, H., Szymanski, B., Shentu, Z., & Deelman, E. (2003). Simulation of dynamic data replication
strategies in data grids. In Proceedings of the international parallel and distributed processing symposium
(pp. 1020).
Landers, M., Zhang, H., & Tan, K.-L. (2004). PeerStore:
Better performance by relaxing in peer-to-peer backup.
In Proceedings of the 4th Intl. Conf. on Peer-to-Peer
Computing (pp. 72-79). New York: IEEE Computer
Society Press.
Lange, D., & Oshima, M. (1998). Mobile agents with
java: The aglet api. World Wide Web (Bussum), 1(3).
doi:10.1023/A:1019267832048
927
Laszewski, G. v., Foster, I., & Gawor, J. (2000). Cog kits:

A bridge between commodity distributed computing and
high-performance grids. In ACM 2000 Conference on java
grande (p.97 - 106). San Francisco, CA: ACM Press.
Lee, Y. C., & Zomaya, A. Y. (2006). Data sharing

pattern aware scheduling on grids. In Proceedings of
International Conference on Parallel Processing, (pp.
365372).
Laure, E. (2001). OpusJava: A Java framework for

distributed high performance computing. Future Generation Computer Systems, 18(2), 235251. doi:10.1016/
S0167-739X(00)00094-7
Legrand, I., Newman, H., Voicu, R., Cirstoiu, C., Grigoras, C., Toarta, M., et al. (2004, September-October).
Monalisa: An agent based, dynamic service system to
monitor, control and optimize Grid based applications. In
Computing in High Energy and Nuclear Physics (CHEP),
Interlaken, Switzerland.
Laure, E., Mehrotra, P., & Zima, H. P. (1999).

Opus: Heterogeneous computing with data parallel
tasks. Parallel Processing Letters, 9(2). doi:10.1142/
S0129626499000256
Lawler, E. L., Lenstra, J. K., Rinnooy Kan, A. H. G., &
Shmoys, H. (1993). Sequencing and Scheduling: Algorithms and Complexity. Amsterdam: North-Holland.
Ledlie, J., Serban, L., & Toncheva, D. (2002). Scaling
Filename Queries in a Large-Scale Distributed File
System. Harvard University, Cambridge, MA.
Lei, M., & Vrbsky, S. V. (2006). A data replication strategy

to increase data availability in data grids. In Proceedings
of the international conference on grid computing and
applications (pp. 221227).
Leslie, M., Davies, J., & Huffman, T. (2006). replication
strategies for reliable decentralised storage. In Proceedings of the 1st Workshop on Dependable and Sustainable
Peer-to-Peer Systems (pp. 740-747). New York: IEEE
Lee, C. (2003). Grid programming models: Current

tools, issues and directions. In G. F. Fran Berman, T.
Hey, (Eds.), Grid computing (pp. 555578). New York:
Wiley Press.
Leutenegger, S., & Sun, X. (1993). Distributed computing

feasibility in a non-dedicated homogeneous distributed
system. In Proceedings of SC93, Portland, OR.
Lee, C., Lee, T.-y., Lu, T.-c., & Chen, Y.-t. (1997). A Worldwide Web Based Distributed Animation Environment.
Computer Networks and ISDN Systems, 29, 16351644.
doi:10.1016/S0169-7552(97)00078-0
Levitin, G., Dai, Y. S., & Ben-Haim, H. (2006). Reliability and performance of star topology grid service
with precedence constraints on subtask execution. IEEE
Transactions on Reliability, 55(3), 507515. doi:10.1109/
TR.2006.879651
Lee, L. G. (1982). Designing a Bloom filter for differential

file access. Communications of the ACM, 25(9), 600604.
doi:10.1145/358628.358632
Lee, S., Ren, X., & Eigenmann, R. (2008). Efficient
content search in ishare, a p2p based internet-sharing
system. In PCGRID.
Lee, S.-W., & Gaudiot, J.-L. (2003). Clustered microarchitecture simultaneous multithreading. In 9th International Euro-Par Conference on Parallel Processing
(Euro-Par03), (LNCS Vol. 2790/2004, pp. 576-585),
Klagenfurt, Austria. Berlin/Heidelberg: Springer.
928
Levitin, G., Dai, Y. S., Xie, M., & Poh, K. L. (2003).

Optimizing survivability of multi-state systems with
multi-level protection by multi-processor genetic algorithm. Reliability Engineering & System Safety, 82,
93104. doi:10.1016/S0951-8320(03)00136-4
Li, C.-C. J., Stewart, E. M., & Fuchs, W. K. (1994). Compiler assisted full checkpointing. Software, Practice &
Experience, 24, 871886. doi:10.1002/spe.4380241002
Li, F., Pei, C., Jussara, A., & Andrei, Z. B. (2000). Summary cache: a scalable wide-area web cache sharing
protocol. IEEE/ACM Trans. Netw., 8(3), 281293.
Li, J., Stribling, J., Gil, T. M., Morris, R., & Kaashoek,
M. F. (2004). Comparing the performance of distributed
hash tables under churn. In Proceedings of the 3rd Intl.
Workshop on Peer-to-Peer Systems (pp. 87-99). Berlin:
Springer-Verlag.
Li, J., Stribling, J., Morris, R., & Kaashoek, M. F. (2005).
Bandwidth-efficient management of dht routing tables.
In Proceedings of 2nd Symp. on Networked Systems
Design and Implementation (pp. 99-114). USENIX Association.
Li, S.-Y. R., Yeung, R. W., & Cai, N. (2003, Feb.). Linear
network coding. IEEE Transactions on Information
Theory, 49(2), 371381. doi:10.1109/TIT.2002.807285
Li, X., & Gaudiot, J.-L. (2006). Design trade-offs and
deadlock prevention in transient fault-tolerant SMT processors. In Proceedings of 12th Pacific Rim International
Symposium on Dependable Computing (PRDC06),
(pp. 315-322). Riverside, CA: IEEE Computer Society
Press.
Li, Z. Zhang, Duan, Z., Gao, L.& Hou, Y.T.(2000).
Decoupling QoS control from Core routers: A Novel
bandwidth broker architecture for scalable support of
guaranteed services. Proc. Of SIGCOMM00, Stockholm,
Sweden, (pp. 71-83).
Li, Z., & Mohapatra, P. (2004, January). QoS Aware
routing in Overlay networks (QRON). IEEE Journal on
Selected Areas in Communications, 22(1).
Li, Z., Sun, L., & Ifeachor, E. (2005). Challenges of mobile ad-hoc grids and their applications in e-healthcare.
In Proceedings of Second International Conference on
Computational Intelligence in Medicine And Healthcare
(cimed 2005).
Li, Z., Xu, X., Hu, W., & Tang, Z. (2006). Microarchitecture and performance analysis of Godson-2 SMT processor. In Proceedings of the 24th International Conference
on Computer Design (ICCD06), (pp. 485-490). San Jose,
CA: IEEE Computer Society Press.
Liang, D., & Tripathi, S. (1996). Performance analysis of

longlived transaction processing systems with rollbacks
and aborts. IEEE Transactions on Knowledge and Data
Engineering, 8(5), 802815. doi:10.1109/69.542031
Likic, V. (2000). The needleman-wunsch algorithm
for sequence alignment. The University of Melbourne,
Australia.
Lin, M. S., Chang, M. S., Chen, D. J., & Ku, K. L. (2001).
The distributed program reliability analysis on ring-type
topologies. Computers & Operations Research, 28,
625635. doi:10.1016/S0305-0548(99)00151-3
Lin, Y., Liu, P., & Wu, J. (2006). Optimal placement of
replicas in data grid environments with locality assurance. In Proceedings of the 12th International Conference on Parallel and Distributed Systems (ICPADS06),
01, 465474.
Lindholm, T., & Yellin, F. (1999). The jave(tm) virtual
machine specification (2nd Ed.).New York: Addison
Wesley.
Litchfield, S. (2008). A detailed comparison of Seires 60
(S60) Symbian smartphones. Retrieved March 10, 2008,
from http://3lib.ukonline.co.uk/s60history.htm
Litke, A., Skoutas, D., & Varvarigou, T. (2004). Mobile
grid computing: Changes and challenges of resource
management in a mobile grid environment. In Proceedings of Practical Aspects of Knowledge Management
(PAKM 2004), Austria.
Little, M. C., Shrivastava, S. K., & Speirs, N. A. (2002)...
The Computer Journal, 45(6), 645652. doi:10.1093/
comjnl/45.6.645
Litzkow, M. J., Livny, M., & Mutka, M. W. (1988, June).
Condor a hunter of idle workstations. In 8th International Conference of Distributed Computing Systems (pp.
104111). San Jose, CA: Computer Society.
Liu, C., & Gaudiot, J.-L. (2008). Resource sharing control
in simultaneous multithreading microarchitectures. In
Proceedings of the 13th IEEE Asia-Pacific Computer
Systems Conference (ACSAC08), (pp. 1-8). Hsinchu,
Taiwan: IEEE Computer Society Press.
929
Liu, C., Qian, D., Liu, Y., Li, Y., & Wang, C. (2006).
RSVP Context Extraction in IP Mobility Environments. Vehicular Technology Conference, 2006, VTC
2006-Spring, IEEE 63rd, (Vol. 2, pp. 756-760).
Liu, G. Q., Xie, M., Dai, Y. S., & Poh, K. L. (2004). On
program and file assignment for distributed systems.
Computer Systems Science and Engineering, 19(1),
3948.
Liu, H., Zheng, K., Liu, B., Zhang, X., & Liu, Y. (2006).
A memory-efficient parallel string matching architecture
for high-speed intrusion detection. IEEE Journal on
Selected Areas in Communications, 24(10), 17931804.
doi:10.1109/JSAC.2006.877221
Liu, L.-L., Liu, Q., Natsev, A., Ross, K. A., Smith, J. R.,
& Varbanescu, A. L. (2007, July). Digital media indexing
on the cell processor. In 16th international conference
on parallel architecture and compilation techniques,
Beijing, China (pp. 425425).
Liu, S., & Gaudiot, J.-L. (2007). Synchronization mechanisms on modern multi-core architectures. In Proceedings
of the 12th Asia-Pacific Computer Systems Architecture
Conference (ACSAC07), (LNCS Vol. 4697/2007), (pp.
290-303), Seoul, Korea. Berlin/Heidelberg: Springer.
Liu, S., & Gaudiot, J.-L. (2008). The potential of finegrained value prediction in enhancing the performance
of modern parallel machines. In Proceedings of the 13th
IEEE Asia-Pacific Computer Systems Conference (ACSAC08), (pp. 1-8). Hsinchu, Taiwan: IEEE Computer
Society Press.
Locke, C. D., Vogel, D. R., & Mesler, T. J. (1991). Building

A Predictable Avionics Platform in Ada. In Proceedings
of IEEE Real-Time Systems Symposium.
Lodygensky, O., Fedak, G., Cappello, F., Neri, V., Livny,
M., & Thain, D. (2003). XtremWeb & Condor: Sharing
resources between Internet connected condor pools.
In Proceedings of CCGRID2003, Third International
Workshop On Global And Peer-To-Peer Computing
(GP2PC03) (pp. 382389). Tokyo, Japan.
Loo, B. T., Huebsch, R., Stoica, I., & Hellerstein, J. M.
(2004). The case for a hybrid p2p search infrastructure.
In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer
Systems (pp. 141-150). Berlin: Springer-Verlag.
Lopez, J., Aeschlimann, M., Dinda, P., Kallivokas, L.,
Lowekamp, B., & OHallaron, D. (1999, June). Preliminary report on the design of a framework for distributed
visualization. In Proceedings of the international conference on parallel and distributed processing techniques
and applications (PDPTA99) (pp. 18331839). Las
Vegas, NV.
Lovas, R., Dzsa, G., Kacsuk, P., Podhorszki, N., &
Drtos, D. (2004). Workflow support for complex Grid
applications: Integrated and portal solutions. In M. Dikaiakos (Ed.): AxGrids 2004, (LNCS 3165, pp. 129-138).
Ludtke, S., Baldwin, P., & Chiu, W. (1999). EMAN:
Semiautomated software for high-resolution singleparticle reconstruction. Journal of Structural Biology,
128, 146157. doi:10.1006/jsbi.1999.4174
Liu, X., Li, V., & Zhang, P. (2006). Joint radio resource
management through vertical handoffs in 4G networks
IEEE GLOBECOM (pp. 1-5). Washington, DC: IEEE.
Luk, F. T., & Park, H. (1986). An analysis of algorithmbased fault tolerance techniques. SPIE Adv. Alg. and
Arch. for Signal Proc., 696, 222228.
Livny, M., & Raman, R. (1998). High-throughput resource management. In The Grid: Blueprint for a new
computing infrastructure (pp. 311-338). San Francisco:
Morgan-Kaufmann
Luk, M., Mezzour, G., Perrig, A., & Gligor, V. (2007).

MiniSec: A Secure Sensor Network Communication
Architecture. Proceedings of IEEE International Conference on Information Processing in Sensor Networks
(IPSN), (pp. 479-488).
Loan, C. V. (1992). Computational frameworks for the

fast Fourier transform. Philadelphia, PA: Society for
Industrial and Applied Mathematics.
930
Luther, A., Buyya, R., Ranjan, R., & Venugopal, S. (2005).

Peer-to-peer grid computing and a. NET-based alchemi
framework. high performance computing: Paradigm
and Infrastructure. In M. Guo, (Ed.). New York: Wiley
Press. Retrieved from www.alchemi.net
Malyshkin, V. E. (1995). Functionality in ASSY system

and language of functional programming. In Proceedings
of the First Aizu International Symposium on Parallel
Algorithms/Architecture Synthesis. (pp. 92-97). AizuWakamatsu, Japan: IEEE Comp. Soc. Press.
Luther, A., Buyya, R., Ranjan, R., & Venugopal, S.

(2005, June). Alchemi: A. NET-based enterprise grid
computing system, In ICOMP05 Proceedings of the
6th International Conference on Internet Computing,
Las Vegas, USA.
Mandelbrot Set. (2008, November). Retrieved from http://

mathworld.wolfram.com/MandelbrotSet.html.
Lv, C., Cao, P., Cohen, E., Li, K., & Shenker, S. (2002).
Search and replication in unstructured peer-to-peer networks. In Proceedings of the 2002 ACM SIGMETRICS
international conference on Measurement and modeling
of computer systems (pp.258-259).
Ma, M. J. M., Wang, C. L., & Lau, F. C. M. (2000). JESSICA: Java-enabled single-system-image computing
architecture. Journal of Parallel and Distributed Computing, 60(10), 11941222. doi:10.1006/jpdc.2000.1650
Mahadevan, U., & Ramakrishnan, S. (1994) Instruction
scheduling over regions: A framework for scheduling
across basic blocks. In Proceedings of the 5th International Conference on Compiler Construction (CC94),
Edinburgh, (LNCS Vol. 786/1994, pp. 419-434). Berlin/
Malcot, P., Kondo, D., & Fedak, G. (2006, June).
Xtremlab: A system for characterizing internet desktop
grids. In Poster in the 15th ieee international symposium
on high performance distributed computing hpdc06.
Paris, France.
Malyshkin V.E., Sorokin S.B., & K.G.Chauk (2008,
May). Fragmented numerical algorithms for the library
parallel standard subroutines. Accepted to publication
in Siberian Journal of Numerical Mathematics, Novosibirsk, Russia.
Malyshkin, V. (2006). How to create the magic wand?
Currently implementable formulation of the problem.
In New Trends in Software Methodologies, Tools and
Techniques, Proceedings of the Fifth SoMeT_06, 147,
127-132.
Manku, G. (2004). Balanced binary trees for ID management and load balance in distributed hash tables. In
Proc. of PODC.
March, V., Teo, Y. M., & Wang, X. (2007). DGRID: A
DHT-based resource indexing and discovery scheme for
computational grids. In Proceedings of the 5th Australasian Symp. on Grid Computing and e-Research (pp.
41-48). Australian Computer Society, Inc.
March, V., Teo, Y. M., Lim, H. B., Eriksson, P., & Ayani,
R. (2005). Collision detection and resolution in hierarchical peer-to-peer systems. In Proceedings of the 30th
IEEE Conf. on Local Computer Networks (pp. 2-9). New
Marcuello, P., & Gonzalez, A. (1999) Exploiting speculative thread-level parallelism on a SMT processor.
In Proceedings of the 7th International Conference on
High-Performance Computing and Networking (HPCN
Europe99), Amsterdam, the Netherlands, (LNCS Vol.
1593/1999, pp. 754-763) Berlin/Heidelberg: Springer.
Marr, D.T., Binns, F., Hill, D.L., Hinton, G., Koufaty,
D.A, Miller, J.A., & Upton, M. (2002). Hyper-threading
technology architecture and microarchitecture. Intel
Technology Journal, 6(1), 4-15.
Marsh, A. (1997). EUROMED - Combining WWW and
HPCN to Support Advanced Medical Imaging. Paper
presented at the High-Performance Computing and
Networking, Vienna, Austria.
Mascarenhas, E., & Rego, V. (1995). Ariadne: Architecture of a portable threads system supporting mobile process, (Tech. Rep. No. CSD-TR 95-017). Dept. of Computer
Sciences, Purdue University, Southbend, IN.
931
Mason, R., & Kelly, W. (2005). G2-p2p: A fully decentralized fault-tolerant cycle-stealing framework. In R.
Buyya, P. Coddington, and A. Wendelborn, (Eds.), In
AusGrid05 Australasian Workshop on Grid Computing and e-Research, Newcastle, Australia, (Vol. 44 of
CRPIT, pp. 33-39).
Matei, R., & Ian, F. (2002). A Decentralized, Adaptive
Replica Location Mechanism. Paper presented at the
Proceedings of the 11th IEEE International Symposium
on High Performance Distributed Computing.
Mathe, J., Kuntner, K., Pota, S., & Juhasz, Z. (2003). The
use of jini technology in distributed and grid multimedia
systems. In MIPRO 2003, Hypermedia and Grid Systems
(p. 148-151). Opatija, Croatia.
Matjaz, B. J. (2008). BPEL and Java. Retrieved June 15th,
2008, from http://www.theserverside.com/tt/articles/
article.tss?l=BPELJava
Matossian, V., Bhat, V., Parashar, M., Peszynska, M., Sen,
M., & Stoffa, P. (2005). Autonomic oil reservoir optimization on the grid. [John Wiley and Sons.]. Concurrency
and Computation, 17(1), 126. doi:10.1002/cpe.871
Mattson, T., Sanders, B., & Massingill, B. (2004). Patterns for parallel programming. New York: AddisonWesley.
Maymounkov, P., & Mazires, D. (2002). Kademlia: A
Peer-to-peer Information System Based on the XOR
Metric. In Proceedings of the 1st international workshop
on peer-to-peer systems (iptps02) (pp. 5365).
McGinnis, L., Wallom, D., & Gentzsch, W. (Eds.). (2007).
2nd International Workshop on Campus and Community
Grids. retrieved from http://forge.gridforum.org/sf/go/
doc14617?nav=1
McIlroy, M. (1982). Development of a Spelling List.
Communications, IEEE Transactions on [legacy, pre 1988], 30(1), 91-99.
McKenney, P. E., Lee, D. Y., & Denny, B. A. (2008).
Traffic generator software release notes.
932
McKnight, L., Howison, J., & Bradner, S. (2004, July).

Wireless grids, distributed resource sharing by mobile,
nomadic and fixed devices. IEEE Internet Computing,
8(4), 2431. doi:10.1109/MIC.2004.14
McNair, J., & Fang, Z. (2004). Vertical handoffs in
fourth-generation multinetwork environments. IEEE
Wireless Communications., 11(3), 815. doi:10.1109/
MWC.2004.1308935
Merlin, J. H., Baden, S. B., Fink, S., & Chapman, B. M.
(1999). Multiple data parallelism with HPF and KeLP.
Future Generation Computer Systems, 15(3), 393405.
doi:10.1016/S0167-739X(98)00083-1
Merton, R. C. (1973). Theory of Real Option Pricing. The
Bell Journal of Economics and Management Science,
4(1). doi:10.2307/3003143
Message Passing Interface Forum. (1994). MPI: A Message Passing Interface Standard. (Technical Report utcs-94-230), University of Tennessee, Knoxville, TN.
Messig, M., & Goscinski, A. (2007). Autonomic system
management in mobile grid environments. In Proceedings
of the Fifth Australasian Symposium on ACSW Frontiers
(ACSW 07), (pp. 4958). Darlinghurst, Australia: Australian Computer Society, Inc.
Metz, C. (2001). Interconnecting ISP networks. IEEE Internet Computing, 5(2), 7480. doi:10.1109/4236.914650
Meyer, J. (1980). On evaluating the performability of
degradable computing systems. IEEE Transactions on
Computers, 29, 720731. doi:10.1109/TC.1980.1675654
Michael, M. (2002). Compressed bloom filters. IEEE/
ACM Trans. Netw., 10(5), 604612.
Microsoft Live Mesh. (2008, November). Retrieved from
http://www.mesh.com.
Migliaccio, A. (2006). The Design and Development
of a Nomadic Computing Middleware: the Esperanto
Broker. PhD Dissertation, Department of Computer and
System Engineering, Federico II, University of Naples,
Naples, Italy.
Migliardi, M., & Sunderam, V. (1999). The harness

metacomputing framework. In Proceedings of Ninth
Siam Conference on Parallel Processing for Scientific
Computing. San Antonio, TX: SIAM.
Miller, R. L. (1993). High Resolution Image Processing on Low-cost Microcomputer. International Journal of Remote Sensing, 14(4), 655667.
doi:10.1080/01431169308904366
Milton, S. (1998). Thread migration in distributed
memory multicomputers, (Tech. Rep. No. TR-CS-98-01).
Dept. of Comp Sci & Comp Sciences Lab, Australia
National University, Acton, Australia.
Min, W. H., & Veeravalli, B. (2005, December). Aligning biological sequences on distributed bus networks: a
divisible load scheduling approach. Institute of Electrical
and Electronic Engineering, 9(4), 489501.
Mislove, A., & Druschel, P. (2004). Providing administrative control and autonomy in structured peer-to-peer
overlays. Proceedings of the 3rd Intl. Workshop on Peerto-Peer Systems (pp. 162-172). Berlin: Springer-Verlag.
Mitzenmacher, M. (1997). On the analysis of randomized
load balancing schemes. In Proc. of SPAA.
Mohamed, H. H., & Epema, D. H. (2005). Experiences with the KOALA co-allocating scheduler in
multiclusters. International Conference of Cluster
Computing and the Grid (pp. 784-791). Cardiff, UK:
Mohamed, H., & Epema, D. (in press). KOALA: A
co-allocating Grid scheduler. Concurrency and Computation.
Mohan, A., & Kalogeraki, V. (2003). Speculative routing
and update propagation: a kundali centric approach.
Paper presented at the IEEE International Conference
on Communications, 2003.
Mondal, A., Goda, K., & Kitsuregawa, M. (2003). Effective load-balancing of peer-to-peer systems. In Proc.
of IEICE DEWS DBSJ Annual Conference.
Montero, R. S., Huedo, E., & Llorente, I. M. (2008,

September/October). Dynamic deployment of custom
execution environments in Grids. In 2nd International
Conference on Advanced Engineering Computing and
Applications in Sciences (ADVCOMP 08) (pp. 3338).
Valencia, Spain: IEEE Computer Society.
Montgomery, D. C. (2004). Design and analysis of
experiments (6 ed.). New York: Wiley.
Moore, G. E. (1965). Cramming more components onto
integrated circuits. Electronics Magazine, 38(8).
Motwani, R., & Raghavan, P. (1995). Randomized Algorithms. New York: Cambridge University Press.
MRML. (2003). MRML- Multimedia Retrieval Markup
Language. Retrieved June 15th, 2008, from http://www.
mrml.net/
Murphy, A. L., Picco, G. P., & Roman, G. (2001). LIME:
a middleware for physical and logical mobility. 21st
Systems, (pp. 524-533).
Mutka, M. W., & Livny, M. (1987). Profiling workstations
available capacity for remote execution. In Proceedings
of performance-87, the 12th ifip w.g. 7.3 international
symposium on computer performance modeling, measurement and evaluation. Brussels, Belgium.
Mutka, M., & Livny, M. (1991, July). The available capacity of a privately owned workstation environment.
Performance Evaluation, 4(12).
Mutz, A., Wolski, R., & Brevik, J. (2007). Eliciting honest value information in a batch-queue environment. In
The 8th IEEE/ACM Int Conference on Grid Computing
(Grid 2007) Austin, Texas, USA.
Myers, D. S., Bazinet, A. L., & Cummings, M. P. (2008).
Expanding the reach of grid computing: combining globus- and boinc-based systems. In Grids for Bioinformatics
and Computational Biology. New York: Wiley.
MyGrid. (2008). Retrieved from www.mygrid.org.uk
933
Ntakp. T., & Suter, F. (2006). Critical path and area

based scheduling of parallel task graphs on heterogeneous
platforms. In Proceedings of the Twelfth International
Conference on Parallel and Distributed Systems (ICPADS) (pp. 310), Minneapolis, MN.
Ntakp. T., Suter, F., & Casanova, H. (2007). A comparison of scheduling approaches for mixed-parallel applications on heterogeneous platforms. In 6th International
Symposium on Parallel and Distributed Computing (pp.
3542). Hagenberg, Austria: IEEE Computer Press.
Nesargi, S., & Prakash, R. (2002). MANETconf: Configuration of Hosts in a Mobile Ad Hoc Network. In
Proceedings of the IEEE Infocom 2002, New York,
June 2002.
Neuroth, H., Kerzel, M., & Gentzsch, W. (Eds.). (2007).
German Grid Initiative D-Grid. Gttingen, Germany:
Universittsverlag Gttingen Publishers. Retrieved from
www.d-grid.de/index.php?id=4&L=1
Nabrzyski, J., Schopf, J. M., & Weglarz, J. (2003). Grid

Resource Management. Amsterdam: Kluwer Publishing.
Ni, J., & Lin, C. Chen, Z., & Ungsunan, P. (2007, September). A Fast Multi-pattern Matching Algorithm for
Deep Packet Inspection on a Network Processor. In
Proceedings of International Conference on Parallel
Processing (ICPP 2007)(p.16).
Nakada, H., Matsuoka, S., Seymour, K., Dongarra, J.,

Lee, C., & Casanova, H. (2003). Gridrpc: A remote
procedure call api for grid computing.
Nickolls, J., Buck. I, & Garland, M., (2008). Scalable

Parallel Programming with CUPA. ACM QUEUE, March/
April, 6(2), 40-53
Nam, M., Choi, N., Seok, Y., & Choi, Y. (2004). WISE:
Energy-efficient interface selection on vertical handoff
between 3G networks and WLANs. IEEE PIMRC 2004,
1, (pp. 692-698). Washington, DC: IEEE.
Nicolescu, C., & Jonker, P. (2002). A Data and Task Parallel Image Processing Environment. Parallel Computing,
28, 945965. doi:10.1016/S0167-8191(02)00105-9
Nanda, P. (2008, January). A three layer policy based

architecture supporting Internet QoS. Ph.D. thesis,
University of Technology, Sydney, Australia.
Naor, M., & Wieder, U. (June 2003). Novel Architectures
for P2P applications: The continuous-discrete approach.
In Proc. SPAA.
National e-Science Centre. (2005). Retrieved from http://
www.nesc.ac.uk.
NEESgrid. (2008). Retrieved from www.nees.org/
Nelson, B. J. (1981). Remote Procedure Call. Palo Alto,
CA: Xerox - Palo Alto Research Center.
Nemirovsky, M. D., Brewer, F., & Wood, R. C. (1991).
DISC: dynamic instruction stream computer. In Proceedings of the 24th Annual International Symposium
on Microarchitecture (MICRO91), Albuquerque, NM
934
Niederl, F., & Goller, A. (Jan, 1998). Method Execution

On A Distributed Image Processing Backend. Paper
presented at the 6th EUROMICRO Workshop on Parallel
and Distributed Processing, Madrid, Spain.
Nieuwpoort, R. V. v., Maassen, J., Wrzesinska, G., Hofman, R., Jacobs, C., & Kielmann, T. (2005). Ibis: a flexible
and efficient Java-based Grid programming environment.
Concurrency and Computation, 17(7/8), 1079-1108.
Nieuwpoort, R. V. v., Maassen, J., Wrzesinska, G., Kielmann, T., & Bal, H. E. (2004). Satin: Simple and efficient
Java-based grid programming. Journal of Parallel and
Distributed Computing Practices.
Nisan, N., London, S., Regev, O., & Camiel, N. (1998).
Globally distributed computation over the internet - the
popcorn project. In International conference on distributed computing systems 1998 (p. 592). New York: IEEE
Computer Society.
Norman, T. J., Preece, A., Chalmers, S., Jennings, N. R.,

Luck, M., & Dang, V. D. (2004). Agent-based formation
of virtual organisations. Knowledge-Based Systems, 17,
103111. doi:10.1016/j.knosys.2004.03.005
Oaks, S., Traversat, B., & Gong, L. (2002). JXTA in a
Nutshell. Sebastopol, CA: OReilly Media, Inc.
Oberhuber, M. (1998). Distributed High-Performance
Image Processing on the Internet. Doctoral Thesis, Graz
University of Technology, Austria.
ObjectWeb. (2004). RUBBoS: Bulletin Board Benchmark.
Retrieved June 19, 2008, from http://jmob.objectweb.
org/rubbos.html
ObjectWeb. (2005). TPC-W Benchmark (Java Servlets
version). Retrieved June 19, 2008, from http://jmob.
objectweb.org/tpcw.html
OGF. (2008). Open Grid Forum. Retrieved from www.
ogf.org
Oh, J., Lee, S., & Lee, E. (2006). An adaptive mobile
system using mobile grid computing in wireless network.
In Computational Science And Its Applications - ICCSA
2006 (LNCS Vol. 3984, pp. 49-57). Berlin: Springer.
Olukotun, K., & Hammond, L., (September 2005). The
Future of Microprocessors. ACM Queue, September,
3(7), 26-29
OMG. (2002). Wireless Access and Terminal Mobility in
CORBA Specification. Retrieved June 15th, 2008, from
http://www.info.fundp.ac.be/~ven/CIS/OMG/new%20
documents%20from%20OMG%20on%20CORBA/
corba%20wireless.pdf
Open Science Grid. (2005). Retrieved from http://www.
opensciencegrid.org
Open Source Metascheduling for Virtual Organizations with the Community Scheduler Framework (CSF)
(Tech. Rep.) (2003, August). Ontario, Canada: Platform
Computing.
OpenPBS. The portable batch system software. (2005).
Veridian Systems, Inc., Mountain View, CA. Retrieved
from http://www.openpbs.org/scheduler.html
Oram, A. (2001). Peer-to-Peer: Harnessing the power

of disruptive technologies. OReilly.
Orlando, S., & Perego, R. (1999). COLTHPF, A run-time
support for the high-level co-ordination of HPF tasks.
Concurrency (Chichester, England), 11(8), 407434.
doi:10.1002/(SICI)1096-9128(199907)11:8<407::AIDCPE435>3.0.CO;2-0
Orlando, S., Palmerini, P., & Perego, R. (2000). Coordinating HPF programs to mix task and data parallelism.
In Proceedings of the 2000 ACM Symposium on Applied
Computing (SAC00) (pp. 240247). New York: ACM
Press.
Otebolaku, A., Adigun, M., Iyilade, J., & Ekabua,
O. (2007). On modeling adaptation in context-aware
mobile grid systems. In Icas 07: Proceedings of the
Third International Conference on Autonomic And
Autonomous Systems (p. 52). Washington, DC: IEEE
Computer Society.
Otoo, E., Rotem, D., & Romosan, A. (2004). Optimal
File-Bundle Caching Algorithms for Data-Grids. In Sc
04: Proceedings of the 2004 acm/ieee conference on
supercomputing (p. 6). Washington, DC: IEEE Computer Society.
p2psip working group. (2008). Peer-to-Peer Session
Initiation Protocol Specification. Retrieved June 15th,
2008, from http://www.ietf.org/html.charters/p2psipcharter.html
Padala, P., & Wilson, J. N. (2003). GridOS: Operating
system services for grid architectures. In High Performance Computing (pp. 353-362). Berlin: Springer.
Padala, P., Shin, K. G., Zhu, X., Uysal, M., Wang, Z.,
Singhal, S., et al. (2007, March). Adaptive control of
virtualized resources in utility computing environments.
In 2007 Conference on EuroSys (EuroSys 2007) (pp.
289-302). Lisbon, Portugal: ACM Press.
Pai-Hsiang, H. (2001). Geographical region summary
service for geographical routing. Paper presented at the
Proceedings of the 2nd ACM international symposium
on Mobile ad hoc networking and computing.
935
Pairot, C., Garcia, P., Rallo, R., Blat, J., & Gomez
Skarmeta, A. F. (2005). The Planet Project: collaborative educational content repositories on structured
peer-to-peer grids. CCGrid 2005, IEEE International
Symposium on Cluster Computing and the Grid, (Vol.
1, pp. 35-42).
Palankar, M., Onibokun, A., Iamnitchi, A., & Ripeanu, M.
(2007). Amazon S3 for Science Grids: a Viable Solution?
Poster: 4th USENIX mposium on Networked Systems
Design and Implementation (NSDI07).
Pantry, S., & Griffiths, P. (1997). The Complete Guide to
Preparing and Implementary Service Level Agreements
(1st Ed.). London: Library Association Publishing.
Parashar, M., & Browne, J. (2005, Mar). Conceptual and implementation models for the grid. Proceedings of the IEEE, 93(3), 653668. doi:10.1109/
JPROC.2004.842780
Parashar, M., & Hariri, S. (Eds.). (2006). Autonomic
computing: Concepts, infrastructure and applications.
Boca Raton, FL: CRC Press.
Parashar, M., & Lee, C. A. (2005, March). Scanning the
issue: Special isssue on grid-computing. In Proceedings
of the IEEE, 93 (3), 479-484. Retrieved from http://www.
caip.rutgers.edu/TASSL/Papers/proc-ieee-intro-04.pdf
Parashar, M., Matossian, V., Klie, H., Thomas, S. G.,
Wheeler, M. F., Kurc, T., et al. (2006). Towards dynamic
data-driven management of the ruby gulch waste repository. In V. N. Alexandrox & et al. (Eds.), Proceedings of
the Workshop on Distributed Data Driven Applications
and Systems, International Conference on Computational
Science 2006 (ICCS 2006) (Vol. 3993, pp. 384392).
Parekh, A.K. & Gallager, R.G. (1994). A Generalised
Processor Sharing Approach to Flow Control in Integrated Services Networks. IEEE Transactions on
Networking 2(2).
Park, H.-S., Yoon, S.-H., Kim, T.-Y., Park, J.-S., Do, M.,
& Lee, J.-Y. (2003). Vertical handoff procedure and algorithm between IEEE 802.11 WLAN and CDMA cellular
network (LNCS, pp.103-112). Berlin: Springer.
936
Park, S., Kim, J., Ko, Y., & Yoon, W. (2003). Dynamic
data grid replication strategy based on Internet hierarchy.
In Proceedings of the second international workshop on
grid and cooperative computing (GCC2003).
Park, S.-M., Ko, Y.-B., & Kim, J.-H. (2003, December).
Disconnected operation service in mobile grid computing.
In First International Conference on Service Oriented
Computing (ICSOC2003), Trento, Italy.
Pascual, V., Matuszewski, M., Shim, E., Zheng, H., &
Song, Y. (2008). P2PSIP Clients. Retrieved June 15th,
2008, from http://tools.ietf.org/id/draft-pascual-p2psipclients-01.txt
Patel, J., Teacy, L. W. T., Jennings, N. R., Luck, M.,
Chalmers, S., & Oren, N. (2005). Agent-based virtual
organisations for the Grids. International Journal of
Multi-Agent and Grid Systems, 1(4), 237249.
Patterson, D.A., & Hennessy, J.L. Computer Organization and Design (3rd Ed.).
Pavlidou, F. N. (1994). Two-dimensional traffic models for cellular mobile systems. IEEE Transactions
on Communications, 42(234), 15051511. doi:10.1109/
TCOMM.1994.582831
Paxson, V., & Sommer, R. (2007). An Architecture Exploiting Multi-Core Processors to Parallelize Network
Intrusion Prevention. In . Proceedings of IEEE Sarnoff
Symposium, 3(7), 2629.
Paxson, V., Asanovi, K., Dharmapurikar, S., Lockwood,
J., Pang, R., Sommer, R., et al. (2006). Rethinking
hardware support for network analysis and intrusion
prevention. Proceedings of the 1st conference on USENIX
Workshop on Hot Topics in Security.
Pedroso, J., Silva, L., & Silva, J. (1997, June). Webbased metacomputing with JET. In Proc. of the acm
ppopp workshop on java for science and engineering
computation.
Pelagatti, S. (2003). Task and Data Parallelism in P3L. In
F. A. Rabhi & S. Gorlatch (Eds.), Patterns and skeletons
for parallel and distributed computing (pp.155186).
Pelagatti, S., & Skillicorn, D. B. (2001). Coordinating programs in the network of tasks model.
Journal of Systems Integration, 10(2), 107126.
doi:10.1023/A:1011228808844
Pennebaker, W. B., & Mitchell, J. L. (1992). JPEG: Still
Image Data Compression Standard (Digital Multimedia
Standards). Berlin: Springer.
Perez, C. E. (2003). Open Source Distributed Cache
Solutions Written in Java. Retrieved June 24, 2008, from
http://www.manageability.org/blog/stuff/distributedcache-java
Perez, J.M., Bellens, P., Badia, R.M., & Labarta, J. (2007,
August). CellSs: Programming the Cell/ B.E. made easier.
IBM Journal of R&D, 51(5).
Pericas, M., Cristal, A., Cazorla, F. J., Gonzalez, R.,
Jimenez, D. A., & Valero, M. (2007). A Flexible Heterogeneous Multi-Core Architecture. In Proceedings of the
16th International Conference on Parallel Architecture
and Compilation Techniques, (pp. 13 -24).
Perkins, C. (2003). RTP: Audio and Video for the Internet.
New York: Addison-Wesley.
Persistence of Vision Raytracer. (2008, November).
Retrieved from http://www.povray.org
Peterson, L., Muir, S., Roscoe, T., & Klingaman, A.
(2006, May). PlanetLab Architecture: An Overview
(Tech. Rep. No. PDN-06-031). Princeton, NJ: PlanetLab
Consortium.
Petrini, F. Kerbyson, D. J. & Pakin, S. (2003). The Case
of the Missing Supercomputer Performance: Achieving
Optimal Performance on the 8,192 Processors of ASCI
Q. Proceedings of the 2003 ACM/IEEE Conference on
Supercomputing.
P-GRADE. (2003). Parallel grid run-time and application development environment. Retrieved from www.
lpds.sztaki.hu/pgrade/
Pham, H. (2000). Software reliability. Singapore:
Springer-Verlag.
Phan, T., Huang, L., & Dulan, C. (2002). Challenge: integrating mobile wireless devices into the computational
grid. In Mobicom 02: Proceedings of the 8th annual
international conference on mobile computing and networking (pp. 271278). New York: ACM Press.
Pierson, J.-M. (2006, June). A pervasive grid, from the data
side (Tech. Rep. No. RR-LIRIS-2006-015). LIRIS UMR
5205 CNRS/INSA de Lyon/Universit Claude Bernard
Lyon 1/Universit Lumire Lyon 2/Ecole Centrale de Lyon.
Retrieved from http://liris.cnrs.fr/publis/?id=2436
Pitas, I. (1993). Parallel Algorithm for Digital Image
Processing, Computer Vision and Neural Network.
Chichester, UK: John Wiley & Sons.
Piyachon, P., & Luo, Y. (2006). Efficient memory utilization on network processors for deep packet inspection.
Proceedings of ACM/IEEE ANCS, (pp. 71-80).
Pjesivac-Grbovic, J., Bosilca, G., Fagg, G. E., Angskun,
T., & Dongarra, J. J. (2007). MPI Collective Algorithm
Selection and Quadtree Encoding. Parallel Computing,
33(9), 613623. doi:10.1016/j.parco.2007.06.005
PlanetLab Europe. (2008). Retrieved from http://www.
planet-lab.eu/.
Plank, J. S. (1997, September). A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems.
Software, Practice & Experience, 27(9), 9951012.
doi:10.1002/(SICI)1097-024X(199709)27:9<995::AIDSPE111>3.0.CO;2-6
Plank, J. S., & Li, K. (1994). Faster checkpointing with
n+1 parity. In FTCS, (pp. 288297).
Plank, J. S., & Thomason, M. G. (2001, November).
Processor allocation and checkpoint interval selection
in cluster computing systems. Journal of Parallel and
jpdc.2001.1757
Plank, J. S., Beck, M., Kinsley, G., & Li, K. (1995).
Libckpt: Transparent checkpointing under unix. Usenix
winter technical conference, (pp. 213-223).
937
Plank, J. S., Kim, Y., & Dongarra, J. (1997). Faulttolerant matrix operations for networks of workstations
using diskless checkpointing. Journal of Parallel and
jpdc.1997.1336
Plank, J. S., Li, K., & Puening, M. A. (1998). Diskless
checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10), 972986. doi:10.1109/71.730527
Polak, S., Slota, R., Kitowski, J., & Otfinowski, J. (2001).
XML-based Tools for Multimedia Course Preparation.
Archiwum Informatyki Teoretycznej i Stosowanej, 13,
321.
Portal, C. H. R. O. N. O. S. (2004). Retrieved from http://
portal.chronos.org/gridsphere/gridsphere
PortoResearch. (2008). Slicing Up the Mobile Services
Revenue Pie. Retrieved March 10, 2008, from http://www.
portioresearch.com/slicing_pie_press.html
PPDG. (2006). From fabric to physics (Tech. Rep.). The
Particle Physics Data Grid.
PRACE. (2008). Partnership for advanced computing in
Europe. Retrieved from www.prace-project.eu/
Preston, R. P., Badeau, R. W., Bailey, D. W., Bell, S. L.,
Biro, L. L., Bowhill, W. J., et al. (2002). Design of an
8-wide superscalar RISC microprocessor with simultaneous multithreading. In Digest of Technical Papers of the
2002 IEEE International Solid-State Circuits Conference
(ISSCC02), San Francisco, CA (Vol. 1, pp. 334-472).
New York: IEEE Press.
Proactive (2005). Proactive manual REVISED 2.2.,
Proactive, INRIA. Retrieved from http://www-sop.inria.
fr/oasis/Proactive/
Prodan, R., & Fahringer, T. (2008, March). overhead
analysis of scientific workflows in grid environments.
Transactions on Parallel and Distributed Systems,
19(3), 378393. doi:10.1109/TPDS.2007.70734
Pro-MPEG. (2005). Material eXchange Format (MXF).
Retrieved 15th June, 2008, from http://www.pro-mpeg.
org.
938
Pruyne, J., & Livny, M. (1996). A Worldwide Flock of

Condors: Load Sharing among Workstation Clusters.
Journal on Future Generations of Computer Systems,
12.
Qi, Y., Xu, B., He, F., Yang, B., Yu, J., & Li, J. (2007).
Towards high-performance flow-level packet processing
on multi-core network processors. Proceedings of 3rd
ACM/IEEE Symposium on Architecture for Networking
and Communications Systems, (pp. 17-26).
Qiu, D., & Srikant, R. (2004). Modeling and performance analysis of bittorrent-like peer-to-peer networks.
Computer Communication Review, 34(4), 367378.
doi:10.1145/1030194.1015508
Quan, D. M. (Ed.). (2008). A Framework for SLA-aware
execution of Grid-based workflows. Saabbrcken, Germany: VDM Verlag.
Quan, D. M., & Altmann, J. (2007). Business model and
the policy of mapping light communication grid-based
workflow within the SLA Context. In Proceedings of the
International Conference of High Performance Computing and Communication (HPCC07), (pp. 285-295).
Berlin: Springer Velag.
Quan, D. M., & Altmann, J. (2007). Mapping a group of
jobs in the error recovery of the Grid-based workflow
within SLA context. In L. T. Yang (Ed.), Proceedings
of the 21st International Conference on Advanced Information Networking and Applications (AINA 2007), (pp.
986-993). New York: IEEE press.
Quasy, H. M. (2004). Middleware for Communications.
Chichester, UK: John Wiley Sons ltd.
Quoitin, B., & Bonaventure, O. (2005). A Co-operative
approach to Inter-domain traffic engineering. 1st Conference on Next Generation Internet Networks Traffic
Engineering (NGI 2005), Rome, Italy, April 18-20th.
Quoitin, B., Uhlig, S., Pelsser, C., Swinnen, L., &
Bonaventure, O. (2003). Internet traffic engineering
with BGP: Quality of Future Internet Services. Berlin:
Springer
Raasch, S. E., & Reinhardt, S. K. (1999). Applications of

thread prioritization in SMT processors. In Proceedings
of the 3rd Workshop on Multithreaded Execution and
Compilation (MTEAC99), Orlando, FL.
Raasch, S. E., & Reinhardt, S. K. (2003). The impact of
resource partitioning on SMT processors. In Proceedings
of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT03), (pp.
1525). New Orleans, LA: IEEE Computer Society.
Radulescu, A., & van Gemund, A. J. C. (2001). A low-cost
approach towards mixed task and data parallel scheduling. In Proceedings of the International Conference on
Parallel Processing (ICPP01)(pp. 6976). New York:
Radulescu, A., Nicolescu, C., van Gemund, A. J. C.,
& Jonker, P. (2001). CPR: Mixed task and data parallel
scheduling for distributed systems. In Proceedings of the
15th International Parallel and Distributed Processing
Symposium (IPDPS01) (pp. 39-46). New York: IEEE
Computer Society.
Rahman, R. M., Barker, K., & Alhajj, R. (2005). Replica
selection in grid environment: A data-mining approach.
In Proceedings of the ACM symposium on applied computing (pp. 695700).
placement in data grid: A multi-objective approach. In
Proceedings of the international conference on grid and
cooperative computing (pp. 645656).
placement in data grid: Considering utility and risk. In
Proceedings of the international conference on information technology: Coding and computing (ITCC05) (Vol.
1, pp. 354359).
Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., & Wilde,
M. (2007). Falkon: a fast and light-weight task execution
framework. In Ieee/acm supercomputing.
Ramakrishnan, L., Irwin, D., Grit, L., Yumerefendi, A.,

Iamnitchi, A., & Chase, J. (2006). Toward a doctrine of
containment: Grid hosting with adaptive resource control.
In 2006 ACM/IEEE Conference on Supercomputing (SC
2006) (p. 101). New York: ACM Press.
Raman, R., Livny, M., & Solomon, M. H. (1998). Matchmaking: Distributed resource management for high
throughput computing. In Hpdc (p. 140).
Raman, R., Livny, M., & Solomon, M. H. (1999).
Matchmaking: An extensible framework for distributed resource management. Cluster Computing, 2(2),
129138. doi:10.1023/A:1019022624119
Ramaswamy, S. (1996). Simultaneous exploitation of
task and data parallelism in regular scientific computations. Doctoral thesis, University of Illinois at UrbanaChampaign.
Ramaswamy, S., Sapatnekar, S., & Banerjee, P. (1997).
A framework for exploiting task and data parallelism on
distributed memory multicomputers. IEEE Transactions
doi:10.1109/71.642945
Ramaswamy, S., Simons, B., & Banerjee, P. (1996).
Optimizations for efficient array redistribution on distributed memory multicomputers. Journal of Parallel
and Distributed Computing, 38(2), 217228. doi:10.1006/
jpdc.1996.0142
Ramjee, R., Li, L., La Porta, T., & Kasera, S. (2002). IP
paging service for mobile hosts. Wireless Networks, 8,
427441. doi:10.1023/A:1016534027402
Ramkumar, B., & Strumpen, V. (1997). Portable checkpointing for heterogenous architectures. Symposium on
fault-tolerent computing, (pp. 58-67).
Ranganathan, K., & Foster, I. (2001). Design and evaluation of dynamic replication strategies for a high performance data grid. In Proceedings of the international
conference on computing in high energy and nuclear
physics (pp. 260-263).
939
Ranganathan, K., & Foster, I. (2002). Decoupling computation and data scheduling in distributed data intensive
applications. In Proceedings of the 11th international
symposium for high performance distributed computing
(HPDC) (pp. 352358).
Ranjan, R., Harwood, A., & Buyya, R. (2008, July). Peerto-peer resource discovery in global grids: A tutorial.
IEEE Communication Surveys and Tutorials (COMST),
10(2), 6-33. New York: IEEE Communications Society
Press. doi:doi:10.1109/COMST.2008.4564477
Ranganathan, K., & Foster, I. (2003). Simulation studies of computation and data scheduling algorithms for
data grids. Journal of Grid Computing, 1(1), 5362.
doi:10.1023/A:1024035627870
Ranjan, R., Rahman, M., & Buyya, R. (2008, May).

A decentralized and cooperative workflow scheduling
algorithm. In 8th IEEE International Symposium on
Cluster Computing and the Grid (CCGRID 2008). Lyon,
France: IEEE Computer Society.
Ranganathan, K., & Foster, I. T. (2001). Identifying

dynamic replication strategies for a high-performance
data grid. In Proceedings of the International Workshop
on Grid Computing (GRID2001) (pp. 7586).
Ranganathan, K., Iamnitchi, A., & Foster, I. (2002).
Improving data availability through dynamic modeldriven replication in large peer-to-peer communities. In
Proceedings of the 2nd IEEE/ACM international symposium on cluster computing and the grid (CCGRID02)
(pp. 376381).
Ranjan, R. (2007, July). Coordinated resource provisioning in federated grids. Doctoral thesis, The University
of Melbourne, Australia.
Ranjan, R., Buyya, R., & Harwood, A. (2005, September).
A case for cooperative and incentive-based coupling of
distributed clusters. In 7th IEEE International Conference
on Cluster Computing. Boston, MA: IEEE CS Press.
Ranjan, R., Harwood, A., & Buyya, R. (2006, September). SLA-based coordinated superscheduling scheme
for computational Grids. In IEEE International Conference on Cluster Computing (Cluster 2006) (pp. 18).
Barcelona, Spain: IEEE.
Ranjan, R., Harwood, A., & Buyya, R. (2008). Coordinated load management in peer-to-peer coupled federated
grid systems. (Technical Report GRIDS-TR-2008-2).
Grid Computing and Distributed Systems Laboratory,
The University of Melbourne, Australia. doi: http://www.
gridbus.org/reports/CoordinatedGrid2007.pdf
940
Rao, A., Lakshminarayanan, K., Surana, S., Karp, R.,

& Stoica, I. (2003). Load Balancing in structured P2P
systems. Proceedings of the 2nd Intl. Workshop on Peerto-Peer Systems (pp. 68-79). Berlin: Springer-Verlag.
Rashid, R. F., & Robertson, G. (1981). Accent: A communication oriented network operating system kernel.
Proceedings of the eighth acm symposium on operating
systems principles, (pp. 64-75).
Ratnasamy, S., Francis, P., Handley, M., Karp, R., &
Schenker, S. (2001). A scalable content-addressable
network. In SIGCOMM01 Proceedings of the 2001
Conference on Applications, Technologies, Architectures,
and Protocols for Computer Communications, (pp. 161172). New York: ACM Press. Retrieved from http://doi.
acm.org/10.1145/ 383059.383072
Ratnasamy, S., Handley, M., Karp, R., & Shenker, S.
(2002). Topologically aware overlay construction and
server selection. In Proc. of INFOCOM.
Ratnasamy, S., Stoica, I., & Shenker, S. (2002). Routing
algorithms for DHTs: Some open questions. Proceedings
the 1st Intl. Workshop on Peer-to-Peer Systems (pp. 4552). Berlin: Springer-Verlag.
Rauber, T., & Rnger, G. (1996). The compiler TwoL
for the design of parallel implementations. In Proceedings of the 1996 Conference on Parallel Architectures
and Compilation Techniques(PACT96)(pp. 292-301).
Rauber, T., & Rnger, G. (1999). Compiler support for

task scheduling in hierarchical execution models. Journal
of Systems Architecture, 45(6-7), 483503. doi:10.1016/
S1383-7621(98)00019-8
Rauber, T., & Rnger, G. (1999). Parallel execution
of embedded and iterated Runge-Kutta methods.
Concurrency (Chichester, England), 11(7), 367385.
doi:10.1002/(SICI)1096-9128(199906)11:7<367::AIDCPE430>3.0.CO;2-G
Rauber, T., & Rnger, G. (1999). Scheduling of data
parallel modules for scientific computing. In Proceedings of the 9th SIAM Conference on Parallel Processing
for Scientific Computing (PPSC), SIAM(CD-ROM), San
Antonio, TX.
Rauber, T., & Rnger, G. (2000). A transformation approach to derive efficient parallel implementations. IEEE
Transactions on Software Engineering, 26(4), 315339.
doi:10.1109/32.844492
Rauber, T., & Rnger, G. (2005). TLib - A library to
support programming with hierarchical multi-processor
tasks. Journal of Parallel and Distributed Computing,
65(3), 347360.
Rauber, T., & Rnger, G. (2006). A data re-distribution
library for multi-processor task programming. International Journal of Foundations of Computer Science,
17(2), 251270. doi:10.1142/S0129054106003814
Rauber, T., & Rnger, G. (2007). Mixed task and data
parallel executions in general linear methods. Science
Progress, 15(3), 137155.
Rauber, T., Reilein-Ru, R., & Rnger, G. (2004). GroupSPMD programming with orthogonal processor groups.
Concurrency and Computation: Practice and Experience
. Special Issue on Compilers for Parallel Computers,
16(2-3), 173195.
Rauber, T., Reilein-Ru, R., & Rnger, G. (2004). On

compiler support for mixed task and data parallelism.
In G. R. Joubert, W. E. Nagel, F. J. Peter, & W. V. Walter (Eds.), Parallel Computing: Software Technology,
Algorithms, Architectures & Applications. Proceedings
of 12th International Conference on Parallel Computing
(ParCo03) (pp. 2330). New York: Elsevier.
Rauber, T., Rnger, G., & Wilhelm, R. (1995). Deriving
optimal data distributions for group parallel numerical
algorithms. In Proceedings of the Conference on Programming Models for Massively Parallel Computers
(PMMP94) (pp. 3341). Washington, DC: IEEE Computer Society.
Ray, E. (2003). Learning XML. Sebastopol, CA: OReilly
Media, Inc.
Reed, D. A. (2003). Grids: The teragrid, and beyond.
IEEE Computer, 36(1), 6268.
Reilein-Ru, R. (2005). Eine komponentenbasierte
Realisierung der TwoL Spracharchitektur. PhD Thesis, TU Chemnitz, Fakultt fr Informatik, Chemnitz,
Germany.
Reinhardt, S., & Mukherjee, S. (2000). Transient fault
detection via simultaneous multithreading. In ACM
SIGARCH Computer Architecture News: Special Issue:
Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA00), (pp. 25-36).
Vancouver,Canada: ACM Press
Rekhter, Y. & Li, T. (2002, January). A border gateway
protocol 4 (BGP-4): draft-ietf-idr-bgp4-17.txt [Internet
draft, work in progress].
Replica Location Service (RLS) (n.d.). Retrieved from
http://www.globus.org/toolkit/docs/4.0/data/rls/
Reuters (2007). Global cellphone penetration reaches
50 pct. Retrieved March 10, 2008, from http://investing.
reuters.co.uk/news/articleinvesting.aspx?type=media&
storyID=nL29172095
Revees, C. (1993). Moderm heuristic techniques for
combinatorial problems. Oxford, UK: Oxford Blackwell
Scientific Publication.
941
Rhea, S. C., & Kubiatowicz, J. (2002). Probabilistic

location and routing. Paper presented at the IEEE INFOCOM 2002, Twenty-First Annual Joint Conference
of the IEEE Computer and Communications Societies
Proceedings.
Rhea, S. C., Eaton, P. R., Geels, D., Weatherspoon, H.,
Zhao, B. Y., & Kubiatowicz, J. (2003). Pond: The oceanstore prototype. In Fast. Rodriguez, A., Gonzalez, A.,
& Malumbres, M. P. Performance evaluation of parallel
mpeg-4 video coding algorithms on clusters of workstations. International Conference on Parallel Computing
in Electrical Engineering (PARELEC04), 354-357.
Rhea, S., Geels, D., Roscoe, T., & Kubiatowicz, J. (2004).
Handling Churn in a DHT. Proceedings of the USENIX
(pp. 127-140). USENIX Association.
Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., et al. (2005). OpenDHT: A public
DHT service and its uses. In Proceedings of ACM SIGCOMM (pp. 73-84). New York: ACM Press.
Ricci, R., Oppenheimer, D., Lepreau, J., & Vahdat, A.
(2006, January). Lessons from resource allocators for
large-scale multiuser testbeds. SIGOPS Operating Systems
Review, 40(1), 2532. doi:10.1145/1113361.1113369
Richardson, I., & Richardson, I. E. G. (2003). H.264 and
MPEG-4 Video Compression: Video Coding for Next
Generation Multimedia. Chichester, UK: Wiley.
Ripeanu, M., Foster, I., & Iamnitchi, A. (2002). Mapping
the gnutella network: properties of large-scale peer-topeer systems and implications for system design. IEEE
Internet Computing, 6(1), 50-57.
Robertazzi, T. (2003). Ten reasons to use divisible load
theory. Institute of Electrical and Electronic Engineering, 36(5), 6368.
Rodriguez, B. (2002). EDLXML serialization. Retrieved
15th June, 2008, from download.sybase.com/pdfdocs/
prg0390e/prsver39edl.pdf
Roesch, M. (1999). Snort - lightweight intrusion detection for networks. Proceedings of 13th USENIX LISA
Conference, (pp. 229-238).
942
Roman, M., Kon, F., & Campbell, R. (2001). Reflective

Middleware: From your Desk to your Hand. IEEE Communications Surveys, 2(5).
Roure, D. D. (2003). Semantic grid and pervasive computing.
http://www.semanticgrid.org/GGF/ggf9/gpc/
Rowstron, A., & Druschel, P. (2001). Pastry: Scalable,
decentralized object location, and routing for large-scale
peer-to-peer systems. In Middleware01 Proceedings of
the IFIP/ACM International Conference on Distributed
Systems Platforms, (pp. 329-350). Heidelberg, Germany:
SpringerLink. doi: 10.1007/3-540-45518-3
Rowstron, A., & Druschel, P. (2001). Pastry: Scalable,
distributed object location and routing for large-scale
peer-to-peer systems. In Proceedings of IFIP/ACM Intl.
Conf. on Distributed Systems Platforms (pp. 329-350).
Rowstron, A., & Druschel, P. (2001, November). Pastry:
Scalable, distributed object location and routing for largescale peer-to-peer systems. In Proceedings of the 18th
ifip/acm international conference on distributed systems
platforms (middleware 2001), Heidelberg, Germany.
Rowstron, A., & Druschel, P. Pastry. (2001). Scalable,
decentralized object location and routing for large-scale
peer-to-peer systems. In Proc. of the 18th IFIP/ACM Intl
Conf. on Distributed Systems Platforms (Middleware).
Rubio-Montero, A., Huedo, E., Montero, R., & Llorente,
I. (2007, March). Management of virtual machines on
globus Grids using GridWay. In IEEE International
Parallel and Distributed Processing Symposium (IPDPS
2007) (pp. 17). Long Beach, USA: IEEE Computer
Society.
Ruth, P., Jiang, X., Xu, D., & Goasguen, S. (2005, May).
Virtual distributed environments in a shared infrastructure. IEEE Computer, 38(5), 6369.
Ruth, P., McGachey, P., & Xu, D. (2005, September).
VioCluster: Virtualization for dynamic computational
domain. In IEEE International on Cluster Computing
(Cluster 2005) (pp. 110). Burlington, MA: IEEE.
Ruth, P., Rhee, J., Xu, D., Kennell, R., & Goasguen,
S. (2006, June). Autonomic live adaptation of virtual
computational environments in a multi-domain infrastructure. In 3rd IEEE International Conference on
Autonomic Computing (ICAC 2006) (pp. 5-14). Dublin,
Ireland: IEEE.
Saar, C., & Yossi, M. (2003). Spectral bloom filters. Paper
presented at the Proceedings of the 2003 ACM SIGMOD
international conference on Management of data.
Saara Vrt, S. (Ed.). (2008). Advancing science in
Europe. DEISA Distributed European Infrastructure
for Supercomputing Applications. EU FP6 Project.
Retrieved from www.deisa.eu/press/DEISA-AdvancingScienceInEurope.pdf
SAGA. (2006). SAGA implementation home page Retrieved from http://fortytwo.cct.lsu.edu:8000/SAGA
Sahai, A., Graupner, S., Machiraju, V., & Moorsel, A.
(2003). Specifying and monitoring guarantees in commercial grids through SLA. In F. Tisworth (Ed.), Proceeding
of the 3rd IEEE/ACM CCGrid2003, (pp.292300). New
York: IEEE press.
Sairamesh, J., Stanbridge, P., Ausio, J., Keser, C., &
Karabulut, Y. (2005, March). Business Models for Virtual
Organization Management and Interoperability (Deliverable A - WP8&15 WP - Business & Economic Models
No. V.1.5). Deliverable document 01945 prepared for
TrustCom and the European Commission.
Saito, Y., & Levy, H. M. (2000). Optimistic replication
for internet data services. In Proceedings of international
symposium on distributed computing (pp. 297314).
Saito, Y., & Shapiro, M. (2005). Optimistic replication. ACM Computing Surveys, 37(1), 4281.
doi:10.1145/1057977.1057980
Saleh, O., & Hefeeda, M. (2006). Modeling and caching
of peer-to-peer traffic. In Proc. of 14th IEEE International Conference on Network Protocols (ICNP06),
(pp. 249-258).
Salkintzis, A. K. (2004). Interworking techniques and

architectures for WLAN-3G integration toward 4G
mobile data networks. IEEE Wireless Communications,
11(3), 5061. doi:10.1109/MWC.2004.1308950
Salkintzis, A. K., Fords, C., & Pazhyannur, R. (2002).
WLAN-GPRS integration for next generation mobile
data networks. IEEE Wireless Communications, 9(5),
112124. doi:10.1109/MWC.2002.1043861
Salsano, S. (2001 October). COPS usage for Diffserv
resource allocation (COPS-DRA) [Internet Draft].
Samet, H. (2008, November). The design and analysis
of spatial data structures. New York: Addison-Wesley
Publishing Company.
Sanders, R. (2008). SETI@home looking for more volunteers. Retrieved 10 March, 2008, from http://www.
berkeley.edu/news/media/releases/2008/01/02_setiahome.shtml
Santos-Neto, E., Cirne, W., Brasileiro, F., & Lima, A.
(2004). Exploiting Replication and Data Reuse to Efficiently Schedule Data-intensive Applications on Grids.
In Proceedings of the 10th workshop on job scheduling
strategies for parallel processing.
Santos-Neto, E., Cirne, W., Brasileiro, F., & Lima, A.
(2004). Exploiting replication and data reuse to efficiently
schedule data-intensive applications on grids. In Proceedings of 10th workshop on job scheduling strategies for
parallel processing (Vol. 3277, pp. 210232).
Sarmenta, L. F. G. (2002). Sabotage-tolerance mechanisms for volunteer computing systems. Future Generation Computer Systems, 18(4), 561572. doi:10.1016/
S0167-739X(01)00077-2
Sarmenta, L. F. G., & Hirano, S. (1999). Bayanihan:
Building and studying volunteer computing systems
using Java. Future Generation Computer Systems,
15(5/6), 675-686.
Saroiu, S., et al. (2002). A Measurement Study of Peerto- Peer File Sharing Systems. In Proc. of MMCN.
943
Scales, D. J., & Gharachorloo, K. (1997). Towards transparent and efficient software distributed shared memory.
Paper presented at the Proceedings of the sixteenth ACM
symposium on Operating systems principles.
Schiffmann, W., Sulistio, A., & Buyya, R. (2007). Using
Revenue management to Determine Pricing of Revervations. Proc. 3rd International Conference on e-Science
and Grid Computing (eScience 2007) Bangalore, India,
December 10-13.
Schilit, B., Adams, N., & Want, R. (1994). Context-aware
computing applications. In Proceedings of Mobile Computing Systems and Applications, (pp. 85-90).
Schintke, F., & Reinefeld, A. (2003). Modeling
replica availability in large data grids. Journal
of Grid Computing, 1(2), 219227. doi:10.1023/
B:GRID.0000024086.50333.0d
Schirrmeister, F. (2007). Multi-core Processors: Fundamentals, Trends, and Challenges, Embedded Systems
Conference, (pp. 6-15).
Schowengerdt, R. A., & Mehldau, G. (1993). Engineering
a Scientific Image Processing Toolbox for the Macintosh II. International Journal of Remote Sensing, 14(4),
669683. doi:10.1080/01431169308904367
Schwiegelshohn, U., & Yahyapour, R. (1999). Resource

allocation and scheduling in metasystems. In 7th International Conference on High-Performance Computing
and Networking (HPCN Europe 99) (pp. 851860).
London, UK: Springer-Verlag.
Schwiegelshohn, U., & Yahyapour, R. (2000). Fairness in
parallel job scheduling. Journal of Scheduling, 3(5), 297
320. doi:10.1002/1099-1425(200009/10)3:5<297::AIDJOS50>3.0.CO;2-D
Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash,
M., & Dubey, P. (2008). Larabee: a many-core x86 architecture for visual computing. [TOG]. ACM Transactions
on Graphics, 27(3). doi:10.1145/1360612.1360617
Seltzer, M. I., Krinsky, D., & Smith, K. A. (1999). The
case for application-specific benchmarking. Workshop
on Hot Topics in Operating Systems (pp. 102-109). Rio
Rico, AZ: IEEE Computer Society Press.
Sensor Networks. Retrieved from http://www.sensornetworks.net.au/network.html
SETIstats. (2008). Seti@home Project Statistics. Retrieved March 10, 2008, from http://boincstats.com/stats/
project_graph.php?pr=bo
Schller, F., Qin, J., Nadeem, F., Prodan, R., Fahringer,

T., & Mayr, G. (2006). Performance, scalability and
quality of the meteorological grid workflow MeteoAG.
In Austrian Grid Symposium. Innsbruck, Austria:
OCG Verlag.
Seymour, K., Nakada, H., Matsuoka, S., Dongarra, J.,

Lee, C., & Casanova, H. (2002). Overview of GridRPC:
A remote procedure call API for Grid computing. In
Proceedings of the Third International Workshop on Grid
Computing, Baltimore, MD (LNCS 2536, pp. 274278).
Berlin: Springer.
Schwartz, E. (1980). Computational Anatomy and Functional Architecture of Striate Cortex: A Spatial Mapping
Approach to Perceptual Coding. Vision Research, 20,
645669. doi:10.1016/0042-6989(80)90090-5
Sfiligoi, K. O., Venekamp, G., Yocum, D., Groep, D.,

& Petravick, D. (2007). Addressing the Pilot security
problem with gLExec (Tech. Rep. No. FERMILAB-PUB07-483-CD). Fermi National Laboratory, Batavia, IL.
Schwarz, K., Blaha, P., & Madsen, G. K. (2002).

Electronic structure calculations of solids using the
WIEN2k package for material sciences. Computer
Physics Communications, 147(71).
Shankland, S. (2007). Sun starts bidding adieu to mobilespecific Java. Retrieved March 10, 2008, from http://
www.news.com/8301-13580_3-9800679-39.html?part=
rss&subj=news&tag=2547-1_3-0-20
SHARCNET. (2008). Shared Hierarchical Academic
Research Computing Network (SHARCNET).
944
ShareGrid Project. (2008, November). Retrieved from

http://dcs.di.unipmn.it/sharegrid.
Shen, H., & Xu, C. (2006,April). Hash-based proximity
clustering for load balancing in heterogeneous DHT
networks. In Proc. of IPDPS.
Shen, H., & Xu, C.-Z. (2007). Locality-aware and Churnresilient load balancing algorithms in structured peer-topeer networks. [TPDS]. IEEE Transactions on Parallel
TPDS.2007.1040
Shen, H., Xu, C., & Chen, G. (2006). Cycloid: A scalable
constant-degree P2P overlay network. Performance Evaluation, 63(3), 195216. doi:10.1016/j.peva.2005.01.004
Shen, J. P., & Lipasti, M. (2004). Modern Processor
Design: Fundamentals of Superscalar Processors (1st
Ed.).
Shen, W., & Zeng, Q.-A. (2007). Cost-function-based
network selection strategy in heterogeneous wireless
networks. IEEE International Symposium on Symposium
on Ubiquitous Computing and Intelligence (UCI-07).
Washington, DC: IEEE.
Shen, W., & Zeng, Q.-A. (2008). Cost-function-based
network selection strategy in integrated heterogeneous
wireless and mobile networks. To appear in IEEE Transactions on Vehicle Technology.
Sheridan, P. (1996). Spiral Architecture for Machine
Vision. Doctoral Thesis, University of Technology,
Sydney.
Sheridan, P., Hintz, T., & Alexander, D. (2000). Pseudoinvariant Image Transformations on a Hexagonal Lattice. Image and Vision Computing, 18(11), 907917.
doi:10.1016/S0262-8856(00)00036-6
Shi, W., Lee, H.-H., Ghosh, M., & Lu, C. (2004). Architectual support for high speed protection of memory
integrity and confidentiality in multiprocessor systems.
In Proceedings of the 13th International Conference on
Parallel Architectures and Computation Techniques
(PACT04), Antibes Juan-les-Pins, France (pp.123-134).
Shin, C.-H., & Gaudiot, J.-L. (2006). Adaptive dynamic

thread scheduling for simultaneous multithreaded architectures with a detector thread. Journal of Parallel and
jpdc.2006.06.003
Shin, C.-H., Lee, S.-W., & Gaudiot, J.-L. (2003). Dynamic
scheduling issues in SMT architectures. In Proceedings
of the 17th International Symposium on Parallel and
Distributed Processing (IPDPS03), Nice, France, (p.
77b). New York: IEEE Computer Society.
Shirts, M., & Pande, V. (2000). Screen savers of the
world, unite! Science, 290, 19031904. doi:10.1126/science.290.5498.1903
Shoch, J. F., & Hupp, J. A. (1982). 03). The worm programs - early experience with a distributed computation.
Communications of the ACM, 3(25).
Shoykhet, A., Lange, J., & Dinda, P. (2004, July). Virtuoso:
A System For Virtual Machine Marketplaces [Technical
Report No. NWU-CS-04-39]. Evanston/Chicago: Electrical Engineering and Computer Science Department,
Northwestern University.
Siagri, R. (2007). Pervasive computers and the GRID:
The birth of a computational exoskeleton for augmented
reality. In 6th Joint Meeting of the European Software
Engineering Conference and the ACM SIGSOFT Symposium on The foundations of software engineering
(pp.1-4), Croatia.
Siddiqui, M., Villazn, A., & Fahringer, T. (2006).
Grid capacity planning with negotiation-based advance
reservation for optimized QoS. In 2006 ACM/IEEE
Conference on Supercomputing (SC 2006) (pp. 2121).
New York: ACM.
Siddiqui, M., Villazon, A., Hoffer, J., & Fahringer, T.
(2005). GLARE: A Grid activity registration, deployment, and provisioning framework. Supercomputing
Conference. Seattle, WA: IEEE Computer Society
Press.
945
Siegel, H. J., Armstrong, J. B., & Watson, D. W. (1992).

Mapping Computer-Vision-Related Tasks onto Reconfigurable Parallel-Processing Systems. IEEE Computer,
25(2), 5463.
Siegel, L. J., Siegel, H. J., & Feather, A. E. (1982). Parallel Processing Approaches to Image Correlation. IEEE
Transactions on Computers, 31(3), 208218. doi:10.1109/
TC.1982.1675976
Silagadze, Z. (1997). Citations and the Mandelbrot-Zipfs
law. Complex Systems, 11, 487499.
Silva, L. M., & Silva, J. G. (1998). An experimental study
about diskless checkpointing. In EUROMICRO98, (pp.
395402).
SIMDAT. (2008). Grids for industrial product development. Retrieved from www.scai.fraunhofer.de/
about_simdat.html
Simmonds, A., & Nanda, P. (2002). Resource Management in Differentiated Services Networks. In C McDonald
(Ed.), Proceedings of Converged Networking: Data and
Real-time Communications over IP, IFIP Interworking
2002, Perth, Australia, October 14 - 16, (pp. 313 323).
Amsterdam: Kluwer Academic Publishers.
Singh, M. P., & Vouk, M. A. (1997). Scientific workflows:
Scientific computing meets transactional workflows.
Retrieved January 13, 2006 from http://www.csc.ncsu.
edu/faculty/mpsingh/papers/databases/workf lows /
sciworkflows.html
Sinharoy, B., Kalla, R. N., Tendler, J. M., Eickemeyer,
R. J., & Joyner, J. B. (2005). Power5 system microarchitecture. IBM Journal of Research and Development,
49(4/5), 505521.
Sips, H. J., & van Reeuwijk, C. (2004). An integrated
annotation and compilation framework for task and data
parallel programming in Java. In Parallel Computing
(PARCO): Software Technology, Algorithms, Architectures and Applications (pp. 111118). New York:
Elsevier.
946
Skillicorn, D. B. (1999). The network of tasks model,

(TR1999-427). Queens University, Kingston, Canada.
Smallen, S., Casanova, H., & Berman, F. (2001, Nov.).
Tunable on-line parallel tomography. In Proceedings of
Supercomputing01, Denver, CO.
Smarr, L., & Catlett, C. E. (1992, June). Metacomputing. Communications of the ACM, 35(6), 4452.
doi:10.1145/129888.129890
SMIL/ W3C. (2005). SMIL- Synchronized Multimedia
Integration Language. Retrieved June 15th, 2008 from
http://www.w3.org/AudioVideo/
Smith, B. J. (1981). Architecture and applications of the
HEP multiprocessor computer system. In SPIE Proceedings of Real Time Signal Processing IV, 298, 241-248.
Smith, P., & Hutchinson, N. C. (1998). Heterogeneous
process migration: The tui system. Software, Practice
& Experience, 28(6), 611639. doi:10.1002/(SICI)1097024X(199805)28:6<611::AID-SPE169>3.0.CO;2-F
SMPTE. (2004). Metadata dictionary registry of metadata element descriptions. Retrieved June 15th, 2008,
from http://www.smpte-ra.org/mdd/rp210-8.pdf
Snavely, A., & Weinberg, J. (2006). Symbiotic spacesharing on SDSCs datastar system. Job Scheduling
Strategies for Parallel Processing. (LNCS 4376, pp.192209). St. Malo, France: Springer Verlag.
Snytnikov, V. N., Vshivkov, V. A., Kuksheva, E. A., Neupokoev, E. V., Nikitin, S. A., & Snytnikov, A. V. (2004).
Three-dimensional numerical simulation of a nonstationary gravitating n-body system with gas. Astronomy
Letters, 30(2), 124138. doi:10.1134/1.1646697
SOAP/W3C. (2003). SOAP Version 1.2 Part 1: Messaging
Framework. Retrieved June 15th, 2008, from Http://www.
w3.org/TR/2003/REC-soap12-part1-20030624/
Soh, H., Shazia Haque, S., Liao, W., & Buyya, R. (2006).
Grid programming models and environments. In YuanShun Dai, et al. (Eds.) Advanced parallel and distributed
computing (pp. 141173). Hauppauge, NY: Nova Science
Publishers.
Sohi, G. S., Breach, S. E., & Vijaykumar, T. N. (1995).

Multiscalar processors. Proceedings of 22nd Annual
International Symposium on Computer Architecture,
(pp. 414-425).
Srinivasan, S. H. (2005). Pervasive wireless grid architecture. In Proceedings of The Second Annual Conference
on Wireless On-demand Network Systems and Services
(pp.83-88), Switzerland.
Song, H., Dharmapurikar, S., Turner, J., & Lockwood,

J. (2005). Fast hash table lookup using extended bloom
filter: an aid to network processing. Paper presented at
the Proceedings of the 2005 conference on Applications,
technologies, architectures, and protocols for computer
communications.
Ssu, K., Yao, B., & Fuchs, W. K. (1999). An adaptive

checkpointing protocol to bound recovery time with
message logging. Symposium on reliable distributed
systems, (pp. 244-252).
Song, Q., & Jamalipour, A. (2005). Network selection in an

integrated wireless LAN and UMTS environment using
mathematical modeling and computing techniques. IEEE
Wireless Communications, 12(3), 4248. doi:10.1109/
MWC.2005.1452853
Song, Y., Jiang, X., Zheng, H., & Deng, H. (2008). P2PSIP
Client Protocol. Retrieved June 15th, 2008, from http://
tools.ietf.org/id/draft-jiang-p2psip-sep-01.txt.
Sonnek, J. D., Nathan, M., Chandra, A., & Weissman,
J. B. (2006). Reputation-based scheduling on unreliable
distributed infrastructures. In ICDCS (p. 30).
Spooner, D. P., Jarvis, S. A., Cao, J., Saini, S., & Nudd,
G. R. (2003). Local grid scheduling techniques using
performance prediction. In S. Govan (Ed.), IEEE Proceedings - Computers and Digital Techniques Vol 150,
(pp. 87-96). New York: IEEE Press.
Spring.NET. (2008, November). Retrieved from http://
www.springframework.net.
Squyres, J. M., Lumsdaine, A., & Stevenson, R. L. (1995).
A Cluster-based Parallel Image Processing Toolkit. Paper
presented at the IS&T Conference on Image and Video
Processing, San Jose, CA.
SRB (Storage Resource Broker) (n.d.). Retrieved from
http://www.sdsc.edu/srb/index.php/Main_Page
Srinivasan, R. (1995). XDR: External Data Representation Standard (Tech. Rep. No. RFC 1832).
Steen, van M., Homburg, P., & Tanenbaum, A. S. (1999).

Globe: a wide area distributed system. Concurrency,
IEEE [See also IEEE Parallel & Distributed Technology], 7, 70-78.
Stellner, G. (1996). Cocheck: Checkpointing and process
migration for mpi. Proceedings of 10th international
parallel processing symposium.
Stemm, M., & Katz, R. H. (1998). Vertical handoffs in
wireless overlay networks. ACM Mobile Networking
(MONET) [New York: ACM.]. Special Issue on Mobile
Networking in the Internet, 3(4), 335350.
Stets, R., Dwarkadas, S., Hardavellas, N., Hunt, G.,
Kontothanassis, L., & Parthasarathy, S. (1997). Cashmere2L: software coherent shared memory on a clustered
remote-write network. SIGOPS Oper. Syst. Rev., 31(5),
170183. doi:10.1145/269005.266675
Stevenson, R. L., Adams, G. B., Jamieson, L. H., &
Delp, E. J. (1993, April). Parallel Implementation for
Iterative Image Restoration Algorithms on a Parallel
DSP Machine. The Journal of VLSI Signal Processing,
5, 261272. doi:10.1007/BF01581300
Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., &
Balakrishnan, H. (2001). Chord: A scalable peer-to-peer
lookup service for Internet applications. In Proceedings
of ACM SIGCOMM (pp. 149-160). New York: ACM
Press.
Stone, N. (2004). GWD-I: An architecture for grid checkpoint recovery services and a GridCPR API. Retrieved
October 15, 2006 from http://gridcpr.psc.edu/GGF/docs/
draft-ggf-gridcpr-Architecture-2.0.pdf
947
Storz, O., Friday, A., & Davies, N. (2003, October).

Towards ubiquitous ubiquitous computing: an alliance
with the grid. In Proceedings of the First Workshop On
System Support For Ubiquitous Computing Workshop
(UBISYS 2003) in association with Fifth International
Conference On Ubiquitous Computing, Seattle, WA.
Retrieved from http://ciae.cs.uiuc.edu/ubisys/papers/
alliance-w-grid.pdf
Stuart, W., & Koch, T. (2000). The Dublin Core Metadata
Initiative: Mission, Current Activities, and Future Directions, (Vol. 6). Retrieved June 15th, 2008, from http:/
www/dlib.org/dlib/december00/weibel/12weibel.html
Su, M. EI-kady, I., Bader, D. A., & Lin, S. (2004, August).
A Novel FDTD Application Featuring OpenMP-MPI
Hybrid Parallelization. In 33rd international conference on parallel processing(icpp) Montreal, Canada,
(pp. pp. 373379).
Subhlok, J., & Vondran, G. (1995). Optimal mapping of
sequences of data parallel tasks. ACM SIGPLAN Notices,
30(8), 134143. doi:10.1145/209937.209951
Subhlok, J., & Yang, B. (1997). A new model for integrated nested task and data parallel programming. In
Proceedings of the 6th ACM SIGPLAN symposium on
Principles and Practice of Parallel Programming (pp.
112). New York: ACM Press.
Sun, M., Sun, J., Lu, E., & Yu, C. (2005). Ant algorithm
for file replica selection in data grid. In Proceedings of the
first international conference on semantics, knowledge,
and grid (SKG 2005) (pp. 6466).
Sun, Y., & Xu, Z. (2004). Grid replication coherence
protocol. In Proceedings of the 18th international parallel
and distributed processing symposium (pp. 232239).
SunGrid. (2005). Sun utility computing. Retrieved from
www.sun.com/service/sungrid/
SURA Southeastern Universities Research Association.
(2007). The Grid technology cookbook: Programming
concepts and challenges. Retrieved from www.sura.org/
cookbook/gtcb/
948
Suter, F., Desprez, F., & Casanova, H. (2004). From

heterogeneous task scheduling to heterogeneous mixed
parallel scheduling. In Proceedings of the 10th International Euro-Par Conference (Euro-Par04), (LNCS: Vol.
3149, pp. 230237). Pisa, Italy: Springer.
Sutter, H., & Larus, J. (2005). Software and the concurrency revolution. ACM Queue; Tomorrows Computing
Today, 3(7), 5462. doi:10.1145/1095408.1095421
Svirskas, A., Arevas, A., Wilson, M., & Matthews, B.
(2005, October). Secure and trusted virtual organization
management. ERCIM News (63).
Taesombut, N., & Chien, A. (2004). Distributed virtual
computer (dvc): Simplifying the development of high
performance grid applications. In Workshop on Grids and
Advanced Networks (GAN 04), IEEE Cluster Computing
and the Grid (ccgrid2004) Conference, Chicago.
Taflove, A., & Hagness, S. (2000). Computational
Electrodynimics: The Finite-Difference Time-Domain
Method, second edition. Boston: Artech House.
Tai, A., Meyer, J., & Avizienis, A. (1993). Performability enhancement of fault-tolerant software.
IEEE Transactions on Reliability, 42(2), 227237.
doi:10.1109/24.229492
Tanenbaum, A. S., & Steen, M. V. (2008). Distributed
Systems: Principles and Paradigms. Upper Saddle River,
NJ: Prentice Hall.
Tang, F. L., Li, M. L., & Huang, Z. X. (2004). Real-time
transaction processing for autonomic Grid applications.
Engineering Applications of Artificial Intelligence, 17(7),
799807. doi:10.1016/S0952-1976(04)00122-8
Tang, M., Lee, B., Tang, X., & Yeo, C. K. (2005). Combining data replication algorithms and job scheduling
heuristics in the data grid. In Proceedings of European
conference on parallel computing (pp. 381390).
Tang, M., Lee, B., Yeo, C., & Tang, X. (2005). Dynamic
replication algorithms for the multi-tier data grid. Future Generation Computer Systems, 21(5), 775790.
doi:10.1016/j.future.2004.08.001
Tang, M., Lee, B., Yeo, C., & Tang, X. (2006). The impact
of data replication on job scheduling performance in the
data grid. Future Generation Computer Systems, 22(3),
254268. doi:10.1016/j.future.2005.08.004
Teo, Y. M., & Mihailescu, M. (2008). Collision avoidance

in hierarchical peer-to-peer systems. In Proceedings of
7th Intl. Conf. on Networking (pp. 336-341). New York:
Tang, X., & Xu, J. (2005). QoS-aware replica placement

for content distribution. IEEE Transactions on Parallel
TPDS.2005.126
Terzis, A., Wang, L., Ogawa, J. & Zhang, L. (1999, December). A two tier resource management model for the
Internet, Global Internet, (pp. 1808 1817).
Tanin, E., Harwood, A., & Samet, H. (2007). Using a

distributed quadtree index in peer-to-peer networks.
[Heidelberg, Germany: SpringerLink.]. The VLDB
Journal, 16(2), 165178. doi:. doi:10.1007/s00778-0050001-y
Taubman, D., & Marcellin, M. (2001). JPEG2000: Image
Compression Fundamentals, Standards and Practice.
Berlin: Springer.
Taufer, M., Anderson, D., Cicotti, P., & III, C. L. B.
(2005). Homogeneous redundancy: a technique to
ensure integrity of molecular simulation results using
public computing. In Proceedings of The International
Heterogeneity In Computing Workshop.
TAVERNA. (2008). The Taverna Workbench 1.7. Retrieved from http://taverna.sourceforge.net/
Taylor, I., Shields, M., Wang, I., & Philp, R. (2003).
Distributed p2p computing within triana: A galaxy
visualization test case. In International Parallel and
Distributed Processing Symposium (IPDPS03). Nice,
France: IEEE Computer Society Press.
Thain, D., & Livny, M. (2004). Building reliable clients

and services. In The grid2 (pp. 285318). San Francisco:
Morgan Kaufman.
Thain, D., Tannenbaum, T., & Livny, M. (2002). Condor
and the grid. John Wiley & Sons Inc.
Thatte, S. (2003). BPEL4WS, business process execution
language for web services. Retrieved June 15th, 2008,
from http://xml.coverpages.org/ni2003-04-16-a.html
The EU Data Grid Project (n.d.). Retrieved from http://
www.eu-datagrid.org/.
The Globus Alliance (n.d.). Retrieved from http://www.
globus.org/
The seti@home project. Retrieved from http://setiathome.
ssl.berkeley.edu/
The TrustCoM Project. (2005). Retrieved from http://
www.eu-trustcom.com.
Theimer, M. M., Lantz, K. A., & Cheriton, D. R.
(1985). Preemptable remote execution facilities for
the v-system. SIGOPS Oper. Syst. Rev., 19(5), 212.
doi:10.1145/323627.323629
Taylor, M. B., Lee, W., Miller, J., Wentzlaff, D., Bratt,

I., Greenwald, B., et al. (2004). Evaluation of the raw
microprocessor: An exposed-wire-delay architecture for
ilp and streams. Proceedings of 31st Annual International
Symposium on Computer Architecture, (pp. 2-13).
Theiner, D., & Rutschmann, P. (2005). An inverse

modelling approach for the estimation of hydrological model parameters. (I. Publishing, Ed.) Journal of
Hydroinformatics.
Tendler, J. M., Dodson, J. S. Jr, Fields, J. S., Le, H., & Sinharoy, B. (2002). Power4 system microarchitecture. IBM
Journal of Research and Development, 46(1), 525.
Thistle, M. R., & Smith, B. J. (1988). A processor architecture for Horizon. In Proceedings of the 1988 ACM/IEEE
conference on Supercomputing (SC88), Orlando, FL, (pp.
35-41). New York: IEEE Computer Society Press.
949
Thomasian, A. (1997). A performance comparison of

locking methods with limited wait depth. IEEE Transactions on Knowledge and Data Engineering, 9(3), 421434.
doi:10.1109/69.599931
Thornton, J. E. (1970). Design of a computer - the Control
Data 6600. Upper Saddle River, NJ: Scott Foresman
& Co.
Thulasiraman, P., Khokhar, A., Heber, G., & Gao, G. (2004,
Jan.). A fine-grain load adaptive algorithm of the 2d discrete wavelet transform for multithreaded architectures.
[JPDC]. Journal of Parallel and Distributed Computing,
64(1), 6878. doi:10.1016/j.jpdc.2003.06.003doi:10.1016/j.
jpdc.2003.06.003
Thulasiraman, P., Theobald, K. B., Khokhar, A. A., &
Gao, G. R. (2000, July). Multithreaded algorithms for
the fast Fourier transform. In Acm symposium on parallel algorithms and architectures Winnipeg, Canada,
(p. 176-185).
Tian, D., & Xiang, Y. (2008). A multi-core supported
intrusion detection system. Proceedings of IFIP International Conference on Network and Parallel Computing.
Tian, R., Xiong, Y., Zhang, Q., Li, B., Zhao, B. Y., & Li,
X. (2005). Hybrid Overlay Structure Based on Random
Walks. In Proceedings of the 4th Intl. Workshop on
Trelles, Andrade, & Valencia, Zapata, & Carazo. (1998,

June). Computational space reduction and parallelization of a new clustering approach for large groups of
sequences. Bioinformatics (Oxford, England), 14(5),
439451. doi:10.1093/bioinformatics/14.5.439
TRIANA. (2003). The Triana Project. Retrieved from
www.trianacode.org/
Tripp, G. (2006). A parallel string matching engine for
use in high speed network intrusion detection systems.
Journal in Computer Virology, 2(1), 2134. doi:10.1007/
s11416-006-0010-4
Tsaregorodtsev, A., Garonne, V., & Stokes-Rees, I.
(2004). Dirac: A scalable lightweight architecture for high
throughput computing. In Fifth IEEE/ACM International
Workshop On Grid Computing (Grid04).
Tseng, Y-C., Shen, C-C. & Chen, W-T. (2003). Integrating Mobile IP with ad hoc networks. IEEE Computer,
May, 48-55.
Tsouloupas, G., & Dikaiakos, M. D. (2007). GridBench:
A tool for the interactive performance exploration of
Grid infrastructures. Journal of Parallel and Distributed Computing, 67(9), 10291045. doi:10.1016/j.
jpdc.2007.04.009
Tsoumakos, D., & Rousseopoulos, N. (2006). Analysis
and comparison of p2p search methods. In Proceedings
of the 1st International Conference on Scalable Information Systems (INFOSCALE 2006), No. 25.
Tirado-Ramos, A., Tsouloupas, G., Dikaiakos, M.

D., & Sloot, P. M. (2005). Grid resource selection by
application benchmarking: A computational haemodynamics case study. International Conference on
Computational Science. (LNCS 3514, pp. 534-543).
Atlanta, GA: Springer Verlag.
Tuck, N., & Tullsen, D. M. (2005). Multithreaded value

prediction. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture
(HPCA05), (pp. 5-15), San Francisco: IEEE Computer
Society.
TOP500. (2007). TOP 500 Supercomputer Sites, Performance Development, November 2007. Retrieved March
10, 2008 from http://www.top500.org/lists/2007/11/
performance_development
Tullsen, D. M., & Brown, J. A. (2001). Handling longlatency loads in a simultaneous multithreading processor.
In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO01),
(pp. 318327). Austin, TX: IEEE Computer Society.
950
Tullsen, D. M., Eggers, S. J., & Levy, H. M. (1995).

Simultaneous multithreading: maximizing on-chip
parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA95),
Santa Margherita Ligure, Italy (pp. 392-403). New York:
ACM Press.
Tullsen, D. M., Eggers, S. J., Emer, J. S., Levy, H. M.,
Lo, J. L., & Stamm, R. L. (1996). Exploiting choice:
instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of
the 23rd Annual International Symposium on Computer
Architecture (ISCA96), Philadelphia, (pp. 191202).
Tullsen, D. M., Lo, J. L., Eggers, S. J., & Levy, H. M.
(1999). Supporting fine-grained synchronization on a
simultaneous multithreading processor. In Proceedings
of the 5th International Symposium on High Performance
Computer Architecture (HPCA99), Orlando, FL (pp.
54-58). New York: IEEE Computer Society.
Turner, D., & Chen, X. (2002). Protocol-dependent
message-passing performance on linux clusters. Proceedings of the 2002 IEEE International Conference on
Linux Clusters, pp. 187-194. New York: IEEE Computer
Society.
TV-Anytime. (2005). TV-Anytime. Retrieved June 15th,
2008, from http://www.tv-anytime.org
UDDI. (2004). UDDI Version 3.0.2. Retrieved June 15th,
2008, from http://www.Oasis-Open.org/committees/
uddi-spec/doc/spec/v3/uddi-v3.0.2-20041019.Htm
Uhlig, S., Bonaventure, O., & Quoitin, B. (2003). Internet
traffic engineering with minimal BGP configuration. 18th
International Teletraffic Congress.
Unicore (n.d.). Retrieved from http://unicore.sourceforge.net
UNICORE. (2008). UNiform Interface to COmputing
Resources. Retrieved from www.unicore.eu/
Ururahy, C., & Rodriguez, N. (2004). Programming and
coordinating grid environments and applications. In Concurrency and computation: Practice and experience.
Vakali, A., & Pallis, G. (2003). Content Delivery Networks: Status and Trends. IEEE Internet Computing,
(November 6): 6874. doi:10.1109/MIC.2003.1250586
Valkovskii, V. A., & Malyshkin, V. E. (1988). Synthesis
of parallel programs and systems on the basis of computational models. Novosibirsk, Russia: Nauka.
van der Houwen, P. J., & Messina, E. (1999). Parallel
Adams methods. Journal of Computational and Applied
Mathematics, 101(1-2), 153165. doi:10.1016/S03770427(98)00214-3
van der Houwen, P. J., & Sommeijer, B. P. (1991). Iterated Runge-Kutta methods on parallel computers. SIAM
Journal on Scientific and Statistical Computing, 12(5),
10001028. doi:10.1137/0912054
Van der Wijngaart, R. F., & Frumkin, M. A. (2004).
Evaluating the information power Grid using the
NAS Grid benchmarks. International Parallel and
Distributed Processing Symposium. Santa Fe, NM:
van der Wijngaart, R. F., & Jin, H. (2003). The NAS parallel benchmarks, multi-zone versions (No. NAS-03-010).
NASA Ames Research Center, Sunnydale, CA.
van Reeuwijk, C., Kuijlman, F., & Sips, H. J. (2003).
Spar: A set of extensions to Java for scientific computation. Concurrency and Computation, 15, 277299.
doi:10.1002/cpe.659
Vanneschi, M. (2002). The programming model of ASSIST, an environment for parallel and distributed portable
applications. Parallel Computing, 28(12), 17091732.
doi:10.1016/S0167-8191(02)00188-6
Vanneschi, M., & Veraldi, L. (2007). Dynamicity in
distributed applications: Issues, problems and the ASSIST approach. Parallel Computing, 33(12), 822845.
doi:10.1016/j.parco.2007.08.001
Vazhkudai, S. (2003, Nov). Enabling the co-allocation of
grid data transfers. In Proceedings of the fourth international workshop on grid computing (pp. 4151).
951
Vazhkudai, S., & Ma, X. V. F., Strickland, J., Tammineedi,

N., & Scott, S. (2005). Freeloader:scavenging desktop
storage resources for scientific data. In Proceedings of
Supercomputing 2005 (SC05), Seattle, WA.
Venugopal, S., Nadiminti, K., Gibbins, H., & Buyya, R.

(2008). Designing a resource broker for heterogeneous
Grids. Software, Practice & Experience, 38(8), 793825.
doi:10.1002/spe.849
Vazhkudai, S., & Syed, J., & Maginnis T. (2002). PODOS The design and implementation of a performance oriented
Linux cluster. Future Generation Computer Systems,
18(3), 335352. doi:10.1016/S0167-739X(01)00055-3
Villa, O., Scarpazza, D. P., & Petrini, F. (2008). Accelerating real-time string searching with multicore processors.
IEEE Computer, 41(4), 4250.
Vazhkudai, S., Tuecke, S., & Foster, I. (2001). Replica

selection in the globus data grid. In Proceedings of the first
IEEE/ACM international conference on cluster computing and the grid (CCGRID 2001) (pp. 106113).
Vzquez-Poletti, J. L., Huedo, E., Montero, R. S., &
Llorente, I. M. (2007). A comparison between two grid
scheduling philosophies: EGEE WMS and Grid Way.
Multiagent and Grid Systems, 3(4), 429439.
Vecchiola, C., & Chu, X. (2008). Aneka tutorial series on
developing task model applications. (Technical Report).
Grid Computing and Distributed Systems Laboratory,
The University of Melbourne, Australia.
Veldema, R., Hofman, R. F. H., Bhoedjang, R., & Bal, H.
E. (2001). Runtime optimizations for a Java DSM implementation. Paper presented at the Proceedings of the 2001
joint ACM-ISCOPE conference on Java Grande.
Venugopal, S., & Buyya, R. (2005, Oct). A deadline and
budget constrained scheduling algorithm for escience
applications on data grids. In Proceedings of the 6th
international conference on algorithms and architectures
for parallel processing (ICA3PP-2005) (pp. 6072).
Venugopal, S., Buyya, R., & Ramamohanarao, K. (2006).
A taxonomy of data grids for distributed data sharing,
management, and processing. ACM Computing Surveys,
1, 153.
Venugopal, S., Buyya, R., & Winton, L. (2004). A grid
service broker for scheduling distributed data-oriented
applications on global grids. Proceedings of the 2nd
workshop on Middleware for grid computing, Toronto,
Canada, (pp. 7580). Retrieved from www.Gridbus.
org/broker
952
VMware Inc. (1999). VMware virtual platform.

Voss, M. J., & Eigenmann, R. (2000). ADAPT: Automated
De-coupled Adaptive Program Transformation. International Conference on Parallel Processing, Toronto,
Canada, (pp. 163).
Vshivkov, V. A., Nikitin, S. A., & Snytnikov, V. N.
(2003). Studying instability of collisionless systems on
stochastic trajectories. JETP Letters, 78(6), 358362.
doi:10.1134/1.1630127
Vuduc, R., Demel, J., & Bilmes, J. A. (2004). Statistical Models for Empirical Search-Based Performance Tuning. International Journal of High
Performance Computing Applications, 18(1), 6594.
doi:10.1177/1094342004041293
Vydyanathan, N., Krishnamoorthy, S., Sabin, G., atalyrek, . V., Kur, T. M., Sadayappan, P., et al. (2006). An
integrated approach for processor allocation and scheduling of mixed-parallel applications. In Proceedings of the
2006 International Conference on Parallel Processing
(ICPP06) (pp. 443450). New York: IEEE.
Vydyanathan, N., Krishnamoorthy, S., Sabin, G., atalyrek, . V., Kur, T. M., Sadayappan, P., et al. (2006).
Locality conscious processor allocation and scheduling
for mixed parallel applications. In Proceedings of the
2006 IEEE International Conference on Cluster Computing, September 25-28, 2006, Barcelona, Spain. New
York: IEEE.
Wachter, H., & Reuter, A. (Eds.). (1992). Contracts:
A means for Extending Control Beyond Transaction
Boundaries. Advanced Transaction Models for New Applications. San Francisco: Morgan Kaufmann.
Waldburger, M., & Stiller, B. (2006). Toward the mobile

grid:service provisioning in a mobile dynamic virtual
organization. In. Proceedings of the IEEE International
Conference on Computer Systems and Applications,
2006, (pp.579583).
Wang, J., Zeng, Q.-A., & Agrawal, D. P. (2003). Performance analysis of a preemptive and priority reservation
handoff scheme for integrated service-based wireless
mobile networks. IEEE Transactions on Mobile Computing, 2(1), 6575. doi:10.1109/TMC.2003.1195152
Waldvogel, M., & Rinaldi, R. (2002). Efficient topologyaware overlay network. In Proc. of HotNets-I.
Wang, T., Vonk, J., Kratz, B., & Grefen, P. (2008). A survey
on the history of transaction management: from flat to
grid transactions. Distributed and Parallel Databases,
23(3), 235270. doi:10.1007/s10619-008-7028-1
Walker, B., Popek, G., English, R., Kline, C., & Thiel, G.
(1992). The locus distributed operating system. Ditributed
Computing Systems: Concepts and Structures, 17(5).
Walker, D. W. (1990). Characterising the parallel
performance of a large-scale, particle-in-cell plasma
simulation code. International Journal on Concurrency:
Practice and Experience., 2(4), 257288. doi:10.1002/
cpe.4330020402
Wall, D. W. (1991). Limits of instruction-level parallelism. In Proceedings of the 4th International Conference
on Architectural Support for Programming Languages
and Operating Systems, Santa Clara, CA (ASPLOS-IV),
Wang, C., Hsu, C., Chen, H., & Wu, J. (2006). Efficient
multi-source data transfer in data grids. In Proceedings
of the sixth IEEE international symposium on cluster
computing and the grid (CCGRID06) (pp. 421424).
Wang, C., Xiao, L., Liu, Y., & Zheng, P. (2004). Distributed
caching and adaptive search in multilayer P2P networks.
In International Conference on Distributed Computing
Systems (ICDCS04) (pp. 219-226).
Wang, H., Katz, R., & Giese, J. (1999). Policy-enabled
handoffs across heterogeneous wireless networks. Mobile Computing Systems and Applications (PWMCSA),
(pp. 51-60).
Wang, H., Liu, P., & Wu, J. (2006). A QoS-aware heuristic algorithm for replica placement. Journal of Grid
Computing, 96103.
Wang, Y., Scardaci, D., Yan, B., & Huang, Y. (2007). Interconnect EGEE and CNGRID e-infrastructures through
interoperability between gLite and GOS middlewares.
In International Grid Interoperability and Interoperation Workshop (IGIIW 2007) with e-Science 2007 (pp.
553560). Bangalore, India: IEEE Computer Society.
Wang, Z., Yu, B., Chen, Q., & Gao, C. (2005). Wireless
grid computing over mobile ad-hoc networks with mobile
agent. In Skg 05: Proceedings of the first international
conference on semantics, knowledge and grid (p. 113).
Wasson, G., & Humphrey, M. (2003). Policy and enforcement in virtual organizations. In 4th International
Workshop on Grid Computing (pp. 125132). Washington, DC: IEEE Computer Society.
Watt, A. Lilley Chris, & J., Daniel. (2003). SVG Unleashed. Indianapolis, IN: SAMS.
Wei, B., Fedak, G., & Cappello, F. (2005). scheduling
independent tasks sharing large data distributed with
BitTorrent. In The 6th IEEE/ACM International Workshop
On Grid Computing, 2005, Seattle, WA.
Weiser, M. (1991, February). The computer for the 21st
century. Scientific American, 265(3), 6675.
Wesner, S., Dimitrakos, T., & Jeffrey, K. (2004, October). Akogrimo - the Grid goes mobile. ERCIM News,
(59), 32-33.
953
West, E. A., & Grimshaw, A. S. (1995). Braid: Integrating task and data parallelism. In Proceedings of the
Fifth Symposium on the Frontiers of Massively Parallel
Computation (Frontiers95) (p. 211). New York: IEEE
Computer Society.
Whaley, R. C., & Petite, A. (2005). Minimizing development and maintenance costs in supporting persistently
optimized BLAS. Software, Practice & Experience,
35(2), 101121. doi:10.1002/spe.626
White Paper, A. M. D. (2008). The industry-changing
impact of accelerated computing.
White, J. E. (1996). Telescript technology: Mobile agents.
Journal of Software Agents.
Wieczorek, M., Prodan, R., & Fahringer, T. (2005).
Scheduling of scientific workflows in the ASKALON
Grid environment. SIGMOD Record, 09.
WiFi (2008). Retrieved November 2008 from http://
www.ieee802.org/11/
Wikipedia, Gauss-Jordan elimination. Retrieved from
http://en.wikipedia.org/wiki/Gauss-Jordan_elimination
Wikipedia, Max-flow min-cut theorem. Retrieved from
http://en.wikipedia.org/wiki/Max-flow_min-cut_theorem
Wilkinson, T. (1998). Kaffe - a clean room implementation of the Java virtual machine. Retrieved 2002, from
http://www.kaffe.org/
Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands,
P., & Yelick, K. (2006, May). The Potential of the Cell
Processor for Scientific Computing. In Computing frontiers (cf06) Ischia, Italy (pp. 920).
Winkler, P., & Zhang, L. (2003). Wavelength assignment
and generalized interval graph coloring. In Proceedings
of the ACM-SIAM Symposium on Discrete Algorithms
(SODA03), Baltimore, MD, (pp. 830831).
Witten, I. H., & Frank, E. (2005). Data Mining: Practical
Machine Learning Tools and Techniques (2nd Ed.). San
954
Woeginger, G. J. (1997). There is no asymptotic PTAS

for two-dimensional vector packing. Information
Processing Letters, 64, 293297. doi:10.1016/S00200190(97)00179-8
Wolski, R. (2003). Experiences with predicting resource
performance on-line in computational grid settings. ACM
SIGMETRICS Performance Evaluation Review, 30(4),
4149. doi:10.1145/773056.773064
Wong, S.-W., & Ng, K.-W. (2006). Security support for
mobile grid services framework. In Nwesp06: Proceedings of the international conference on next generation
web services practices (pp.7582). Washington, DC:
World Wide Web Consortium (W3C). (n.d.). Web services
activity. Retrieved from http://www.w3.org/2002/ws/
WSDL/W3C. (2005). WSDL: Web Services Description
Language (WSDL) 1.1. Retrieved June 15th, 2008, from
http://www.w3.org/TR/wsdl.
Wu, D. M., & Guan, L. (1995). A Distributed Real-Time
Image Processing System. Real-Time Imaging, 1(6),
427435. doi:10.1006/rtim.1995.1044
Wu, G., Chu, C. W., Wine, K., Evans, J., & Frenkiel, R.
(1999). WINMAC: A novel transmission protocol for
infostations. Proceedings of the 49th IEEE Vehicular
Technology Conference (VTC), Houston, TX, (Vol. 2,
pp. 13401344).
Wu, Q., He, X., & Hintz, T. (2004, June 21-24). Virtual
Spiral Architecture. Paper presented at the International
Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA.
Wyckoff, P., McLaughry, S. W., Lehman, T. J., & Ford,
D. A. (1998). T Spaces. IBM Systems Journal, 37,
454474.
Xiang, Y., & Zhou, W. (2006). Protecting information infrastructure from ddos attacks by mark-aided
distributed filtering (madf). International Journal of
High Performance Computing and Networking, 4(5/6),
357367. doi:10.1504/IJHPCN.2006.013491
Xie, M. (1991). Software reliability modeling. Hackensack, NJ: World Scientific Publishing Company.
Xie, M., Dai, Y. S., & Poh, K. L. (2004). Computing
systems reliability: Models and analysis. New York:
Kluwer Academic Publishers.
Xu, C. (2005). Scalable and Secure Internet Services
and Architecture. Boca Raton, FL: Chapman & Hall/
CRC Press.
Xu, J. (2003). On the fundamental tradeoffs between
routing table size and network diameter in peer-to-peer
networks. In Proceedings of INFOCOM (pp. 2177-2187).
New York: IEEE Press.
Xu, M., Sabouni, A., Thulasiraman, P., Noghanian,
S., & Pistorius, S. (2007, Sept.). Image Reconstruction
using Microwave Tomography for Breast Cancer Detection on Distributed Memory Machine. In International
conference on parallel processing (icpp) XiAn, China
(p. 1-8).
Xu, Y., Liu, H., & Zeng, Q.-A. (2005). Resource management and Qos control in multiple traffic wireless and
mobile Internet systems. [WCMC]. Wileys Journal of
Wireless Communications and Mobile Computing, 2(1),
971982. doi:10.1002/wcm.360
Xu, Z., Mahalingam, M., & Karlsson, M. (2003). Turning
heterogeneity into an advantage in overlay routing. In
Proc. of INFOCOM.
Xu, Z., Min, R., & Hu, Y. (2003). HIERAS: A DHT based
hierarchical p2p routing algorithm. In Proceedings of the
2003 Intl. Conf. on Parallel Processing (pp. 187-194).
Xu, Z., Tang, C., & Zhang, Z. (2003). Building topology-aware overlays using global soft-state. In Proc. of
ICDCS.
Yamamoto, W., & Nemirovsky, M. (1995). Increasing
superscalar performance through multistreaming. In
Proceedings of the IFIP WG10.3 Working Conference
on Parallel Architectures and Compilation Techniques
(PACT95), (pp. 49-58). Limassol, Cyprus: IFIP Working Group on Algol.
Yamin, A., Augustin, I., Barbosa, J., da Silva, L., Real,

R., & Cavalheiro, G. (2003). Towards merging contextaware, mobile and grid computing. International Journal
of High Performance Computing Applications, 17(2),
191203. doi:10.1177/1094342003017002008
Yan, C., Rogers, B., Englender, D., Solihin, Y., & Prvulovic, M. (2006). Improving cost, performance, and
security of memory encryption and authentication. In
Proceedings of 33rd Annual International Symposium
on Computer Architecture (ISCA06), (pp. 179-190).
Boston: IEEE Computer Society Press.
Yan, J., & Zhang, W. (2007). Hybrid multi-core architecture for boosting single-threaded performance. ACM
SIGARCH Computer Architecture News, 35(1), 141148.
doi:10.1145/1241601.1241603
Yang, B., & Garcia-Molina, H. (2002). Improving search
in peer-to-peer networks. In Proceedings of the 22nd
Systems (ICDCS02) (pp. 5).
Yang, B., & Xie, M. (2000). A study of operational
and testing reliability in software reliability analysis.
Reliability Engineering & System Safety, 70, 323329.
doi:10.1016/S0951-8320(00)00069-7
Yang, C., Yang, I., Chen, C., & Wang, S. (2006). Implementation of a dynamic adjustment mechanism with
efficient replica selection in data grid environments. In
Proceedings of the ACM symposium on applied computing (pp. 797804).
Yang, H. T., Wang, Z. H., & Deng, Q. H. (2008).
Scheduling optimization in coupling independent
services as a Grid transaction. Journal of Parallel and
jpdc.2008.01.004
Yang, Y. G., Jin, H., & Li, M. L. (2004). Grid computing
in China. Journal of Grid Computing, 2(2), 193206.
doi:10.1007/s10723-004-4201-2
Yap, T., Frieder, O., & Martino, R. (1998, March). Parallel
computation in biological sequence analysis. Institute of
Electrical and Electronic Engineers, 9(3), 283294.
955
Yau, D. K. Y., & Lam, S. S. (1996). Adaptive RateControlled Scheduling for Multimedia Applications. In
Proceedings of ACM Multimedia Conference.
Yavatkar, R., Pendarakis, D., & Guerin, R. (2000, January). A framework for policy based admission control,
(RFC 2753).
Yeager, K. C. (1996). The MIPS R10000 superscalar microprocessor. IEEE Micro, 16(2), 2840.
doi:10.1109/40.491460
Yee, K. (1966, May). Numerial solution of initial boundary value problems involving maxwells equations in
isotropic media. IEEE Transactions on Antennas and
Propagation, AP-14(8), 302307.
Yeh, C.-H., Parhami, B., Varvarigos, E. A., & Lee, H.
(2002, July). VLSI layout and packaging of butterfly
networks. In Acm symposium on parallel algorithms and
architectures Winnipeg, Canada (pp. 196205).
Yeo, C. S., & Buyya, R. (2007). Integrated Risk Analysis
for a Commercial Computing Service. Proceedings of
the 21st IEEE International Parallel and Distributed
Processing Symposium (IPDPS 2007, IEEE CS Press,
Los Alamitos, CA, USA).
Yinglian, X. OHallaron D. (2002). Locality in search
engine queries and its implications for caching. In Proceedings of the IEEE Infocom (pp. 1238-1247).
Young, J. W. (1974). A first order approximation to the
optimal checkpoint interval. Communications of the
ACM, 17(9), 530531. doi:10.1145/361147.361115
Yu, W., & Cox, A. (1997). Java/DSM: A Platform for
Heterogeneous Computing. Concurrency (Chichester,
England), 9(11), 12131224. doi:10.1002/(SICI)10969128(199711)9:11<1213::AID-CPE333>3.0.CO;2-J
Yu, W., Mittra, R., Su, T., Liu, Y., & Yang, X. (2006).
Parallel Finite-Difference Time-Domain Method. Boston: Artech House publishers.
Zajcew, R., Roy, P., Black, D., & Peak, C. (1993). An osf/l
unix for massively parallel multi-computers. Proceedings
of the winter 1993 conference, (pp. 449-468).
956
Zander, J. (2000). Trends and challenges in resource

management future wireless networks. In Proceedings
of the IEEE Wireless Communications and Networks
Conference (WCNC), Chicago, (Vol. 1, pp. 159163).
Zegura, E. Calvert, K. et al. (1996). How to model an
Internetwork. In Proc. of INFOCOM.
Zhang, G., & Parashar, M. (2003). Dynamic context-aware
access control for grid applications. In 4th international
workshop on grid computing (grid 2003), (pp. 101 108).
Phoenix, AZ: IEEE Computer Society Press. Retrieved
from citeseer.ist.psu.edu/zhang03dynamic.html
Zhang, Q., Guo, C., Guo, Z., & Zhu, W. (2003). Efficient
mobility management for vertical handoff between
WWAN and WLAN. IEEE Communications Magazine,
41(11), 102108. doi:10.1109/MCOM.2003.1244929
Zhang, X. Y., Zhang, Q., Zhang, Z., Song, G., & Zhu,
W. (2004). A construction of locality-aware overlay
network: mOverlay and its performance. IEEE Journal
on Selected Areas in Communications, 22(1), 1828.
doi:10.1109/JSAC.2003.818780
Zhang, X., Freschl, J. L., & Schopf, J. M. (2003, June).
A performance study of monitoring and information
services for distributed systems. In HPDC03: Proceedings of the Twelfth International Symposium on High
Performance Distributed Computing, (pp. 270-281). Los
Alamitos, CA: IEEE Computer Society Press.
Zhao, B. Y., Duan, Y., Huang, L., Joseph, A., & Kubiatowicz, J. (2003). Brocade: landmark routing on overlay
networks. In Proceedings of the 2nd Intl. Workshop on
Zhao, B. Y., Kubiatowicz, J., & Oseph, A. D. (2001).
Tapestry: An infrastructure for fault-tolerant wide-area
location and routing (Tech. Rep. UCB/CSD-01-1141).
University of California at Berkeley, Berkeley, CA.
Zhao, S., & Lo, V. (2001, May). Result Verification and
Trust-based Scheduling in Open Peer-to-Peer Cycle Sharing Systems. In Proceedings of Ieee Fifth International
Conference on Peer-To-Peer Systems.
Zhou, D., & Lo, V. M. (2006). Wavegrid: A scalable

fast-turnaround heterogeneous peer-based desktop grid
system. In IPDPS.
Zhou, J., Ou, Z., Rautiainen, M., & Ylianttila, M. (2008b).
P2P SCCM: Service-oriented Community Coordinated
Multimedia over P2P. In Proceedings of 2008 IEEE International Conference on Web Services, Beijing, China,
September 23-26, (pp. 34-40).
Zhou, J., Rautiainen, M., & Ylianttila, M. (2008a). Community coordinated multimedia: Converging contentdriven and service-driven models. In proceedings of
2008 IEEE International Conference on Multimedia &
Expo, June 23-26, 2008, Hannover, Germany.
Zhou, S., Zheng, X., Wang, J., & Delisle, P. (1993). Utopia:
a load sharing facility for large, heterogeneous distributed
computer systems. Software, Practice & Experience,
23(12), 13051336. doi:10.1002/spe.4380231203
Zhou, X., Kim, E., Kim, J. W., & Yeom, H. Y. (2006).
ReCon: A fast and reliable replica retrieval service for the
data grid. In Proceedings of IEEE international symposium on cluster computing and the grid (pp. 446453).
Zhou, Y., Iftode, L., & Li, K. (1996). Performance
evaluation of two home-based lazy release consistency
protocols for shared virtual memory systems. SIGOPS
Oper. Syst. Rev., 30(SI), 75-88.
Zhu, W., Wang, C. L., & Lau, F. C. M. (2002). JESSICA2:

A Distributed Java Virtual Machine with Transparent Thread Migration Support. Paper presented at the
Proceedings of the IEEE International Conference on
Cluster Computing.
Zhu, Y., & Hu, Y. (2005). Efficient, proximity-aware
load balancing for DHT-based P2P systems. Proc. of
IEEE TPDS, 16(4).
Zhu, Y., & Jiang, H. (2006). False Rate Analysis of Bloom
Filter Replicas in Distributed Systems. Paper presented
at the Proceedings of the 2006 International Conference
on Parallel Processing.
Zhu, Y., Jiang, H., & Wang, J. (2004). Hierarchical
Bloom filter arrays (HBA): a novel, scalable metadata
management system for large cluster-based storage.
Paper presented at the Proceedings of the 2004 IEEE
International Conference on Cluster Computing.
Zhu, Y., Jiang, H., Wang, J., & Xian, F. (2008). HBA:
Distributed Metadata Management for Large ClusterBased Storage Systems. IEEE Transactions on Parallel
TPDS.2007.70788
Zilka, A. (2006). Terracotta - JVM Clustering, Scalability
and Reliability for Java. Retrieved June 19, 2008, from
http://www.terracotta.org
Zhu, F., & McNair, J. (2004). Optimizations for vertical

handoff decision algorithms. IEEE Wireless Communications and Network Conference (WCNC), (pp. 867-872).
957
About the Contributors
Kuan-Ching Li received the PhD and MS degrees in Electrical Engineering and Licenciatura in
Mathematics from the University of So Paulo, Brazil. After he received his PhD, he was a postdoctoral
scholar in the University of California Irvine (UCI) and University of Southern California (USC). His
main research interests include cluster and grid computing, parallel software design, and life science
applications. He has authored over 60 research papers and book chapters, and co-editor of book entitled
"Handbook of Research on Scalable Computing Technologies" published by IGI Global and volumes of
LNCS and LNAI published by Springer. He has served as Guest Editor of a number of journal special
issues, including The Journal of Supercomputing (TJS), International Journal of Ad Hoc and Ubiquitous
Computing (IJAHUC), and International Journal of Computer Applications in Technology (IJCAT). In
addition, he has served on the steering, organizing, and program committees of several conferences
and workshops, including Conference co-chair of CSE'2008 (Sao Paulo, Brazil) and Program co-chair
of APSCC'2008 (Yilan, Taiwan), AINA'2008 (Okinawa, Japan). He is a senior member of the IEEE.
Ching-Hsien Hsu received the B.S. and Ph.D. degrees in Computer Science from Tung Hai University and Feng Chia University, Taiwan, in 1995 and 1999, respectively. He is currently an associate
professor of the department of Computer Science and Information Engineering at Chung Hua University,
Taiwan. Dr. Hsu's research interest is primarily in parallel and distributed computing, grid computing,
P2P computing, RFID and services computing. Dr. Hsu has published more than 80 academic papers
in journals, books and conference proceedings. He was awarded as annual outstanding researchers
by Chung Hua University in 2005, 2006 and 2007 and got the excellent research award in 2008. He
is serving in a number of journal editorial boards, including International Journal of Communication
Systems, International Journal of Computer Science, International Journal of Grid and High Performance Computing, International Journal of Smart Home and International Journal of Multimedia and
Ubiquitous Engineering.
Laurence T. Yang is a professor at Department of Computer Science of St Francis Xavier University, Canada. His research includes high performance computing and networking, embedded systems,
ubiquitous/pervasive computing and intelligence. He has published around 300 papers (including around
80+ international journal papers such as IEEE and ACM Transactions) in refereed journals, conference
proceedings and book chapters in these areas. He has been involved in more than 100 conferences and
workshops as a program/general/steering conference chair and more than 300 conference and workshops as a program committee member. He served as the vice-chair of IEEE Technical Committee of
Supercomputing Applications (TCSA) until 2004, currently is the chair of IEEE Technical Committee
of Scalable Computing (TCSC), the chair of IEEE Task force on Ubiquitous Computing and Intelligence.
In addition, he is the editors-in-chief of several international journals and few book series. He is serving
as an editor for numerous international journals. He has been acting as an author/co-author or an editor/
co-editor of 25 books from Kluwer, Springer, IGI, Nova Science, American Scientific Publishers and
John Wiley & Sons. He has won 5 Best Paper Awards (including the IEEE 20th International Conference on Advanced Information Networking and Applications (AINA-06)) and 1 Best Paper Nomination in 2007; as well as a Distinguished Achievement Award, 2005; Canada Foundation for Innovation
Award, 2003.
Jack Dongarra holds an appointment at the University of Tennessee and holds the title of Distinguished Research Staff at Oak Ridge National Laboratory (ORNL), Turing Fellow at the University
of Manchester. He was awarded the IEEE Sid Fernbach Award in 2004 for his contributions in the application of high performance computers using innovative approaches and in 2008 he was the recipient
of the IEEE Medal of Excellence in Scalable Computing. He is a Fellow of the AAAS, ACM, and the
IEEE and a member of the National Academy of Engineering.
Hans P. Zima is a Principal Scientist at the Jet Propulsion Laboratory, California Institute of Technology, and a Professor Emeritus of the University of Vienna, Austria. He received his Ph.D. degree
in Mathematics and Astronomy from the University of Vienna in 1964. His major research interests
have been in the fields of high-level programming languages, compilers, and advanced software tools.
In the early 1970s, while working in industry, he designed and implemented one of the first high-level
real-time languages for the German Air Traffic Control Agency. During his tenure as a Professor of
Computer Science at the University of Bonn, Germany, he contributed to the German supercomputer
project "SUPRENUM", leading the design of the first Fortran-based compilation system for distributedmemory architectures (1989). After his move to the University of Vienna, he became the chief designer
of the Vienna Fortran language (1992) that provided a major input for the High Performance Fortran
de-facto standard. From 1997 to 2007, Dr. Zima headed the Priority Research Program "Aurora", a tenyear program funded by the Austrian Science Foundation. His research over the past years focused on
the design of the "Chapel" programming language in the framework of the DARPA-sponsored HPCS
project "Cascade". More recently, Dr. Zima has become involved in the design of space-borne faulttolerant high capability computing systems. Dr. Zima is the author or co-author of about 200 publications, including 4 books.
***
David Allenotor has a B.Sc. and M.Sc. in Computer Science, University of Benin 1996 and 2000.
He also has M.Sc. in Computer Science, University of Manitoba, Canada. 2005. At present he is a
Ph.D. candidate in Department of Computer Science, University of Manitoba, Canada and a member
of the Computational Finance Derivatives Lab (CFD). His interest and field is Grid Computing, Cloud
Computing, applications fuzzy logic to Computational Finance Derivatives, and Modeling of Financial
Engineering problems.
Jrn Altmann is Associate Professor at the International University of Bruchsal, Germany, where he
heads the group of Computer Networks and Distributed Systems. Dr. Altmann received his B.Sc. degree,
his M.Sc. degree (1993), and his Ph.D. (1996) from the University of Erlangen-Nrnberg, Germany. Dr.
Altmann's current research centers on the economics of Internet services and Internet infrastructures,
integrating economic models into distributed systems. In particular, he focuses on capacity planning,
network topologies, and resource allocation.
Carlos Eduardo Rodrigues Alves is an Associate Professor of the Computer Science Department of
So Judas Tadeu University. He obtained his Ph.D. in Computer Science at the Institute of Mathematics
and Statistics of the University of So Paulo in 2002. He was a graduate of the Instituto Tecnolgico
de Aerontica where he finished both his undergraduate course in Electronics Engineering and his
M.Sc. degree in Electrical and Computer Engineering. His research interests are the design of efficient
sequential and parallel algorithms.
Marcos Dias de Assuno is a PhD candidate at the University of Melbourne, Australia. His PhD
thesis is on peering and resource allocation across Grids. He has previously obtained a masters degree
on network management at the Federal University of Santa Catarina, Brazil. The current topics of his
interest include Grid scheduling, virtual machines, and network virtualisation.
Alan A. Bertossi got the Laurea Degree in Computer Science from the University of Pisa (Italy)
in 1979. Currently, he is a Professor of Computer Science at the Department of Computer Science of
the University of Bologna (Italy). His main research interests are the design and analysis of algorithms
for high-performance, parallel, distributed, wireless, fault-tolerant, and real-time systems. He has published 45 refereed papers on international archival journals, as well as several other papers in conference proceedings, book chapters, and encyclopedias. He served as a guest coeditor for special issues
of international journals, mainly on algorithms for wireless networks. Since 2000, he has been in the
editorial board of Information Processing Letters.
Rajkumar Buyya is an Associate Professor and Reader of Computer Science and Software Engineering; and Director of the Grid Computing and Distributed Systems (GRIDS) Laboratory at the
University of Melbourne, Australia. Dr. Buyya has authored/co-authored over 250 publications. He has
co-authored three books: Microprocessor x86 Programming, BPB Press, New Delhi, 1995, Mastering
C++, Tata McGraw Hill Press, New Delhi, 1997, and Design of PARAS Microkernel. The books on
emerging topics that he edited include, High Performance Cluster Computing (Prentice Hall, USA,
1999), High Performance Mass Storage and Parallel I/O (IEEE and Wiley Press, USA, 2001), Content
Delivery Networks (Springer, Germany, 2008), and Market Oriented Grid and Utility Computing (Wiley
Press, USA, 2009).
Edson Norberto Caceres is a Professor of the Department of Computer Science and Statistics of
the Federal University of Mato Grosso do Sul, where he has been a former Pro-Rector of Undergraduate Stuties. He hold a PhD in Computer Science obtained at the Federal University of Rio de Janeiro in
1992. His research interests include the design of parallel algorithms, especially graph algorithms. He
is the Director of Education of the Brazilian Computer Society. In addition to belonging to the Federal
University of Mato Grosso do Sul since the early eighties, currenly he is also with the Brazilian Ministery of Education as the General Coordinator of Student Relations.
Franck Cappello holds a Senior Researcher position at INRIA. He leads the Grand-Large project
at INRIA, focusing on High Performance issues in Large Scale Distributed Systems. He has initiated
the XtremWeb (Desktop Grid) and MPICH-V (Fault tolerant MPI) projects. He was the director of the
Grid5000 project from its beginning and until 2008 and is the scientific director of ALADDIN/Grid5000,
the new 4 years INRIA project aiming to sustain the Grid5000 infrastructure. He has contributed to
more than 50 Program Committees. He is editorial board member of the international Journal on Grid
Computing, Journal of Grid and Utility Computing and Journal of Cluster Computing. He is a steering
committee member of IEEE HPDC and IEEE/ACM CCGRID. He is the Program co-Chair of IEEE
CCGRID'2009 and System Software area co-chair of SC'2009 and was the General Chair of IEEE
HPDC'2006.
Jih-Sheng Chang received his B.E. degree from the Department of Computer Science and Information
Engineering, I-Shou University, Kaohsiung, Taiwan in 2002 and his M.S. degree from the Department
of Computer Science and Information Engineering, National Dong Hwa University, Hualien, Taiwan in
2004. He is a Ph.D. candidate at the Department of Computer Science and Information Engineering at
National Dong Hwa University currently. His academic research interests focuses on wireless network
technology and grid computing.
Ruay-Shiung Chang received his B.S.E.E. degree from National Taiwan University in 1980 and his
Ph.D. degree in Computer Science from National Tsing Hua University in 1988. He is now a professor
in the Department of Computer Science and Information Engineering, National Dong Hwa University.
His research interests include Internet, wireless networks, and grid computing. Dr. Chang is a member of ACM, a senior member of IEEE, and a founding member of ROC Institute of Information and
Computing Machinery. Dr. Chang also served on the advisory council for the Public Interest Registry
(www.pir.org) from 2004/5 to 2007/4.
Jinjun Chen received his Ph.D. degree in Computer Science and Software Engineering from Swinburne University of Technology, Melbourne, Australia, in 2007. He is currently a Lecturer in Centre
for Complex Systems and Services in the Faculty of Information and Communication Technologies at
Swinburne University of Technology, Melbourne, Australia. His research interests include: Scientific
Workflow Management and Applications, Workflow Management and Applications in Web Service
or SOC Environments, Workflow Management and Applications in Grid (Service)/Cloud Computing
Environments, Software Verification and Validation in Workflow Systems, QoS and Resource Scheduling in Distributed Computing Systems such as Cloud Computing, Service Oriented Computing (SLA,
Negotiation, Engineering, Composition), Semantics and Knowledge Management, Cloud Computing.
Zizhong Chen received a B.S. degree in mathematics from Beijing Normal University, P. R. China,
in 1997, and M.S. and Ph.D. degrees in computer science from the University of Tennessee, Knoxville,
in 2003 and 2006, respectively. He is currently an assistant professor of computer science at Colorado
School of Mines. His research interests include high performance computing, parallel, distributed, and
grid computing, fault tolerance and reliability, numerical algorithms and software, and computational
science and engineering. The goal of his research is to develop techniques, design algorithms, and build
software tools for computational science applications to achieve both high performance and high reliability on a wide range of computational platforms.
Shang-Feng Chiang was graduate students at the Department of Electrical Engineering, National
Taiwan University.
Kuo Chiang and Ruo-Jian Yu were research assistants at the Department of Electrical Engineering, National Taiwan University.
Kenneth Chiu is an assistant professor at SUNY Binghamton. His interests are in the areas of
scientific data management, web services, and grid computing. He has served as program co-chair for
IEEE e-Science 2007, and as workshops chair for e-Science 2008. He is involved in a number of multidisciplinary projects with domain scientists, and he is PI or co-PI on five research/education awards
from the NSF or DOE, five of which are still active. He received his A.B. from Princeton University
and his Ph.D. from Indiana University, both in computer science.
Yuanshun Dai is currently an assistant professor with the Department of Electrical Engineering and
Computer Science and the Department of Industrial and Information Engineering at the University of
Tennessee, Knoxville. He was the program chair of the 12th IEEE Pacific Rim Symposium on Dependable Computing (PRDC 06). He was also the general chair of the IEEE Symposium on Dependable
Autonomic and Secure Computing (DASC) in 2005, 2006, 2007. He was an Associate Editor of IEEE
Transactions on Reliability. His research interests are dependability, security, grid computing, and
autonomic computing. He published over 60 papers and 5 books.
F. Dehne received a M.C.S. degree (Dipl. Inform.) from the RWTH Aachen University, Germany in
1983 and a Ph.D. (Dr. Rer. Nat.) from the University of Wrzburg, Germany in 1986. In 1986 he joined
the School of Computer Science at Carleton University in Ottawa, Canada as an Assistant Professor. He
was appointed Associate Professor and Professor of Computer Science in 1990 and 1997, respectively.
From 2000 to 2003 and 2006 to 2008 he served as Director of the School of Computer Science. His
current research interests are in the areas of Parallel Computing, Coarse Grained Parallel Algorithms,
Parallel Computational Geometry, Parallel Data Warehousing & OLAP, and Parallel Bioinformatics.
He is a Senior Member of the IEEE, member of the ACM Symposium on Parallel Algorithms and Architectures Steering Committee, and former Vice-Chair of the IEEE Technical Committee on Parallel
Processing. He is an Editorial Board member for IEEE Transaction on Computers, Information Processing Letters, Journal of Bioinformatics Research and Applications, and Int. Journal of Data Warehousing
and Mining.
Evgueni Dodonov is currently finishing his PhD research at the University of So Paulo, Brazil. He
completed his Master degree at Federal University of So Carlos in 2004, and worked in the computer
industry for several years. His research interests include autonomic computing, file systems, process
behavior evaluation and distributed programming.
Daniel C. Doolan is a lecturer in the School of Computing, Robert Gordon University, Scotland.
His main research interest is in Mobile and Parallel Computing. He has published over 40 articles in
the areas of mobile multimedia and parallel computation.
Dou Wanchun received his Ph.D. degree in Mechanical and Electronic Engineering from Nanjing
University and Scientific and Technology, Nanjing, P.R. China, in 2001. He is currently a full professor
in Department of Computer Scientific and Technology at Nanjing University, Nanjing, P.R. China. His
research interests include: Scientific Workflow Management and Applications, Workflow Management
and Applications in Web Service, QoS and Resource Scheduling in Distributed Computing Systems.
Jrg Dmmler received his Master degree in Computer Science from the Chemnitz University of
Technology in 2004 and is pursuing doctoral research since then. His research interest include scheduling and mapping of mixed parallel applications, parallel programming models for distributed memory
platforms and transformation tools for the development of parallel applications.
M. Rasit Eskicioglu received his B.Sc. in Chemical Engineering from Istanbul Technical University,
Turkey, M.Sc. in Computer Engineering from Middle East Technical University, Turkey, and Ph.D. in
Computing Science from University of Alberta, Canada. His research interests are mainly in the systems
area, including operating systems, cluster and grid computing, high-speed network interconnects, and
mobile networks. He has investigated ways to make software DSM systems more efficient and scalable
using high-speed, programmable interconnects. Currently he is looking at wireless sensor networks and
their applications to real world problems, such as environmental monitoring. Dr. Eskicioglu is currently
an associate professor in Computer Science Department, at the University of Manitoba, Canada. He is
a member of ACM and senior member of IEEE.
Thomas Fahringer received his Ph.D. in 1993 from the Vienna University of Technology. Between
1990 and 1998, Fahringer worked as Assistant Professor at the University of Vienna, where he was promoted as Associate Professor in 1998. Since 2003, Fahringer is a Full Professor in Computer Science at the
Institute of Computer Science, University of Innsbruck, where he is leading a research group developing
the ASKALON Grid application development and computing environment. Fahringer's main research
interests include software architectures, programming paradigms, compiler technology, performance
analysis, and prediction for parallel and distributed Grid systems. Fahringer is currently coordinating the
IST-034601 edutain@grid project and is involved in numerous Austrian (SFB Aurora, Austrian Grid)
and European Grid (EGEE, CoreGrid, K-Wf Grid, ASG) projects. He is the author of over 100 papers,
including two books, 20 journal articles, and two best paper awards (ACM and IEEE).
Tore Ferm received his Bachelor of Computer Science Degree from Sydney University in 2004. He
is currently working in the Telecommunication industry in Sydney, Australia.
Gilles Fedak received its PhD degree from University Paris-XI in 2003. He is currently junior INRIA researcher at the LIP Laboratory. He is mainly interested in research around Desktop Grids. He
has designed several Desktop Grid middleware, most notably XtremWeb (Desktop Grid) and BitDew
(Data Management).
Edgar Gabriel is an Assistant Professor in the Department of Computer Science at the University of
Houston, Texas, USA. He got his PhD and Dipl.-Ing. in mechanical engineering from the University of
Stuttgart. His research interests are Message Passing Systems, High Performance Computing, Parallel
Computing on Distributed Memory Machines, and Grid Computing
Jean-Luc Gaudiot received his Diplme dIngnieur from the cole Suprieure dIngnieurs en
Electrotechnique et Electronique, Paris, France in 1976 and the M.S. and Ph.D. degrees in Computer
Science from the University of California, Los Angeles in 1977 and 1982, respectively. He is currently
a Professor and Chair of the Electrical Engineering and Computer Science Department at the University
of California, Irvine. His research interests include multithreaded architectures, fault-tolerant multiprocessors, and implementation of reconfigurable architectures. He has published over 170 journal and
conference papers. Dr. Gaudiot is a Fellow of IEEE and AAAS.
Wolfgang Gentzsch is dissemination advisor for the DEISA Distributed European Initiative for
Supercomputing Applications, and a member of the Board of Directors of the Open Grid Forum. Before,
he was Chairman of the German D-Grid Initiative; managing director of MCNC Grid and Data Center
Services in Durham; adjunct professor of computer science at Duke University; and visiting scientist at
RENCI Renaissance Computing Institute at UNC Chapel Hill in North Carolina. At the same time, he
was a member of the US Presidents Council of Advisors for Science and Technology. Before he joined
Sun in Menlo Park, CA, in 2000, as the senior director of Grid Computing, he was the President, CEO,
and CTO of start-up companies Genias and Gridware, and a professor of mathematics and computer
science at the University of Applied Sciences in Regensburg, Germany. Gentzsch studied mathematics
and physics at the Technical Universities in Aachen and Darmstadt, Germany.
Lin Guan is a Lecturer in the Department of Computer Science at Loughborough University, UK.
Her research interests include performance modeling/evaluation of computer networks, Quality of
Service (QoS) analysis and enhancement, such as congestion control mechanisms with QoS constraints,
mobile computing and wireless networks.
Sudha Gunturu received B.Tech degree in Computer Science and Engineering, from Jawaharlal
Nehru Technological University, Hyderabad, India., in 2005. Currently, she is pursuing her MS degree
in the Computer Science Department of Oklahoma State University, Stillwater, OK. Her research interests include bioinformatics, scheduling computational loads in parallel and distributed systems and
grid computing.
Minyi Guo received his Ph.D. degree in computer science from University of Tsukuba, Japan.
Before 2000, Dr. Guo had been a research scientist of NEC Corp., Japan. He is now a full professor at
Department of Computer Science and Engineering, Shanghai Jiao Tong University, China. His research
interests include pervasive computing, parallel and distributed processing, parallelizing compilers and
software engineering. He is a member of the ACM, IEEE, IEEE Computer Society, and IEICE.
Phalguni Gupta received the Doctoral degree from the Indian Institute of Technology Kharagpur,
India in 1986. He works in the field of data structures, sequential algorithms, parallel algorithms, on-line
algorithms. From 1983 to 1987, he was in the Image Processing and Data Product Group of the Space
Applications Centre (ISRO), Ahmedabad, India and was responsible for software for correcting image
data received from Indian Remote Sensing Satellite. In 1987, he joined the Department of Computer
Science and Engineering, Indian Institute of Technology Kanpur, India. Currently he is a Professor in the
department. He is responsible for several research projects in the area of image processing, graph theory
and network flow. Dr. Gupta is a member of the Association for Computing Machinery (ACM).
Peter Graham is an Associate Professor in the Computer Science Department and Associate Dean
(Research) for the Faculty of Science at the University of Manitoba. He is also an adjunct scientist at
TRLabs, Winnipeg. His current research interests include large-scale parallel and distributed systems,
pervasive computing, and mobile computing. He is a member of the ACM, IEEE and USENIX.
Alan Grigg is a senior researcher in the Systems Engineering Research Centre at Loughborough University, UK, a BAE Systems funded position. His research interests include real-time embedded systems
design, analysis and implementation issues around scheduling, communication and reconfiguration.
Dan Grigoras is Senior Lecturer in the Department of Computer Science of National University of
Ireland, Cork, where he leads the Mobile and Cluster Computing Group. His main research interests are
in Mobile Networking and Parallel Computing, especially MANET management, middleware services,
mobile applications design, load balancing and load sharing. He published one book, co-edited seven
others and 44 papers in journals and proceedings of conferences. He is also involved in many conferences and workshops.
Xiangjian He is an Associate Professor at the Faculty of Engineering and Information Technology,
University of Technology, Sydney, Australia. In the previous few years, he has received many research
grants including four Australian National Grants for research in the fields of computing and telecommunication. His current research interests include multi-scale computing, computer vision, and network
security and QoS. More information can be found at http://www-staff.it.uts.edu.au/~sean/
Yong J. Jang received the B.S. degree from the School of Electrical and Electronic Engineering in
Yonsei University in March, 2008. He is in the master degree program from the School of Electrical
and Electronic Engineering in Yonsei University. Yongs research interest includes multi-processor
system and SOC design.
Hai Jiang is an Assistant Professor in the Department of Computer Science at Arkansas State University. He received his Ph. D. in Computer Science from Wayne State University, Detroit, Michigan in
December, 2003. His current research interests include Parallel Computing, Distributed Systems, High
Performance Computing and Communication, Modeling and Simulation, and System Security. He is
a member of the IEEE, the IEEE Computer Society, and the ACM. His personal web page is at http://
www.csm.astate.edu/~hjiang.
Hong Jiang received B.Sc. and M.A.Sc. in Computer Engineering in 1982 from Huazhong University of Science and Technology and in 1987 from University of Toronto; and PhD in Computer Science
in 1991 from Texas A&M University. Since 1991 he has been at the University of Nebraska-Lincoln,
where he is Professor in Computer Science and Engineering. His present research interests are computer
architecture, parallel I/O, parallel/distributed computing, cluster and Grid computing, performance
evaluation, real-time systems, middleware, and distributed systems for distance education. He has over
150 publications in major journals and international Conferences in these areas and his research has
been supported by NSF, DOD and the State of Nebraska.
Yanqing Ji received his Ph.D. in Computer Engineering from Wayne State University, Detroit,
Michigan, in 2007. He is currently an Assistant Professor in the Department of Electrical and Computer
Engineering at Gonzaga University, Spokane, Washington. His research interests include parallel and
distributed systems, application-level thread migration/checkpointing, multi-agent systems, and their
biomedical applications. His website is at: http://barney.gonzaga.edu/~ece/yji.html.
Derrick Kondo received his PhD in Computer Science from the University of California at San
Diego, and his BS from Stanford University. Currently, he is a research scientist at INRIA Rhne-Alpes
in the MESCAL team. His interests lie in the area of volunteer computing and desktop grids. In particular, he leads research on the measurement and characterization of Internet distributed systems, their
simulation and modelling, and resource management. He founded and continues to serve as co-chair of
the Workshop on Volunteer Computing and Desktop Grids (PCGrid), and also co-chaired the BOINC
2008 workshop on volunteer computing and distributed thinking. He is serving as guest co-editor of a
special issue in 2009 of the Journal of Grid Computing on volunteer computing and desktop grids.
King Tin Lam received his B.Eng. degree in Electrical and Electronic Engineering and M.Sc. degree in Computer Science both from the University of Hong Kong in 2001 and 2006 respectively. He
worked in the IT Department of the Hongkong and Shanghai Banking Corporation for five years between
graduations of the two degrees. Mr. Lam is currently a full-time Ph.D. candidate in the Department of
Computer Science at the University of Hong Kong. His research interests include distributed Java virtual
machines for cluster computing, software transactional memory and server clustering technologies.
Xiaobin Li received the B.S. degree in electrical engineering from Chongqing University, China,
in 1990 and the M.S and the Ph.D. degree in electrical and computer engineering from the University
of California, Irvine, in 2001 and 2005, respectively. He is now a senior engineer of Enterprise Microprocessor Group at Intel Corporation, where he is developing XEON microprocessors. His research
interests are in the fault-tolerant computing and the power and thermal management.
Xiaolin Li is currently an assistant professor in Computer Science Department at Oklahoma State
University (OSU), USA, and director of Scalable Software Systems Laboratory (S3Lab,http://s3lab.
cs.okstate.edu/). He received the Ph.D. degree in Computer Engineering from Rutgers University, USA.
His research interests include distributed systems, sensor networks, network security, and bioinformatics. He is in the executive committee of IEEE Technical Committee of Scalable Computing (TCSC) and
the coordinator of Sensor Networks. He has been a TPC chair for several international conferences and
workshops and is on the editorial board of three international journals. He regularly reviews NSF grant
proposals as a panelist. He is a member of IEEE and ACM.
Chen Liu received his B.E. degree in Electrical Engineering from University of Science and Technology of China in 2000. He received the M.S. degree in Electrical Engineering from the University of
California, Riverside in 2002 and the Ph.D. degree in Electrical and Computer Engineering from the
University of California, Irvine in 2008. He currently works as an Assistant Professor in the Department of Electrical and Computer Engineering at Florida International University. His current research
interests are high-performance microprocessor design and multi-thread multi/many-core architecture.
Shaoshan Liu is currently a Ph.D. candidate in Computer Architecture at the University of California, Irvine. He received the B.S. degree in Computer Engineering, M.S. in Computer Engineering,
and M.S. in Biomedical Engineering in 2005, 2006, and 2007 respectively, all from the University of
California, Irvine. His research interests include high performance parallel computer systems, runtime
systems, and biomedical engineering. He has been with Intel Research as a member of the Managed
Runtime Optimization (MRO) lab, and Broadcom Corporation as a Device Verification and Test (DVT)
engineer.
Paul Malcot is Ph.D. candidate in computer science from LRI, Paris South University (France) under
the direction of Franck Cappello and Gilles Fedak. He is a member of the INRIA Grand-Large project.
His research interests includes the characterization of volunteer desktop computing platforms.
Victor Malyshkin received his M.S. degree in Mathematics from the State University of Tomsk
(1970), Ph.D. degree in Computer Science from the Computing Center of the Russian Academy of Sciences (1984), Doctor of Sciences (Dr.h.) degree from the State University of Novosibirsk (1993). From
1970 he had a job in software industry. In 1979 he jointed the Computing Center RAS where he is presently the head of Supercomputer Software Department. He also found the Chair of Parallel Computing
Technologies at the State Technical University of Novosibirsk and the Chair bof Parallel Computing in
the Novosibirsk State (Classical) University.. He is one of the organizers of the PaCT (Parallel Computing
Technologies) series of international conferences that held each odd year. He published over 100 scientific
papers on parallel and distributed computing, parallel program synthesis, supercomputer software and
applications, parallel implementation of the large scale numerical models. His current research interests
include parallel computing technologies; parallel programming languages and systems; methods of parallel implementation of the large scale numerical models; dynamic load balancing; methods, algorithms
and tools for parallel program synthesis.
Verdi March is a Research Fellow at the Department of Computer Science, National University of
Singapore, and a Research Scientist at the Asia-Pacific Science & Technology Center (APSTC), Sun
Microsystems Inc.. He completed his PhD from the Department of Computer Science, NUS, in 2007. He
received his BSc in Computer Science from the University of Indonesia in 2000. Verdi is currently leading the HPC research projects in APSTC. His main research interest includes the performance analysis
of HPC systems, and distributed systems such as grid computing and peer-to-peer computing.
Rodrigo Fernandes de Mello is currently a faculty member at the Institute of Mathematics and
Computer Sciences, Department of Computer Science, University of So Paulo, So Carlos, Brazil. He
completed his PhD degree at the University of So Paulo, So Carlos in 2003. His research interests
include autonomic computing, load balancing, scheduling and bio-inspired computing.
Marian Mihailescu is pursuing a PhD in Computer Science at the Department of Computer Science,
National University of Singapore. He received his BSc in Computer Science in 2005 from the Polytechnic
University of Bucharest, Romania. His main research interests include grid computing, peer-to-peer
systems, resource allocation and game theory, with a focus on pricing mechanisms.
10
Farrukh Nadeem received his Master's degree in Computer Science from Punjab University of College of Information Technology, Lahore, Pakistan, in 2002. Currently he is employed as a Ph.D. student
at the Institute of Computer Science, University of Innsbruck, Austria, where he is working in area of
performance modeling and prediction for high performance Grid computing. Nadeem is the author of
over 10 scientific papers and co-author of two book chapters.
Priyadarsi Nanda is a Lecturer at School of Computing and Communications in the Faculty of
Engineering and IT at the University of Technology Sydney (UTS), Australia. He has a wide ranging
career in teaching, research, industry and consultancy. He received B.Eng. degree in Computer Engineering from Shivaji University, India, M.Eng. degree in Computer and Telecommunication from
University of Wollongong, Australia and PhD in Computing Science from University of Technology,
Sydney, Australia in 1990, 1996 and 2008 respectively. Details of his research and teaching are available
at http://www-staff.it.uts.edu.au/~pnanda/
Doohwan. Oh has received the B.S. degree in of Kyung Hee University in 2007. Doohwan is in the
master degree program from the School of Electrical and Electronic Engineering, Yonsei University.
His research interest includes ASIC designs and SOC (system on a chip) development.
Zhonghong Ou received his M.Sc degree in electronic engineering from Beijing University of Posts
and Telecommunications, Beijing, in 2005. He is now pursuing his PhD degree at both Beijing University
of Posts and Telecommunications, China and University of Oulu, Finland. His current research interests
span the fields of P2PSIP systems, hierarchical P2P networks, routing algorithms, and protocols.
Manish Parashar is Professor of Electrical and Computer Engineering at Rutgers University, , where
he also is Director of the NSF Center for Autonomic Computing and The Applied Software Systems
Laboratory and director of the Applied Software Systems Laboratory (TASSL). He received a BE degree
in Electronics and Telecommunications from Bombay University, India and MS and Ph.D. degrees in
Computer Engineering from Syracuse University. His research interests include autonomic computing,
parallel & distributed computing (including peer-to-peer and Grid computing), scientific computing, and
software engineering. Manish has received the IBM Faculty Award (2008), Rutgers University Board
of Trustees Award for Excellence in Research (2004-2005), the NSF CAREER Award (1999), TICAM,
University of Texas at Austin, Distinguished Fellowship (1999-2001), Enrico Fermi Scholarship, Argonne National Laboratory (1996). He is a senior member of IEEE/IEEE Computer Society and ACM.
For more information please visit http://www.ece.rutgers.edu/~parashar/.
Jean-Marc Pierson: Since September 2006, Jean-Marc Pierson serves as a University Professor in
Computer Science at the University Paul Sabatier, Toulouse 3 (France). Jean-Marc Pierson received his
PhD from the ENS-Lyon, France in1996. He was an Associate Professor at the University Littoral Coted'Opale (1997-2001) in Calais, then at INSA-Lyon (2001-2006). He is a member of the IRIT Laboratory.
His main interests are related to large-scale distributed systems, funded by several projects in Grids and
Pervasive environments, with applications in biomedical informatics. He serves on several PCs in the
Grid and Pervasive computing area. His researches focus on security, cache and replica management,
monitoring and more recently energy aware distributed systems. For more information, please visit
http://www.irit.fr/~Jean-Marc.Pierson/
11
M. Cristina Pinotti received the Laurea degree in Computer Science from the University of Pisa
(Italy) in 1986. Currently, she is a Professor of Computer Science at the University of Perugia. She spent
visiting periods at the University of North Texas and at the Old Dominion University (USA). Her research
interests are the design and analysis of algorithms for wireless networks, sensor networks, parallel and
distributed systems, and special purpose architectures. She has published about 50 refereed papers on
international journals, conferences, and workshops. She has been a guest coeditor for special issues of
international journals. She is in the editorial board of the International Journal of Parallel, Emergent
and Distributed Systems.
Radu Prodan received his Master's degree in Computer Science from the Technical University of
Cluj-Napoca, Romania, in 1997. Between 1998 and 2001 he served as Research Assistant in Switzerland
at ETH Zurich, University of Basel and the Swiss Centre for Scientific Computing. In 2001 he joined
the Institute for Software Science, University of Vienna, where he earned his Ph.D. in 2004 from the
Vienna University of Technology. Prodan is currently an assistant professor at the Institute of Computer
Science, University of Innsbruck. He is interested in distributed software architectures, compiler technology, performance analysis, and scheduling for parallel and Grid computing. Prodan participated in
several national and European projects and is currently workpackage leader in the IST-034601 edutain@
grid project. He is the author of over 50 papers, including one book, over 10 journal articles, and one
IEEE best paper award.
Dang Minh Quan is senior researcher at the School of Information Technology at the International
University in Bruchsal. Dr. Quan received his Eng. degree (2001), his M.Sc. degree (2003) from Hanoi
University of Technology, VietNam, and his Ph.D. (2006) from the University of Paderborn, Germany.
Dr. Quan's current research focuses on High Performance Computing, and Grid computing. In particular,
he put special focus on supporting management of SLA-based workflows in the Grid.
Rajiv Ranjan is a post doctoral research fellow in the Grids laboratory, Department of Computer
Science and Software Engineering, the University of Melbourne. Dr. Ranjan has authored/co-authored
more than 15 papers, which are published in well reputed international conferences, journals, and edited
books. His current research interest includes design, development, and implementation of algorithms,
software frameworks, and middleware services for realizing autonomic Grid and Cloud computing systems. In particular, he researches on next generation decentralized protocols, data indexing algorithms,
and fault-tolerant scheduling heuristics for autonomic management of applications in large scale Grid
and Cloud computing environment.
Thomas Rauber received his Master degree, his PhD degree, and the Habilitation in Computer Science from the University des Saarlandes (Saarbrcken) in 1986, 1990, and 1996 respectively. From 1996
to 2002, he has been professor for computer science at the Martin-Luther-University Halle-Wittenberg.
He joined the University Bayreuth in 2002 where he holds the chair for parallel and distributed systems.
His research interest include parallel and distributed algorithms, programming environments for parallel
and distributed systems, compiler optimizations and performance prediction.
Mika Rautiainen is currently working as a post-doctoral researcher at the University of Oulu. He
received his M.Sc (eng.) and Dr. Tech. degrees from the Department of Electrical and Information En-
12
gineering, University of Oulu, Finland, in 2001 and 2006, respectively. His research interests include
content-based multimedia management and retrieval systems, pattern recognition, and digital image
and video processing and understanding.
Ala Rezmerita is a Ph.D. student in the Cluster and Grid group of the LRI laboratory at Paris-South
University and is a member of the Grand-Large team of INRIA. She has obtained a Master in computer
science in 2005 from the French University of Paris 7 Denis Diderot. Her research interests include
parallel and distributed computing, grid middleware and Desktop Grid.
Romeo Rizzi was born in 1967. He received the Laurea degree in Electronic Engineering from
the Politecnico di Milano in 1991, and in 1997 he received a Ph.D. in Computational Mathematics and
Informatics from the University of Padova, Italy. Afterwards, he held Post-Doc and other temporary
positions at research centers like CWI (Amsterdam, Holland), BRICS (Aarhus, Denmark) and IRST
(Trento, Italy). In March 2001, he became an Assistant Professor at the University of Trento. Since 2005,
he is with the University of Udine, as an Associated Professor. He is fond of combinatorial optimization
and algorithms and has a background in operations research.
Won W. Ro received the B.S. degree in Electrical Engineering from Yonsei University, Seoul, Korea, in 1996. He received the M.S. and Ph.D. degrees in Electrical Engineering from the University of
Southern California in 1999 and 2004, respectively. He also worked as a research scientist in Electrical
Engineering and Computer Science Department in University of California, Irvine. Dr. Ro has worked
as an Assistant Professor in the Department of Electrical and Computer Engineering of the California
State University, Northridge. He also worked as a college intern in Apple Computer Inc. and as a contract
software engineer in ARM Inc. His current research interest includes high-performance microprocessor
design, compiler optimization, and embedded system designs. (http://escal.yonsei.ac.kr)
Gudula Rnger received her Master degree and her PhD degree in mathematics from the University of Cologne in 1985 and 1989, respectively, and the Habilitation in Computer Science from
the University des Saarlandes (Saarbrcken) in 1996. From 1997 to 2000, she has been professor for
computer science at the University Leipzig. Since 2000 she is full professor at the Technical University
of Chemnitz. Her research interest include parallel applications, parallel programming languages and
libraries, scientific computing, software tools for mixed programming models, as well as algorithmic
and parallel adaptivity.
Haiying Shen received the BS degree in Computer Science and Engineering from Tongji University,
China in 2000, and the MS and Ph.D. degrees in Computer Engineering from Wayne State University
in 2004 and 2006, respectively. She is currently an Assistant Professor in the Department of Computer
Science and Computer Engineering of the University of Arkansas. Her research interests include distributed and parallel computer systems and networks, with an emphasis on peer-to-peer networks, wireless
networks, resource management in cluster and grid computing, and data processing. She has been a PC
member of many conferences, and a member of IEEE and ACM.
Wei Shen is currently a Ph.D. candidate at the University of Cincinnati, USA. He received his B.E.
degree from Anhui Normal University, China, in 1997, and an M. E. degree from Nanjing University
13
of Posts and Telecommunications in 2001, both in electrical engineering. His current research interests
include resource and mobility management of wireless and mobile networks, QoS provision, and the
next generation heterogeneous wireless networks.
Mohammad Shorfuzzaman is a PhD student in the Department of Computer Science, University
of Manitoba (UofM), Canada. He received his B.Sc.Engg. (Bachelor of Science and Engineering) in
Computer Science and Engineering from Bangladesh University of Engineering and Technology, Bangladesh in 2001 and M.Sc. degree in Computer Science from University of Manitoba in 2005. Prior to
his current study in UofM, he worked as a Lecturer in Asian University of Bangladesh for one year. His
research interests include distributed systems and in particular Grid computing.
Siang Wun Song is a Professor of the Department of Computer Science, University of Sao Paulo,
Brazil, where has been a former dean of the Institute of Mathematics and Statistics. He holds a PhD in
Computer Science obtained at Carnegie Mellon University in 1981. He was on the editorial boards of
Parallel Computing, and Parallel and Distributed Computing Practices. He is currently on the editorial
boards of Parallel Processing Letters, Scalable Computing: Practice and Experience, and Journal of the
Brazilian Computer Society. His area of interest is the design of parallel algorithms.
Jun-Zhao Sun received his Dr. Eng. degree in computer science in 1999 from the Harbin Institute of
Technology in China. He has been a senior researcher at the Department of Electrical and Information
Engineering, University of Oulu in Finland since 2000. From 2006 he works as an academy research
fellow for the Academy of Finland. His research interests are in mobile and pervasive computing, wireless sensor networks, context awareness, middleware, and mobility management.
Sabin Tabirca is lecturer in the Department of Computer Science of National University of Ireland,
Cork. His main research interest is in Mobile Multimedia with an emphasis on visualisation and graphics.
He has published more than 130 articles in the areas of HPC computing and Mobile Multimedia.
Feilong Tang received his Ph.D degree in Computer Science and Technology from Shanghai Jiao
Tong University (SJTU), China in 2005. Now, he works with the Department of Computer Science and
Engineering, Shanghai Jiao Tong University. His research interests focus on grid and pervasive computing, distributed transaction processing, wireless sensor networks, and distributed computing.
Yong Meng Teo is an Associate Professor with the Department of Computer Science at the National
University of Singapore, and an Associate Senior Scientist at the Asia-Pacific Science & Technology
Center, Sun Microsystems Inc. He heads the Computer Systems Research Laboratory and the Information Technology Unit. He was a Fellow of the Singapore-Massachusetts Institute of Technology Alliance
from 2002-2006. He received his MSc and PhD in Computer Science from the University of Manchester,
UK, in 1987 and 1989. His main research interest is in parallel and distributed systems covering the
organization, programming models, networking and performance of multi-core, grid and peer-to-peer
systems. Current projects include peer-to-peer networks, performance analysis of large systems, faulttolerant consensus in distributed systems and component-based modeling and simulation.
14
Parimala Thulasiraman received B.Eng. (Honors) and M.A.Sc. degrees in Computer Engineering
from Concordia University, Montreal, Canada and obtained her Ph.D. from University of Delaware,
Newark, DE, USA after finishing most of her formalities in McGill University, Montreal, Canada.
She is now an Associate Professor with the Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada. The focus of her research is on parallel and algorithms for applications
such as computational biology, computational finance, medical imaging or computational medicine on
advanced architectures. Over the past few years, she has been working on distributed algorithms for
mobile networks using nature inspired algorithms such as Ant Colony Optimization techniques. She
has published several papers in the above areas in leading journals and conferences and has graduated
many students. Parimala has organized conferences as local chair, program chair and tutorial chair.
She has has been serving as a reviewer and program committee member for many conferences. She
has also been a reviewer for many leading journals. She is a member of the ACM and IEEE societies.
Ruppa K. Thulasiram (Tulsi) is an Associate Professor with theDepartment of Computer Science,
University of Manitoba, Winnipeg, Manitoba. He received his Ph.D., from Indian Institute of Science,
Bangalore, India and spent years at Concordia University, Montreal, Canada; Georgia Institute of
Technology, Atlanta; and University of Delaware as Post-doc, Research Staff and Research Faculty
before taking up the position with University of Manitoba. Tulsi has undergone training in Mathematics,
Applied Science, Aerospace Engineering, Computer Science and Finance during various stages of his
schooling and post doctoral positions. Tulsi's current primary research interests is in the emerging area
of Computational Finance. Tulsi has developed a curriculum for cross-disciplinary computational finance
course at University of Manitoba and currently teaching this at both graduate and undergraduate level.
He has trained and graduated many students in this area.His research interests include Scientific and
Grid Computing, Bio-inspired algorithms for Finance, M-Commerce Applications, and Mathematical
Finance, where he has been training many graduate students. He has published number of papers in the
areas of High Temperature Physics, Gas Dynamics, Scientific Computing and Computational Finance in
leading journals and conferences and has own best and distinguished paper awards in prominent conferences. Tulsi has been serving in many conference technical committees related to parallel and distributed
computing, Neural Networks, Computational Finance as program chair, general chair etc and has been
a reviewer for many conferences and journals. He is a member of the ACM and IEEE societies.
Daxin Tian received the BS, MS.and Ph.D (Hons) degrees, in computer science, from the Jilin
University, Changchun, China, in July 2002, July 2005, and December 2007, respectively. His research
interests include network security, intrusion detection system, neural network, machine learning.
Sameer Tilak is an assistant research scientist at the University of California, San Diego. He is
involved in the design and development of the cyberinfrastructure for a number of large-scale sensorbased environmental observing system initiatives including the Global Lake Ecological Observatory
Network (GLEON) and the Coral Reef Environmental Observatory Network (CREON). He received
his Ph.D. and M.S. in computer science from SUNY Binghamton in 2005 (degree conferred: January
2006) and 2002 respectively. He received his M.S. in computer science from the University of Rochester
in 2003. His research interests include wireless networks (specifically ad-hoc and sensor networks), grid
computing, stream data management, and parallel discrete-event simulation. He has served as a TPC
15
member for numerous conferences and workshops including IEEE Percom 2009, DCOSS (2008-2009),
IEEE LCN (2007-2009), IEEE SECON 2009, ACM-IEEE MSWiM 2008, IEEE SenseApp (2007-2009)
and IEEE e-Science 2007.
Cho-Li Wang received his Ph.D. degree in Computer Engineering from University of Southern
California in 1995. He is currently an associate professor of the Department of Computer Science at the
University of Hong Kong. Dr. Wangs research interests mainly focus on distributed Java virtual machines on clusters, Grid middleware, and software systems for pervasive/mobile computing. Dr. Wang
is serving in a number of editorial boards, including IEEE Transaction on Computers (TC), Multiagent
and Grid Systems (MAGS), and the International Journal of Pervasive Computing and Communications (JPCC). He is the regional coordinator (Hong Kong) of IEEE Technical Committee on Scalable
Computing (TCSC).
Sheng-De Wang was born in Taiwan in 1957. He received the B.S. degree from National Tsing Hua
University, Hsinchu, Taiwan, in 1980, and the M. S. and the Ph. D. degrees in electrical engineering from
National Taiwan University, Taipei, Taiwan, in 1982 and 1986, respectively. Since 1986 he has been on
the faculty of the department of electrical engineering at National Taiwan University, Taipei, Taiwan,
where he is currently a professor. From 1995 to 2001, he also served as the director of computer operating group of computer and information network center, National Taiwan University. He was a visiting
scholar in Department of Electrical Engineering, University of Washington, Seattle during the academic
year of 1998-1999. From 2001 to 2003, He has been served as the Department Chair of Department of
Electrical Engineering, National Chi Nan University, Puli, Taiwan for the 2-year appointment. His research interests include parallel and distributed computing, embedded systems, and intelligent systems.
Dr. Wang is a member of the Association for Computing Machinery and IEEE computer societies. He
is also a member of Phi Tau Phi Honor society.
Yang Xiang is currently with School of Management and Information Systems, Central Queensland
University. His research interests include network and system security, and wireless systems. He has
served or is serving as PC Chair for the The 11th IEEE International Conference on High Performance
Computing and Communications (HPCC 09), The 3rd IEEE International Conference on Network and
System Security (NSS 09), and The 14th IEEE International Conference on Parallel and Distributed
Systems (ICPADS 08). He has served as or is serving as guest editor for ACM Transactions on Autonomous and Adaptive Systems, Journal of Network and Computer Applications, and Concurrency and
Computation: Practice and Experience.
Meilian Xu is a PhD student in the University of Manitoba. She received her M.Sc degree in Computer Science from Peking University and B.E. degree in Computer Science from East China Normal
University in China. Her research interest is high performance computing and parallel algorithm design
for applications such as medical imaging on parallel systems, focusing on multi-core architectures such
as Cell Broadband Engine architecture. She has published several papers in leading conferences in this
direction. She is a member of the ACM and IEEE societies.
Jaeyoung Yi received the B.S. degree in Mathematics department and Computer Science department
from Yonsei University in March, 2008. Currently, she is in the master degree program of the School
16
of Electrical and Electronic Engineering, Yonsei University.

multi-processor system on a chip architectures.
Jaeyoungs research interest includes
Mika Ylianttila is a professor and adjunct professor in computer science and information networks
at the Information Processing laboratory and Research Manager at the MediaTeam Oulu research group
at the University of Oulu, Finland. His research interests include mobile applications and services,
protocol design and performance, and communication and middleware architectures. He is a senior
member of IEEE.
Jiehan Zhou is currently working as a research scientist at MediaTeam, Information Processing
laboratory, University of Oulu. He obtained his PhD in manufacturing and automation from the Huazhong University of Science and Technology, Wuhan, China in 2000. He did 2-year postdoctoral research
In CIMS, Department of Automation, Tsinghua University, Beijing, China. He worked in VTT/Oulu
Finland and INRIA/Sophia Antipolis France for a 18-month ERCIM fellowship. His current research
interests include middleware, community coordinated multimedia, service-oriented computing, ontology engineering, semantic Web, and protocol engineering.
Yifeng Zhu received BSc in electrical engineering in 1998 from the Huazhong University of Science and Technology, China, and the MS and PhD degrees in computer science from the University of
Nebraska, Lincoln, in 2002 and 2005, respectively. He is currently an assistant professor in Electrical
and Computer Engineering at University of Maine. His research interests include parallel I/O storage
systems, supercomputing, energy aware memory systems, and wireless sensor networks. He served
as the program chair of IEEE NAS09 and IEEE SNAPI07, the guest editor of a special issue of the
International Journal of High Performance Computing and Networking. He received Best Paper Award
at IEEE CLUSTER07.
Albert Y. Zomaya currently holds the Chair of High Performance Computing and Networking in
the School of Information Technologies at Sydney University. He is the author/co-author of seven books,
more than 300 papers, and the editor of eight books and eight conference proceedings. He serves as an
associate editor for 16 leading journals. Professor Zomaya is the recipient of the Meritorious Service
Award (in 2000) and the Golden Core Recognition (in 2006), both from the IEEE Computer Society.
He is a Chartered Engineer (CEng), a Fellow of the American Association for the Advancement of Science, the IEEE, the Institution of Engineering and Technology (U.K.), and a Distinguished Engineer
of the ACM.
17
Index
A
-approximation algorithm 645, 649
Abstract Data and Communication Library
(ADCL) 583, 585, 587, 588, 589, 590,
591, 592, 593, 594, 595, 598, 600, 601,
602, 603
access points (APs) 719, 721, 725, 726
adaptation point 888
ALF programming system 297, 309
Amazon Cloud 63, 79, 83, 86
AMD 266, 277, 279, 291, 313, 314, 315, 317,
319, 322, 323, 332, 336
American option 472
analytical hierarchy process (AHP) 723
Aneka Coordinator 195, 196, 197, 198, 199,
200, 201, 208, 209
Aneka-Federation 191, 192, 193, 194, 195,
196, 197, 198, 199, 200, 202, 205, 207,
208, 209, 212, 213, 214, 217
Apache Tomcat 658, 659, 660, 661, 663, 664,
665, 669, 671, 672, 673, 674, 677, 678,
679
application-level approach 881, 882, 888, 890
application programmer's interface (API) 43,
66, 67, 71, 84, 86
application scalability 73
applications, data-intensive 1, 3, 8, 42, 51, 52,
74, 116
arithmetic operation 816, 840,
assembly technology (AT) 295, 296, 297, 299,
300, 301, 303, 305, 309, 310
atomicity, consistency, isolation, & durability
(ACID) 422
atomic transaction 422, 425, 429, 441
authentication 573, 574, 576, 582
Automatically Tuned Collective Communications (ATCC) 586

Automatically Tuned Linear Algebra Software
(ATLAS) 586
autonomic computing 22, 26, 28
autonomous system (AS) 740, 742, 743, 744,
745, 746, 747, 748, 749, 750, 751, 752,
753, 754, 755, 756, 757, 759
B
back-end pipeline 555, 559, 562
base stations (BSs) 719, 721, 725, 726
Basic Linear Algebra Software (BLAS) library
586, 605
basic similarity algorithm 383, 384, 385, 387
behavior classification 339, 342, 344, 345, 346
behavior extraction 338, 339, 341, 346, 347,
348
behavior prediction 339, 342, 343, 345, 347
benchmarks 92, 95, 96, 102, 103, 105, 109,
112, 113, 114, 115
Berkeley Open Infrastructure for Network
Computing (BOINC) 32, 33, 37, 38, 39,
40, 41, 42, 43, 44, 48, 49, 50, 51, 52, 53,
54, 56
best effort (BE) model 739, 754
bin-packing 645
bioinformatics 843, 844, 856, 857
BitTorrent 124, 128, 138
Bloom filter 787, 793, 794, 799, 802, 803, 804,
806, 807, , 861
Bloom filter array 807
Bloom filter replica 799, 802, 803, 807,
Bloom filter update protocol 807
Bluetooth 705, 706, 707, 712, 713, 714, 715,
716, 717
Volume 1: pgs 1-485 Volume II: pgs 486-894
Index
branch prediction 568

branch prediction units (BPU) 567
C
cache coherence communication 559
call option 472, 477, 481
call tracing 340
Cell Broadband Engine (Cell/B.E.) architecture
312, 314, 315, 320, 321, 322, 323, 324,
325, 327, 328, 329, 330, 331, 332, 333,
335, 336
checkpointing 875, 878, 883, 886, 888, 891,
892, 893
chip multi-processing (CMP) 556, 557, 558,
559, 578
chords 143, 145, 148, 149, 160
churn 128, 144, 146, 147, 154, 158, 160, 164,
165, 168, 169, 174, 176, 179, 180, 181,
182, 183, 184, 186
clients 33, 34, 36, 37, 38, 39, 44, 49, 50, 60,
65, 79
client schedule coordinator 33, 34
client schedule coordinator coordinator 44
close to the metal (CTM) parallel programming
tool 291
Cloud computing 53, 54, 62, 76, 79, 80, 83, 84
Clouds, Aneka 210, 211, 212
Clouds, enterprise 191, 192, 193, 194, 195,
198, 200, 208, 212, 217
Cloud services, decentralized 195, 196, 197,
198, 203, 207
cluster 234, 250, 262, 263, 264, 265, 266, 267,
268, 270, 275, 312, 322, 323, 324, 328,
330, 331, 332
cluster computing 856, 891, 892, 894
coarse-grained SMP designs 563
coarse-grained synchronization 564
coarse-grain multithreading 556
collaboration disciplines 414
communicating multiprocessor tasks (CMtasks) 260, 261, 262, 265, 274
communication reliability 220
communication round 378, 379, 380, 393
communication, synchronization and 552, 564
community coordinated multimedia (CCM)
682, 683, 684, 685, 686, 687, 688, 689,
690, 691, 692, 693, 694, 695, 696, 697,

698, 703
compensating transaction 422, 424, 426, 433,
434, 435, 436, 437, 438
compiler tool 246, 248, 249, 250, 251, 254,
255, 256, 261, 262, 265, 269, 273, 277,
294, 321, 322, 323, 324, 330, 333
computational biology 843, 844, 856
computational grid 470, 471, 472
computation mobility 874, 875, 876, 878, 879,
880, 881, 890, 891
compute unified device architecture (CUDA)
288, 291
concurrent-read exclusive-write (CREW) 670,
678
condition number 776, 777
controlled flooding mechanisms 123
cross-organizational collaboration 397, 400,
402, 404, 413, 414
cross-organizational service invocation, QoS
of 416
cycle stealing 32
D
data conversion 876, 878, 885, 886, 887
data grids 2, 12, 63, 512, 513, 514, 515
data packets 280, 282, 283, 284, 285, 287
data parallelism 246, 247, 248, 249, 251, 268,
269, 270, 272, 273, 274, 275, 319
data replication 486, 487, 488, 489, 490, 491,
492, 493, 494, 495, 496, 497, 500, 501,
502, 503, 504, 505, 506, 507, 508, 509,
510, 511, 512, 513, 514, 515
data separation 69
deadlock monitoring, dynamic 571, 572
deadlock situations 563, 564, 567, 569, 570,
571
decoding process 285
deep packet inspection (DPI) 873
DEISA Extreme Computing Initiative (DECI)
62, 77, 78, 79, 80, 83, 86
desktop grid 41, 42, 48, 54, 55, 56, 57, 60
Differentiated Services (Diffserv) 739, 742,
751, 753, 754, 757, 758, 759
digital image processing 808, 809
Index
diskless checkpointing 760, 761, 763, 764,

765, 767, 768, 770, 782
distributed computing 1, 2, 7, 13, 14, 21, 27,
32, 36, 39, 57, 58, 60, 86, 87, 88
Distributed European Initiative for Supercomputing Applications (DEISA) project 62,
70, 73, 76, 77, 78, 79, 80, 83, 86, 87
distributed hash tables (DHTs) 124, 143, 160,
161, 163, 164, 165, 166, 169, 174, 180,
186, 190, 193, 201, 207, 217
distributed Java virtual machine (DJVM) 658,
659, 660, 661, 662, 663, 664, 665, 666,
670, 671, 673, 674, 675, 677, 678, 679,
681
Distributed Membership Query 807
distributed processing 810, 822, 824, 827, 831,
833, 834, 838, 840
distributed transaction processing (DTP) 423
divisible load theory (DLT) 827, 841, 842, 844,
846, 851, 853, 854
domain-specific service 397, 398
dynamic evaluation 339
dynamic host configuration protocol (DHCP)
server 705, 707, 708
dynamic load balancing 296, 299, 300, 305,
306, 307, 308
dynamic queries 130, 131, 133, 134, 135, 136,
138
dynamic tunability 309
dynamism 163, 164, 165, 169, 171, 186
experimental design 91, 95, 97, 102, 105, 109,

116, 118
experimental results 385, 387, 392, 393
F
false negative 786, 789, 791, 792, 793, 794,
795, 797, 799, 800, 802, 803, 804, 805,
807, 864
false positive 786, 787, 788, 789, 791, 792,
794, 795, 797, 799, 800, 802, 804, 807,
864, 870
fast Fourier transform (FFT) 312, 314, 324,
325, 326, 327, 328, 329, 330, 331, 332,
335
fault tolerance 22, 39, 44, 45, 52, 447, 469,
486, 487, 490, 552, 566, 567, 570, 579
fetch policy 552, 559, 560, 561, 562, 578
financial options 472
fine-grained parallelism 564
fine-grained synchronization 565, 581
fine-grain multithreading 556
finger 143, 145, 147, 156, 157, 162
finite different time domain (FDTD) 312, 314,
315, 316, 317, 318, 319, 320, 321, 323,
324, 325, 331, 332, 334, 335
Foster, Ian 2, 12, 15, 21, 22, 25, 26, 27, 33, 40,
41, 46, 56, 57, 59, 63, 70, 83, 84, 85, 89,
90, 92, 96, 119
front-end pipeline 555, 559, 560, 562
functional unit (FU) 566
ECC-like mechanisms 568, 569

encoding algorithm 282, 283
encoding/decoding speed 281
encoding process 284, 285
EnginFrame (portal technology) 75, 76, 83, 87
error correcting code (ECC) 567, 568, 569
error recovery 442, 444, 446, 447, 448, 450,
451, 452, 456, 465, 466, 467, 468, 469
errors, large-scale 445, 468
errors, small-scale 450, 467, 468
European option 472
evolving systems 608, 609, 611, 643
execution traces 342
Gaussian processing 822, 831, 833, 834

gaussian random matrices 781
general purpose computation on GPUs (GPGPUs) 289
general timing constraint model 402
global object space (GOS) 659, 663, 665, 666,
667, 668, 672, 673, 674, 675, 676, 679,
681
Globus grid middleware 2, 4, 5, 6, 8, 12, 13,
21, 23, 24, 65, 67, 71, 75, 85, 87, 92, 96,
119
Google app engine 84
granularity 841, 875, 876, 879, 891
graphics data 279, 288, 289, 290, 291, 292
Index
graphics processing unit (GPU) 278, 279, 288,

289, 290, 291, 293, 314
Greedy join algorithm 705, 709
grey relational analysis (GRA) 723
grid application toolkit (GAT) 21, 66, 84
grid-based workflow 469
grid compute commodities (gccs) 472, 473,
474
grid computing 2, 4, 6, 12, 13, 25, 26, 27, 28,
29, 56, 58, 66, 81, 82, 86, 87, 119, 220,
221, 222, 223, 242, 243, 421, 422, 243,
472, 473, 478, 512, 514, 515
grid-enabled operating system (GridOS) 3, 12
grid engine 83
grid environment 346, 401, 404
GridFTP 5, 6, 10, 12, 13, 65, 66
grid index information service (GIIS) 65
grid infrastructure 396
grid middleware 1, 2, 4, 6, 12, 24, 26, 66
grid performance 220
grid performance measure 220
grid, pervasive 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 17,
25, 26, 28, 29, 30
grid portals 83
grid reliability 219, 220, 241
grid resource allocation and management
(GRAM) 5, 6, 12, 13
grid resource information service (GRIS) 65
grid resource pricing 472
grid resources utilization and pricing (GRUP)
matrix 478
grid security 13
grid security infrastructure (GSI) 5, 6, 8, 12, 13
grid service modeling 223
grid service performance 223
grid service reliability 223, 227, 241, 242
grid system performance 221
grid systems reliability 221
grid transaction service (GridTS) 421, 422,
425, 426, 427, 428, 429, 433, 434, 435,
436, 437, 438, 439
groups, collision of 141
H
handoff, horizontal 721, 722, 725, 728, 729,
730, 731, 732, 735
handoff, vertical 721, 722, 723, 724, 725, 728,

730, 734, 735, 736
handoff, vertical, downward (DVH) 722, 730
handoff, vertical, upward (UVH) 722, 730
heterogeneity 163, 164, 165, 166, 168, 171,
180, 181, 185, 186, 187, 189
heterogeneous multi-core processors 278, 314,
335
high performance computing (HPC) 583, 585,
587, 710, 711
home-based lazy release consistency 668, 669,
670, 677, 678
homogeneous multi-core processors 277, 291,
313
hyper-threading (HT) 559
I
IBM Blue Gene 583, 596, 599
ICOUNT policy 560, 561, 562, 563, 569, 571,
572
IEEE 802.11 (standard) 719, 736
IEEE 802.11x (standard) 705, 707, 717
ILP wall 313
image partitioning 808, 810, 811, 814, 821,
823, 824, 825, 826, 827, 828, 830, 831,
832, 833, 834, 835, 836, 838, 840
image processing 808, 809, 810, 811, 814, 819,
821, 822, 823, 827, 831, 836, 838, 842
improved similarity algorithm 387, 388, 389,
392, 393
index caching 123, 124, 127, 128, 134, 136
indirect swap network (ISN) 312, 314, 324,
326, 327, 329, 335
InfiniBand interconnect 584
infostation 645, 646, 647
instruction fetch queue (IFQ) 563, 568, 569,
571, 572
instruction per cycle (IPC) 555
integrated heterogeneous wireless and mobile
network (IHWMN) 718, 719, 720, 721,
722, 723, 724, 728, 734, 737
Integrated Services (Intserv) 739, 742, 751,
759
inter-domain 742, 743, 752, 754
interest groups 123, 125
Index
Internet service provider (ISP) 744, 745, 747,

748, 758
Internet volunteer desktop grids, (IVDG) 36
inter-operation 517, 518, 519, 526, 534, 535
interval graph coloring 645, 647, 655
intra-domain 740, 741, 742, 752, 753, 754
intrusion detection systems (IDSs) 277, 286,
287, 858
I/O redirection, transparent 663, 674
IP-based networks 707
issue queue (IssueQ) 567, 568, 569
J
Java 2 Platform, Enterprise Edition (J2EE)
658, 660
Java Bytecode 661, 662, 667, 678, 679
job deadlines 41
job descriptions 34, 96
job management 4, 13, 23, 39, 64, 94
job parameters 34
job scheduling 33, 56, 59, 120, 346, 349, 350,
352
jobs, execution of 37, 39, 46
jobs, high-throughput 36, 41
jobs, homogeneous 38
jobs, low latency 41, 57
jobs, pilot 46
jobs, stand-alone 37
job submission 6, 39
K
Kesselman, Carl 12, 15, 23, 25, 26, 41, 56, 63,
70, 84, 89, 90, 92, 96, 119
key-value pair 141, 142
L
language extension 249
leading thread (LT) 568, 569, 570, 571, 572
load balancing method 163, 164, 167, 186, 187
load distribution problem 842
load queue (LQ) 567, 570, 571
load value queue (LVQ) 568, 569, 570, 571
logical file names (LFN) 5, 10, 13
long-latency instructions 552, 559, 560, 561
long-lived transaction 421, 422, 424, 425, 431,
435, 439, 441
M
MANETs, IP-based 707, 708, 715
mapping 243, 246, 248, 249, 250, 251, 253,
258, 259, 262, 263, 264, 265, 267, 268,
269, 274, 302, 312, 314, 318, 324, 326
Master-Worker computing paradigm 32, 33, 60
matrix matrix multiplication 770, 772, 773,
774, 775
memory hierarchy 277, 279, 331
memory management unit (MMU) 574, 576
Memory Wall problem 313, 553, 560
message passing 21, 22, 85
message passing interface (MPI) 71, 73, 761,
762, 763, 772, 777, 778, 779, 781, 782
microarchitecture (arch) 552, 557, 559, 573,
576, 577, 579, 580, 581
middleware 659, 663, 677, 679, 682, 683, 684,
685, 686, 687, 688, 689, 690, 691, 694,
695, 697, 698, 700, 701, 703
mixed parallelism 248
mobile ad hoc networks (MANETs) 705, 706,
707, 708, 709, 710, 715, 717
mobile agent 879, 880
mobile computing 689, 690, 691, 700
mobile message passing interface (MMPI) 713,
714, 715, 717
mobile middleware 706
monitoring and discovery service (MDS) 6, 10,
65
Moores law 553, 582
multi-core architecture 278, 292, 313, 314, 335
multi-core processing 858, 859, 860, 861, 862,
863, 864, 865, 867, 868, 870, 871, 872,
873, 874, 875, 876, 879, 881, 882, 890,
891
multi-core processors 248, 263, 266, 268, 269,
271, 275, 276, 277, 278, 279, 280, 281,
282, 285, 286, 287, 288, 291, 292, 294,
312, 313, 314, 315, 319, 333, 335, 336,
557, 559, 576, 577, 579
multi-dimensional queries 217
multi-mode interfaces 719, 721
multiprocessors 313, 314
multiprocessor scheduling 645, 647
Index
multiprocessor task (M-task) 246, 247, 248,

255, 257, 258, 260, 262, 263, 264, 265,
267, 268, 274, 275
multi-programming workload 555, 557
multi protocol label switching (MPLS) 739,
741, 742, 751
N
near copy 823, 833, 836
NEC Earth Simulator 583
Needleman-Wunsch Algorithm 841, 842, 845,
846, 848, 855
net_id 705, 709, 710, 715
network bandwidth 51, 60, 67, 68, 69
network coding 277, 280, 281, 285, 293
network policy 739
nodes 124, 125, 139, 140, 141, 142, 144, 145,
146, 147, 149, 151, 152, 153, 154, 155,
157, 158, 162, 163, 164, 165, 166, 167,
168, 169, 170, 171, 172, 173, 174, 175,
176, 177, 178, 179, 180, 181, 182, 183,
184, 185, 186, 187, 190, 191, 192, 194,
195, 197, 200, 202, 206, 207, 208, 210,
212, 213, 214, 217
nodes, predecessor 142, 143, 145, 146, 149,
151, 152, 155, 156, 157, 171, 173
nodes, successor 142, 143, 145, 146, 147, 148,
149, 150, 151, 152, 155, 156, 157, 162,
171, 173
non-singular sub-matrix 768, 776
NVIDIA 279, 288, 291, 292, 293
O
object transaction service (OTS) 423
occupancy counter (OC) 562, 563
on-line algorithm 645, 648, 649, 652
Open Grid Forum (OGF) 66, 67, 80, 85, 87
open grid services architecture (OGSA) 23, 71,
87
overlay networking 191
P
parallel computing 809, 810, 862, 877, 878,
879
parallelism 246, 247, 248, 249, 250, 251, 254,

255, 256, 265, 268, 269, 270, 271, 272,
273, 274, 275, 276, 277, 280, 281, 287,
288, 292, 313, 315, 319, 320, 321, 336
parallelism, instruction level (ILP) 277, 294,
313, 552, 555, 556, 557, 559, 565, 566,
573, 577
parallelism, thread level (TLP) 294, 552, 556,
557, 558, 559, 561, 562, 573, 577
parallel processing system 809
parallel program, fragmented 295
parallel programming 246, 247, 248, 250, 269,
271, 273, 274, 288, 291, 295, 319
parallel-programming workload 555, 557
parallel programs 246, 250, 275, 295, 300,
301, 308
parallel shaders 290
parallel task 246, 251, 253, 272, 275
parameter jobs 72
particle in cell (PIC) method 295, 296, 297,
298, 299, 302, 303, 304, 305, 306, 307,
308, 310
partitionable analysis 607
partitioning, functional 607
partitioning, physical 607
patrolling thread (PT) 573, 574, 576
pattern matching 287, 288
peering 534, 535, 543, 546
peer-to-peer networks 123, 124, 125, 126, 128,
130, 138, 140, 161, 187, 188, 189, 215,
216
peer-to-peer networks, Gnutella-like 125
peer-to-peer networks, structured 124, 126,
189
peer-to-peer networks, unstructured 123, 124,
126, 138
peer to peer (P2P) 2, 4, 12, 18, 23, 24, 35, 37,
39, 40, 42, 43, 44, 52, 53
performance prediction 92, 95, 97, 99, 118
physical file names (PFN) 5, 13
Piconet 712, 713, 714
pipelining 766
pixel shader 290, 291
policy based management 742
policy based network architecture 740, 741
policy based networking (PBN) 739, 759
Index
policy compliance 741

policy management 740
policy negotiation 742
policy rules 741, 756
policy statements 739
portal 6, 10, 25, 63, 72, 74, 75, 81, 83, 85, 87
power processor element (PPE) 314, 315, 320,
321, 322, 328, 329, 330, 335
Power Wall problem 313, 557
price variant factor (pvf) 473, 475
process migration 876, 877, 879, 880, 882,
889, 892, 893
process scheduling 338, 347
program composition 70
program counter (PC) 557, 573, 583, 584, 595
proximity 144, 159, 163, 164, 165, 168, 169,
175, 176, 179, 180, 186, 188, 189, 190
proxy-based clustered architecture 4
proxy-based wireless grid architecture 4
proxy server, dedicated 4
put option 472, 477
Q
QoS architecture, scalable internet 739, 740,
741, 742, 743, 745, 750
QoS, end-to-end 739, 740, 741, 742, 743, 749,
754, 755, 756, 757, 759
QoS evaluation, time-related 396, 398, 401,
402, 404, 412, 416
QoS routing 740, 749
QoS, workflow 402
quality of service (QoS) 3, 4, 8, 15, 19, 20,
396, 398, 400, 401, 402, 404, 412, 414,
415, 416, 418, 419, 443, 448, 472, 473,
479, 480, 482, 484, 485, 489, 490, 497,
508, 513, 515, 516, 549, 723, 724, 728,
730, 731, 732, 734, 735, 739, 740, 741,
742, 743, 745, 749, 750, 751, 752, 753,
754, 755, 756, 757, 758, 759
R
random walk search algorithm 131, 132, 133
real option 471, 472, 473, 474, 478, 483
real-time deep packet inspection 859
real-time packet inspection 858
real-time requirement 859, 862
real-time system, development of 606

reliability 421, 422, 427, 439, 441, 488, 492,
495, 496, 500, 502, 509, 543
reorder buffer (ROB) 561, 562, 563, 567, 568,
570, 571
replica consistency 486, 487, 489, 491, 501,
502, 512, 516
replica location service (RLS) 5, 10, 13
replica placement 489, 490, 492, 494, 495,
496, 497, 504, 505, 507, 508, 509, 510,
511, 515, 516
replica selection 490, 498, 499, 500, 501, 512,
515, 516
replica selection service 490, 498, 501, 516
replica servers 488, 496, 499, 501, 508
reservation-based analysis (RBA) 608, 609,
630, 631, 640, 641, 642, 643, 644
resource broker 530, 531, 532, 533, 539, 547,
549
resource discovery 144, 192, 193, 194, 195,
201, 203, 204, 216, 217
resource management (RM) 23, 27, 40, 59, 62,
63, 64, 83, 84, 92, 120, 741, 742, 743,
750, 752, 753, 757, 758
resource management system (RMS) 442, 443,
444, 445, 446, 448, 449, 450, 451, 453,
454, 455, 456, 457, 458, 459, 460, 461,
462, 463, 464, 466, 467, 470, 535
resource reliability 220
resource sharing control 552, 562, 563
resource sharing networks 521, 526, 543
result certification 44, 46, 53, 60
running time curves 385
S
scalability 89, 90, 95, 97, 101, 102, 104, 105,
109, 110, 112, 116, 118, 120, 415
scalable algorithm 645
ScaLAPACK algebra library 776, 778, 781
scatternet 707
schedule 247, 254, 255, 262
scheduler, adaptive 355
scheduler, non-adaptive 355
scheduling 90, 91, 92, 102, 103, 117, 118, 119,
120, 121, 338, 339, 342, 346, 347, 349,
350, 352, 354, 355, 356, 357, 358, 362,
Index
365, 375, 376, 377, 384, 388, 389, 396,

397, 398, 401, 404, 405, 406, 407, 408,
409, 411, 415, 416, 418
scheduling decisions 338, 346
scheduling, distributed 339, 346
scheduling, non-preemptive 355
scheduling, preemptive 355, 377
scientific collaboration 397, 404, 405, 406,
407, 408, 409, 411, 416
scientific workflow 396, 397, 398, 400, 401,
404, 405, 406, 407, 409, 411, 412, 413,
414, 415, 416, 418
scientific workflow execution 396, 397, 398,
401, 404, 405, 406, 407, 412, 413, 414,
416
scientific workflow management system 397
semantic knowledge 14, 15, 19
sequence alignment 841, 842, 843, 844, 846,
848, 855, 856, 857
server allocation 645, 647, 648, 655
servers 33, 34, 36, 37, 38, 39, 44, 53, 65, 70,
82
service-based workflow system 404
service level agreements (SLA) , 442, 443,
444, 445, x, 445, x, 446, 448, xxii, 451,
468, 469, 470, 480, 481, 485, 469, 480,
529, 537, 545, 548, 740, 742, 745, 756
service-oriented provider 6, 8, 9
service repository 8
shaders 290, 291
Shared Hierarchical Academic Research Computing Network (SHARCNET) 473,
477, 480, 481, 482, 484
simple API for grid applications (SAGA) 66,
67, 84, 85, 86
simple storage service (S3) 472, 484
simultaneous multi-threading (SMT) 552, 556,
557, 558, 559, 560, 561, 562, 563, 564,
565, 566, 567, 570, 571, 573, 577, 578,
579, 580, 581
single process, multiple data (SPMD) 247,
253, 255, 258, 273
SLA workflow broker 444
Software as a service (SaaS) 79
space of modeling (SM) 297, 298, 302, 303,
304, 305, 306, 308
spanning tree 234, 237

Spiral Architecture 808, 811, 814, 816, 819,
820, 821, 822, 823, 824, 825, 826, 827,
830, 831, 832, 833, 834, 836, 837, 838,
839, 840,
SPMD, group 253, 258
stabilization 141, 143, 144, 145, 146, 147, 148,
149, 150, 151, 152, 158, 159, 171
star topology 224, 227, 234, 243
static evaluation 339
storage broker 10
store queue (SQ) 567, 570, 571
string editing problem 383
string similarity problem 378, 379, 381, 382,
383
strong consistency algorithm 501
Sun Grid Engine 74, 75, 92
Sun Network.com 79
supernode 145, 147, 149, 150, 152, 153, 154,
155, 156, 157, 158, 172
superscalar processors 552, 555, 556, 561
superscalar sequential applications (STARSs)
71
symmetric multi-processing (SMP) 563, 564
synchronization 552, 563, 564, 565, 569, 577,
578, 581, 589
synergistic processor element (SPE) 315, 320,
321, 322, 324, 327, 328, 329, 330, 335
T
Tabu search 375, 376, 377
task coordination 261
task graph 254, 257, 260, 262
task parallelism 248, 249, 250, 256, 268, 269,
271
temporal-dependable service 411
temporal-dependency relations 397
temporal disciplines 407, 412
temporal model 398, 401, 404, 407, 416
thread migration 874, 877, 878, 880, 881, 882,
889, 891, 892, 894
threads 277, 278, 282, 283, 284, 287, 291, 319,
322, 324
throttling 562, 563
timing analysis 606, 607, 608, 609, 610, 611,
612, 614, 615, 622, 628, 630, 643, 644
Index
timing analysis, abstract 609

timing analysis, target-specific 609
timing requirements 606, 607, 608, 609, 623,
624, 631, 632, 634, 635
trace queue (traceQ) 568, 569, 570, 572
tracing 340, 341, 347
traffic engineering (TE) 759
trailing thread (TT) 568, 569, 570, 571, 572
transaction processing 421, 427, 428, 432, 439,
441
transient faults 566, 567, 579
tree topology 224, 234
tuning, dynamic 585, 588
tuning, static 585
U
uncertainty 18, 19, 20, 25
uncertainty, application 19
uniform index caching 128
uniform interface for computing resources
(UNICORE) 2, 77, 78, 79, 86, 88
V
value prediction 565, 566, 580, 581
vertex shader 290, 291
very long instruction word (VLIW) 555, 557
virtual cluster 523, 528
virtual execution environment 523
virtualisation technology 519, 528, 534, 543
virtual machine 874, 876, 880, 881, 889, 891,
893, 894
virtual organizations (VOs) 24, 88, 510, 518,

519, 520, 524, 525, 526, 528, 530, 533,
536, 537, 538, 539, 541, 542
virus 858, 861, 870, 871
volatile consistency 666, 667, 668, 670, 671,
678
volunteer computing 38, 39, 48, 49, 54, 56, 57,
59
W
Wavefront algorithm 379, 383, 387, 393, 394,
395
weak consistency algorithms 501
Web services 71, 87, 88, 682, 683, 684, 687,
695, 698, 703
wireless and mobile networks (WMNs) 718,
719, 720, 724, 726, 730, 734
wireless grid 4, 12, 13, 28
workers 37, 39, 40, 41, 46, 47, 57, 60
workflow model 402, 406
workflows 7, 62, 63, 71, 72, 73, 80, 90, 92, 93,
94, 98, 99, 102, 105, 106, 107, 108, 109,
110, 115, 119, 120
workflow, scientific 108
X
XtremWeb 33, 36, 37, 39, 40, 41, 42, 43, 44,
46, 52, 56, 58, 61

Handbook Scalable Computing

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Handbook Scalable Computing

Transféré par

Droits d'auteur :

Formats disponibles

The Handbook of

InformatIon scIence reference

Director of Editorial Content:

Published in the United States of America by

Library of Congress Cataloging-in-Publication Data

British Cataloguing in Publication Data

Editorial Advisory Board

Minyi Guo, The University of Aizu, Japan

Allenotor, David / University of Manitoba, Canada ..........................................................................................471

Gupta, Phalguni / Indian Institute of Technology Kanpur, India .......................................................................645

Ylianttila, Mika / University of Oulu, Finland...................................................................................................682

Foreword .......................................................................................................................................... xxxi

Compilation of References ............................................................................................................... 895

Detailed Table of Contents

Foreword .......................................................................................................................................... xxxi

Compilation of References ............................................................................................................... 895

Architectures and systems

Grid Architectures and

Pervasive Grid and

Pervasive Grid and its Applications

CURRENT AND FUTURE RESEARCH TRENDS

Pervasive Grid and its Applications

Pervasive Grid and its Applications

Process management module

Pervasive Grid and its Applications

APPLICATION OF PERVASIVE GRID

Pervasive Grid and its Applications

Core computing infrastructure

Pervasive Grid and its Applications

Pervasive Grid and its Applications

Pervasive Grid Platform

Grid Service Provider

Pervasive Grid and its Applications

Pervasive Grid and its Applications

Performance Evaluation and Analyses

Pervasive Grid and its Applications

Pervasive Grid and its Applications

Pervasive Grid and its Applications

KEY TERMS AND DEFINITIONS

Challenges and Opportunities

PERVASIVE GRID APPLICATIONS AND THEIR REqUIREMENTS

RESEARCH OPPORTUNITIES IN PERVASIVE GRID COMPUTING

Key research challenges includes:

Programming models, abstractions and systems: Applications targeted to emerging Pervasive

Pervasive Grid Efforts

KEY TERMS AND DEFINITIONS

From Volunteer Distributed

ORIGINS AND PRINCIPLES

Origins of Desktop Grids

Figure 1. General architecture of desktop Grids

CLASSIFICATION OF DESKTOP GRIDS

Figure 2. Overview of the OurGrid platform architecture

Local Desktop Grids

Collaborative Desktop Grids

Internet Volunteer Desktop Grids

Single-Application Internet Volunteer Desktop Grids.

Figure 3. Overview of the XtremWeb platform architecture

Figure 4. Overview of the BOINC platform architecture

EVOLUTION OF MIDDLEWARE ARCHITECTURE

Bridging Service Grids and Desktop Grids

The Superworker Approach

The Gliding-In Approach

vote is given by:

The redundancy of majority voting is